{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fundamentals of Accelerated Data Science # " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 02 - K-Means ##\n", "\n", "**Table of Contents**\n", "
\n", "This notebook uses GPU-accelerated K-means to find the best locations for a fixed number of humanitarian supply airdrop depots. This notebook covers the below sections: \n", "1. [Environment](#Environment)\n", "2. [Load Data](#Load-Data)\n", "3. [K-Means Clustering](#K-Means-Clustering)\n", " * [Exercise #1 - Make Another `KMeans` Instance](#Exercise-#1---Make-Another-KMeans-Instance)\n", "4. [Visualize the Clusters](#Visualize-the-Clusters)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Environment ##\n", "For the first time we import `cuml`, the RAPIDS GPU-accelerated library containing many common machine learning algorithms. We will be visualizing the results of your work in this notebook, so we also import `cuxfilter`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# DO NOT CHANGE THIS CELL\n", "import cudf\n", "import cuml\n", "\n", "import cuxfilter as cxf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data ##\n", "For this notebook we load again the cleaned UK population data--in this case, we are not specifically looking at counties, so we omit that column and just keep the grid coordinate columns." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "northing float64\n", "easting float64\n", "dtype: object\n" ] }, { "data": { "text/plain": [ "(58479894, 2)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "gdf = cudf.read_csv('./data/clean_uk_pop.csv', usecols=['easting', 'northing'])\n", "print(gdf.dtypes)\n", "gdf.shape" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
northingeasting
0515491.5313430772.1875
1503572.4688434685.8750
2517903.6563432565.5313
3517059.9063427660.6250
4509228.6875425527.7813
\n", "
" ], "text/plain": [ " northing easting\n", "0 515491.5313 430772.1875\n", "1 503572.4688 434685.8750\n", "2 517903.6563 432565.5313\n", "3 517059.9063 427660.6250\n", "4 509228.6875 425527.7813" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## K-Means Clustering ##\n", "The unsupervised K-means clustering algorithm will look for a fixed number *k* of centroids in the data and clusters each point with its closest centroid. K-means can be effective when the number of clusters *k* is known or has a good estimate (such as from a model of the underlying mechanics of a problem).\n", "\n", "Assume that in addition to knowing the distribution of the population, which we do, we would like to estimate the best locations to build a fixed number of humanitarian supply depots from which we can perform airdrops and reach the population most efficiently. We can use K-means, setting *k* to the number of supply depots available and fitting on the locations of the population, to identify candidate locations.\n", "\n", "GPU-accelerated K-means is just as easy as its CPU-only scikit-learn counterpart. In this series of exercises, you will use it to optimize the locations for 5 supply depots.\n", "\n", "`cuml.KMeans()` will initialize a K-means instance. Use it now to initialize a K-means instance called `km`, passing the named argument `n_clusters` set equal to our desired number `5`. Use the `km.fit` method to fit `km` to the population's locations by passing it the population data. After fitting, add the cluster labels back to the `gdf` in a new column named `cluster`. Finally, you can use `km.cluster_centers_` to see where the algorithm created the 5 centroids.\n", "\n", "Below we train a K-means clustering algorithm to find 5 clusters. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
0306647.898235408370.452191
1442109.465392402673.747673
2288997.149971553805.430444
3148770.463641311786.805381
4170553.110214521605.459724
\n", "
" ], "text/plain": [ " 0 1\n", "0 306647.898235 408370.452191\n", "1 442109.465392 402673.747673\n", "2 288997.149971 553805.430444\n", "3 148770.463641 311786.805381\n", "4 170553.110214 521605.459724" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DO NOT CHANGE THIS CELL\n", "# instantaite\n", "km = cuml.KMeans(n_clusters=5)\n", "\n", "# fit\n", "km.fit(gdf)\n", "\n", "# assign cluster as new column\n", "gdf['cluster'] = km.labels_\n", "km.cluster_centers_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Exercise #1 - Make Another `KMeans` Instance ##\n", "\n", "**Instructions**:
\n", "* Modify the `` only and execute the below cell to instantiate a K-means instance with 6 clusters.\n", "* Modify the `` only and execute the cell below to fit the data. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "km = cuml.KMeans(n_clusters=6)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "km.fit(gdf)\n", "gdf['cluster'] = km.labels_\n", "km.cluster_centers_" ] }, { "cell_type": "raw", "metadata": { "jupyter": { "source_hidden": true } }, "source": [ "\n", "km = cuml.KMeans(n_clusters=6)\n", "\n", "km.fit(gdf)\n", "gdf['cluster'] = km.labels_\n", "km.cluster_centers_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Click ... for solution. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Visualize the Clusters ##\n", "To help us understand where clusters are located, we make a visualization that separates them, using the same three steps as before.\n", "\n", "Below we plot the clusters with cuxfilter. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# DO NOT CHANGE THIS CELL\n", "# associate a data source with cuXfilter\n", "cxf_data = cxf.DataFrame.from_dataframe(gdf)\n", "\n", "# define charts\n", "scatter_chart = cxf.charts.datashader.scatter(x='easting', y='northing')\n", "\n", "# define widget using the `cluster` column for multiselect\n", "# use the same technique to scale the scatterplot, then add a widget to let us select which cluster to look at\n", "cluster_widget = cxf.charts.panel_widgets.multi_select('cluster')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create dashboard\n", "dash = cxf_data.dashboard(charts=[scatter_chart],sidebar=[cluster_widget], theme=cxf.themes.dark, data_size_widget=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dash.app()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import IPython\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Well Done!** Let's move to the [next notebook](3-03_dbscan.ipynb). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 4 }