427 lines
12 KiB
Plaintext
427 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<img src=\"./images/DLI_Header.png\" width=400/>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Fundamentals of Accelerated Data Science # "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 02 - K-Means ##\n",
|
|
"\n",
|
|
"**Table of Contents**\n",
|
|
"<br>\n",
|
|
"This notebook uses GPU-accelerated K-means to find the best locations for a fixed number of humanitarian supply airdrop depots. This notebook covers the below sections: \n",
|
|
"1. [Environment](#Environment)\n",
|
|
"2. [Load Data](#Load-Data)\n",
|
|
"3. [K-Means Clustering](#K-Means-Clustering)\n",
|
|
" * [Exercise #1 - Make Another `KMeans` Instance](#Exercise-#1---Make-Another-KMeans-Instance)\n",
|
|
"4. [Visualize the Clusters](#Visualize-the-Clusters)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Environment ##\n",
|
|
"For the first time we import `cuml`, the RAPIDS GPU-accelerated library containing many common machine learning algorithms. We will be visualizing the results of your work in this notebook, so we also import `cuxfilter`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# DO NOT CHANGE THIS CELL\n",
|
|
"import cudf\n",
|
|
"import cuml\n",
|
|
"\n",
|
|
"import cuxfilter as cxf"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load Data ##\n",
|
|
"For this notebook we load again the cleaned UK population data--in this case, we are not specifically looking at counties, so we omit that column and just keep the grid coordinate columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"northing float64\n",
|
|
"easting float64\n",
|
|
"dtype: object\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"(58479894, 2)"
|
|
]
|
|
},
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# DO NOT CHANGE THIS CELL\n",
|
|
"gdf = cudf.read_csv('./data/clean_uk_pop.csv', usecols=['easting', 'northing'])\n",
|
|
"print(gdf.dtypes)\n",
|
|
"gdf.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>northing</th>\n",
|
|
" <th>easting</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>515491.5313</td>\n",
|
|
" <td>430772.1875</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>503572.4688</td>\n",
|
|
" <td>434685.8750</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>517903.6563</td>\n",
|
|
" <td>432565.5313</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>517059.9063</td>\n",
|
|
" <td>427660.6250</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>509228.6875</td>\n",
|
|
" <td>425527.7813</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" northing easting\n",
|
|
"0 515491.5313 430772.1875\n",
|
|
"1 503572.4688 434685.8750\n",
|
|
"2 517903.6563 432565.5313\n",
|
|
"3 517059.9063 427660.6250\n",
|
|
"4 509228.6875 425527.7813"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"gdf.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a name='#s2-3'></a>\n",
|
|
"## K-Means Clustering ##\n",
|
|
"The unsupervised K-means clustering algorithm will look for a fixed number *k* of centroids in the data and clusters each point with its closest centroid. K-means can be effective when the number of clusters *k* is known or has a good estimate (such as from a model of the underlying mechanics of a problem).\n",
|
|
"\n",
|
|
"Assume that in addition to knowing the distribution of the population, which we do, we would like to estimate the best locations to build a fixed number of humanitarian supply depots from which we can perform airdrops and reach the population most efficiently. We can use K-means, setting *k* to the number of supply depots available and fitting on the locations of the population, to identify candidate locations.\n",
|
|
"\n",
|
|
"GPU-accelerated K-means is just as easy as its CPU-only scikit-learn counterpart. In this series of exercises, you will use it to optimize the locations for 5 supply depots.\n",
|
|
"\n",
|
|
"`cuml.KMeans()` will initialize a K-means instance. Use it now to initialize a K-means instance called `km`, passing the named argument `n_clusters` set equal to our desired number `5`. Use the `km.fit` method to fit `km` to the population's locations by passing it the population data. After fitting, add the cluster labels back to the `gdf` in a new column named `cluster`. Finally, you can use `km.cluster_centers_` to see where the algorithm created the 5 centroids.\n",
|
|
"\n",
|
|
"Below we train a K-means clustering algorithm to find 5 clusters. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>0</th>\n",
|
|
" <th>1</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>306647.898235</td>\n",
|
|
" <td>408370.452191</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>442109.465392</td>\n",
|
|
" <td>402673.747673</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>288997.149971</td>\n",
|
|
" <td>553805.430444</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>148770.463641</td>\n",
|
|
" <td>311786.805381</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>170553.110214</td>\n",
|
|
" <td>521605.459724</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" 0 1\n",
|
|
"0 306647.898235 408370.452191\n",
|
|
"1 442109.465392 402673.747673\n",
|
|
"2 288997.149971 553805.430444\n",
|
|
"3 148770.463641 311786.805381\n",
|
|
"4 170553.110214 521605.459724"
|
|
]
|
|
},
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# DO NOT CHANGE THIS CELL\n",
|
|
"# instantaite\n",
|
|
"km = cuml.KMeans(n_clusters=5)\n",
|
|
"\n",
|
|
"# fit\n",
|
|
"km.fit(gdf)\n",
|
|
"\n",
|
|
"# assign cluster as new column\n",
|
|
"gdf['cluster'] = km.labels_\n",
|
|
"km.cluster_centers_"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a name='#s2-e1'></a>\n",
|
|
"## Exercise #1 - Make Another `KMeans` Instance ##\n",
|
|
"\n",
|
|
"**Instructions**: <br>\n",
|
|
"* Modify the `<FIXME>` only and execute the below cell to instantiate a K-means instance with 6 clusters.\n",
|
|
"* Modify the `<FIXME>` only and execute the cell below to fit the data. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"km = cuml.KMeans(n_clusters=6)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"km.fit(gdf)\n",
|
|
"gdf['cluster'] = km.labels_\n",
|
|
"km.cluster_centers_"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "raw",
|
|
"metadata": {
|
|
"jupyter": {
|
|
"source_hidden": true
|
|
}
|
|
},
|
|
"source": [
|
|
"\n",
|
|
"km = cuml.KMeans(n_clusters=6)\n",
|
|
"\n",
|
|
"km.fit(gdf)\n",
|
|
"gdf['cluster'] = km.labels_\n",
|
|
"km.cluster_centers_"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Click ... for solution. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<a id='#s2-4'></a>\n",
|
|
"## Visualize the Clusters ##\n",
|
|
"To help us understand where clusters are located, we make a visualization that separates them, using the same three steps as before.\n",
|
|
"\n",
|
|
"Below we plot the clusters with cuxfilter. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# DO NOT CHANGE THIS CELL\n",
|
|
"# associate a data source with cuXfilter\n",
|
|
"cxf_data = cxf.DataFrame.from_dataframe(gdf)\n",
|
|
"\n",
|
|
"# define charts\n",
|
|
"scatter_chart = cxf.charts.datashader.scatter(x='easting', y='northing')\n",
|
|
"\n",
|
|
"# define widget using the `cluster` column for multiselect\n",
|
|
"# use the same technique to scale the scatterplot, then add a widget to let us select which cluster to look at\n",
|
|
"cluster_widget = cxf.charts.panel_widgets.multi_select('cluster')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# create dashboard\n",
|
|
"dash = cxf_data.dashboard(charts=[scatter_chart],sidebar=[cluster_widget], theme=cxf.themes.dark, data_size_widget=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dash.app()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import IPython\n",
|
|
"app = IPython.Application.instance()\n",
|
|
"app.kernel.do_shutdown(True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Well Done!** Let's move to the [next notebook](3-03_dbscan.ipynb). "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<img src=\"./images/DLI_Header.png\" width=400/>"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.15"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|