lab/5/data science/2/3-02_k-means.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fundamentals of Accelerated Data Science # "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 02 - K-Means ##\n",
    "\n",
    "**Table of Contents**\n",
    "<br>\n",
    "This notebook uses GPU-accelerated K-means to find the best locations for a fixed number of humanitarian supply airdrop depots. This notebook covers the below sections: \n",
    "1. [Environment](#Environment)\n",
    "2. [Load Data](#Load-Data)\n",
    "3. [K-Means Clustering](#K-Means-Clustering)\n",
    "    * [Exercise #1 - Make Another `KMeans` Instance](#Exercise-#1---Make-Another-KMeans-Instance)\n",
    "4. [Visualize the Clusters](#Visualize-the-Clusters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment ##\n",
    "For the first time we import `cuml`, the RAPIDS GPU-accelerated library containing many common machine learning algorithms. We will be visualizing the results of your work in this notebook, so we also import `cuxfilter`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "import cudf\n",
    "import cuml\n",
    "\n",
    "import cuxfilter as cxf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data ##\n",
    "For this notebook we load again the cleaned UK population data--in this case, we are not specifically looking at counties, so we omit that column and just keep the grid coordinate columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "northing    float64\n",
      "easting     float64\n",
      "dtype: object\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(58479894, 2)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "gdf = cudf.read_csv('./data/clean_uk_pop.csv', usecols=['easting', 'northing'])\n",
    "print(gdf.dtypes)\n",
    "gdf.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>northing</th>\n",
       "      <th>easting</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>515491.5313</td>\n",
       "      <td>430772.1875</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>503572.4688</td>\n",
       "      <td>434685.8750</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>517903.6563</td>\n",
       "      <td>432565.5313</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>517059.9063</td>\n",
       "      <td>427660.6250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>509228.6875</td>\n",
       "      <td>425527.7813</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      northing      easting\n",
       "0  515491.5313  430772.1875\n",
       "1  503572.4688  434685.8750\n",
       "2  517903.6563  432565.5313\n",
       "3  517059.9063  427660.6250\n",
       "4  509228.6875  425527.7813"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name='#s2-3'></a>\n",
    "## K-Means Clustering ##\n",
    "The unsupervised K-means clustering algorithm will look for a fixed number *k* of centroids in the data and clusters each point with its closest centroid. K-means can be effective when the number of clusters *k* is known or has a good estimate (such as from a model of the underlying mechanics of a problem).\n",
    "\n",
    "Assume that in addition to knowing the distribution of the population, which we do, we would like to estimate the best locations to build a fixed number of humanitarian supply depots from which we can perform airdrops and reach the population most efficiently. We can use K-means, setting *k* to the number of supply depots available and fitting on the locations of the population, to identify candidate locations.\n",
    "\n",
    "GPU-accelerated K-means is just as easy as its CPU-only scikit-learn counterpart. In this series of exercises, you will use it to optimize the locations for 5 supply depots.\n",
    "\n",
    "`cuml.KMeans()` will initialize a K-means instance. Use it now to initialize a K-means instance called `km`, passing the named argument `n_clusters` set equal to our desired number `5`. Use the `km.fit` method to fit `km` to the population's locations by passing it the population data. After fitting, add the cluster labels back to the `gdf` in a new column named `cluster`. Finally, you can use `km.cluster_centers_` to see where the algorithm created the 5 centroids.\n",
    "\n",
    "Below we train a K-means clustering algorithm to find 5 clusters. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>306647.898235</td>\n",
       "      <td>408370.452191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>442109.465392</td>\n",
       "      <td>402673.747673</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>288997.149971</td>\n",
       "      <td>553805.430444</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>148770.463641</td>\n",
       "      <td>311786.805381</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>170553.110214</td>\n",
       "      <td>521605.459724</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               0              1\n",
       "0  306647.898235  408370.452191\n",
       "1  442109.465392  402673.747673\n",
       "2  288997.149971  553805.430444\n",
       "3  148770.463641  311786.805381\n",
       "4  170553.110214  521605.459724"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "# instantaite\n",
    "km = cuml.KMeans(n_clusters=5)\n",
    "\n",
    "# fit\n",
    "km.fit(gdf)\n",
    "\n",
    "# assign cluster as new column\n",
    "gdf['cluster'] = km.labels_\n",
    "km.cluster_centers_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name='#s2-e1'></a>\n",
    "## Exercise #1 - Make Another `KMeans` Instance ##\n",
    "\n",
    "**Instructions**: <br>\n",
    "* Modify the `<FIXME>` only and execute the below cell to instantiate a K-means instance with 6 clusters.\n",
    "* Modify the `<FIXME>` only and execute the cell below to fit the data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "km = cuml.KMeans(n_clusters=6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "km.fit(gdf)\n",
    "gdf['cluster'] = km.labels_\n",
    "km.cluster_centers_"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "source": [
    "\n",
    "km = cuml.KMeans(n_clusters=6)\n",
    "\n",
    "km.fit(gdf)\n",
    "gdf['cluster'] = km.labels_\n",
    "km.cluster_centers_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Click ... for solution. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='#s2-4'></a>\n",
    "## Visualize the Clusters ##\n",
    "To help us understand where clusters are located, we make a visualization that separates them, using the same three steps as before.\n",
    "\n",
    "Below we plot the clusters with cuxfilter. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "# associate a data source with cuXfilter\n",
    "cxf_data = cxf.DataFrame.from_dataframe(gdf)\n",
    "\n",
    "# define charts\n",
    "scatter_chart = cxf.charts.datashader.scatter(x='easting', y='northing')\n",
    "\n",
    "# define widget using the `cluster` column for multiselect\n",
    "# use the same technique to scale the scatterplot, then add a widget to let us select which cluster to look at\n",
    "cluster_widget = cxf.charts.panel_widgets.multi_select('cluster')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create dashboard\n",
    "dash = cxf_data.dashboard(charts=[scatter_chart],sidebar=[cluster_widget], theme=cxf.themes.dark, data_size_widget=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dash.app()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import IPython\n",
    "app = IPython.Application.instance()\n",
    "app.kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Well Done!** Let's move to the [next notebook](3-03_dbscan.ipynb). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}