{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"Header\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Week 1: Find Clusters of Infected People" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "**URGENT WARNING**\n", "\n", "We have been receiving reports from health facilities that a new, fast-spreading virus has been discovered in the population. To prepare our response, we need to understand the geospatial distribution of those who have been infected. Find out whether there are identifiable clusters of infected individuals and where they are. \n", "\n", "\n", "Your goal for this notebook will be to estimate the location of dense geographic clusters of infected people using incoming data from week 1 of the simulated epidemic." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The cudf.pandas extension is already loaded. To reload it, use:\n", " %reload_ext cudf.pandas\n" ] } ], "source": [ "%load_ext cudf.pandas\n", "import pandas as pd\n", "import cuml\n", "\n", "import cupy as cp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Begin by loading the data you've received about week 1 of the outbreak into a cuDF-accelerated pandas DataFrame. The data is located at `'./data/week1.csv'`. For this notebook you will only need the `'lat'`, `'long'`, and `'infected'` columns. Either drop the columns after loading, or use the `pd.read_csv` named argument `usecols` to provide a list of only the columns you need." ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latlonginfected
054.522511-1.571896False
154.554031-1.524968False
254.552483-1.435203False
354.537186-1.566215False
454.528210-1.588462False
\n", "
" ], "text/plain": [ " lat long infected\n", "0 54.522511 -1.571896 False\n", "1 54.554031 -1.524968 False\n", "2 54.552483 -1.435203 False\n", "3 54.537186 -1.566215 False\n", "4 54.528210 -1.588462 False" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./data/week1.csv', dtype = {\n", " 'lat': 'float32',\n", " 'long': 'float32',\n", " 'infected': 'category',\n", "}, usecols = ['lat', 'long', 'infected'])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make Data Frame of the Infected" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a new DataFrame `infected_df` that contains only the infected members of the population.\n", "\n", "**Tip**: Reset the index of `infected_df` with `.reset_index(drop=True)`. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[28928759 28930512 28930904 ... 57410428 57411005 57411919]\n", "[ 0 1 2 ... 18145 18146 18147]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latlonginfected
054.472767-1.654932True
154.529720-1.667143True
254.512981-1.589866True
354.522320-1.380694True
454.541656-1.613490True
\n", "
" ], "text/plain": [ " lat long infected\n", "0 54.472767 -1.654932 True\n", "1 54.529720 -1.667143 True\n", "2 54.512981 -1.589866 True\n", "3 54.522320 -1.380694 True\n", "4 54.541656 -1.613490 True" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "infected_df = df[df['infected'] == True]\n", "print(infected_df.index.values)\n", "\n", "infected_df = infected_df.reset_index(drop=True)\n", "\n", "print(infected_df.index.values)\n", "infected_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Make Grid Coordinates for Infected Locations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Provided for you in the next cell (which you can expand by clicking on the \"...\" and contract again after executing by clicking on the blue left border of the cell) is the lat/long to OSGB36 grid coordinates converter you used earlier in the workshop. Use this converter to create grid coordinate values stored in `northing` and `easting` columns of the `infected_df` you created in the last step." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "jupyter": { "source_hidden": true } }, "outputs": [], "source": [ "# https://www.ordnancesurvey.co.uk/docs/support/guide-coordinate-systems-great-britain.pdf\n", "\n", "def latlong2osgbgrid_cupy(lat, long, input_degrees=True):\n", " '''\n", " Converts latitude and longitude (ellipsoidal) coordinates into northing and easting (grid) coordinates, using a Transverse Mercator projection.\n", " \n", " Inputs:\n", " lat: latitude coordinate (N)\n", " long: longitude coordinate (E)\n", " input_degrees: if True (default), interprets the coordinates as degrees; otherwise, interprets coordinates as radians\n", " \n", " Output:\n", " (northing, easting)\n", " '''\n", " \n", " if input_degrees:\n", " lat = lat * cp.pi/180\n", " long = long * cp.pi/180\n", "\n", " a = 6377563.396\n", " b = 6356256.909\n", " e2 = (a**2 - b**2) / a**2\n", "\n", " N0 = -100000 # northing of true origin\n", " E0 = 400000 # easting of true origin\n", " F0 = .9996012717 # scale factor on central meridian\n", " phi0 = 49 * cp.pi / 180 # latitude of true origin\n", " lambda0 = -2 * cp.pi / 180 # longitude of true origin and central meridian\n", " \n", " sinlat = cp.sin(lat)\n", " coslat = cp.cos(lat)\n", " tanlat = cp.tan(lat)\n", " \n", " latdiff = lat-phi0\n", " longdiff = long-lambda0\n", "\n", " n = (a-b) / (a+b)\n", " nu = a * F0 * (1 - e2 * sinlat ** 2) ** -.5\n", " rho = a * F0 * (1 - e2) * (1 - e2 * sinlat ** 2) ** -1.5\n", " eta2 = nu / rho - 1\n", " M = b * F0 * ((1 + n + 5/4 * (n**2 + n**3)) * latdiff - \n", " (3*(n+n**2) + 21/8 * n**3) * cp.sin(latdiff) * cp.cos(lat+phi0) +\n", " 15/8 * (n**2 + n**3) * cp.sin(2*(latdiff)) * cp.cos(2*(lat+phi0)) - \n", " 35/24 * n**3 * cp.sin(3*(latdiff)) * cp.cos(3*(lat+phi0)))\n", " I = M + N0\n", " II = nu/2 * sinlat * coslat\n", " III = nu/24 * sinlat * coslat ** 3 * (5 - tanlat ** 2 + 9 * eta2)\n", " IIIA = nu/720 * sinlat * coslat ** 5 * (61-58 * tanlat**2 + tanlat**4)\n", " IV = nu * coslat\n", " V = nu / 6 * coslat**3 * (nu/rho - cp.tan(lat)**2)\n", " VI = nu / 120 * coslat ** 5 * (5 - 18 * tanlat**2 + tanlat**4 + 14 * eta2 - 58 * tanlat**2 * eta2)\n", "\n", " northing = I + II * longdiff**2 + III * longdiff**4 + IIIA * longdiff**6\n", " easting = E0 + IV * longdiff + V * longdiff**3 + VI * longdiff**5\n", "\n", " return(northing, easting)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latlonginfectednorthingeastingcluster
054.472767-1.654932True508670.609809422359.747233-1
154.529720-1.667143True515003.452959421538.534748-1
254.512981-1.589866True513167.311551426549.871569-1
354.522320-1.380694True514305.528712440081.234190-1
454.541656-1.613490True516349.193146425002.998690-1
\n", "
" ], "text/plain": [ " lat long infected northing easting cluster\n", "0 54.472767 -1.654932 True 508670.609809 422359.747233 -1\n", "1 54.529720 -1.667143 True 515003.452959 421538.534748 -1\n", "2 54.512981 -1.589866 True 513167.311551 426549.871569 -1\n", "3 54.522320 -1.380694 True 514305.528712 440081.234190 -1\n", "4 54.541656 -1.613490 True 516349.193146 425002.998690 -1" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cupy_lat = cp.asarray(infected_df['lat'])\n", "cupy_long = cp.asarray(infected_df['long'])\n", "\n", "infected_df['northing'], infected_df['easting'] = latlong2osgbgrid_cupy(cupy_lat, cupy_long)\n", "infected_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find Clusters of Infected People" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use DBSCAN to find clusters of at least 25 infected people where no member is more than 2000m from at least one other cluster member. Create a new column in `infected_df` which contains the cluster to which each infected person belongs." ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dbscan = cuml.DBSCAN(eps = 2000, min_samples = 25)\n", "infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing', 'easting']])\n", "infected_df.groupby('cluster')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find the Centroid of Each Cluster" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use grouping to find the mean `northing` and `easting` values for each cluster identified above." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
northingeasting
cluster
-1378094.622647401880.682473
0397661.319575371410.021738
1436475.527827332980.449214
2347062.477357389386.823243
3359668.552556379638.020362
4391630.403390431158.137254
5386471.397432426559.085587
6434970.462486406985.278520
7412772.652344410069.663793
8415808.971615414713.750256
9417322.530166409583.737652
10334208.471668435937.777721
11300568.023792391901.514790
12291539.540205401640.663845
13289855.069902394518.295606
\n", "
" ], "text/plain": [ " northing easting\n", "cluster \n", "-1 378094.622647 401880.682473\n", " 0 397661.319575 371410.021738\n", " 1 436475.527827 332980.449214\n", " 2 347062.477357 389386.823243\n", " 3 359668.552556 379638.020362\n", " 4 391630.403390 431158.137254\n", " 5 386471.397432 426559.085587\n", " 6 434970.462486 406985.278520\n", " 7 412772.652344 410069.663793\n", " 8 415808.971615 414713.750256\n", " 9 417322.530166 409583.737652\n", " 10 334208.471668 435937.777721\n", " 11 300568.023792 391901.514790\n", " 12 291539.540205 401640.663845\n", " 13 289855.069902 394518.295606" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "centroids_df = infected_df[['northing', 'easting', 'cluster']].groupby('cluster').mean()\n", "centroids_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find the number of people in each cluster by counting the number of appearances of each cluster's label in the column produced by DBSCAN." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cluster\n", "-1 8451\n", " 0 8638\n", " 1 68\n", " 2 403\n", " 3 25\n", " 4 66\n", " 5 43\n", " 6 27\n", " 7 39\n", " 8 92\n", " 9 21\n", " 10 64\n", " 11 68\n", " 12 72\n", " 13 71\n", "dtype: int64" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "infected_df.groupby(['cluster']).size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find the Centroid of the Cluster with the Most Members ##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the cluster label for with the most people to filter `centroid_df` and write the answer to `my_assessment/question_1.json`. " ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.10/site-packages/cudf/io/json.py:194: UserWarning: Using CPU via Pandas to write JSON dataset\n", " warnings.warn(\"Using CPU via Pandas to write JSON dataset\")\n" ] } ], "source": [ "centroids_df.loc[0].to_json('my_assessment/question_1.json')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check Submission ##" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"northing\":397661.3195752321,\"easting\":371410.0217381102}" ] } ], "source": [ "!cat my_assessment/question_1.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tip**: Your submission file should contain one line of text, similar to: \n", "\n", "```\n", "{'northing':XXX.XX,'easting':XXX.XX}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Please Restart the Kernel

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import IPython\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Header\"" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 4 }