feat(ds-2.2, 2e)

This commit is contained in:
2025-10-31 12:17:30 +03:00
parent 6b6ac94483
commit d5bf47ddd5
16 changed files with 210543 additions and 1959 deletions

View File

@ -1,278 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "19051402",
"metadata": {
"tags": []
},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
},
{
"cell_type": "markdown",
"id": "67ed6062",
"metadata": {},
"source": [
"# Fundamentals of Accelerated Data Science #"
]
},
{
"cell_type": "markdown",
"id": "a65f57f0",
"metadata": {},
"source": [
"## 00 - Introduction ##\n",
"Welcome to NVIDIA's Deep Learning Institute workshop on the Fundamentals of Accelerated Data Science. This interactive lab offers practical experience with every stage of the development process, empowering participants to tailor solutions for their unique applications."
]
},
{
"cell_type": "markdown",
"id": "50d32b6c",
"metadata": {},
"source": [
"**Learning Objectives**\n",
"<br>\n",
"In this workshop, you will learn: \n",
"* Overview of data science\n",
"* Demonstrations of data science workflows\n",
"* How acceleration is achieved\n",
"* How to design operations to maximize GPU acceleration\n",
"* Implications of acceleration"
]
},
{
"cell_type": "markdown",
"id": "3a02c2b6",
"metadata": {},
"source": [
"### JupyterLab ###\n",
"For this hands-on lab, we use [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) to manage our environment. The [JupyterLab Interface](https://jupyterlab.readthedocs.io/en/stable/user/interface.html) is a dashboard that provides access to interactive iPython notebooks, as well as the folder structure of our environment and a terminal window into the Ubuntu operating system. The first view includes a **menu bar** at the top, a **file browser** in the **left sidebar**, and a **main work area** that is initially open to this \"introduction\" notebook. \n",
"\n",
"<p><img src=\"images/jl_launcher.png\" width=720></p>\n",
"\n",
"* The file browser can be navigated just like any other file explorer. A double click on any of the items will open a new tab with its content. \n",
"* The main work area includes tabbed views of open files that can be closed, moved, and edited as needed. \n",
"* The notebooks, including this one, consist of a series of content and code **cells**. To execute code in a code cell, press `Shift+Enter` or the `Run` button in the menu bar above, while a cell is highlighted. Sometimes, a content cell will get switched to editing mode. Executing the cell with `Shift+Enter` or the `Run` button will switch it back to a readable form.\n",
"* To interrupt cell execution, click the `Stop` button in the menu bar or navigate to the `Kernel` menu, and select `Interrupt Kernel`. \n",
"* We can use terminal commands in the notebook cells by prepending an exclamation point/bang(`!`) to the beginning of the command.\n",
"* We can create additional interactive cells by clicking the `+` button above, or by switching to command mode with `Esc` and using the keyboard shortcuts `a` (for new cell above) and `b` (for new cell below)."
]
},
{
"cell_type": "markdown",
"id": "4492c58d",
"metadata": {},
"source": [
"<a name='e1'></a>\n",
"### Exercise #1 - Practice ###\n",
"**Instructions**: <br>\n",
"* Try executing the simple print statement in the below cell.\n",
"* Then try executing the terminal command in the cell below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e69a6515",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"# activate this cell by selecting it with the mouse or arrow keys then use the keyboard shortcut [Shift+Enter] to execute\n",
"print('This is just a simple print statement.')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e54fe372",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"!echo 'This is another simple print statement.'"
]
},
{
"cell_type": "markdown",
"id": "c2e5151b-4842-465e-a20d-bb64af66d011",
"metadata": {},
"source": [
"<a name='e2'></a>\n",
"### Exercise #2 - Available GPU Accelerators ###\n",
"The `nvidia-smi` (NVIDIA System Management Interface) command is a powerful utility for managing and monitoring NVIDIA GPU devices. It will print information about available GPUs, their current memory usage, and any processes currently utilizing them. \n",
"\n",
"**Instructions**: <br>\n",
"* Execute the below cell to learn about this environment's available GPUs. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "08d543eb-a951-4eb9-8107-b13c01b3ac46",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"!nvidia-smi"
]
},
{
"cell_type": "markdown",
"id": "adee74e3-613a-4986-be34-ff3ae113ccc7",
"metadata": {},
"source": [
"**Note**: Currently, GPU memory usage is minimal, with no active processes utilizing the GPUs. Throughout our session, we'll employ this command to monitor memory consumption. When conducting GPU-based data analysis, it's advisable to maintain approximately 50% of GPU memory free, allowing for operations that may expand data stored on the device."
]
},
{
"cell_type": "markdown",
"id": "f0839f2e-dfe3-4d8f-8010-ed8445c171fb",
"metadata": {},
"source": [
"<a name='e3'></a>\n",
"### Exercise #3 - Magic Commands ###\n",
"The Jupyter environment come installed with *magic* commands, which can be recognized by the presence of `%` or `%%`. We will be using two magic commands liberally in this workshop: \n",
"* `%time`: prints summary information about how long it took to run code for a single line of code\n",
"* `%%time`: prints summary information about how long it took to run code for an entire cell\n",
"\n",
"**Instructions**: <br>\n",
"* Execute the below cell to import the `time` library. \n",
"* Execute the cell below to time the single line of code. \n",
"* Execute the cell below to time the entire cell. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1c34489-7812-4ffe-bd2e-748a52903481",
"metadata": {},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"from time import sleep"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db1d5de9-f6e6-4984-8c32-f13b51aa27db",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"# %time only times one line\n",
"%time sleep(2) \n",
"sleep(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "daf2f6f0-58a9-43a5-af8f-0b69b4a2a3a8",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"%%time\n",
"# DO NOT CHANGE THIS CELL\n",
"# %%time will time the entire cell\n",
"sleep(1)\n",
"sleep(1)"
]
},
{
"cell_type": "markdown",
"id": "42ed873e-f7b5-4668-8e96-ce31d53d43b1",
"metadata": {},
"source": [
"<a name='e4'></a>\n",
"### Exercise #4 - Jupyter Kernels and GPU Memory ###\n",
"The compute backend for Jupyter is called the *kernel*. The Jupyter environment starts up a separate kernel for each new notebook. The many notebooks in this workshop are each intended to stand alone with regard to memory and computation. \n",
"\n",
"To ensure we have enough memory and compute for each notebook, we can clear the memory at the conclusion of each notebook in two ways: \n",
"1. Shut down the kernel with `ipykernel.kernelapp.IPKernelApp.do_shutdown()` or\n",
"2. Shut down the kernel through the *Running Terminals and Kernels* panel. \n",
"\n",
"**Instructions**: <br>\n",
"* Execute the below cell to shut down and restart the current kernel. \n",
"* Shut down the current kernel through the *Running Terminals and Kernels* panel.\n",
"\n",
"<p><img src=\"images/kernel_restart.png\" width=720></p>\n",
"\n",
"**Note**: Restarting the kernel from the *Kernel* menu will only clear the memory for *the current notebook's kernel*, while notebooks other than the one we're working on may still have memory allocated for *their unique kernels*. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "98e05b77-6019-428b-8e18-a2477692ef6f",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"import IPython\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"id": "0321075e-433e-42d4-b849-de3fa17b54e1",
"metadata": {},
"source": [
"**Note**: Executing the provided code cell will shut down the kernel and activate a popup indicating that the kernel has restarted."
]
},
{
"cell_type": "markdown",
"id": "8e950df2",
"metadata": {},
"source": [
"**Well Done!** Let's move to the [next notebook](1-01_section_overview.ipynb). "
]
},
{
"cell_type": "markdown",
"id": "b604003a",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -1,78 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "b53a7b12-538d-4459-b82a-a35c8c417849",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
},
{
"cell_type": "markdown",
"id": "ae497b71-bc43-471e-8970-88a1878e7cf9",
"metadata": {},
"source": [
"# Fundamentals of Accelerated Data Science # "
]
},
{
"cell_type": "markdown",
"id": "3a61cc06-80da-4f73-ba61-8ff1b5af71d8",
"metadata": {},
"source": [
"## 01 - Section Overview ##\n",
"\n",
"**Table of Contents**\n",
"This section focuses on data processing. We'll work with multiple datasets, conduct high-level analyses, and prepare the data for subsequent machine learning tasks. \n",
"<br>\n",
"* **1-01_section_overview.ipynb**\n",
"* **1-02_data_manipulation.ipynb**\n",
"* **1-03_memory_management.ipynb**\n",
"* **1-04_interoperability.ipynb**\n",
"* **1-05_grouping.ipynb**\n",
"* **1-06_data_visualization.ipynb**\n",
"* **1-07_etl.ipynb**\n",
"* **1-08_dask-cudf.ipynb**\n",
"* **1-09_cudf-polars.ipynb**"
]
},
{
"cell_type": "markdown",
"id": "9b1485a5-00e8-4495-85b0-b48671674818",
"metadata": {},
"source": [
"**Well Done!** Let's move to the [next notebook](1-02_data_manipulation.ipynb). "
]
},
{
"cell_type": "markdown",
"id": "81e47f0a-547e-4714-878d-34eb9b75c835",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -183,7 +183,7 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"Duration: 3.67 seconds\n" "Duration: 3.76 seconds\n"
] ]
} }
], ],
@ -877,7 +877,11 @@
{ {
"cell_type": "raw", "cell_type": "raw",
"id": "738892d3-4bab-4404-af83-b8623804ca5d", "id": "738892d3-4bab-4404-af83-b8623804ca5d",
"metadata": {}, "metadata": {
"jupyter": {
"source_hidden": true
}
},
"source": [ "source": [
"\n", "\n",
"df['county'].str.title()" "df['county'].str.title()"
@ -990,7 +994,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 13, "execution_count": 11,
"id": "c65e80f8-1cc3-453c-85f2-910dab451228", "id": "c65e80f8-1cc3-453c-85f2-910dab451228",
"metadata": { "metadata": {
"scrolled": true "scrolled": true
@ -1020,7 +1024,7 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"Duration: 0.05 seconds\n" "Duration: 2.17 seconds\n"
] ]
} }
], ],
@ -1079,7 +1083,7 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"Duration: 0.05 seconds\n" "Duration: 0.06 seconds\n"
] ]
} }
], ],
@ -1101,7 +1105,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 14, "execution_count": 13,
"id": "ecadefaa-380c-412c-87af-05c63d3f7871", "id": "ecadefaa-380c-412c-87af-05c63d3f7871",
"metadata": { "metadata": {
"scrolled": true "scrolled": true
@ -1131,7 +1135,7 @@
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
"text": [ "text": [
"Duration: 0.02 seconds\n" "Duration: 0.01 seconds\n"
] ]
} }
], ],
@ -1153,7 +1157,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 15, "execution_count": 14,
"id": "c4a3e4e1-fd83-4024-bcbf-29216c11016f", "id": "c4a3e4e1-fd83-4024-bcbf-29216c11016f",
"metadata": { "metadata": {
"scrolled": true "scrolled": true
@ -1176,7 +1180,7 @@
"Name: name, Length: 58479894, dtype: int32" "Name: name, Length: 58479894, dtype: int32"
] ]
}, },
"execution_count": 15, "execution_count": 14,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -1201,7 +1205,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 19, "execution_count": 15,
"id": "cf9cc540-1de6-4e50-986a-5bf9bd9056a6", "id": "cf9cc540-1de6-4e50-986a-5bf9bd9056a6",
"metadata": { "metadata": {
"scrolled": true "scrolled": true
@ -1358,7 +1362,7 @@
"[5081794 rows x 6 columns]" "[5081794 rows x 6 columns]"
] ]
}, },
"execution_count": 19, "execution_count": 15,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -1383,7 +1387,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 20, "execution_count": 16,
"id": "03713403-6575-437d-99f0-c7f8ec3cb13b", "id": "03713403-6575-437d-99f0-c7f8ec3cb13b",
"metadata": { "metadata": {
"scrolled": true "scrolled": true
@ -1540,7 +1544,7 @@
"[47085782 rows x 6 columns]" "[47085782 rows x 6 columns]"
] ]
}, },
"execution_count": 20, "execution_count": 16,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -1565,7 +1569,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 23, "execution_count": null,
"id": "8c1394db-6bca-473b-a053-61a6066bd835", "id": "8c1394db-6bca-473b-a053-61a6066bd835",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
@ -1583,7 +1587,7 @@
"Name: county, dtype: object" "Name: county, dtype: object"
] ]
}, },
"execution_count": 23, "execution_count": 17,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -1591,13 +1595,19 @@
"source": [ "source": [
"sunderland_residents=df.loc[df['county'] =='SUNDERLAND']\n", "sunderland_residents=df.loc[df['county'] =='SUNDERLAND']\n",
"northmost_sunderland_lat=sunderland_residents['lat'].max()\n", "northmost_sunderland_lat=sunderland_residents['lat'].max()\n",
"df.loc[df['lat'] > northmost_sunderland_lat]['county'].unique()" "north = df.loc[df['count'] == 'NORTH YORKSHIRE']\n",
"uzhnee=north['lat'].min()\n",
"df.loc[df['lat'] > northmost_sunderland_lat and df['lat'] < uzhnee]['county'].unique()"
] ]
}, },
{ {
"cell_type": "raw", "cell_type": "raw",
"id": "0cf99881-1a27-409a-822b-7e62b5953f3a", "id": "0cf99881-1a27-409a-822b-7e62b5953f3a",
"metadata": {}, "metadata": {
"jupyter": {
"source_hidden": true
}
},
"source": [ "source": [
"\n", "\n",
"sunderland_residents=df.loc[df['county'] == 'SUNDERLAND']\n", "sunderland_residents=df.loc[df['county'] == 'SUNDERLAND']\n",
@ -1633,7 +1643,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 24, "execution_count": null,
"id": "977fdb2b-dbf1-4842-ab0f-31b9af65e0d1", "id": "977fdb2b-dbf1-4842-ab0f-31b9af65e0d1",
"metadata": { "metadata": {
"scrolled": true "scrolled": true
@ -1752,7 +1762,7 @@
"4 M Darlington " "4 M Darlington "
] ]
}, },
"execution_count": 24, "execution_count": 18,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -1776,7 +1786,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 25, "execution_count": 19,
"id": "66c5d332-a6ef-4f8d-9560-fa860ea1679a", "id": "66c5d332-a6ef-4f8d-9560-fa860ea1679a",
"metadata": { "metadata": {
"scrolled": true "scrolled": true
@ -1788,7 +1798,7 @@
"{'status': 'ok', 'restart': True}" "{'status': 'ok', 'restart': True}"
] ]
}, },
"execution_count": 25, "execution_count": 19,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -1854,18 +1864,27 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 1, "execution_count": 9,
"id": "5fed82ae-0ecb-4471-bb8f-060b1bf4542f", "id": "5fed82ae-0ecb-4471-bb8f-060b1bf4542f",
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The cudf.pandas extension is already loaded. To reload it, use:\n",
" %reload_ext cudf.pandas\n"
]
}
],
"source": [ "source": [
"# DO NOT CHANGE THIS CELL\n", "# DO NOT CHANGE THIS CELL\n",
"# %load_ext cudf.pandas" "%load_ext cudf.pandas"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 2, "execution_count": 10,
"id": "7671791e-c491-4831-bd1b-956de6b455e5", "id": "7671791e-c491-4831-bd1b-956de6b455e5",
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
@ -1878,31 +1897,495 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 3, "execution_count": 11,
"id": "47c87c9f-5b97-4a0d-bfa7-a26c1369314f", "id": "47c87c9f-5b97-4a0d-bfa7-a26c1369314f",
"metadata": { "metadata": {},
"scrolled": true
},
"outputs": [ "outputs": [
{ {
"ename": "ParserError", "name": "stdout",
"evalue": "Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.", "output_type": "stream",
"output_type": "error", "text": [
"traceback": [ "Duration: 2.3 seconds\n"
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mParserError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[3], line 5\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# %%cudf.pandas.line_profile\u001b[39;00m\n\u001b[1;32m 2\u001b[0m \u001b[38;5;66;03m# DO NOT CHANGE THIS CELL\u001b[39;00m\n\u001b[1;32m 3\u001b[0m start\u001b[38;5;241m=\u001b[39mtime\u001b[38;5;241m.\u001b[39mtime()\n\u001b[0;32m----> 5\u001b[0m df\u001b[38;5;241m=\u001b[39m\u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread_csv\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43m./data/uk_pop.csv\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 6\u001b[0m current_year\u001b[38;5;241m=\u001b[39mdatetime\u001b[38;5;241m.\u001b[39mnow()\u001b[38;5;241m.\u001b[39myear\n\u001b[1;32m 8\u001b[0m df[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mbirth_year\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m=\u001b[39mcurrent_year\u001b[38;5;241m-\u001b[39mdf[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mage\u001b[39m\u001b[38;5;124m'\u001b[39m]\n",
"File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1026\u001b[0m, in \u001b[0;36mread_csv\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[1;32m 1013\u001b[0m kwds_defaults \u001b[38;5;241m=\u001b[39m _refine_defaults_read(\n\u001b[1;32m 1014\u001b[0m dialect,\n\u001b[1;32m 1015\u001b[0m delimiter,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 1022\u001b[0m dtype_backend\u001b[38;5;241m=\u001b[39mdtype_backend,\n\u001b[1;32m 1023\u001b[0m )\n\u001b[1;32m 1024\u001b[0m kwds\u001b[38;5;241m.\u001b[39mupdate(kwds_defaults)\n\u001b[0;32m-> 1026\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_read\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfilepath_or_buffer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py:626\u001b[0m, in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m 623\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m parser\n\u001b[1;32m 625\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m parser:\n\u001b[0;32m--> 626\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mparser\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnrows\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1923\u001b[0m, in \u001b[0;36mTextFileReader.read\u001b[0;34m(self, nrows)\u001b[0m\n\u001b[1;32m 1916\u001b[0m nrows \u001b[38;5;241m=\u001b[39m validate_integer(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnrows\u001b[39m\u001b[38;5;124m\"\u001b[39m, nrows)\n\u001b[1;32m 1917\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 1918\u001b[0m \u001b[38;5;66;03m# error: \"ParserBase\" has no attribute \"read\"\u001b[39;00m\n\u001b[1;32m 1919\u001b[0m (\n\u001b[1;32m 1920\u001b[0m index,\n\u001b[1;32m 1921\u001b[0m columns,\n\u001b[1;32m 1922\u001b[0m col_dict,\n\u001b[0;32m-> 1923\u001b[0m ) \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_engine\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# type: ignore[attr-defined]\u001b[39;49;00m\n\u001b[1;32m 1924\u001b[0m \u001b[43m \u001b[49m\u001b[43mnrows\u001b[49m\n\u001b[1;32m 1925\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1926\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m:\n\u001b[1;32m 1927\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mclose()\n",
"File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py:234\u001b[0m, in \u001b[0;36mCParserWrapper.read\u001b[0;34m(self, nrows)\u001b[0m\n\u001b[1;32m 232\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 233\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlow_memory:\n\u001b[0;32m--> 234\u001b[0m chunks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_reader\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mread_low_memory\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnrows\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 235\u001b[0m \u001b[38;5;66;03m# destructive to chunks\u001b[39;00m\n\u001b[1;32m 236\u001b[0m data \u001b[38;5;241m=\u001b[39m _concatenate_chunks(chunks)\n",
"File \u001b[0;32mparsers.pyx:838\u001b[0m, in \u001b[0;36mpandas._libs.parsers.TextReader.read_low_memory\u001b[0;34m()\u001b[0m\n",
"File \u001b[0;32mparsers.pyx:905\u001b[0m, in \u001b[0;36mpandas._libs.parsers.TextReader._read_rows\u001b[0;34m()\u001b[0m\n",
"File \u001b[0;32mparsers.pyx:874\u001b[0m, in \u001b[0;36mpandas._libs.parsers.TextReader._tokenize_rows\u001b[0;34m()\u001b[0m\n",
"File \u001b[0;32mparsers.pyx:891\u001b[0m, in \u001b[0;36mpandas._libs.parsers.TextReader._check_tokenize_status\u001b[0;34m()\u001b[0m\n",
"File \u001b[0;32mparsers.pyx:2061\u001b[0m, in \u001b[0;36mpandas._libs.parsers.raise_parser_error\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mParserError\u001b[0m: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'."
] ]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>county</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" <th>name</th>\n",
" <th>birth_year</th>\n",
" <th>sex_normalize</th>\n",
" <th>county_normalize</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.533644</td>\n",
" <td>-1.524401</td>\n",
" <td>Francis</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.426256</td>\n",
" <td>-1.465314</td>\n",
" <td>Edward</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.555200</td>\n",
" <td>-1.496417</td>\n",
" <td>Teddy</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.547906</td>\n",
" <td>-1.572341</td>\n",
" <td>Angus</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.477639</td>\n",
" <td>-1.605995</td>\n",
" <td>Charlie</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age sex county lat long name birth_year \\\n",
"0 0 m DARLINGTON 54.533644 -1.524401 Francis 2025 \n",
"1 0 m DARLINGTON 54.426256 -1.465314 Edward 2025 \n",
"2 0 m DARLINGTON 54.555200 -1.496417 Teddy 2025 \n",
"3 0 m DARLINGTON 54.547906 -1.572341 Angus 2025 \n",
"4 0 m DARLINGTON 54.477639 -1.605995 Charlie 2025 \n",
"\n",
" sex_normalize county_normalize \n",
"0 M Darlington \n",
"1 M Darlington \n",
"2 M Darlington \n",
"3 M Darlington \n",
"4 M Darlington "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\"> </span>\n",
"<span style=\"font-style: italic\"> Total time elapsed: 4.036 seconds </span>\n",
"<span style=\"font-style: italic\"> </span>\n",
"<span style=\"font-style: italic\"> Stats </span>\n",
"<span style=\"font-style: italic\"> </span>\n",
"┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
"┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n",
"┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
"│ 2 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> start</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time()</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 4 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">pd</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">read_csv(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'./data/uk_pop.csv'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)</span><span style=\"background-color: #272822\"> </span> │ 1.849591138 │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 5 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> current_year</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">datetime</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">now()</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">year</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 7 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'birth_year'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">current_year</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">-</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"background-color: #272822\"> </span> │ 0.015203007 │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 9 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'sex_normalize'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'sex'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">str</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">upper()</span><span style=\"background-color: #272822\"> </span> │ 0.014988246 │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 10 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county_normalize'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">str</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">title()</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">str</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">replace(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">' '</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'_…</span> │ 0.064808281 │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 11 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'name'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'name'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">str</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">title()</span><span style=\"background-color: #272822\"> </span> │ 0.037497676 │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 13 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> print(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">f'Duration: {</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">round(time</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time()</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">-</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">start, </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">2</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">} seconds'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 15 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> display(df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">head())</span><span style=\"background-color: #272822\"> </span> │ 1.427565133 │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n",
"</pre>\n"
],
"text/plain": [
"\u001b[3m \u001b[0m\n",
"\u001b[3m Total time elapsed: 4.036 seconds \u001b[0m\n",
"\u001b[3m \u001b[0m\n",
"\u001b[3m Stats \u001b[0m\n",
"\u001b[3m \u001b[0m\n",
"┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
"┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n",
"┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
"│ 2 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstart\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 4 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mpd\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mread_csv\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m./data/uk_pop.csv\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 1.849591138 │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 5 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcurrent_year\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdatetime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mnow\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34myear\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 7 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mbirth_year\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcurrent_year\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m-\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.015203007 │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 9 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msex_normalize\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msex\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstr\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mupper\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.014988246 │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 10 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty_normalize\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstr\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtitle\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstr\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mreplace\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m_\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m…\u001b[0m │ 0.064808281 │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 11 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mname\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mname\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstr\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtitle\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.037497676 │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 13 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mprint\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mf\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mDuration: \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m{\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mround\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m-\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstart\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m2\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m}\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m seconds\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 15 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdisplay\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mhead\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 1.427565133 │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%cudf.pandas.line_profile\n",
"# DO NOT CHANGE THIS CELL\n",
"start=time.time()\n",
"\n",
"df=pd.read_csv('./data/uk_pop.csv')\n",
"current_year=datetime.now().year\n",
"\n",
"df['birth_year']=current_year-df['age']\n",
"\n",
"df['sex_normalize']=df['sex'].str.upper()\n",
"df['county_normalize']=df['county'].str.title().str.replace(' ', '_')\n",
"df['name']=df['name'].str.title()\n",
"\n",
"print(f'Duration: {round(time.time()-start, 2)} seconds')\n",
"\n",
"display(df.head())"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "27e9495e-493f-40bd-86d6-324cae46c598",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Duration: 4.47 seconds\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>county</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" <th>name</th>\n",
" <th>birth_year</th>\n",
" <th>sex_normalize</th>\n",
" <th>county_normalize</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.533644</td>\n",
" <td>-1.524401</td>\n",
" <td>Francis</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.426256</td>\n",
" <td>-1.465314</td>\n",
" <td>Edward</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.555200</td>\n",
" <td>-1.496417</td>\n",
" <td>Teddy</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.547906</td>\n",
" <td>-1.572341</td>\n",
" <td>Angus</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.477639</td>\n",
" <td>-1.605995</td>\n",
" <td>Charlie</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age sex county lat long name birth_year \\\n",
"0 0 m DARLINGTON 54.533644 -1.524401 Francis 2025 \n",
"1 0 m DARLINGTON 54.426256 -1.465314 Edward 2025 \n",
"2 0 m DARLINGTON 54.555200 -1.496417 Teddy 2025 \n",
"3 0 m DARLINGTON 54.547906 -1.572341 Angus 2025 \n",
"4 0 m DARLINGTON 54.477639 -1.605995 Charlie 2025 \n",
"\n",
" sex_normalize county_normalize \n",
"0 M Darlington \n",
"1 M Darlington \n",
"2 M Darlington \n",
"3 M Darlington \n",
"4 M Darlington "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# %%cudf.pandas.line_profile\n",
"# DO NOT CHANGE THIS CELL\n",
"start=time.time()\n",
"\n",
"df=pd.read_csv('./data/uk_pop.csv')\n",
"current_year=datetime.now().year\n",
"\n",
"df['birth_year']=current_year-df['age']\n",
"\n",
"df['sex_normalize']=df['sex'].str.upper()\n",
"df['county_normalize']=df['county'].str.title().str.replace(' ', '_')\n",
"df['name']=df['name'].str.title()\n",
"\n",
"print(f'Duration: {round(time.time()-start, 2)} seconds')\n",
"\n",
"display(df.head())"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "1487644e-2ae9-4cad-aab8-4638e554c5df",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Duration: 79.92 seconds\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>county</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" <th>name</th>\n",
" <th>birth_year</th>\n",
" <th>sex_normalize</th>\n",
" <th>county_normalize</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.533644</td>\n",
" <td>-1.524401</td>\n",
" <td>Francis</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.426256</td>\n",
" <td>-1.465314</td>\n",
" <td>Edward</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.555200</td>\n",
" <td>-1.496417</td>\n",
" <td>Teddy</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.547906</td>\n",
" <td>-1.572341</td>\n",
" <td>Angus</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.477639</td>\n",
" <td>-1.605995</td>\n",
" <td>Charlie</td>\n",
" <td>2025</td>\n",
" <td>M</td>\n",
" <td>Darlington</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age sex county lat long name birth_year \\\n",
"0 0 m DARLINGTON 54.533644 -1.524401 Francis 2025 \n",
"1 0 m DARLINGTON 54.426256 -1.465314 Edward 2025 \n",
"2 0 m DARLINGTON 54.555200 -1.496417 Teddy 2025 \n",
"3 0 m DARLINGTON 54.547906 -1.572341 Angus 2025 \n",
"4 0 m DARLINGTON 54.477639 -1.605995 Charlie 2025 \n",
"\n",
" sex_normalize county_normalize \n",
"0 M Darlington \n",
"1 M Darlington \n",
"2 M Darlington \n",
"3 M Darlington \n",
"4 M Darlington "
]
},
"metadata": {},
"output_type": "display_data"
} }
], ],
"source": [ "source": [
@ -1940,23 +2423,12 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 4, "execution_count": null,
"id": "f1688462-783c-4fea-ae18-5d37524d26d8", "id": "f1688462-783c-4fea-ae18-5d37524d26d8",
"metadata": { "metadata": {
"scrolled": true "scrolled": true
}, },
"outputs": [ "outputs": [],
{
"data": {
"text/plain": [
"{'status': 'ok', 'restart': True}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [ "source": [
"# DO NOT CHANGE THIS CELL\n", "# DO NOT CHANGE THIS CELL\n",
"import IPython\n", "import IPython\n",

View File

@ -647,7 +647,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 17, "execution_count": null,
"id": "99bdd7b0-8563-41db-bd8e-3a7279394ede", "id": "99bdd7b0-8563-41db-bd8e-3a7279394ede",
"metadata": { "metadata": {
"scrolled": true "scrolled": true

View File

@ -490,7 +490,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 8, "execution_count": null,
"id": "0772c391-4ad6-4fcc-8754-97b575bca1c5", "id": "0772c391-4ad6-4fcc-8754-97b575bca1c5",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
@ -512,7 +512,10 @@
} }
], ],
"source": [ "source": [
"df[['county', 'age']].groupby('county')['age']\\\n", "group = df[['county', 'age']].groupby('county').agg([pl.max().alias('max'), pl.min()])\n",
"min = group.min()\n",
"max = group.max()\n",
"['age']\\\n",
" .mean()\\\n", " .mean()\\\n",
" .sort_values(ascending=False)\\\n", " .sort_values(ascending=False)\\\n",
" .head()" " .head()"

File diff suppressed because one or more lines are too long

View File

@ -930,7 +930,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 45, "execution_count": null,
"id": "69e5074d-d15b-471a-a9dd-e1f7a52013a5", "id": "69e5074d-d15b-471a-a9dd-e1f7a52013a5",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

2882
ds/25-1/2/2-03_cugraph.ipynb Normal file

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,669 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "2d190a78-7253-4fad-9d9c-6b4fb33c8bf2",
"metadata": {
"tags": []
},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
},
{
"cell_type": "markdown",
"id": "8a2c4abf-6278-4edd-83f8-f0afac4c834f",
"metadata": {},
"source": [
"# Fundamentals of Accelerated Data Science #"
]
},
{
"cell_type": "markdown",
"id": "e1e78ef4-c0de-433e-8616-bd946f69d30e",
"metadata": {},
"source": [
"## 04 - cuGraph as a NetworkX backend ##"
]
},
{
"cell_type": "markdown",
"id": "0828e0b4-7935-4b77-95ef-e06b72f0319e",
"metadata": {},
"source": [
"**Table of Contents**\n",
"<br>\n",
"This notebook introduces the various methods of utilizing the cuGraph backend for NetworkX and runs centrality algorithms on the dataset. This notebook covers the below sections:\n",
"1. [Background](#Background)\n",
"2. [Installation](#Installation)\n",
"3. [Utilizing nx-cugraph](#Utilizing-nx-cugraph)\n",
" * [Runtime Environment Variable](#Runtime-Environment-Variable)\n",
" * [Backend Keyword Argument](#Backend-Keyword-Argument)\n",
" * [Type-Based Dispatching](#Type-Based-Dispatching)\n",
"4. [Computing Centrality](#Computing-Centrality)\n",
" * [Creating Graph](#Creating-Graph)\n",
" * [Running Centrality Algorithms](#Running-Centrality-Algorithms)\n",
" * [Betweenness Centrality](#Betweenness-Centrality)\n",
" * [Degree Centrality](#Degree-Centrality)\n",
" * [Katz Centrality](#Katz-Centrality)\n",
" * [Pagerank Centrality](#Pagerank-Centrality)\n",
" * [Eigenvector Centrality](#Eigenvector-Centrality)\n",
" * [Visualize Results](#Visualize-Results)\n",
" * [Exercise #1 - Type Dispatch](#Exercise-#1---Type-Dispatch)"
]
},
{
"cell_type": "markdown",
"id": "c57b79ba-c7c7-49d2-9e21-c388bbe6ca98",
"metadata": {},
"source": [
"## Background ##\n",
"RAPIDS recently introduced a new backend to NetworkX called nx-cugraph. With this backend, you can automatically accelerate supported algorithms. In this notebook, we will cover the various methods of enabling the cugraph backend, and use the backend to run different centrality algorithms."
]
},
{
"cell_type": "markdown",
"id": "697ea4c9-b416-43d5-9d2c-28aa41ef2561",
"metadata": {},
"source": [
"## Installation ##\n",
"We have already prepared the environment with nx-cugraph installed. When you are using your own environment, below is the command for installation. "
]
},
{
"cell_type": "raw",
"id": "2fe07200-4f66-4604-9950-40ade1938f4c",
"metadata": {},
"source": [
"pip install nx-cugraph-cu12 --no-deps --extra-index-url https://pypi.anaconda.org/rapidsai-wheels-nightly/simple"
]
},
{
"cell_type": "markdown",
"id": "a9ea09f4-6c93-4785-bcc3-44c6f040dfc6",
"metadata": {},
"source": [
"## Utilizing nx-cugraph ##\n",
"There are 3 ways to utilize nx-cugraph\n",
"\n",
"1. **Environment Variable at Runtime**\n",
"2. **Backend keyword argument**\n",
"3. **Type-Based dispatching**\n",
"\n",
"Let's dig a little deeper in to each of these methods."
]
},
{
"cell_type": "markdown",
"id": "8b4322fd-9f56-4cbc-a00c-8fac4b2b2fe1",
"metadata": {},
"source": [
"### Runtime Environment Variable ###\n",
"The NETWORKX_AUTOMATIC_BACKENDS environment variable can be used to have NetworkX automatically dispatch to specified backends. Set NETWORKX_AUTOMATIC_BACKENDS=cugraph to use nx-cugraph to GPU accelerate supported APIs with no code changes. We will also be loading the cuDF pandas module to accelerate csv loading."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "b41fef7f-5d43-4481-98a7-d9f3cb54066c",
"metadata": {},
"outputs": [],
"source": [
"!NETWORKX_AUTOMATIC_BACKENDS=cugraph python -m cudf.pandas scripts/networkx.py"
]
},
{
"cell_type": "markdown",
"id": "5ffb6c4b-a03a-4bfb-9b92-14c59e6dcd75",
"metadata": {},
"source": [
"### Backend Keyword Argument ###\n",
"NetworkX also supports explicitly specifying a particular backend for supported APIs with the backend= keyword argument. This argument takes precedence over the NETWORKX_AUTOMATIC_BACKENDS environment variable. This method also requires that the specified backend already be installed."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8183ecc7-8544-4914-8c07-c904ba12225a",
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"%load_ext cudf.pandas\n",
"import networkx as nx\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Load the CSV file\n",
"road_graph = pd.read_csv('./data/road_graph.csv', dtype=['int32', 'int32', 'float32'], nrows=1000)\n",
"\n",
"# Create an empty graph\n",
"G = nx.from_pandas_edgelist(road_graph, source='src', target='dst', edge_attr='length')\n",
"b = nx.betweenness_centrality(G, k=1000, backend=\"cugraph\")"
]
},
{
"cell_type": "markdown",
"id": "e588aa65-6281-4c19-a51c-42f044636ac0",
"metadata": {},
"source": [
"### Type-Based Dispatching ###\n",
"For users wanting to ensure a particular behavior, without the potential for runtime conversions, NetworkX offers type-based dispatching. To utilize this method, users must import the desired backend and create a Graph instance for it."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5fea9300-8d75-443a-9ec0-ee65c8ccaf0f",
"metadata": {},
"outputs": [],
"source": [
"import networkx as nx\n",
"import nx_cugraph as nxcg\n",
"\n",
"# Loading data from previous cell\n",
"G = nx.from_pandas_edgelist(road_graph, source='src', target='dst', edge_attr='length') \n",
"\n",
"nxcg_G = nxcg.from_networkx(G) # conversion happens once here\n",
"b = nx.betweenness_centrality(nxcg_G, k=1000) # nxcg Graph type causes cugraph backend to be used, no conversion necessary"
]
},
{
"cell_type": "markdown",
"id": "cb5a17e1-d886-4d20-8d4b-ce900280279c",
"metadata": {},
"source": [
"## Computing Centrality ##\n",
"Now that we learned how to enable nx-cugraph, let's try to use it in a workflow! We will be using the backend argument for this example. First let's create a graph."
]
},
{
"cell_type": "markdown",
"id": "19bea37c-bccf-4815-81bd-aa1de553812d",
"metadata": {},
"source": [
"### Creating Graph ###"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2b4420d7-7c89-4914-809f-4e323a12f47f",
"metadata": {},
"outputs": [],
"source": [
"# Create a graph from already loaded dataframe\n",
"G = nx.from_pandas_edgelist(road_graph, source='src', target='dst', edge_attr='length')"
]
},
{
"cell_type": "markdown",
"id": "7dc1ad5b-8454-4277-9568-0cdacbebd9f1",
"metadata": {},
"source": [
"### Running Centrality Algorithms ###\n",
"Now, let's run the various centrality algorithms!"
]
},
{
"cell_type": "markdown",
"id": "1c52b7b3-6c23-45be-9ace-34a667f132aa",
"metadata": {},
"source": [
"### Betweenness Centrality ###\n",
"Quantifies the number of times a node acts as a bridge along the shortest path between two other nodes, highlighting its importance in information flow"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "281374af-c7cf-4592-a34d-796c1158dab6",
"metadata": {},
"outputs": [],
"source": [
"b = nx.betweenness_centrality(G, backend=\"cugraph\")"
]
},
{
"cell_type": "markdown",
"id": "f98b2975-1f72-4bff-83c7-ace7aab65d98",
"metadata": {},
"source": [
"### Degree Centrality ###\n",
"Measures the number of direct connections a node has, indicating how well-connected it is within the network"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "3e0c4460-6d25-4a2b-8b8f-8f8c6ef617b0",
"metadata": {},
"outputs": [],
"source": [
"d = nx.degree_centrality(G, backend=\"cugraph\")"
]
},
{
"cell_type": "markdown",
"id": "0665a659-16b1-48b4-b3bb-9aa5659ef91c",
"metadata": {},
"source": [
"### Katz Centrality ###\n",
"Measures a node's centrality based on its global influence in the network, considering both direct and indirect connections"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "8ce418d2-9eda-40bc-9733-b82d8d7556b1",
"metadata": {},
"outputs": [],
"source": [
"k = nx.katz_centrality(G, backend=\"cugraph\")"
]
},
{
"cell_type": "markdown",
"id": "0712cedb-87ba-4a08-a74d-24997d02a636",
"metadata": {},
"source": [
"### Pagerank Centrality ###\n",
"Determines a node's importance based on the quantity and quality of links to it, similar to Google's original PageRank algorithm"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a17ee15b-8758-484b-82b9-a158187231c5",
"metadata": {},
"outputs": [],
"source": [
"p = nx.pagerank(G, max_iter=10, tol=1.0e-3, backend=\"cugraph\")"
]
},
{
"cell_type": "markdown",
"id": "c5f57a5e-95e4-47f7-a9ec-04a99fa2c1dc",
"metadata": {},
"source": [
"### Eigenvector Centrality ###\n",
"Assigns scores to nodes based on the principle that connections to high-scoring nodes contribute more to the node's own score than connections to low-scoring nodes"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "3eb1e358-ae8e-4399-bf45-90616b663e9d",
"metadata": {},
"outputs": [],
"source": [
"e = nx.eigenvector_centrality(G, max_iter=1000, tol=1.0e-3, backend=\"cugraph\")"
]
},
{
"cell_type": "markdown",
"id": "0bc9178c-e66a-4c75-bf91-0c5d668b5634",
"metadata": {},
"source": [
"### Visualize Results ###\n",
"Now let's visualize results! We will only display the top 5 rows for readibility. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "69b6c23d-78a0-4dbb-be19-913ad180fe94",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style type=\"text/css\">\n",
"</style>\n",
"<table id=\"T_9f2bb\" style='display:inline'>\n",
" <caption>Degree</caption>\n",
" <thead>\n",
" <tr>\n",
" <th id=\"T_9f2bb_level0_col0\" class=\"col_heading level0 col0\" >vertex</th>\n",
" <th id=\"T_9f2bb_level0_col1\" class=\"col_heading level0 col1\" >degree_centrality</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td id=\"T_9f2bb_row0_col0\" class=\"data row0 col0\" >24</td>\n",
" <td id=\"T_9f2bb_row0_col1\" class=\"data row0 col1\" >0.002847</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_9f2bb_row1_col0\" class=\"data row1 col0\" >72</td>\n",
" <td id=\"T_9f2bb_row1_col1\" class=\"data row1 col1\" >0.002847</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_9f2bb_row2_col0\" class=\"data row2 col0\" >86</td>\n",
" <td id=\"T_9f2bb_row2_col1\" class=\"data row2 col1\" >0.002847</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_9f2bb_row3_col0\" class=\"data row3 col0\" >127</td>\n",
" <td id=\"T_9f2bb_row3_col1\" class=\"data row3 col1\" >0.002847</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_9f2bb_row4_col0\" class=\"data row4 col0\" >133</td>\n",
" <td id=\"T_9f2bb_row4_col1\" class=\"data row4 col1\" >0.002847</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<style type=\"text/css\">\n",
"</style>\n",
"<table id=\"T_c13b0\" style='display:inline'>\n",
" <caption>Betweenness</caption>\n",
" <thead>\n",
" <tr>\n",
" <th id=\"T_c13b0_level0_col0\" class=\"col_heading level0 col0\" >vertex</th>\n",
" <th id=\"T_c13b0_level0_col1\" class=\"col_heading level0 col1\" >betweenness_centrality</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td id=\"T_c13b0_row0_col0\" class=\"data row0 col0\" >222</td>\n",
" <td id=\"T_c13b0_row0_col1\" class=\"data row0 col1\" >0.000007</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_c13b0_row1_col0\" class=\"data row1 col0\" >381</td>\n",
" <td id=\"T_c13b0_row1_col1\" class=\"data row1 col1\" >0.000007</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_c13b0_row2_col0\" class=\"data row2 col0\" >24</td>\n",
" <td id=\"T_c13b0_row2_col1\" class=\"data row2 col1\" >0.000006</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_c13b0_row3_col0\" class=\"data row3 col0\" >72</td>\n",
" <td id=\"T_c13b0_row3_col1\" class=\"data row3 col1\" >0.000006</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_c13b0_row4_col0\" class=\"data row4 col0\" >86</td>\n",
" <td id=\"T_c13b0_row4_col1\" class=\"data row4 col1\" >0.000006</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<style type=\"text/css\">\n",
"</style>\n",
"<table id=\"T_afb59\" style='display:inline'>\n",
" <caption>Katz</caption>\n",
" <thead>\n",
" <tr>\n",
" <th id=\"T_afb59_level0_col0\" class=\"col_heading level0 col0\" >vertex</th>\n",
" <th id=\"T_afb59_level0_col1\" class=\"col_heading level0 col1\" >katz_centrality</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td id=\"T_afb59_row0_col0\" class=\"data row0 col0\" >24</td>\n",
" <td id=\"T_afb59_row0_col1\" class=\"data row0 col1\" >0.033058</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_afb59_row1_col0\" class=\"data row1 col0\" >72</td>\n",
" <td id=\"T_afb59_row1_col1\" class=\"data row1 col1\" >0.033058</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_afb59_row2_col0\" class=\"data row2 col0\" >86</td>\n",
" <td id=\"T_afb59_row2_col1\" class=\"data row2 col1\" >0.033058</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_afb59_row3_col0\" class=\"data row3 col0\" >127</td>\n",
" <td id=\"T_afb59_row3_col1\" class=\"data row3 col1\" >0.033058</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_afb59_row4_col0\" class=\"data row4 col0\" >133</td>\n",
" <td id=\"T_afb59_row4_col1\" class=\"data row4 col1\" >0.033058</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<style type=\"text/css\">\n",
"</style>\n",
"<table id=\"T_bb8df\" style='display:inline'>\n",
" <caption>PageRank</caption>\n",
" <thead>\n",
" <tr>\n",
" <th id=\"T_bb8df_level0_col0\" class=\"col_heading level0 col0\" >vertex</th>\n",
" <th id=\"T_bb8df_level0_col1\" class=\"col_heading level0 col1\" >pagerank</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td id=\"T_bb8df_row0_col0\" class=\"data row0 col0\" >24</td>\n",
" <td id=\"T_bb8df_row0_col1\" class=\"data row0 col1\" >0.002525</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_bb8df_row1_col0\" class=\"data row1 col0\" >72</td>\n",
" <td id=\"T_bb8df_row1_col1\" class=\"data row1 col1\" >0.002525</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_bb8df_row2_col0\" class=\"data row2 col0\" >86</td>\n",
" <td id=\"T_bb8df_row2_col1\" class=\"data row2 col1\" >0.002525</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_bb8df_row3_col0\" class=\"data row3 col0\" >127</td>\n",
" <td id=\"T_bb8df_row3_col1\" class=\"data row3 col1\" >0.002525</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_bb8df_row4_col0\" class=\"data row4 col0\" >133</td>\n",
" <td id=\"T_bb8df_row4_col1\" class=\"data row4 col1\" >0.002525</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<style type=\"text/css\">\n",
"</style>\n",
"<table id=\"T_f5314\" style='display:inline'>\n",
" <caption>EigenVector</caption>\n",
" <thead>\n",
" <tr>\n",
" <th id=\"T_f5314_level0_col0\" class=\"col_heading level0 col0\" >vertex</th>\n",
" <th id=\"T_f5314_level0_col1\" class=\"col_heading level0 col1\" >eigenvector_centrality</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td id=\"T_f5314_row0_col0\" class=\"data row0 col0\" >24</td>\n",
" <td id=\"T_f5314_row0_col1\" class=\"data row0 col1\" >0.064086</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_f5314_row1_col0\" class=\"data row1 col0\" >72</td>\n",
" <td id=\"T_f5314_row1_col1\" class=\"data row1 col1\" >0.064086</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_f5314_row2_col0\" class=\"data row2 col0\" >86</td>\n",
" <td id=\"T_f5314_row2_col1\" class=\"data row2 col1\" >0.064086</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_f5314_row3_col0\" class=\"data row3 col0\" >127</td>\n",
" <td id=\"T_f5314_row3_col1\" class=\"data row3 col1\" >0.064086</td>\n",
" </tr>\n",
" <tr>\n",
" <td id=\"T_f5314_row4_col0\" class=\"data row4 col0\" >133</td>\n",
" <td id=\"T_f5314_row4_col1\" class=\"data row4 col1\" >0.064086</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from IPython.display import display_html\n",
"dc_top = pd.DataFrame(sorted(d.items(), key=lambda x:x[1], reverse=True)[:5], columns=[\"vertex\", \"degree_centrality\"])\n",
"bc_top = pd.DataFrame(sorted(b.items(), key=lambda x:x[1], reverse=True)[:5], columns=[\"vertex\", \"betweenness_centrality\"])\n",
"katz_top = pd.DataFrame(sorted(k.items(), key=lambda x:x[1], reverse=True)[:5], columns=[\"vertex\", \"katz_centrality\"])\n",
"pr_top = pd.DataFrame(sorted(p.items(), key=lambda x:x[1], reverse=True)[:5], columns=[\"vertex\", \"pagerank\"])\n",
"ev_top = pd.DataFrame(sorted(e.items(), key=lambda x:x[1], reverse=True)[:5], columns=[\"vertex\", \"eigenvector_centrality\"])\n",
"\n",
"df1_styler = dc_top.style.set_table_attributes(\"style='display:inline'\").set_caption('Degree').hide(axis='index')\n",
"df2_styler = bc_top.style.set_table_attributes(\"style='display:inline'\").set_caption('Betweenness').hide(axis='index')\n",
"df3_styler = katz_top.style.set_table_attributes(\"style='display:inline'\").set_caption('Katz').hide(axis='index')\n",
"df4_styler = pr_top.style.set_table_attributes(\"style='display:inline'\").set_caption('PageRank').hide(axis='index')\n",
"df5_styler = ev_top.style.set_table_attributes(\"style='display:inline'\").set_caption('EigenVector').hide(axis='index')\n",
"\n",
"display_html(df1_styler._repr_html_()+df2_styler._repr_html_()+df3_styler._repr_html_()+df4_styler._repr_html_()+df5_styler._repr_html_(), raw=True)"
]
},
{
"cell_type": "markdown",
"id": "1a653ca9-9448-4ba5-85b2-f6c885c273a9",
"metadata": {},
"source": [
"### Exercise #1 - Type Dispatch ###\n",
"Use the type dispatching method to obtain pagerank centrality results with the cugraph backend."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "6eb90078-1479-4847-97b7-eb119e9d5478",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Graph with 1406 nodes and 999 edges\n",
"CudaGraph with 1406 nodes and 999 edges\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>vertex</th>\n",
" <th>pagerank</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>24</td>\n",
" <td>0.002525</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>72</td>\n",
" <td>0.002525</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>86</td>\n",
" <td>0.002525</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>127</td>\n",
" <td>0.002525</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>133</td>\n",
" <td>0.002525</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" vertex pagerank\n",
"0 24 0.002525\n",
"1 72 0.002525\n",
"2 86 0.002525\n",
"3 127 0.002525\n",
"4 133 0.002525"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import networkx as nx\n",
"import nx_cugraph as nxcg\n",
"\n",
"G = nx.from_pandas_edgelist(road_graph, source='src', target='dst', edge_attr='length')\n",
"nxcg_G = nxcg.from_networkx(G)\n",
"p = nx.pagerank(nxcg_G, max_iter=10, tol=1.0e-3)\n",
"\n",
"print(G)\n",
"print(nxcg_G)\n",
"\n",
"pd.DataFrame(sorted(p.items(), key=lambda x:x[1], reverse=True)[:5], columns=[\"vertex\", \"pagerank\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d70c78b7-551d-4d9e-b428-32b26adcd3c4",
"metadata": {},
"outputs": [],
"source": [
"import IPython\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"id": "2279fdf1-82c0-4c6e-ac8e-b952f4777562",
"metadata": {},
"source": [
"**Well Done!** "
]
},
{
"cell_type": "markdown",
"id": "3fbc12b2-585c-48a9-a176-b2572040d378",
"metadata": {
"tags": []
},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,426 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fundamentals of Accelerated Data Science # "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 02 - K-Means ##\n",
"\n",
"**Table of Contents**\n",
"<br>\n",
"This notebook uses GPU-accelerated K-means to find the best locations for a fixed number of humanitarian supply airdrop depots. This notebook covers the below sections: \n",
"1. [Environment](#Environment)\n",
"2. [Load Data](#Load-Data)\n",
"3. [K-Means Clustering](#K-Means-Clustering)\n",
" * [Exercise #1 - Make Another `KMeans` Instance](#Exercise-#1---Make-Another-KMeans-Instance)\n",
"4. [Visualize the Clusters](#Visualize-the-Clusters)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Environment ##\n",
"For the first time we import `cuml`, the RAPIDS GPU-accelerated library containing many common machine learning algorithms. We will be visualizing the results of your work in this notebook, so we also import `cuxfilter`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"import cudf\n",
"import cuml\n",
"\n",
"import cuxfilter as cxf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data ##\n",
"For this notebook we load again the cleaned UK population data--in this case, we are not specifically looking at counties, so we omit that column and just keep the grid coordinate columns."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"northing float64\n",
"easting float64\n",
"dtype: object\n"
]
},
{
"data": {
"text/plain": [
"(58479894, 2)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"gdf = cudf.read_csv('./data/clean_uk_pop.csv', usecols=['easting', 'northing'])\n",
"print(gdf.dtypes)\n",
"gdf.shape"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>northing</th>\n",
" <th>easting</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>515491.5313</td>\n",
" <td>430772.1875</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>503572.4688</td>\n",
" <td>434685.8750</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>517903.6563</td>\n",
" <td>432565.5313</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>517059.9063</td>\n",
" <td>427660.6250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>509228.6875</td>\n",
" <td>425527.7813</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" northing easting\n",
"0 515491.5313 430772.1875\n",
"1 503572.4688 434685.8750\n",
"2 517903.6563 432565.5313\n",
"3 517059.9063 427660.6250\n",
"4 509228.6875 425527.7813"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gdf.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a name='#s2-3'></a>\n",
"## K-Means Clustering ##\n",
"The unsupervised K-means clustering algorithm will look for a fixed number *k* of centroids in the data and clusters each point with its closest centroid. K-means can be effective when the number of clusters *k* is known or has a good estimate (such as from a model of the underlying mechanics of a problem).\n",
"\n",
"Assume that in addition to knowing the distribution of the population, which we do, we would like to estimate the best locations to build a fixed number of humanitarian supply depots from which we can perform airdrops and reach the population most efficiently. We can use K-means, setting *k* to the number of supply depots available and fitting on the locations of the population, to identify candidate locations.\n",
"\n",
"GPU-accelerated K-means is just as easy as its CPU-only scikit-learn counterpart. In this series of exercises, you will use it to optimize the locations for 5 supply depots.\n",
"\n",
"`cuml.KMeans()` will initialize a K-means instance. Use it now to initialize a K-means instance called `km`, passing the named argument `n_clusters` set equal to our desired number `5`. Use the `km.fit` method to fit `km` to the population's locations by passing it the population data. After fitting, add the cluster labels back to the `gdf` in a new column named `cluster`. Finally, you can use `km.cluster_centers_` to see where the algorithm created the 5 centroids.\n",
"\n",
"Below we train a K-means clustering algorithm to find 5 clusters. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>306647.898235</td>\n",
" <td>408370.452191</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>442109.465392</td>\n",
" <td>402673.747673</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>288997.149971</td>\n",
" <td>553805.430444</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>148770.463641</td>\n",
" <td>311786.805381</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>170553.110214</td>\n",
" <td>521605.459724</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1\n",
"0 306647.898235 408370.452191\n",
"1 442109.465392 402673.747673\n",
"2 288997.149971 553805.430444\n",
"3 148770.463641 311786.805381\n",
"4 170553.110214 521605.459724"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"# instantaite\n",
"km = cuml.KMeans(n_clusters=5)\n",
"\n",
"# fit\n",
"km.fit(gdf)\n",
"\n",
"# assign cluster as new column\n",
"gdf['cluster'] = km.labels_\n",
"km.cluster_centers_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a name='#s2-e1'></a>\n",
"## Exercise #1 - Make Another `KMeans` Instance ##\n",
"\n",
"**Instructions**: <br>\n",
"* Modify the `<FIXME>` only and execute the below cell to instantiate a K-means instance with 6 clusters.\n",
"* Modify the `<FIXME>` only and execute the cell below to fit the data. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"km = cuml.KMeans(n_clusters=6)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"km.fit(gdf)\n",
"gdf['cluster'] = km.labels_\n",
"km.cluster_centers_"
]
},
{
"cell_type": "raw",
"metadata": {
"jupyter": {
"source_hidden": true
}
},
"source": [
"\n",
"km = cuml.KMeans(n_clusters=6)\n",
"\n",
"km.fit(gdf)\n",
"gdf['cluster'] = km.labels_\n",
"km.cluster_centers_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Click ... for solution. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='#s2-4'></a>\n",
"## Visualize the Clusters ##\n",
"To help us understand where clusters are located, we make a visualization that separates them, using the same three steps as before.\n",
"\n",
"Below we plot the clusters with cuxfilter. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"# associate a data source with cuXfilter\n",
"cxf_data = cxf.DataFrame.from_dataframe(gdf)\n",
"\n",
"# define charts\n",
"scatter_chart = cxf.charts.datashader.scatter(x='easting', y='northing')\n",
"\n",
"# define widget using the `cluster` column for multiselect\n",
"# use the same technique to scale the scatterplot, then add a widget to let us select which cluster to look at\n",
"cluster_widget = cxf.charts.panel_widgets.multi_select('cluster')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create dashboard\n",
"dash = cxf_data.dashboard(charts=[scatter_chart],sidebar=[cluster_widget], theme=cxf.themes.dark, data_size_widget=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dash.app()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import IPython\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Well Done!** Let's move to the [next notebook](3-03_dbscan.ipynb). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

373
ds/25-1/2/3-03_dbscan.ipynb Normal file
View File

@ -0,0 +1,373 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fundamentals of Accelerated Data Science # "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 03 - DBSCAN ##\n",
"\n",
"**Table of Contents**\n",
"<br>\n",
"This notebook uses GPU-accelerated DBSCAN to identify clusters of infected people. This notebook covers the below sections: \n",
"1. [Environment](#Environment)\n",
"2. [Load Data](#Load-Data)\n",
"3. [DBSCAN Clustering](#DBSCAN-Clustering)\n",
" * [Exercise #1 - Make Another DBSCAN Instance](#Exercise-#1---Make-Another-DBSCAN-Instance)\n",
"4. [Visualize the Clusters](#Visualize-the-Clusters)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Environment ##"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import cudf\n",
"import cuml\n",
"\n",
"import cuxfilter as cxf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data ##\n",
"For this notebook, we again load a subset of our population data with only the columns we need. An `infected` column has been added to the data to indicate whether or not a person is known to be infected with our simulated virus."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"northing float32\n",
"easting float32\n",
"infected float32\n",
"dtype: object\n"
]
},
{
"data": {
"text/plain": [
"(1000000, 3)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gdf = cudf.read_csv('./data/pop_sample.csv', dtype=['float32', 'float32', 'float32'])\n",
"print(gdf.dtypes)\n",
"gdf.shape"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>northing</th>\n",
" <th>easting</th>\n",
" <th>infected</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>178547.296875</td>\n",
" <td>368012.1250</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>174068.281250</td>\n",
" <td>543802.1250</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>358293.687500</td>\n",
" <td>435639.8750</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>87240.304688</td>\n",
" <td>389607.3750</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>158261.015625</td>\n",
" <td>340764.9375</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" northing easting infected\n",
"0 178547.296875 368012.1250 0.0\n",
"1 174068.281250 543802.1250 0.0\n",
"2 358293.687500 435639.8750 0.0\n",
"3 87240.304688 389607.3750 0.0\n",
"4 158261.015625 340764.9375 0.0"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gdf.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"infected\n",
"0.0 984331\n",
"1.0 15669\n",
"Name: count, dtype: int64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gdf['infected'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DBSCAN Clustering ##\n",
"DBSCAN is another unsupervised clustering algorithm that is particularly effective when the number of clusters is not known up front and the clusters may have concave or other unusual shapes--a situation that often applies in geospatial analytics.\n",
"\n",
"In this series of exercises you will use DBSCAN to identify clusters of infected people by location, which may help us identify groups becoming infected from common patient zeroes and assist in response planning.\n",
"\n",
"Create a DBSCAN instance by using `cuml.DBSCAN`. Pass in the named argument `eps` (the maximum distance a point can be from the nearest point in a cluster to be considered possibly in that cluster) to be `5000`. Since the `northing` and `easting` values we created are measured in meters, this will allow us to identify clusters of infected people where individuals may be separated from the rest of the cluster by up to 5 kilometers.\n",
"\n",
"Below we train a DBSCAN algorithm. We start by creating a new dataframe from rows of the original dataframe where `infected` is `1` (true), and call it `infected_df`--be sure to reset the dataframe's index afterward. Use `dbscan.fit_predict` to perform clustering on the `northing` and `easting` columns of `infected_df`, and turn the resulting series into a new column in `infected_gdf` called \"cluster\". Finally, compute the number of clusters identified by DBSCAN."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"96"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dbscan = cuml.DBSCAN(eps=5000)\n",
"# dbscan = cuml.DBSCAN(eps=10000)\n",
"\n",
"infected_df = gdf[gdf['infected'] == 1].reset_index()\n",
"infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing', 'easting']])\n",
"infected_df['cluster'].nunique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise #1 - Make Another DBSCAN Instance ###\n",
"\n",
"**Instructions**: <br>\n",
"* Modify the `<FIXME>` only and execute the below cell to instantiate a DBSCAN instance with `10000` for `eps`.\n",
"* Modify the `<FIXME>` only and execute the cell below to fit the data and identify infected clusters. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"dbscan = cuml.DBSCAN(eps=10000)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"infected_df = gdf[gdf['infected'] == 1].reset_index()\n",
"infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing', 'easting']])\n",
"infected_df['cluster'].nunique()"
]
},
{
"cell_type": "raw",
"metadata": {
"jupyter": {
"source_hidden": true
}
},
"source": [
"\n",
"dbscan = cuml.DBSCAN(eps=10000)\n",
"\n",
"infected_df = gdf[gdf['infected'] == 1].reset_index()\n",
"infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing', 'easting']])\n",
"infected_df['cluster'].nunique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Click ... for solution. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize the Clusters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because we have the same column names as in the K-means example--`easting`, `northing`, and `cluster`--we can use the same code to visualize the clusters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"infected_df.to_pandas().plot(kind='scatter', x='easting', y='northing', c='cluster')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import IPython\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Well Done!** Let's move to the [next notebook](3-04_logistic_regression.ipynb). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

155148
ds/25-1/2e/BoardingData.csv Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

48060
ds/25-1/2e/worldcities.csv Executable file

File diff suppressed because it is too large Load Diff