feat(ds-2.1, circuit-3)

This commit is contained in:
2025-10-24 22:04:14 +03:00
parent 4a27006658
commit dd905ac0c9
24 changed files with 12601 additions and 0 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

BIN
circuit/25-1/3/TJK.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 59 KiB

BIN
circuit/25-1/3/TT.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

BIN
circuit/25-1/3/TT_table.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.0 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 182 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

BIN
circuit/25-1/3/lab3.pdf Normal file

Binary file not shown.

BIN
circuit/25-1/3/schema.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 186 KiB

View File

@ -0,0 +1,278 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "19051402",
"metadata": {
"tags": []
},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
},
{
"cell_type": "markdown",
"id": "67ed6062",
"metadata": {},
"source": [
"# Fundamentals of Accelerated Data Science #"
]
},
{
"cell_type": "markdown",
"id": "a65f57f0",
"metadata": {},
"source": [
"## 00 - Introduction ##\n",
"Welcome to NVIDIA's Deep Learning Institute workshop on the Fundamentals of Accelerated Data Science. This interactive lab offers practical experience with every stage of the development process, empowering participants to tailor solutions for their unique applications."
]
},
{
"cell_type": "markdown",
"id": "50d32b6c",
"metadata": {},
"source": [
"**Learning Objectives**\n",
"<br>\n",
"In this workshop, you will learn: \n",
"* Overview of data science\n",
"* Demonstrations of data science workflows\n",
"* How acceleration is achieved\n",
"* How to design operations to maximize GPU acceleration\n",
"* Implications of acceleration"
]
},
{
"cell_type": "markdown",
"id": "3a02c2b6",
"metadata": {},
"source": [
"### JupyterLab ###\n",
"For this hands-on lab, we use [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) to manage our environment. The [JupyterLab Interface](https://jupyterlab.readthedocs.io/en/stable/user/interface.html) is a dashboard that provides access to interactive iPython notebooks, as well as the folder structure of our environment and a terminal window into the Ubuntu operating system. The first view includes a **menu bar** at the top, a **file browser** in the **left sidebar**, and a **main work area** that is initially open to this \"introduction\" notebook. \n",
"\n",
"<p><img src=\"images/jl_launcher.png\" width=720></p>\n",
"\n",
"* The file browser can be navigated just like any other file explorer. A double click on any of the items will open a new tab with its content. \n",
"* The main work area includes tabbed views of open files that can be closed, moved, and edited as needed. \n",
"* The notebooks, including this one, consist of a series of content and code **cells**. To execute code in a code cell, press `Shift+Enter` or the `Run` button in the menu bar above, while a cell is highlighted. Sometimes, a content cell will get switched to editing mode. Executing the cell with `Shift+Enter` or the `Run` button will switch it back to a readable form.\n",
"* To interrupt cell execution, click the `Stop` button in the menu bar or navigate to the `Kernel` menu, and select `Interrupt Kernel`. \n",
"* We can use terminal commands in the notebook cells by prepending an exclamation point/bang(`!`) to the beginning of the command.\n",
"* We can create additional interactive cells by clicking the `+` button above, or by switching to command mode with `Esc` and using the keyboard shortcuts `a` (for new cell above) and `b` (for new cell below)."
]
},
{
"cell_type": "markdown",
"id": "4492c58d",
"metadata": {},
"source": [
"<a name='e1'></a>\n",
"### Exercise #1 - Practice ###\n",
"**Instructions**: <br>\n",
"* Try executing the simple print statement in the below cell.\n",
"* Then try executing the terminal command in the cell below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e69a6515",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"# activate this cell by selecting it with the mouse or arrow keys then use the keyboard shortcut [Shift+Enter] to execute\n",
"print('This is just a simple print statement.')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e54fe372",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"!echo 'This is another simple print statement.'"
]
},
{
"cell_type": "markdown",
"id": "c2e5151b-4842-465e-a20d-bb64af66d011",
"metadata": {},
"source": [
"<a name='e2'></a>\n",
"### Exercise #2 - Available GPU Accelerators ###\n",
"The `nvidia-smi` (NVIDIA System Management Interface) command is a powerful utility for managing and monitoring NVIDIA GPU devices. It will print information about available GPUs, their current memory usage, and any processes currently utilizing them. \n",
"\n",
"**Instructions**: <br>\n",
"* Execute the below cell to learn about this environment's available GPUs. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "08d543eb-a951-4eb9-8107-b13c01b3ac46",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"!nvidia-smi"
]
},
{
"cell_type": "markdown",
"id": "adee74e3-613a-4986-be34-ff3ae113ccc7",
"metadata": {},
"source": [
"**Note**: Currently, GPU memory usage is minimal, with no active processes utilizing the GPUs. Throughout our session, we'll employ this command to monitor memory consumption. When conducting GPU-based data analysis, it's advisable to maintain approximately 50% of GPU memory free, allowing for operations that may expand data stored on the device."
]
},
{
"cell_type": "markdown",
"id": "f0839f2e-dfe3-4d8f-8010-ed8445c171fb",
"metadata": {},
"source": [
"<a name='e3'></a>\n",
"### Exercise #3 - Magic Commands ###\n",
"The Jupyter environment come installed with *magic* commands, which can be recognized by the presence of `%` or `%%`. We will be using two magic commands liberally in this workshop: \n",
"* `%time`: prints summary information about how long it took to run code for a single line of code\n",
"* `%%time`: prints summary information about how long it took to run code for an entire cell\n",
"\n",
"**Instructions**: <br>\n",
"* Execute the below cell to import the `time` library. \n",
"* Execute the cell below to time the single line of code. \n",
"* Execute the cell below to time the entire cell. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1c34489-7812-4ffe-bd2e-748a52903481",
"metadata": {},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"from time import sleep"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db1d5de9-f6e6-4984-8c32-f13b51aa27db",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"# %time only times one line\n",
"%time sleep(2) \n",
"sleep(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "daf2f6f0-58a9-43a5-af8f-0b69b4a2a3a8",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"%%time\n",
"# DO NOT CHANGE THIS CELL\n",
"# %%time will time the entire cell\n",
"sleep(1)\n",
"sleep(1)"
]
},
{
"cell_type": "markdown",
"id": "42ed873e-f7b5-4668-8e96-ce31d53d43b1",
"metadata": {},
"source": [
"<a name='e4'></a>\n",
"### Exercise #4 - Jupyter Kernels and GPU Memory ###\n",
"The compute backend for Jupyter is called the *kernel*. The Jupyter environment starts up a separate kernel for each new notebook. The many notebooks in this workshop are each intended to stand alone with regard to memory and computation. \n",
"\n",
"To ensure we have enough memory and compute for each notebook, we can clear the memory at the conclusion of each notebook in two ways: \n",
"1. Shut down the kernel with `ipykernel.kernelapp.IPKernelApp.do_shutdown()` or\n",
"2. Shut down the kernel through the *Running Terminals and Kernels* panel. \n",
"\n",
"**Instructions**: <br>\n",
"* Execute the below cell to shut down and restart the current kernel. \n",
"* Shut down the current kernel through the *Running Terminals and Kernels* panel.\n",
"\n",
"<p><img src=\"images/kernel_restart.png\" width=720></p>\n",
"\n",
"**Note**: Restarting the kernel from the *Kernel* menu will only clear the memory for *the current notebook's kernel*, while notebooks other than the one we're working on may still have memory allocated for *their unique kernels*. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "98e05b77-6019-428b-8e18-a2477692ef6f",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"import IPython\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"id": "0321075e-433e-42d4-b849-de3fa17b54e1",
"metadata": {},
"source": [
"**Note**: Executing the provided code cell will shut down the kernel and activate a popup indicating that the kernel has restarted."
]
},
{
"cell_type": "markdown",
"id": "8e950df2",
"metadata": {},
"source": [
"**Well Done!** Let's move to the [next notebook](1-01_section_overview.ipynb). "
]
},
{
"cell_type": "markdown",
"id": "b604003a",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,78 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "b53a7b12-538d-4459-b82a-a35c8c417849",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
},
{
"cell_type": "markdown",
"id": "ae497b71-bc43-471e-8970-88a1878e7cf9",
"metadata": {},
"source": [
"# Fundamentals of Accelerated Data Science # "
]
},
{
"cell_type": "markdown",
"id": "3a61cc06-80da-4f73-ba61-8ff1b5af71d8",
"metadata": {},
"source": [
"## 01 - Section Overview ##\n",
"\n",
"**Table of Contents**\n",
"This section focuses on data processing. We'll work with multiple datasets, conduct high-level analyses, and prepare the data for subsequent machine learning tasks. \n",
"<br>\n",
"* **1-01_section_overview.ipynb**\n",
"* **1-02_data_manipulation.ipynb**\n",
"* **1-03_memory_management.ipynb**\n",
"* **1-04_interoperability.ipynb**\n",
"* **1-05_grouping.ipynb**\n",
"* **1-06_data_visualization.ipynb**\n",
"* **1-07_etl.ipynb**\n",
"* **1-08_dask-cudf.ipynb**\n",
"* **1-09_cudf-polars.ipynb**"
]
},
{
"cell_type": "markdown",
"id": "9b1485a5-00e8-4495-85b0-b48671674818",
"metadata": {},
"source": [
"**Well Done!** Let's move to the [next notebook](1-02_data_manipulation.ipynb). "
]
},
{
"cell_type": "markdown",
"id": "81e47f0a-547e-4714-878d-34eb9b75c835",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,958 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "def31b0f-921a-43eb-9807-8b9b31eb7b32",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
},
{
"cell_type": "markdown",
"id": "4a0fd4dd-f7be-4c90-8ddd-384a760ac04f",
"metadata": {},
"source": [
"# Fundamentals of Accelerated Data Science # "
]
},
{
"cell_type": "markdown",
"id": "6a8fdf2e-a481-455e-8a52-8be8472b63bf",
"metadata": {},
"source": [
"## 03 - Memory Management ##\n",
"\n",
"**Table of Contents**\n",
"<br>\n",
"This notebook explores the dynamics between data and memory. This notebook covers the below sections: \n",
"1. [Memory Management](#Memory-Management)\n",
" * [Memory Usage](#Memory-Usage)\n",
"2. [Data Types](#Data-Types)\n",
" * [Convert Data Types](#Convert-Data-Types)\n",
" * [Exercise #1 - Modify `dtypes`](#Exercise-#1---Modify-dtypes)\n",
" * [Categorical](#Categorical)\n",
"3. [Efficient Data Loading](#Efficient-Data-Loading)"
]
},
{
"cell_type": "markdown",
"id": "1b59367c-48bc-4c72-b1f4-4cfdfa5470cf",
"metadata": {},
"source": [
"## Memory Management ##\n",
"During the data acquisition process, data is transferred to memory in order to be operated on by the processor. Memory management is crucial for cuDF and GPU operations for several key reasons: \n",
"* **Limited GPU memory**: GPUs typically have less memory than CPUs, therefore efficient memory management is essential to maximize the use of available GPU memory, especially for large datasets.\n",
"* **Data transfer overhead**: Transferring data between CPU and GPU memory is relatively slow compared to GPU computation speed. Minimizing these transfers through smart memory management is critical for performance.\n",
"* **Performance tuning**: Understanding and optimizing memory usage is key to achieving peak performance in GPU-accelerated data processing tasks.\n",
"\n",
"When done correctly, keeping the data on the GPU can enable cuDF and the RAPIDS ecosystem to achieve significant performance improvements, handle larger datasets, and provide more efficient data processing capabilities. \n",
"\n",
"Below we import the data from the csv file. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "b7b8a623-f799-4dad-aca9-0e571bb6e527",
"metadata": {},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"import pandas as pd\n",
"import random\n",
"import time"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "711d0a7f-8598-49fc-949c-5caf6029ce47",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>county</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" <th>name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.533644</td>\n",
" <td>-1.524401</td>\n",
" <td>FRANCIS</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.426256</td>\n",
" <td>-1.465314</td>\n",
" <td>EDWARD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.555200</td>\n",
" <td>-1.496417</td>\n",
" <td>TEDDY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.547906</td>\n",
" <td>-1.572341</td>\n",
" <td>ANGUS</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>m</td>\n",
" <td>DARLINGTON</td>\n",
" <td>54.477639</td>\n",
" <td>-1.605995</td>\n",
" <td>CHARLIE</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age sex county lat long name\n",
"0 0 m DARLINGTON 54.533644 -1.524401 FRANCIS\n",
"1 0 m DARLINGTON 54.426256 -1.465314 EDWARD\n",
"2 0 m DARLINGTON 54.555200 -1.496417 TEDDY\n",
"3 0 m DARLINGTON 54.547906 -1.572341 ANGUS\n",
"4 0 m DARLINGTON 54.477639 -1.605995 CHARLIE"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"df=pd.read_csv('./data/uk_pop.csv')\n",
"\n",
"# preview\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "36416fd0-7081-42aa-bf31-d1231b81ec0b",
"metadata": {},
"source": [
"### Memory Usage ###\n",
"Memory utilization of a DataFrame depends on the date types for each column.\n",
"\n",
"<p><img src='images/dtypes.png' width=720></p>\n",
"\n",
"We can use `DataFrame.memory_usage()` to see the memory usage for each column (in bytes). Most of the common data types have a fixed size in memory, such as `int`, `float`, `datetime`, and `bool`. Memory usage for these data types is the respective memory requirement multiplied by the number of data points. For `string` data type, the memory usage reported _for pandas_ is the number of elements times 8 bytes. This accounts for the 64-bit required for the pointer that points to an address in memory but not the memory used for the actual string values. The actual memory required for a string value is 49 bytes plus an additional byte for each character. The `deep` parameter provides a more accurate memory usage report that accounts for the system-level memory consumption of the contained `string` data type. \n",
"\n",
"Below we get the memory usage. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "8378207b-2d9e-4102-8408-c2dddafc8a40",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Index 128\n",
"age 467839152\n",
"sex 3391833852\n",
"county 3934985133\n",
"lat 467839152\n",
"long 467839152\n",
"name 3666922374\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"# pandas memory utilization\n",
"mem_usage_df=df.memory_usage(deep=True)\n",
"mem_usage_df"
]
},
{
"cell_type": "markdown",
"id": "07c24bb1-c4f7-440c-a949-d4c57800ec61",
"metadata": {},
"source": [
"Below we define a `make_decimal()` function to convert memory size into units based on powers of 2. In contrast to units based on powers of 10, this customary convention is commonly used to report memory capacity. More information about the two definitions can be found [here](https://en.wikipedia.org/wiki/Byte#Multiple-byte_units). "
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "5ae42218-1547-49fd-9123-ab508a2b03de",
"metadata": {},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"suffixes = ['B', 'kB', 'MB', 'GB', 'TB', 'PB']\n",
"def make_decimal(nbytes):\n",
" i=0\n",
" while nbytes >= 1024 and i < len(suffixes)-1:\n",
" nbytes/=1024.\n",
" i+=1\n",
" f=('%.2f' % nbytes).rstrip('0').rstrip('.')\n",
" return '%s %s' % (f, suffixes[i])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "e6d4a613-3eea-4dce-8e71-39593ff6f226",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"'11.55 GB'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"make_decimal(mem_usage_df.sum())"
]
},
{
"cell_type": "markdown",
"id": "a352c0b2-65aa-4231-b753-556aca46ff49",
"metadata": {},
"source": [
"Below we calculate the memory usage manually based on the data types. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "630327b9-6dc1-4b70-9fdf-9f7763ec4d50",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Numerical columns use 467839152 bytes of memory\n"
]
}
],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"# get number of rows\n",
"num_rows=len(df)\n",
"\n",
"# 64-bit numbers uses 8 bytes of memory\n",
"print(f'Numerical columns use {num_rows*8} bytes of memory')"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "bb22b5f4-e38f-438e-9426-61746b509e50",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"county column uses 3934985133 bytes of memory.\n"
]
}
],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"# check random string-typed column\n",
"string_cols=[col for col in df.columns if df[col].dtype=='object' ]\n",
"column_to_check=random.choice(string_cols)\n",
"\n",
"overhead=49\n",
"pointer_size=8\n",
"\n",
"# nan==nan when value is not a number\n",
"# nan uses 32 bytes of memory\n",
"string_col_mem_usage_df=df[column_to_check].map(lambda x: len(x)+overhead+pointer_size if x else 32)\n",
"string_col_mem_usage=string_col_mem_usage_df.sum()\n",
"print(f'{column_to_check} column uses {string_col_mem_usage} bytes of memory.')"
]
},
{
"cell_type": "markdown",
"id": "94e393c2-c0d0-40ee-82d2-730c4667e9b8",
"metadata": {},
"source": [
"**Note**: The `string` data type is stored differently in cuDF than it is in pandas. More information about `libcudf` stores string data using the [Arrow format](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) can be found [here](https://developer.nvidia.com/blog/mastering-string-transformations-in-rapids-libcudf/). "
]
},
{
"cell_type": "markdown",
"id": "737ff50b-9426-4e08-a00a-d7ee69f48b9f",
"metadata": {},
"source": [
"## Data Types ##\n",
"By default, pandas (and cuDF) uses 64-bit for numerical values. Using 64-bit numbers provides the highest precision but many applications do not require 64-bit precision when aggregating over a very large number of data points. When possible, using 32-bit numbers reduces storage and memory requirements in half, and also typically greatly speeds up computations because only half as much data needs to be accessed in memory. "
]
},
{
"cell_type": "markdown",
"id": "0b77d450-c415-44b8-87ac-20ce616ec809",
"metadata": {},
"source": [
"### Convert Data Types ###\n",
"The `.astype()` method can be used to convert numerical data types to use different bit-size containers. Here we convert the `age` column from `int64` to `int8`. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "603f7c70-134e-4466-a790-8a18b9088ca6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"age int8\n",
"sex object\n",
"county object\n",
"lat float64\n",
"long float64\n",
"name object\n",
"dtype: object"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"df['age']=df['age'].astype('int8')\n",
"\n",
"df.dtypes"
]
},
{
"cell_type": "markdown",
"id": "973a6dd4-2aef-44d9-8b01-8853032eddae",
"metadata": {},
"source": [
"### Exercise #1 - Modify `dtypes` ###\n",
"**Instructions**: <br>\n",
"* Modify the `<FIXME>` only and execute the below cell to convert any 64-bit data types to their 32-bit counterparts."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "beb7d71b-6672-462e-b65c-a64dbe5f7a57",
"metadata": {},
"outputs": [],
"source": [
"df['lat']=df['lat'].astype('float32')\n",
"df['long']=df['long'].astype('float32')"
]
},
{
"cell_type": "raw",
"id": "3b44fb22-a0f1-4e43-a332-1ccbad50caee",
"metadata": {},
"source": [
"\n",
"df['lat']=df['lat'].astype('float32')\n",
"df['long']=df['long'].astype('float32')"
]
},
{
"cell_type": "markdown",
"id": "98b6542d-22cc-4926-b600-a3e052c37c96",
"metadata": {},
"source": [
"Click ... for solution. "
]
},
{
"cell_type": "markdown",
"id": "7b2cd622-977c-4915-a87f-2fe03c1793f5",
"metadata": {},
"source": [
"### Categorical ###\n",
"Categorical data is a type of data that represents discrete, distinct categories or groups. They can have a meaningful order or ranking but generally cannot be used for numerical operations. When appropriate, using the `categorical` data type can reduce memory usage and lead to faster operations. It can also be used to define and maintain a custom order of categories. \n",
"\n",
"Below we get the number of unique values in the string columns. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "f249e4b8-5d7a-4b44-ac15-bd3360a43f2a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"sex 2\n",
"county 171\n",
"name 13212\n",
"dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"df.select_dtypes(include='object').nunique()"
]
},
{
"cell_type": "markdown",
"id": "f1d8bd88-b39b-4043-9039-d8bd75fe851a",
"metadata": {},
"source": [
"Below we convert columns with few discrete values to `category`. The `category` data type has `.categories` and `codes` properties that are accessed through `.cat`. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "a99bebbf-2e5b-4720-96f9-9fd7d42d2fe8",
"metadata": {},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"df['sex']=df['sex'].astype('category')\n",
"df['county']=df['county'].astype('category')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "41b7b290-cfcf-4ff6-b6b4-454c19b44a62",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['BARKING AND DAGENHAM', 'BARNET', 'BARNSLEY',\n",
" 'BATH AND NORTH EAST SOMERSET', 'BEDFORD', 'BEXLEY', 'BIRMINGHAM',\n",
" 'BLACKBURN WITH DARWEN', 'BLACKPOOL', 'BLAENAU GWENT',\n",
" ...\n",
" 'WESTMINSTER', 'WIGAN', 'WILTSHIRE', 'WINDSOR AND MAIDENHEAD', 'WIRRAL',\n",
" 'WOKINGHAM', 'WOLVERHAMPTON', 'WORCESTERSHIRE', 'WREXHAM', 'YORK'],\n",
" dtype='object', length=171)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"----------------------------------------\n"
]
},
{
"data": {
"text/plain": [
"0 37\n",
"1 37\n",
"2 37\n",
"3 37\n",
"4 37\n",
" ..\n",
"58479889 96\n",
"58479890 96\n",
"58479891 96\n",
"58479892 96\n",
"58479893 96\n",
"Length: 58479894, dtype: int16"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"display(df['county'].cat.categories)\n",
"print('-'*40)\n",
"display(df['county'].cat.codes)"
]
},
{
"cell_type": "markdown",
"id": "737385ab-677c-4bef-a86a-10aa3119e29a",
"metadata": {},
"source": [
"**Note**: `.astype()` can also be used to convert data to `datetime` or `object` to enable datetime and string methods. "
]
},
{
"cell_type": "markdown",
"id": "552c47c2-0fbc-455e-8745-cb98fc777243",
"metadata": {},
"source": [
"## Efficient Data Loading ##\n",
"It is often advantageous to specify the most appropriate data types for each columns, based on range, precision requirement, and how they are used. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "c2b9f0c3-8598-4a28-9481-ce28fea7544b",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Index 128\n",
"age 467839152\n",
"sex 3391833852\n",
"county 3934985133\n",
"lat 467839152\n",
"long 467839152\n",
"name 3666922374\n",
"dtype: int64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading 11.55 GB took 33.63 seconds.\n"
]
}
],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"start=time.time()\n",
"df=pd.read_csv('./data/uk_pop.csv')\n",
"duration=time.time()-start\n",
"\n",
"mem_usage_df=df.memory_usage(deep=True)\n",
"display(mem_usage_df)\n",
"\n",
"print(f'Loading {make_decimal(mem_usage_df.sum())} took {round(duration, 2)} seconds.')"
]
},
{
"cell_type": "markdown",
"id": "5729520e-3ed8-4ec6-ae1f-ba46d642f48d",
"metadata": {},
"source": [
"Below we enable `cuda.pandas` to see the difference. "
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "99aa0f32-4d2a-43a7-bec1-f1b88bcc37c2",
"metadata": {},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"%load_ext cudf.pandas\n",
"\n",
"import pandas as pd\n",
"import time"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "2b724201-9ad1-4e9b-b712-f3b31bdc4104",
"metadata": {},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"suffixes = ['B', 'kB', 'MB', 'GB', 'TB', 'PB']\n",
"def make_decimal(nbytes):\n",
" i=0\n",
" while nbytes >= 1024 and i < len(suffixes)-1:\n",
" nbytes/=1024.\n",
" i+=1\n",
" f=('%.2f' % nbytes).rstrip('0').rstrip('.')\n",
" return '%s %s' % (f, suffixes[i])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "99bdd7b0-8563-41db-bd8e-3a7279394ede",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"age 58479894\n",
"sex 58479908\n",
"county 58482446\n",
"lat 467839152\n",
"long 467839152\n",
"name 117096917\n",
"Index 0\n",
"dtype: int64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading 1.14 GB took 2.13 seconds.\n"
]
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\"> </span>\n",
"<span style=\"font-style: italic\"> Total time elapsed: 2.705 seconds </span>\n",
"<span style=\"font-style: italic\"> </span>\n",
"<span style=\"font-style: italic\"> Stats </span>\n",
"<span style=\"font-style: italic\"> </span>\n",
"┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
"┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n",
"┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
"│ 2 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> start</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time()</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 5 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> dtype_dict</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">{</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 6 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'int8'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 7 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'sex'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'category'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 8 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'category'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 9 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'lat'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'float64'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 10 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'long'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'float64'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 11 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'name'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'category'</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 14 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> efficient_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">pd</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">read_csv(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'./data/uk_pop.csv'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, dtype</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">dtype_dict)</span><span style=\"background-color: #272822\"> </span> │ 1.728013188 │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 15 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> duration</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time()</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">-</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">start</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 17 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> mem_usage_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">efficient_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">memory_usage(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'deep'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)</span><span style=\"background-color: #272822\"> </span> │ 0.005340174 │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 18 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> display(mem_usage_df)</span><span style=\"background-color: #272822\"> </span> │ 0.011073721 │ 0.006896915 │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"│ 20 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> print(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">f'Loading {</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">make_decimal(mem_usage_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">sum())</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">} took {</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">round(dura…</span> │ 0.004693074 │ │\n",
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
"└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n",
"</pre>\n"
],
"text/plain": [
"\u001b[3m \u001b[0m\n",
"\u001b[3m Total time elapsed: 2.705 seconds \u001b[0m\n",
"\u001b[3m \u001b[0m\n",
"\u001b[3m Stats \u001b[0m\n",
"\u001b[3m \u001b[0m\n",
"┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
"┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n",
"┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
"│ 2 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstart\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 5 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdtype_dict\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m{\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 6 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mint8\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 7 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msex\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcategory\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 8 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcategory\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 9 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlat\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mfloat64\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 10 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlong\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mfloat64\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 11 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mname\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcategory\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 14 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mefficient_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mpd\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mread_csv\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m./data/uk_pop.csv\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdtype\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdtype_dict\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 1.728013188 │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 15 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mduration\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m-\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstart\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 17 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmem_usage_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mefficient_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmemory_usage\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mdeep\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.005340174 │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 18 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdisplay\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmem_usage_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.011073721 │ 0.006896915 │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"│ 20 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mprint\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mf\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mLoading \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m{\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmake_decimal\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmem_usage_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msum\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m}\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m took \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m{\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mround\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdura…\u001b[0m │ 0.004693074 │ │\n",
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
"└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%cudf.pandas.line_profile\n",
"# DO NOT CHANGE THIS CELL\n",
"start=time.time()\n",
"\n",
"# define data types for each column\n",
"dtype_dict={\n",
" 'age': 'int8', \n",
" 'sex': 'category', \n",
" 'county': 'category', \n",
" 'lat': 'float64', \n",
" 'long': 'float64', \n",
" 'name': 'category'\n",
"}\n",
" \n",
"efficient_df=pd.read_csv('./data/uk_pop.csv', dtype=dtype_dict)\n",
"duration=time.time()-start\n",
"\n",
"mem_usage_df=efficient_df.memory_usage('deep')\n",
"display(mem_usage_df)\n",
"\n",
"print(f'Loading {make_decimal(mem_usage_df.sum())} took {round(duration, 2)} seconds.')"
]
},
{
"cell_type": "markdown",
"id": "0f4607d8-6de3-4b27-96d4-a9720d268333",
"metadata": {},
"source": [
"We were able to load data faster and more efficiently. \n",
"\n",
"**Note**: Notice that the memory utilized on the GPU is larger than the memory used by the DataFrame. This is expected because there are intermediary processes that use some memory during the data loading process, specifically related to parsing the csv file in this case. \n",
"\n",
"```\n",
"+-----------------------------------------------------------------------------+\n",
"| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |\n",
"|-------------------------------+----------------------+----------------------+\n",
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
"| | | MIG M. |\n",
"|===============================+======================+======================|\n",
"| 0 Tesla T4 Off | 00000000:00:1B.0 Off | 0 |\n",
"| N/A 32C P0 26W / 70W | 1378MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 1 Tesla T4 Off | 00000000:00:1C.0 Off | 0 |\n",
"| N/A 31C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 2 Tesla T4 Off | 00000000:00:1D.0 Off | 0 |\n",
"| N/A 30C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 3 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |\n",
"| N/A 30C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
" \n",
"+-----------------------------------------------------------------------------+\n",
"| Processes: |\n",
"| GPU GI CI PID Type Process name GPU Memory |\n",
"| ID ID Usage |\n",
"|=============================================================================|\n",
"+-----------------------------------------------------------------------------+\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "92f7ee37-4acb-46aa-bb73-4c0139d3f6b8",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tue Oct 21 08:08:25 2025 \n",
"+-----------------------------------------------------------------------------+\n",
"| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |\n",
"|-------------------------------+----------------------+----------------------+\n",
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
"| | | MIG M. |\n",
"|===============================+======================+======================|\n",
"| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |\n",
"| N/A 28C P0 24W / 70W | 11314MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |\n",
"| N/A 29C P0 25W / 70W | 168MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |\n",
"| N/A 28C P0 25W / 70W | 168MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |\n",
"| N/A 29C P0 24W / 70W | 168MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
" \n",
"+-----------------------------------------------------------------------------+\n",
"| Processes: |\n",
"| GPU GI CI PID Type Process name GPU Memory |\n",
"| ID ID Usage |\n",
"|=============================================================================|\n",
"+-----------------------------------------------------------------------------+\n"
]
}
],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"!nvidia-smi"
]
},
{
"cell_type": "markdown",
"id": "c031d2c7-03cb-4ac7-a195-70fc25cb191d",
"metadata": {},
"source": [
"When loading data this way, we may be able to fit more data. The optimal dataset size depends on various factors including the specific operations being performed, the complexity of the workload, and the available GPU memory. To maximize acceleration, datasets should ideally fit within GPU memory, with ample space left for operations that can spike memory requirements. As a general rule of thumb, cuDF recommends data sets that are less than 50% of the GPU memory capacity. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec6cefea-dc64-4f13-815e-081cd35651b9",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"# 1 gigabytes = 1073741824 bytes\n",
"mem_capacity=16*1073741824\n",
"\n",
"mem_per_record=mem_usage_df.sum()/len(efficient_df)\n",
"\n",
"print(f'We can load {int(mem_capacity/2/mem_per_record)} rows.')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ddaaa1ac-66ec-4323-9842-2543c6d85e4e",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# DO NOT CHANGE THIS CELL\n",
"import IPython\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"id": "658e9847-775f-4d12-af4e-8f896df4e6fe",
"metadata": {},
"source": [
"**Well Done!** Let's move to the [next notebook](1-04_interoperability.ipynb). "
]
},
{
"cell_type": "markdown",
"id": "b86451cf-60e6-4733-b431-1bc0bd586bc2",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

1123
ds/25-1/2/1-07_etl.ipynb Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,978 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fundamentals of Accelerated Data Science # "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transition Path: cuDF provides a way for users to scale their pandas workflows as data sizes grow, offering a middle ground between single-threaded pandas and distributed computing solutions like Dask or Apache Spark ."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 09 - Introduction to Dask cuDF ##\n",
"\n",
"**Table of Contents**\n",
"<br>\n",
"[Dask](https://dask.org/) cuDF can be used to distribute dataframe operations to multiple GPUs. In this notebook we will introduce some key Dask concepts, learn how to setup a Dask cluster for utilizing multiple GPUs, and see how to perform simple dataframe operations on distributed Dask dataframes. This notebook covers the below sections: \n",
"1. [An Introduction to Dask](#An-Introduction-to-Dask)\n",
"2. [Setting up a Dask Scheduler](#Setting-up-a-Dask-Scheduler)\n",
" * [Obtaining the Local IP Address](#Obtaining-the-Local-IP-Address)\n",
" * [Starting a `LocalCUDACluster`](#Starting-a-LocalCUDACluster)\n",
" * [Instantiating a Client Connection](#Instantiating-a-Client-Connection)\n",
" * [The Dask Dashboard](#The-Dask-Dashboard)\n",
"3. [Reading Data with Dask cuDF](#Reading-Data-with-Dask-cuDF)\n",
"4. [Computational Graph](#Computational-Graph)\n",
" * [Visualizing the Computational Graph](#Visualizing-the-Computational-Graph)\n",
" * [Extending the Computational Graph](#Extending-the-Computational-Graph)\n",
" * [Computing with the Computational Graph](#Computing-with-the-Computational-Graph)\n",
" * [Persisting Data in the Cluster](#Persisting-Data-in-the-Cluster)\n",
"6. [Initial Data Exploration with Dask cuDF](#Initial-Data-Exploration-with-Dask-cuDF)\n",
" * [Exercise #1 - Counties North of Sunderland with Dask](#Exercise-#1---Counties-North-of-Sunderland-with-Dask)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## An Introduction to Dask ##\n",
"[Dask](https://dask.org/) is a Python library for parallel computing. In Dask programming, we create computational graphs that define code we **would like** to execute, and then, give these computational graphs to a Dask scheduler which evaluates them lazily, and efficiently, in parallel. \n",
"\n",
"In addition to using multiple CPU cores or threads to execute computational graphs in parallel, Dask schedulers can also be configured to execute computational graphs on multiple CPUs, or, as we will do in this workshop, multiple GPUs. As a result, Dask programming facilitates operating on data sets that are larger than the memory of a single compute resource.\n",
"\n",
"Because Dask computational graphs can consist of arbitrary Python code, they provide [a level of control and flexibility superior to many other systems](https://docs.dask.org/en/latest/spark.html) that can operate on massive data sets. However, we will focus for this workshop primarily on the Dask DataFrame, one of several data structures whose operations and methods natively utilize Dask's parallel scheduling:\n",
"* Dask DataFrame, which closely resembles the Pandas DataFrame\n",
"* Dask Array, which closely resembles the NumPy ndarray\n",
"* Dask Bag, a set which allows duplicates and can hold heterogeneously-typed data\n",
"\n",
"In particular, we will use a Dask-cuDF dataframe, which combines the interface of Dask with the GPU power of cuDF for distributed dataframe operations on multiple GPUs. We will now turn our attention to utilizing all 4 NVIDIA V100 GPUs in this environment for operations on an 18GB UK population data set that would not fit into the memory of a single 16GB GPU."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up a Dask Scheduler ##\n",
"We begin by starting a Dask scheduler which will take care to distribute our work across the 4 available GPUs. In order to do this we need to start a `LocalCUDACluster` instance, using our host machine's IP, and then instantiate a client that can communicate with the cluster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Obtaining the Local IP Address ###"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import subprocess # we will use this to obtain our local IP using the following command\n",
"cmd = \"hostname --all-ip-addresses\"\n",
"\n",
"process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)\n",
"output, error = process.communicate()\n",
"IPADDR = str(output.decode()).split()[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Starting a `LocalCUDACluster` ###\n",
"`dask_cuda` provides utilities for Dask and CUDA (the \"cu\" in cuDF) interactions."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-10-21 13:31:13,108 - distributed.scheduler - WARNING - Removing worker 'tcp://172.18.0.2:44687' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('read_csv-910ec886221afde30c768158c33b486c', 16), ('read_csv-910ec886221afde30c768158c33b486c', 67), ('read_csv-910ec886221afde30c768158c33b486c', 0), ('read_csv-910ec886221afde30c768158c33b486c', 41), ('read_csv-910ec886221afde30c768158c33b486c', 54), ('read_csv-910ec886221afde30c768158c33b486c', 9), ('read_csv-910ec886221afde30c768158c33b486c', 38), ('read_csv-910ec886221afde30c768158c33b486c', 5), ('read_csv-910ec886221afde30c768158c33b486c', 34), ('read_csv-910ec886221afde30c768158c33b486c', 12), ('read_csv-910ec886221afde30c768158c33b486c', 2), ('read_csv-910ec886221afde30c768158c33b486c', 27), ('read_csv-910ec886221afde30c768158c33b486c', 62), ('read_csv-910ec886221afde30c768158c33b486c', 46), ('read_csv-910ec886221afde30c768158c33b486c', 30), ('read_csv-910ec886221afde30c768158c33b486c', 59), ('read_csv-910ec886221afde30c768158c33b486c', 23)} (stimulus_id='handle-worker-cleanup-1761053473.108198')\n",
"2025-10-21 13:31:13,110 - distributed.scheduler - WARNING - Removing worker 'tcp://172.18.0.2:35977' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('read_csv-910ec886221afde30c768158c33b486c', 29), ('read_csv-910ec886221afde30c768158c33b486c', 48), ('read_csv-910ec886221afde30c768158c33b486c', 32), ('read_csv-910ec886221afde30c768158c33b486c', 10), ('read_csv-910ec886221afde30c768158c33b486c', 51), ('read_csv-910ec886221afde30c768158c33b486c', 25), ('read_csv-910ec886221afde30c768158c33b486c', 60), ('read_csv-910ec886221afde30c768158c33b486c', 44), ('read_csv-910ec886221afde30c768158c33b486c', 14), ('read_csv-910ec886221afde30c768158c33b486c', 57), ('read_csv-910ec886221afde30c768158c33b486c', 18), ('read_csv-910ec886221afde30c768158c33b486c', 8), ('read_csv-910ec886221afde30c768158c33b486c', 66), ('read_csv-910ec886221afde30c768158c33b486c', 21), ('read_csv-910ec886221afde30c768158c33b486c', 36), ('read_csv-910ec886221afde30c768158c33b486c', 4), ('read_csv-910ec886221afde30c768158c33b486c', 55)} (stimulus_id='handle-worker-cleanup-1761053473.1105292')\n",
"2025-10-21 13:31:13,112 - distributed.scheduler - WARNING - Removing worker 'tcp://172.18.0.2:39371' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('read_csv-910ec886221afde30c768158c33b486c', 7), ('read_csv-910ec886221afde30c768158c33b486c', 58), ('read_csv-910ec886221afde30c768158c33b486c', 3), ('read_csv-910ec886221afde30c768158c33b486c', 26), ('read_csv-910ec886221afde30c768158c33b486c', 61), ('read_csv-910ec886221afde30c768158c33b486c', 22), ('read_csv-910ec886221afde30c768158c33b486c', 19), ('read_csv-910ec886221afde30c768158c33b486c', 15), ('read_csv-910ec886221afde30c768158c33b486c', 50), ('read_csv-910ec886221afde30c768158c33b486c', 47), ('read_csv-910ec886221afde30c768158c33b486c', 53), ('read_csv-910ec886221afde30c768158c33b486c', 37), ('read_csv-910ec886221afde30c768158c33b486c', 43), ('read_csv-910ec886221afde30c768158c33b486c', 11), ('read_csv-910ec886221afde30c768158c33b486c', 40), ('read_csv-910ec886221afde30c768158c33b486c', 65), ('read_csv-910ec886221afde30c768158c33b486c', 33)} (stimulus_id='handle-worker-cleanup-1761053473.1126676')\n",
"2025-10-21 13:31:13,114 - distributed.scheduler - WARNING - Removing worker 'tcp://172.18.0.2:36291' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('read_csv-910ec886221afde30c768158c33b486c', 52), ('read_csv-910ec886221afde30c768158c33b486c', 13), ('read_csv-910ec886221afde30c768158c33b486c', 42), ('read_csv-910ec886221afde30c768158c33b486c', 45), ('read_csv-910ec886221afde30c768158c33b486c', 6), ('read_csv-910ec886221afde30c768158c33b486c', 35), ('read_csv-910ec886221afde30c768158c33b486c', 64), ('read_csv-910ec886221afde30c768158c33b486c', 31), ('read_csv-910ec886221afde30c768158c33b486c', 28), ('read_csv-910ec886221afde30c768158c33b486c', 63), ('read_csv-910ec886221afde30c768158c33b486c', 24), ('read_csv-910ec886221afde30c768158c33b486c', 56), ('read_csv-910ec886221afde30c768158c33b486c', 17), ('read_csv-910ec886221afde30c768158c33b486c', 1), ('read_csv-910ec886221afde30c768158c33b486c', 20), ('read_csv-910ec886221afde30c768158c33b486c', 49), ('read_csv-910ec886221afde30c768158c33b486c', 39), ('read_csv-910ec886221afde30c768158c33b486c', 68)} (stimulus_id='handle-worker-cleanup-1761053473.1145272')\n"
]
}
],
"source": [
"from dask_cuda import LocalCUDACluster\n",
"cluster = LocalCUDACluster(ip=IPADDR)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Instantiating a Client Connection ###\n",
"The `dask.distributed` library gives us distributed functionality, including the ability to connect to the CUDA Cluster we just created. The `progress` import will give us a handy progress bar we can utilize below."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from dask.distributed import Client, progress\n",
"\n",
"client = Client(cluster)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Dask Dashboard"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dask ships with a very helpful dashboard that in our case runs on port `8787`. Open a new browser tab now and copy this lab's URL into it, replacing `/lab/lab` with `:8787` (so it ends with `.com:8787`). This should open the Dask dashboard, currently idle."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading Data with Dask cuDF ##\n",
"With `dask_cudf` we can create a dataframe from several file formats (including from multiple files and directly from cloud storage like S3), from cuDF dataframes, from Pandas dataframes, and even from vanilla CPU Dask dataframes. Here we will create a Dask cuDF dataframe from the local csv file `pop5x_1-07.csv`, which has similar features to the `pop.csv` files you have already been using, except scaled up to 5 times larger (18GB), representing a population of almost 300 million, nearly the size of the entire United States."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"18G data/uk_pop5x.csv\n"
]
}
],
"source": [
"# get the file size of `pop5x_1-07.csv` in GB\n",
"!ls -sh data/uk_pop5x.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We import dask_cudf (and other RAPIDS components when necessary) after setting up the cluster to ensure that they establish correctly inside the CUDA context it creates."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import dask_cudf"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"ddf = dask_cudf.read_csv('./data/uk_pop5x.csv', dtype=['float32', 'str', 'str', 'float32', 'float32', 'str'])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"age float32\n",
"sex object\n",
"county object\n",
"lat float32\n",
"long float32\n",
"name object\n",
"dtype: object"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computational Graph ##\n",
"As mentioned above, when programming with Dask, we create computational graphs that we **would eventually like** to be executed. We can already observe this behavior in action: in calling `dask_cudf.read_csv` we have indicated that **would eventually like** to read the entire contents of `pop5x_1-07.csv`. However, Dask will not ask the scheduler execute this work until we explicitly indicate that we would like it do so.\n",
"\n",
"Observe the memory usage for each of the 4 GPUs by executing the following cell, and notice that the GPU memory usage is not nearly large enough to indicate that the entire 18GB file has been read into memory:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tue Oct 21 13:29:09 2025 \n",
"+-----------------------------------------------------------------------------+\n",
"| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |\n",
"|-------------------------------+----------------------+----------------------+\n",
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
"| | | MIG M. |\n",
"|===============================+======================+======================|\n",
"| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |\n",
"| N/A 30C P0 26W / 70W | 14956MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |\n",
"| N/A 30C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |\n",
"| N/A 30C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |\n",
"| N/A 29C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
" \n",
"+-----------------------------------------------------------------------------+\n",
"| Processes: |\n",
"| GPU GI CI PID Type Process name GPU Memory |\n",
"| ID ID Usage |\n",
"|=============================================================================|\n",
"+-----------------------------------------------------------------------------+\n"
]
}
],
"source": [
"!nvidia-smi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualizing the Computational Graph ###\n",
"Computational graphs that have not yet been executed provide the `.visualize` method that, when used in a Jupyter environment such as this one, will display the computational graph, including how Dask intends to go about distributing the work. Thus, we can visualize how the `read_csv` operation will be distributed by Dask by executing the following cell:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.43.0 (0)\n",
" -->\n",
"<!-- Title: %3 Pages: 1 -->\n",
"<svg width=\"115pt\" height=\"44pt\"\n",
" viewBox=\"0.00 0.00 115.00 44.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 40)\">\n",
"<title>%3</title>\n",
"<polygon fill=\"white\" stroke=\"transparent\" points=\"-4,4 -4,-40 111,-40 111,4 -4,4\"/>\n",
"<!-- &#45;6332770613817605186 -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>&#45;6332770613817605186</title>\n",
"<polygon fill=\"none\" stroke=\"black\" points=\"107,-36 0,-36 0,0 107,0 107,-36\"/>\n",
"<text text-anchor=\"middle\" x=\"53.5\" y=\"-13\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">ReadCSV</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<graphviz.graphs.Digraph at 0x7f94de3b45b0>"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.visualize(format='svg') # This visualization is very large, and using `format='svg'` will make it easier to view."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, when we indicate for Dask to actually execute this operation, it will parallelize the work across the 4 GPUs in something like 69 parallel partitions. We can see the exact number of partitions with the `npartitions` property:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"69"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.npartitions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extending the Computational Graph ###\n",
"The concept of constructing computational graphs with arbitrary operations before executing them is a core part of Dask. Let's add some operations to the existing computational graph and visualize it again.\n",
"\n",
"After running the next cell, although it will take some scrolling to get a clear sense of it (the challenges of distributed data analytics!), you can see that the graph already constructed for `read_csv` now continues upward. It selects the `age` column across all partitions (visualized as `getitem`) and eventually performs the `.mean()` reduction (visualized as `series-sum-chunk`, `series-sum-agg`, `count-chunk`, `sum-agg` and `true-div`)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.43.0 (0)\n",
" -->\n",
"<!-- Title: %3 Pages: 1 -->\n",
"<svg width=\"276pt\" height=\"188pt\"\n",
" viewBox=\"0.00 0.00 276.00 188.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 184)\">\n",
"<title>%3</title>\n",
"<polygon fill=\"white\" stroke=\"transparent\" points=\"-4,4 -4,-184 272,-184 272,4 -4,4\"/>\n",
"<!-- 2336549067836068764 -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>2336549067836068764</title>\n",
"<polygon fill=\"none\" stroke=\"black\" points=\"221,-180 47,-180 47,-144 221,-144 221,-180\"/>\n",
"<text text-anchor=\"middle\" x=\"134\" y=\"-157\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">Sum(Projection)</text>\n",
"</g>\n",
"<!-- 553658985626135620 -->\n",
"<g id=\"node2\" class=\"node\">\n",
"<title>553658985626135620</title>\n",
"<polygon fill=\"none\" stroke=\"black\" points=\"268,-108 0,-108 0,-72 268,-72 268,-108\"/>\n",
"<text text-anchor=\"middle\" x=\"134\" y=\"-85\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">Projection(ReadCSV, age)</text>\n",
"</g>\n",
"<!-- 553658985626135620&#45;&gt;2336549067836068764 -->\n",
"<g id=\"edge1\" class=\"edge\">\n",
"<title>553658985626135620&#45;&gt;2336549067836068764</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M134,-108.3C134,-116.02 134,-125.29 134,-133.89\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"130.5,-133.9 134,-143.9 137.5,-133.9 130.5,-133.9\"/>\n",
"</g>\n",
"<!-- &#45;6332770613817605186 -->\n",
"<g id=\"node3\" class=\"node\">\n",
"<title>&#45;6332770613817605186</title>\n",
"<polygon fill=\"none\" stroke=\"black\" points=\"187.5,-36 80.5,-36 80.5,0 187.5,0 187.5,-36\"/>\n",
"<text text-anchor=\"middle\" x=\"134\" y=\"-13\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">ReadCSV</text>\n",
"</g>\n",
"<!-- &#45;6332770613817605186&#45;&gt;553658985626135620 -->\n",
"<g id=\"edge2\" class=\"edge\">\n",
"<title>&#45;6332770613817605186&#45;&gt;553658985626135620</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M134,-36.3C134,-44.02 134,-53.29 134,-61.89\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"130.5,-61.9 134,-71.9 137.5,-61.9 130.5,-61.9\"/>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<graphviz.graphs.Digraph at 0x7f94de3b59f0>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_age = ddf['age'].sum()\n",
"mean_age.visualize(format='svg')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Computing with the Computational Graph ###\n",
"There are several ways to indicate to Dask that we would like to perform the computations described in the computational graphs we have constructed. The first we will show is the `.compute` method, which will return the output of the computation as an object in one GPU's memory - no longer distributed across GPUs.\n",
"\n",
"**NOTE**: This value is actually a [*future*](https://docs.python.org/3/library/concurrent.futures.html) that it can be immediately used in code, even before it completes evaluating. While this can be tremendously useful in many scenarios, we will not need in this workshop to do anything fancy with the futures we generate except to wait for them to evaluate so we can visualize their values.\n",
"\n",
"Below we send the computational graph we have created to the Dask scheduler to be executed in parallel on our 4 GPUs. If you have the Dask Dashboard open on another tab from before, you can watch it while the operation completes. Because our graph involves reading the entire 18GB data set (as we declared when adding `read_csv` to the call graph), you can expect the operation to take a little time. If you closely watch the dashboard, you will see that Dask begins follow-on calculations for `mean` even while data is still being read into memory."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11732293000.0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_age.compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Persisting Data in the Cluster ###\n",
"As you can see, the previous operation, which read the entire 18GB csv into the GPUs' memory, did not retain the data in memory after completing the computational graph:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tue Oct 21 13:31:04 2025 \n",
"+-----------------------------------------------------------------------------+\n",
"| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |\n",
"|-------------------------------+----------------------+----------------------+\n",
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
"| | | MIG M. |\n",
"|===============================+======================+======================|\n",
"| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |\n",
"| N/A 30C P0 26W / 70W | 14094MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |\n",
"| N/A 30C P0 26W / 70W | 690MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |\n",
"| N/A 30C P0 26W / 70W | 690MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |\n",
"| N/A 29C P0 26W / 70W | 690MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
" \n",
"+-----------------------------------------------------------------------------+\n",
"| Processes: |\n",
"| GPU GI CI PID Type Process name GPU Memory |\n",
"| ID ID Usage |\n",
"|=============================================================================|\n",
"+-----------------------------------------------------------------------------+\n"
]
}
],
"source": [
"!nvidia-smi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A typical Dask workflow, which we will utilize, is to persist data we would like to work with to the cluster and then perform fast operations on that persisted data. We do this with the `.persist` method. From the [Dask documentation](https://distributed.dask.org/en/latest/manage-computation.html#client-persist):\n",
"\n",
">The `.persist` method submits the task graph behind the Dask collection to the scheduler, obtaining Futures for all of the top-most tasks (for example one Future for each Pandas [*or cuDF*] DataFrame in a Dask[*-cudf*] DataFrame). It then returns a copy of the collection pointing to these futures instead of the previous graph. This new collection is semantically equivalent but now points to actively running data rather than a lazy graph.\n",
"\n",
"Below we persist `ddf` to the cluster so that it will reside in GPU memory for us to perform fast operations on. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"ddf = ddf.persist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see by executing `nvidia-smi` (after letting the `persist` finish), each GPU now has parts of the distributed dataframe in its memory:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tue Oct 21 13:31:08 2025 \n",
"+-----------------------------------------------------------------------------+\n",
"| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |\n",
"|-------------------------------+----------------------+----------------------+\n",
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
"| | | MIG M. |\n",
"|===============================+======================+======================|\n",
"| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |\n",
"| N/A 32C P0 33W / 70W | 14218MiB / 15360MiB | 46% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |\n",
"| N/A 32C P0 32W / 70W | 3768MiB / 15360MiB | 19% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |\n",
"| N/A 31C P0 32W / 70W | 3804MiB / 15360MiB | 24% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
"| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |\n",
"| N/A 31C P0 32W / 70W | 3764MiB / 15360MiB | 45% Default |\n",
"| | | N/A |\n",
"+-------------------------------+----------------------+----------------------+\n",
" \n",
"+-----------------------------------------------------------------------------+\n",
"| Processes: |\n",
"| GPU GI CI PID Type Process name GPU Memory |\n",
"| ID ID Usage |\n",
"|=============================================================================|\n",
"+-----------------------------------------------------------------------------+\n"
]
}
],
"source": [
"!nvidia-smi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Running `ddf.visualize` now shows that we no longer have operations in our task graph, only partitions of data, ready for us to perform operations:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.43.0 (0)\n",
" -->\n",
"<!-- Title: %3 Pages: 1 -->\n",
"<svg width=\"135pt\" height=\"44pt\"\n",
" viewBox=\"0.00 0.00 135.00 44.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 40)\">\n",
"<title>%3</title>\n",
"<polygon fill=\"white\" stroke=\"transparent\" points=\"-4,4 -4,-40 131,-40 131,4 -4,4\"/>\n",
"<!-- &#45;4538719848559110466 -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>&#45;4538719848559110466</title>\n",
"<polygon fill=\"none\" stroke=\"black\" points=\"127,-36 0,-36 0,0 127,0 127,-36\"/>\n",
"<text text-anchor=\"middle\" x=\"63.5\" y=\"-13\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">FromGraph</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<graphviz.graphs.Digraph at 0x7f94b80d4550>"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.visualize(format='svg')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Computing operations on this data will now be much faster:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"40.1241924549316"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf['age'].mean().compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initial Data Exploration with Dask cuDF ##\n",
"The beauty of Dask is that working with your data, even though it is distributed and massive, is a lot like working with smaller in-memory data sets."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>county</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" <th>name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.0</td>\n",
" <td>m</td>\n",
" <td>Darlington</td>\n",
" <td>54.549641</td>\n",
" <td>-1.493884</td>\n",
" <td>HARRISON</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.0</td>\n",
" <td>m</td>\n",
" <td>Darlington</td>\n",
" <td>54.523945</td>\n",
" <td>-1.401142</td>\n",
" <td>LAKSH</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.0</td>\n",
" <td>m</td>\n",
" <td>Darlington</td>\n",
" <td>54.561127</td>\n",
" <td>-1.690068</td>\n",
" <td>MUHAMMAD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.0</td>\n",
" <td>m</td>\n",
" <td>Darlington</td>\n",
" <td>54.542988</td>\n",
" <td>-1.543216</td>\n",
" <td>GRAYSON</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.0</td>\n",
" <td>m</td>\n",
" <td>Darlington</td>\n",
" <td>54.532101</td>\n",
" <td>-1.569116</td>\n",
" <td>FINLAY</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age sex county lat long name\n",
"0 0.0 m Darlington 54.549641 -1.493884 HARRISON\n",
"1 0.0 m Darlington 54.523945 -1.401142 LAKSH\n",
"2 0.0 m Darlington 54.561127 -1.690068 MUHAMMAD\n",
"3 0.0 m Darlington 54.542988 -1.543216 GRAYSON\n",
"4 0.0 m Darlington 54.532101 -1.569116 FINLAY"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.head() # As a convenience, no need to `.compute` the `head()` method"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"age 292399470\n",
"sex 292399470\n",
"county 292399470\n",
"lat 292399470\n",
"long 292399470\n",
"name 292399470\n",
"dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.count().compute()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"age float32\n",
"sex object\n",
"county object\n",
"lat float32\n",
"long float32\n",
"name object\n",
"dtype: object"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise #1 - Counties North of Sunderland with Dask ###\n",
"Here we ask you to revisit an earlier exercise, but on the distributed data set. Hopefully, it's clear how similar the code is for single-GPU dataframes and distributed dataframes with Dask.\n",
"\n",
"Identify the latitude of the northernmost resident of Sunderland county (the person with the maximum `lat` value), and then determine which counties have any residents north of this resident. Use the `unique` method of a cudf `Series` to de-duplicate the result.\n",
"\n",
"**Instructions**: <br>\n",
"* Modify the `<FIXME>` only and execute the below cell to identify counties north of Sunderland. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'ddf' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[1], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m sunderland_residents \u001b[38;5;241m=\u001b[39m \u001b[43mddf\u001b[49m\u001b[38;5;241m.\u001b[39mloc[[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcounty\u001b[39m\u001b[38;5;124m'\u001b[39m], [\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mSUNDERLAND\u001b[39m\u001b[38;5;124m'\u001b[39m]]\n\u001b[1;32m 2\u001b[0m northmost_sunderland_lat \u001b[38;5;241m=\u001b[39m sunderland_residents[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mlat\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39mmax()\n\u001b[1;32m 3\u001b[0m counties_with_pop_north_of \u001b[38;5;241m=\u001b[39m ddf\u001b[38;5;241m.\u001b[39mloc[ddf[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mlat\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m>\u001b[39m northmost_sunderland_lat][\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcounty\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39munique()\n",
"\u001b[0;31mNameError\u001b[0m: name 'ddf' is not defined"
]
}
],
"source": [
"sunderland_residents = ddf.loc[['county'], ['SUNDERLAND']]\n",
"northmost_sunderland_lat = sunderland_residents['lat'].max()\n",
"counties_with_pop_north_of = ddf.loc[ddf['lat'] > northmost_sunderland_lat]['county'].unique()\n",
"results=counties_with_pop_north_of.compute()\n",
"results.head()"
]
},
{
"cell_type": "raw",
"metadata": {
"jupyter": {
"source_hidden": true
}
},
"source": [
"\n",
"sunderland_residents = ddf.loc[ddf['county'] == 'Sunderland']\n",
"northmost_sunderland_lat = sunderland_residents['lat'].max()\n",
"counties_with_pop_north_of = ddf.loc[ddf['lat'] > northmost_sunderland_lat]['county'].unique()\n",
"results=counties_with_pop_north_of.compute()\n",
"results.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Click ... for solution. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'status': 'ok', 'restart': True}"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import IPython\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Well Done!** Let's move to the [next notebook](1-09_cudf-polars.ipynb). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"./images/DLI_Header.png\" width=400/>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,172 @@
county,lat_county_center,long_county_center
BARKING AND DAGENHAM,51.621048311776526,0.12958319845588165
BARNET,51.81255163972051,-0.21821206632197684
BARNSLEY,53.57190690010971,-1.5487193565226611
BATH AND NORTH EAST SOMERSET,51.35496548780361,-2.486675162410336
BEDFORD,52.145475839485385,-0.4549734374180617
BEXLEY,51.33625605642689,0.14633321710015448
BIRMINGHAM,52.12178304394528,-1.881329432771379
BLACKBURN WITH DARWEN,53.63718763008419,-2.463700844959783
BLACKPOOL,53.882118373353435,-3.0229009637127167
BLAENAU GWENT,51.75159582861159,-3.1862426125686745
BOLTON,53.73813128127497,-2.4794091133678147
BRACKNELL FOREST,51.457925145468295,-0.7336441271286038
BRADFORD,53.972113267048044,-1.8738762931122748
BRENT,51.761695309784,-0.2756927203781798
BRIDGEND,51.522888539164526,-3.6137468421270604
BRIGHTON AND HOVE,50.94890407892698,-0.1507807253912774
"BRISTOL, CITY OF",51.53203785026057,-2.5774864859032594
BROMLEY,51.2251371203518,0.03905163114984023
BUCKINGHAMSHIRE,51.92925587759856,-0.8053996183750294
BURY,53.61553432785575,-2.3088650595977023
CAERPHILLY,51.62781255006381,-3.1973649865483735
CALDERDALE,53.769761331289686,-1.9616103771384508
CAMBRIDGESHIRE,52.1333820427886,-0.23503728806014595
CAMDEN,51.69346289078886,-0.1629412552292679
CARDIFF,51.56635588939404,-3.222317281083218
CARMARTHENSHIRE,51.92106862577838,-4.211293704149962
CENTRAL BEDFORDSHIRE,51.99983427713095,-0.4775810785914261
CEREDIGION,52.297905934896974,-3.9524382809074967
CHESHIRE EAST,53.209779668583735,-2.2923524120906538
CHESHIRE WEST AND CHESTER,53.12468649229667,-2.703640874356098
CITY OF LONDON,51.515869084539396,-0.09345024349003202
CONWY,53.125451225027945,-3.7469275629154897
CORNWALL,50.2491094902892,-4.642072961722217
COUNTY DURHAM,54.46928915708376,-1.840983172985692
COVENTRY,52.20619163815314,-1.5190329484575433
CROYDON,51.33122440611814,-0.07773715861848832
CUMBRIA,54.470582575648244,-2.902600383252353
DARLINGTON,54.51355967194039,-1.5680201999230523
DENBIGHSHIRE,53.07313542431554,-3.347662396412462
DERBY,52.98317870391253,-1.471762916352353
DERBYSHIRE,52.96237103431297,-1.6019383162802616
DEVON,50.75993290464059,-3.6572707805745353
DONCASTER,53.579077870304175,-1.1091519021581622
DORSET,50.80117614559981,-2.4141088997141975
DUDLEY,52.466075739334926,-2.101688961593882
EALING,51.69946371446451,-0.31413253292570953
EAST RIDING OF YORKSHIRE,53.9506321883079,-0.6619808168243948
EAST SUSSEX,50.8319515317622,0.33441692286193403
ENFIELD,51.79829813489722,-0.08133941451400101
ESSEX,51.61177562858481,0.5408806396014519
FLINTSHIRE,53.18448452051185,-3.176529270275655
GATESHEAD,54.984104331680726,-1.6867966327256207
GLOUCESTERSHIRE,51.95116469210396,-2.152140175011601
GREENWICH,51.298529627584855,0.05009798110429057
GWYNEDD,52.90798692199907,-3.815807248465912
HACKNEY,51.715573990309835,-0.06047668080560671
HALTON,53.37945371869939,-2.6885285111965866
HAMMERSMITH AND FULHAM,51.45669431471315,-0.21734862391196488
HAMPSHIRE,51.35882747857323,-1.2472236572124424
HARINGEY,51.71488485869694,-0.10670896820865851
HARROW,51.69502976226169,-0.3360141730528605
HARTLEPOOL,54.67019690697325,-1.2702881849113061
HAVERING,51.68803382335829,0.23538931286606415
"HEREFORDSHIRE, COUNTY OF",52.05661428266539,-2.7394973894756567
HERTFORDSHIRE,51.97545351306396,-0.2768104374496038
HILLINGDON,51.67744993832507,-0.44168376669816023
HOUNSLOW,51.31550103034914,-0.37851470463324743
ISLE OF ANGLESEY,53.27637540915653,-4.323495411729392
ISLE OF WIGHT,50.62684579406237,-1.3335589426514434
ISLES OF SCILLY,49.923857744201605,-6.302263516809768
ISLINGTON,51.66454658738323,-0.10992970115558956
KENSINGTON AND CHELSEA,51.49977592399342,-0.18981078381787103
KENT,51.066980402556894,0.72177006521006
"KINGSTON UPON HULL, CITY OF",53.894135701816644,-0.30380941990063115
KINGSTON UPON THAMES,51.42789080754545,-0.28368404321251495
KIRKLEES,53.84779145117579,-1.7808194218728275
KNOWSLEY,53.48284092504563,-2.8329791954991275
LAMBETH,51.252923290285565,-0.11380231585035454
LANCASHIRE,53.39410422518683,-2.460896340904076
LEEDS,53.55494339794778,-1.5074406609781625
LEICESTER,52.7035904712036,-1.1304165681356237
LEICESTERSHIRE,52.372384242153444,-1.3774821236258858
LEWISHAM,51.26146486742923,-0.017302263531446847
LINCOLNSHIRE,53.019325697607805,-0.23840017404638325
LIVERPOOL,53.51161042331058,-2.9133522899513755
LUTON,51.96794156247519,-0.4231450525783596
MANCHESTER,53.618174414336764,-2.2337215842169944
MEDWAY,51.32754494250598,0.5632336335498731
MERTHYR TYDFIL,51.749169200604825,-3.36403864047987
MERTON,51.37364806533906,-0.18868296177359278
MIDDLESBROUGH,54.5098082464691,-1.211038279554591
MILTON KEYNES,52.01693552290149,-0.7406232665194876
MONMOUTHSHIRE,51.78143655329183,-2.9039386644643197
NEATH PORT TALBOT,51.59538437854254,-3.7458617902677283
NEWCASTLE UPON TYNE,55.00208530426788,-1.652806624671881
NEWHAM,51.75154898367921,0.027418339450078835
NEWPORT,51.53253056059282,-2.8977514562758477
NORFOLK,52.3032223796034,0.9647662889518414
NORTH EAST LINCOLNSHIRE,53.50967645052903,-0.13922750148994814
NORTH LINCOLNSHIRE,53.57540769163687,-0.5237063875323392
NORTH SOMERSET,51.35265217208383,-2.754333708085771
NORTH TYNESIDE,55.00390319683472,-1.5092377782362794
NORTH YORKSHIRE,54.037083506236726,-1.5496083229591298
NORTHAMPTONSHIRE,52.090056204873584,-0.8673643733062965
NORTHUMBERLAND,55.268382697315424,-2.075107564148198
NOTTINGHAM,52.95517248670217,-1.166635297324727
NOTTINGHAMSHIRE,53.03298887412134,-1.006945929298795
OLDHAM,53.659965283524954,-2.052688245629671
OXFORDSHIRE,51.93769526591072,-1.2911207463303098
PEMBROKESHIRE,51.87232817560273,-4.908191395785854
PETERBOROUGH,52.62511626981561,-0.2689975241368676
PLYMOUTH,50.29446598251615,-4.112955625237552
PORTSMOUTH,50.91433206435089,-1.0702659081823802
POWYS,52.35028728472521,-3.4364646802117074
READING,51.48972751726377,-0.9907195716377762
REDBRIDGE,51.74619394585629,0.0701000048233879
REDCAR AND CLEVELAND,54.52674848959172,-1.0057471172413288
RICHMOND UPON THAMES,51.40228740909276,-0.28924251316631455
ROCHDALE,53.67734692115036,-2.14815188340053
ROTHERHAM,53.27571588878268,-1.2866084213986422
RUTLAND,52.66741819281054,-0.6255844565552813
SALFORD,53.39900474827836,-2.3848977331687684
SANDWELL,52.58696674791831,-2.007627650605722
SEFTON,53.41754419091054,-2.9918998460398845
SHEFFIELD,53.594572416421464,-1.5427564265432459
SHROPSHIRE,52.68421414164122,-2.7366875706426375
SLOUGH,51.500375556628576,-0.5761037634462686
SOLIHULL,52.36591301434561,-1.7157174664625492
SOMERSET,51.15203995716832,-3.2953379430424437
SOUTH GLOUCESTERSHIRE,51.619868102630875,-2.469430184260059
SOUTH TYNESIDE,54.994706019365786,-1.4469508035803413
SOUTHAMPTON,50.984805930473584,-1.4002768042215858
SOUTHEND-ON-SEA,51.562157807336284,0.7069905953535786
SOUTHWARK,51.26247572937943,-0.07306483663823536
ST. HELENS,53.442240723358644,-2.7032424159534347
STAFFORDSHIRE,52.54946704767607,-2.027491119365553
STOCKPORT,53.243567817667724,-2.1248973952531918
STOCKTON-ON-TEES,54.60356568786033,-1.3063893005278557
STOKE-ON-TRENT,53.0018684063432,-2.1588155163720084
SUFFOLK,52.07327606663186,1.049040133490474
SUNDERLAND,54.95658521287448,-1.433572135990224
SURREY,51.75817482314145,-0.3386369800762059
SUTTON,51.33189096687447,-0.17228958486126392
SWANSEA,51.734320352502984,-3.967180818043868
SWINDON,51.64295753076632,-1.7336382187066433
TAMESIDE,53.4185402114593,-2.0769462404028474
TELFORD AND WREKIN,52.709149095326744,-2.4894724871905916
THURROCK,51.508227793073466,0.33492786371540356
TORBAY,50.494049197230815,-3.5551646045072913
TORFAEN,51.69896506141925,-3.0509328418360218
TOWER HAMLETS,51.68485859523772,-0.03638140322291906
TRAFFORD,53.314621144815334,-2.3656560688750687
VALE OF GLAMORGAN,51.477096810804674,-3.3980039155600954
WAKEFIELD,53.81677380462442,-1.4208545508030999
WALSALL,52.742742908764974,-1.9703315889024553
WALTHAM FOREST,51.723501987712325,-0.01886180175957716
WANDSWORTH,51.24653418036352,-0.2001743797936436
WARRINGTON,53.338554119123636,-2.561564052456012
WARWICKSHIRE,52.04847200574421,-1.5686356193411675
WEST BERKSHIRE,51.472960442069805,-1.2740171035533379
WEST SUSSEX,51.11473921001523,-0.4593527537340543
WESTMINSTER,51.613346179755915,-0.15298252171750404
WIGAN,53.58763891955546,-2.5723844100365545
WILTSHIRE,51.48575283497703,-1.926537553406791
WINDSOR AND MAIDENHEAD,51.494612540256846,-0.6753936432282348
WIRRAL,53.237217504292545,-3.0650813262796417
WOKINGHAM,51.45966460093226,-0.8993706058495408
WOLVERHAMPTON,52.71684834050869,-2.127594624973283
WORCESTERSHIRE,52.05799103802506,-2.209184250840713
WREXHAM,53.00080440180421,-2.991958507191866
YORK,53.99232942499273,-1.073788787620359
1 county lat_county_center long_county_center
2 BARKING AND DAGENHAM 51.621048311776526 0.12958319845588165
3 BARNET 51.81255163972051 -0.21821206632197684
4 BARNSLEY 53.57190690010971 -1.5487193565226611
5 BATH AND NORTH EAST SOMERSET 51.35496548780361 -2.486675162410336
6 BEDFORD 52.145475839485385 -0.4549734374180617
7 BEXLEY 51.33625605642689 0.14633321710015448
8 BIRMINGHAM 52.12178304394528 -1.881329432771379
9 BLACKBURN WITH DARWEN 53.63718763008419 -2.463700844959783
10 BLACKPOOL 53.882118373353435 -3.0229009637127167
11 BLAENAU GWENT 51.75159582861159 -3.1862426125686745
12 BOLTON 53.73813128127497 -2.4794091133678147
13 BRACKNELL FOREST 51.457925145468295 -0.7336441271286038
14 BRADFORD 53.972113267048044 -1.8738762931122748
15 BRENT 51.761695309784 -0.2756927203781798
16 BRIDGEND 51.522888539164526 -3.6137468421270604
17 BRIGHTON AND HOVE 50.94890407892698 -0.1507807253912774
18 BRISTOL, CITY OF 51.53203785026057 -2.5774864859032594
19 BROMLEY 51.2251371203518 0.03905163114984023
20 BUCKINGHAMSHIRE 51.92925587759856 -0.8053996183750294
21 BURY 53.61553432785575 -2.3088650595977023
22 CAERPHILLY 51.62781255006381 -3.1973649865483735
23 CALDERDALE 53.769761331289686 -1.9616103771384508
24 CAMBRIDGESHIRE 52.1333820427886 -0.23503728806014595
25 CAMDEN 51.69346289078886 -0.1629412552292679
26 CARDIFF 51.56635588939404 -3.222317281083218
27 CARMARTHENSHIRE 51.92106862577838 -4.211293704149962
28 CENTRAL BEDFORDSHIRE 51.99983427713095 -0.4775810785914261
29 CEREDIGION 52.297905934896974 -3.9524382809074967
30 CHESHIRE EAST 53.209779668583735 -2.2923524120906538
31 CHESHIRE WEST AND CHESTER 53.12468649229667 -2.703640874356098
32 CITY OF LONDON 51.515869084539396 -0.09345024349003202
33 CONWY 53.125451225027945 -3.7469275629154897
34 CORNWALL 50.2491094902892 -4.642072961722217
35 COUNTY DURHAM 54.46928915708376 -1.840983172985692
36 COVENTRY 52.20619163815314 -1.5190329484575433
37 CROYDON 51.33122440611814 -0.07773715861848832
38 CUMBRIA 54.470582575648244 -2.902600383252353
39 DARLINGTON 54.51355967194039 -1.5680201999230523
40 DENBIGHSHIRE 53.07313542431554 -3.347662396412462
41 DERBY 52.98317870391253 -1.471762916352353
42 DERBYSHIRE 52.96237103431297 -1.6019383162802616
43 DEVON 50.75993290464059 -3.6572707805745353
44 DONCASTER 53.579077870304175 -1.1091519021581622
45 DORSET 50.80117614559981 -2.4141088997141975
46 DUDLEY 52.466075739334926 -2.101688961593882
47 EALING 51.69946371446451 -0.31413253292570953
48 EAST RIDING OF YORKSHIRE 53.9506321883079 -0.6619808168243948
49 EAST SUSSEX 50.8319515317622 0.33441692286193403
50 ENFIELD 51.79829813489722 -0.08133941451400101
51 ESSEX 51.61177562858481 0.5408806396014519
52 FLINTSHIRE 53.18448452051185 -3.176529270275655
53 GATESHEAD 54.984104331680726 -1.6867966327256207
54 GLOUCESTERSHIRE 51.95116469210396 -2.152140175011601
55 GREENWICH 51.298529627584855 0.05009798110429057
56 GWYNEDD 52.90798692199907 -3.815807248465912
57 HACKNEY 51.715573990309835 -0.06047668080560671
58 HALTON 53.37945371869939 -2.6885285111965866
59 HAMMERSMITH AND FULHAM 51.45669431471315 -0.21734862391196488
60 HAMPSHIRE 51.35882747857323 -1.2472236572124424
61 HARINGEY 51.71488485869694 -0.10670896820865851
62 HARROW 51.69502976226169 -0.3360141730528605
63 HARTLEPOOL 54.67019690697325 -1.2702881849113061
64 HAVERING 51.68803382335829 0.23538931286606415
65 HEREFORDSHIRE, COUNTY OF 52.05661428266539 -2.7394973894756567
66 HERTFORDSHIRE 51.97545351306396 -0.2768104374496038
67 HILLINGDON 51.67744993832507 -0.44168376669816023
68 HOUNSLOW 51.31550103034914 -0.37851470463324743
69 ISLE OF ANGLESEY 53.27637540915653 -4.323495411729392
70 ISLE OF WIGHT 50.62684579406237 -1.3335589426514434
71 ISLES OF SCILLY 49.923857744201605 -6.302263516809768
72 ISLINGTON 51.66454658738323 -0.10992970115558956
73 KENSINGTON AND CHELSEA 51.49977592399342 -0.18981078381787103
74 KENT 51.066980402556894 0.72177006521006
75 KINGSTON UPON HULL, CITY OF 53.894135701816644 -0.30380941990063115
76 KINGSTON UPON THAMES 51.42789080754545 -0.28368404321251495
77 KIRKLEES 53.84779145117579 -1.7808194218728275
78 KNOWSLEY 53.48284092504563 -2.8329791954991275
79 LAMBETH 51.252923290285565 -0.11380231585035454
80 LANCASHIRE 53.39410422518683 -2.460896340904076
81 LEEDS 53.55494339794778 -1.5074406609781625
82 LEICESTER 52.7035904712036 -1.1304165681356237
83 LEICESTERSHIRE 52.372384242153444 -1.3774821236258858
84 LEWISHAM 51.26146486742923 -0.017302263531446847
85 LINCOLNSHIRE 53.019325697607805 -0.23840017404638325
86 LIVERPOOL 53.51161042331058 -2.9133522899513755
87 LUTON 51.96794156247519 -0.4231450525783596
88 MANCHESTER 53.618174414336764 -2.2337215842169944
89 MEDWAY 51.32754494250598 0.5632336335498731
90 MERTHYR TYDFIL 51.749169200604825 -3.36403864047987
91 MERTON 51.37364806533906 -0.18868296177359278
92 MIDDLESBROUGH 54.5098082464691 -1.211038279554591
93 MILTON KEYNES 52.01693552290149 -0.7406232665194876
94 MONMOUTHSHIRE 51.78143655329183 -2.9039386644643197
95 NEATH PORT TALBOT 51.59538437854254 -3.7458617902677283
96 NEWCASTLE UPON TYNE 55.00208530426788 -1.652806624671881
97 NEWHAM 51.75154898367921 0.027418339450078835
98 NEWPORT 51.53253056059282 -2.8977514562758477
99 NORFOLK 52.3032223796034 0.9647662889518414
100 NORTH EAST LINCOLNSHIRE 53.50967645052903 -0.13922750148994814
101 NORTH LINCOLNSHIRE 53.57540769163687 -0.5237063875323392
102 NORTH SOMERSET 51.35265217208383 -2.754333708085771
103 NORTH TYNESIDE 55.00390319683472 -1.5092377782362794
104 NORTH YORKSHIRE 54.037083506236726 -1.5496083229591298
105 NORTHAMPTONSHIRE 52.090056204873584 -0.8673643733062965
106 NORTHUMBERLAND 55.268382697315424 -2.075107564148198
107 NOTTINGHAM 52.95517248670217 -1.166635297324727
108 NOTTINGHAMSHIRE 53.03298887412134 -1.006945929298795
109 OLDHAM 53.659965283524954 -2.052688245629671
110 OXFORDSHIRE 51.93769526591072 -1.2911207463303098
111 PEMBROKESHIRE 51.87232817560273 -4.908191395785854
112 PETERBOROUGH 52.62511626981561 -0.2689975241368676
113 PLYMOUTH 50.29446598251615 -4.112955625237552
114 PORTSMOUTH 50.91433206435089 -1.0702659081823802
115 POWYS 52.35028728472521 -3.4364646802117074
116 READING 51.48972751726377 -0.9907195716377762
117 REDBRIDGE 51.74619394585629 0.0701000048233879
118 REDCAR AND CLEVELAND 54.52674848959172 -1.0057471172413288
119 RICHMOND UPON THAMES 51.40228740909276 -0.28924251316631455
120 ROCHDALE 53.67734692115036 -2.14815188340053
121 ROTHERHAM 53.27571588878268 -1.2866084213986422
122 RUTLAND 52.66741819281054 -0.6255844565552813
123 SALFORD 53.39900474827836 -2.3848977331687684
124 SANDWELL 52.58696674791831 -2.007627650605722
125 SEFTON 53.41754419091054 -2.9918998460398845
126 SHEFFIELD 53.594572416421464 -1.5427564265432459
127 SHROPSHIRE 52.68421414164122 -2.7366875706426375
128 SLOUGH 51.500375556628576 -0.5761037634462686
129 SOLIHULL 52.36591301434561 -1.7157174664625492
130 SOMERSET 51.15203995716832 -3.2953379430424437
131 SOUTH GLOUCESTERSHIRE 51.619868102630875 -2.469430184260059
132 SOUTH TYNESIDE 54.994706019365786 -1.4469508035803413
133 SOUTHAMPTON 50.984805930473584 -1.4002768042215858
134 SOUTHEND-ON-SEA 51.562157807336284 0.7069905953535786
135 SOUTHWARK 51.26247572937943 -0.07306483663823536
136 ST. HELENS 53.442240723358644 -2.7032424159534347
137 STAFFORDSHIRE 52.54946704767607 -2.027491119365553
138 STOCKPORT 53.243567817667724 -2.1248973952531918
139 STOCKTON-ON-TEES 54.60356568786033 -1.3063893005278557
140 STOKE-ON-TRENT 53.0018684063432 -2.1588155163720084
141 SUFFOLK 52.07327606663186 1.049040133490474
142 SUNDERLAND 54.95658521287448 -1.433572135990224
143 SURREY 51.75817482314145 -0.3386369800762059
144 SUTTON 51.33189096687447 -0.17228958486126392
145 SWANSEA 51.734320352502984 -3.967180818043868
146 SWINDON 51.64295753076632 -1.7336382187066433
147 TAMESIDE 53.4185402114593 -2.0769462404028474
148 TELFORD AND WREKIN 52.709149095326744 -2.4894724871905916
149 THURROCK 51.508227793073466 0.33492786371540356
150 TORBAY 50.494049197230815 -3.5551646045072913
151 TORFAEN 51.69896506141925 -3.0509328418360218
152 TOWER HAMLETS 51.68485859523772 -0.03638140322291906
153 TRAFFORD 53.314621144815334 -2.3656560688750687
154 VALE OF GLAMORGAN 51.477096810804674 -3.3980039155600954
155 WAKEFIELD 53.81677380462442 -1.4208545508030999
156 WALSALL 52.742742908764974 -1.9703315889024553
157 WALTHAM FOREST 51.723501987712325 -0.01886180175957716
158 WANDSWORTH 51.24653418036352 -0.2001743797936436
159 WARRINGTON 53.338554119123636 -2.561564052456012
160 WARWICKSHIRE 52.04847200574421 -1.5686356193411675
161 WEST BERKSHIRE 51.472960442069805 -1.2740171035533379
162 WEST SUSSEX 51.11473921001523 -0.4593527537340543
163 WESTMINSTER 51.613346179755915 -0.15298252171750404
164 WIGAN 53.58763891955546 -2.5723844100365545
165 WILTSHIRE 51.48575283497703 -1.926537553406791
166 WINDSOR AND MAIDENHEAD 51.494612540256846 -0.6753936432282348
167 WIRRAL 53.237217504292545 -3.0650813262796417
168 WOKINGHAM 51.45966460093226 -0.8993706058495408
169 WOLVERHAMPTON 52.71684834050869 -2.127594624973283
170 WORCESTERSHIRE 52.05799103802506 -2.209184250840713
171 WREXHAM 53.00080440180421 -2.991958507191866
172 YORK 53.99232942499273 -1.073788787620359