feat(ds-2.1, circuit-3)

2025-10-24 22:04:14 +03:00
parent 4a27006658
commit dd905ac0c9
24 changed files with 12601 additions and 0 deletions
--- a/circuit/25-1/3/DVJK_table.png
+++ b/circuit/25-1/3/DVJK_table.png
--- a/circuit/25-1/3/TDCE_schema.png
+++ b/circuit/25-1/3/TDCE_schema.png
--- a/circuit/25-1/3/TDCE_timing.png
+++ b/circuit/25-1/3/TDCE_timing.png
--- a/circuit/25-1/3/TJK.png
+++ b/circuit/25-1/3/TJK.png
--- a/circuit/25-1/3/TJK_schema.png
+++ b/circuit/25-1/3/TJK_schema.png
--- a/circuit/25-1/3/TJK_timing.png
+++ b/circuit/25-1/3/TJK_timing.png
--- a/circuit/25-1/3/TT.png
+++ b/circuit/25-1/3/TT.png
--- a/circuit/25-1/3/TT_schema.png
+++ b/circuit/25-1/3/TT_schema.png
--- a/circuit/25-1/3/TT_table.png
+++ b/circuit/25-1/3/TT_table.png
--- a/circuit/25-1/3/TT_timing.png
+++ b/circuit/25-1/3/TT_timing.png
--- a/circuit/25-1/3/TT_transition.png
+++ b/circuit/25-1/3/TT_transition.png
--- a/circuit/25-1/3/lab3.pdf
+++ b/circuit/25-1/3/lab3.pdf
--- a/circuit/25-1/3/schema.png
+++ b/circuit/25-1/3/schema.png
--- a/ds/25-1/2/1-00_introduction.ipynb
+++ b/ds/25-1/2/1-00_introduction.ipynb
@ -0,0 +1,278 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "19051402",
   "metadata": {
    "tags": []
   },
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67ed6062",
   "metadata": {},
   "source": [
    "# Fundamentals of Accelerated Data Science #"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a65f57f0",
   "metadata": {},
   "source": [
    "## 00 - Introduction ##\n",
    "Welcome to NVIDIA's Deep Learning Institute workshop on the Fundamentals of Accelerated Data Science. This interactive lab offers practical experience with every stage of the development process, empowering participants to tailor solutions for their unique applications."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50d32b6c",
   "metadata": {},
   "source": [
    "**Learning Objectives**\n",
    "<br>\n",
    "In this workshop, you will learn: \n",
    "* Overview of data science\n",
    "* Demonstrations of data science workflows\n",
    "* How acceleration is achieved\n",
    "* How to design operations to maximize GPU acceleration\n",
    "* Implications of acceleration"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a02c2b6",
   "metadata": {},
   "source": [
    "### JupyterLab ###\n",
    "For this hands-on lab, we use [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) to manage our environment.  The [JupyterLab Interface](https://jupyterlab.readthedocs.io/en/stable/user/interface.html) is a dashboard that provides access to interactive iPython notebooks, as well as the folder structure of our environment and a terminal window into the Ubuntu operating system. The first view includes a **menu bar** at the top, a **file browser** in the **left sidebar**, and a **main work area** that is initially open to this \"introduction\" notebook. \n",
    "\n",
    "<p><img src=\"images/jl_launcher.png\" width=720></p>\n",
    "\n",
    "* The file browser can be navigated just like any other file explorer. A double click on any of the items will open a new tab with its content. \n",
    "* The main work area includes tabbed views of open files that can be closed, moved, and edited as needed. \n",
    "* The notebooks, including this one, consist of a series of content and code **cells**. To execute code in a code cell, press `Shift+Enter` or the `Run` button in the menu bar above, while a cell is highlighted. Sometimes, a content cell will get switched to editing mode. Executing the cell with `Shift+Enter` or the `Run` button will switch it back to a readable form.\n",
    "* To interrupt cell execution, click the `Stop` button in the menu bar or navigate to the `Kernel` menu, and select `Interrupt Kernel`. \n",
    "* We can use terminal commands in the notebook cells by prepending an exclamation point/bang(`!`) to the beginning of the command.\n",
    "* We can create additional interactive cells by clicking the `+` button above, or by switching to command mode with `Esc` and using the keyboard shortcuts `a` (for new cell above) and `b` (for new cell below)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4492c58d",
   "metadata": {},
   "source": [
    "<a name='e1'></a>\n",
    "### Exercise #1 - Practice ###\n",
    "**Instructions**: <br>\n",
    "* Try executing the simple print statement in the below cell.\n",
    "* Then try executing the terminal command in the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e69a6515",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "# activate this cell by selecting it with the mouse or arrow keys then use the keyboard shortcut [Shift+Enter] to execute\n",
    "print('This is just a simple print statement.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e54fe372",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "!echo 'This is another simple print statement.'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2e5151b-4842-465e-a20d-bb64af66d011",
   "metadata": {},
   "source": [
    "<a name='e2'></a>\n",
    "### Exercise #2 - Available GPU Accelerators ###\n",
    "The `nvidia-smi` (NVIDIA System Management Interface) command is a powerful utility for managing and monitoring NVIDIA GPU devices. It will print information about available GPUs, their current memory usage, and any processes currently utilizing them. \n",
    "\n",
    "**Instructions**: <br>\n",
    "* Execute the below cell to learn about this environment's available GPUs. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08d543eb-a951-4eb9-8107-b13c01b3ac46",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adee74e3-613a-4986-be34-ff3ae113ccc7",
   "metadata": {},
   "source": [
    "**Note**: Currently, GPU memory usage is minimal, with no active processes utilizing the GPUs. Throughout our session, we'll employ this command to monitor memory consumption. When conducting GPU-based data analysis, it's advisable to maintain approximately 50% of GPU memory free, allowing for operations that may expand data stored on the device."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0839f2e-dfe3-4d8f-8010-ed8445c171fb",
   "metadata": {},
   "source": [
    "<a name='e3'></a>\n",
    "### Exercise #3 - Magic Commands ###\n",
    "The Jupyter environment come installed with *magic* commands, which can be recognized by the presence of `%` or `%%`. We will be using two magic commands liberally in this workshop: \n",
    "* `%time`: prints summary information about how long it took to run code for a single line of code\n",
    "* `%%time`: prints summary information about how long it took to run code for an entire cell\n",
    "\n",
    "**Instructions**: <br>\n",
    "* Execute the below cell to import the `time` library. \n",
    "* Execute the cell below to time the single line of code. \n",
    "* Execute the cell below to time the entire cell. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1c34489-7812-4ffe-bd2e-748a52903481",
   "metadata": {},
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "from time import sleep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db1d5de9-f6e6-4984-8c32-f13b51aa27db",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "# %time only times one line\n",
    "%time sleep(2) \n",
    "sleep(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "daf2f6f0-58a9-43a5-af8f-0b69b4a2a3a8",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "# DO NOT CHANGE THIS CELL\n",
    "# %%time will time the entire cell\n",
    "sleep(1)\n",
    "sleep(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42ed873e-f7b5-4668-8e96-ce31d53d43b1",
   "metadata": {},
   "source": [
    "<a name='e4'></a>\n",
    "### Exercise #4 - Jupyter Kernels and GPU Memory ###\n",
    "The compute backend for Jupyter is called the *kernel*. The Jupyter environment starts up a separate kernel for each new notebook. The many notebooks in this workshop are each intended to stand alone with regard to memory and computation. \n",
    "\n",
    "To ensure we have enough memory and compute for each notebook, we can clear the memory at the conclusion of each notebook in two ways: \n",
    "1. Shut down the kernel with `ipykernel.kernelapp.IPKernelApp.do_shutdown()` or\n",
    "2. Shut down the kernel through the *Running Terminals and Kernels* panel. \n",
    "\n",
    "**Instructions**: <br>\n",
    "* Execute the below cell to shut down and restart the current kernel. \n",
    "* Shut down the current kernel through the *Running Terminals and Kernels* panel.\n",
    "\n",
    "<p><img src=\"images/kernel_restart.png\" width=720></p>\n",
    "\n",
    "**Note**: Restarting the kernel from the *Kernel* menu will only clear the memory for *the current notebook's kernel*, while notebooks other than the one we're working on may still have memory allocated for *their unique kernels*. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98e05b77-6019-428b-8e18-a2477692ef6f",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "import IPython\n",
    "app = IPython.Application.instance()\n",
    "app.kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0321075e-433e-42d4-b849-de3fa17b54e1",
   "metadata": {},
   "source": [
    "**Note**: Executing the provided code cell will shut down the kernel and activate a popup indicating that the kernel has restarted."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e950df2",
   "metadata": {},
   "source": [
    "**Well Done!** Let's move to the [next notebook](1-01_section_overview.ipynb). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b604003a",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/ds/25-1/2/1-01_section_overview.ipynb
+++ b/ds/25-1/2/1-01_section_overview.ipynb
@ -0,0 +1,78 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b53a7b12-538d-4459-b82a-a35c8c417849",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae497b71-bc43-471e-8970-88a1878e7cf9",
   "metadata": {},
   "source": [
    "# Fundamentals of Accelerated Data Science # "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a61cc06-80da-4f73-ba61-8ff1b5af71d8",
   "metadata": {},
   "source": [
    "## 01 - Section Overview ##\n",
    "\n",
    "**Table of Contents**\n",
    "This section focuses on data processing. We'll work with multiple datasets, conduct high-level analyses, and prepare the data for subsequent machine learning tasks. \n",
    "<br>\n",
    "* **1-01_section_overview.ipynb**\n",
    "* **1-02_data_manipulation.ipynb**\n",
    "* **1-03_memory_management.ipynb**\n",
    "* **1-04_interoperability.ipynb**\n",
    "* **1-05_grouping.ipynb**\n",
    "* **1-06_data_visualization.ipynb**\n",
    "* **1-07_etl.ipynb**\n",
    "* **1-08_dask-cudf.ipynb**\n",
    "* **1-09_cudf-polars.ipynb**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b1485a5-00e8-4495-85b0-b48671674818",
   "metadata": {},
   "source": [
    "**Well Done!** Let's move to the [next notebook](1-02_data_manipulation.ipynb). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81e47f0a-547e-4714-878d-34eb9b75c835",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/ds/25-1/2/1-02_data_manipulation.ipynb
+++ b/ds/25-1/2/1-02_data_manipulation.ipynb
--- a/ds/25-1/2/1-03_memory_management.ipynb
+++ b/ds/25-1/2/1-03_memory_management.ipynb
@ -0,0 +1,958 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "def31b0f-921a-43eb-9807-8b9b31eb7b32",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a0fd4dd-f7be-4c90-8ddd-384a760ac04f",
   "metadata": {},
   "source": [
    "# Fundamentals of Accelerated Data Science # "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a8fdf2e-a481-455e-8a52-8be8472b63bf",
   "metadata": {},
   "source": [
    "## 03 - Memory Management ##\n",
    "\n",
    "**Table of Contents**\n",
    "<br>\n",
    "This notebook explores the dynamics between data and memory. This notebook covers the below sections: \n",
    "1. [Memory Management](#Memory-Management)\n",
    "    * [Memory Usage](#Memory-Usage)\n",
    "2. [Data Types](#Data-Types)\n",
    "    * [Convert Data Types](#Convert-Data-Types)\n",
    "    * [Exercise #1 - Modify `dtypes`](#Exercise-#1---Modify-dtypes)\n",
    "    * [Categorical](#Categorical)\n",
    "3. [Efficient Data Loading](#Efficient-Data-Loading)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b59367c-48bc-4c72-b1f4-4cfdfa5470cf",
   "metadata": {},
   "source": [
    "## Memory Management ##\n",
    "During the data acquisition process, data is transferred to memory in order to be operated on by the processor. Memory management is crucial for cuDF and GPU operations for several key reasons: \n",
    "* **Limited GPU memory**: GPUs typically have less memory than CPUs, therefore efficient memory management is essential to maximize the use of available GPU memory, especially for large datasets.\n",
    "* **Data transfer overhead**: Transferring data between CPU and GPU memory is relatively slow compared to GPU computation speed. Minimizing these transfers through smart memory management is critical for performance.\n",
    "* **Performance tuning**: Understanding and optimizing memory usage is key to achieving peak performance in GPU-accelerated data processing tasks.\n",
    "\n",
    "When done correctly, keeping the data on the GPU can enable cuDF and the RAPIDS ecosystem to achieve significant performance improvements, handle larger datasets, and provide more efficient data processing capabilities. \n",
    "\n",
    "Below we import the data from the csv file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b7b8a623-f799-4dad-aca9-0e571bb6e527",
   "metadata": {},
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "import pandas as pd\n",
    "import random\n",
    "import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "711d0a7f-8598-49fc-949c-5caf6029ce47",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>county</th>\n",
       "      <th>lat</th>\n",
       "      <th>long</th>\n",
       "      <th>name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.533644</td>\n",
       "      <td>-1.524401</td>\n",
       "      <td>FRANCIS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.426256</td>\n",
       "      <td>-1.465314</td>\n",
       "      <td>EDWARD</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.555200</td>\n",
       "      <td>-1.496417</td>\n",
       "      <td>TEDDY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.547906</td>\n",
       "      <td>-1.572341</td>\n",
       "      <td>ANGUS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.477639</td>\n",
       "      <td>-1.605995</td>\n",
       "      <td>CHARLIE</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age sex      county        lat      long     name\n",
       "0    0   m  DARLINGTON  54.533644 -1.524401  FRANCIS\n",
       "1    0   m  DARLINGTON  54.426256 -1.465314   EDWARD\n",
       "2    0   m  DARLINGTON  54.555200 -1.496417    TEDDY\n",
       "3    0   m  DARLINGTON  54.547906 -1.572341    ANGUS\n",
       "4    0   m  DARLINGTON  54.477639 -1.605995  CHARLIE"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "df=pd.read_csv('./data/uk_pop.csv')\n",
    "\n",
    "# preview\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36416fd0-7081-42aa-bf31-d1231b81ec0b",
   "metadata": {},
   "source": [
    "### Memory Usage ###\n",
    "Memory utilization of a DataFrame depends on the date types for each column.\n",
    "\n",
    "<p><img src='images/dtypes.png' width=720></p>\n",
    "\n",
    "We can use `DataFrame.memory_usage()` to see the memory usage for each column (in bytes). Most of the common data types have a fixed size in memory, such as `int`, `float`, `datetime`, and `bool`. Memory usage for these data types is the respective memory requirement multiplied by the number of data points. For `string` data type, the memory usage reported _for pandas_ is the number of elements times 8 bytes. This accounts for the 64-bit required for the pointer that points to an address in memory but not the memory used for the actual string values. The actual memory required for a string value is 49 bytes plus an additional byte for each character. The `deep` parameter provides a more accurate memory usage report that accounts for the system-level memory consumption of the contained `string` data type. \n",
    "\n",
    "Below we get the memory usage. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8378207b-2d9e-4102-8408-c2dddafc8a40",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index            128\n",
       "age        467839152\n",
       "sex       3391833852\n",
       "county    3934985133\n",
       "lat        467839152\n",
       "long       467839152\n",
       "name      3666922374\n",
       "dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "# pandas memory utilization\n",
    "mem_usage_df=df.memory_usage(deep=True)\n",
    "mem_usage_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07c24bb1-c4f7-440c-a949-d4c57800ec61",
   "metadata": {},
   "source": [
    "Below we define a `make_decimal()` function to convert memory size into units based on powers of 2. In contrast to units based on powers of 10, this customary convention is commonly used to report memory capacity. More information about the two definitions can be found [here](https://en.wikipedia.org/wiki/Byte#Multiple-byte_units). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5ae42218-1547-49fd-9123-ab508a2b03de",
   "metadata": {},
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "suffixes = ['B', 'kB', 'MB', 'GB', 'TB', 'PB']\n",
    "def make_decimal(nbytes):\n",
    "    i=0\n",
    "    while nbytes >= 1024 and i < len(suffixes)-1:\n",
    "        nbytes/=1024.\n",
    "        i+=1\n",
    "    f=('%.2f' % nbytes).rstrip('0').rstrip('.')\n",
    "    return '%s %s' % (f, suffixes[i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "e6d4a613-3eea-4dce-8e71-39593ff6f226",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'11.55 GB'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "make_decimal(mem_usage_df.sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a352c0b2-65aa-4231-b753-556aca46ff49",
   "metadata": {},
   "source": [
    "Below we calculate the memory usage manually based on the data types. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "630327b9-6dc1-4b70-9fdf-9f7763ec4d50",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Numerical columns use 467839152 bytes of memory\n"
     ]
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "# get number of rows\n",
    "num_rows=len(df)\n",
    "\n",
    "# 64-bit numbers uses 8 bytes of memory\n",
    "print(f'Numerical columns use {num_rows*8} bytes of memory')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "bb22b5f4-e38f-438e-9426-61746b509e50",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "county column uses 3934985133 bytes of memory.\n"
     ]
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "# check random string-typed column\n",
    "string_cols=[col for col in df.columns if df[col].dtype=='object' ]\n",
    "column_to_check=random.choice(string_cols)\n",
    "\n",
    "overhead=49\n",
    "pointer_size=8\n",
    "\n",
    "# nan==nan when value is not a number\n",
    "# nan uses 32 bytes of memory\n",
    "string_col_mem_usage_df=df[column_to_check].map(lambda x: len(x)+overhead+pointer_size if x else 32)\n",
    "string_col_mem_usage=string_col_mem_usage_df.sum()\n",
    "print(f'{column_to_check} column uses {string_col_mem_usage} bytes of memory.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94e393c2-c0d0-40ee-82d2-730c4667e9b8",
   "metadata": {},
   "source": [
    "**Note**: The `string` data type is stored differently in cuDF than it is in pandas. More information about `libcudf` stores string data using the [Arrow format](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) can be found [here](https://developer.nvidia.com/blog/mastering-string-transformations-in-rapids-libcudf/). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "737ff50b-9426-4e08-a00a-d7ee69f48b9f",
   "metadata": {},
   "source": [
    "## Data Types ##\n",
    "By default, pandas (and cuDF) uses 64-bit for numerical values. Using 64-bit numbers provides the highest precision but many applications do not require 64-bit precision when aggregating over a very large number of data points. When possible, using 32-bit numbers reduces storage and memory requirements in half, and also typically greatly speeds up computations because only half as much data needs to be accessed in memory. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b77d450-c415-44b8-87ac-20ce616ec809",
   "metadata": {},
   "source": [
    "### Convert Data Types ###\n",
    "The `.astype()` method can be used to convert numerical data types to use different bit-size containers. Here we convert the `age` column from `int64` to `int8`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "603f7c70-134e-4466-a790-8a18b9088ca6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age          int8\n",
       "sex        object\n",
       "county     object\n",
       "lat       float64\n",
       "long      float64\n",
       "name       object\n",
       "dtype: object"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "df['age']=df['age'].astype('int8')\n",
    "\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "973a6dd4-2aef-44d9-8b01-8853032eddae",
   "metadata": {},
   "source": [
    "### Exercise #1 - Modify `dtypes` ###\n",
    "**Instructions**: <br>\n",
    "* Modify the `<FIXME>` only and execute the below cell to convert any 64-bit data types to their 32-bit counterparts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "beb7d71b-6672-462e-b65c-a64dbe5f7a57",
   "metadata": {},
   "outputs": [],
   "source": [
    "df['lat']=df['lat'].astype('float32')\n",
    "df['long']=df['long'].astype('float32')"
   ]
  },
  {
   "cell_type": "raw",
   "id": "3b44fb22-a0f1-4e43-a332-1ccbad50caee",
   "metadata": {},
   "source": [
    "\n",
    "df['lat']=df['lat'].astype('float32')\n",
    "df['long']=df['long'].astype('float32')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98b6542d-22cc-4926-b600-a3e052c37c96",
   "metadata": {},
   "source": [
    "Click ... for solution. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b2cd622-977c-4915-a87f-2fe03c1793f5",
   "metadata": {},
   "source": [
    "### Categorical ###\n",
    "Categorical data is a type of data that represents discrete, distinct categories or groups. They can have a meaningful order or ranking but generally cannot be used for numerical operations. When appropriate, using the `categorical` data type can reduce memory usage and lead to faster operations. It can also be used to define and maintain a custom order of categories. \n",
    "\n",
    "Below we get the number of unique values in the string columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "f249e4b8-5d7a-4b44-ac15-bd3360a43f2a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sex           2\n",
       "county      171\n",
       "name      13212\n",
       "dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "df.select_dtypes(include='object').nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1d8bd88-b39b-4043-9039-d8bd75fe851a",
   "metadata": {},
   "source": [
    "Below we convert columns with few discrete values to `category`. The `category` data type has `.categories` and `codes` properties that are accessed through `.cat`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "a99bebbf-2e5b-4720-96f9-9fd7d42d2fe8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "df['sex']=df['sex'].astype('category')\n",
    "df['county']=df['county'].astype('category')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "41b7b290-cfcf-4ff6-b6b4-454c19b44a62",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['BARKING AND DAGENHAM', 'BARNET', 'BARNSLEY',\n",
       "       'BATH AND NORTH EAST SOMERSET', 'BEDFORD', 'BEXLEY', 'BIRMINGHAM',\n",
       "       'BLACKBURN WITH DARWEN', 'BLACKPOOL', 'BLAENAU GWENT',\n",
       "       ...\n",
       "       'WESTMINSTER', 'WIGAN', 'WILTSHIRE', 'WINDSOR AND MAIDENHEAD', 'WIRRAL',\n",
       "       'WOKINGHAM', 'WOLVERHAMPTON', 'WORCESTERSHIRE', 'WREXHAM', 'YORK'],\n",
       "      dtype='object', length=171)"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "----------------------------------------\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0           37\n",
       "1           37\n",
       "2           37\n",
       "3           37\n",
       "4           37\n",
       "            ..\n",
       "58479889    96\n",
       "58479890    96\n",
       "58479891    96\n",
       "58479892    96\n",
       "58479893    96\n",
       "Length: 58479894, dtype: int16"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "display(df['county'].cat.categories)\n",
    "print('-'*40)\n",
    "display(df['county'].cat.codes)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "737385ab-677c-4bef-a86a-10aa3119e29a",
   "metadata": {},
   "source": [
    "**Note**: `.astype()` can also be used to convert data to `datetime` or `object` to enable datetime and string methods. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "552c47c2-0fbc-455e-8745-cb98fc777243",
   "metadata": {},
   "source": [
    "## Efficient Data Loading ##\n",
    "It is often advantageous to specify the most appropriate data types for each columns, based on range, precision requirement, and how they are used. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "c2b9f0c3-8598-4a28-9481-ce28fea7544b",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index            128\n",
       "age        467839152\n",
       "sex       3391833852\n",
       "county    3934985133\n",
       "lat        467839152\n",
       "long       467839152\n",
       "name      3666922374\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading 11.55 GB took 33.63 seconds.\n"
     ]
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "start=time.time()\n",
    "df=pd.read_csv('./data/uk_pop.csv')\n",
    "duration=time.time()-start\n",
    "\n",
    "mem_usage_df=df.memory_usage(deep=True)\n",
    "display(mem_usage_df)\n",
    "\n",
    "print(f'Loading {make_decimal(mem_usage_df.sum())} took {round(duration, 2)} seconds.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5729520e-3ed8-4ec6-ae1f-ba46d642f48d",
   "metadata": {},
   "source": [
    "Below we enable `cuda.pandas` to see the difference. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "99aa0f32-4d2a-43a7-bec1-f1b88bcc37c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "%load_ext cudf.pandas\n",
    "\n",
    "import pandas as pd\n",
    "import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "2b724201-9ad1-4e9b-b712-f3b31bdc4104",
   "metadata": {},
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "suffixes = ['B', 'kB', 'MB', 'GB', 'TB', 'PB']\n",
    "def make_decimal(nbytes):\n",
    "    i=0\n",
    "    while nbytes >= 1024 and i < len(suffixes)-1:\n",
    "        nbytes/=1024.\n",
    "        i+=1\n",
    "    f=('%.2f' % nbytes).rstrip('0').rstrip('.')\n",
    "    return '%s %s' % (f, suffixes[i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "99bdd7b0-8563-41db-bd8e-3a7279394ede",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age        58479894\n",
       "sex        58479908\n",
       "county     58482446\n",
       "lat       467839152\n",
       "long      467839152\n",
       "name      117096917\n",
       "Index             0\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading 1.14 GB took 2.13 seconds.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "<span style=\"font-style: italic\">                                             Total time elapsed: 2.705 seconds                                     </span>\n",
       "<span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "<span style=\"font-style: italic\">                                                           Stats                                                   </span>\n",
       "<span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line                                                                     </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 2        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    start</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time()</span><span style=\"background-color: #272822\">                                                   </span> │             │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 5        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    dtype_dict</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">{</span><span style=\"background-color: #272822\">                                                        </span> │             │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 6        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">        </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'int8'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\">                                                 </span> │             │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 7        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">        </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'sex'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'category'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\">                                             </span> │             │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 8        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">        </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'category'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\">                                          </span> │             │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 9        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">        </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'lat'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'float64'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\">                                              </span> │             │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 10       │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">        </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'long'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'float64'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\">                                             </span> │             │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 11       │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">        </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'name'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'category'</span><span style=\"background-color: #272822\">                                              </span> │             │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 14       │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    efficient_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">pd</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">read_csv(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'./data/uk_pop.csv'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, dtype</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">dtype_dict)</span><span style=\"background-color: #272822\">     </span> │ 1.728013188 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 15       │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    duration</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time()</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">-</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">start</span><span style=\"background-color: #272822\">                                          </span> │             │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 17       │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    mem_usage_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">efficient_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">memory_usage(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'deep'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)</span><span style=\"background-color: #272822\">                      </span> │ 0.005340174 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 18       │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    display(mem_usage_df)</span><span style=\"background-color: #272822\">                                               </span> │ 0.011073721 │ 0.006896915 │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 20       │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    print(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">f'Loading {</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">make_decimal(mem_usage_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">sum())</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">} took {</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">round(dura…</span> │ 0.004693074 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "\u001b[3m                                             Total time elapsed: 2.705 seconds                                     \u001b[0m\n",
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "\u001b[3m                                                           Stats                                                   \u001b[0m\n",
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine                                                                    \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 2        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstart\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m                                                   \u001b[0m │             │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 5        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdtype_dict\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m{\u001b[0m\u001b[48;2;39;40;34m                                                        \u001b[0m │             │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 6        │ \u001b[38;2;248;248;242;48;2;39;40;34m        \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mint8\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m                                                 \u001b[0m │             │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 7        │ \u001b[38;2;248;248;242;48;2;39;40;34m        \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msex\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcategory\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m                                             \u001b[0m │             │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 8        │ \u001b[38;2;248;248;242;48;2;39;40;34m        \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcategory\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m                                          \u001b[0m │             │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 9        │ \u001b[38;2;248;248;242;48;2;39;40;34m        \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlat\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mfloat64\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m                                              \u001b[0m │             │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 10       │ \u001b[38;2;248;248;242;48;2;39;40;34m        \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlong\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mfloat64\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m                                             \u001b[0m │             │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 11       │ \u001b[38;2;248;248;242;48;2;39;40;34m        \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mname\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcategory\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[48;2;39;40;34m                                              \u001b[0m │             │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 14       │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mefficient_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mpd\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mread_csv\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m./data/uk_pop.csv\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdtype\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdtype_dict\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m     \u001b[0m │ 1.728013188 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 15       │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mduration\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m-\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstart\u001b[0m\u001b[48;2;39;40;34m                                          \u001b[0m │             │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 17       │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmem_usage_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mefficient_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmemory_usage\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mdeep\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m                      \u001b[0m │ 0.005340174 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 18       │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdisplay\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmem_usage_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m                                               \u001b[0m │ 0.011073721 │ 0.006896915 │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 20       │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mprint\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mf\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mLoading \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m{\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmake_decimal\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmem_usage_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msum\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m}\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m took \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m{\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mround\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdura…\u001b[0m │ 0.004693074 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%cudf.pandas.line_profile\n",
    "# DO NOT CHANGE THIS CELL\n",
    "start=time.time()\n",
    "\n",
    "# define data types for each column\n",
    "dtype_dict={\n",
    "    'age': 'int8', \n",
    "    'sex': 'category', \n",
    "    'county': 'category', \n",
    "    'lat': 'float64', \n",
    "    'long': 'float64', \n",
    "    'name': 'category'\n",
    "}\n",
    "        \n",
    "efficient_df=pd.read_csv('./data/uk_pop.csv', dtype=dtype_dict)\n",
    "duration=time.time()-start\n",
    "\n",
    "mem_usage_df=efficient_df.memory_usage('deep')\n",
    "display(mem_usage_df)\n",
    "\n",
    "print(f'Loading {make_decimal(mem_usage_df.sum())} took {round(duration, 2)} seconds.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f4607d8-6de3-4b27-96d4-a9720d268333",
   "metadata": {},
   "source": [
    "We were able to load data faster and more efficiently. \n",
    "\n",
    "**Note**: Notice that the memory utilized on the GPU is larger than the memory used by the DataFrame. This is expected because there are intermediary processes that use some memory during the data loading process, specifically related to parsing the csv file in this case. \n",
    "\n",
    "```\n",
    "+-----------------------------------------------------------------------------+\n",
    "| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |\n",
    "|-------------------------------+----------------------+----------------------+\n",
    "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
    "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
    "|                               |                      |               MIG M. |\n",
    "|===============================+======================+======================|\n",
    "|   0  Tesla T4            Off  | 00000000:00:1B.0 Off |                    0 |\n",
    "| N/A   32C    P0    26W /  70W |   1378MiB / 15360MiB |      0%      Default |\n",
    "|                               |                      |                  N/A |\n",
    "+-------------------------------+----------------------+----------------------+\n",
    "|   1  Tesla T4            Off  | 00000000:00:1C.0 Off |                    0 |\n",
    "| N/A   31C    P0    26W /  70W |    168MiB / 15360MiB |      0%      Default |\n",
    "|                               |                      |                  N/A |\n",
    "+-------------------------------+----------------------+----------------------+\n",
    "|   2  Tesla T4            Off  | 00000000:00:1D.0 Off |                    0 |\n",
    "| N/A   30C    P0    26W /  70W |    168MiB / 15360MiB |      0%      Default |\n",
    "|                               |                      |                  N/A |\n",
    "+-------------------------------+----------------------+----------------------+\n",
    "|   3  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |\n",
    "| N/A   30C    P0    26W /  70W |    168MiB / 15360MiB |      0%      Default |\n",
    "|                               |                      |                  N/A |\n",
    "+-------------------------------+----------------------+----------------------+\n",
    "                                                                               \n",
    "+-----------------------------------------------------------------------------+\n",
    "| Processes:                                                                  |\n",
    "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
    "|        ID   ID                                                   Usage      |\n",
    "|=============================================================================|\n",
    "+-----------------------------------------------------------------------------+\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "92f7ee37-4acb-46aa-bb73-4c0139d3f6b8",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tue Oct 21 08:08:25 2025       \n",
      "+-----------------------------------------------------------------------------+\n",
      "| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |\n",
      "|-------------------------------+----------------------+----------------------+\n",
      "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
      "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
      "|                               |                      |               MIG M. |\n",
      "|===============================+======================+======================|\n",
      "|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |\n",
      "| N/A   28C    P0    24W /  70W |  11314MiB / 15360MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |\n",
      "| N/A   29C    P0    25W /  70W |    168MiB / 15360MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |\n",
      "| N/A   28C    P0    25W /  70W |    168MiB / 15360MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |\n",
      "| N/A   29C    P0    24W /  70W |    168MiB / 15360MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "                                                                               \n",
      "+-----------------------------------------------------------------------------+\n",
      "| Processes:                                                                  |\n",
      "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
      "|        ID   ID                                                   Usage      |\n",
      "|=============================================================================|\n",
      "+-----------------------------------------------------------------------------+\n"
     ]
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c031d2c7-03cb-4ac7-a195-70fc25cb191d",
   "metadata": {},
   "source": [
    "When loading data this way, we may be able to fit more data. The optimal dataset size depends on various factors including the specific operations being performed, the complexity of the workload, and the available GPU memory. To maximize acceleration, datasets should ideally fit within GPU memory, with ample space left for operations that can spike memory requirements. As a general rule of thumb, cuDF recommends data sets that are less than 50% of the GPU memory capacity. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec6cefea-dc64-4f13-815e-081cd35651b9",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "# 1 gigabytes = 1073741824 bytes\n",
    "mem_capacity=16*1073741824\n",
    "\n",
    "mem_per_record=mem_usage_df.sum()/len(efficient_df)\n",
    "\n",
    "print(f'We can load {int(mem_capacity/2/mem_per_record)} rows.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddaaa1ac-66ec-4323-9842-2543c6d85e4e",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "import IPython\n",
    "app = IPython.Application.instance()\n",
    "app.kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "658e9847-775f-4d12-af4e-8f896df4e6fe",
   "metadata": {},
   "source": [
    "**Well Done!** Let's move to the [next notebook](1-04_interoperability.ipynb). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b86451cf-60e6-4733-b431-1bc0bd586bc2",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/ds/25-1/2/1-04_interoperability.ipynb
+++ b/ds/25-1/2/1-04_interoperability.ipynb
--- a/ds/25-1/2/1-05_grouping.ipynb
+++ b/ds/25-1/2/1-05_grouping.ipynb
--- a/ds/25-1/2/1-06_data_visualization.ipynb
+++ b/ds/25-1/2/1-06_data_visualization.ipynb
--- a/ds/25-1/2/1-07_etl.ipynb
+++ b/ds/25-1/2/1-07_etl.ipynb
--- a/ds/25-1/2/1-08_cudf-polars.ipynb
+++ b/ds/25-1/2/1-08_cudf-polars.ipynb
--- a/ds/25-1/2/1-09_dask-cudf.ipynb
+++ b/ds/25-1/2/1-09_dask-cudf.ipynb
@ -0,0 +1,978 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fundamentals of Accelerated Data Science # "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transition Path: cuDF provides a way for users to scale their pandas workflows as data sizes grow, offering a middle ground between single-threaded pandas and distributed computing solutions like Dask or Apache Spark ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 09 - Introduction to Dask cuDF ##\n",
    "\n",
    "**Table of Contents**\n",
    "<br>\n",
    "[Dask](https://dask.org/) cuDF can be used to distribute dataframe operations to multiple GPUs. In this notebook we will introduce some key Dask concepts, learn how to setup a Dask cluster for utilizing multiple GPUs, and see how to perform simple dataframe operations on distributed Dask dataframes. This notebook covers the below sections: \n",
    "1. [An Introduction to Dask](#An-Introduction-to-Dask)\n",
    "2. [Setting up a Dask Scheduler](#Setting-up-a-Dask-Scheduler)\n",
    "    * [Obtaining the Local IP Address](#Obtaining-the-Local-IP-Address)\n",
    "    * [Starting a `LocalCUDACluster`](#Starting-a-LocalCUDACluster)\n",
    "    * [Instantiating a Client Connection](#Instantiating-a-Client-Connection)\n",
    "    * [The Dask Dashboard](#The-Dask-Dashboard)\n",
    "3. [Reading Data with Dask cuDF](#Reading-Data-with-Dask-cuDF)\n",
    "4. [Computational Graph](#Computational-Graph)\n",
    "    * [Visualizing the Computational Graph](#Visualizing-the-Computational-Graph)\n",
    "    * [Extending the Computational Graph](#Extending-the-Computational-Graph)\n",
    "    * [Computing with the Computational Graph](#Computing-with-the-Computational-Graph)\n",
    "    * [Persisting Data in the Cluster](#Persisting-Data-in-the-Cluster)\n",
    "6. [Initial Data Exploration with Dask cuDF](#Initial-Data-Exploration-with-Dask-cuDF)\n",
    "    * [Exercise #1 - Counties North of Sunderland with Dask](#Exercise-#1---Counties-North-of-Sunderland-with-Dask)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## An Introduction to Dask ##\n",
    "[Dask](https://dask.org/) is a Python library for parallel computing. In Dask programming, we create computational graphs that define code we **would like** to execute, and then, give these computational graphs to a Dask scheduler which evaluates them lazily, and efficiently, in parallel. \n",
    "\n",
    "In addition to using multiple CPU cores or threads to execute computational graphs in parallel, Dask schedulers can also be configured to execute computational graphs on multiple CPUs, or, as we will do in this workshop, multiple GPUs. As a result, Dask programming facilitates operating on data sets that are larger than the memory of a single compute resource.\n",
    "\n",
    "Because Dask computational graphs can consist of arbitrary Python code, they provide [a level of control and flexibility superior to many other systems](https://docs.dask.org/en/latest/spark.html) that can operate on massive data sets. However, we will focus for this workshop primarily on the Dask DataFrame, one of several data structures whose operations and methods natively utilize Dask's parallel scheduling:\n",
    "* Dask DataFrame, which closely resembles the Pandas DataFrame\n",
    "* Dask Array, which closely resembles the NumPy ndarray\n",
    "* Dask Bag, a set which allows duplicates and can hold heterogeneously-typed data\n",
    "\n",
    "In particular, we will use a Dask-cuDF dataframe, which combines the interface of Dask with the GPU power of cuDF for distributed dataframe operations on multiple GPUs. We will now turn our attention to utilizing all 4 NVIDIA V100 GPUs in this environment for operations on an 18GB UK population data set that would not fit into the memory of a single 16GB GPU."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up a Dask Scheduler ##\n",
    "We begin by starting a Dask scheduler which will take care to distribute our work across the 4 available GPUs. In order to do this we need to start a `LocalCUDACluster` instance, using our host machine's IP, and then instantiate a client that can communicate with the cluster."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Obtaining the Local IP Address ###"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import subprocess # we will use this to obtain our local IP using the following command\n",
    "cmd = \"hostname --all-ip-addresses\"\n",
    "\n",
    "process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)\n",
    "output, error = process.communicate()\n",
    "IPADDR = str(output.decode()).split()[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Starting a `LocalCUDACluster` ###\n",
    "`dask_cuda` provides utilities for Dask and CUDA (the \"cu\" in cuDF) interactions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-10-21 13:31:13,108 - distributed.scheduler - WARNING - Removing worker 'tcp://172.18.0.2:44687' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('read_csv-910ec886221afde30c768158c33b486c', 16), ('read_csv-910ec886221afde30c768158c33b486c', 67), ('read_csv-910ec886221afde30c768158c33b486c', 0), ('read_csv-910ec886221afde30c768158c33b486c', 41), ('read_csv-910ec886221afde30c768158c33b486c', 54), ('read_csv-910ec886221afde30c768158c33b486c', 9), ('read_csv-910ec886221afde30c768158c33b486c', 38), ('read_csv-910ec886221afde30c768158c33b486c', 5), ('read_csv-910ec886221afde30c768158c33b486c', 34), ('read_csv-910ec886221afde30c768158c33b486c', 12), ('read_csv-910ec886221afde30c768158c33b486c', 2), ('read_csv-910ec886221afde30c768158c33b486c', 27), ('read_csv-910ec886221afde30c768158c33b486c', 62), ('read_csv-910ec886221afde30c768158c33b486c', 46), ('read_csv-910ec886221afde30c768158c33b486c', 30), ('read_csv-910ec886221afde30c768158c33b486c', 59), ('read_csv-910ec886221afde30c768158c33b486c', 23)} (stimulus_id='handle-worker-cleanup-1761053473.108198')\n",
      "2025-10-21 13:31:13,110 - distributed.scheduler - WARNING - Removing worker 'tcp://172.18.0.2:35977' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('read_csv-910ec886221afde30c768158c33b486c', 29), ('read_csv-910ec886221afde30c768158c33b486c', 48), ('read_csv-910ec886221afde30c768158c33b486c', 32), ('read_csv-910ec886221afde30c768158c33b486c', 10), ('read_csv-910ec886221afde30c768158c33b486c', 51), ('read_csv-910ec886221afde30c768158c33b486c', 25), ('read_csv-910ec886221afde30c768158c33b486c', 60), ('read_csv-910ec886221afde30c768158c33b486c', 44), ('read_csv-910ec886221afde30c768158c33b486c', 14), ('read_csv-910ec886221afde30c768158c33b486c', 57), ('read_csv-910ec886221afde30c768158c33b486c', 18), ('read_csv-910ec886221afde30c768158c33b486c', 8), ('read_csv-910ec886221afde30c768158c33b486c', 66), ('read_csv-910ec886221afde30c768158c33b486c', 21), ('read_csv-910ec886221afde30c768158c33b486c', 36), ('read_csv-910ec886221afde30c768158c33b486c', 4), ('read_csv-910ec886221afde30c768158c33b486c', 55)} (stimulus_id='handle-worker-cleanup-1761053473.1105292')\n",
      "2025-10-21 13:31:13,112 - distributed.scheduler - WARNING - Removing worker 'tcp://172.18.0.2:39371' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('read_csv-910ec886221afde30c768158c33b486c', 7), ('read_csv-910ec886221afde30c768158c33b486c', 58), ('read_csv-910ec886221afde30c768158c33b486c', 3), ('read_csv-910ec886221afde30c768158c33b486c', 26), ('read_csv-910ec886221afde30c768158c33b486c', 61), ('read_csv-910ec886221afde30c768158c33b486c', 22), ('read_csv-910ec886221afde30c768158c33b486c', 19), ('read_csv-910ec886221afde30c768158c33b486c', 15), ('read_csv-910ec886221afde30c768158c33b486c', 50), ('read_csv-910ec886221afde30c768158c33b486c', 47), ('read_csv-910ec886221afde30c768158c33b486c', 53), ('read_csv-910ec886221afde30c768158c33b486c', 37), ('read_csv-910ec886221afde30c768158c33b486c', 43), ('read_csv-910ec886221afde30c768158c33b486c', 11), ('read_csv-910ec886221afde30c768158c33b486c', 40), ('read_csv-910ec886221afde30c768158c33b486c', 65), ('read_csv-910ec886221afde30c768158c33b486c', 33)} (stimulus_id='handle-worker-cleanup-1761053473.1126676')\n",
      "2025-10-21 13:31:13,114 - distributed.scheduler - WARNING - Removing worker 'tcp://172.18.0.2:36291' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('read_csv-910ec886221afde30c768158c33b486c', 52), ('read_csv-910ec886221afde30c768158c33b486c', 13), ('read_csv-910ec886221afde30c768158c33b486c', 42), ('read_csv-910ec886221afde30c768158c33b486c', 45), ('read_csv-910ec886221afde30c768158c33b486c', 6), ('read_csv-910ec886221afde30c768158c33b486c', 35), ('read_csv-910ec886221afde30c768158c33b486c', 64), ('read_csv-910ec886221afde30c768158c33b486c', 31), ('read_csv-910ec886221afde30c768158c33b486c', 28), ('read_csv-910ec886221afde30c768158c33b486c', 63), ('read_csv-910ec886221afde30c768158c33b486c', 24), ('read_csv-910ec886221afde30c768158c33b486c', 56), ('read_csv-910ec886221afde30c768158c33b486c', 17), ('read_csv-910ec886221afde30c768158c33b486c', 1), ('read_csv-910ec886221afde30c768158c33b486c', 20), ('read_csv-910ec886221afde30c768158c33b486c', 49), ('read_csv-910ec886221afde30c768158c33b486c', 39), ('read_csv-910ec886221afde30c768158c33b486c', 68)} (stimulus_id='handle-worker-cleanup-1761053473.1145272')\n"
     ]
    }
   ],
   "source": [
    "from dask_cuda import LocalCUDACluster\n",
    "cluster = LocalCUDACluster(ip=IPADDR)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Instantiating a Client Connection ###\n",
    "The `dask.distributed` library gives us distributed functionality, including the ability to connect to the CUDA Cluster we just created. The `progress` import will give us a handy progress bar we can utilize below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import Client, progress\n",
    "\n",
    "client = Client(cluster)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Dask Dashboard"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dask ships with a very helpful dashboard that in our case runs on port `8787`. Open a new browser tab now and copy this lab's URL into it, replacing `/lab/lab` with `:8787` (so it ends with `.com:8787`). This should open the Dask dashboard, currently idle."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading Data with Dask cuDF ##\n",
    "With `dask_cudf` we can create a dataframe from several file formats (including from multiple files and directly from cloud storage like S3), from cuDF dataframes, from Pandas dataframes, and even from vanilla CPU Dask dataframes. Here we will create a Dask cuDF dataframe from the local csv file `pop5x_1-07.csv`, which has similar features to the `pop.csv` files you have already been using, except scaled up to 5 times larger (18GB), representing a population of almost 300 million, nearly the size of the entire United States."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "18G data/uk_pop5x.csv\n"
     ]
    }
   ],
   "source": [
    "# get the file size of `pop5x_1-07.csv` in GB\n",
    "!ls -sh data/uk_pop5x.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We import dask_cudf (and other RAPIDS components when necessary) after setting up the cluster to ensure that they establish correctly inside the CUDA context it creates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask_cudf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "ddf = dask_cudf.read_csv('./data/uk_pop5x.csv', dtype=['float32', 'str', 'str', 'float32', 'float32', 'str'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age       float32\n",
       "sex        object\n",
       "county     object\n",
       "lat       float32\n",
       "long      float32\n",
       "name       object\n",
       "dtype: object"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computational Graph ##\n",
    "As mentioned above, when programming with Dask, we create computational graphs that we **would eventually like** to be executed. We can already observe this behavior in action: in calling `dask_cudf.read_csv` we have indicated that **would eventually like** to read the entire contents of `pop5x_1-07.csv`. However, Dask will not ask the scheduler execute this work until we explicitly indicate that we would like it do so.\n",
    "\n",
    "Observe the memory usage for each of the 4 GPUs by executing the following cell, and notice that the GPU memory usage is not nearly large enough to indicate that the entire 18GB file has been read into memory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tue Oct 21 13:29:09 2025       \n",
      "+-----------------------------------------------------------------------------+\n",
      "| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |\n",
      "|-------------------------------+----------------------+----------------------+\n",
      "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
      "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
      "|                               |                      |               MIG M. |\n",
      "|===============================+======================+======================|\n",
      "|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |\n",
      "| N/A   30C    P0    26W /  70W |  14956MiB / 15360MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |\n",
      "| N/A   30C    P0    26W /  70W |    168MiB / 15360MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |\n",
      "| N/A   30C    P0    26W /  70W |    168MiB / 15360MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |\n",
      "| N/A   29C    P0    26W /  70W |    168MiB / 15360MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "                                                                               \n",
      "+-----------------------------------------------------------------------------+\n",
      "| Processes:                                                                  |\n",
      "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
      "|        ID   ID                                                   Usage      |\n",
      "|=============================================================================|\n",
      "+-----------------------------------------------------------------------------+\n"
     ]
    }
   ],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualizing the Computational Graph ###\n",
    "Computational graphs that have not yet been executed provide the `.visualize` method that, when used in a Jupyter environment such as this one, will display the computational graph, including how Dask intends to go about distributing the work. Thus, we can visualize how the `read_csv` operation will be distributed by Dask by executing the following cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.43.0 (0)\n",
       " -->\n",
       "<!-- Title: %3 Pages: 1 -->\n",
       "<svg width=\"115pt\" height=\"44pt\"\n",
       " viewBox=\"0.00 0.00 115.00 44.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 40)\">\n",
       "<title>%3</title>\n",
       "<polygon fill=\"white\" stroke=\"transparent\" points=\"-4,4 -4,-40 111,-40 111,4 -4,4\"/>\n",
       "<!-- &#45;6332770613817605186 -->\n",
       "<g id=\"node1\" class=\"node\">\n",
       "<title>&#45;6332770613817605186</title>\n",
       "<polygon fill=\"none\" stroke=\"black\" points=\"107,-36 0,-36 0,0 107,0 107,-36\"/>\n",
       "<text text-anchor=\"middle\" x=\"53.5\" y=\"-13\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">ReadCSV</text>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.graphs.Digraph at 0x7f94de3b45b0>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.visualize(format='svg') # This visualization is very large, and using `format='svg'` will make it easier to view."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, when we indicate for Dask to actually execute this operation, it will parallelize the work across the 4 GPUs in something like 69 parallel partitions. We can see the exact number of partitions with the `npartitions` property:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "69"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.npartitions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extending the Computational Graph ###\n",
    "The concept of constructing computational graphs with arbitrary operations before executing them is a core part of Dask. Let's add some operations to the existing computational graph and visualize it again.\n",
    "\n",
    "After running the next cell, although it will take some scrolling to get a clear sense of it (the challenges of distributed data analytics!), you can see that the graph already constructed for `read_csv` now continues upward. It selects the `age` column across all partitions (visualized as `getitem`) and eventually performs the `.mean()` reduction (visualized as `series-sum-chunk`, `series-sum-agg`, `count-chunk`, `sum-agg` and `true-div`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.43.0 (0)\n",
       " -->\n",
       "<!-- Title: %3 Pages: 1 -->\n",
       "<svg width=\"276pt\" height=\"188pt\"\n",
       " viewBox=\"0.00 0.00 276.00 188.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 184)\">\n",
       "<title>%3</title>\n",
       "<polygon fill=\"white\" stroke=\"transparent\" points=\"-4,4 -4,-184 272,-184 272,4 -4,4\"/>\n",
       "<!-- 2336549067836068764 -->\n",
       "<g id=\"node1\" class=\"node\">\n",
       "<title>2336549067836068764</title>\n",
       "<polygon fill=\"none\" stroke=\"black\" points=\"221,-180 47,-180 47,-144 221,-144 221,-180\"/>\n",
       "<text text-anchor=\"middle\" x=\"134\" y=\"-157\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">Sum(Projection)</text>\n",
       "</g>\n",
       "<!-- 553658985626135620 -->\n",
       "<g id=\"node2\" class=\"node\">\n",
       "<title>553658985626135620</title>\n",
       "<polygon fill=\"none\" stroke=\"black\" points=\"268,-108 0,-108 0,-72 268,-72 268,-108\"/>\n",
       "<text text-anchor=\"middle\" x=\"134\" y=\"-85\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">Projection(ReadCSV, age)</text>\n",
       "</g>\n",
       "<!-- 553658985626135620&#45;&gt;2336549067836068764 -->\n",
       "<g id=\"edge1\" class=\"edge\">\n",
       "<title>553658985626135620&#45;&gt;2336549067836068764</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M134,-108.3C134,-116.02 134,-125.29 134,-133.89\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"130.5,-133.9 134,-143.9 137.5,-133.9 130.5,-133.9\"/>\n",
       "</g>\n",
       "<!-- &#45;6332770613817605186 -->\n",
       "<g id=\"node3\" class=\"node\">\n",
       "<title>&#45;6332770613817605186</title>\n",
       "<polygon fill=\"none\" stroke=\"black\" points=\"187.5,-36 80.5,-36 80.5,0 187.5,0 187.5,-36\"/>\n",
       "<text text-anchor=\"middle\" x=\"134\" y=\"-13\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">ReadCSV</text>\n",
       "</g>\n",
       "<!-- &#45;6332770613817605186&#45;&gt;553658985626135620 -->\n",
       "<g id=\"edge2\" class=\"edge\">\n",
       "<title>&#45;6332770613817605186&#45;&gt;553658985626135620</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M134,-36.3C134,-44.02 134,-53.29 134,-61.89\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"130.5,-61.9 134,-71.9 137.5,-61.9 130.5,-61.9\"/>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.graphs.Digraph at 0x7f94de3b59f0>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_age = ddf['age'].sum()\n",
    "mean_age.visualize(format='svg')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing with the Computational Graph ###\n",
    "There are several ways to indicate to Dask that we would like to perform the computations described in the computational graphs we have constructed. The first we will show is the `.compute` method, which will return the output of the computation as an object in one GPU's memory - no longer distributed across GPUs.\n",
    "\n",
    "**NOTE**: This value is actually a [*future*](https://docs.python.org/3/library/concurrent.futures.html) that it can be immediately used in code, even before it completes evaluating. While this can be tremendously useful in many scenarios, we will not need in this workshop to do anything fancy with the futures we generate except to wait for them to evaluate so we can visualize their values.\n",
    "\n",
    "Below we send the computational graph we have created to the Dask scheduler to be executed in parallel on our 4 GPUs. If you have the Dask Dashboard open on another tab from before, you can watch it while the operation completes. Because our graph involves reading the entire 18GB data set (as we declared when adding `read_csv` to the call graph), you can expect the operation to take a little time. If you closely watch the dashboard, you will see that Dask begins follow-on calculations for `mean` even while data is still being read into memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "11732293000.0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_age.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Persisting Data in the Cluster ###\n",
    "As you can see, the previous operation, which read the entire 18GB csv into the GPUs' memory, did not retain the data in memory after completing the computational graph:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tue Oct 21 13:31:04 2025       \n",
      "+-----------------------------------------------------------------------------+\n",
      "| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |\n",
      "|-------------------------------+----------------------+----------------------+\n",
      "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
      "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
      "|                               |                      |               MIG M. |\n",
      "|===============================+======================+======================|\n",
      "|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |\n",
      "| N/A   30C    P0    26W /  70W |  14094MiB / 15360MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |\n",
      "| N/A   30C    P0    26W /  70W |    690MiB / 15360MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |\n",
      "| N/A   30C    P0    26W /  70W |    690MiB / 15360MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |\n",
      "| N/A   29C    P0    26W /  70W |    690MiB / 15360MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "                                                                               \n",
      "+-----------------------------------------------------------------------------+\n",
      "| Processes:                                                                  |\n",
      "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
      "|        ID   ID                                                   Usage      |\n",
      "|=============================================================================|\n",
      "+-----------------------------------------------------------------------------+\n"
     ]
    }
   ],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A typical Dask workflow, which we will utilize, is to persist data we would like to work with to the cluster and then perform fast operations on that persisted data. We do this with the `.persist` method. From the [Dask documentation](https://distributed.dask.org/en/latest/manage-computation.html#client-persist):\n",
    "\n",
    ">The `.persist` method submits the task graph behind the Dask collection to the scheduler, obtaining Futures for all of the top-most tasks (for example one Future for each Pandas [*or cuDF*] DataFrame in a Dask[*-cudf*] DataFrame). It then returns a copy of the collection pointing to these futures instead of the previous graph. This new collection is semantically equivalent but now points to actively running data rather than a lazy graph.\n",
    "\n",
    "Below we persist `ddf` to the cluster so that it will reside in GPU memory for us to perform fast operations on. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "ddf = ddf.persist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see by executing `nvidia-smi` (after letting the `persist` finish), each GPU now has parts of the distributed dataframe in its memory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tue Oct 21 13:31:08 2025       \n",
      "+-----------------------------------------------------------------------------+\n",
      "| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |\n",
      "|-------------------------------+----------------------+----------------------+\n",
      "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
      "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
      "|                               |                      |               MIG M. |\n",
      "|===============================+======================+======================|\n",
      "|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |\n",
      "| N/A   32C    P0    33W /  70W |  14218MiB / 15360MiB |     46%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |\n",
      "| N/A   32C    P0    32W /  70W |   3768MiB / 15360MiB |     19%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |\n",
      "| N/A   31C    P0    32W /  70W |   3804MiB / 15360MiB |     24%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |\n",
      "| N/A   31C    P0    32W /  70W |   3764MiB / 15360MiB |     45%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "                                                                               \n",
      "+-----------------------------------------------------------------------------+\n",
      "| Processes:                                                                  |\n",
      "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
      "|        ID   ID                                                   Usage      |\n",
      "|=============================================================================|\n",
      "+-----------------------------------------------------------------------------+\n"
     ]
    }
   ],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running `ddf.visualize` now shows that we no longer have operations in our task graph, only partitions of data, ready for us to perform operations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.43.0 (0)\n",
       " -->\n",
       "<!-- Title: %3 Pages: 1 -->\n",
       "<svg width=\"135pt\" height=\"44pt\"\n",
       " viewBox=\"0.00 0.00 135.00 44.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 40)\">\n",
       "<title>%3</title>\n",
       "<polygon fill=\"white\" stroke=\"transparent\" points=\"-4,4 -4,-40 131,-40 131,4 -4,4\"/>\n",
       "<!-- &#45;4538719848559110466 -->\n",
       "<g id=\"node1\" class=\"node\">\n",
       "<title>&#45;4538719848559110466</title>\n",
       "<polygon fill=\"none\" stroke=\"black\" points=\"127,-36 0,-36 0,0 127,0 127,-36\"/>\n",
       "<text text-anchor=\"middle\" x=\"63.5\" y=\"-13\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">FromGraph</text>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.graphs.Digraph at 0x7f94b80d4550>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.visualize(format='svg')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Computing operations on this data will now be much faster:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "40.1241924549316"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf['age'].mean().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initial Data Exploration with Dask cuDF ##\n",
    "The beauty of Dask is that working with your data, even though it is distributed and massive, is a lot like working with smaller in-memory data sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>county</th>\n",
       "      <th>lat</th>\n",
       "      <th>long</th>\n",
       "      <th>name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>m</td>\n",
       "      <td>Darlington</td>\n",
       "      <td>54.549641</td>\n",
       "      <td>-1.493884</td>\n",
       "      <td>HARRISON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>m</td>\n",
       "      <td>Darlington</td>\n",
       "      <td>54.523945</td>\n",
       "      <td>-1.401142</td>\n",
       "      <td>LAKSH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>m</td>\n",
       "      <td>Darlington</td>\n",
       "      <td>54.561127</td>\n",
       "      <td>-1.690068</td>\n",
       "      <td>MUHAMMAD</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>m</td>\n",
       "      <td>Darlington</td>\n",
       "      <td>54.542988</td>\n",
       "      <td>-1.543216</td>\n",
       "      <td>GRAYSON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>m</td>\n",
       "      <td>Darlington</td>\n",
       "      <td>54.532101</td>\n",
       "      <td>-1.569116</td>\n",
       "      <td>FINLAY</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age sex      county        lat      long      name\n",
       "0  0.0   m  Darlington  54.549641 -1.493884  HARRISON\n",
       "1  0.0   m  Darlington  54.523945 -1.401142     LAKSH\n",
       "2  0.0   m  Darlington  54.561127 -1.690068  MUHAMMAD\n",
       "3  0.0   m  Darlington  54.542988 -1.543216   GRAYSON\n",
       "4  0.0   m  Darlington  54.532101 -1.569116    FINLAY"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.head() # As a convenience, no need to `.compute` the `head()` method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age       292399470\n",
       "sex       292399470\n",
       "county    292399470\n",
       "lat       292399470\n",
       "long      292399470\n",
       "name      292399470\n",
       "dtype: int64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.count().compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age       float32\n",
       "sex        object\n",
       "county     object\n",
       "lat       float32\n",
       "long      float32\n",
       "name       object\n",
       "dtype: object"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise #1 - Counties North of Sunderland with Dask ###\n",
    "Here we ask you to revisit an earlier exercise, but on the distributed data set. Hopefully, it's clear how similar the code is for single-GPU dataframes and distributed dataframes with Dask.\n",
    "\n",
    "Identify the latitude of the northernmost resident of Sunderland county (the person with the maximum `lat` value), and then determine which counties have any residents north of this resident. Use the `unique` method of a cudf `Series` to de-duplicate the result.\n",
    "\n",
    "**Instructions**: <br>\n",
    "* Modify the `<FIXME>` only and execute the below cell to identify counties north of Sunderland. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'ddf' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[1], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m sunderland_residents \u001b[38;5;241m=\u001b[39m \u001b[43mddf\u001b[49m\u001b[38;5;241m.\u001b[39mloc[[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcounty\u001b[39m\u001b[38;5;124m'\u001b[39m], [\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mSUNDERLAND\u001b[39m\u001b[38;5;124m'\u001b[39m]]\n\u001b[1;32m      2\u001b[0m northmost_sunderland_lat \u001b[38;5;241m=\u001b[39m sunderland_residents[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mlat\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39mmax()\n\u001b[1;32m      3\u001b[0m counties_with_pop_north_of \u001b[38;5;241m=\u001b[39m ddf\u001b[38;5;241m.\u001b[39mloc[ddf[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mlat\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m>\u001b[39m northmost_sunderland_lat][\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcounty\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39munique()\n",
      "\u001b[0;31mNameError\u001b[0m: name 'ddf' is not defined"
     ]
    }
   ],
   "source": [
    "sunderland_residents = ddf.loc[['county'], ['SUNDERLAND']]\n",
    "northmost_sunderland_lat = sunderland_residents['lat'].max()\n",
    "counties_with_pop_north_of = ddf.loc[ddf['lat'] > northmost_sunderland_lat]['county'].unique()\n",
    "results=counties_with_pop_north_of.compute()\n",
    "results.head()"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "source": [
    "\n",
    "sunderland_residents = ddf.loc[ddf['county'] == 'Sunderland']\n",
    "northmost_sunderland_lat = sunderland_residents['lat'].max()\n",
    "counties_with_pop_north_of = ddf.loc[ddf['lat'] > northmost_sunderland_lat]['county'].unique()\n",
    "results=counties_with_pop_north_of.compute()\n",
    "results.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Click ... for solution. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'status': 'ok', 'restart': True}"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import IPython\n",
    "app = IPython.Application.instance()\n",
    "app.kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Well Done!** Let's move to the [next notebook](1-09_cudf-polars.ipynb). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/ds/25-1/2/county_centroid.csv
+++ b/ds/25-1/2/county_centroid.csv
@ -0,0 +1,172 @@
 county,lat_county_center,long_county_center
 BARKING AND DAGENHAM,51.621048311776526,0.12958319845588165
 BARNET,51.81255163972051,-0.21821206632197684
 BARNSLEY,53.57190690010971,-1.5487193565226611
 BATH AND NORTH EAST SOMERSET,51.35496548780361,-2.486675162410336
 BEDFORD,52.145475839485385,-0.4549734374180617
 BEXLEY,51.33625605642689,0.14633321710015448
 BIRMINGHAM,52.12178304394528,-1.881329432771379
 BLACKBURN WITH DARWEN,53.63718763008419,-2.463700844959783
 BLACKPOOL,53.882118373353435,-3.0229009637127167
 BLAENAU GWENT,51.75159582861159,-3.1862426125686745
 BOLTON,53.73813128127497,-2.4794091133678147
 BRACKNELL FOREST,51.457925145468295,-0.7336441271286038
 BRADFORD,53.972113267048044,-1.8738762931122748
 BRENT,51.761695309784,-0.2756927203781798
 BRIDGEND,51.522888539164526,-3.6137468421270604
 BRIGHTON AND HOVE,50.94890407892698,-0.1507807253912774
 "BRISTOL, CITY OF",51.53203785026057,-2.5774864859032594
 BROMLEY,51.2251371203518,0.03905163114984023
 BUCKINGHAMSHIRE,51.92925587759856,-0.8053996183750294
 BURY,53.61553432785575,-2.3088650595977023
 CAERPHILLY,51.62781255006381,-3.1973649865483735
 CALDERDALE,53.769761331289686,-1.9616103771384508
 CAMBRIDGESHIRE,52.1333820427886,-0.23503728806014595
 CAMDEN,51.69346289078886,-0.1629412552292679
 CARDIFF,51.56635588939404,-3.222317281083218
 CARMARTHENSHIRE,51.92106862577838,-4.211293704149962
 CENTRAL BEDFORDSHIRE,51.99983427713095,-0.4775810785914261
 CEREDIGION,52.297905934896974,-3.9524382809074967
 CHESHIRE EAST,53.209779668583735,-2.2923524120906538
 CHESHIRE WEST AND CHESTER,53.12468649229667,-2.703640874356098
 CITY OF LONDON,51.515869084539396,-0.09345024349003202
 CONWY,53.125451225027945,-3.7469275629154897
 CORNWALL,50.2491094902892,-4.642072961722217
 COUNTY DURHAM,54.46928915708376,-1.840983172985692
 COVENTRY,52.20619163815314,-1.5190329484575433
 CROYDON,51.33122440611814,-0.07773715861848832
 CUMBRIA,54.470582575648244,-2.902600383252353
 DARLINGTON,54.51355967194039,-1.5680201999230523
 DENBIGHSHIRE,53.07313542431554,-3.347662396412462
 DERBY,52.98317870391253,-1.471762916352353
 DERBYSHIRE,52.96237103431297,-1.6019383162802616
 DEVON,50.75993290464059,-3.6572707805745353
 DONCASTER,53.579077870304175,-1.1091519021581622
 DORSET,50.80117614559981,-2.4141088997141975
 DUDLEY,52.466075739334926,-2.101688961593882
 EALING,51.69946371446451,-0.31413253292570953
 EAST RIDING OF YORKSHIRE,53.9506321883079,-0.6619808168243948
 EAST SUSSEX,50.8319515317622,0.33441692286193403
 ENFIELD,51.79829813489722,-0.08133941451400101
 ESSEX,51.61177562858481,0.5408806396014519
 FLINTSHIRE,53.18448452051185,-3.176529270275655
 GATESHEAD,54.984104331680726,-1.6867966327256207
 GLOUCESTERSHIRE,51.95116469210396,-2.152140175011601
 GREENWICH,51.298529627584855,0.05009798110429057
 GWYNEDD,52.90798692199907,-3.815807248465912
 HACKNEY,51.715573990309835,-0.06047668080560671
 HALTON,53.37945371869939,-2.6885285111965866
 HAMMERSMITH AND FULHAM,51.45669431471315,-0.21734862391196488
 HAMPSHIRE,51.35882747857323,-1.2472236572124424
 HARINGEY,51.71488485869694,-0.10670896820865851
 HARROW,51.69502976226169,-0.3360141730528605
 HARTLEPOOL,54.67019690697325,-1.2702881849113061
 HAVERING,51.68803382335829,0.23538931286606415
 "HEREFORDSHIRE, COUNTY OF",52.05661428266539,-2.7394973894756567
 HERTFORDSHIRE,51.97545351306396,-0.2768104374496038
 HILLINGDON,51.67744993832507,-0.44168376669816023
 HOUNSLOW,51.31550103034914,-0.37851470463324743
 ISLE OF ANGLESEY,53.27637540915653,-4.323495411729392
 ISLE OF WIGHT,50.62684579406237,-1.3335589426514434
 ISLES OF SCILLY,49.923857744201605,-6.302263516809768
 ISLINGTON,51.66454658738323,-0.10992970115558956
 KENSINGTON AND CHELSEA,51.49977592399342,-0.18981078381787103
 KENT,51.066980402556894,0.72177006521006
 "KINGSTON UPON HULL, CITY OF",53.894135701816644,-0.30380941990063115
 KINGSTON UPON THAMES,51.42789080754545,-0.28368404321251495
 KIRKLEES,53.84779145117579,-1.7808194218728275
 KNOWSLEY,53.48284092504563,-2.8329791954991275
 LAMBETH,51.252923290285565,-0.11380231585035454
 LANCASHIRE,53.39410422518683,-2.460896340904076
 LEEDS,53.55494339794778,-1.5074406609781625
 LEICESTER,52.7035904712036,-1.1304165681356237
 LEICESTERSHIRE,52.372384242153444,-1.3774821236258858
 LEWISHAM,51.26146486742923,-0.017302263531446847
 LINCOLNSHIRE,53.019325697607805,-0.23840017404638325
 LIVERPOOL,53.51161042331058,-2.9133522899513755
 LUTON,51.96794156247519,-0.4231450525783596
 MANCHESTER,53.618174414336764,-2.2337215842169944
 MEDWAY,51.32754494250598,0.5632336335498731
 MERTHYR TYDFIL,51.749169200604825,-3.36403864047987
 MERTON,51.37364806533906,-0.18868296177359278
 MIDDLESBROUGH,54.5098082464691,-1.211038279554591
 MILTON KEYNES,52.01693552290149,-0.7406232665194876
 MONMOUTHSHIRE,51.78143655329183,-2.9039386644643197
 NEATH PORT TALBOT,51.59538437854254,-3.7458617902677283
 NEWCASTLE UPON TYNE,55.00208530426788,-1.652806624671881
 NEWHAM,51.75154898367921,0.027418339450078835
 NEWPORT,51.53253056059282,-2.8977514562758477
 NORFOLK,52.3032223796034,0.9647662889518414
 NORTH EAST LINCOLNSHIRE,53.50967645052903,-0.13922750148994814
 NORTH LINCOLNSHIRE,53.57540769163687,-0.5237063875323392
 NORTH SOMERSET,51.35265217208383,-2.754333708085771
 NORTH TYNESIDE,55.00390319683472,-1.5092377782362794
 NORTH YORKSHIRE,54.037083506236726,-1.5496083229591298
 NORTHAMPTONSHIRE,52.090056204873584,-0.8673643733062965
 NORTHUMBERLAND,55.268382697315424,-2.075107564148198
 NOTTINGHAM,52.95517248670217,-1.166635297324727
 NOTTINGHAMSHIRE,53.03298887412134,-1.006945929298795
 OLDHAM,53.659965283524954,-2.052688245629671
 OXFORDSHIRE,51.93769526591072,-1.2911207463303098
 PEMBROKESHIRE,51.87232817560273,-4.908191395785854
 PETERBOROUGH,52.62511626981561,-0.2689975241368676
 PLYMOUTH,50.29446598251615,-4.112955625237552
 PORTSMOUTH,50.91433206435089,-1.0702659081823802
 POWYS,52.35028728472521,-3.4364646802117074
 READING,51.48972751726377,-0.9907195716377762
 REDBRIDGE,51.74619394585629,0.0701000048233879
 REDCAR AND CLEVELAND,54.52674848959172,-1.0057471172413288
 RICHMOND UPON THAMES,51.40228740909276,-0.28924251316631455
 ROCHDALE,53.67734692115036,-2.14815188340053
 ROTHERHAM,53.27571588878268,-1.2866084213986422
 RUTLAND,52.66741819281054,-0.6255844565552813
 SALFORD,53.39900474827836,-2.3848977331687684
 SANDWELL,52.58696674791831,-2.007627650605722
 SEFTON,53.41754419091054,-2.9918998460398845
 SHEFFIELD,53.594572416421464,-1.5427564265432459
 SHROPSHIRE,52.68421414164122,-2.7366875706426375
 SLOUGH,51.500375556628576,-0.5761037634462686
 SOLIHULL,52.36591301434561,-1.7157174664625492
 SOMERSET,51.15203995716832,-3.2953379430424437
 SOUTH GLOUCESTERSHIRE,51.619868102630875,-2.469430184260059
 SOUTH TYNESIDE,54.994706019365786,-1.4469508035803413
 SOUTHAMPTON,50.984805930473584,-1.4002768042215858
 SOUTHEND-ON-SEA,51.562157807336284,0.7069905953535786
 SOUTHWARK,51.26247572937943,-0.07306483663823536
 ST. HELENS,53.442240723358644,-2.7032424159534347
 STAFFORDSHIRE,52.54946704767607,-2.027491119365553
 STOCKPORT,53.243567817667724,-2.1248973952531918
 STOCKTON-ON-TEES,54.60356568786033,-1.3063893005278557
 STOKE-ON-TRENT,53.0018684063432,-2.1588155163720084
 SUFFOLK,52.07327606663186,1.049040133490474
 SUNDERLAND,54.95658521287448,-1.433572135990224
 SURREY,51.75817482314145,-0.3386369800762059
 SUTTON,51.33189096687447,-0.17228958486126392
 SWANSEA,51.734320352502984,-3.967180818043868
 SWINDON,51.64295753076632,-1.7336382187066433
 TAMESIDE,53.4185402114593,-2.0769462404028474
 TELFORD AND WREKIN,52.709149095326744,-2.4894724871905916
 THURROCK,51.508227793073466,0.33492786371540356
 TORBAY,50.494049197230815,-3.5551646045072913
 TORFAEN,51.69896506141925,-3.0509328418360218
 TOWER HAMLETS,51.68485859523772,-0.03638140322291906
 TRAFFORD,53.314621144815334,-2.3656560688750687
 VALE OF GLAMORGAN,51.477096810804674,-3.3980039155600954
 WAKEFIELD,53.81677380462442,-1.4208545508030999
 WALSALL,52.742742908764974,-1.9703315889024553
 WALTHAM FOREST,51.723501987712325,-0.01886180175957716
 WANDSWORTH,51.24653418036352,-0.2001743797936436
 WARRINGTON,53.338554119123636,-2.561564052456012
 WARWICKSHIRE,52.04847200574421,-1.5686356193411675
 WEST BERKSHIRE,51.472960442069805,-1.2740171035533379
 WEST SUSSEX,51.11473921001523,-0.4593527537340543
 WESTMINSTER,51.613346179755915,-0.15298252171750404
 WIGAN,53.58763891955546,-2.5723844100365545
 WILTSHIRE,51.48575283497703,-1.926537553406791
 WINDSOR AND MAIDENHEAD,51.494612540256846,-0.6753936432282348
 WIRRAL,53.237217504292545,-3.0650813262796417
 WOKINGHAM,51.45966460093226,-0.8993706058495408
 WOLVERHAMPTON,52.71684834050869,-2.127594624973283
 WORCESTERSHIRE,52.05799103802506,-2.209184250840713
 WREXHAM,53.00080440180421,-2.991958507191866
 YORK,53.99232942499273,-1.073788787620359