feat(ds-2.1, circuit-3)
BIN
circuit/25-1/3/DVJK_table.png
Normal file
|
After Width: | Height: | Size: 16 KiB |
BIN
circuit/25-1/3/TDCE_schema.png
Normal file
|
After Width: | Height: | Size: 30 KiB |
BIN
circuit/25-1/3/TDCE_timing.png
Normal file
|
After Width: | Height: | Size: 57 KiB |
BIN
circuit/25-1/3/TJK.png
Normal file
|
After Width: | Height: | Size: 35 KiB |
BIN
circuit/25-1/3/TJK_schema.png
Normal file
|
After Width: | Height: | Size: 38 KiB |
BIN
circuit/25-1/3/TJK_timing.png
Normal file
|
After Width: | Height: | Size: 59 KiB |
BIN
circuit/25-1/3/TT.png
Normal file
|
After Width: | Height: | Size: 31 KiB |
BIN
circuit/25-1/3/TT_schema.png
Normal file
|
After Width: | Height: | Size: 30 KiB |
BIN
circuit/25-1/3/TT_table.png
Normal file
|
After Width: | Height: | Size: 3.0 KiB |
BIN
circuit/25-1/3/TT_timing.png
Normal file
|
After Width: | Height: | Size: 182 KiB |
BIN
circuit/25-1/3/TT_transition.png
Normal file
|
After Width: | Height: | Size: 16 KiB |
BIN
circuit/25-1/3/lab3.pdf
Normal file
BIN
circuit/25-1/3/schema.png
Normal file
|
After Width: | Height: | Size: 186 KiB |
278
ds/25-1/2/1-00_introduction.ipynb
Normal file
@ -0,0 +1,278 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "19051402",
|
||||||
|
"metadata": {
|
||||||
|
"tags": []
|
||||||
|
},
|
||||||
|
"source": [
|
||||||
|
"<img src=\"./images/DLI_Header.png\" width=400/>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "67ed6062",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Fundamentals of Accelerated Data Science #"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "a65f57f0",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 00 - Introduction ##\n",
|
||||||
|
"Welcome to NVIDIA's Deep Learning Institute workshop on the Fundamentals of Accelerated Data Science. This interactive lab offers practical experience with every stage of the development process, empowering participants to tailor solutions for their unique applications."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "50d32b6c",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"**Learning Objectives**\n",
|
||||||
|
"<br>\n",
|
||||||
|
"In this workshop, you will learn: \n",
|
||||||
|
"* Overview of data science\n",
|
||||||
|
"* Demonstrations of data science workflows\n",
|
||||||
|
"* How acceleration is achieved\n",
|
||||||
|
"* How to design operations to maximize GPU acceleration\n",
|
||||||
|
"* Implications of acceleration"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "3a02c2b6",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### JupyterLab ###\n",
|
||||||
|
"For this hands-on lab, we use [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) to manage our environment. The [JupyterLab Interface](https://jupyterlab.readthedocs.io/en/stable/user/interface.html) is a dashboard that provides access to interactive iPython notebooks, as well as the folder structure of our environment and a terminal window into the Ubuntu operating system. The first view includes a **menu bar** at the top, a **file browser** in the **left sidebar**, and a **main work area** that is initially open to this \"introduction\" notebook. \n",
|
||||||
|
"\n",
|
||||||
|
"<p><img src=\"images/jl_launcher.png\" width=720></p>\n",
|
||||||
|
"\n",
|
||||||
|
"* The file browser can be navigated just like any other file explorer. A double click on any of the items will open a new tab with its content. \n",
|
||||||
|
"* The main work area includes tabbed views of open files that can be closed, moved, and edited as needed. \n",
|
||||||
|
"* The notebooks, including this one, consist of a series of content and code **cells**. To execute code in a code cell, press `Shift+Enter` or the `Run` button in the menu bar above, while a cell is highlighted. Sometimes, a content cell will get switched to editing mode. Executing the cell with `Shift+Enter` or the `Run` button will switch it back to a readable form.\n",
|
||||||
|
"* To interrupt cell execution, click the `Stop` button in the menu bar or navigate to the `Kernel` menu, and select `Interrupt Kernel`. \n",
|
||||||
|
"* We can use terminal commands in the notebook cells by prepending an exclamation point/bang(`!`) to the beginning of the command.\n",
|
||||||
|
"* We can create additional interactive cells by clicking the `+` button above, or by switching to command mode with `Esc` and using the keyboard shortcuts `a` (for new cell above) and `b` (for new cell below)."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "4492c58d",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<a name='e1'></a>\n",
|
||||||
|
"### Exercise #1 - Practice ###\n",
|
||||||
|
"**Instructions**: <br>\n",
|
||||||
|
"* Try executing the simple print statement in the below cell.\n",
|
||||||
|
"* Then try executing the terminal command in the cell below."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "e69a6515",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"# activate this cell by selecting it with the mouse or arrow keys then use the keyboard shortcut [Shift+Enter] to execute\n",
|
||||||
|
"print('This is just a simple print statement.')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "e54fe372",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"!echo 'This is another simple print statement.'"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "c2e5151b-4842-465e-a20d-bb64af66d011",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<a name='e2'></a>\n",
|
||||||
|
"### Exercise #2 - Available GPU Accelerators ###\n",
|
||||||
|
"The `nvidia-smi` (NVIDIA System Management Interface) command is a powerful utility for managing and monitoring NVIDIA GPU devices. It will print information about available GPUs, their current memory usage, and any processes currently utilizing them. \n",
|
||||||
|
"\n",
|
||||||
|
"**Instructions**: <br>\n",
|
||||||
|
"* Execute the below cell to learn about this environment's available GPUs. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "08d543eb-a951-4eb9-8107-b13c01b3ac46",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"!nvidia-smi"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "adee74e3-613a-4986-be34-ff3ae113ccc7",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"**Note**: Currently, GPU memory usage is minimal, with no active processes utilizing the GPUs. Throughout our session, we'll employ this command to monitor memory consumption. When conducting GPU-based data analysis, it's advisable to maintain approximately 50% of GPU memory free, allowing for operations that may expand data stored on the device."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "f0839f2e-dfe3-4d8f-8010-ed8445c171fb",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<a name='e3'></a>\n",
|
||||||
|
"### Exercise #3 - Magic Commands ###\n",
|
||||||
|
"The Jupyter environment come installed with *magic* commands, which can be recognized by the presence of `%` or `%%`. We will be using two magic commands liberally in this workshop: \n",
|
||||||
|
"* `%time`: prints summary information about how long it took to run code for a single line of code\n",
|
||||||
|
"* `%%time`: prints summary information about how long it took to run code for an entire cell\n",
|
||||||
|
"\n",
|
||||||
|
"**Instructions**: <br>\n",
|
||||||
|
"* Execute the below cell to import the `time` library. \n",
|
||||||
|
"* Execute the cell below to time the single line of code. \n",
|
||||||
|
"* Execute the cell below to time the entire cell. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "b1c34489-7812-4ffe-bd2e-748a52903481",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"from time import sleep"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "db1d5de9-f6e6-4984-8c32-f13b51aa27db",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"# %time only times one line\n",
|
||||||
|
"%time sleep(2) \n",
|
||||||
|
"sleep(1)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "daf2f6f0-58a9-43a5-af8f-0b69b4a2a3a8",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"%%time\n",
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"# %%time will time the entire cell\n",
|
||||||
|
"sleep(1)\n",
|
||||||
|
"sleep(1)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "42ed873e-f7b5-4668-8e96-ce31d53d43b1",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<a name='e4'></a>\n",
|
||||||
|
"### Exercise #4 - Jupyter Kernels and GPU Memory ###\n",
|
||||||
|
"The compute backend for Jupyter is called the *kernel*. The Jupyter environment starts up a separate kernel for each new notebook. The many notebooks in this workshop are each intended to stand alone with regard to memory and computation. \n",
|
||||||
|
"\n",
|
||||||
|
"To ensure we have enough memory and compute for each notebook, we can clear the memory at the conclusion of each notebook in two ways: \n",
|
||||||
|
"1. Shut down the kernel with `ipykernel.kernelapp.IPKernelApp.do_shutdown()` or\n",
|
||||||
|
"2. Shut down the kernel through the *Running Terminals and Kernels* panel. \n",
|
||||||
|
"\n",
|
||||||
|
"**Instructions**: <br>\n",
|
||||||
|
"* Execute the below cell to shut down and restart the current kernel. \n",
|
||||||
|
"* Shut down the current kernel through the *Running Terminals and Kernels* panel.\n",
|
||||||
|
"\n",
|
||||||
|
"<p><img src=\"images/kernel_restart.png\" width=720></p>\n",
|
||||||
|
"\n",
|
||||||
|
"**Note**: Restarting the kernel from the *Kernel* menu will only clear the memory for *the current notebook's kernel*, while notebooks other than the one we're working on may still have memory allocated for *their unique kernels*. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "98e05b77-6019-428b-8e18-a2477692ef6f",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"import IPython\n",
|
||||||
|
"app = IPython.Application.instance()\n",
|
||||||
|
"app.kernel.do_shutdown(True)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "0321075e-433e-42d4-b849-de3fa17b54e1",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"**Note**: Executing the provided code cell will shut down the kernel and activate a popup indicating that the kernel has restarted."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "8e950df2",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"**Well Done!** Let's move to the [next notebook](1-01_section_overview.ipynb). "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "b604003a",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<img src=\"./images/DLI_Header.png\" width=400/>"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.10.15"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
78
ds/25-1/2/1-01_section_overview.ipynb
Normal file
@ -0,0 +1,78 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "b53a7b12-538d-4459-b82a-a35c8c417849",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<img src=\"./images/DLI_Header.png\" width=400/>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "ae497b71-bc43-471e-8970-88a1878e7cf9",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Fundamentals of Accelerated Data Science # "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "3a61cc06-80da-4f73-ba61-8ff1b5af71d8",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 01 - Section Overview ##\n",
|
||||||
|
"\n",
|
||||||
|
"**Table of Contents**\n",
|
||||||
|
"This section focuses on data processing. We'll work with multiple datasets, conduct high-level analyses, and prepare the data for subsequent machine learning tasks. \n",
|
||||||
|
"<br>\n",
|
||||||
|
"* **1-01_section_overview.ipynb**\n",
|
||||||
|
"* **1-02_data_manipulation.ipynb**\n",
|
||||||
|
"* **1-03_memory_management.ipynb**\n",
|
||||||
|
"* **1-04_interoperability.ipynb**\n",
|
||||||
|
"* **1-05_grouping.ipynb**\n",
|
||||||
|
"* **1-06_data_visualization.ipynb**\n",
|
||||||
|
"* **1-07_etl.ipynb**\n",
|
||||||
|
"* **1-08_dask-cudf.ipynb**\n",
|
||||||
|
"* **1-09_cudf-polars.ipynb**"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "9b1485a5-00e8-4495-85b0-b48671674818",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"**Well Done!** Let's move to the [next notebook](1-02_data_manipulation.ipynb). "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "81e47f0a-547e-4714-878d-34eb9b75c835",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<img src=\"./images/DLI_Header.png\" width=400/>"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.10.15"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
2005
ds/25-1/2/1-02_data_manipulation.ipynb
Normal file
958
ds/25-1/2/1-03_memory_management.ipynb
Normal file
@ -0,0 +1,958 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "def31b0f-921a-43eb-9807-8b9b31eb7b32",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<img src=\"./images/DLI_Header.png\" width=400/>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "4a0fd4dd-f7be-4c90-8ddd-384a760ac04f",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Fundamentals of Accelerated Data Science # "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "6a8fdf2e-a481-455e-8a52-8be8472b63bf",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 03 - Memory Management ##\n",
|
||||||
|
"\n",
|
||||||
|
"**Table of Contents**\n",
|
||||||
|
"<br>\n",
|
||||||
|
"This notebook explores the dynamics between data and memory. This notebook covers the below sections: \n",
|
||||||
|
"1. [Memory Management](#Memory-Management)\n",
|
||||||
|
" * [Memory Usage](#Memory-Usage)\n",
|
||||||
|
"2. [Data Types](#Data-Types)\n",
|
||||||
|
" * [Convert Data Types](#Convert-Data-Types)\n",
|
||||||
|
" * [Exercise #1 - Modify `dtypes`](#Exercise-#1---Modify-dtypes)\n",
|
||||||
|
" * [Categorical](#Categorical)\n",
|
||||||
|
"3. [Efficient Data Loading](#Efficient-Data-Loading)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "1b59367c-48bc-4c72-b1f4-4cfdfa5470cf",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Memory Management ##\n",
|
||||||
|
"During the data acquisition process, data is transferred to memory in order to be operated on by the processor. Memory management is crucial for cuDF and GPU operations for several key reasons: \n",
|
||||||
|
"* **Limited GPU memory**: GPUs typically have less memory than CPUs, therefore efficient memory management is essential to maximize the use of available GPU memory, especially for large datasets.\n",
|
||||||
|
"* **Data transfer overhead**: Transferring data between CPU and GPU memory is relatively slow compared to GPU computation speed. Minimizing these transfers through smart memory management is critical for performance.\n",
|
||||||
|
"* **Performance tuning**: Understanding and optimizing memory usage is key to achieving peak performance in GPU-accelerated data processing tasks.\n",
|
||||||
|
"\n",
|
||||||
|
"When done correctly, keeping the data on the GPU can enable cuDF and the RAPIDS ecosystem to achieve significant performance improvements, handle larger datasets, and provide more efficient data processing capabilities. \n",
|
||||||
|
"\n",
|
||||||
|
"Below we import the data from the csv file. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 1,
|
||||||
|
"id": "b7b8a623-f799-4dad-aca9-0e571bb6e527",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import random\n",
|
||||||
|
"import time"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"id": "711d0a7f-8598-49fc-949c-5caf6029ce47",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<div>\n",
|
||||||
|
"<style scoped>\n",
|
||||||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||||||
|
" vertical-align: middle;\n",
|
||||||
|
" }\n",
|
||||||
|
"\n",
|
||||||
|
" .dataframe tbody tr th {\n",
|
||||||
|
" vertical-align: top;\n",
|
||||||
|
" }\n",
|
||||||
|
"\n",
|
||||||
|
" .dataframe thead th {\n",
|
||||||
|
" text-align: right;\n",
|
||||||
|
" }\n",
|
||||||
|
"</style>\n",
|
||||||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||||||
|
" <thead>\n",
|
||||||
|
" <tr style=\"text-align: right;\">\n",
|
||||||
|
" <th></th>\n",
|
||||||
|
" <th>age</th>\n",
|
||||||
|
" <th>sex</th>\n",
|
||||||
|
" <th>county</th>\n",
|
||||||
|
" <th>lat</th>\n",
|
||||||
|
" <th>long</th>\n",
|
||||||
|
" <th>name</th>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
" </thead>\n",
|
||||||
|
" <tbody>\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <th>0</th>\n",
|
||||||
|
" <td>0</td>\n",
|
||||||
|
" <td>m</td>\n",
|
||||||
|
" <td>DARLINGTON</td>\n",
|
||||||
|
" <td>54.533644</td>\n",
|
||||||
|
" <td>-1.524401</td>\n",
|
||||||
|
" <td>FRANCIS</td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <th>1</th>\n",
|
||||||
|
" <td>0</td>\n",
|
||||||
|
" <td>m</td>\n",
|
||||||
|
" <td>DARLINGTON</td>\n",
|
||||||
|
" <td>54.426256</td>\n",
|
||||||
|
" <td>-1.465314</td>\n",
|
||||||
|
" <td>EDWARD</td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <th>2</th>\n",
|
||||||
|
" <td>0</td>\n",
|
||||||
|
" <td>m</td>\n",
|
||||||
|
" <td>DARLINGTON</td>\n",
|
||||||
|
" <td>54.555200</td>\n",
|
||||||
|
" <td>-1.496417</td>\n",
|
||||||
|
" <td>TEDDY</td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <th>3</th>\n",
|
||||||
|
" <td>0</td>\n",
|
||||||
|
" <td>m</td>\n",
|
||||||
|
" <td>DARLINGTON</td>\n",
|
||||||
|
" <td>54.547906</td>\n",
|
||||||
|
" <td>-1.572341</td>\n",
|
||||||
|
" <td>ANGUS</td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <th>4</th>\n",
|
||||||
|
" <td>0</td>\n",
|
||||||
|
" <td>m</td>\n",
|
||||||
|
" <td>DARLINGTON</td>\n",
|
||||||
|
" <td>54.477639</td>\n",
|
||||||
|
" <td>-1.605995</td>\n",
|
||||||
|
" <td>CHARLIE</td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
" </tbody>\n",
|
||||||
|
"</table>\n",
|
||||||
|
"</div>"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
" age sex county lat long name\n",
|
||||||
|
"0 0 m DARLINGTON 54.533644 -1.524401 FRANCIS\n",
|
||||||
|
"1 0 m DARLINGTON 54.426256 -1.465314 EDWARD\n",
|
||||||
|
"2 0 m DARLINGTON 54.555200 -1.496417 TEDDY\n",
|
||||||
|
"3 0 m DARLINGTON 54.547906 -1.572341 ANGUS\n",
|
||||||
|
"4 0 m DARLINGTON 54.477639 -1.605995 CHARLIE"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 2,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"df=pd.read_csv('./data/uk_pop.csv')\n",
|
||||||
|
"\n",
|
||||||
|
"# preview\n",
|
||||||
|
"df.head()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "36416fd0-7081-42aa-bf31-d1231b81ec0b",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Memory Usage ###\n",
|
||||||
|
"Memory utilization of a DataFrame depends on the date types for each column.\n",
|
||||||
|
"\n",
|
||||||
|
"<p><img src='images/dtypes.png' width=720></p>\n",
|
||||||
|
"\n",
|
||||||
|
"We can use `DataFrame.memory_usage()` to see the memory usage for each column (in bytes). Most of the common data types have a fixed size in memory, such as `int`, `float`, `datetime`, and `bool`. Memory usage for these data types is the respective memory requirement multiplied by the number of data points. For `string` data type, the memory usage reported _for pandas_ is the number of elements times 8 bytes. This accounts for the 64-bit required for the pointer that points to an address in memory but not the memory used for the actual string values. The actual memory required for a string value is 49 bytes plus an additional byte for each character. The `deep` parameter provides a more accurate memory usage report that accounts for the system-level memory consumption of the contained `string` data type. \n",
|
||||||
|
"\n",
|
||||||
|
"Below we get the memory usage. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 8,
|
||||||
|
"id": "8378207b-2d9e-4102-8408-c2dddafc8a40",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"Index 128\n",
|
||||||
|
"age 467839152\n",
|
||||||
|
"sex 3391833852\n",
|
||||||
|
"county 3934985133\n",
|
||||||
|
"lat 467839152\n",
|
||||||
|
"long 467839152\n",
|
||||||
|
"name 3666922374\n",
|
||||||
|
"dtype: int64"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 8,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"# pandas memory utilization\n",
|
||||||
|
"mem_usage_df=df.memory_usage(deep=True)\n",
|
||||||
|
"mem_usage_df"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "07c24bb1-c4f7-440c-a949-d4c57800ec61",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Below we define a `make_decimal()` function to convert memory size into units based on powers of 2. In contrast to units based on powers of 10, this customary convention is commonly used to report memory capacity. More information about the two definitions can be found [here](https://en.wikipedia.org/wiki/Byte#Multiple-byte_units). "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 4,
|
||||||
|
"id": "5ae42218-1547-49fd-9123-ab508a2b03de",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"suffixes = ['B', 'kB', 'MB', 'GB', 'TB', 'PB']\n",
|
||||||
|
"def make_decimal(nbytes):\n",
|
||||||
|
" i=0\n",
|
||||||
|
" while nbytes >= 1024 and i < len(suffixes)-1:\n",
|
||||||
|
" nbytes/=1024.\n",
|
||||||
|
" i+=1\n",
|
||||||
|
" f=('%.2f' % nbytes).rstrip('0').rstrip('.')\n",
|
||||||
|
" return '%s %s' % (f, suffixes[i])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"id": "e6d4a613-3eea-4dce-8e71-39593ff6f226",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"'11.55 GB'"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 5,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"make_decimal(mem_usage_df.sum())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "a352c0b2-65aa-4231-b753-556aca46ff49",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Below we calculate the memory usage manually based on the data types. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 6,
|
||||||
|
"id": "630327b9-6dc1-4b70-9fdf-9f7763ec4d50",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Numerical columns use 467839152 bytes of memory\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"# get number of rows\n",
|
||||||
|
"num_rows=len(df)\n",
|
||||||
|
"\n",
|
||||||
|
"# 64-bit numbers uses 8 bytes of memory\n",
|
||||||
|
"print(f'Numerical columns use {num_rows*8} bytes of memory')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 7,
|
||||||
|
"id": "bb22b5f4-e38f-438e-9426-61746b509e50",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"county column uses 3934985133 bytes of memory.\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"# check random string-typed column\n",
|
||||||
|
"string_cols=[col for col in df.columns if df[col].dtype=='object' ]\n",
|
||||||
|
"column_to_check=random.choice(string_cols)\n",
|
||||||
|
"\n",
|
||||||
|
"overhead=49\n",
|
||||||
|
"pointer_size=8\n",
|
||||||
|
"\n",
|
||||||
|
"# nan==nan when value is not a number\n",
|
||||||
|
"# nan uses 32 bytes of memory\n",
|
||||||
|
"string_col_mem_usage_df=df[column_to_check].map(lambda x: len(x)+overhead+pointer_size if x else 32)\n",
|
||||||
|
"string_col_mem_usage=string_col_mem_usage_df.sum()\n",
|
||||||
|
"print(f'{column_to_check} column uses {string_col_mem_usage} bytes of memory.')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "94e393c2-c0d0-40ee-82d2-730c4667e9b8",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"**Note**: The `string` data type is stored differently in cuDF than it is in pandas. More information about `libcudf` stores string data using the [Arrow format](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) can be found [here](https://developer.nvidia.com/blog/mastering-string-transformations-in-rapids-libcudf/). "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "737ff50b-9426-4e08-a00a-d7ee69f48b9f",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Data Types ##\n",
|
||||||
|
"By default, pandas (and cuDF) uses 64-bit for numerical values. Using 64-bit numbers provides the highest precision but many applications do not require 64-bit precision when aggregating over a very large number of data points. When possible, using 32-bit numbers reduces storage and memory requirements in half, and also typically greatly speeds up computations because only half as much data needs to be accessed in memory. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "0b77d450-c415-44b8-87ac-20ce616ec809",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Convert Data Types ###\n",
|
||||||
|
"The `.astype()` method can be used to convert numerical data types to use different bit-size containers. Here we convert the `age` column from `int64` to `int8`. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 9,
|
||||||
|
"id": "603f7c70-134e-4466-a790-8a18b9088ca6",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"age int8\n",
|
||||||
|
"sex object\n",
|
||||||
|
"county object\n",
|
||||||
|
"lat float64\n",
|
||||||
|
"long float64\n",
|
||||||
|
"name object\n",
|
||||||
|
"dtype: object"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 9,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"df['age']=df['age'].astype('int8')\n",
|
||||||
|
"\n",
|
||||||
|
"df.dtypes"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "973a6dd4-2aef-44d9-8b01-8853032eddae",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Exercise #1 - Modify `dtypes` ###\n",
|
||||||
|
"**Instructions**: <br>\n",
|
||||||
|
"* Modify the `<FIXME>` only and execute the below cell to convert any 64-bit data types to their 32-bit counterparts."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 10,
|
||||||
|
"id": "beb7d71b-6672-462e-b65c-a64dbe5f7a57",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"df['lat']=df['lat'].astype('float32')\n",
|
||||||
|
"df['long']=df['long'].astype('float32')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "raw",
|
||||||
|
"id": "3b44fb22-a0f1-4e43-a332-1ccbad50caee",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"df['lat']=df['lat'].astype('float32')\n",
|
||||||
|
"df['long']=df['long'].astype('float32')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "98b6542d-22cc-4926-b600-a3e052c37c96",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Click ... for solution. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "7b2cd622-977c-4915-a87f-2fe03c1793f5",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Categorical ###\n",
|
||||||
|
"Categorical data is a type of data that represents discrete, distinct categories or groups. They can have a meaningful order or ranking but generally cannot be used for numerical operations. When appropriate, using the `categorical` data type can reduce memory usage and lead to faster operations. It can also be used to define and maintain a custom order of categories. \n",
|
||||||
|
"\n",
|
||||||
|
"Below we get the number of unique values in the string columns. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 11,
|
||||||
|
"id": "f249e4b8-5d7a-4b44-ac15-bd3360a43f2a",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"sex 2\n",
|
||||||
|
"county 171\n",
|
||||||
|
"name 13212\n",
|
||||||
|
"dtype: int64"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 11,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"df.select_dtypes(include='object').nunique()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "f1d8bd88-b39b-4043-9039-d8bd75fe851a",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Below we convert columns with few discrete values to `category`. The `category` data type has `.categories` and `codes` properties that are accessed through `.cat`. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 12,
|
||||||
|
"id": "a99bebbf-2e5b-4720-96f9-9fd7d42d2fe8",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"df['sex']=df['sex'].astype('category')\n",
|
||||||
|
"df['county']=df['county'].astype('category')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 13,
|
||||||
|
"id": "41b7b290-cfcf-4ff6-b6b4-454c19b44a62",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"Index(['BARKING AND DAGENHAM', 'BARNET', 'BARNSLEY',\n",
|
||||||
|
" 'BATH AND NORTH EAST SOMERSET', 'BEDFORD', 'BEXLEY', 'BIRMINGHAM',\n",
|
||||||
|
" 'BLACKBURN WITH DARWEN', 'BLACKPOOL', 'BLAENAU GWENT',\n",
|
||||||
|
" ...\n",
|
||||||
|
" 'WESTMINSTER', 'WIGAN', 'WILTSHIRE', 'WINDSOR AND MAIDENHEAD', 'WIRRAL',\n",
|
||||||
|
" 'WOKINGHAM', 'WOLVERHAMPTON', 'WORCESTERSHIRE', 'WREXHAM', 'YORK'],\n",
|
||||||
|
" dtype='object', length=171)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"----------------------------------------\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"0 37\n",
|
||||||
|
"1 37\n",
|
||||||
|
"2 37\n",
|
||||||
|
"3 37\n",
|
||||||
|
"4 37\n",
|
||||||
|
" ..\n",
|
||||||
|
"58479889 96\n",
|
||||||
|
"58479890 96\n",
|
||||||
|
"58479891 96\n",
|
||||||
|
"58479892 96\n",
|
||||||
|
"58479893 96\n",
|
||||||
|
"Length: 58479894, dtype: int16"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"display(df['county'].cat.categories)\n",
|
||||||
|
"print('-'*40)\n",
|
||||||
|
"display(df['county'].cat.codes)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "737385ab-677c-4bef-a86a-10aa3119e29a",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"**Note**: `.astype()` can also be used to convert data to `datetime` or `object` to enable datetime and string methods. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "552c47c2-0fbc-455e-8745-cb98fc777243",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Efficient Data Loading ##\n",
|
||||||
|
"It is often advantageous to specify the most appropriate data types for each columns, based on range, precision requirement, and how they are used. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 14,
|
||||||
|
"id": "c2b9f0c3-8598-4a28-9481-ce28fea7544b",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"Index 128\n",
|
||||||
|
"age 467839152\n",
|
||||||
|
"sex 3391833852\n",
|
||||||
|
"county 3934985133\n",
|
||||||
|
"lat 467839152\n",
|
||||||
|
"long 467839152\n",
|
||||||
|
"name 3666922374\n",
|
||||||
|
"dtype: int64"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Loading 11.55 GB took 33.63 seconds.\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"start=time.time()\n",
|
||||||
|
"df=pd.read_csv('./data/uk_pop.csv')\n",
|
||||||
|
"duration=time.time()-start\n",
|
||||||
|
"\n",
|
||||||
|
"mem_usage_df=df.memory_usage(deep=True)\n",
|
||||||
|
"display(mem_usage_df)\n",
|
||||||
|
"\n",
|
||||||
|
"print(f'Loading {make_decimal(mem_usage_df.sum())} took {round(duration, 2)} seconds.')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "5729520e-3ed8-4ec6-ae1f-ba46d642f48d",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Below we enable `cuda.pandas` to see the difference. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 15,
|
||||||
|
"id": "99aa0f32-4d2a-43a7-bec1-f1b88bcc37c2",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"%load_ext cudf.pandas\n",
|
||||||
|
"\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import time"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 16,
|
||||||
|
"id": "2b724201-9ad1-4e9b-b712-f3b31bdc4104",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"suffixes = ['B', 'kB', 'MB', 'GB', 'TB', 'PB']\n",
|
||||||
|
"def make_decimal(nbytes):\n",
|
||||||
|
" i=0\n",
|
||||||
|
" while nbytes >= 1024 and i < len(suffixes)-1:\n",
|
||||||
|
" nbytes/=1024.\n",
|
||||||
|
" i+=1\n",
|
||||||
|
" f=('%.2f' % nbytes).rstrip('0').rstrip('.')\n",
|
||||||
|
" return '%s %s' % (f, suffixes[i])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 17,
|
||||||
|
"id": "99bdd7b0-8563-41db-bd8e-3a7279394ede",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"age 58479894\n",
|
||||||
|
"sex 58479908\n",
|
||||||
|
"county 58482446\n",
|
||||||
|
"lat 467839152\n",
|
||||||
|
"long 467839152\n",
|
||||||
|
"name 117096917\n",
|
||||||
|
"Index 0\n",
|
||||||
|
"dtype: int64"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Loading 1.14 GB took 2.13 seconds.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\"> </span>\n",
|
||||||
|
"<span style=\"font-style: italic\"> Total time elapsed: 2.705 seconds </span>\n",
|
||||||
|
"<span style=\"font-style: italic\"> </span>\n",
|
||||||
|
"<span style=\"font-style: italic\"> Stats </span>\n",
|
||||||
|
"<span style=\"font-style: italic\"> </span>\n",
|
||||||
|
"┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
|
||||||
|
"┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n",
|
||||||
|
"┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
|
||||||
|
"│ 2 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> start</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time()</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ 5 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> dtype_dict</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">{</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ 6 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'int8'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ 7 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'sex'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'category'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ 8 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'category'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ 9 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'lat'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'float64'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ 10 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'long'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'float64'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ 11 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'name'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">: </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'category'</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ 14 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> efficient_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">pd</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">read_csv(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'./data/uk_pop.csv'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, dtype</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">dtype_dict)</span><span style=\"background-color: #272822\"> </span> │ 1.728013188 │ │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ 15 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> duration</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">time()</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">-</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">start</span><span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ 17 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> mem_usage_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">efficient_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">memory_usage(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'deep'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)</span><span style=\"background-color: #272822\"> </span> │ 0.005340174 │ │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ 18 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> display(mem_usage_df)</span><span style=\"background-color: #272822\"> </span> │ 0.011073721 │ 0.006896915 │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"│ 20 │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> print(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">f'Loading {</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">make_decimal(mem_usage_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">sum())</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">} took {</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">round(dura…</span> │ 0.004693074 │ │\n",
|
||||||
|
"│ │ <span style=\"background-color: #272822\"> </span> │ │ │\n",
|
||||||
|
"└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n",
|
||||||
|
"</pre>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"\u001b[3m \u001b[0m\n",
|
||||||
|
"\u001b[3m Total time elapsed: 2.705 seconds \u001b[0m\n",
|
||||||
|
"\u001b[3m \u001b[0m\n",
|
||||||
|
"\u001b[3m Stats \u001b[0m\n",
|
||||||
|
"\u001b[3m \u001b[0m\n",
|
||||||
|
"┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
|
||||||
|
"┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n",
|
||||||
|
"┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
|
||||||
|
"│ 2 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstart\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ 5 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdtype_dict\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m{\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ 6 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mint8\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ 7 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msex\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcategory\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ 8 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcategory\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ 9 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlat\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mfloat64\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ 10 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlong\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mfloat64\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ 11 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mname\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m:\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcategory\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ 14 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mefficient_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mpd\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mread_csv\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m./data/uk_pop.csv\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdtype\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdtype_dict\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 1.728013188 │ │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ 15 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mduration\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mtime\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m-\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstart\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ 17 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmem_usage_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mefficient_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmemory_usage\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mdeep\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.005340174 │ │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ 18 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdisplay\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmem_usage_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m │ 0.011073721 │ 0.006896915 │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"│ 20 │ \u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mprint\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mf\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mLoading \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m{\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmake_decimal\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmem_usage_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msum\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m}\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m took \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m{\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mround\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdura…\u001b[0m │ 0.004693074 │ │\n",
|
||||||
|
"│ │ \u001b[48;2;39;40;34m \u001b[0m │ │ │\n",
|
||||||
|
"└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"%%cudf.pandas.line_profile\n",
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"start=time.time()\n",
|
||||||
|
"\n",
|
||||||
|
"# define data types for each column\n",
|
||||||
|
"dtype_dict={\n",
|
||||||
|
" 'age': 'int8', \n",
|
||||||
|
" 'sex': 'category', \n",
|
||||||
|
" 'county': 'category', \n",
|
||||||
|
" 'lat': 'float64', \n",
|
||||||
|
" 'long': 'float64', \n",
|
||||||
|
" 'name': 'category'\n",
|
||||||
|
"}\n",
|
||||||
|
" \n",
|
||||||
|
"efficient_df=pd.read_csv('./data/uk_pop.csv', dtype=dtype_dict)\n",
|
||||||
|
"duration=time.time()-start\n",
|
||||||
|
"\n",
|
||||||
|
"mem_usage_df=efficient_df.memory_usage('deep')\n",
|
||||||
|
"display(mem_usage_df)\n",
|
||||||
|
"\n",
|
||||||
|
"print(f'Loading {make_decimal(mem_usage_df.sum())} took {round(duration, 2)} seconds.')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "0f4607d8-6de3-4b27-96d4-a9720d268333",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"We were able to load data faster and more efficiently. \n",
|
||||||
|
"\n",
|
||||||
|
"**Note**: Notice that the memory utilized on the GPU is larger than the memory used by the DataFrame. This is expected because there are intermediary processes that use some memory during the data loading process, specifically related to parsing the csv file in this case. \n",
|
||||||
|
"\n",
|
||||||
|
"```\n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n",
|
||||||
|
"| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |\n",
|
||||||
|
"|-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
|
||||||
|
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
|
||||||
|
"| | | MIG M. |\n",
|
||||||
|
"|===============================+======================+======================|\n",
|
||||||
|
"| 0 Tesla T4 Off | 00000000:00:1B.0 Off | 0 |\n",
|
||||||
|
"| N/A 32C P0 26W / 70W | 1378MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 1 Tesla T4 Off | 00000000:00:1C.0 Off | 0 |\n",
|
||||||
|
"| N/A 31C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 2 Tesla T4 Off | 00000000:00:1D.0 Off | 0 |\n",
|
||||||
|
"| N/A 30C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 3 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |\n",
|
||||||
|
"| N/A 30C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
" \n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n",
|
||||||
|
"| Processes: |\n",
|
||||||
|
"| GPU GI CI PID Type Process name GPU Memory |\n",
|
||||||
|
"| ID ID Usage |\n",
|
||||||
|
"|=============================================================================|\n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n",
|
||||||
|
"```"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 18,
|
||||||
|
"id": "92f7ee37-4acb-46aa-bb73-4c0139d3f6b8",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Tue Oct 21 08:08:25 2025 \n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n",
|
||||||
|
"| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |\n",
|
||||||
|
"|-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
|
||||||
|
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
|
||||||
|
"| | | MIG M. |\n",
|
||||||
|
"|===============================+======================+======================|\n",
|
||||||
|
"| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |\n",
|
||||||
|
"| N/A 28C P0 24W / 70W | 11314MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |\n",
|
||||||
|
"| N/A 29C P0 25W / 70W | 168MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |\n",
|
||||||
|
"| N/A 28C P0 25W / 70W | 168MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |\n",
|
||||||
|
"| N/A 29C P0 24W / 70W | 168MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
" \n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n",
|
||||||
|
"| Processes: |\n",
|
||||||
|
"| GPU GI CI PID Type Process name GPU Memory |\n",
|
||||||
|
"| ID ID Usage |\n",
|
||||||
|
"|=============================================================================|\n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"!nvidia-smi"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "c031d2c7-03cb-4ac7-a195-70fc25cb191d",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"When loading data this way, we may be able to fit more data. The optimal dataset size depends on various factors including the specific operations being performed, the complexity of the workload, and the available GPU memory. To maximize acceleration, datasets should ideally fit within GPU memory, with ample space left for operations that can spike memory requirements. As a general rule of thumb, cuDF recommends data sets that are less than 50% of the GPU memory capacity. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "ec6cefea-dc64-4f13-815e-081cd35651b9",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"# 1 gigabytes = 1073741824 bytes\n",
|
||||||
|
"mem_capacity=16*1073741824\n",
|
||||||
|
"\n",
|
||||||
|
"mem_per_record=mem_usage_df.sum()/len(efficient_df)\n",
|
||||||
|
"\n",
|
||||||
|
"print(f'We can load {int(mem_capacity/2/mem_per_record)} rows.')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "ddaaa1ac-66ec-4323-9842-2543c6d85e4e",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# DO NOT CHANGE THIS CELL\n",
|
||||||
|
"import IPython\n",
|
||||||
|
"app = IPython.Application.instance()\n",
|
||||||
|
"app.kernel.do_shutdown(True)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "658e9847-775f-4d12-af4e-8f896df4e6fe",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"**Well Done!** Let's move to the [next notebook](1-04_interoperability.ipynb). "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "b86451cf-60e6-4733-b431-1bc0bd586bc2",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<img src=\"./images/DLI_Header.png\" width=400/>"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.10.15"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
1269
ds/25-1/2/1-04_interoperability.ipynb
Normal file
1085
ds/25-1/2/1-05_grouping.ipynb
Normal file
3435
ds/25-1/2/1-06_data_visualization.ipynb
Normal file
1123
ds/25-1/2/1-07_etl.ipynb
Normal file
1220
ds/25-1/2/1-08_cudf-polars.ipynb
Normal file
978
ds/25-1/2/1-09_dask-cudf.ipynb
Normal file
@ -0,0 +1,978 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<img src=\"./images/DLI_Header.png\" width=400/>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Fundamentals of Accelerated Data Science # "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Transition Path: cuDF provides a way for users to scale their pandas workflows as data sizes grow, offering a middle ground between single-threaded pandas and distributed computing solutions like Dask or Apache Spark ."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 09 - Introduction to Dask cuDF ##\n",
|
||||||
|
"\n",
|
||||||
|
"**Table of Contents**\n",
|
||||||
|
"<br>\n",
|
||||||
|
"[Dask](https://dask.org/) cuDF can be used to distribute dataframe operations to multiple GPUs. In this notebook we will introduce some key Dask concepts, learn how to setup a Dask cluster for utilizing multiple GPUs, and see how to perform simple dataframe operations on distributed Dask dataframes. This notebook covers the below sections: \n",
|
||||||
|
"1. [An Introduction to Dask](#An-Introduction-to-Dask)\n",
|
||||||
|
"2. [Setting up a Dask Scheduler](#Setting-up-a-Dask-Scheduler)\n",
|
||||||
|
" * [Obtaining the Local IP Address](#Obtaining-the-Local-IP-Address)\n",
|
||||||
|
" * [Starting a `LocalCUDACluster`](#Starting-a-LocalCUDACluster)\n",
|
||||||
|
" * [Instantiating a Client Connection](#Instantiating-a-Client-Connection)\n",
|
||||||
|
" * [The Dask Dashboard](#The-Dask-Dashboard)\n",
|
||||||
|
"3. [Reading Data with Dask cuDF](#Reading-Data-with-Dask-cuDF)\n",
|
||||||
|
"4. [Computational Graph](#Computational-Graph)\n",
|
||||||
|
" * [Visualizing the Computational Graph](#Visualizing-the-Computational-Graph)\n",
|
||||||
|
" * [Extending the Computational Graph](#Extending-the-Computational-Graph)\n",
|
||||||
|
" * [Computing with the Computational Graph](#Computing-with-the-Computational-Graph)\n",
|
||||||
|
" * [Persisting Data in the Cluster](#Persisting-Data-in-the-Cluster)\n",
|
||||||
|
"6. [Initial Data Exploration with Dask cuDF](#Initial-Data-Exploration-with-Dask-cuDF)\n",
|
||||||
|
" * [Exercise #1 - Counties North of Sunderland with Dask](#Exercise-#1---Counties-North-of-Sunderland-with-Dask)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## An Introduction to Dask ##\n",
|
||||||
|
"[Dask](https://dask.org/) is a Python library for parallel computing. In Dask programming, we create computational graphs that define code we **would like** to execute, and then, give these computational graphs to a Dask scheduler which evaluates them lazily, and efficiently, in parallel. \n",
|
||||||
|
"\n",
|
||||||
|
"In addition to using multiple CPU cores or threads to execute computational graphs in parallel, Dask schedulers can also be configured to execute computational graphs on multiple CPUs, or, as we will do in this workshop, multiple GPUs. As a result, Dask programming facilitates operating on data sets that are larger than the memory of a single compute resource.\n",
|
||||||
|
"\n",
|
||||||
|
"Because Dask computational graphs can consist of arbitrary Python code, they provide [a level of control and flexibility superior to many other systems](https://docs.dask.org/en/latest/spark.html) that can operate on massive data sets. However, we will focus for this workshop primarily on the Dask DataFrame, one of several data structures whose operations and methods natively utilize Dask's parallel scheduling:\n",
|
||||||
|
"* Dask DataFrame, which closely resembles the Pandas DataFrame\n",
|
||||||
|
"* Dask Array, which closely resembles the NumPy ndarray\n",
|
||||||
|
"* Dask Bag, a set which allows duplicates and can hold heterogeneously-typed data\n",
|
||||||
|
"\n",
|
||||||
|
"In particular, we will use a Dask-cuDF dataframe, which combines the interface of Dask with the GPU power of cuDF for distributed dataframe operations on multiple GPUs. We will now turn our attention to utilizing all 4 NVIDIA V100 GPUs in this environment for operations on an 18GB UK population data set that would not fit into the memory of a single 16GB GPU."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Setting up a Dask Scheduler ##\n",
|
||||||
|
"We begin by starting a Dask scheduler which will take care to distribute our work across the 4 available GPUs. In order to do this we need to start a `LocalCUDACluster` instance, using our host machine's IP, and then instantiate a client that can communicate with the cluster."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Obtaining the Local IP Address ###"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 1,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import subprocess # we will use this to obtain our local IP using the following command\n",
|
||||||
|
"cmd = \"hostname --all-ip-addresses\"\n",
|
||||||
|
"\n",
|
||||||
|
"process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)\n",
|
||||||
|
"output, error = process.communicate()\n",
|
||||||
|
"IPADDR = str(output.decode()).split()[0]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Starting a `LocalCUDACluster` ###\n",
|
||||||
|
"`dask_cuda` provides utilities for Dask and CUDA (the \"cu\" in cuDF) interactions."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stderr",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"2025-10-21 13:31:13,108 - distributed.scheduler - WARNING - Removing worker 'tcp://172.18.0.2:44687' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('read_csv-910ec886221afde30c768158c33b486c', 16), ('read_csv-910ec886221afde30c768158c33b486c', 67), ('read_csv-910ec886221afde30c768158c33b486c', 0), ('read_csv-910ec886221afde30c768158c33b486c', 41), ('read_csv-910ec886221afde30c768158c33b486c', 54), ('read_csv-910ec886221afde30c768158c33b486c', 9), ('read_csv-910ec886221afde30c768158c33b486c', 38), ('read_csv-910ec886221afde30c768158c33b486c', 5), ('read_csv-910ec886221afde30c768158c33b486c', 34), ('read_csv-910ec886221afde30c768158c33b486c', 12), ('read_csv-910ec886221afde30c768158c33b486c', 2), ('read_csv-910ec886221afde30c768158c33b486c', 27), ('read_csv-910ec886221afde30c768158c33b486c', 62), ('read_csv-910ec886221afde30c768158c33b486c', 46), ('read_csv-910ec886221afde30c768158c33b486c', 30), ('read_csv-910ec886221afde30c768158c33b486c', 59), ('read_csv-910ec886221afde30c768158c33b486c', 23)} (stimulus_id='handle-worker-cleanup-1761053473.108198')\n",
|
||||||
|
"2025-10-21 13:31:13,110 - distributed.scheduler - WARNING - Removing worker 'tcp://172.18.0.2:35977' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('read_csv-910ec886221afde30c768158c33b486c', 29), ('read_csv-910ec886221afde30c768158c33b486c', 48), ('read_csv-910ec886221afde30c768158c33b486c', 32), ('read_csv-910ec886221afde30c768158c33b486c', 10), ('read_csv-910ec886221afde30c768158c33b486c', 51), ('read_csv-910ec886221afde30c768158c33b486c', 25), ('read_csv-910ec886221afde30c768158c33b486c', 60), ('read_csv-910ec886221afde30c768158c33b486c', 44), ('read_csv-910ec886221afde30c768158c33b486c', 14), ('read_csv-910ec886221afde30c768158c33b486c', 57), ('read_csv-910ec886221afde30c768158c33b486c', 18), ('read_csv-910ec886221afde30c768158c33b486c', 8), ('read_csv-910ec886221afde30c768158c33b486c', 66), ('read_csv-910ec886221afde30c768158c33b486c', 21), ('read_csv-910ec886221afde30c768158c33b486c', 36), ('read_csv-910ec886221afde30c768158c33b486c', 4), ('read_csv-910ec886221afde30c768158c33b486c', 55)} (stimulus_id='handle-worker-cleanup-1761053473.1105292')\n",
|
||||||
|
"2025-10-21 13:31:13,112 - distributed.scheduler - WARNING - Removing worker 'tcp://172.18.0.2:39371' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('read_csv-910ec886221afde30c768158c33b486c', 7), ('read_csv-910ec886221afde30c768158c33b486c', 58), ('read_csv-910ec886221afde30c768158c33b486c', 3), ('read_csv-910ec886221afde30c768158c33b486c', 26), ('read_csv-910ec886221afde30c768158c33b486c', 61), ('read_csv-910ec886221afde30c768158c33b486c', 22), ('read_csv-910ec886221afde30c768158c33b486c', 19), ('read_csv-910ec886221afde30c768158c33b486c', 15), ('read_csv-910ec886221afde30c768158c33b486c', 50), ('read_csv-910ec886221afde30c768158c33b486c', 47), ('read_csv-910ec886221afde30c768158c33b486c', 53), ('read_csv-910ec886221afde30c768158c33b486c', 37), ('read_csv-910ec886221afde30c768158c33b486c', 43), ('read_csv-910ec886221afde30c768158c33b486c', 11), ('read_csv-910ec886221afde30c768158c33b486c', 40), ('read_csv-910ec886221afde30c768158c33b486c', 65), ('read_csv-910ec886221afde30c768158c33b486c', 33)} (stimulus_id='handle-worker-cleanup-1761053473.1126676')\n",
|
||||||
|
"2025-10-21 13:31:13,114 - distributed.scheduler - WARNING - Removing worker 'tcp://172.18.0.2:36291' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('read_csv-910ec886221afde30c768158c33b486c', 52), ('read_csv-910ec886221afde30c768158c33b486c', 13), ('read_csv-910ec886221afde30c768158c33b486c', 42), ('read_csv-910ec886221afde30c768158c33b486c', 45), ('read_csv-910ec886221afde30c768158c33b486c', 6), ('read_csv-910ec886221afde30c768158c33b486c', 35), ('read_csv-910ec886221afde30c768158c33b486c', 64), ('read_csv-910ec886221afde30c768158c33b486c', 31), ('read_csv-910ec886221afde30c768158c33b486c', 28), ('read_csv-910ec886221afde30c768158c33b486c', 63), ('read_csv-910ec886221afde30c768158c33b486c', 24), ('read_csv-910ec886221afde30c768158c33b486c', 56), ('read_csv-910ec886221afde30c768158c33b486c', 17), ('read_csv-910ec886221afde30c768158c33b486c', 1), ('read_csv-910ec886221afde30c768158c33b486c', 20), ('read_csv-910ec886221afde30c768158c33b486c', 49), ('read_csv-910ec886221afde30c768158c33b486c', 39), ('read_csv-910ec886221afde30c768158c33b486c', 68)} (stimulus_id='handle-worker-cleanup-1761053473.1145272')\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"from dask_cuda import LocalCUDACluster\n",
|
||||||
|
"cluster = LocalCUDACluster(ip=IPADDR)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Instantiating a Client Connection ###\n",
|
||||||
|
"The `dask.distributed` library gives us distributed functionality, including the ability to connect to the CUDA Cluster we just created. The `progress` import will give us a handy progress bar we can utilize below."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 3,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from dask.distributed import Client, progress\n",
|
||||||
|
"\n",
|
||||||
|
"client = Client(cluster)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### The Dask Dashboard"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Dask ships with a very helpful dashboard that in our case runs on port `8787`. Open a new browser tab now and copy this lab's URL into it, replacing `/lab/lab` with `:8787` (so it ends with `.com:8787`). This should open the Dask dashboard, currently idle."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Reading Data with Dask cuDF ##\n",
|
||||||
|
"With `dask_cudf` we can create a dataframe from several file formats (including from multiple files and directly from cloud storage like S3), from cuDF dataframes, from Pandas dataframes, and even from vanilla CPU Dask dataframes. Here we will create a Dask cuDF dataframe from the local csv file `pop5x_1-07.csv`, which has similar features to the `pop.csv` files you have already been using, except scaled up to 5 times larger (18GB), representing a population of almost 300 million, nearly the size of the entire United States."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 4,
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"18G data/uk_pop5x.csv\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# get the file size of `pop5x_1-07.csv` in GB\n",
|
||||||
|
"!ls -sh data/uk_pop5x.csv"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"We import dask_cudf (and other RAPIDS components when necessary) after setting up the cluster to ensure that they establish correctly inside the CUDA context it creates."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import dask_cudf"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 6,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"ddf = dask_cudf.read_csv('./data/uk_pop5x.csv', dtype=['float32', 'str', 'str', 'float32', 'float32', 'str'])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 7,
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"age float32\n",
|
||||||
|
"sex object\n",
|
||||||
|
"county object\n",
|
||||||
|
"lat float32\n",
|
||||||
|
"long float32\n",
|
||||||
|
"name object\n",
|
||||||
|
"dtype: object"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 7,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"ddf.dtypes"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Computational Graph ##\n",
|
||||||
|
"As mentioned above, when programming with Dask, we create computational graphs that we **would eventually like** to be executed. We can already observe this behavior in action: in calling `dask_cudf.read_csv` we have indicated that **would eventually like** to read the entire contents of `pop5x_1-07.csv`. However, Dask will not ask the scheduler execute this work until we explicitly indicate that we would like it do so.\n",
|
||||||
|
"\n",
|
||||||
|
"Observe the memory usage for each of the 4 GPUs by executing the following cell, and notice that the GPU memory usage is not nearly large enough to indicate that the entire 18GB file has been read into memory:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 8,
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Tue Oct 21 13:29:09 2025 \n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n",
|
||||||
|
"| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |\n",
|
||||||
|
"|-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
|
||||||
|
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
|
||||||
|
"| | | MIG M. |\n",
|
||||||
|
"|===============================+======================+======================|\n",
|
||||||
|
"| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |\n",
|
||||||
|
"| N/A 30C P0 26W / 70W | 14956MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |\n",
|
||||||
|
"| N/A 30C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |\n",
|
||||||
|
"| N/A 30C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |\n",
|
||||||
|
"| N/A 29C P0 26W / 70W | 168MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
" \n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n",
|
||||||
|
"| Processes: |\n",
|
||||||
|
"| GPU GI CI PID Type Process name GPU Memory |\n",
|
||||||
|
"| ID ID Usage |\n",
|
||||||
|
"|=============================================================================|\n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"!nvidia-smi"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Visualizing the Computational Graph ###\n",
|
||||||
|
"Computational graphs that have not yet been executed provide the `.visualize` method that, when used in a Jupyter environment such as this one, will display the computational graph, including how Dask intends to go about distributing the work. Thus, we can visualize how the `read_csv` operation will be distributed by Dask by executing the following cell:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 9,
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"image/svg+xml": [
|
||||||
|
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
|
||||||
|
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
|
||||||
|
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
|
||||||
|
"<!-- Generated by graphviz version 2.43.0 (0)\n",
|
||||||
|
" -->\n",
|
||||||
|
"<!-- Title: %3 Pages: 1 -->\n",
|
||||||
|
"<svg width=\"115pt\" height=\"44pt\"\n",
|
||||||
|
" viewBox=\"0.00 0.00 115.00 44.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
|
||||||
|
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 40)\">\n",
|
||||||
|
"<title>%3</title>\n",
|
||||||
|
"<polygon fill=\"white\" stroke=\"transparent\" points=\"-4,4 -4,-40 111,-40 111,4 -4,4\"/>\n",
|
||||||
|
"<!-- -6332770613817605186 -->\n",
|
||||||
|
"<g id=\"node1\" class=\"node\">\n",
|
||||||
|
"<title>-6332770613817605186</title>\n",
|
||||||
|
"<polygon fill=\"none\" stroke=\"black\" points=\"107,-36 0,-36 0,0 107,0 107,-36\"/>\n",
|
||||||
|
"<text text-anchor=\"middle\" x=\"53.5\" y=\"-13\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">ReadCSV</text>\n",
|
||||||
|
"</g>\n",
|
||||||
|
"</g>\n",
|
||||||
|
"</svg>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<graphviz.graphs.Digraph at 0x7f94de3b45b0>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 9,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"ddf.visualize(format='svg') # This visualization is very large, and using `format='svg'` will make it easier to view."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"As you can see, when we indicate for Dask to actually execute this operation, it will parallelize the work across the 4 GPUs in something like 69 parallel partitions. We can see the exact number of partitions with the `npartitions` property:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 10,
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"69"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 10,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"ddf.npartitions"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Extending the Computational Graph ###\n",
|
||||||
|
"The concept of constructing computational graphs with arbitrary operations before executing them is a core part of Dask. Let's add some operations to the existing computational graph and visualize it again.\n",
|
||||||
|
"\n",
|
||||||
|
"After running the next cell, although it will take some scrolling to get a clear sense of it (the challenges of distributed data analytics!), you can see that the graph already constructed for `read_csv` now continues upward. It selects the `age` column across all partitions (visualized as `getitem`) and eventually performs the `.mean()` reduction (visualized as `series-sum-chunk`, `series-sum-agg`, `count-chunk`, `sum-agg` and `true-div`)."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 11,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"image/svg+xml": [
|
||||||
|
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
|
||||||
|
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
|
||||||
|
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
|
||||||
|
"<!-- Generated by graphviz version 2.43.0 (0)\n",
|
||||||
|
" -->\n",
|
||||||
|
"<!-- Title: %3 Pages: 1 -->\n",
|
||||||
|
"<svg width=\"276pt\" height=\"188pt\"\n",
|
||||||
|
" viewBox=\"0.00 0.00 276.00 188.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
|
||||||
|
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 184)\">\n",
|
||||||
|
"<title>%3</title>\n",
|
||||||
|
"<polygon fill=\"white\" stroke=\"transparent\" points=\"-4,4 -4,-184 272,-184 272,4 -4,4\"/>\n",
|
||||||
|
"<!-- 2336549067836068764 -->\n",
|
||||||
|
"<g id=\"node1\" class=\"node\">\n",
|
||||||
|
"<title>2336549067836068764</title>\n",
|
||||||
|
"<polygon fill=\"none\" stroke=\"black\" points=\"221,-180 47,-180 47,-144 221,-144 221,-180\"/>\n",
|
||||||
|
"<text text-anchor=\"middle\" x=\"134\" y=\"-157\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">Sum(Projection)</text>\n",
|
||||||
|
"</g>\n",
|
||||||
|
"<!-- 553658985626135620 -->\n",
|
||||||
|
"<g id=\"node2\" class=\"node\">\n",
|
||||||
|
"<title>553658985626135620</title>\n",
|
||||||
|
"<polygon fill=\"none\" stroke=\"black\" points=\"268,-108 0,-108 0,-72 268,-72 268,-108\"/>\n",
|
||||||
|
"<text text-anchor=\"middle\" x=\"134\" y=\"-85\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">Projection(ReadCSV, age)</text>\n",
|
||||||
|
"</g>\n",
|
||||||
|
"<!-- 553658985626135620->2336549067836068764 -->\n",
|
||||||
|
"<g id=\"edge1\" class=\"edge\">\n",
|
||||||
|
"<title>553658985626135620->2336549067836068764</title>\n",
|
||||||
|
"<path fill=\"none\" stroke=\"black\" d=\"M134,-108.3C134,-116.02 134,-125.29 134,-133.89\"/>\n",
|
||||||
|
"<polygon fill=\"black\" stroke=\"black\" points=\"130.5,-133.9 134,-143.9 137.5,-133.9 130.5,-133.9\"/>\n",
|
||||||
|
"</g>\n",
|
||||||
|
"<!-- -6332770613817605186 -->\n",
|
||||||
|
"<g id=\"node3\" class=\"node\">\n",
|
||||||
|
"<title>-6332770613817605186</title>\n",
|
||||||
|
"<polygon fill=\"none\" stroke=\"black\" points=\"187.5,-36 80.5,-36 80.5,0 187.5,0 187.5,-36\"/>\n",
|
||||||
|
"<text text-anchor=\"middle\" x=\"134\" y=\"-13\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">ReadCSV</text>\n",
|
||||||
|
"</g>\n",
|
||||||
|
"<!-- -6332770613817605186->553658985626135620 -->\n",
|
||||||
|
"<g id=\"edge2\" class=\"edge\">\n",
|
||||||
|
"<title>-6332770613817605186->553658985626135620</title>\n",
|
||||||
|
"<path fill=\"none\" stroke=\"black\" d=\"M134,-36.3C134,-44.02 134,-53.29 134,-61.89\"/>\n",
|
||||||
|
"<polygon fill=\"black\" stroke=\"black\" points=\"130.5,-61.9 134,-71.9 137.5,-61.9 130.5,-61.9\"/>\n",
|
||||||
|
"</g>\n",
|
||||||
|
"</g>\n",
|
||||||
|
"</svg>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<graphviz.graphs.Digraph at 0x7f94de3b59f0>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 11,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"mean_age = ddf['age'].sum()\n",
|
||||||
|
"mean_age.visualize(format='svg')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Computing with the Computational Graph ###\n",
|
||||||
|
"There are several ways to indicate to Dask that we would like to perform the computations described in the computational graphs we have constructed. The first we will show is the `.compute` method, which will return the output of the computation as an object in one GPU's memory - no longer distributed across GPUs.\n",
|
||||||
|
"\n",
|
||||||
|
"**NOTE**: This value is actually a [*future*](https://docs.python.org/3/library/concurrent.futures.html) that it can be immediately used in code, even before it completes evaluating. While this can be tremendously useful in many scenarios, we will not need in this workshop to do anything fancy with the futures we generate except to wait for them to evaluate so we can visualize their values.\n",
|
||||||
|
"\n",
|
||||||
|
"Below we send the computational graph we have created to the Dask scheduler to be executed in parallel on our 4 GPUs. If you have the Dask Dashboard open on another tab from before, you can watch it while the operation completes. Because our graph involves reading the entire 18GB data set (as we declared when adding `read_csv` to the call graph), you can expect the operation to take a little time. If you closely watch the dashboard, you will see that Dask begins follow-on calculations for `mean` even while data is still being read into memory."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 12,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"11732293000.0"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 12,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"mean_age.compute()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Persisting Data in the Cluster ###\n",
|
||||||
|
"As you can see, the previous operation, which read the entire 18GB csv into the GPUs' memory, did not retain the data in memory after completing the computational graph:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 13,
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Tue Oct 21 13:31:04 2025 \n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n",
|
||||||
|
"| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |\n",
|
||||||
|
"|-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
|
||||||
|
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
|
||||||
|
"| | | MIG M. |\n",
|
||||||
|
"|===============================+======================+======================|\n",
|
||||||
|
"| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |\n",
|
||||||
|
"| N/A 30C P0 26W / 70W | 14094MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |\n",
|
||||||
|
"| N/A 30C P0 26W / 70W | 690MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |\n",
|
||||||
|
"| N/A 30C P0 26W / 70W | 690MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |\n",
|
||||||
|
"| N/A 29C P0 26W / 70W | 690MiB / 15360MiB | 0% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
" \n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n",
|
||||||
|
"| Processes: |\n",
|
||||||
|
"| GPU GI CI PID Type Process name GPU Memory |\n",
|
||||||
|
"| ID ID Usage |\n",
|
||||||
|
"|=============================================================================|\n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"!nvidia-smi"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"A typical Dask workflow, which we will utilize, is to persist data we would like to work with to the cluster and then perform fast operations on that persisted data. We do this with the `.persist` method. From the [Dask documentation](https://distributed.dask.org/en/latest/manage-computation.html#client-persist):\n",
|
||||||
|
"\n",
|
||||||
|
">The `.persist` method submits the task graph behind the Dask collection to the scheduler, obtaining Futures for all of the top-most tasks (for example one Future for each Pandas [*or cuDF*] DataFrame in a Dask[*-cudf*] DataFrame). It then returns a copy of the collection pointing to these futures instead of the previous graph. This new collection is semantically equivalent but now points to actively running data rather than a lazy graph.\n",
|
||||||
|
"\n",
|
||||||
|
"Below we persist `ddf` to the cluster so that it will reside in GPU memory for us to perform fast operations on. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 14,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"ddf = ddf.persist()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"As you can see by executing `nvidia-smi` (after letting the `persist` finish), each GPU now has parts of the distributed dataframe in its memory:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 15,
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Tue Oct 21 13:31:08 2025 \n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n",
|
||||||
|
"| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |\n",
|
||||||
|
"|-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
|
||||||
|
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
|
||||||
|
"| | | MIG M. |\n",
|
||||||
|
"|===============================+======================+======================|\n",
|
||||||
|
"| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |\n",
|
||||||
|
"| N/A 32C P0 33W / 70W | 14218MiB / 15360MiB | 46% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |\n",
|
||||||
|
"| N/A 32C P0 32W / 70W | 3768MiB / 15360MiB | 19% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |\n",
|
||||||
|
"| N/A 31C P0 32W / 70W | 3804MiB / 15360MiB | 24% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
"| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |\n",
|
||||||
|
"| N/A 31C P0 32W / 70W | 3764MiB / 15360MiB | 45% Default |\n",
|
||||||
|
"| | | N/A |\n",
|
||||||
|
"+-------------------------------+----------------------+----------------------+\n",
|
||||||
|
" \n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n",
|
||||||
|
"| Processes: |\n",
|
||||||
|
"| GPU GI CI PID Type Process name GPU Memory |\n",
|
||||||
|
"| ID ID Usage |\n",
|
||||||
|
"|=============================================================================|\n",
|
||||||
|
"+-----------------------------------------------------------------------------+\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"!nvidia-smi"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Running `ddf.visualize` now shows that we no longer have operations in our task graph, only partitions of data, ready for us to perform operations:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 16,
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"image/svg+xml": [
|
||||||
|
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
|
||||||
|
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
|
||||||
|
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
|
||||||
|
"<!-- Generated by graphviz version 2.43.0 (0)\n",
|
||||||
|
" -->\n",
|
||||||
|
"<!-- Title: %3 Pages: 1 -->\n",
|
||||||
|
"<svg width=\"135pt\" height=\"44pt\"\n",
|
||||||
|
" viewBox=\"0.00 0.00 135.00 44.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
|
||||||
|
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 40)\">\n",
|
||||||
|
"<title>%3</title>\n",
|
||||||
|
"<polygon fill=\"white\" stroke=\"transparent\" points=\"-4,4 -4,-40 131,-40 131,4 -4,4\"/>\n",
|
||||||
|
"<!-- -4538719848559110466 -->\n",
|
||||||
|
"<g id=\"node1\" class=\"node\">\n",
|
||||||
|
"<title>-4538719848559110466</title>\n",
|
||||||
|
"<polygon fill=\"none\" stroke=\"black\" points=\"127,-36 0,-36 0,0 127,0 127,-36\"/>\n",
|
||||||
|
"<text text-anchor=\"middle\" x=\"63.5\" y=\"-13\" font-family=\"Helvetica,sans-Serif\" font-size=\"20.00\">FromGraph</text>\n",
|
||||||
|
"</g>\n",
|
||||||
|
"</g>\n",
|
||||||
|
"</svg>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"<graphviz.graphs.Digraph at 0x7f94b80d4550>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 16,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"ddf.visualize(format='svg')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Computing operations on this data will now be much faster:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 17,
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"40.1241924549316"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 17,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"ddf['age'].mean().compute()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Initial Data Exploration with Dask cuDF ##\n",
|
||||||
|
"The beauty of Dask is that working with your data, even though it is distributed and massive, is a lot like working with smaller in-memory data sets."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 18,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<div>\n",
|
||||||
|
"<style scoped>\n",
|
||||||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||||||
|
" vertical-align: middle;\n",
|
||||||
|
" }\n",
|
||||||
|
"\n",
|
||||||
|
" .dataframe tbody tr th {\n",
|
||||||
|
" vertical-align: top;\n",
|
||||||
|
" }\n",
|
||||||
|
"\n",
|
||||||
|
" .dataframe thead th {\n",
|
||||||
|
" text-align: right;\n",
|
||||||
|
" }\n",
|
||||||
|
"</style>\n",
|
||||||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||||||
|
" <thead>\n",
|
||||||
|
" <tr style=\"text-align: right;\">\n",
|
||||||
|
" <th></th>\n",
|
||||||
|
" <th>age</th>\n",
|
||||||
|
" <th>sex</th>\n",
|
||||||
|
" <th>county</th>\n",
|
||||||
|
" <th>lat</th>\n",
|
||||||
|
" <th>long</th>\n",
|
||||||
|
" <th>name</th>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
" </thead>\n",
|
||||||
|
" <tbody>\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <th>0</th>\n",
|
||||||
|
" <td>0.0</td>\n",
|
||||||
|
" <td>m</td>\n",
|
||||||
|
" <td>Darlington</td>\n",
|
||||||
|
" <td>54.549641</td>\n",
|
||||||
|
" <td>-1.493884</td>\n",
|
||||||
|
" <td>HARRISON</td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <th>1</th>\n",
|
||||||
|
" <td>0.0</td>\n",
|
||||||
|
" <td>m</td>\n",
|
||||||
|
" <td>Darlington</td>\n",
|
||||||
|
" <td>54.523945</td>\n",
|
||||||
|
" <td>-1.401142</td>\n",
|
||||||
|
" <td>LAKSH</td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <th>2</th>\n",
|
||||||
|
" <td>0.0</td>\n",
|
||||||
|
" <td>m</td>\n",
|
||||||
|
" <td>Darlington</td>\n",
|
||||||
|
" <td>54.561127</td>\n",
|
||||||
|
" <td>-1.690068</td>\n",
|
||||||
|
" <td>MUHAMMAD</td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <th>3</th>\n",
|
||||||
|
" <td>0.0</td>\n",
|
||||||
|
" <td>m</td>\n",
|
||||||
|
" <td>Darlington</td>\n",
|
||||||
|
" <td>54.542988</td>\n",
|
||||||
|
" <td>-1.543216</td>\n",
|
||||||
|
" <td>GRAYSON</td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <th>4</th>\n",
|
||||||
|
" <td>0.0</td>\n",
|
||||||
|
" <td>m</td>\n",
|
||||||
|
" <td>Darlington</td>\n",
|
||||||
|
" <td>54.532101</td>\n",
|
||||||
|
" <td>-1.569116</td>\n",
|
||||||
|
" <td>FINLAY</td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
" </tbody>\n",
|
||||||
|
"</table>\n",
|
||||||
|
"</div>"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
" age sex county lat long name\n",
|
||||||
|
"0 0.0 m Darlington 54.549641 -1.493884 HARRISON\n",
|
||||||
|
"1 0.0 m Darlington 54.523945 -1.401142 LAKSH\n",
|
||||||
|
"2 0.0 m Darlington 54.561127 -1.690068 MUHAMMAD\n",
|
||||||
|
"3 0.0 m Darlington 54.542988 -1.543216 GRAYSON\n",
|
||||||
|
"4 0.0 m Darlington 54.532101 -1.569116 FINLAY"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 18,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"ddf.head() # As a convenience, no need to `.compute` the `head()` method"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 19,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"age 292399470\n",
|
||||||
|
"sex 292399470\n",
|
||||||
|
"county 292399470\n",
|
||||||
|
"lat 292399470\n",
|
||||||
|
"long 292399470\n",
|
||||||
|
"name 292399470\n",
|
||||||
|
"dtype: int64"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 19,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"ddf.count().compute()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 20,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"age float32\n",
|
||||||
|
"sex object\n",
|
||||||
|
"county object\n",
|
||||||
|
"lat float32\n",
|
||||||
|
"long float32\n",
|
||||||
|
"name object\n",
|
||||||
|
"dtype: object"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 20,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"ddf.dtypes"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Exercise #1 - Counties North of Sunderland with Dask ###\n",
|
||||||
|
"Here we ask you to revisit an earlier exercise, but on the distributed data set. Hopefully, it's clear how similar the code is for single-GPU dataframes and distributed dataframes with Dask.\n",
|
||||||
|
"\n",
|
||||||
|
"Identify the latitude of the northernmost resident of Sunderland county (the person with the maximum `lat` value), and then determine which counties have any residents north of this resident. Use the `unique` method of a cudf `Series` to de-duplicate the result.\n",
|
||||||
|
"\n",
|
||||||
|
"**Instructions**: <br>\n",
|
||||||
|
"* Modify the `<FIXME>` only and execute the below cell to identify counties north of Sunderland. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 1,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"ename": "NameError",
|
||||||
|
"evalue": "name 'ddf' is not defined",
|
||||||
|
"output_type": "error",
|
||||||
|
"traceback": [
|
||||||
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
||||||
|
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
||||||
|
"Cell \u001b[0;32mIn[1], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m sunderland_residents \u001b[38;5;241m=\u001b[39m \u001b[43mddf\u001b[49m\u001b[38;5;241m.\u001b[39mloc[[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcounty\u001b[39m\u001b[38;5;124m'\u001b[39m], [\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mSUNDERLAND\u001b[39m\u001b[38;5;124m'\u001b[39m]]\n\u001b[1;32m 2\u001b[0m northmost_sunderland_lat \u001b[38;5;241m=\u001b[39m sunderland_residents[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mlat\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39mmax()\n\u001b[1;32m 3\u001b[0m counties_with_pop_north_of \u001b[38;5;241m=\u001b[39m ddf\u001b[38;5;241m.\u001b[39mloc[ddf[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mlat\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m>\u001b[39m northmost_sunderland_lat][\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcounty\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39munique()\n",
|
||||||
|
"\u001b[0;31mNameError\u001b[0m: name 'ddf' is not defined"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"sunderland_residents = ddf.loc[['county'], ['SUNDERLAND']]\n",
|
||||||
|
"northmost_sunderland_lat = sunderland_residents['lat'].max()\n",
|
||||||
|
"counties_with_pop_north_of = ddf.loc[ddf['lat'] > northmost_sunderland_lat]['county'].unique()\n",
|
||||||
|
"results=counties_with_pop_north_of.compute()\n",
|
||||||
|
"results.head()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "raw",
|
||||||
|
"metadata": {
|
||||||
|
"jupyter": {
|
||||||
|
"source_hidden": true
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"sunderland_residents = ddf.loc[ddf['county'] == 'Sunderland']\n",
|
||||||
|
"northmost_sunderland_lat = sunderland_residents['lat'].max()\n",
|
||||||
|
"counties_with_pop_north_of = ddf.loc[ddf['lat'] > northmost_sunderland_lat]['county'].unique()\n",
|
||||||
|
"results=counties_with_pop_north_of.compute()\n",
|
||||||
|
"results.head()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Click ... for solution. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 22,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"{'status': 'ok', 'restart': True}"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 22,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"import IPython\n",
|
||||||
|
"app = IPython.Application.instance()\n",
|
||||||
|
"app.kernel.do_shutdown(True)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"**Well Done!** Let's move to the [next notebook](1-09_cudf-polars.ipynb). "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<img src=\"./images/DLI_Header.png\" width=400/>"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.10.15"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
||||||
172
ds/25-1/2/county_centroid.csv
Normal file
@ -0,0 +1,172 @@
|
|||||||
|
county,lat_county_center,long_county_center
|
||||||
|
BARKING AND DAGENHAM,51.621048311776526,0.12958319845588165
|
||||||
|
BARNET,51.81255163972051,-0.21821206632197684
|
||||||
|
BARNSLEY,53.57190690010971,-1.5487193565226611
|
||||||
|
BATH AND NORTH EAST SOMERSET,51.35496548780361,-2.486675162410336
|
||||||
|
BEDFORD,52.145475839485385,-0.4549734374180617
|
||||||
|
BEXLEY,51.33625605642689,0.14633321710015448
|
||||||
|
BIRMINGHAM,52.12178304394528,-1.881329432771379
|
||||||
|
BLACKBURN WITH DARWEN,53.63718763008419,-2.463700844959783
|
||||||
|
BLACKPOOL,53.882118373353435,-3.0229009637127167
|
||||||
|
BLAENAU GWENT,51.75159582861159,-3.1862426125686745
|
||||||
|
BOLTON,53.73813128127497,-2.4794091133678147
|
||||||
|
BRACKNELL FOREST,51.457925145468295,-0.7336441271286038
|
||||||
|
BRADFORD,53.972113267048044,-1.8738762931122748
|
||||||
|
BRENT,51.761695309784,-0.2756927203781798
|
||||||
|
BRIDGEND,51.522888539164526,-3.6137468421270604
|
||||||
|
BRIGHTON AND HOVE,50.94890407892698,-0.1507807253912774
|
||||||
|
"BRISTOL, CITY OF",51.53203785026057,-2.5774864859032594
|
||||||
|
BROMLEY,51.2251371203518,0.03905163114984023
|
||||||
|
BUCKINGHAMSHIRE,51.92925587759856,-0.8053996183750294
|
||||||
|
BURY,53.61553432785575,-2.3088650595977023
|
||||||
|
CAERPHILLY,51.62781255006381,-3.1973649865483735
|
||||||
|
CALDERDALE,53.769761331289686,-1.9616103771384508
|
||||||
|
CAMBRIDGESHIRE,52.1333820427886,-0.23503728806014595
|
||||||
|
CAMDEN,51.69346289078886,-0.1629412552292679
|
||||||
|
CARDIFF,51.56635588939404,-3.222317281083218
|
||||||
|
CARMARTHENSHIRE,51.92106862577838,-4.211293704149962
|
||||||
|
CENTRAL BEDFORDSHIRE,51.99983427713095,-0.4775810785914261
|
||||||
|
CEREDIGION,52.297905934896974,-3.9524382809074967
|
||||||
|
CHESHIRE EAST,53.209779668583735,-2.2923524120906538
|
||||||
|
CHESHIRE WEST AND CHESTER,53.12468649229667,-2.703640874356098
|
||||||
|
CITY OF LONDON,51.515869084539396,-0.09345024349003202
|
||||||
|
CONWY,53.125451225027945,-3.7469275629154897
|
||||||
|
CORNWALL,50.2491094902892,-4.642072961722217
|
||||||
|
COUNTY DURHAM,54.46928915708376,-1.840983172985692
|
||||||
|
COVENTRY,52.20619163815314,-1.5190329484575433
|
||||||
|
CROYDON,51.33122440611814,-0.07773715861848832
|
||||||
|
CUMBRIA,54.470582575648244,-2.902600383252353
|
||||||
|
DARLINGTON,54.51355967194039,-1.5680201999230523
|
||||||
|
DENBIGHSHIRE,53.07313542431554,-3.347662396412462
|
||||||
|
DERBY,52.98317870391253,-1.471762916352353
|
||||||
|
DERBYSHIRE,52.96237103431297,-1.6019383162802616
|
||||||
|
DEVON,50.75993290464059,-3.6572707805745353
|
||||||
|
DONCASTER,53.579077870304175,-1.1091519021581622
|
||||||
|
DORSET,50.80117614559981,-2.4141088997141975
|
||||||
|
DUDLEY,52.466075739334926,-2.101688961593882
|
||||||
|
EALING,51.69946371446451,-0.31413253292570953
|
||||||
|
EAST RIDING OF YORKSHIRE,53.9506321883079,-0.6619808168243948
|
||||||
|
EAST SUSSEX,50.8319515317622,0.33441692286193403
|
||||||
|
ENFIELD,51.79829813489722,-0.08133941451400101
|
||||||
|
ESSEX,51.61177562858481,0.5408806396014519
|
||||||
|
FLINTSHIRE,53.18448452051185,-3.176529270275655
|
||||||
|
GATESHEAD,54.984104331680726,-1.6867966327256207
|
||||||
|
GLOUCESTERSHIRE,51.95116469210396,-2.152140175011601
|
||||||
|
GREENWICH,51.298529627584855,0.05009798110429057
|
||||||
|
GWYNEDD,52.90798692199907,-3.815807248465912
|
||||||
|
HACKNEY,51.715573990309835,-0.06047668080560671
|
||||||
|
HALTON,53.37945371869939,-2.6885285111965866
|
||||||
|
HAMMERSMITH AND FULHAM,51.45669431471315,-0.21734862391196488
|
||||||
|
HAMPSHIRE,51.35882747857323,-1.2472236572124424
|
||||||
|
HARINGEY,51.71488485869694,-0.10670896820865851
|
||||||
|
HARROW,51.69502976226169,-0.3360141730528605
|
||||||
|
HARTLEPOOL,54.67019690697325,-1.2702881849113061
|
||||||
|
HAVERING,51.68803382335829,0.23538931286606415
|
||||||
|
"HEREFORDSHIRE, COUNTY OF",52.05661428266539,-2.7394973894756567
|
||||||
|
HERTFORDSHIRE,51.97545351306396,-0.2768104374496038
|
||||||
|
HILLINGDON,51.67744993832507,-0.44168376669816023
|
||||||
|
HOUNSLOW,51.31550103034914,-0.37851470463324743
|
||||||
|
ISLE OF ANGLESEY,53.27637540915653,-4.323495411729392
|
||||||
|
ISLE OF WIGHT,50.62684579406237,-1.3335589426514434
|
||||||
|
ISLES OF SCILLY,49.923857744201605,-6.302263516809768
|
||||||
|
ISLINGTON,51.66454658738323,-0.10992970115558956
|
||||||
|
KENSINGTON AND CHELSEA,51.49977592399342,-0.18981078381787103
|
||||||
|
KENT,51.066980402556894,0.72177006521006
|
||||||
|
"KINGSTON UPON HULL, CITY OF",53.894135701816644,-0.30380941990063115
|
||||||
|
KINGSTON UPON THAMES,51.42789080754545,-0.28368404321251495
|
||||||
|
KIRKLEES,53.84779145117579,-1.7808194218728275
|
||||||
|
KNOWSLEY,53.48284092504563,-2.8329791954991275
|
||||||
|
LAMBETH,51.252923290285565,-0.11380231585035454
|
||||||
|
LANCASHIRE,53.39410422518683,-2.460896340904076
|
||||||
|
LEEDS,53.55494339794778,-1.5074406609781625
|
||||||
|
LEICESTER,52.7035904712036,-1.1304165681356237
|
||||||
|
LEICESTERSHIRE,52.372384242153444,-1.3774821236258858
|
||||||
|
LEWISHAM,51.26146486742923,-0.017302263531446847
|
||||||
|
LINCOLNSHIRE,53.019325697607805,-0.23840017404638325
|
||||||
|
LIVERPOOL,53.51161042331058,-2.9133522899513755
|
||||||
|
LUTON,51.96794156247519,-0.4231450525783596
|
||||||
|
MANCHESTER,53.618174414336764,-2.2337215842169944
|
||||||
|
MEDWAY,51.32754494250598,0.5632336335498731
|
||||||
|
MERTHYR TYDFIL,51.749169200604825,-3.36403864047987
|
||||||
|
MERTON,51.37364806533906,-0.18868296177359278
|
||||||
|
MIDDLESBROUGH,54.5098082464691,-1.211038279554591
|
||||||
|
MILTON KEYNES,52.01693552290149,-0.7406232665194876
|
||||||
|
MONMOUTHSHIRE,51.78143655329183,-2.9039386644643197
|
||||||
|
NEATH PORT TALBOT,51.59538437854254,-3.7458617902677283
|
||||||
|
NEWCASTLE UPON TYNE,55.00208530426788,-1.652806624671881
|
||||||
|
NEWHAM,51.75154898367921,0.027418339450078835
|
||||||
|
NEWPORT,51.53253056059282,-2.8977514562758477
|
||||||
|
NORFOLK,52.3032223796034,0.9647662889518414
|
||||||
|
NORTH EAST LINCOLNSHIRE,53.50967645052903,-0.13922750148994814
|
||||||
|
NORTH LINCOLNSHIRE,53.57540769163687,-0.5237063875323392
|
||||||
|
NORTH SOMERSET,51.35265217208383,-2.754333708085771
|
||||||
|
NORTH TYNESIDE,55.00390319683472,-1.5092377782362794
|
||||||
|
NORTH YORKSHIRE,54.037083506236726,-1.5496083229591298
|
||||||
|
NORTHAMPTONSHIRE,52.090056204873584,-0.8673643733062965
|
||||||
|
NORTHUMBERLAND,55.268382697315424,-2.075107564148198
|
||||||
|
NOTTINGHAM,52.95517248670217,-1.166635297324727
|
||||||
|
NOTTINGHAMSHIRE,53.03298887412134,-1.006945929298795
|
||||||
|
OLDHAM,53.659965283524954,-2.052688245629671
|
||||||
|
OXFORDSHIRE,51.93769526591072,-1.2911207463303098
|
||||||
|
PEMBROKESHIRE,51.87232817560273,-4.908191395785854
|
||||||
|
PETERBOROUGH,52.62511626981561,-0.2689975241368676
|
||||||
|
PLYMOUTH,50.29446598251615,-4.112955625237552
|
||||||
|
PORTSMOUTH,50.91433206435089,-1.0702659081823802
|
||||||
|
POWYS,52.35028728472521,-3.4364646802117074
|
||||||
|
READING,51.48972751726377,-0.9907195716377762
|
||||||
|
REDBRIDGE,51.74619394585629,0.0701000048233879
|
||||||
|
REDCAR AND CLEVELAND,54.52674848959172,-1.0057471172413288
|
||||||
|
RICHMOND UPON THAMES,51.40228740909276,-0.28924251316631455
|
||||||
|
ROCHDALE,53.67734692115036,-2.14815188340053
|
||||||
|
ROTHERHAM,53.27571588878268,-1.2866084213986422
|
||||||
|
RUTLAND,52.66741819281054,-0.6255844565552813
|
||||||
|
SALFORD,53.39900474827836,-2.3848977331687684
|
||||||
|
SANDWELL,52.58696674791831,-2.007627650605722
|
||||||
|
SEFTON,53.41754419091054,-2.9918998460398845
|
||||||
|
SHEFFIELD,53.594572416421464,-1.5427564265432459
|
||||||
|
SHROPSHIRE,52.68421414164122,-2.7366875706426375
|
||||||
|
SLOUGH,51.500375556628576,-0.5761037634462686
|
||||||
|
SOLIHULL,52.36591301434561,-1.7157174664625492
|
||||||
|
SOMERSET,51.15203995716832,-3.2953379430424437
|
||||||
|
SOUTH GLOUCESTERSHIRE,51.619868102630875,-2.469430184260059
|
||||||
|
SOUTH TYNESIDE,54.994706019365786,-1.4469508035803413
|
||||||
|
SOUTHAMPTON,50.984805930473584,-1.4002768042215858
|
||||||
|
SOUTHEND-ON-SEA,51.562157807336284,0.7069905953535786
|
||||||
|
SOUTHWARK,51.26247572937943,-0.07306483663823536
|
||||||
|
ST. HELENS,53.442240723358644,-2.7032424159534347
|
||||||
|
STAFFORDSHIRE,52.54946704767607,-2.027491119365553
|
||||||
|
STOCKPORT,53.243567817667724,-2.1248973952531918
|
||||||
|
STOCKTON-ON-TEES,54.60356568786033,-1.3063893005278557
|
||||||
|
STOKE-ON-TRENT,53.0018684063432,-2.1588155163720084
|
||||||
|
SUFFOLK,52.07327606663186,1.049040133490474
|
||||||
|
SUNDERLAND,54.95658521287448,-1.433572135990224
|
||||||
|
SURREY,51.75817482314145,-0.3386369800762059
|
||||||
|
SUTTON,51.33189096687447,-0.17228958486126392
|
||||||
|
SWANSEA,51.734320352502984,-3.967180818043868
|
||||||
|
SWINDON,51.64295753076632,-1.7336382187066433
|
||||||
|
TAMESIDE,53.4185402114593,-2.0769462404028474
|
||||||
|
TELFORD AND WREKIN,52.709149095326744,-2.4894724871905916
|
||||||
|
THURROCK,51.508227793073466,0.33492786371540356
|
||||||
|
TORBAY,50.494049197230815,-3.5551646045072913
|
||||||
|
TORFAEN,51.69896506141925,-3.0509328418360218
|
||||||
|
TOWER HAMLETS,51.68485859523772,-0.03638140322291906
|
||||||
|
TRAFFORD,53.314621144815334,-2.3656560688750687
|
||||||
|
VALE OF GLAMORGAN,51.477096810804674,-3.3980039155600954
|
||||||
|
WAKEFIELD,53.81677380462442,-1.4208545508030999
|
||||||
|
WALSALL,52.742742908764974,-1.9703315889024553
|
||||||
|
WALTHAM FOREST,51.723501987712325,-0.01886180175957716
|
||||||
|
WANDSWORTH,51.24653418036352,-0.2001743797936436
|
||||||
|
WARRINGTON,53.338554119123636,-2.561564052456012
|
||||||
|
WARWICKSHIRE,52.04847200574421,-1.5686356193411675
|
||||||
|
WEST BERKSHIRE,51.472960442069805,-1.2740171035533379
|
||||||
|
WEST SUSSEX,51.11473921001523,-0.4593527537340543
|
||||||
|
WESTMINSTER,51.613346179755915,-0.15298252171750404
|
||||||
|
WIGAN,53.58763891955546,-2.5723844100365545
|
||||||
|
WILTSHIRE,51.48575283497703,-1.926537553406791
|
||||||
|
WINDSOR AND MAIDENHEAD,51.494612540256846,-0.6753936432282348
|
||||||
|
WIRRAL,53.237217504292545,-3.0650813262796417
|
||||||
|
WOKINGHAM,51.45966460093226,-0.8993706058495408
|
||||||
|
WOLVERHAMPTON,52.71684834050869,-2.127594624973283
|
||||||
|
WORCESTERSHIRE,52.05799103802506,-2.209184250840713
|
||||||
|
WREXHAM,53.00080440180421,-2.991958507191866
|
||||||
|
YORK,53.99232942499273,-1.073788787620359
|
||||||
|