{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4b2efdc2-313c-493b-9d6c-432ae77d342c",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b754f1a2-24e8-44d4-ae6a-257483573434",
   "metadata": {},
   "source": [
    "# Fundamentals of Accelerated Data Science # "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "881e48fa-78ac-4d92-a8bb-d419c14df9e9",
   "metadata": {},
   "source": [
    "## 07 - Extract, Transform, and Load ##\n",
    "\n",
    "**Table of Contents**\n",
    "<br>\n",
    "In this notebook, we will go through the basics of extract, transform, and load. This notebook covers the below sections: \n",
    "1. [Extract, Transform, and Load (ETL)](#Extract,-Transform,-and-Load-(ETL))\n",
    "    * [Extract](#Extract)\n",
    "    * [Transform](#Transform)\n",
    "    * [Load](#Load)\n",
    "2. [Save to Parquet Format](#Save-to-Parquet-Format)\n",
    "    * [Reading from Parquet](#Reading-from-Parquet)\n",
    "3. [Accelerated ETL for Downstream Tasks](#Accelerated-ETL-for-Downstream-Tasks)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a083cd6e-73f4-42b8-be04-bdd2bdeb5185",
   "metadata": {},
   "source": [
    "## Extract, Transform, and Load (ETL) ##\n",
    "An important but perhaps not as highly glorified use case of RAPIDS is extract, transform, and load, or ETL for short. It is a data integration process used to combine data from multiple sources into a single, consistent data store. It's primary goals are: \n",
    "* Consolidates data from multiple sources into a single, consistent format\n",
    "* Improves data quality through cleaning and validation\n",
    "* Enables more efficient data analysis and reporting\n",
    "* Supports data-driven decision making"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9d5d73e-1665-4706-8840-6e14c2fabaa7",
   "metadata": {},
   "source": [
    "### Extract ###\n",
    "**Extract** is the first step where data is collected from various source systems. These sources could include: \n",
    "* Static files (csv, json)\n",
    "* SQL RDBMS\n",
    "* Webpages\n",
    "* API\n",
    "\n",
    "**Note**: cuDF doesn't have a way to get transactions from external SQL databases directly to GPU. The workaround is reading with pandas and create cuDF dataframe with `cudf.from_pandas()`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "17be68c6-49c0-429d-86d2-c2cb711a6dc3",
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext cudf.pandas\n",
    "# DO NOT CHANGE THIS CELL\n",
    "import pandas as pd\n",
    "import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "37c75532-1bb6-4aab-9cee-3215d7137cee",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>county</th>\n",
       "      <th>lat</th>\n",
       "      <th>long</th>\n",
       "      <th>name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.533638</td>\n",
       "      <td>-1.524400</td>\n",
       "      <td>FRANCIS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.426254</td>\n",
       "      <td>-1.465314</td>\n",
       "      <td>EDWARD</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.555199</td>\n",
       "      <td>-1.496417</td>\n",
       "      <td>TEDDY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.547909</td>\n",
       "      <td>-1.572342</td>\n",
       "      <td>ANGUS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.477638</td>\n",
       "      <td>-1.605995</td>\n",
       "      <td>CHARLIE</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age sex      county        lat      long     name\n",
       "0    0   m  DARLINGTON  54.533638 -1.524400  FRANCIS\n",
       "1    0   m  DARLINGTON  54.426254 -1.465314   EDWARD\n",
       "2    0   m  DARLINGTON  54.555199 -1.496417    TEDDY\n",
       "3    0   m  DARLINGTON  54.547909 -1.572342    ANGUS\n",
       "4    0   m  DARLINGTON  54.477638 -1.605995  CHARLIE"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "dtype_dict={\n",
    "    'age': 'int8', \n",
    "    'sex': 'object', \n",
    "    'county': 'object', \n",
    "    'lat': 'float32', \n",
    "    'long': 'float32', \n",
    "    'name': 'object'\n",
    "}\n",
    "        \n",
    "df=pd.read_csv('./data/uk_pop.csv', dtype=dtype_dict)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd97bfdd-5ead-4cd7-bc17-0e1eeb4aa62b",
   "metadata": {},
   "source": [
    "When importing data, it is important to only include columns that are relevant to reduce the memory and compute burden. \n",
    "\n",
    "Below we read in the county centroid data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d064ff45-4cf4-48cd-9cbe-3f3f177fb875",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>county</th>\n",
       "      <th>lat_county_center</th>\n",
       "      <th>long_county_center</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>BARKING AND DAGENHAM</td>\n",
       "      <td>51.621048</td>\n",
       "      <td>0.129583</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>BARNET</td>\n",
       "      <td>51.812552</td>\n",
       "      <td>-0.218212</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>BARNSLEY</td>\n",
       "      <td>53.571907</td>\n",
       "      <td>-1.548719</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>BATH AND NORTH EAST SOMERSET</td>\n",
       "      <td>51.354965</td>\n",
       "      <td>-2.486675</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>BEDFORD</td>\n",
       "      <td>52.145476</td>\n",
       "      <td>-0.454973</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                         county  lat_county_center  long_county_center\n",
       "0          BARKING AND DAGENHAM          51.621048            0.129583\n",
       "1                        BARNET          51.812552           -0.218212\n",
       "2                      BARNSLEY          53.571907           -1.548719\n",
       "3  BATH AND NORTH EAST SOMERSET          51.354965           -2.486675\n",
       "4                       BEDFORD          52.145476           -0.454973"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "centroid_df=pd.read_csv('county_centroid.csv')\n",
    "centroid_df.columns=['county', 'lat_county_center', 'long_county_center']\n",
    "centroid_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2dbf6f18-f28d-45ff-a4c4-1f50bfcf5860",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\">                                                                                             </span>\n",
       "<span style=\"font-style: italic\">                                  Total time elapsed: 0.559 seconds                          </span>\n",
       "<span style=\"font-style: italic\">                                                                                             </span>\n",
       "<span style=\"font-style: italic\">                                                Stats                                        </span>\n",
       "<span style=\"font-style: italic\">                                                                                             </span>\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line                                               </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 1        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    combined_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">merge(centroid_df, on</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)</span> │ 0.347158869 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                  </span> │             │             │\n",
       "└──────────┴────────────────────────────────────────────────────┴─────────────┴─────────────┘\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[3m                                                                                             \u001b[0m\n",
       "\u001b[3m                                  Total time elapsed: 0.559 seconds                          \u001b[0m\n",
       "\u001b[3m                                                                                             \u001b[0m\n",
       "\u001b[3m                                                Stats                                        \u001b[0m\n",
       "\u001b[3m                                                                                             \u001b[0m\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine                                              \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 1        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcombined_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmerge\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcentroid_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mon\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m │ 0.347158869 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                  \u001b[0m │             │             │\n",
       "└──────────┴────────────────────────────────────────────────────┴─────────────┴─────────────┘\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%cudf.pandas.line_profile\n",
    "combined_df=df.merge(centroid_df, on='county')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3294d6e0-00d9-4e33-91d8-2dc02db51512",
   "metadata": {},
   "source": [
    "### Transform ###\n",
    "During the **Transform** step, the extract data is cleaned, validated, and converted into a suitable format for analysis. \n",
    "\n",
    "Below we add a new column, representing each persons's distance from their respective county center. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "1e6ce3e0-625b-49ec-a755-2bfddee41431",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "<span style=\"font-style: italic\">                                             Total time elapsed: 2.406 seconds                                     </span>\n",
       "<span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "<span style=\"font-style: italic\">                                                           Stats                                                   </span>\n",
       "<span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line                                                                     </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 1        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    c</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'lat'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'long'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"background-color: #272822\">                                                   </span> │             │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 2        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    combined_df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'R'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">((combined_df[c] </span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">-</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> combined_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">groupby(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)[c…</span> │ 2.261978734 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "\u001b[3m                                             Total time elapsed: 2.406 seconds                                     \u001b[0m\n",
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "\u001b[3m                                                           Stats                                                   \u001b[0m\n",
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine                                                                    \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 1        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mc\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlat\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlong\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[48;2;39;40;34m                                                   \u001b[0m │             │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 2        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcombined_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mR\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcombined_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mc\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m-\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcombined_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mgroupby\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mc\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m…\u001b[0m │ 2.261978734 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%cudf.pandas.line_profile\n",
    "c=['lat', 'long']\n",
    "combined_df['R']=((combined_df[c] - combined_df.groupby('county')[c].transform('mean')) ** 2).sum(axis=1) ** 0.5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e676b1b-143d-4cf4-a534-8c25dafac1a6",
   "metadata": {},
   "source": [
    "Using joins to get lookup values can be faster than deriving those. It is not uncommon to store group statistics for this purpose. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "4339b0c7-c391-48f4-9ba3-00f79ba50b5b",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "<span style=\"font-style: italic\">                                             Total time elapsed: 0.830 seconds                                     </span>\n",
       "<span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "<span style=\"font-style: italic\">                                                           Stats                                                   </span>\n",
       "<span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line                                                                     </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 3        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    centroid_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">pd</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">read_csv(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county_centroid.csv'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)</span><span style=\"background-color: #272822\">                      </span> │ 0.005319933 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 6        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    combined_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">merge(centroid_df, on</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, suffixes</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">''</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'_coun…</span> │ 0.452281164 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 9        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    combined_df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'R'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">((combined_df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'lat'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">-</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">combined_df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'lat_county_cente…</span> │ 0.185010254 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "\u001b[3m                                             Total time elapsed: 0.830 seconds                                     \u001b[0m\n",
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "\u001b[3m                                                           Stats                                                   \u001b[0m\n",
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine                                                                    \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 3        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcentroid_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mpd\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mread_csv\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty_centroid.csv\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m                      \u001b[0m │ 0.005319933 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 6        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcombined_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mmerge\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcentroid_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mon\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msuffixes\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m_coun…\u001b[0m │ 0.452281164 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 9        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcombined_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mR\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcombined_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlat\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m-\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcombined_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mlat_county_cente…\u001b[0m │ 0.185010254 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%cudf.pandas.line_profile\n",
    "\n",
    "# read in centroid data\n",
    "centroid_df=pd.read_csv('county_centroid.csv')\n",
    "\n",
    "# merge \n",
    "combined_df=df.merge(centroid_df, on='county', suffixes=['', '_county_center'])\n",
    "\n",
    "# calculate distance from county center\n",
    "combined_df['R']=((combined_df['lat']-combined_df['lat_county_center'])**2+(combined_df['long']-combined_df['long_county_center'])**2)**0.5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa017ea0-4ad2-4184-9669-40747e31381a",
   "metadata": {},
   "source": [
    "Below we filter the data to only include adults. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "917aa1ed-1778-40d4-bbbc-5892935ce7cd",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>county</th>\n",
       "      <th>lat</th>\n",
       "      <th>long</th>\n",
       "      <th>name</th>\n",
       "      <th>lat_county_center</th>\n",
       "      <th>long_county_center</th>\n",
       "      <th>R</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22443378</th>\n",
       "      <td>60</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.542645</td>\n",
       "      <td>-1.611084</td>\n",
       "      <td>ELLIOTT</td>\n",
       "      <td>54.51356</td>\n",
       "      <td>-1.56802</td>\n",
       "      <td>0.051966</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22443379</th>\n",
       "      <td>60</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.547031</td>\n",
       "      <td>-1.517514</td>\n",
       "      <td>SANTINO</td>\n",
       "      <td>54.51356</td>\n",
       "      <td>-1.56802</td>\n",
       "      <td>0.060590</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22443380</th>\n",
       "      <td>60</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.470505</td>\n",
       "      <td>-1.527565</td>\n",
       "      <td>LAWRENCE</td>\n",
       "      <td>54.51356</td>\n",
       "      <td>-1.56802</td>\n",
       "      <td>0.059079</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22443381</th>\n",
       "      <td>60</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.520664</td>\n",
       "      <td>-1.620248</td>\n",
       "      <td>THEO</td>\n",
       "      <td>54.51356</td>\n",
       "      <td>-1.56802</td>\n",
       "      <td>0.052709</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22443382</th>\n",
       "      <td>60</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.569939</td>\n",
       "      <td>-1.498693</td>\n",
       "      <td>CHARLIE</td>\n",
       "      <td>54.51356</td>\n",
       "      <td>-1.56802</td>\n",
       "      <td>0.089358</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          age sex      county        lat      long      name  \\\n",
       "22443378   60   m  DARLINGTON  54.542645 -1.611084   ELLIOTT   \n",
       "22443379   60   m  DARLINGTON  54.547031 -1.517514   SANTINO   \n",
       "22443380   60   m  DARLINGTON  54.470505 -1.527565  LAWRENCE   \n",
       "22443381   60   m  DARLINGTON  54.520664 -1.620248      THEO   \n",
       "22443382   60   m  DARLINGTON  54.569939 -1.498693   CHARLIE   \n",
       "\n",
       "          lat_county_center  long_county_center         R  \n",
       "22443378           54.51356            -1.56802  0.051966  \n",
       "22443379           54.51356            -1.56802  0.060590  \n",
       "22443380           54.51356            -1.56802  0.059079  \n",
       "22443381           54.51356            -1.56802  0.052709  \n",
       "22443382           54.51356            -1.56802  0.089358  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\">                                                                                          </span>\n",
       "<span style=\"font-style: italic\">                                Total time elapsed: 0.574 seconds                         </span>\n",
       "<span style=\"font-style: italic\">                                                                                          </span>\n",
       "<span style=\"font-style: italic\">                                              Stats                                       </span>\n",
       "<span style=\"font-style: italic\">                                                                                          </span>\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line                                            </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 2        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    senior_df_filter</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">combined_df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">] </span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">&gt;=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\"> </span><span style=\"color: #ae81ff; text-decoration-color: #ae81ff; background-color: #272822\">60</span><span style=\"background-color: #272822\">  </span> │ 0.008445255 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                               </span> │             │             │\n",
       "│ 3        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    senior_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">combined_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">loc[senior_df_filter]</span> │ 0.081001887 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                               </span> │             │             │\n",
       "│ 5        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    display(senior_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">head())</span><span style=\"background-color: #272822\">                  </span> │ 0.287346369 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                               </span> │             │             │\n",
       "└──────────┴─────────────────────────────────────────────────┴─────────────┴─────────────┘\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[3m                                                                                          \u001b[0m\n",
       "\u001b[3m                                Total time elapsed: 0.574 seconds                         \u001b[0m\n",
       "\u001b[3m                                                                                          \u001b[0m\n",
       "\u001b[3m                                              Stats                                       \u001b[0m\n",
       "\u001b[3m                                                                                          \u001b[0m\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine                                           \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 2        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msenior_df_filter\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcombined_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m>\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m60\u001b[0m\u001b[48;2;39;40;34m  \u001b[0m │ 0.008445255 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                               \u001b[0m │             │             │\n",
       "│ 3        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msenior_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcombined_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mloc\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msenior_df_filter\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m │ 0.081001887 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                               \u001b[0m │             │             │\n",
       "│ 5        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdisplay\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msenior_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mhead\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m                  \u001b[0m │ 0.287346369 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                               \u001b[0m │             │             │\n",
       "└──────────┴─────────────────────────────────────────────────┴─────────────┴─────────────┘\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%cudf.pandas.line_profile\n",
    "\n",
    "senior_df_filter=combined_df['age'] >= 60\n",
    "senior_df=combined_df.loc[senior_df_filter]\n",
    "\n",
    "display(senior_df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb474531-389e-44aa-a6b3-8de205d9cb0b",
   "metadata": {},
   "source": [
    "### Load ###\n",
    "The final **Load** step is where the transformed data is loaded into a target system. The target system can be a database or a file. The key is to develop a system that is efficient for downstream tasks. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "6f637229-1770-439b-9702-6356723c96f8",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>county</th>\n",
       "      <th>lat</th>\n",
       "      <th>long</th>\n",
       "      <th>name</th>\n",
       "      <th>lat_county_center</th>\n",
       "      <th>long_county_center</th>\n",
       "      <th>R</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22443378</th>\n",
       "      <td>60</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.542645</td>\n",
       "      <td>-1.611084</td>\n",
       "      <td>ELLIOTT</td>\n",
       "      <td>54.51356</td>\n",
       "      <td>-1.56802</td>\n",
       "      <td>0.051966</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22443379</th>\n",
       "      <td>60</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.547031</td>\n",
       "      <td>-1.517514</td>\n",
       "      <td>SANTINO</td>\n",
       "      <td>54.51356</td>\n",
       "      <td>-1.56802</td>\n",
       "      <td>0.060590</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22443380</th>\n",
       "      <td>60</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.470505</td>\n",
       "      <td>-1.527565</td>\n",
       "      <td>LAWRENCE</td>\n",
       "      <td>54.51356</td>\n",
       "      <td>-1.56802</td>\n",
       "      <td>0.059079</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22443381</th>\n",
       "      <td>60</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.520664</td>\n",
       "      <td>-1.620248</td>\n",
       "      <td>THEO</td>\n",
       "      <td>54.51356</td>\n",
       "      <td>-1.56802</td>\n",
       "      <td>0.052709</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22443382</th>\n",
       "      <td>60</td>\n",
       "      <td>m</td>\n",
       "      <td>DARLINGTON</td>\n",
       "      <td>54.569939</td>\n",
       "      <td>-1.498693</td>\n",
       "      <td>CHARLIE</td>\n",
       "      <td>54.51356</td>\n",
       "      <td>-1.56802</td>\n",
       "      <td>0.089358</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          age sex      county        lat      long      name  \\\n",
       "22443378   60   m  DARLINGTON  54.542645 -1.611084   ELLIOTT   \n",
       "22443379   60   m  DARLINGTON  54.547031 -1.517514   SANTINO   \n",
       "22443380   60   m  DARLINGTON  54.470505 -1.527565  LAWRENCE   \n",
       "22443381   60   m  DARLINGTON  54.520664 -1.620248      THEO   \n",
       "22443382   60   m  DARLINGTON  54.569939 -1.498693   CHARLIE   \n",
       "\n",
       "          lat_county_center  long_county_center         R  \n",
       "22443378           54.51356            -1.56802  0.051966  \n",
       "22443379           54.51356            -1.56802  0.060590  \n",
       "22443380           54.51356            -1.56802  0.059079  \n",
       "22443381           54.51356            -1.56802  0.052709  \n",
       "22443382           54.51356            -1.56802  0.089358  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "senior_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "348eac86-3d67-42e2-9c7f-e0acb14021cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "senior_df.to_csv('senior_df.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e25ae20b-7eea-4775-9743-d43cbf6877a7",
   "metadata": {},
   "source": [
    "**Note**: If the downstream task involves querying and analyzing the data further, the csv file format may not be the best choice. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08413512-47f9-4543-a80c-442f1500b49a",
   "metadata": {},
   "source": [
    "<a name='s1-6'></a>\n",
    "## Save to Parquet Format ##\n",
    "After processing the data, we persist it for later use. [Apache Parquet](https://parquet.apache.org/) is a columnar binary format and has become the de-facto standard for the storage of large volumes of tabular data. Converting to Parquet file format is important and csv files should generally be avoided in data products. While the csv file format is convenient and human-readable, importing csv files requires reading and parsing entire records, which can be a bottleneck. In fact, many developers will start their analysis by first converting csv files to the Parquet file format. There are many reasons to use Parquet format for analytics: \n",
    "* The columnar nature of Parquet files allows for column pruning, which often yields big query performance gains. \n",
    "* It uses metadata to store the schema and supports more advanced data types such as categorical, datetimes, and more. This means that importing data would not require schema inference or manual schema specification. \n",
    "* It captures metadata related to row-group level statistics for each column. This enables predicate pushdown filtering, which is a form of query pushdown that allows computations to happen at the “database layer” instead of the “execution engine layer”. In this case, the database layer is Parquet files in a filesystem, and the execution engine is Dask. \n",
    "* It supports flexible compression options, making it more compact to store and more portable than a database. \n",
    "\n",
    "We will use `.to_parquet(path)`[[doc]](https://docs.dask.org/en/stable/generated/dask.dataframe.to_parquet.html#dask-dataframe-to-parquet) to write to Parquet files. By default, files will be created in the specified output directory using the convention `part.0.parquet`, `part.1.parquet`, `part.2.parquet`, ... and so on for each partition in the DataFrame. This can be changed using the `name_function` parameter. Ouputting multiple files lets Dask write to multiple files in parallel, which is faster than writing to a single file. \n",
    "\n",
    "<p><img src='images/parquet.png' width=240></p>\n",
    "\n",
    "When working with large datasets, decoding and encoding is often an expensive task. This challenge tends to compound as the data size grows. A common pattern in data science is to subset the dataset by columns, row slices, or both. Moving these filtering operations to the read phase of the workflow can: 1) reduce I/O time, and 2) reduce the amount of memory required, which is important for GPUs where memory can be a limiting factor. Parquet file format enables filtered reading through **column pruning** and **statistic-based predicate filtering** to skip portions of the data that are irrelevant to the problem. Below are some tips for writing Parquet files: \n",
    "* When writing data, sorting the data by the columns that expect the most filters to be applied or columns with the highest cardinality can lead to meaningful performance benefits. The metadata calculated for each row group will enable predicate pushdown filters to the fullest extent. \n",
    "* Writing Parquet format, which requires reprocessing entire data sets, can be expensive. The format works remarkably well for read-intensive applications and low latency data storage and retrieval. \n",
    "* Partitions in Dask DataFrame can write out files in parallel, so multiple Parquet files are written simultaneously.\n",
    "\n",
    "Below we write the data into Parquet format, after sorting by the county. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "35007914-f7af-4083-a42a-fc02048c7ec4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "senior_df=senior_df.sort_values('county')\n",
    "\n",
    "senior_df.to_parquet('senior_df.parquet', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "7e82b557-a2c2-4a42-96d0-35174370aaee",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'status': 'ok', 'restart': True}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# DO NOT CHANGE THIS CELL\n",
    "import IPython\n",
    "app = IPython.Application.instance()\n",
    "app.kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a3438a6-02d8-4ea0-824f-08c4bcac1063",
   "metadata": {},
   "source": [
    "### Reading from Parquet ###\n",
    "Querying data in Parquet format can be significantly more performant, especially as the size of the data increases. \n",
    "\n",
    "Below we read from both the csv format and Parquet format for comparison. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "11c2aab8-3f67-4a57-8eaa-b5d4b224a7b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext cudf.pandas\n",
    "import pandas as pd\n",
    "import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "3876ba06-ef43-4849-8a16-58ee2f3a8423",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "<span style=\"font-style: italic\">                                             Total time elapsed: 0.245 seconds                                     </span>\n",
       "<span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "<span style=\"font-style: italic\">                                                           Stats                                                   </span>\n",
       "<span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line                                                                     </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 2        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    sel</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">[(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'='</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'BLACKPOOL'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">)]</span><span style=\"background-color: #272822\">                                  </span> │             │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 3        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    parquet_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">pd</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">read_parquet(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'senior_df.parquet'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, columns</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'se…</span> │ 0.046496972 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 4        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    parquet_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">parquet_df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">loc[parquet_df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">==</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'BLACKPOOL'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"background-color: #272822\">        </span> │ 0.014774265 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "\u001b[3m                                             Total time elapsed: 0.245 seconds                                     \u001b[0m\n",
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "\u001b[3m                                                           Stats                                                   \u001b[0m\n",
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine                                                                    \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 2        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msel\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m=\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mBLACKPOOL\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[48;2;39;40;34m                                  \u001b[0m │             │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 3        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mparquet_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mpd\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mread_parquet\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msenior_df.parquet\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcolumns\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mse…\u001b[0m │ 0.046496972 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 4        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mparquet_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mparquet_df\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mloc\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mparquet_df\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m==\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mBLACKPOOL\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[48;2;39;40;34m        \u001b[0m │ 0.014774265 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%cudf.pandas.line_profile\n",
    "\n",
    "sel=[('county', '=', 'BLACKPOOL')]\n",
    "parquet_df=pd.read_parquet('senior_df.parquet', columns=['age', 'sex', 'county', 'lat', 'long', 'name', 'R'], filters=sel)\n",
    "parquet_df=parquet_df.loc[parquet_df['county']=='BLACKPOOL']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "9c67aaf7-4dda-4bf5-9205-fd097d567ea4",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['BLACKPOOL'], dtype=object)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parquet_df['county'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "61a225a7-3106-48a3-b100-24c829a6089c",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "<span style=\"font-style: italic\">                                             Total time elapsed: 0.697 seconds                                     </span>\n",
       "<span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "<span style=\"font-style: italic\">                                                           Stats                                                   </span>\n",
       "<span style=\"font-style: italic\">                                                                                                                   </span>\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃<span style=\"font-weight: bold\"> Line no. </span>┃<span style=\"font-weight: bold\"> Line                                                                     </span>┃<span style=\"font-weight: bold\"> GPU TIME(s) </span>┃<span style=\"font-weight: bold\"> CPU TIME(s) </span>┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 2        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">pd</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">read_csv(</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'./senior_df.csv'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, usecols</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'age'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'sex'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">, </span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">…</span> │ 0.535917163 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "│ 3        │ <span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">    df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">=</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">df</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">.</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">loc[df[</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'county'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"color: #ff4689; text-decoration-color: #ff4689; background-color: #272822\">==</span><span style=\"color: #e6db74; text-decoration-color: #e6db74; background-color: #272822\">'BLACKPOOL'</span><span style=\"color: #f8f8f2; text-decoration-color: #f8f8f2; background-color: #272822\">]</span><span style=\"background-color: #272822\">                                </span> │ 0.017859971 │             │\n",
       "│          │ <span style=\"background-color: #272822\">                                                                        </span> │             │             │\n",
       "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "\u001b[3m                                             Total time elapsed: 0.697 seconds                                     \u001b[0m\n",
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "\u001b[3m                                                           Stats                                                   \u001b[0m\n",
       "\u001b[3m                                                                                                                   \u001b[0m\n",
       "┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n",
       "┃\u001b[1m \u001b[0m\u001b[1mLine no.\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mLine                                                                    \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mGPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mCPU TIME(s)\u001b[0m\u001b[1m \u001b[0m┃\n",
       "┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n",
       "│ 2        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mpd\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mread_csv\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m./senior_df.csv\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34musecols\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mage\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msex\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m,\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m…\u001b[0m │ 0.535917163 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "│ 3        │ \u001b[38;2;248;248;242;48;2;39;40;34m    \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mloc\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mdf\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m[\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mcounty\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m==\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34mBLACKPOOL\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m'\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m]\u001b[0m\u001b[48;2;39;40;34m                                \u001b[0m │ 0.017859971 │             │\n",
       "│          │ \u001b[48;2;39;40;34m                                                                        \u001b[0m │             │             │\n",
       "└──────────┴──────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┘\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%cudf.pandas.line_profile\n",
    "\n",
    "df=pd.read_csv('./senior_df.csv', usecols=['age', 'sex', 'county', 'lat', 'long', 'name', 'R'])\n",
    "df=df.loc[df['county']=='BLACKPOOL']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b1b5fff0-ccfa-47fe-b821-d68c5858f844",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['BLACKPOOL'], dtype=object)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['county'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cdf4ff5-c182-41e5-b17d-3c0cd02dd9d5",
   "metadata": {},
   "source": [
    "## Accelerated ETL for Downstream Tasks ##\n",
    "Accelerating the ETL process is important for data science as it provides the below benefits: \n",
    "* **Timely insights**: Faster ETL allows for more up-to-date data analysis, enabling data scientists to work with the most current information.\n",
    "* **Increased productivity**: Reduced processing time means data scientists can spend more time on analysis and model development rather than waiting for data to be ready.\n",
    "* **Handling larger datasets**: Accelerated ETL processes can manage larger volumes of data more efficiently.\n",
    "* **Cost efficiency**: Accelerated ETL can reduce computational resources and time, leading to lower infrastructure costs.\n",
    "\n",
    "<p><img src='images/etl.png' width=720></p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5105158c-ce64-4b8c-9ee7-6d5d684ea5cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import IPython\n",
    "app = IPython.Application.instance()\n",
    "app.kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c88554b0-23c6-4316-912e-f962b4c97456",
   "metadata": {},
   "source": [
    "**Well Done!** Let's move to the [next notebook](1-08_dask-cudf.ipynb). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "391f35d9-6768-4bf3-8d96-b706351c2ad6",
   "metadata": {},
   "source": [
    "<img src=\"./images/DLI_Header.png\" width=400/>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}