{ "cells": [ { "cell_type": "markdown", "id": "_XMDuUoUl5cQ", "metadata": { "id": "_XMDuUoUl5cQ" }, "source": [ " \"Header\" " ] }, { "cell_type": "markdown", "id": "T1WbsSK_0Xqo", "metadata": { "id": "T1WbsSK_0Xqo" }, "source": [ "# Speed Up DataFrame Operations w/ RAPIDS cuDF" ] }, { "cell_type": "markdown", "id": "-bPAvj4fwjbq", "metadata": { "id": "-bPAvj4fwjbq" }, "source": [ "## Welcome\n", "A **DataFrame** is a 2-dimensional data structure used to represent data in a tabular format, like a spreadsheet or SQL table. Originally offered through the Python Data Analysis ([pandas](https://pandas.pydata.org/docs/)) library, DataFrames have become very popular for its familiar representation along with a robust set of features that are intuitive and expressive. \n", "\n", "Raw data often needs to be manipulated before it can be used for further purposes such as generating **Business Intelligence**, creating **Dashboard Visualization**, or training **Machine Learning** models. These preprocessing steps can include **filtering**, **merging**, **grouping**, and **aggregating**. \n", "\n", "Below is a typical data processing pipeline: \n", "

\n", "\n", "According to [studies](https://www.forbes.https://courses.nvidia.com/courses/course-v1:DLI+T-DS-01+V1/aboutcom/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=29f71b266f63), data preparation accounts for ~80% of the work for analysts. This could be due in part to the rapid increase in the size of data as well as the iterative nature of analytics. \n", "\n", "Recognizing this potential bottleneck, NVIDIA created [**cuDF**](https://docs.rapids.ai/api/cudf/stable/) that leverages GPU hardware and software to perform data manipulation tasks with parallel computing, **saving valuable time and resources**. The cuDF library is part of the larger [**RAPIDS**](https://rapids.ai/) data science framework that allows for the execution of **end-to-end analytics pipelines** entirely on GPUs. One of the focus for cuDF and its companion suite of open source software libraries is to provide syntax that is similar to their CPU counterparts, **making it easy to implement**. \n", "\n", "This notebook is intended to demonstrate speedup in data processing by moving common DataFrame operations to the GPU with minimal changes to existing code. " ] }, { "cell_type": "markdown", "id": "ComTzf6gEWwT", "metadata": { "id": "ComTzf6gEWwT" }, "source": [ "### Environment Sanity Check\n", "Check the output of `!nvidia-smi` to make sure you've been allocated a RAPIDS supported GPU such as Tesla T4, P4, or P100." ] }, { "cell_type": "code", "execution_count": 1, "id": "c58af14d", "metadata": { "id": "c58af14d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tue Feb 3 12:46:12 2026 \n", "+-----------------------------------------------------------------------------+\n", "| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |\n", "|-------------------------------+----------------------+----------------------+\n", "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", "| | | MIG M. |\n", "|===============================+======================+======================|\n", "| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |\n", "| N/A 23C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", " \n", "+-----------------------------------------------------------------------------+\n", "| Processes: |\n", "| GPU GI CI PID Type Process name GPU Memory |\n", "| ID ID Usage |\n", "|=============================================================================|\n", "| No running processes found |\n", "+-----------------------------------------------------------------------------+\n" ] } ], "source": [ "!nvidia-smi" ] }, { "cell_type": "markdown", "id": "GM2FQ7-P8iaF", "metadata": { "id": "GM2FQ7-P8iaF" }, "source": [ "## Interactive Exercise" ] }, { "cell_type": "code", "execution_count": 2, "id": "XKUJgAqC38jR", "metadata": { "id": "XKUJgAqC38jR" }, "outputs": [], "source": [ "import numpy as np # for generating sample data\n", "\n", "import pandas as df\n", "# import cudf as df\n", "import time # for clocking process times\n", "import matplotlib.pyplot as plt # for visualizing results\n", "\n", "class Timer: # creating a Timer helper class to measure execution time\n", " def __enter__(self):\n", " self.start=time.perf_counter()\n", " return self\n", " def __exit__(self, *args):\n", " self.end=time.perf_counter()\n", " self.interval=self.end-self.start" ] }, { "cell_type": "markdown", "id": "GjeW2Mdh0huU", "metadata": { "id": "GjeW2Mdh0huU" }, "source": [ "### Loading a Sample Data\n", "We start our demonstration by generating two 2-dimensional arrays of random numbers - we've configured for sizeable arrays at 1MM rows by 50 columns each. Then they are converted to DataFrames using ```pandas.DataFrame()``` or ```cudf.DataFrame()```:" ] }, { "cell_type": "code", "execution_count": 3, "id": "RSCUQYModrAd", "metadata": { "id": "RSCUQYModrAd" }, "outputs": [], "source": [ "rows=1000000\n", "columns=50" ] }, { "cell_type": "code", "execution_count": 4, "id": "108eb7cb", "metadata": { "id": "108eb7cb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The loading process took 0.96 seconds\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a_0a_1a_2a_3a_4a_5a_6a_7a_8a_9...a_40a_41a_42a_43a_44a_45a_46a_47a_48a_49
99999540901613919577684856...9428568507048766194
9999967740808434679571212...987173567666377943
99999737228533449383964891...1266903435610587034
99999878343076705368976439...7656468175151047874
99999952535810944266853334...6269995375565294582
\n", "

5 rows × 50 columns

\n", "
" ], "text/plain": [ " a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7 a_8 a_9 ... a_40 a_41 \\\n", "999995 40 90 16 13 91 95 77 68 48 56 ... 94 28 \n", "999996 77 40 80 84 34 67 95 7 12 12 ... 98 71 \n", "999997 37 22 85 33 44 93 83 96 48 91 ... 12 66 \n", "999998 78 34 30 76 70 53 68 97 64 39 ... 76 56 \n", "999999 52 53 58 10 94 42 66 85 33 34 ... 62 69 \n", "\n", " a_42 a_43 a_44 a_45 a_46 a_47 a_48 a_49 \n", "999995 5 68 50 70 48 76 61 94 \n", "999996 73 5 6 76 66 37 79 43 \n", "999997 90 34 35 61 0 58 70 34 \n", "999998 46 81 75 15 10 47 8 74 \n", "999999 9 95 37 55 65 29 45 82 \n", "\n", "[5 rows x 50 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
b_0b_1b_2b_3b_4b_5b_6b_7b_8b_9...b_40b_41b_42b_43b_44b_45b_46b_47b_48b_49
999995758664325785313034...92808679193628756448
9999965179232332539672860...0625729301247153675
9999972313891636343948154...1631486598221165339
99999834983240293815503438...8545245063438667642
99999987904725991192878145...5802899689638138673
\n", "

5 rows × 50 columns

\n", "
" ], "text/plain": [ " b_0 b_1 b_2 b_3 b_4 b_5 b_6 b_7 b_8 b_9 ... b_40 b_41 \\\n", "999995 75 8 66 43 25 78 53 13 0 34 ... 92 80 \n", "999996 51 79 23 23 32 53 96 7 28 60 ... 0 62 \n", "999997 23 13 89 16 36 34 39 48 15 4 ... 16 31 \n", "999998 34 98 32 40 29 38 15 50 34 38 ... 85 45 \n", "999999 87 90 47 25 99 11 92 87 81 45 ... 5 80 \n", "\n", " b_42 b_43 b_44 b_45 b_46 b_47 b_48 b_49 \n", "999995 86 79 19 36 28 75 64 48 \n", "999996 57 29 30 12 47 15 36 75 \n", "999997 48 65 98 22 11 6 53 39 \n", "999998 24 50 63 4 38 66 76 42 \n", "999999 28 99 68 96 38 13 86 73 \n", "\n", "[5 rows x 50 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def load_data(): \n", " data_a=np.random.randint(0, 100, (rows, columns))\n", " data_b=np.random.randint(0, 100, (rows, columns))\n", " dataframe_a=df.DataFrame(data_a, columns=[f'a_{i}' for i in range(columns)])\n", " dataframe_b=df.DataFrame(data_b, columns=[f'b_{i}' for i in range(columns)])\n", " return dataframe_a, dataframe_b\n", "\n", "with Timer() as process_time: \n", " dataframe_a, dataframe_b=load_data()\n", "\n", "print(f'The loading process took {process_time.interval:.2f} seconds')\n", "display(dataframe_a.tail(5))\n", "display(dataframe_b.tail(5))" ] }, { "cell_type": "markdown", "id": "sXlraNW9cl31", "metadata": { "id": "sXlraNW9cl31" }, "source": [ "

\n", "\n", "We created two DataFrames, _dataframe_a_ and _dataframe_b_ that are 1000000 rows by 50 columns (col_1, col_2, ... col_48, col_49) each. " ] }, { "cell_type": "markdown", "id": "DKYzyh6bxwAB", "metadata": { "id": "DKYzyh6bxwAB" }, "source": [ "### Merging Data\n", "Sometimes data can come from multiple sources and need to be merged into one with ```DataFrame.merge()```. For example, a typical retail data storage infrastructure may include a customer table and separate transaction and product tables. Merging the data allows the correct details to be included in a single DataFrame to get the insight needed. " ] }, { "cell_type": "code", "execution_count": 5, "id": "bAGSwY8qx2DB", "metadata": { "id": "bAGSwY8qx2DB" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The merging process took 1.28 seconds\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a_0a_1a_2a_3a_4a_5a_6a_7a_8a_9...b_40b_41b_42b_43b_44b_45b_46b_47b_48b_49
04995725886205481452...19229069285053176079
177324757229873982570...49174136618293287
29415582252090487088...48742698931571549726
39867367655848619877...7759367623927142630
470803551761434259949...299211878323974823
\n", "

5 rows × 100 columns

\n", "
" ], "text/plain": [ " a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7 a_8 a_9 ... b_40 b_41 b_42 \\\n", "0 49 95 72 58 86 20 54 81 4 52 ... 19 22 90 \n", "1 77 32 47 57 22 98 73 98 25 70 ... 4 91 74 \n", "2 94 15 58 22 52 0 90 48 70 88 ... 48 74 26 \n", "3 98 67 36 76 55 8 48 61 98 77 ... 77 59 36 \n", "4 70 80 35 51 76 14 34 25 99 49 ... 2 99 21 \n", "\n", " b_43 b_44 b_45 b_46 b_47 b_48 b_49 \n", "0 69 28 50 53 17 60 79 \n", "1 1 36 61 8 29 32 87 \n", "2 98 93 15 71 54 97 26 \n", "3 7 62 39 27 14 26 30 \n", "4 18 78 32 3 97 48 23 \n", "\n", "[5 rows x 100 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def merge_data(left_df, right_df):\n", " combined_df=df.merge(left_df, right_df, left_index=True, right_index=True)\n", " return combined_df\n", "\n", "with Timer() as process_time: \n", " combined_df=merge_data(dataframe_a, dataframe_b)\n", "\n", "print(f'The merging process took {process_time.interval:.2f} seconds')\n", "display(combined_df.head())" ] }, { "cell_type": "markdown", "id": "S_1QcS17c3S5", "metadata": { "id": "S_1QcS17c3S5" }, "source": [ "

\n", "\n", "We merged two DataFrames, _dataframe_a_ and _dataframe_b_ on their _index_ into one larger DataFrame that is 1000000 rows by 100 columns (a_0, a_1, ..., b_48, b_49). " ] }, { "cell_type": "markdown", "id": "UhdsvT-gABvZ", "metadata": { "id": "UhdsvT-gABvZ" }, "source": [ "### Summarize\n", "Exploring data begins with **descriptive statistics**, which often involves finding the **central tendency** and **dispersion**. They are a quick way to summarize distributions. Measures of central tendency includes the mean, median, and mode - they are used to describe the center of a set of data values. Measures of dispersion include variance and standard deviation - they are used to describe the degree to which data is distributed around the center. We can quickly perform simple descriptive statistics with the ```DataFrame.describe()``` method. " ] }, { "cell_type": "code", "execution_count": 6, "id": "26a2c5b6", "metadata": { "id": "26a2c5b6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The summarizing process took 4.43 seconds\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a_0a_1a_2a_3a_4a_5a_6a_7a_8a_9...b_40b_41b_42b_43b_44b_45b_46b_47b_48b_49
count1000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.000000...1000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.0000001000000.000000
mean49.51106149.44640649.48299149.49394249.48311749.48752249.52589449.50413049.48490149.483078...49.50403349.50726749.53756349.45011049.48621749.47179649.49942649.51014949.47665349.512459
std28.86191828.87467828.88284528.86675528.87862628.86433328.86084528.86331628.85805928.887034...28.86302128.85884328.85861728.85212128.84758328.85919828.85515928.85283728.84044328.867724
min0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%24.00000024.00000024.00000024.00000024.00000024.00000025.00000024.00000024.00000024.000000...25.00000025.00000025.00000024.00000025.00000024.00000025.00000025.00000025.00000024.000000
50%50.00000049.00000049.00000049.00000049.00000049.00000050.00000050.00000049.00000049.000000...50.00000050.00000050.00000049.00000049.00000049.00000049.00000049.00000049.00000050.000000
75%75.00000074.00000074.00000075.00000075.00000074.00000075.00000074.00000074.00000075.000000...75.00000074.00000075.00000074.00000074.00000074.00000074.00000074.00000074.00000074.000000
max99.00000099.00000099.00000099.00000099.00000099.00000099.00000099.00000099.00000099.000000...99.00000099.00000099.00000099.00000099.00000099.00000099.00000099.00000099.00000099.000000
\n", "

8 rows × 100 columns

\n", "
" ], "text/plain": [ " a_0 a_1 a_2 a_3 \\\n", "count 1000000.000000 1000000.000000 1000000.000000 1000000.000000 \n", "mean 49.511061 49.446406 49.482991 49.493942 \n", "std 28.861918 28.874678 28.882845 28.866755 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 24.000000 24.000000 24.000000 24.000000 \n", "50% 50.000000 49.000000 49.000000 49.000000 \n", "75% 75.000000 74.000000 74.000000 75.000000 \n", "max 99.000000 99.000000 99.000000 99.000000 \n", "\n", " a_4 a_5 a_6 a_7 \\\n", "count 1000000.000000 1000000.000000 1000000.000000 1000000.000000 \n", "mean 49.483117 49.487522 49.525894 49.504130 \n", "std 28.878626 28.864333 28.860845 28.863316 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 24.000000 24.000000 25.000000 24.000000 \n", "50% 49.000000 49.000000 50.000000 50.000000 \n", "75% 75.000000 74.000000 75.000000 74.000000 \n", "max 99.000000 99.000000 99.000000 99.000000 \n", "\n", " a_8 a_9 ... b_40 b_41 \\\n", "count 1000000.000000 1000000.000000 ... 1000000.000000 1000000.000000 \n", "mean 49.484901 49.483078 ... 49.504033 49.507267 \n", "std 28.858059 28.887034 ... 28.863021 28.858843 \n", "min 0.000000 0.000000 ... 0.000000 0.000000 \n", "25% 24.000000 24.000000 ... 25.000000 25.000000 \n", "50% 49.000000 49.000000 ... 50.000000 50.000000 \n", "75% 74.000000 75.000000 ... 75.000000 74.000000 \n", "max 99.000000 99.000000 ... 99.000000 99.000000 \n", "\n", " b_42 b_43 b_44 b_45 \\\n", "count 1000000.000000 1000000.000000 1000000.000000 1000000.000000 \n", "mean 49.537563 49.450110 49.486217 49.471796 \n", "std 28.858617 28.852121 28.847583 28.859198 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 25.000000 24.000000 25.000000 24.000000 \n", "50% 50.000000 49.000000 49.000000 49.000000 \n", "75% 75.000000 74.000000 74.000000 74.000000 \n", "max 99.000000 99.000000 99.000000 99.000000 \n", "\n", " b_46 b_47 b_48 b_49 \n", "count 1000000.000000 1000000.000000 1000000.000000 1000000.000000 \n", "mean 49.499426 49.510149 49.476653 49.512459 \n", "std 28.855159 28.852837 28.840443 28.867724 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 25.000000 25.000000 25.000000 24.000000 \n", "50% 49.000000 49.000000 49.000000 50.000000 \n", "75% 74.000000 74.000000 74.000000 74.000000 \n", "max 99.000000 99.000000 99.000000 99.000000 \n", "\n", "[8 rows x 100 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def summarize(dataframe):\n", " summary_df=dataframe.describe()\n", " return summary_df\n", "\n", "with Timer() as process_time: \n", " summary_df=summarize(combined_df)\n", "\n", "print(f'The summarizing process took {process_time.interval:.2f} seconds')\n", "display(summary_df)" ] }, { "cell_type": "markdown", "id": "KPz54wMldInX", "metadata": { "id": "KPz54wMldInX" }, "source": [ "

\n", "\n", "Since this is a sample data set, we see that each of columns/features (a_0, a_1, ..., b_48, b_49) have 1000000 values with an average ~50 and standard deviation of ~30" ] }, { "cell_type": "markdown", "id": "w7N64bdRAclS", "metadata": { "id": "w7N64bdRAclS" }, "source": [ "### Correlation - Exploring Relationships\n", "We might be interested in finding relationships/dependencies between two or more variables through their correlation with ```DataFrame.corr()```. Correlation is a number between -1 and 1 that describes the strength of the association between two variables. Two variables with a correlation of 1 suggests that they change together in the same direction while a correlation of -1 suggests that they change together in the opposite direction. " ] }, { "cell_type": "code", "execution_count": 7, "id": "2538ccdd", "metadata": { "id": "2538ccdd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The correlation process took 23.03 seconds\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a_0a_1a_2a_3a_4a_5a_6a_7a_8a_9...b_40b_41b_42b_43b_44b_45b_46b_47b_48b_49
a_01.000000-0.001281-0.0018310.000314-0.001505-0.0012470.0000180.0005850.000628-0.001549...0.0020550.0001920.001457-0.000054-0.000405-0.0010550.0007680.0011390.000209-0.000652
a_1-0.0012811.0000000.000033-0.0002120.0008290.001192-0.000334-0.000551-0.000434-0.000121...-0.0007260.0003900.0000880.000975-0.0002870.0010540.0003700.0005520.0001850.001366
a_2-0.0018310.0000331.000000-0.000632-0.001345-0.000222-0.000713-0.001515-0.000810-0.000193...0.0002630.000430-0.000263-0.0005690.001625-0.000449-0.001388-0.0004140.001550-0.000436
a_30.000314-0.000212-0.0006321.0000000.002325-0.001373-0.000923-0.0003730.000230-0.000529...0.0004480.000080-0.000237-0.000018-0.000217-0.0005650.0006070.000945-0.000555-0.000179
a_4-0.0015050.000829-0.0013450.0023251.000000-0.000842-0.000515-0.000127-0.000170-0.000975...0.001551-0.000489-0.0004250.0004500.0006330.0002670.0003400.0009450.000047-0.000825
\n", "

5 rows × 100 columns

\n", "
" ], "text/plain": [ " a_0 a_1 a_2 a_3 a_4 a_5 a_6 \\\n", "a_0 1.000000 -0.001281 -0.001831 0.000314 -0.001505 -0.001247 0.000018 \n", "a_1 -0.001281 1.000000 0.000033 -0.000212 0.000829 0.001192 -0.000334 \n", "a_2 -0.001831 0.000033 1.000000 -0.000632 -0.001345 -0.000222 -0.000713 \n", "a_3 0.000314 -0.000212 -0.000632 1.000000 0.002325 -0.001373 -0.000923 \n", "a_4 -0.001505 0.000829 -0.001345 0.002325 1.000000 -0.000842 -0.000515 \n", "\n", " a_7 a_8 a_9 ... b_40 b_41 b_42 \\\n", "a_0 0.000585 0.000628 -0.001549 ... 0.002055 0.000192 0.001457 \n", "a_1 -0.000551 -0.000434 -0.000121 ... -0.000726 0.000390 0.000088 \n", "a_2 -0.001515 -0.000810 -0.000193 ... 0.000263 0.000430 -0.000263 \n", "a_3 -0.000373 0.000230 -0.000529 ... 0.000448 0.000080 -0.000237 \n", "a_4 -0.000127 -0.000170 -0.000975 ... 0.001551 -0.000489 -0.000425 \n", "\n", " b_43 b_44 b_45 b_46 b_47 b_48 b_49 \n", "a_0 -0.000054 -0.000405 -0.001055 0.000768 0.001139 0.000209 -0.000652 \n", "a_1 0.000975 -0.000287 0.001054 0.000370 0.000552 0.000185 0.001366 \n", "a_2 -0.000569 0.001625 -0.000449 -0.001388 -0.000414 0.001550 -0.000436 \n", "a_3 -0.000018 -0.000217 -0.000565 0.000607 0.000945 -0.000555 -0.000179 \n", "a_4 0.000450 0.000633 0.000267 0.000340 0.000945 0.000047 -0.000825 \n", "\n", "[5 rows x 100 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def correlation(dataframe): \n", " corr_df=dataframe.corr()\n", " return corr_df\n", "\n", "with Timer() as process_time: \n", " corr_df=correlation(combined_df)\n", "\n", "print(f'The correlation process took {process_time.interval:.2f} seconds')\n", "display(corr_df.head())" ] }, { "cell_type": "markdown", "id": "uaiK9t2CdgFS", "metadata": { "id": "uaiK9t2CdgFS" }, "source": [ "

\n", "\n", "The resulting cross tabulation shows that each column/feature (a_0, a_1, ..., b_48, b_49) have a perfect correlation (1) with itself and is not correlated (~0) with each other. " ] }, { "cell_type": "markdown", "id": "1j1Y3Y_kBYyY", "metadata": { "id": "1j1Y3Y_kBYyY" }, "source": [ "### Grouping\n", "We can compare subsets of the data to explore the significance of categories and classes with the ```DataFrame.groupby()``` method. We can even group continuous data values into a smaller number of bins with ```pandas.cut()``` or ```cudf.cut()``` to simplify our analysis. The groupings usually follow an aggregation such as mean or count. For example, we can group our data into 5 equidistant bins based on their sequential index. " ] }, { "cell_type": "code", "execution_count": 8, "id": "d050021a", "metadata": { "id": "d050021a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The grouping process took 1.04 seconds\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a_0a_1a_2a_3a_4a_5a_6a_7a_8a_9...b_40b_41b_42b_43b_44b_45b_46b_47b_48b_49
049.52951549.51886049.38922049.52291549.45614549.50679549.56693549.51493049.49068549.438430...49.41769549.45757049.55226049.37018049.54866049.54346049.5053949.48324549.46731549.504140
149.53313549.43428549.54856549.47355049.58767049.45059049.43359549.53820049.48311549.520890...49.49094549.53112049.53595549.46581549.44572049.39361549.5480049.60703549.51496549.483715
249.46399049.42348049.52575049.55960049.55643049.51357049.49022549.48377549.45311549.368385...49.57865049.55074049.50696549.42182549.54236549.43939549.4373849.50852049.51767049.505025
349.48616549.43244549.52821549.41042049.37844049.53050049.53876049.49127549.55264049.550090...49.43758549.48148549.60575549.49883049.48871549.53518049.4935249.52962049.42920049.604420
449.54250049.42296049.42320549.50322549.43690049.43615549.59995549.49247049.44495049.537595...49.59529049.51542049.48688049.49390049.40562549.44733049.5128449.42232549.45411549.464995
\n", "

5 rows × 100 columns

\n", "
" ], "text/plain": [ " a_0 a_1 a_2 a_3 a_4 a_5 \\\n", "0 49.529515 49.518860 49.389220 49.522915 49.456145 49.506795 \n", "1 49.533135 49.434285 49.548565 49.473550 49.587670 49.450590 \n", "2 49.463990 49.423480 49.525750 49.559600 49.556430 49.513570 \n", "3 49.486165 49.432445 49.528215 49.410420 49.378440 49.530500 \n", "4 49.542500 49.422960 49.423205 49.503225 49.436900 49.436155 \n", "\n", " a_6 a_7 a_8 a_9 ... b_40 b_41 \\\n", "0 49.566935 49.514930 49.490685 49.438430 ... 49.417695 49.457570 \n", "1 49.433595 49.538200 49.483115 49.520890 ... 49.490945 49.531120 \n", "2 49.490225 49.483775 49.453115 49.368385 ... 49.578650 49.550740 \n", "3 49.538760 49.491275 49.552640 49.550090 ... 49.437585 49.481485 \n", "4 49.599955 49.492470 49.444950 49.537595 ... 49.595290 49.515420 \n", "\n", " b_42 b_43 b_44 b_45 b_46 b_47 b_48 \\\n", "0 49.552260 49.370180 49.548660 49.543460 49.50539 49.483245 49.467315 \n", "1 49.535955 49.465815 49.445720 49.393615 49.54800 49.607035 49.514965 \n", "2 49.506965 49.421825 49.542365 49.439395 49.43738 49.508520 49.517670 \n", "3 49.605755 49.498830 49.488715 49.535180 49.49352 49.529620 49.429200 \n", "4 49.486880 49.493900 49.405625 49.447330 49.51284 49.422325 49.454115 \n", "\n", " b_49 \n", "0 49.504140 \n", "1 49.483715 \n", "2 49.505025 \n", "3 49.604420 \n", "4 49.464995 \n", "\n", "[5 rows x 100 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def groupby_summarize(dataframe):\n", " dataframe['group']=dataframe.index\n", " dataframe['group']=df.cut(dataframe['group'], 5)\n", " group_describe_df=dataframe.groupby('group').mean().reset_index(drop=True)\n", " return group_describe_df\n", "\n", "with Timer() as process_time: \n", " group_describe_df=groupby_summarize(combined_df)\n", "\n", "print(f'The grouping process took {process_time.interval:.2f} seconds')\n", "display(group_describe_df)" ] }, { "cell_type": "markdown", "id": "LdVGBVr9e_o8", "metadata": { "id": "LdVGBVr9e_o8" }, "source": [ "

\n", "\n", "The resulting DataFrame shows that each group maintains an average of ~50 for each column/feature (a_0, a_1, ..., b_48, b_49) as expected for this sample data. " ] }, { "cell_type": "markdown", "id": "b-9gbIriKa85", "metadata": { "id": "b-9gbIriKa85" }, "source": [ "### Putting it together\n", "We can measure the total elapsed time for this sample data processing workflow. " ] }, { "cell_type": "code", "execution_count": 10, "id": "HMLKNN_RPB0c", "metadata": { "id": "HMLKNN_RPB0c" }, "outputs": [], "source": [ "def pipeline():\n", " performance={}\n", " with Timer() as process_time: \n", " dataframe_a, dataframe_b=load_data()\n", " performance['load data']=process_time.interval\n", " with Timer() as process_time: \n", " combined_df=merge_data(dataframe_a, dataframe_b)\n", " performance['merge data']=process_time.interval\n", " with Timer() as process_time: \n", " summarize(combined_df)\n", " performance['summarize']=process_time.interval\n", " with Timer() as process_time: \n", " correlation(combined_df)\n", " performance['correlation']=process_time.interval\n", " with Timer() as process_time: \n", " groupby_summarize(combined_df)\n", " performance['groupby & summarize']=process_time.interval\n", " if df.__name__=='cudf': \n", " df.DataFrame([performance], index=['gpu']).to_pandas().plot(kind='bar', stacked=True)\n", " else: \n", " df.DataFrame([performance], index=['cpu']).plot(kind='bar', stacked=True)\n", " return None" ] }, { "cell_type": "markdown", "id": "csfRLkjsc2v8", "metadata": { "id": "csfRLkjsc2v8" }, "source": [ "### Timing the Pipeline on CPU" ] }, { "cell_type": "code", "execution_count": 11, "id": "8DcmBph9cyjm", "metadata": { "id": "8DcmBph9cyjm" }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAEACAYAAACqOy3+AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAg90lEQVR4nO3de3zPdf/H8cd7M2ZOTaaIjOsawzAzp1SONclo5VC5hBISpdMVEdevVukkXEiUQ0U5l0oqOriSYmMYI6lp5GJzyrkd3r8/9rXLbPOdHcxne95vt922z3vvz+fz2jc99977+/m8P8Zai4iIOI9HURcgIiJ5owAXEXEoBbiIiEMpwEVEHEoBLiLiUKUu58mqVKli/f39L+cpRUQcLzo6Osla63dh+2UNcH9/f6Kioi7nKUVEHM8Ysye7dk2hiIg4lAJcRMShFOAiIg6lABcRcSgFuIiIQynARUQcSgEuIuJQCnAREYdSgIuIONRlvRNTCtbUIV8XdQki2Xp4eoeiLqFE0AhcRMShFOAiIg6lABcRcSgFuIiIQ7kNcGOMtzFmvTFmszFmmzHm/1ztlY0xXxljdrk++xZ+uSIick5uRuBngQ7W2iZAMNDZGNMKGAmsttYGAKtd2yIicpm4vYzQWmuBE65NL9eHBboD7Vztc4FvgacLvELJUYdvHy7qEkRyEFfUBZQIuZoDN8Z4GmNigIPAV9ban4BrrLX7AVyfq+aw7yBjTJQxJioxMbGAyhYRkVwFuLU21VobDNQAWhhjgnJ7AmvtDGttqLU21M8vyyPdREQkjy7pKhRr7VHSp0o6AweMMdUAXJ8PFnRxIiKSs9xcheJnjLnK9XVZoBOwA1gO9HN16wd8XEg1iohINnKzFko1YK4xxpP0wF9orf3UGLMOWGiMeQD4HehZiHWKiMgFcnMVyhagaTbth4COhVGUiIi4pzsxRUQcSgEuIuJQCnAREYdSgIuIOJQCXETEoRTgIiIOpQAXEXEoBbiIiEMpwEVEHEoBLiLiUApwERGHUoCLiDiUAlxExKEU4CIiDqUAFxFxKAW4iIhDKcBFRBxKAS4i4lAKcBERh1KAi4g4lAJcRMShFOAiIg6lABcRcSgFuIiIQ7kNcGNMTWPMN8aYOGPMNmPMo672fxlj9hljYlwfXQq/XBEROadULvqkAE9YazcaYyoA0caYr1zfe8Na+1rhlSciIjlxG+DW2v3AftfXx40xccB1hV2YiIhc3CXNgRtj/IGmwE+upmHGmC3GmFnGGN8c9hlkjIkyxkQlJibmr1oREcmQ6wA3xpQHlgAjrLV/Am8CfwOCSR+hv57dftbaGdbaUGttqJ+fX/4rFhERIJcBbozxIj2851lrlwJYaw9Ya1OttWnATKBF4ZUpIiIXys1VKAZ4B4iz1k44r73aed0igNiCL09ERHKSm6tQ2gB9ga3GmBhX2zPAPcaYYMAC8cDgQqhPRERykJurUL4HTDbfWlHw5YiISG7pTkwREYdSgIuIOJQCXETEoRTgIiIOpQAXEXEoBbiIiEMpwEVEHEoBLiLiUApwERGHUoCLiDiUAlxExKEU4CIiDpWb1QhFxA1bsSKpQwZja9YED42L4uLiiroER/L29qZGjRp4eXnlqr8CXKQApA4ZzNXBwVzl5UX6EvolW9n69Yu6BMex1nLo0CH27t1L7dq1c7WPhgoiBcDWrKnwlnwxxnD11Vdz5syZXO+jABcpCB4eCm/Jt0v9N6QAFxFxKM2BixSC+u/vKdDjxf2jlts+fi1akLh+fb7PFTltGuV9fBjRv3++znf06FHmz5/P0KFD812TZE8jcBEpFEePHmXatGlFXUaxpgAXKWastTzz+uuERkTQPCKCxStXAnDi1Cm6DBxI6169aB4RwSdff52xz8szZtAkPJzbBw5kV3x8tseN37uXdn36cOPdd/N///53RntOxx05ciS7d+8mODiYp556ihMnTtCxY0dCQkJo1KgRH3/8ceG9CCWEplBEipmPV61iy44d/LR4MUlHjnDTPffQplkz/Hx9+XDiRCqWL0/SkSO069OHru3bs2n7dhZ//jnrFi4kJTWVG3r1ommDBlmO++TLL/Ng79706daN6R98kNHuXbp0luP2HD6c8ePHExsbS0xMDAApKSksW7aMihUrkpSURKtWrejWrZve/M0HBbhIMfPDxo307NIFT09PrqlShZtCQ4mOjSXsxhsZN2kSa6OjMR4e/HHwIAcOHeKHjRsJ79gRn7JlAbi9Xbtsj/vjpk18MGECAPeGh/PsG28A6SP+LMc9cCDL/tZannnmGdasWYOHhwf79u3jwIEDXHvttYXzQpQACnCRYsbm0P7hZ5+RdOQIaxcswMvLi8CwMM6ePQvk/vK17Ppld9zsrmWeN28eiYmJREdH4+Xlhb+//yVd8yxZaQ5cpJhp06wZS1auJDU1lcTDh/k+OprQRo3488QJ/CpXxsvLi+/Wr+f3P/7I6P/J6tWcPnOG4ydPsuK777I9bqumTVn0+edAemifk9NxK1SowPHjxzP6HTt2jKpVq+Ll5cU333zDnj0Fe6VOSeR2BG6MqQm8C1wLpAEzrLWTjDGVgQWAPxAP9LLWHim8UkWcIzeX/RWW7h07sn7zZlr26IEBXnj8ca6tUoXet99Oj2HDaNO7N40DA6nnul27aYMG3NW5M6169uT6atW4ISQk2+O+9vTT9H/6aabOm8cdnTpltOd03Kuvvpo2bdoQFBTEbbfdxtNPP014eDihoaEEBwcTGBhY6K9FcWeszekPLlcHY6oB1ay1G40xFYBo4A6gP3DYWjveGDMS8LXWPn2xY4WGhtqoqKgCKVwgLlDrTVwpkqdOIeCaa4q6jCtG2aCgoi7BseLi4qh/wVoyxphoa23ohX3dTqFYa/dbaze6vj4OxAHXAd2Bua5uc0kPdRERuUwuaQ7cGOMPNAV+Aq6x1u6H9JAHquawzyBjTJQxJioxMTGf5YqIyDm5DnBjTHlgCTDCWvtnbvez1s6w1oZaa0P9/PzyUqOIiGQjVwFujPEiPbznWWuXupoPuObHz82THyycEkVEJDtuA9ykX/j5DhBnrZ1w3reWA/1cX/cDdF+siMhllJsbedoAfYGtxpgYV9szwHhgoTHmAeB3oGehVCgiItlyG+DW2u+BnG7T6liw5YgUD2UXtynQ453usbZAj3c5xcfH07VrV2JjYy/a54cffuDee++9jJU5n+7EFJFMUlJSLvs54+PjmT9//mU/r9NpLRSRYmDPvn10HzKE1iEhbNiyhUZ169L3jjuInDaNxMOHmTV+PM0bNeLkqVM8/tJLbNu1i5TUVEY/9BDhHTrw3kcfsXLNGs789RenTp9myZQpDBozhp9/+416deqw548/eGP0aJo1bMiqH34gcupUziYnU6dGDd6KjKS8j0+meqKjo7n//vvx8fHhxhtvzGiPj4+nb9++nDx5EoApU6Zwww03MHLkSOLi4ggODqZfv35ERERk208y0whcpJjYnZDAw336sH7JEnb+9hsLVqxg9bvv8uITT/DqzJkAvDxzJu1atOD7Dz9k5TvvMHrCBE6eOgXAT5s3M/OFF/j8nXeYsWABV1WsyPqlSxk5eDCbtm8HIOnIEV5+6y0+mzmTdQsXEtKwIZPnzs1Sy4ABA5g8eTLr1q3L1F61alW++uorNm7cyIIFC3jkkUcAGD9+PDfddBMxMTE89thjOfaTzDQCFykm/K+7jqC6dQFo8Pe/075lS4wxBAUEsMe1wNTqH35gxbffMtEVumfOniXhv/8FoEPr1lSuVAlIX5L24X/8A4CGAQEZx12/ZQs7fv2VDvfdB0BycjItmjTJVMex48c5evQobdu2BaBv37587loEKzk5mWHDhhETE4Onpyc///xztj9LbvuVdApwkWKiTOnSGV97GJOx7eHhQWpqKpC+Jvf8CROo61pw6pwNW7ZQzrUe+Ll+2bHW0qF1a+a+8kqOdVhrc1ye9o033uCaa65h8+bNpKWl4e3tna9+JZ2mUERKkE5t2vDm/PkZAR0TF5dtvxtCQljyxRcAxO3ezbZduwBo0bgx6zZtYvfvvwNw6vTpLI9gu6piRSpVqsT3338PpK8Dfs6xY8eoVq0aHh4evPfeexm/WLJbeja7fpKZRuAiheBKvexv1ODBPPXyy7S4804scH316iydOjVLv0G9e/PgmDG0uPNOmtSvT1BAAJXKl8evcmVmREbS75//5K+//gJg7PDhBPj7Z9p/9uzZGW9ihoWFZbQPHTqUu+66i0WLFtG+fXvKlSsHQOPGjSlVqhRNmjShf//+OfaTzNwuJ1uQtJxswdJysleO4racbGpqKskpKXiXKcOvCQl0GTiQLZ9+Smkvr1ztr+Vk8+5SlpPVCFxEsjh15gyd77+flJQUrLVMGjMm1+Etl48CXESyqFCuHGsXLCjqMsQNvYkpIuJQCnAREYdSgIuIOJQCXETEofQmpkghaBF9T4Eeb32zDwr0eEWlS5cuzJ8/n6uuuqqoSykWFOAiUuistVhrWbFiRVGXUqxoCkWkmDh56hQRQ4fS8q67CI2IYPHKlQSGhZF05AgA0du2ETZgAACR06bx4OjRhA8aRGBYGB+tWsXoCRNoHhFBtyFDSE5OBiAwLIyxkybRrk8f2vTuzabt2+k2eDANb7uNmQsXAnDi1Cm6DBxI6169aB4RwSdffw2kLx1bv359hg4dSkhICAkJCfj7+5OUlMT06dMJDg4mODiY2rVr0759ewC+/PJLWrduTUhICD179uTEiROX+2V0FAW4SDHx1dq1VKtalZ+WLCFq2TJuaXPxpwL9mpDA0qlTWTh5Mg+MGsXNzZuzYdkyypYpw+dr1mT0q3HttXw7bx5tQkIYPGYM8yZM4Nt584h03YLvXbo0H06cyLqFC/l81ixGvfZaxlorO3fu5L777mPTpk3UqlUr45hDhgwhJiaGDRs2UKNGDR5//HGSkpKIjIxk1apVbNy4kdDQUCZMmIDkTFMoIsVEw4AARr3+OmMmTOC2tm1p06zZRfvfeuONeHl5ERQQQGpqKre6HrzQMCCA313LzwLc3q5denvdupw4fZoK5cpRoVw5ypQuzdE//6Rc2bKMmzSJtdHRGA8P/jh4kAMHDgBQq1YtWrVqlWMNjz76KB06dCA8PJxPP/2U7du308b1i+evv/6idevW+XlJij0FuEgxEeDvz9oFC/hizRrGTppEx9atKeXpSVpaGgBnz57N1P/85Wa9SpXKWALWw8ODlPNW/8voZwxlzrud/ly/Dz/7jKQjR1i7YAFeXl4EhoVx5swZgIsuQjVnzhz27NnDlClTgPR58ltuuYUPPigeb9heDppCESkm/jh4EB9vb+4JD+fRfv2IiYujVvXqGU/T+eirrwrlvH+eOIFf5cp4eXnx3fr1mUbvOYmOjua1117j/fffx8MjPYZatWrF2rVr+eWXXwA4deqUHuTghkbgIoWgKC7727ZrF6Nffx3jGlFPevZZzpw5w0PjxvHq22/TvFGjQjlv79tvp8ewYbTp3ZvGgYHUu+BhEdmZMmUKhw8fznjzMjQ0lLfffps5c+Zwzz33ZPy1EBkZSV3X04AkKy0n62BaTvbKUdyWk80vLSebd5eynKymUEREHMptgBtjZhljDhpjYs9r+5cxZp8xJsb10aVwyxQRkQvlZgQ+B+icTfsb1tpg14durxIRuczcBri1dg1w+DLUIiIilyA/c+DDjDFbXFMsvjl1MsYMMsZEGWOiEhMT83E6ERE5X14D/E3gb0AwsB94PaeO1toZ1tpQa22on59fHk8nIiIXytN14NbaA+e+NsbMBD4tsIpEioH4Hj0L9Hj+ixcV6PHyKnLaNMr7+DCif/8c+yxfvZpGHh40aNAAgLFjx3LzzTfTqVOny1RlyZGnADfGVLPW7ndtRgCxF+svIleulJQUSpUqleP2pfr066/xvO66jAB/7rnn8l2jZM/tfyVjzAdAO6CKMWYvMA5oZ4wJBiwQDwwuvBJFJLfmLV/OpDlzMMYQVLcu44YPZ8jYsSQdPkyVypV56/nnqVmtGoNGj8a3UiU279hBcP36HD56NNP2oLvvZsQLL5B0+DA+Zcsyddw46tWpk+lcsxYvZtbixSQnJ1Pn+ut558UX2bJzJ599+y3fb9lCZGQkS5Ys4fnnn6dr16706NGD1atX8+STT5KSkkLz5s158803KVOmDP7+/vTr149PPvmE5ORkFi1aRGBgYBG9is7hNsCttdk9WuSdQqhFRPJh+y+/8MrMmax+912q+Ppy+NgxHhw9mnvDw/lH9+7MXbaMJ156iYWTJwOwa88ePps5E09PTwaNHp1pu8vAgUx+9ln+XqsW67dsYcQLL/D5O5n/t+/eqRP39+gBwL8mT2bu0qU81KcPt7drR/e+fenh+t45Z86coX///qxevZq6dety33338eabbzJixAgAqlSpwsaNG5k2bRqvvfYab7/9duG/aA6nOzFFionvfvqJO265hSq+6ReFVa5UifWbN9O7S/p9dvd27cq6TZsy+t956614enpm2T5x6hQ/xsTQ54knaNmjB8Ofe47/ZnMF2fZdu+jUrx/NIyJYsGIF23fvvmh9O3fupHbt2hlrm/Tr14815607fueddwLQrFkz4uPj8/YilDBazEqkmLCAcdPn3JKxAOXKls30vXPbaWlpVKpQgZ8WL77osQY9+ywLJk2icb16vPfRR/xnw4aL1+dm3aUyZcoA4OnpSUpKykX7SjqNwEWKiXYtW7L0yy85dPQoAIePHaNlcDCLVq4E4MPPPqN106Zuj1OxfHn8r7uOpV98AaQH75adO7P0O3HyJNdWqUJycjILPvsso718uXIcP348S//AwEDi4+Mzlot97733aNu27SX/nPI/GoGLFIKiuOyvwd//zj8ffJCwAQPw9PCgSWAgr48cyZCxY5k4e3bGm5i5MXv8eB6JjOTlGTNITkmhR+fONK5XL1OfZ4cNo22fPlxfrRoNAwI4cfIkAD1vu41hL73E5MmTWXzeKN7b25vZs2fTs2fPjDcxhwwZUnAvQAmk5WQdTMvJXjm0nGxmWk4277ScrIhICaAAFxFxKAW4iIhDKcBFRBxKAS4i4lAKcBERh9J14CKFYNaUgwV6vPuHVS3Q4xWENRs2MHHOHJZOnZqn/c+ePUvv3r3ZvXs3pUqVYsmSJdS5YMEsJxk4cCCPP/54xiqMl4MCXKQEye9SsQVp4cKFVKpUia1bt3LkyJFMt/k7TWpqapEsvqUpFJFi4qXp0wkOD6frgw/S75//ZOKcOQCEDRjA2EmTuLV/f6bOm8c3P/5Iq549aR4RweBnn+XsX38BEBgWRtKRIwBEb9tG2IABQPpDHB4YNYrbHniARrffzqzz7q48fuIEvR99lJDu3Rn+3HOkpaUxZ+lSHnvssYw+M2fO5PHHH89Sb+nSpdm3bx/WWnx9fbnqqquy/blSU1Pp378/QUFBNGrUiDfeeAOAdu3ace7GwKSkJPz9/QGYM2cOd9xxB+Hh4dSuXZspU6YwYcIEmjZtSqtWrTh8+HDG/o899hg333wz9evXZ8OGDdx5550EBAQwZsyYjPPfcccdNGvWjIYNGzJjxoyM9vLlyzN27FhatmzJunXrMupZvnw5wcHBBAcHU69ePWrXrp3+mkZH07ZtW5o1a0ZYWBj79+8nvxTgIsVA9LZtfLRqFesWLeKDiRPZuG1bpu8fO36cL+fMYfDddzNozBjee/VVNixbRmpqKjMXLHB7/Niff2bp1Kl88/77vDR9On8cTJ8iioqNZfyTT7Jh6VJ+S0jg41Wr6Nm5M8uXLyc5ORmA2bNnM8D1y+B8derUITo6mlGjRl303DExMezbt4/Y2Fi2bt2a7bGy1Bsby/z581m/fj2jR4/Gx8eHTZs20bp1a959992MfqVLl2bNmjUMGTKE7t27M3XqVGJjY5kzZw6HDh0CYNasWURHRxMVFcXkyZMz2k+ePElQUBA//fQTN954Y8Yxu3XrRkxMDDExMTRp0oQnn3yS5ORkhg8fzuLFi4mOjub+++9n9OjRbn8OdxTgIsXAuo0b6dq+PWW9valQrhxdLlgkqkdYGAA/x8fjf911BLhGq326deP76Gi3x7/ddewqvr60bdGCqK1bAQgNCqJ2zZp4enrSs0sXfti0iXI+PnTo0IFPP/2UHTt2kJycTKNGjTId7/Tp0/Tv359t27YRExPDxIkTAejSpQvbLvjlU6dOHX799VeGDx/OypUrqVixott627dvT4UKFfDz86NSpUqEh4cD0KhRo0xL1Xbr1i2jvWHDhlSrVo0yZcpQp04dEhISAJg8eTJNmjShVatWJCQksGvXLiB91cS77rorxxpeeeUVypYty8MPP8zOnTuJjY3llltuITg4mMjISPbu3ev253DnypgME5F8cbemkY+Pj9t+pTw9SUtLA9LfYDzfhfPT57aztLs+Dxw4kBdffJHAwMBsR8xbt27Fz8+P6tWrs2TJEjp16oQxhqNHj2Z5E9DX15fNmzfzxRdfMHXqVBYuXMisWbMoVapURr1nzpzJtM+5pWkBPDw8MrY9PDwyLVV7fvuF+6SkpPDtt9+yatUq1q1bh4+PD+3atcs4l7e3d6b11M+3evVqFi1alLHeubWWhg0bsm7dumz755VG4CLFQOuQEFZ89x1nzp7lxKlTrPzPf7LtV692bfb88Qe7f/8dgA8++YSbQtPXSKpVvTqbtm8H4KOvvsq036fffMOZs2c5dPQoazZsoJlrsaqo2Fji9+4lLS2NJStXckNICAAtW7YkISGB+fPnc889WR/qFRAQwI4dO9i2bRvlypXjnXfe4amnnqJbt25ZfikkJSWRlpbGXXfdxfPPP8/GjRsB8Pf3J9r118NiN2uX59WxY8fw9fXFx8eHHTt28OOPP7rdZ8+ePQwdOpSFCxdS1rXGer169UhMTMwI8OTk5Cx/aeSFRuAiheByX/YXGhTE7e3a0bJHD66vVo2QBg2oWL58ln7eZcrw1vPP0+eJJ0hNSSEkKIiBvXoB8MxDD/HQuHG8+vbbNL9gyiM0KIg7H36YhP37GTl4MNWrVuWXPXto2aQJz06cyLZdu2jTrBndOnbM2KdXr17ExMTg63pC0Pl8fX2ZO3cuffv2xVpLpUqVmDdvHqNGjeLmm2/mhhtuyOi7b98+BgwYkDHafumllwB48skn6dWrF++99x4dOnTI/4uYjc6dOzN9+nQaN25MvXr1aNWqldt9zs2fR0REAFC9enVWrFjB4sWLeeSRRzh27BgpKSmMGDGChg0b5qs+LSfrYFpO9spxJSwne+LUKcr7+HDq9Glu6d+fKePG0bQArkmOnDaN8j4+jOjfP9f7lA0KomvXrjz22GN0PC/Uxb1LWU5WI3CRYmLYv/5F3K+/cvbsWfp0714g4Z0XR//8kyZ169KkSROFdyFTgIsUE3NeeaVQjjtm6NBL6n9VxYr8/PPPhVKLZKY3MUUKQlqa2ytBRNy51H9DCnCRAmASEjianKwQlzyz1nLo0CG8vb1zvY/bKRRjzCygK3DQWhvkaqsMLAD8gXigl7X2SB5qFikWPKe/xaEhg0mqWRM8NC7yyuH6aLk4b29vatSokev+uZkDnwNMAd49r20ksNpaO94YM9K1/fQl1ClSrJg//6TUK68WdRlXjPo74oq6hBLB7VDBWrsGOHxBc3dgruvrucAdBVuWiIi4k9e/9a6x1u4HcH3O8a4FY8wgY0yUMSYqMTExj6cTEZELFfpknbV2hrU21Fob6ufnV9inExEpMfIa4AeMMdUAXJ8L9vEjIiLiVl4DfDnQz/V1P+DjgilHRERyy22AG2M+ANYB9Ywxe40xDwDjgVuMMbuAW1zbIiJyGbm9jNBam3UtyHRa5EBEpAjpjgMREYdSgIuIOJQCXETEoRTgIiIOpQAXEXEoBbiIiEMpwEVEHEoBLiLiUApwERGHUoCLiDiUAlxExKEU4CIiDqUAFxFxKAW4iIhD5eap9HKF6jVK//nkyrS1qAsoITQCFxFxKAW4iIhDKcBFRBxKAS4i4lAKcBERh1KAi4g4lAJcRMShFOAiIg6lABcRcSgFuIiIQ+XrXmxjTDxwHEgFUqy1oQVRlIiIuFcQi2m0t9YmFcBxRETkEmgKRUTEofIb4Bb40hgTbYwZlF0HY8wgY0yUMSYqMTExn6cTEZFz8hvgbay1IcBtwMPGmJsv7GCtnWGtDbXWhvr5+eXzdCIick6+Atxa+4fr80FgGdCiIIoSERH38hzgxphyxpgK574GbgViC6owERG5uPxchXINsMwYc+448621KwukKhERcSvPAW6t/RVoUoC1iIjIJdBlhCIiDqUAFxFxKAW4iIhDKcBFRBxKAS4i4lAKcBERh1KAi4g4lAJcRMShFOAiIg6lABcRcSgFuIiIQynARUQcSgEuIuJQCnAREYcqiKfSSxHZ+tvvRV2CiBQhjcBFRBxKAS4i4lAKcBERh1KAi4g4lAJcRMShdBWKg/mfmV/UJYhkK76oCyghNAIXEXEoBbiIiEMpwEVEHCpfAW6M6WyM2WmM+cUYM7KgihIREffyHODGGE9gKnAb0AC4xxjToKAKExGRi8vPCLwF8Iu19ldr7V/Ah0D3gilLRETcyc9lhNcBCedt7wVaXtjJGDMIGOTaPGGM2ZmPc4oUlipAUlEXUVyYl4u6gmKnVnaN+Qlwk02bzdJg7QxgRj7OI1LojDFR1trQoq5D5FLkZwplL1DzvO0awB/5K0dERHIrPwG+AQgwxtQ2xpQG7gaWF0xZIiLiTp6nUKy1KcaYYcAXgCcwy1q7rcAqE7m8NM0njmOszTJtLSIiDqA7MUVEHEoBLiLiUApwERGHUoCLiDiUHuggJZYx5ubs2q21ay53LSJ5oatQpMQyxnxy3qY36ev7RFtrOxRRSSKXRCNwKbGsteHnbxtjagKvFFE5IpdMc+Ai/7MXCCrqIkRySyNwKbGMMf/mfwuweQDBwOYiK0jkEmkOXEosY0y/8zZTgHhr7dqiqkfkUinApURzLcQWSPpIfKfr4SQijqAAlxLLGNMFeAvYTfr69rWBwdbaz4u0MJFcUoBLiWWM2QF0tdb+4tr+G/CZtTawaCsTyR1dhSIl2cFz4e3yK3CwqIoRuVQagUuJZYx5k/RnDS4kfQ68J7ATWAtgrV1adNWJuKcAlxLLGDM7m2ZL+ny4tdbef5lLErkkug5cSjIP4FFr7VEAY4wv8Lq1dkCRViWSS5oDl5Ks8bnwBrDWHgGaFl05IpdGAS4lmYdr1A2AMaYy+qtUHET/WKUkex34wRizmPS5717AC0Vbkkju6U1MKdGMMQ2ADqS/cbnaWru9iEsSyTUFuIiIQ2kOXETEoRTgIiIOpQAXEXEoBbiIiEP9PxtQtnlihBk9AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import pandas as df\n", "pipeline()" ] }, { "cell_type": "markdown", "id": "T2nKfQWD7V1k", "metadata": { "id": "T2nKfQWD7V1k" }, "source": [ "### Switching to GPU\n", "Traditionally, these tasks are frequently done (as we did) using the popular [**pandas**](https://pandas.pydata.org/) library, which only runs on a single CPU. NVIDIA's [**cuDF**](https://docs.rapids.ai/api/cudf/stable/) library was built with the users in mind - by offering nearly identical syntax to its CPU counterpart, developers only have to make few changes to their existing code to take advantage of its capabilities. " ] }, { "cell_type": "code", "execution_count": 12, "id": "TfDvbYbIU4b1", "metadata": { "id": "TfDvbYbIU4b1" }, "outputs": [], "source": [ "import cudf as df" ] }, { "cell_type": "markdown", "id": "oeYOoMVOLIbD", "metadata": { "id": "oeYOoMVOLIbD" }, "source": [ "**That's it!** cuDF uses nearly identical syntax to the familiar pandas API. **Brilliant!** It's worth noting that there are some features that are unique to each library, but conviniently there are a lot of overlaps. " ] }, { "cell_type": "code", "execution_count": null, "id": "Ocdd-JmXK5gg", "metadata": { "id": "Ocdd-JmXK5gg" }, "outputs": [], "source": [ "pipeline()" ] }, { "cell_type": "markdown", "id": "cgU3PNaPLZsS", "metadata": { "id": "cgU3PNaPLZsS" }, "source": [ "### Comparing Results\n", "In a trial run, **cuDF** completed the data processing tasks in nearly 10x faster than **pandas**. The expectations is that the speedup will be even more significant as the size of the data becomes largers. Feel free to give it a try by modifying the dimensions of the data above. \n", "\n", "![result](https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/result.png?raw=true)" ] }, { "cell_type": "markdown", "id": "lYPyye6BYNbr", "metadata": { "id": "lYPyye6BYNbr" }, "source": [ "## Conclusion\n", "Congratulations on completing the notebook! Want to learn more about cuDF and the rest of the RAPIDS framework? Check out the follow-up to this course, [Accelerating End-to-End Data Science Workflows]('https://courses.nvidia.com/courses/course-v1:DLI+S-DS-01+V1/about') or our other online courses at [NVIDIA DLI]('https://www.nvidia.com/en-us/training/online/')." ] } ], "metadata": { "accelerator": "GPU", "colab": { "name": "cuDF_speed_up.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }