1857 lines
75 KiB
Plaintext
1857 lines
75 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "_XMDuUoUl5cQ",
|
||
"metadata": {
|
||
"id": "_XMDuUoUl5cQ"
|
||
},
|
||
"source": [
|
||
"<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "T1WbsSK_0Xqo",
|
||
"metadata": {
|
||
"id": "T1WbsSK_0Xqo"
|
||
},
|
||
"source": [
|
||
"# Speed Up DataFrame Operations w/ RAPIDS cuDF"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "-bPAvj4fwjbq",
|
||
"metadata": {
|
||
"id": "-bPAvj4fwjbq"
|
||
},
|
||
"source": [
|
||
"## Welcome\n",
|
||
"A **DataFrame** is a 2-dimensional data structure used to represent data in a tabular format, like a spreadsheet or SQL table. Originally offered through the Python Data Analysis ([pandas](https://pandas.pydata.org/docs/)) library, DataFrames have become very popular for its familiar representation along with a robust set of features that are intuitive and expressive. \n",
|
||
"\n",
|
||
"Raw data often needs to be manipulated before it can be used for further purposes such as generating **Business Intelligence**, creating **Dashboard Visualization**, or training **Machine Learning** models. These preprocessing steps can include **filtering**, **merging**, **grouping**, and **aggregating**. \n",
|
||
"\n",
|
||
"Below is a typical data processing pipeline: \n",
|
||
"<p><img src='https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/flow.png?raw=true' atl='flow' width=1080></p>\n",
|
||
"\n",
|
||
"According to [studies](https://www.forbes.https://courses.nvidia.com/courses/course-v1:DLI+T-DS-01+V1/aboutcom/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=29f71b266f63), data preparation accounts for ~80% of the work for analysts. This could be due in part to the rapid increase in the size of data as well as the iterative nature of analytics. \n",
|
||
"\n",
|
||
"Recognizing this potential bottleneck, NVIDIA created [**cuDF**](https://docs.rapids.ai/api/cudf/stable/) that leverages GPU hardware and software to perform data manipulation tasks with parallel computing, **saving valuable time and resources**. The cuDF library is part of the larger [**RAPIDS**](https://rapids.ai/) data science framework that allows for the execution of **end-to-end analytics pipelines** entirely on GPUs. One of the focus for cuDF and its companion suite of open source software libraries is to provide syntax that is similar to their CPU counterparts, **making it easy to implement**. \n",
|
||
"\n",
|
||
"This notebook is intended to demonstrate speedup in data processing by moving common DataFrame operations to the GPU with minimal changes to existing code. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ComTzf6gEWwT",
|
||
"metadata": {
|
||
"id": "ComTzf6gEWwT"
|
||
},
|
||
"source": [
|
||
"### Environment Sanity Check\n",
|
||
"Check the output of `!nvidia-smi` to make sure you've been allocated a RAPIDS supported GPU such as Tesla T4, P4, or P100."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "c58af14d",
|
||
"metadata": {
|
||
"id": "c58af14d"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Tue Feb 3 12:46:12 2026 \n",
|
||
"+-----------------------------------------------------------------------------+\n",
|
||
"| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |\n",
|
||
"|-------------------------------+----------------------+----------------------+\n",
|
||
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
|
||
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
|
||
"| | | MIG M. |\n",
|
||
"|===============================+======================+======================|\n",
|
||
"| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |\n",
|
||
"| N/A 23C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |\n",
|
||
"| | | N/A |\n",
|
||
"+-------------------------------+----------------------+----------------------+\n",
|
||
" \n",
|
||
"+-----------------------------------------------------------------------------+\n",
|
||
"| Processes: |\n",
|
||
"| GPU GI CI PID Type Process name GPU Memory |\n",
|
||
"| ID ID Usage |\n",
|
||
"|=============================================================================|\n",
|
||
"| No running processes found |\n",
|
||
"+-----------------------------------------------------------------------------+\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"!nvidia-smi"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "GM2FQ7-P8iaF",
|
||
"metadata": {
|
||
"id": "GM2FQ7-P8iaF"
|
||
},
|
||
"source": [
|
||
"## Interactive Exercise"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "XKUJgAqC38jR",
|
||
"metadata": {
|
||
"id": "XKUJgAqC38jR"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import numpy as np # for generating sample data\n",
|
||
"\n",
|
||
"import pandas as df\n",
|
||
"# import cudf as df\n",
|
||
"import time # for clocking process times\n",
|
||
"import matplotlib.pyplot as plt # for visualizing results\n",
|
||
"\n",
|
||
"class Timer: # creating a Timer helper class to measure execution time\n",
|
||
" def __enter__(self):\n",
|
||
" self.start=time.perf_counter()\n",
|
||
" return self\n",
|
||
" def __exit__(self, *args):\n",
|
||
" self.end=time.perf_counter()\n",
|
||
" self.interval=self.end-self.start"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "GjeW2Mdh0huU",
|
||
"metadata": {
|
||
"id": "GjeW2Mdh0huU"
|
||
},
|
||
"source": [
|
||
"### Loading a Sample Data\n",
|
||
"We start our demonstration by generating two 2-dimensional arrays of random numbers - we've configured for sizeable arrays at 1MM rows by 50 columns each. Then they are converted to DataFrames using ```pandas.DataFrame()``` or ```cudf.DataFrame()```:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "RSCUQYModrAd",
|
||
"metadata": {
|
||
"id": "RSCUQYModrAd"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"rows=1000000\n",
|
||
"columns=50"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "108eb7cb",
|
||
"metadata": {
|
||
"id": "108eb7cb"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"The loading process took 0.96 seconds\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>a_0</th>\n",
|
||
" <th>a_1</th>\n",
|
||
" <th>a_2</th>\n",
|
||
" <th>a_3</th>\n",
|
||
" <th>a_4</th>\n",
|
||
" <th>a_5</th>\n",
|
||
" <th>a_6</th>\n",
|
||
" <th>a_7</th>\n",
|
||
" <th>a_8</th>\n",
|
||
" <th>a_9</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>a_40</th>\n",
|
||
" <th>a_41</th>\n",
|
||
" <th>a_42</th>\n",
|
||
" <th>a_43</th>\n",
|
||
" <th>a_44</th>\n",
|
||
" <th>a_45</th>\n",
|
||
" <th>a_46</th>\n",
|
||
" <th>a_47</th>\n",
|
||
" <th>a_48</th>\n",
|
||
" <th>a_49</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>999995</th>\n",
|
||
" <td>40</td>\n",
|
||
" <td>90</td>\n",
|
||
" <td>16</td>\n",
|
||
" <td>13</td>\n",
|
||
" <td>91</td>\n",
|
||
" <td>95</td>\n",
|
||
" <td>77</td>\n",
|
||
" <td>68</td>\n",
|
||
" <td>48</td>\n",
|
||
" <td>56</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>94</td>\n",
|
||
" <td>28</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>68</td>\n",
|
||
" <td>50</td>\n",
|
||
" <td>70</td>\n",
|
||
" <td>48</td>\n",
|
||
" <td>76</td>\n",
|
||
" <td>61</td>\n",
|
||
" <td>94</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>999996</th>\n",
|
||
" <td>77</td>\n",
|
||
" <td>40</td>\n",
|
||
" <td>80</td>\n",
|
||
" <td>84</td>\n",
|
||
" <td>34</td>\n",
|
||
" <td>67</td>\n",
|
||
" <td>95</td>\n",
|
||
" <td>7</td>\n",
|
||
" <td>12</td>\n",
|
||
" <td>12</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>98</td>\n",
|
||
" <td>71</td>\n",
|
||
" <td>73</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>76</td>\n",
|
||
" <td>66</td>\n",
|
||
" <td>37</td>\n",
|
||
" <td>79</td>\n",
|
||
" <td>43</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>999997</th>\n",
|
||
" <td>37</td>\n",
|
||
" <td>22</td>\n",
|
||
" <td>85</td>\n",
|
||
" <td>33</td>\n",
|
||
" <td>44</td>\n",
|
||
" <td>93</td>\n",
|
||
" <td>83</td>\n",
|
||
" <td>96</td>\n",
|
||
" <td>48</td>\n",
|
||
" <td>91</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>12</td>\n",
|
||
" <td>66</td>\n",
|
||
" <td>90</td>\n",
|
||
" <td>34</td>\n",
|
||
" <td>35</td>\n",
|
||
" <td>61</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>58</td>\n",
|
||
" <td>70</td>\n",
|
||
" <td>34</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>999998</th>\n",
|
||
" <td>78</td>\n",
|
||
" <td>34</td>\n",
|
||
" <td>30</td>\n",
|
||
" <td>76</td>\n",
|
||
" <td>70</td>\n",
|
||
" <td>53</td>\n",
|
||
" <td>68</td>\n",
|
||
" <td>97</td>\n",
|
||
" <td>64</td>\n",
|
||
" <td>39</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>76</td>\n",
|
||
" <td>56</td>\n",
|
||
" <td>46</td>\n",
|
||
" <td>81</td>\n",
|
||
" <td>75</td>\n",
|
||
" <td>15</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>47</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>74</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>999999</th>\n",
|
||
" <td>52</td>\n",
|
||
" <td>53</td>\n",
|
||
" <td>58</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>94</td>\n",
|
||
" <td>42</td>\n",
|
||
" <td>66</td>\n",
|
||
" <td>85</td>\n",
|
||
" <td>33</td>\n",
|
||
" <td>34</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>62</td>\n",
|
||
" <td>69</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>95</td>\n",
|
||
" <td>37</td>\n",
|
||
" <td>55</td>\n",
|
||
" <td>65</td>\n",
|
||
" <td>29</td>\n",
|
||
" <td>45</td>\n",
|
||
" <td>82</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 50 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7 a_8 a_9 ... a_40 a_41 \\\n",
|
||
"999995 40 90 16 13 91 95 77 68 48 56 ... 94 28 \n",
|
||
"999996 77 40 80 84 34 67 95 7 12 12 ... 98 71 \n",
|
||
"999997 37 22 85 33 44 93 83 96 48 91 ... 12 66 \n",
|
||
"999998 78 34 30 76 70 53 68 97 64 39 ... 76 56 \n",
|
||
"999999 52 53 58 10 94 42 66 85 33 34 ... 62 69 \n",
|
||
"\n",
|
||
" a_42 a_43 a_44 a_45 a_46 a_47 a_48 a_49 \n",
|
||
"999995 5 68 50 70 48 76 61 94 \n",
|
||
"999996 73 5 6 76 66 37 79 43 \n",
|
||
"999997 90 34 35 61 0 58 70 34 \n",
|
||
"999998 46 81 75 15 10 47 8 74 \n",
|
||
"999999 9 95 37 55 65 29 45 82 \n",
|
||
"\n",
|
||
"[5 rows x 50 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>b_0</th>\n",
|
||
" <th>b_1</th>\n",
|
||
" <th>b_2</th>\n",
|
||
" <th>b_3</th>\n",
|
||
" <th>b_4</th>\n",
|
||
" <th>b_5</th>\n",
|
||
" <th>b_6</th>\n",
|
||
" <th>b_7</th>\n",
|
||
" <th>b_8</th>\n",
|
||
" <th>b_9</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>b_40</th>\n",
|
||
" <th>b_41</th>\n",
|
||
" <th>b_42</th>\n",
|
||
" <th>b_43</th>\n",
|
||
" <th>b_44</th>\n",
|
||
" <th>b_45</th>\n",
|
||
" <th>b_46</th>\n",
|
||
" <th>b_47</th>\n",
|
||
" <th>b_48</th>\n",
|
||
" <th>b_49</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>999995</th>\n",
|
||
" <td>75</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>66</td>\n",
|
||
" <td>43</td>\n",
|
||
" <td>25</td>\n",
|
||
" <td>78</td>\n",
|
||
" <td>53</td>\n",
|
||
" <td>13</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>34</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>92</td>\n",
|
||
" <td>80</td>\n",
|
||
" <td>86</td>\n",
|
||
" <td>79</td>\n",
|
||
" <td>19</td>\n",
|
||
" <td>36</td>\n",
|
||
" <td>28</td>\n",
|
||
" <td>75</td>\n",
|
||
" <td>64</td>\n",
|
||
" <td>48</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>999996</th>\n",
|
||
" <td>51</td>\n",
|
||
" <td>79</td>\n",
|
||
" <td>23</td>\n",
|
||
" <td>23</td>\n",
|
||
" <td>32</td>\n",
|
||
" <td>53</td>\n",
|
||
" <td>96</td>\n",
|
||
" <td>7</td>\n",
|
||
" <td>28</td>\n",
|
||
" <td>60</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>62</td>\n",
|
||
" <td>57</td>\n",
|
||
" <td>29</td>\n",
|
||
" <td>30</td>\n",
|
||
" <td>12</td>\n",
|
||
" <td>47</td>\n",
|
||
" <td>15</td>\n",
|
||
" <td>36</td>\n",
|
||
" <td>75</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>999997</th>\n",
|
||
" <td>23</td>\n",
|
||
" <td>13</td>\n",
|
||
" <td>89</td>\n",
|
||
" <td>16</td>\n",
|
||
" <td>36</td>\n",
|
||
" <td>34</td>\n",
|
||
" <td>39</td>\n",
|
||
" <td>48</td>\n",
|
||
" <td>15</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>16</td>\n",
|
||
" <td>31</td>\n",
|
||
" <td>48</td>\n",
|
||
" <td>65</td>\n",
|
||
" <td>98</td>\n",
|
||
" <td>22</td>\n",
|
||
" <td>11</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>53</td>\n",
|
||
" <td>39</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>999998</th>\n",
|
||
" <td>34</td>\n",
|
||
" <td>98</td>\n",
|
||
" <td>32</td>\n",
|
||
" <td>40</td>\n",
|
||
" <td>29</td>\n",
|
||
" <td>38</td>\n",
|
||
" <td>15</td>\n",
|
||
" <td>50</td>\n",
|
||
" <td>34</td>\n",
|
||
" <td>38</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>85</td>\n",
|
||
" <td>45</td>\n",
|
||
" <td>24</td>\n",
|
||
" <td>50</td>\n",
|
||
" <td>63</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>38</td>\n",
|
||
" <td>66</td>\n",
|
||
" <td>76</td>\n",
|
||
" <td>42</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>999999</th>\n",
|
||
" <td>87</td>\n",
|
||
" <td>90</td>\n",
|
||
" <td>47</td>\n",
|
||
" <td>25</td>\n",
|
||
" <td>99</td>\n",
|
||
" <td>11</td>\n",
|
||
" <td>92</td>\n",
|
||
" <td>87</td>\n",
|
||
" <td>81</td>\n",
|
||
" <td>45</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>80</td>\n",
|
||
" <td>28</td>\n",
|
||
" <td>99</td>\n",
|
||
" <td>68</td>\n",
|
||
" <td>96</td>\n",
|
||
" <td>38</td>\n",
|
||
" <td>13</td>\n",
|
||
" <td>86</td>\n",
|
||
" <td>73</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 50 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" b_0 b_1 b_2 b_3 b_4 b_5 b_6 b_7 b_8 b_9 ... b_40 b_41 \\\n",
|
||
"999995 75 8 66 43 25 78 53 13 0 34 ... 92 80 \n",
|
||
"999996 51 79 23 23 32 53 96 7 28 60 ... 0 62 \n",
|
||
"999997 23 13 89 16 36 34 39 48 15 4 ... 16 31 \n",
|
||
"999998 34 98 32 40 29 38 15 50 34 38 ... 85 45 \n",
|
||
"999999 87 90 47 25 99 11 92 87 81 45 ... 5 80 \n",
|
||
"\n",
|
||
" b_42 b_43 b_44 b_45 b_46 b_47 b_48 b_49 \n",
|
||
"999995 86 79 19 36 28 75 64 48 \n",
|
||
"999996 57 29 30 12 47 15 36 75 \n",
|
||
"999997 48 65 98 22 11 6 53 39 \n",
|
||
"999998 24 50 63 4 38 66 76 42 \n",
|
||
"999999 28 99 68 96 38 13 86 73 \n",
|
||
"\n",
|
||
"[5 rows x 50 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"def load_data(): \n",
|
||
" data_a=np.random.randint(0, 100, (rows, columns))\n",
|
||
" data_b=np.random.randint(0, 100, (rows, columns))\n",
|
||
" dataframe_a=df.DataFrame(data_a, columns=[f'a_{i}' for i in range(columns)])\n",
|
||
" dataframe_b=df.DataFrame(data_b, columns=[f'b_{i}' for i in range(columns)])\n",
|
||
" return dataframe_a, dataframe_b\n",
|
||
"\n",
|
||
"with Timer() as process_time: \n",
|
||
" dataframe_a, dataframe_b=load_data()\n",
|
||
"\n",
|
||
"print(f'The loading process took {process_time.interval:.2f} seconds')\n",
|
||
"display(dataframe_a.tail(5))\n",
|
||
"display(dataframe_b.tail(5))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "sXlraNW9cl31",
|
||
"metadata": {
|
||
"id": "sXlraNW9cl31"
|
||
},
|
||
"source": [
|
||
"<p><img src='https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/check.png?raw=true' width=720 atl='check'></p>\n",
|
||
"\n",
|
||
"We created two DataFrames, _dataframe_a_ and _dataframe_b_ that are 1000000 rows by 50 columns (col_1, col_2, ... col_48, col_49) each. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "DKYzyh6bxwAB",
|
||
"metadata": {
|
||
"id": "DKYzyh6bxwAB"
|
||
},
|
||
"source": [
|
||
"### Merging Data\n",
|
||
"Sometimes data can come from multiple sources and need to be merged into one with ```DataFrame.merge()```. For example, a typical retail data storage infrastructure may include a customer table and separate transaction and product tables. Merging the data allows the correct details to be included in a single DataFrame to get the insight needed. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "bAGSwY8qx2DB",
|
||
"metadata": {
|
||
"id": "bAGSwY8qx2DB"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"The merging process took 1.28 seconds\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>a_0</th>\n",
|
||
" <th>a_1</th>\n",
|
||
" <th>a_2</th>\n",
|
||
" <th>a_3</th>\n",
|
||
" <th>a_4</th>\n",
|
||
" <th>a_5</th>\n",
|
||
" <th>a_6</th>\n",
|
||
" <th>a_7</th>\n",
|
||
" <th>a_8</th>\n",
|
||
" <th>a_9</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>b_40</th>\n",
|
||
" <th>b_41</th>\n",
|
||
" <th>b_42</th>\n",
|
||
" <th>b_43</th>\n",
|
||
" <th>b_44</th>\n",
|
||
" <th>b_45</th>\n",
|
||
" <th>b_46</th>\n",
|
||
" <th>b_47</th>\n",
|
||
" <th>b_48</th>\n",
|
||
" <th>b_49</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>49</td>\n",
|
||
" <td>95</td>\n",
|
||
" <td>72</td>\n",
|
||
" <td>58</td>\n",
|
||
" <td>86</td>\n",
|
||
" <td>20</td>\n",
|
||
" <td>54</td>\n",
|
||
" <td>81</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>52</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>19</td>\n",
|
||
" <td>22</td>\n",
|
||
" <td>90</td>\n",
|
||
" <td>69</td>\n",
|
||
" <td>28</td>\n",
|
||
" <td>50</td>\n",
|
||
" <td>53</td>\n",
|
||
" <td>17</td>\n",
|
||
" <td>60</td>\n",
|
||
" <td>79</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>77</td>\n",
|
||
" <td>32</td>\n",
|
||
" <td>47</td>\n",
|
||
" <td>57</td>\n",
|
||
" <td>22</td>\n",
|
||
" <td>98</td>\n",
|
||
" <td>73</td>\n",
|
||
" <td>98</td>\n",
|
||
" <td>25</td>\n",
|
||
" <td>70</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>91</td>\n",
|
||
" <td>74</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>36</td>\n",
|
||
" <td>61</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>29</td>\n",
|
||
" <td>32</td>\n",
|
||
" <td>87</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>94</td>\n",
|
||
" <td>15</td>\n",
|
||
" <td>58</td>\n",
|
||
" <td>22</td>\n",
|
||
" <td>52</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>90</td>\n",
|
||
" <td>48</td>\n",
|
||
" <td>70</td>\n",
|
||
" <td>88</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>48</td>\n",
|
||
" <td>74</td>\n",
|
||
" <td>26</td>\n",
|
||
" <td>98</td>\n",
|
||
" <td>93</td>\n",
|
||
" <td>15</td>\n",
|
||
" <td>71</td>\n",
|
||
" <td>54</td>\n",
|
||
" <td>97</td>\n",
|
||
" <td>26</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>98</td>\n",
|
||
" <td>67</td>\n",
|
||
" <td>36</td>\n",
|
||
" <td>76</td>\n",
|
||
" <td>55</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>48</td>\n",
|
||
" <td>61</td>\n",
|
||
" <td>98</td>\n",
|
||
" <td>77</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>77</td>\n",
|
||
" <td>59</td>\n",
|
||
" <td>36</td>\n",
|
||
" <td>7</td>\n",
|
||
" <td>62</td>\n",
|
||
" <td>39</td>\n",
|
||
" <td>27</td>\n",
|
||
" <td>14</td>\n",
|
||
" <td>26</td>\n",
|
||
" <td>30</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>70</td>\n",
|
||
" <td>80</td>\n",
|
||
" <td>35</td>\n",
|
||
" <td>51</td>\n",
|
||
" <td>76</td>\n",
|
||
" <td>14</td>\n",
|
||
" <td>34</td>\n",
|
||
" <td>25</td>\n",
|
||
" <td>99</td>\n",
|
||
" <td>49</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>99</td>\n",
|
||
" <td>21</td>\n",
|
||
" <td>18</td>\n",
|
||
" <td>78</td>\n",
|
||
" <td>32</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>97</td>\n",
|
||
" <td>48</td>\n",
|
||
" <td>23</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 100 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7 a_8 a_9 ... b_40 b_41 b_42 \\\n",
|
||
"0 49 95 72 58 86 20 54 81 4 52 ... 19 22 90 \n",
|
||
"1 77 32 47 57 22 98 73 98 25 70 ... 4 91 74 \n",
|
||
"2 94 15 58 22 52 0 90 48 70 88 ... 48 74 26 \n",
|
||
"3 98 67 36 76 55 8 48 61 98 77 ... 77 59 36 \n",
|
||
"4 70 80 35 51 76 14 34 25 99 49 ... 2 99 21 \n",
|
||
"\n",
|
||
" b_43 b_44 b_45 b_46 b_47 b_48 b_49 \n",
|
||
"0 69 28 50 53 17 60 79 \n",
|
||
"1 1 36 61 8 29 32 87 \n",
|
||
"2 98 93 15 71 54 97 26 \n",
|
||
"3 7 62 39 27 14 26 30 \n",
|
||
"4 18 78 32 3 97 48 23 \n",
|
||
"\n",
|
||
"[5 rows x 100 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"def merge_data(left_df, right_df):\n",
|
||
" combined_df=df.merge(left_df, right_df, left_index=True, right_index=True)\n",
|
||
" return combined_df\n",
|
||
"\n",
|
||
"with Timer() as process_time: \n",
|
||
" combined_df=merge_data(dataframe_a, dataframe_b)\n",
|
||
"\n",
|
||
"print(f'The merging process took {process_time.interval:.2f} seconds')\n",
|
||
"display(combined_df.head())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "S_1QcS17c3S5",
|
||
"metadata": {
|
||
"id": "S_1QcS17c3S5"
|
||
},
|
||
"source": [
|
||
"<p><img src='https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/check.png?raw=true' width=720 atl='check'></p>\n",
|
||
"\n",
|
||
"We merged two DataFrames, _dataframe_a_ and _dataframe_b_ on their _index_ into one larger DataFrame that is 1000000 rows by 100 columns (a_0, a_1, ..., b_48, b_49). "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "UhdsvT-gABvZ",
|
||
"metadata": {
|
||
"id": "UhdsvT-gABvZ"
|
||
},
|
||
"source": [
|
||
"### Summarize\n",
|
||
"Exploring data begins with **descriptive statistics**, which often involves finding the **central tendency** and **dispersion**. They are a quick way to summarize distributions. Measures of central tendency includes the mean, median, and mode - they are used to describe the center of a set of data values. Measures of dispersion include variance and standard deviation - they are used to describe the degree to which data is distributed around the center. We can quickly perform simple descriptive statistics with the ```DataFrame.describe()``` method. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "26a2c5b6",
|
||
"metadata": {
|
||
"id": "26a2c5b6"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"The summarizing process took 4.43 seconds\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>a_0</th>\n",
|
||
" <th>a_1</th>\n",
|
||
" <th>a_2</th>\n",
|
||
" <th>a_3</th>\n",
|
||
" <th>a_4</th>\n",
|
||
" <th>a_5</th>\n",
|
||
" <th>a_6</th>\n",
|
||
" <th>a_7</th>\n",
|
||
" <th>a_8</th>\n",
|
||
" <th>a_9</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>b_40</th>\n",
|
||
" <th>b_41</th>\n",
|
||
" <th>b_42</th>\n",
|
||
" <th>b_43</th>\n",
|
||
" <th>b_44</th>\n",
|
||
" <th>b_45</th>\n",
|
||
" <th>b_46</th>\n",
|
||
" <th>b_47</th>\n",
|
||
" <th>b_48</th>\n",
|
||
" <th>b_49</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>count</th>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" <td>1000000.000000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>mean</th>\n",
|
||
" <td>49.511061</td>\n",
|
||
" <td>49.446406</td>\n",
|
||
" <td>49.482991</td>\n",
|
||
" <td>49.493942</td>\n",
|
||
" <td>49.483117</td>\n",
|
||
" <td>49.487522</td>\n",
|
||
" <td>49.525894</td>\n",
|
||
" <td>49.504130</td>\n",
|
||
" <td>49.484901</td>\n",
|
||
" <td>49.483078</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>49.504033</td>\n",
|
||
" <td>49.507267</td>\n",
|
||
" <td>49.537563</td>\n",
|
||
" <td>49.450110</td>\n",
|
||
" <td>49.486217</td>\n",
|
||
" <td>49.471796</td>\n",
|
||
" <td>49.499426</td>\n",
|
||
" <td>49.510149</td>\n",
|
||
" <td>49.476653</td>\n",
|
||
" <td>49.512459</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>std</th>\n",
|
||
" <td>28.861918</td>\n",
|
||
" <td>28.874678</td>\n",
|
||
" <td>28.882845</td>\n",
|
||
" <td>28.866755</td>\n",
|
||
" <td>28.878626</td>\n",
|
||
" <td>28.864333</td>\n",
|
||
" <td>28.860845</td>\n",
|
||
" <td>28.863316</td>\n",
|
||
" <td>28.858059</td>\n",
|
||
" <td>28.887034</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>28.863021</td>\n",
|
||
" <td>28.858843</td>\n",
|
||
" <td>28.858617</td>\n",
|
||
" <td>28.852121</td>\n",
|
||
" <td>28.847583</td>\n",
|
||
" <td>28.859198</td>\n",
|
||
" <td>28.855159</td>\n",
|
||
" <td>28.852837</td>\n",
|
||
" <td>28.840443</td>\n",
|
||
" <td>28.867724</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>min</th>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>25%</th>\n",
|
||
" <td>24.000000</td>\n",
|
||
" <td>24.000000</td>\n",
|
||
" <td>24.000000</td>\n",
|
||
" <td>24.000000</td>\n",
|
||
" <td>24.000000</td>\n",
|
||
" <td>24.000000</td>\n",
|
||
" <td>25.000000</td>\n",
|
||
" <td>24.000000</td>\n",
|
||
" <td>24.000000</td>\n",
|
||
" <td>24.000000</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>25.000000</td>\n",
|
||
" <td>25.000000</td>\n",
|
||
" <td>25.000000</td>\n",
|
||
" <td>24.000000</td>\n",
|
||
" <td>25.000000</td>\n",
|
||
" <td>24.000000</td>\n",
|
||
" <td>25.000000</td>\n",
|
||
" <td>25.000000</td>\n",
|
||
" <td>25.000000</td>\n",
|
||
" <td>24.000000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>50%</th>\n",
|
||
" <td>50.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>50.000000</td>\n",
|
||
" <td>50.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>50.000000</td>\n",
|
||
" <td>50.000000</td>\n",
|
||
" <td>50.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>49.000000</td>\n",
|
||
" <td>50.000000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>75%</th>\n",
|
||
" <td>75.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" <td>75.000000</td>\n",
|
||
" <td>75.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" <td>75.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" <td>75.000000</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>75.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" <td>75.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" <td>74.000000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>max</th>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" <td>99.000000</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>8 rows × 100 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" a_0 a_1 a_2 a_3 \\\n",
|
||
"count 1000000.000000 1000000.000000 1000000.000000 1000000.000000 \n",
|
||
"mean 49.511061 49.446406 49.482991 49.493942 \n",
|
||
"std 28.861918 28.874678 28.882845 28.866755 \n",
|
||
"min 0.000000 0.000000 0.000000 0.000000 \n",
|
||
"25% 24.000000 24.000000 24.000000 24.000000 \n",
|
||
"50% 50.000000 49.000000 49.000000 49.000000 \n",
|
||
"75% 75.000000 74.000000 74.000000 75.000000 \n",
|
||
"max 99.000000 99.000000 99.000000 99.000000 \n",
|
||
"\n",
|
||
" a_4 a_5 a_6 a_7 \\\n",
|
||
"count 1000000.000000 1000000.000000 1000000.000000 1000000.000000 \n",
|
||
"mean 49.483117 49.487522 49.525894 49.504130 \n",
|
||
"std 28.878626 28.864333 28.860845 28.863316 \n",
|
||
"min 0.000000 0.000000 0.000000 0.000000 \n",
|
||
"25% 24.000000 24.000000 25.000000 24.000000 \n",
|
||
"50% 49.000000 49.000000 50.000000 50.000000 \n",
|
||
"75% 75.000000 74.000000 75.000000 74.000000 \n",
|
||
"max 99.000000 99.000000 99.000000 99.000000 \n",
|
||
"\n",
|
||
" a_8 a_9 ... b_40 b_41 \\\n",
|
||
"count 1000000.000000 1000000.000000 ... 1000000.000000 1000000.000000 \n",
|
||
"mean 49.484901 49.483078 ... 49.504033 49.507267 \n",
|
||
"std 28.858059 28.887034 ... 28.863021 28.858843 \n",
|
||
"min 0.000000 0.000000 ... 0.000000 0.000000 \n",
|
||
"25% 24.000000 24.000000 ... 25.000000 25.000000 \n",
|
||
"50% 49.000000 49.000000 ... 50.000000 50.000000 \n",
|
||
"75% 74.000000 75.000000 ... 75.000000 74.000000 \n",
|
||
"max 99.000000 99.000000 ... 99.000000 99.000000 \n",
|
||
"\n",
|
||
" b_42 b_43 b_44 b_45 \\\n",
|
||
"count 1000000.000000 1000000.000000 1000000.000000 1000000.000000 \n",
|
||
"mean 49.537563 49.450110 49.486217 49.471796 \n",
|
||
"std 28.858617 28.852121 28.847583 28.859198 \n",
|
||
"min 0.000000 0.000000 0.000000 0.000000 \n",
|
||
"25% 25.000000 24.000000 25.000000 24.000000 \n",
|
||
"50% 50.000000 49.000000 49.000000 49.000000 \n",
|
||
"75% 75.000000 74.000000 74.000000 74.000000 \n",
|
||
"max 99.000000 99.000000 99.000000 99.000000 \n",
|
||
"\n",
|
||
" b_46 b_47 b_48 b_49 \n",
|
||
"count 1000000.000000 1000000.000000 1000000.000000 1000000.000000 \n",
|
||
"mean 49.499426 49.510149 49.476653 49.512459 \n",
|
||
"std 28.855159 28.852837 28.840443 28.867724 \n",
|
||
"min 0.000000 0.000000 0.000000 0.000000 \n",
|
||
"25% 25.000000 25.000000 25.000000 24.000000 \n",
|
||
"50% 49.000000 49.000000 49.000000 50.000000 \n",
|
||
"75% 74.000000 74.000000 74.000000 74.000000 \n",
|
||
"max 99.000000 99.000000 99.000000 99.000000 \n",
|
||
"\n",
|
||
"[8 rows x 100 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"def summarize(dataframe):\n",
|
||
" summary_df=dataframe.describe()\n",
|
||
" return summary_df\n",
|
||
"\n",
|
||
"with Timer() as process_time: \n",
|
||
" summary_df=summarize(combined_df)\n",
|
||
"\n",
|
||
"print(f'The summarizing process took {process_time.interval:.2f} seconds')\n",
|
||
"display(summary_df)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "KPz54wMldInX",
|
||
"metadata": {
|
||
"id": "KPz54wMldInX"
|
||
},
|
||
"source": [
|
||
"<p><img src='https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/check.png?raw=true' width=720 atl='check'></p>\n",
|
||
"\n",
|
||
"Since this is a sample data set, we see that each of columns/features (a_0, a_1, ..., b_48, b_49) have 1000000 values with an average ~50 and standard deviation of ~30"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "w7N64bdRAclS",
|
||
"metadata": {
|
||
"id": "w7N64bdRAclS"
|
||
},
|
||
"source": [
|
||
"### Correlation - Exploring Relationships\n",
|
||
"We might be interested in finding relationships/dependencies between two or more variables through their correlation with ```DataFrame.corr()```. Correlation is a number between -1 and 1 that describes the strength of the association between two variables. Two variables with a correlation of 1 suggests that they change together in the same direction while a correlation of -1 suggests that they change together in the opposite direction. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "2538ccdd",
|
||
"metadata": {
|
||
"id": "2538ccdd"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"The correlation process took 23.03 seconds\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>a_0</th>\n",
|
||
" <th>a_1</th>\n",
|
||
" <th>a_2</th>\n",
|
||
" <th>a_3</th>\n",
|
||
" <th>a_4</th>\n",
|
||
" <th>a_5</th>\n",
|
||
" <th>a_6</th>\n",
|
||
" <th>a_7</th>\n",
|
||
" <th>a_8</th>\n",
|
||
" <th>a_9</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>b_40</th>\n",
|
||
" <th>b_41</th>\n",
|
||
" <th>b_42</th>\n",
|
||
" <th>b_43</th>\n",
|
||
" <th>b_44</th>\n",
|
||
" <th>b_45</th>\n",
|
||
" <th>b_46</th>\n",
|
||
" <th>b_47</th>\n",
|
||
" <th>b_48</th>\n",
|
||
" <th>b_49</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>a_0</th>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>-0.001281</td>\n",
|
||
" <td>-0.001831</td>\n",
|
||
" <td>0.000314</td>\n",
|
||
" <td>-0.001505</td>\n",
|
||
" <td>-0.001247</td>\n",
|
||
" <td>0.000018</td>\n",
|
||
" <td>0.000585</td>\n",
|
||
" <td>0.000628</td>\n",
|
||
" <td>-0.001549</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.002055</td>\n",
|
||
" <td>0.000192</td>\n",
|
||
" <td>0.001457</td>\n",
|
||
" <td>-0.000054</td>\n",
|
||
" <td>-0.000405</td>\n",
|
||
" <td>-0.001055</td>\n",
|
||
" <td>0.000768</td>\n",
|
||
" <td>0.001139</td>\n",
|
||
" <td>0.000209</td>\n",
|
||
" <td>-0.000652</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>a_1</th>\n",
|
||
" <td>-0.001281</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>0.000033</td>\n",
|
||
" <td>-0.000212</td>\n",
|
||
" <td>0.000829</td>\n",
|
||
" <td>0.001192</td>\n",
|
||
" <td>-0.000334</td>\n",
|
||
" <td>-0.000551</td>\n",
|
||
" <td>-0.000434</td>\n",
|
||
" <td>-0.000121</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>-0.000726</td>\n",
|
||
" <td>0.000390</td>\n",
|
||
" <td>0.000088</td>\n",
|
||
" <td>0.000975</td>\n",
|
||
" <td>-0.000287</td>\n",
|
||
" <td>0.001054</td>\n",
|
||
" <td>0.000370</td>\n",
|
||
" <td>0.000552</td>\n",
|
||
" <td>0.000185</td>\n",
|
||
" <td>0.001366</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>a_2</th>\n",
|
||
" <td>-0.001831</td>\n",
|
||
" <td>0.000033</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>-0.000632</td>\n",
|
||
" <td>-0.001345</td>\n",
|
||
" <td>-0.000222</td>\n",
|
||
" <td>-0.000713</td>\n",
|
||
" <td>-0.001515</td>\n",
|
||
" <td>-0.000810</td>\n",
|
||
" <td>-0.000193</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.000263</td>\n",
|
||
" <td>0.000430</td>\n",
|
||
" <td>-0.000263</td>\n",
|
||
" <td>-0.000569</td>\n",
|
||
" <td>0.001625</td>\n",
|
||
" <td>-0.000449</td>\n",
|
||
" <td>-0.001388</td>\n",
|
||
" <td>-0.000414</td>\n",
|
||
" <td>0.001550</td>\n",
|
||
" <td>-0.000436</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>a_3</th>\n",
|
||
" <td>0.000314</td>\n",
|
||
" <td>-0.000212</td>\n",
|
||
" <td>-0.000632</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>0.002325</td>\n",
|
||
" <td>-0.001373</td>\n",
|
||
" <td>-0.000923</td>\n",
|
||
" <td>-0.000373</td>\n",
|
||
" <td>0.000230</td>\n",
|
||
" <td>-0.000529</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.000448</td>\n",
|
||
" <td>0.000080</td>\n",
|
||
" <td>-0.000237</td>\n",
|
||
" <td>-0.000018</td>\n",
|
||
" <td>-0.000217</td>\n",
|
||
" <td>-0.000565</td>\n",
|
||
" <td>0.000607</td>\n",
|
||
" <td>0.000945</td>\n",
|
||
" <td>-0.000555</td>\n",
|
||
" <td>-0.000179</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>a_4</th>\n",
|
||
" <td>-0.001505</td>\n",
|
||
" <td>0.000829</td>\n",
|
||
" <td>-0.001345</td>\n",
|
||
" <td>0.002325</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>-0.000842</td>\n",
|
||
" <td>-0.000515</td>\n",
|
||
" <td>-0.000127</td>\n",
|
||
" <td>-0.000170</td>\n",
|
||
" <td>-0.000975</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0.001551</td>\n",
|
||
" <td>-0.000489</td>\n",
|
||
" <td>-0.000425</td>\n",
|
||
" <td>0.000450</td>\n",
|
||
" <td>0.000633</td>\n",
|
||
" <td>0.000267</td>\n",
|
||
" <td>0.000340</td>\n",
|
||
" <td>0.000945</td>\n",
|
||
" <td>0.000047</td>\n",
|
||
" <td>-0.000825</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 100 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" a_0 a_1 a_2 a_3 a_4 a_5 a_6 \\\n",
|
||
"a_0 1.000000 -0.001281 -0.001831 0.000314 -0.001505 -0.001247 0.000018 \n",
|
||
"a_1 -0.001281 1.000000 0.000033 -0.000212 0.000829 0.001192 -0.000334 \n",
|
||
"a_2 -0.001831 0.000033 1.000000 -0.000632 -0.001345 -0.000222 -0.000713 \n",
|
||
"a_3 0.000314 -0.000212 -0.000632 1.000000 0.002325 -0.001373 -0.000923 \n",
|
||
"a_4 -0.001505 0.000829 -0.001345 0.002325 1.000000 -0.000842 -0.000515 \n",
|
||
"\n",
|
||
" a_7 a_8 a_9 ... b_40 b_41 b_42 \\\n",
|
||
"a_0 0.000585 0.000628 -0.001549 ... 0.002055 0.000192 0.001457 \n",
|
||
"a_1 -0.000551 -0.000434 -0.000121 ... -0.000726 0.000390 0.000088 \n",
|
||
"a_2 -0.001515 -0.000810 -0.000193 ... 0.000263 0.000430 -0.000263 \n",
|
||
"a_3 -0.000373 0.000230 -0.000529 ... 0.000448 0.000080 -0.000237 \n",
|
||
"a_4 -0.000127 -0.000170 -0.000975 ... 0.001551 -0.000489 -0.000425 \n",
|
||
"\n",
|
||
" b_43 b_44 b_45 b_46 b_47 b_48 b_49 \n",
|
||
"a_0 -0.000054 -0.000405 -0.001055 0.000768 0.001139 0.000209 -0.000652 \n",
|
||
"a_1 0.000975 -0.000287 0.001054 0.000370 0.000552 0.000185 0.001366 \n",
|
||
"a_2 -0.000569 0.001625 -0.000449 -0.001388 -0.000414 0.001550 -0.000436 \n",
|
||
"a_3 -0.000018 -0.000217 -0.000565 0.000607 0.000945 -0.000555 -0.000179 \n",
|
||
"a_4 0.000450 0.000633 0.000267 0.000340 0.000945 0.000047 -0.000825 \n",
|
||
"\n",
|
||
"[5 rows x 100 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"def correlation(dataframe): \n",
|
||
" corr_df=dataframe.corr()\n",
|
||
" return corr_df\n",
|
||
"\n",
|
||
"with Timer() as process_time: \n",
|
||
" corr_df=correlation(combined_df)\n",
|
||
"\n",
|
||
"print(f'The correlation process took {process_time.interval:.2f} seconds')\n",
|
||
"display(corr_df.head())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "uaiK9t2CdgFS",
|
||
"metadata": {
|
||
"id": "uaiK9t2CdgFS"
|
||
},
|
||
"source": [
|
||
"<p><img src='https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/check.png?raw=true' width=720 atl='check'></p>\n",
|
||
"\n",
|
||
"The resulting cross tabulation shows that each column/feature (a_0, a_1, ..., b_48, b_49) have a perfect correlation (1) with itself and is not correlated (~0) with each other. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1j1Y3Y_kBYyY",
|
||
"metadata": {
|
||
"id": "1j1Y3Y_kBYyY"
|
||
},
|
||
"source": [
|
||
"### Grouping\n",
|
||
"We can compare subsets of the data to explore the significance of categories and classes with the ```DataFrame.groupby()``` method. We can even group continuous data values into a smaller number of bins with ```pandas.cut()``` or ```cudf.cut()``` to simplify our analysis. The groupings usually follow an aggregation such as mean or count. For example, we can group our data into 5 equidistant bins based on their sequential index. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "d050021a",
|
||
"metadata": {
|
||
"id": "d050021a"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"The grouping process took 1.04 seconds\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>a_0</th>\n",
|
||
" <th>a_1</th>\n",
|
||
" <th>a_2</th>\n",
|
||
" <th>a_3</th>\n",
|
||
" <th>a_4</th>\n",
|
||
" <th>a_5</th>\n",
|
||
" <th>a_6</th>\n",
|
||
" <th>a_7</th>\n",
|
||
" <th>a_8</th>\n",
|
||
" <th>a_9</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>b_40</th>\n",
|
||
" <th>b_41</th>\n",
|
||
" <th>b_42</th>\n",
|
||
" <th>b_43</th>\n",
|
||
" <th>b_44</th>\n",
|
||
" <th>b_45</th>\n",
|
||
" <th>b_46</th>\n",
|
||
" <th>b_47</th>\n",
|
||
" <th>b_48</th>\n",
|
||
" <th>b_49</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>49.529515</td>\n",
|
||
" <td>49.518860</td>\n",
|
||
" <td>49.389220</td>\n",
|
||
" <td>49.522915</td>\n",
|
||
" <td>49.456145</td>\n",
|
||
" <td>49.506795</td>\n",
|
||
" <td>49.566935</td>\n",
|
||
" <td>49.514930</td>\n",
|
||
" <td>49.490685</td>\n",
|
||
" <td>49.438430</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>49.417695</td>\n",
|
||
" <td>49.457570</td>\n",
|
||
" <td>49.552260</td>\n",
|
||
" <td>49.370180</td>\n",
|
||
" <td>49.548660</td>\n",
|
||
" <td>49.543460</td>\n",
|
||
" <td>49.50539</td>\n",
|
||
" <td>49.483245</td>\n",
|
||
" <td>49.467315</td>\n",
|
||
" <td>49.504140</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>49.533135</td>\n",
|
||
" <td>49.434285</td>\n",
|
||
" <td>49.548565</td>\n",
|
||
" <td>49.473550</td>\n",
|
||
" <td>49.587670</td>\n",
|
||
" <td>49.450590</td>\n",
|
||
" <td>49.433595</td>\n",
|
||
" <td>49.538200</td>\n",
|
||
" <td>49.483115</td>\n",
|
||
" <td>49.520890</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>49.490945</td>\n",
|
||
" <td>49.531120</td>\n",
|
||
" <td>49.535955</td>\n",
|
||
" <td>49.465815</td>\n",
|
||
" <td>49.445720</td>\n",
|
||
" <td>49.393615</td>\n",
|
||
" <td>49.54800</td>\n",
|
||
" <td>49.607035</td>\n",
|
||
" <td>49.514965</td>\n",
|
||
" <td>49.483715</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>49.463990</td>\n",
|
||
" <td>49.423480</td>\n",
|
||
" <td>49.525750</td>\n",
|
||
" <td>49.559600</td>\n",
|
||
" <td>49.556430</td>\n",
|
||
" <td>49.513570</td>\n",
|
||
" <td>49.490225</td>\n",
|
||
" <td>49.483775</td>\n",
|
||
" <td>49.453115</td>\n",
|
||
" <td>49.368385</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>49.578650</td>\n",
|
||
" <td>49.550740</td>\n",
|
||
" <td>49.506965</td>\n",
|
||
" <td>49.421825</td>\n",
|
||
" <td>49.542365</td>\n",
|
||
" <td>49.439395</td>\n",
|
||
" <td>49.43738</td>\n",
|
||
" <td>49.508520</td>\n",
|
||
" <td>49.517670</td>\n",
|
||
" <td>49.505025</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>49.486165</td>\n",
|
||
" <td>49.432445</td>\n",
|
||
" <td>49.528215</td>\n",
|
||
" <td>49.410420</td>\n",
|
||
" <td>49.378440</td>\n",
|
||
" <td>49.530500</td>\n",
|
||
" <td>49.538760</td>\n",
|
||
" <td>49.491275</td>\n",
|
||
" <td>49.552640</td>\n",
|
||
" <td>49.550090</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>49.437585</td>\n",
|
||
" <td>49.481485</td>\n",
|
||
" <td>49.605755</td>\n",
|
||
" <td>49.498830</td>\n",
|
||
" <td>49.488715</td>\n",
|
||
" <td>49.535180</td>\n",
|
||
" <td>49.49352</td>\n",
|
||
" <td>49.529620</td>\n",
|
||
" <td>49.429200</td>\n",
|
||
" <td>49.604420</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>49.542500</td>\n",
|
||
" <td>49.422960</td>\n",
|
||
" <td>49.423205</td>\n",
|
||
" <td>49.503225</td>\n",
|
||
" <td>49.436900</td>\n",
|
||
" <td>49.436155</td>\n",
|
||
" <td>49.599955</td>\n",
|
||
" <td>49.492470</td>\n",
|
||
" <td>49.444950</td>\n",
|
||
" <td>49.537595</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>49.595290</td>\n",
|
||
" <td>49.515420</td>\n",
|
||
" <td>49.486880</td>\n",
|
||
" <td>49.493900</td>\n",
|
||
" <td>49.405625</td>\n",
|
||
" <td>49.447330</td>\n",
|
||
" <td>49.51284</td>\n",
|
||
" <td>49.422325</td>\n",
|
||
" <td>49.454115</td>\n",
|
||
" <td>49.464995</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 100 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" a_0 a_1 a_2 a_3 a_4 a_5 \\\n",
|
||
"0 49.529515 49.518860 49.389220 49.522915 49.456145 49.506795 \n",
|
||
"1 49.533135 49.434285 49.548565 49.473550 49.587670 49.450590 \n",
|
||
"2 49.463990 49.423480 49.525750 49.559600 49.556430 49.513570 \n",
|
||
"3 49.486165 49.432445 49.528215 49.410420 49.378440 49.530500 \n",
|
||
"4 49.542500 49.422960 49.423205 49.503225 49.436900 49.436155 \n",
|
||
"\n",
|
||
" a_6 a_7 a_8 a_9 ... b_40 b_41 \\\n",
|
||
"0 49.566935 49.514930 49.490685 49.438430 ... 49.417695 49.457570 \n",
|
||
"1 49.433595 49.538200 49.483115 49.520890 ... 49.490945 49.531120 \n",
|
||
"2 49.490225 49.483775 49.453115 49.368385 ... 49.578650 49.550740 \n",
|
||
"3 49.538760 49.491275 49.552640 49.550090 ... 49.437585 49.481485 \n",
|
||
"4 49.599955 49.492470 49.444950 49.537595 ... 49.595290 49.515420 \n",
|
||
"\n",
|
||
" b_42 b_43 b_44 b_45 b_46 b_47 b_48 \\\n",
|
||
"0 49.552260 49.370180 49.548660 49.543460 49.50539 49.483245 49.467315 \n",
|
||
"1 49.535955 49.465815 49.445720 49.393615 49.54800 49.607035 49.514965 \n",
|
||
"2 49.506965 49.421825 49.542365 49.439395 49.43738 49.508520 49.517670 \n",
|
||
"3 49.605755 49.498830 49.488715 49.535180 49.49352 49.529620 49.429200 \n",
|
||
"4 49.486880 49.493900 49.405625 49.447330 49.51284 49.422325 49.454115 \n",
|
||
"\n",
|
||
" b_49 \n",
|
||
"0 49.504140 \n",
|
||
"1 49.483715 \n",
|
||
"2 49.505025 \n",
|
||
"3 49.604420 \n",
|
||
"4 49.464995 \n",
|
||
"\n",
|
||
"[5 rows x 100 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"def groupby_summarize(dataframe):\n",
|
||
" dataframe['group']=dataframe.index\n",
|
||
" dataframe['group']=df.cut(dataframe['group'], 5)\n",
|
||
" group_describe_df=dataframe.groupby('group').mean().reset_index(drop=True)\n",
|
||
" return group_describe_df\n",
|
||
"\n",
|
||
"with Timer() as process_time: \n",
|
||
" group_describe_df=groupby_summarize(combined_df)\n",
|
||
"\n",
|
||
"print(f'The grouping process took {process_time.interval:.2f} seconds')\n",
|
||
"display(group_describe_df)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "LdVGBVr9e_o8",
|
||
"metadata": {
|
||
"id": "LdVGBVr9e_o8"
|
||
},
|
||
"source": [
|
||
"<p><img src='https://github.com/NVDLI/notebooks/blob/kl/cudf_speed_up/images/check.png?raw=true' width=720 atl='check'></p>\n",
|
||
"\n",
|
||
"The resulting DataFrame shows that each group maintains an average of ~50 for each column/feature (a_0, a_1, ..., b_48, b_49) as expected for this sample data. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b-9gbIriKa85",
|
||
"metadata": {
|
||
"id": "b-9gbIriKa85"
|
||
},
|
||
"source": [
|
||
"### Putting it together\n",
|
||
"We can measure the total elapsed time for this sample data processing workflow. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"id": "HMLKNN_RPB0c",
|
||
"metadata": {
|
||
"id": "HMLKNN_RPB0c"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def pipeline():\n",
|
||
" performance={}\n",
|
||
" with Timer() as process_time: \n",
|
||
" dataframe_a, dataframe_b=load_data()\n",
|
||
" performance['load data']=process_time.interval\n",
|
||
" with Timer() as process_time: \n",
|
||
" combined_df=merge_data(dataframe_a, dataframe_b)\n",
|
||
" performance['merge data']=process_time.interval\n",
|
||
" with Timer() as process_time: \n",
|
||
" summarize(combined_df)\n",
|
||
" performance['summarize']=process_time.interval\n",
|
||
" with Timer() as process_time: \n",
|
||
" correlation(combined_df)\n",
|
||
" performance['correlation']=process_time.interval\n",
|
||
" with Timer() as process_time: \n",
|
||
" groupby_summarize(combined_df)\n",
|
||
" performance['groupby & summarize']=process_time.interval\n",
|
||
" if df.__name__=='cudf': \n",
|
||
" df.DataFrame([performance], index=['gpu']).to_pandas().plot(kind='bar', stacked=True)\n",
|
||
" else: \n",
|
||
" df.DataFrame([performance], index=['cpu']).plot(kind='bar', stacked=True)\n",
|
||
" return None"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "csfRLkjsc2v8",
|
||
"metadata": {
|
||
"id": "csfRLkjsc2v8"
|
||
},
|
||
"source": [
|
||
"### Timing the Pipeline on CPU"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"id": "8DcmBph9cyjm",
|
||
"metadata": {
|
||
"id": "8DcmBph9cyjm"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAEACAYAAACqOy3+AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAg90lEQVR4nO3de3zPdf/H8cd7M2ZOTaaIjOsawzAzp1SONclo5VC5hBISpdMVEdevVukkXEiUQ0U5l0oqOriSYmMYI6lp5GJzyrkd3r8/9rXLbPOdHcxne95vt922z3vvz+fz2jc99977+/m8P8Zai4iIOI9HURcgIiJ5owAXEXEoBbiIiEMpwEVEHEoBLiLiUKUu58mqVKli/f39L+cpRUQcLzo6Osla63dh+2UNcH9/f6Kioi7nKUVEHM8Ysye7dk2hiIg4lAJcRMShFOAiIg6lABcRcSgFuIiIQynARUQcSgEuIuJQCnAREYdSgIuIONRlvRNTCtbUIV8XdQki2Xp4eoeiLqFE0AhcRMShFOAiIg6lABcRcSgFuIiIQ7kNcGOMtzFmvTFmszFmmzHm/1ztlY0xXxljdrk++xZ+uSIick5uRuBngQ7W2iZAMNDZGNMKGAmsttYGAKtd2yIicpm4vYzQWmuBE65NL9eHBboD7Vztc4FvgacLvELJUYdvHy7qEkRyEFfUBZQIuZoDN8Z4GmNigIPAV9ban4BrrLX7AVyfq+aw7yBjTJQxJioxMbGAyhYRkVwFuLU21VobDNQAWhhjgnJ7AmvtDGttqLU21M8vyyPdREQkjy7pKhRr7VHSp0o6AweMMdUAXJ8PFnRxIiKSs9xcheJnjLnK9XVZoBOwA1gO9HN16wd8XEg1iohINnKzFko1YK4xxpP0wF9orf3UGLMOWGiMeQD4HehZiHWKiMgFcnMVyhagaTbth4COhVGUiIi4pzsxRUQcSgEuIuJQCnAREYdSgIuIOJQCXETEoRTgIiIOpQAXEXEoBbiIiEMpwEVEHEoBLiLiUApwERGHUoCLiDiUAlxExKEU4CIiDqUAFxFxKAW4iIhDKcBFRBxKAS4i4lAKcBERh1KAi4g4lAJcRMShFOAiIg6lABcRcSgFuIiIQ7kNcGNMTWPMN8aYOGPMNmPMo672fxlj9hljYlwfXQq/XBEROadULvqkAE9YazcaYyoA0caYr1zfe8Na+1rhlSciIjlxG+DW2v3AftfXx40xccB1hV2YiIhc3CXNgRtj/IGmwE+upmHGmC3GmFnGGN8c9hlkjIkyxkQlJibmr1oREcmQ6wA3xpQHlgAjrLV/Am8CfwOCSR+hv57dftbaGdbaUGttqJ+fX/4rFhERIJcBbozxIj2851lrlwJYaw9Ya1OttWnATKBF4ZUpIiIXys1VKAZ4B4iz1k44r73aed0igNiCL09ERHKSm6tQ2gB9ga3GmBhX2zPAPcaYYMAC8cDgQqhPRERykJurUL4HTDbfWlHw5YiISG7pTkwREYdSgIuIOJQCXETEoRTgIiIOpQAXEXEoBbiIiEMpwEVEHEoBLiLiUApwERGHUoCLiDiUAlxExKEU4CIiDpWb1QhFxA1bsSKpQwZja9YED42L4uLiiroER/L29qZGjRp4eXnlqr8CXKQApA4ZzNXBwVzl5UX6EvolW9n69Yu6BMex1nLo0CH27t1L7dq1c7WPhgoiBcDWrKnwlnwxxnD11Vdz5syZXO+jABcpCB4eCm/Jt0v9N6QAFxFxKM2BixSC+u/vKdDjxf2jlts+fi1akLh+fb7PFTltGuV9fBjRv3++znf06FHmz5/P0KFD812TZE8jcBEpFEePHmXatGlFXUaxpgAXKWastTzz+uuERkTQPCKCxStXAnDi1Cm6DBxI6169aB4RwSdff52xz8szZtAkPJzbBw5kV3x8tseN37uXdn36cOPdd/N///53RntOxx05ciS7d+8mODiYp556ihMnTtCxY0dCQkJo1KgRH3/8ceG9CCWEplBEipmPV61iy44d/LR4MUlHjnDTPffQplkz/Hx9+XDiRCqWL0/SkSO069OHru3bs2n7dhZ//jnrFi4kJTWVG3r1ommDBlmO++TLL/Ng79706daN6R98kNHuXbp0luP2HD6c8ePHExsbS0xMDAApKSksW7aMihUrkpSURKtWrejWrZve/M0HBbhIMfPDxo307NIFT09PrqlShZtCQ4mOjSXsxhsZN2kSa6OjMR4e/HHwIAcOHeKHjRsJ79gRn7JlAbi9Xbtsj/vjpk18MGECAPeGh/PsG28A6SP+LMc9cCDL/tZannnmGdasWYOHhwf79u3jwIEDXHvttYXzQpQACnCRYsbm0P7hZ5+RdOQIaxcswMvLi8CwMM6ePQvk/vK17Ppld9zsrmWeN28eiYmJREdH4+Xlhb+//yVd8yxZaQ5cpJhp06wZS1auJDU1lcTDh/k+OprQRo3488QJ/CpXxsvLi+/Wr+f3P/7I6P/J6tWcPnOG4ydPsuK777I9bqumTVn0+edAemifk9NxK1SowPHjxzP6HTt2jKpVq+Ll5cU333zDnj0Fe6VOSeR2BG6MqQm8C1wLpAEzrLWTjDGVgQWAPxAP9LLWHim8UkWcIzeX/RWW7h07sn7zZlr26IEBXnj8ca6tUoXet99Oj2HDaNO7N40DA6nnul27aYMG3NW5M6169uT6atW4ISQk2+O+9vTT9H/6aabOm8cdnTpltOd03Kuvvpo2bdoQFBTEbbfdxtNPP014eDihoaEEBwcTGBhY6K9FcWeszekPLlcHY6oB1ay1G40xFYBo4A6gP3DYWjveGDMS8LXWPn2xY4WGhtqoqKgCKVwgLlDrTVwpkqdOIeCaa4q6jCtG2aCgoi7BseLi4qh/wVoyxphoa23ohX3dTqFYa/dbaze6vj4OxAHXAd2Bua5uc0kPdRERuUwuaQ7cGOMPNAV+Aq6x1u6H9JAHquawzyBjTJQxJioxMTGf5YqIyDm5DnBjTHlgCTDCWvtnbvez1s6w1oZaa0P9/PzyUqOIiGQjVwFujPEiPbznWWuXupoPuObHz82THyycEkVEJDtuA9ykX/j5DhBnrZ1w3reWA/1cX/cDdF+siMhllJsbedoAfYGtxpgYV9szwHhgoTHmAeB3oGehVCgiItlyG+DW2u+BnG7T6liw5YgUD2UXtynQ453usbZAj3c5xcfH07VrV2JjYy/a54cffuDee++9jJU5n+7EFJFMUlJSLvs54+PjmT9//mU/r9NpLRSRYmDPvn10HzKE1iEhbNiyhUZ169L3jjuInDaNxMOHmTV+PM0bNeLkqVM8/tJLbNu1i5TUVEY/9BDhHTrw3kcfsXLNGs789RenTp9myZQpDBozhp9/+416deqw548/eGP0aJo1bMiqH34gcupUziYnU6dGDd6KjKS8j0+meqKjo7n//vvx8fHhxhtvzGiPj4+nb9++nDx5EoApU6Zwww03MHLkSOLi4ggODqZfv35ERERk208y0whcpJjYnZDAw336sH7JEnb+9hsLVqxg9bvv8uITT/DqzJkAvDxzJu1atOD7Dz9k5TvvMHrCBE6eOgXAT5s3M/OFF/j8nXeYsWABV1WsyPqlSxk5eDCbtm8HIOnIEV5+6y0+mzmTdQsXEtKwIZPnzs1Sy4ABA5g8eTLr1q3L1F61alW++uorNm7cyIIFC3jkkUcAGD9+PDfddBMxMTE89thjOfaTzDQCFykm/K+7jqC6dQFo8Pe/075lS4wxBAUEsMe1wNTqH35gxbffMtEVumfOniXhv/8FoEPr1lSuVAlIX5L24X/8A4CGAQEZx12/ZQs7fv2VDvfdB0BycjItmjTJVMex48c5evQobdu2BaBv37587loEKzk5mWHDhhETE4Onpyc///xztj9LbvuVdApwkWKiTOnSGV97GJOx7eHhQWpqKpC+Jvf8CROo61pw6pwNW7ZQzrUe+Ll+2bHW0qF1a+a+8kqOdVhrc1ye9o033uCaa65h8+bNpKWl4e3tna9+JZ2mUERKkE5t2vDm/PkZAR0TF5dtvxtCQljyxRcAxO3ezbZduwBo0bgx6zZtYvfvvwNw6vTpLI9gu6piRSpVqsT3338PpK8Dfs6xY8eoVq0aHh4evPfeexm/WLJbeja7fpKZRuAiheBKvexv1ODBPPXyy7S4804scH316iydOjVLv0G9e/PgmDG0uPNOmtSvT1BAAJXKl8evcmVmREbS75//5K+//gJg7PDhBPj7Z9p/9uzZGW9ihoWFZbQPHTqUu+66i0WLFtG+fXvKlSsHQOPGjSlVqhRNmjShf//+OfaTzNwuJ1uQtJxswdJysleO4racbGpqKskpKXiXKcOvCQl0GTiQLZ9+Smkvr1ztr+Vk8+5SlpPVCFxEsjh15gyd77+flJQUrLVMGjMm1+Etl48CXESyqFCuHGsXLCjqMsQNvYkpIuJQCnAREYdSgIuIOJQCXETEofQmpkghaBF9T4Eeb32zDwr0eEWlS5cuzJ8/n6uuuqqoSykWFOAiUuistVhrWbFiRVGXUqxoCkWkmDh56hQRQ4fS8q67CI2IYPHKlQSGhZF05AgA0du2ETZgAACR06bx4OjRhA8aRGBYGB+tWsXoCRNoHhFBtyFDSE5OBiAwLIyxkybRrk8f2vTuzabt2+k2eDANb7uNmQsXAnDi1Cm6DBxI6169aB4RwSdffw2kLx1bv359hg4dSkhICAkJCfj7+5OUlMT06dMJDg4mODiY2rVr0759ewC+/PJLWrduTUhICD179uTEiROX+2V0FAW4SDHx1dq1VKtalZ+WLCFq2TJuaXPxpwL9mpDA0qlTWTh5Mg+MGsXNzZuzYdkyypYpw+dr1mT0q3HttXw7bx5tQkIYPGYM8yZM4Nt584h03YLvXbo0H06cyLqFC/l81ixGvfZaxlorO3fu5L777mPTpk3UqlUr45hDhgwhJiaGDRs2UKNGDR5//HGSkpKIjIxk1apVbNy4kdDQUCZMmIDkTFMoIsVEw4AARr3+OmMmTOC2tm1p06zZRfvfeuONeHl5ERQQQGpqKre6HrzQMCCA313LzwLc3q5denvdupw4fZoK5cpRoVw5ypQuzdE//6Rc2bKMmzSJtdHRGA8P/jh4kAMHDgBQq1YtWrVqlWMNjz76KB06dCA8PJxPP/2U7du308b1i+evv/6idevW+XlJij0FuEgxEeDvz9oFC/hizRrGTppEx9atKeXpSVpaGgBnz57N1P/85Wa9SpXKWALWw8ODlPNW/8voZwxlzrud/ly/Dz/7jKQjR1i7YAFeXl4EhoVx5swZgIsuQjVnzhz27NnDlClTgPR58ltuuYUPPigeb9heDppCESkm/jh4EB9vb+4JD+fRfv2IiYujVvXqGU/T+eirrwrlvH+eOIFf5cp4eXnx3fr1mUbvOYmOjua1117j/fffx8MjPYZatWrF2rVr+eWXXwA4deqUHuTghkbgIoWgKC7727ZrF6Nffx3jGlFPevZZzpw5w0PjxvHq22/TvFGjQjlv79tvp8ewYbTp3ZvGgYHUu+BhEdmZMmUKhw8fznjzMjQ0lLfffps5c+Zwzz33ZPy1EBkZSV3X04AkKy0n62BaTvbKUdyWk80vLSebd5eynKymUEREHMptgBtjZhljDhpjYs9r+5cxZp8xJsb10aVwyxQRkQvlZgQ+B+icTfsb1tpg14durxIRuczcBri1dg1w+DLUIiIilyA/c+DDjDFbXFMsvjl1MsYMMsZEGWOiEhMT83E6ERE5X14D/E3gb0AwsB94PaeO1toZ1tpQa22on59fHk8nIiIXytN14NbaA+e+NsbMBD4tsIpEioH4Hj0L9Hj+ixcV6PHyKnLaNMr7+DCif/8c+yxfvZpGHh40aNAAgLFjx3LzzTfTqVOny1RlyZGnADfGVLPW7ndtRgCxF+svIleulJQUSpUqleP2pfr066/xvO66jAB/7rnn8l2jZM/tfyVjzAdAO6CKMWYvMA5oZ4wJBiwQDwwuvBJFJLfmLV/OpDlzMMYQVLcu44YPZ8jYsSQdPkyVypV56/nnqVmtGoNGj8a3UiU279hBcP36HD56NNP2oLvvZsQLL5B0+DA+Zcsyddw46tWpk+lcsxYvZtbixSQnJ1Pn+ut558UX2bJzJ599+y3fb9lCZGQkS5Ys4fnnn6dr16706NGD1atX8+STT5KSkkLz5s158803KVOmDP7+/vTr149PPvmE5ORkFi1aRGBgYBG9is7hNsCttdk9WuSdQqhFRPJh+y+/8MrMmax+912q+Ppy+NgxHhw9mnvDw/lH9+7MXbaMJ156iYWTJwOwa88ePps5E09PTwaNHp1pu8vAgUx+9ln+XqsW67dsYcQLL/D5O5n/t+/eqRP39+gBwL8mT2bu0qU81KcPt7drR/e+fenh+t45Z86coX///qxevZq6dety33338eabbzJixAgAqlSpwsaNG5k2bRqvvfYab7/9duG/aA6nOzFFionvfvqJO265hSq+6ReFVa5UifWbN9O7S/p9dvd27cq6TZsy+t956614enpm2T5x6hQ/xsTQ54knaNmjB8Ofe47/ZnMF2fZdu+jUrx/NIyJYsGIF23fvvmh9O3fupHbt2hlrm/Tr14815607fueddwLQrFkz4uPj8/YilDBazEqkmLCAcdPn3JKxAOXKls30vXPbaWlpVKpQgZ8WL77osQY9+ywLJk2icb16vPfRR/xnw4aL1+dm3aUyZcoA4OnpSUpKykX7SjqNwEWKiXYtW7L0yy85dPQoAIePHaNlcDCLVq4E4MPPPqN106Zuj1OxfHn8r7uOpV98AaQH75adO7P0O3HyJNdWqUJycjILPvsso718uXIcP348S//AwEDi4+Mzlot97733aNu27SX/nPI/GoGLFIKiuOyvwd//zj8ffJCwAQPw9PCgSWAgr48cyZCxY5k4e3bGm5i5MXv8eB6JjOTlGTNITkmhR+fONK5XL1OfZ4cNo22fPlxfrRoNAwI4cfIkAD1vu41hL73E5MmTWXzeKN7b25vZs2fTs2fPjDcxhwwZUnAvQAmk5WQdTMvJXjm0nGxmWk4277ScrIhICaAAFxFxKAW4iIhDKcBFRBxKAS4i4lAKcBERh9J14CKFYNaUgwV6vPuHVS3Q4xWENRs2MHHOHJZOnZqn/c+ePUvv3r3ZvXs3pUqVYsmSJdS5YMEsJxk4cCCPP/54xiqMl4MCXKQEye9SsQVp4cKFVKpUia1bt3LkyJFMt/k7TWpqapEsvqUpFJFi4qXp0wkOD6frgw/S75//ZOKcOQCEDRjA2EmTuLV/f6bOm8c3P/5Iq549aR4RweBnn+XsX38BEBgWRtKRIwBEb9tG2IABQPpDHB4YNYrbHniARrffzqzz7q48fuIEvR99lJDu3Rn+3HOkpaUxZ+lSHnvssYw+M2fO5PHHH89Sb+nSpdm3bx/WWnx9fbnqqquy/blSU1Pp378/QUFBNGrUiDfeeAOAdu3ace7GwKSkJPz9/QGYM2cOd9xxB+Hh4dSuXZspU6YwYcIEmjZtSqtWrTh8+HDG/o899hg333wz9evXZ8OGDdx5550EBAQwZsyYjPPfcccdNGvWjIYNGzJjxoyM9vLlyzN27FhatmzJunXrMupZvnw5wcHBBAcHU69ePWrXrp3+mkZH07ZtW5o1a0ZYWBj79+8nvxTgIsVA9LZtfLRqFesWLeKDiRPZuG1bpu8fO36cL+fMYfDddzNozBjee/VVNixbRmpqKjMXLHB7/Niff2bp1Kl88/77vDR9On8cTJ8iioqNZfyTT7Jh6VJ+S0jg41Wr6Nm5M8uXLyc5ORmA2bNnM8D1y+B8derUITo6mlGjRl303DExMezbt4/Y2Fi2bt2a7bGy1Bsby/z581m/fj2jR4/Gx8eHTZs20bp1a959992MfqVLl2bNmjUMGTKE7t27M3XqVGJjY5kzZw6HDh0CYNasWURHRxMVFcXkyZMz2k+ePElQUBA//fQTN954Y8Yxu3XrRkxMDDExMTRp0oQnn3yS5ORkhg8fzuLFi4mOjub+++9n9OjRbn8OdxTgIsXAuo0b6dq+PWW9valQrhxdLlgkqkdYGAA/x8fjf911BLhGq326deP76Gi3x7/ddewqvr60bdGCqK1bAQgNCqJ2zZp4enrSs0sXfti0iXI+PnTo0IFPP/2UHTt2kJycTKNGjTId7/Tp0/Tv359t27YRExPDxIkTAejSpQvbLvjlU6dOHX799VeGDx/OypUrqVixott627dvT4UKFfDz86NSpUqEh4cD0KhRo0xL1Xbr1i2jvWHDhlSrVo0yZcpQp04dEhISAJg8eTJNmjShVatWJCQksGvXLiB91cS77rorxxpeeeUVypYty8MPP8zOnTuJjY3llltuITg4mMjISPbu3ev253DnypgME5F8cbemkY+Pj9t+pTw9SUtLA9LfYDzfhfPT57aztLs+Dxw4kBdffJHAwMBsR8xbt27Fz8+P6tWrs2TJEjp16oQxhqNHj2Z5E9DX15fNmzfzxRdfMHXqVBYuXMisWbMoVapURr1nzpzJtM+5pWkBPDw8MrY9PDwyLVV7fvuF+6SkpPDtt9+yatUq1q1bh4+PD+3atcs4l7e3d6b11M+3evVqFi1alLHeubWWhg0bsm7dumz755VG4CLFQOuQEFZ89x1nzp7lxKlTrPzPf7LtV692bfb88Qe7f/8dgA8++YSbQtPXSKpVvTqbtm8H4KOvvsq036fffMOZs2c5dPQoazZsoJlrsaqo2Fji9+4lLS2NJStXckNICAAtW7YkISGB+fPnc889WR/qFRAQwI4dO9i2bRvlypXjnXfe4amnnqJbt25ZfikkJSWRlpbGXXfdxfPPP8/GjRsB8Pf3J9r118NiN2uX59WxY8fw9fXFx8eHHTt28OOPP7rdZ8+ePQwdOpSFCxdS1rXGer169UhMTMwI8OTk5Cx/aeSFRuAiheByX/YXGhTE7e3a0bJHD66vVo2QBg2oWL58ln7eZcrw1vPP0+eJJ0hNSSEkKIiBvXoB8MxDD/HQuHG8+vbbNL9gyiM0KIg7H36YhP37GTl4MNWrVuWXPXto2aQJz06cyLZdu2jTrBndOnbM2KdXr17ExMTg63pC0Pl8fX2ZO3cuffv2xVpLpUqVmDdvHqNGjeLmm2/mhhtuyOi7b98+BgwYkDHafumllwB48skn6dWrF++99x4dOnTI/4uYjc6dOzN9+nQaN25MvXr1aNWqldt9zs2fR0REAFC9enVWrFjB4sWLeeSRRzh27BgpKSmMGDGChg0b5qs+LSfrYFpO9spxJSwne+LUKcr7+HDq9Glu6d+fKePG0bQArkmOnDaN8j4+jOjfP9f7lA0KomvXrjz22GN0PC/Uxb1LWU5WI3CRYmLYv/5F3K+/cvbsWfp0714g4Z0XR//8kyZ169KkSROFdyFTgIsUE3NeeaVQjjtm6NBL6n9VxYr8/PPPhVKLZKY3MUUKQlqa2ytBRNy51H9DCnCRAmASEjianKwQlzyz1nLo0CG8vb1zvY/bKRRjzCygK3DQWhvkaqsMLAD8gXigl7X2SB5qFikWPKe/xaEhg0mqWRM8NC7yyuH6aLk4b29vatSokev+uZkDnwNMAd49r20ksNpaO94YM9K1/fQl1ClSrJg//6TUK68WdRlXjPo74oq6hBLB7VDBWrsGOHxBc3dgruvrucAdBVuWiIi4k9e/9a6x1u4HcH3O8a4FY8wgY0yUMSYqMTExj6cTEZELFfpknbV2hrU21Fob6ufnV9inExEpMfIa4AeMMdUAXJ8L9vEjIiLiVl4DfDnQz/V1P+DjgilHRERyy22AG2M+ANYB9Ywxe40xDwDjgVuMMbuAW1zbIiJyGbm9jNBam3UtyHRa5EBEpAjpjgMREYdSgIuIOJQCXETEoRTgIiIOpQAXEXEoBbiIiEMpwEVEHEoBLiLiUApwERGHUoCLiDiUAlxExKEU4CIiDqUAFxFxKAW4iIhD5eap9HKF6jVK//nkyrS1qAsoITQCFxFxKAW4iIhDKcBFRBxKAS4i4lAKcBERh1KAi4g4lAJcRMShFOAiIg6lABcRcSgFuIiIQ+XrXmxjTDxwHEgFUqy1oQVRlIiIuFcQi2m0t9YmFcBxRETkEmgKRUTEofIb4Bb40hgTbYwZlF0HY8wgY0yUMSYqMTExn6cTEZFz8hvgbay1IcBtwMPGmJsv7GCtnWGtDbXWhvr5+eXzdCIick6+Atxa+4fr80FgGdCiIIoSERH38hzgxphyxpgK574GbgViC6owERG5uPxchXINsMwYc+448621KwukKhERcSvPAW6t/RVoUoC1iIjIJdBlhCIiDqUAFxFxKAW4iIhDKcBFRBxKAS4i4lAKcBERh1KAi4g4lAJcRMShFOAiIg6lABcRcSgFuIiIQynARUQcSgEuIuJQCnAREYcqiKfSSxHZ+tvvRV2CiBQhjcBFRBxKAS4i4lAKcBERh1KAi4g4lAJcRMShdBWKg/mfmV/UJYhkK76oCyghNAIXEXEoBbiIiEMpwEVEHCpfAW6M6WyM2WmM+cUYM7KgihIREffyHODGGE9gKnAb0AC4xxjToKAKExGRi8vPCLwF8Iu19ldr7V/Ah0D3gilLRETcyc9lhNcBCedt7wVaXtjJGDMIGOTaPGGM2ZmPc4oUlipAUlEXUVyYl4u6gmKnVnaN+Qlwk02bzdJg7QxgRj7OI1LojDFR1trQoq5D5FLkZwplL1DzvO0awB/5K0dERHIrPwG+AQgwxtQ2xpQG7gaWF0xZIiLiTp6nUKy1KcaYYcAXgCcwy1q7rcAqE7m8NM0njmOszTJtLSIiDqA7MUVEHEoBLiLiUApwERGHUoCLiDiUHuggJZYx5ubs2q21ay53LSJ5oatQpMQyxnxy3qY36ev7RFtrOxRRSSKXRCNwKbGsteHnbxtjagKvFFE5IpdMc+Ai/7MXCCrqIkRySyNwKbGMMf/mfwuweQDBwOYiK0jkEmkOXEosY0y/8zZTgHhr7dqiqkfkUinApURzLcQWSPpIfKfr4SQijqAAlxLLGNMFeAvYTfr69rWBwdbaz4u0MJFcUoBLiWWM2QF0tdb+4tr+G/CZtTawaCsTyR1dhSIl2cFz4e3yK3CwqIoRuVQagUuJZYx5k/RnDS4kfQ68J7ATWAtgrV1adNWJuKcAlxLLGDM7m2ZL+ny4tdbef5lLErkkug5cSjIP4FFr7VEAY4wv8Lq1dkCRViWSS5oDl5Ks8bnwBrDWHgGaFl05IpdGAS4lmYdr1A2AMaYy+qtUHET/WKUkex34wRizmPS5717AC0Vbkkju6U1MKdGMMQ2ADqS/cbnaWru9iEsSyTUFuIiIQ2kOXETEoRTgIiIOpQAXEXEoBbiIiEP9PxtQtnlihBk9AAAAAElFTkSuQmCC\n",
|
||
"text/plain": [
|
||
"<Figure size 432x288 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {
|
||
"needs_background": "light"
|
||
},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"import pandas as df\n",
|
||
"pipeline()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "T2nKfQWD7V1k",
|
||
"metadata": {
|
||
"id": "T2nKfQWD7V1k"
|
||
},
|
||
"source": [
|
||
"### Switching to GPU\n",
|
||
"Traditionally, these tasks are frequently done (as we did) using the popular [**pandas**](https://pandas.pydata.org/) library, which only runs on a single CPU. NVIDIA's [**cuDF**](https://docs.rapids.ai/api/cudf/stable/) library was built with the users in mind - by offering nearly identical syntax to its CPU counterpart, developers only have to make few changes to their existing code to take advantage of its capabilities. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"id": "TfDvbYbIU4b1",
|
||
"metadata": {
|
||
"id": "TfDvbYbIU4b1"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import cudf as df"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "oeYOoMVOLIbD",
|
||
"metadata": {
|
||
"id": "oeYOoMVOLIbD"
|
||
},
|
||
"source": [
|
||
"**That's it!** cuDF uses nearly identical syntax to the familiar pandas API. **Brilliant!** It's worth noting that there are some features that are unique to each library, but conviniently there are a lot of overlaps. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "Ocdd-JmXK5gg",
|
||
"metadata": {
|
||
"id": "Ocdd-JmXK5gg"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"pipeline()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "cgU3PNaPLZsS",
|
||
"metadata": {
|
||
"id": "cgU3PNaPLZsS"
|
||
},
|
||
"source": [
|
||
"### Comparing Results\n",
|
||
"In a trial run, **cuDF** completed the data processing tasks in nearly 10x faster than **pandas**. The expectations is that the speedup will be even more significant as the size of the data becomes largers. Feel free to give it a try by modifying the dimensions of the data above. \n",
|
||
"\n",
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "lYPyye6BYNbr",
|
||
"metadata": {
|
||
"id": "lYPyye6BYNbr"
|
||
},
|
||
"source": [
|
||
"## Conclusion\n",
|
||
"Congratulations on completing the notebook! Want to learn more about cuDF and the rest of the RAPIDS framework? Check out the follow-up to this course, [Accelerating End-to-End Data Science Workflows]('https://courses.nvidia.com/courses/course-v1:DLI+S-DS-01+V1/about') or our other online courses at [NVIDIA DLI]('https://www.nvidia.com/en-us/training/online/')."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"accelerator": "GPU",
|
||
"colab": {
|
||
"name": "cuDF_speed_up.ipynb",
|
||
"provenance": [],
|
||
"toc_visible": true
|
||
},
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.7.10"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|