{
"cells": [
{
"cell_type": "markdown",
"id": "_XMDuUoUl5cQ",
"metadata": {
"id": "_XMDuUoUl5cQ"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "T1WbsSK_0Xqo",
"metadata": {
"id": "T1WbsSK_0Xqo"
},
"source": [
"# Speed Up DataFrame Operations w/ RAPIDS cuDF"
]
},
{
"cell_type": "markdown",
"id": "-bPAvj4fwjbq",
"metadata": {
"id": "-bPAvj4fwjbq"
},
"source": [
"## Welcome\n",
"A **DataFrame** is a 2-dimensional data structure used to represent data in a tabular format, like a spreadsheet or SQL table. Originally offered through the Python Data Analysis ([pandas](https://pandas.pydata.org/docs/)) library, DataFrames have become very popular for its familiar representation along with a robust set of features that are intuitive and expressive. \n",
"\n",
"Raw data often needs to be manipulated before it can be used for further purposes such as generating **Business Intelligence**, creating **Dashboard Visualization**, or training **Machine Learning** models. These preprocessing steps can include **filtering**, **merging**, **grouping**, and **aggregating**. \n",
"\n",
"Below is a typical data processing pipeline: \n",
"

| \n", " | a_0 | \n", "a_1 | \n", "a_2 | \n", "a_3 | \n", "a_4 | \n", "a_5 | \n", "a_6 | \n", "a_7 | \n", "a_8 | \n", "a_9 | \n", "... | \n", "a_40 | \n", "a_41 | \n", "a_42 | \n", "a_43 | \n", "a_44 | \n", "a_45 | \n", "a_46 | \n", "a_47 | \n", "a_48 | \n", "a_49 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 999995 | \n", "40 | \n", "90 | \n", "16 | \n", "13 | \n", "91 | \n", "95 | \n", "77 | \n", "68 | \n", "48 | \n", "56 | \n", "... | \n", "94 | \n", "28 | \n", "5 | \n", "68 | \n", "50 | \n", "70 | \n", "48 | \n", "76 | \n", "61 | \n", "94 | \n", "
| 999996 | \n", "77 | \n", "40 | \n", "80 | \n", "84 | \n", "34 | \n", "67 | \n", "95 | \n", "7 | \n", "12 | \n", "12 | \n", "... | \n", "98 | \n", "71 | \n", "73 | \n", "5 | \n", "6 | \n", "76 | \n", "66 | \n", "37 | \n", "79 | \n", "43 | \n", "
| 999997 | \n", "37 | \n", "22 | \n", "85 | \n", "33 | \n", "44 | \n", "93 | \n", "83 | \n", "96 | \n", "48 | \n", "91 | \n", "... | \n", "12 | \n", "66 | \n", "90 | \n", "34 | \n", "35 | \n", "61 | \n", "0 | \n", "58 | \n", "70 | \n", "34 | \n", "
| 999998 | \n", "78 | \n", "34 | \n", "30 | \n", "76 | \n", "70 | \n", "53 | \n", "68 | \n", "97 | \n", "64 | \n", "39 | \n", "... | \n", "76 | \n", "56 | \n", "46 | \n", "81 | \n", "75 | \n", "15 | \n", "10 | \n", "47 | \n", "8 | \n", "74 | \n", "
| 999999 | \n", "52 | \n", "53 | \n", "58 | \n", "10 | \n", "94 | \n", "42 | \n", "66 | \n", "85 | \n", "33 | \n", "34 | \n", "... | \n", "62 | \n", "69 | \n", "9 | \n", "95 | \n", "37 | \n", "55 | \n", "65 | \n", "29 | \n", "45 | \n", "82 | \n", "
5 rows × 50 columns
\n", "| \n", " | b_0 | \n", "b_1 | \n", "b_2 | \n", "b_3 | \n", "b_4 | \n", "b_5 | \n", "b_6 | \n", "b_7 | \n", "b_8 | \n", "b_9 | \n", "... | \n", "b_40 | \n", "b_41 | \n", "b_42 | \n", "b_43 | \n", "b_44 | \n", "b_45 | \n", "b_46 | \n", "b_47 | \n", "b_48 | \n", "b_49 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 999995 | \n", "75 | \n", "8 | \n", "66 | \n", "43 | \n", "25 | \n", "78 | \n", "53 | \n", "13 | \n", "0 | \n", "34 | \n", "... | \n", "92 | \n", "80 | \n", "86 | \n", "79 | \n", "19 | \n", "36 | \n", "28 | \n", "75 | \n", "64 | \n", "48 | \n", "
| 999996 | \n", "51 | \n", "79 | \n", "23 | \n", "23 | \n", "32 | \n", "53 | \n", "96 | \n", "7 | \n", "28 | \n", "60 | \n", "... | \n", "0 | \n", "62 | \n", "57 | \n", "29 | \n", "30 | \n", "12 | \n", "47 | \n", "15 | \n", "36 | \n", "75 | \n", "
| 999997 | \n", "23 | \n", "13 | \n", "89 | \n", "16 | \n", "36 | \n", "34 | \n", "39 | \n", "48 | \n", "15 | \n", "4 | \n", "... | \n", "16 | \n", "31 | \n", "48 | \n", "65 | \n", "98 | \n", "22 | \n", "11 | \n", "6 | \n", "53 | \n", "39 | \n", "
| 999998 | \n", "34 | \n", "98 | \n", "32 | \n", "40 | \n", "29 | \n", "38 | \n", "15 | \n", "50 | \n", "34 | \n", "38 | \n", "... | \n", "85 | \n", "45 | \n", "24 | \n", "50 | \n", "63 | \n", "4 | \n", "38 | \n", "66 | \n", "76 | \n", "42 | \n", "
| 999999 | \n", "87 | \n", "90 | \n", "47 | \n", "25 | \n", "99 | \n", "11 | \n", "92 | \n", "87 | \n", "81 | \n", "45 | \n", "... | \n", "5 | \n", "80 | \n", "28 | \n", "99 | \n", "68 | \n", "96 | \n", "38 | \n", "13 | \n", "86 | \n", "73 | \n", "
5 rows × 50 columns
\n", "
| \n", " | a_0 | \n", "a_1 | \n", "a_2 | \n", "a_3 | \n", "a_4 | \n", "a_5 | \n", "a_6 | \n", "a_7 | \n", "a_8 | \n", "a_9 | \n", "... | \n", "b_40 | \n", "b_41 | \n", "b_42 | \n", "b_43 | \n", "b_44 | \n", "b_45 | \n", "b_46 | \n", "b_47 | \n", "b_48 | \n", "b_49 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "49 | \n", "95 | \n", "72 | \n", "58 | \n", "86 | \n", "20 | \n", "54 | \n", "81 | \n", "4 | \n", "52 | \n", "... | \n", "19 | \n", "22 | \n", "90 | \n", "69 | \n", "28 | \n", "50 | \n", "53 | \n", "17 | \n", "60 | \n", "79 | \n", "
| 1 | \n", "77 | \n", "32 | \n", "47 | \n", "57 | \n", "22 | \n", "98 | \n", "73 | \n", "98 | \n", "25 | \n", "70 | \n", "... | \n", "4 | \n", "91 | \n", "74 | \n", "1 | \n", "36 | \n", "61 | \n", "8 | \n", "29 | \n", "32 | \n", "87 | \n", "
| 2 | \n", "94 | \n", "15 | \n", "58 | \n", "22 | \n", "52 | \n", "0 | \n", "90 | \n", "48 | \n", "70 | \n", "88 | \n", "... | \n", "48 | \n", "74 | \n", "26 | \n", "98 | \n", "93 | \n", "15 | \n", "71 | \n", "54 | \n", "97 | \n", "26 | \n", "
| 3 | \n", "98 | \n", "67 | \n", "36 | \n", "76 | \n", "55 | \n", "8 | \n", "48 | \n", "61 | \n", "98 | \n", "77 | \n", "... | \n", "77 | \n", "59 | \n", "36 | \n", "7 | \n", "62 | \n", "39 | \n", "27 | \n", "14 | \n", "26 | \n", "30 | \n", "
| 4 | \n", "70 | \n", "80 | \n", "35 | \n", "51 | \n", "76 | \n", "14 | \n", "34 | \n", "25 | \n", "99 | \n", "49 | \n", "... | \n", "2 | \n", "99 | \n", "21 | \n", "18 | \n", "78 | \n", "32 | \n", "3 | \n", "97 | \n", "48 | \n", "23 | \n", "
5 rows × 100 columns
\n", "
| \n", " | a_0 | \n", "a_1 | \n", "a_2 | \n", "a_3 | \n", "a_4 | \n", "a_5 | \n", "a_6 | \n", "a_7 | \n", "a_8 | \n", "a_9 | \n", "... | \n", "b_40 | \n", "b_41 | \n", "b_42 | \n", "b_43 | \n", "b_44 | \n", "b_45 | \n", "b_46 | \n", "b_47 | \n", "b_48 | \n", "b_49 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "... | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "1000000.000000 | \n", "
| mean | \n", "49.511061 | \n", "49.446406 | \n", "49.482991 | \n", "49.493942 | \n", "49.483117 | \n", "49.487522 | \n", "49.525894 | \n", "49.504130 | \n", "49.484901 | \n", "49.483078 | \n", "... | \n", "49.504033 | \n", "49.507267 | \n", "49.537563 | \n", "49.450110 | \n", "49.486217 | \n", "49.471796 | \n", "49.499426 | \n", "49.510149 | \n", "49.476653 | \n", "49.512459 | \n", "
| std | \n", "28.861918 | \n", "28.874678 | \n", "28.882845 | \n", "28.866755 | \n", "28.878626 | \n", "28.864333 | \n", "28.860845 | \n", "28.863316 | \n", "28.858059 | \n", "28.887034 | \n", "... | \n", "28.863021 | \n", "28.858843 | \n", "28.858617 | \n", "28.852121 | \n", "28.847583 | \n", "28.859198 | \n", "28.855159 | \n", "28.852837 | \n", "28.840443 | \n", "28.867724 | \n", "
| min | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
| 25% | \n", "24.000000 | \n", "24.000000 | \n", "24.000000 | \n", "24.000000 | \n", "24.000000 | \n", "24.000000 | \n", "25.000000 | \n", "24.000000 | \n", "24.000000 | \n", "24.000000 | \n", "... | \n", "25.000000 | \n", "25.000000 | \n", "25.000000 | \n", "24.000000 | \n", "25.000000 | \n", "24.000000 | \n", "25.000000 | \n", "25.000000 | \n", "25.000000 | \n", "24.000000 | \n", "
| 50% | \n", "50.000000 | \n", "49.000000 | \n", "49.000000 | \n", "49.000000 | \n", "49.000000 | \n", "49.000000 | \n", "50.000000 | \n", "50.000000 | \n", "49.000000 | \n", "49.000000 | \n", "... | \n", "50.000000 | \n", "50.000000 | \n", "50.000000 | \n", "49.000000 | \n", "49.000000 | \n", "49.000000 | \n", "49.000000 | \n", "49.000000 | \n", "49.000000 | \n", "50.000000 | \n", "
| 75% | \n", "75.000000 | \n", "74.000000 | \n", "74.000000 | \n", "75.000000 | \n", "75.000000 | \n", "74.000000 | \n", "75.000000 | \n", "74.000000 | \n", "74.000000 | \n", "75.000000 | \n", "... | \n", "75.000000 | \n", "74.000000 | \n", "75.000000 | \n", "74.000000 | \n", "74.000000 | \n", "74.000000 | \n", "74.000000 | \n", "74.000000 | \n", "74.000000 | \n", "74.000000 | \n", "
| max | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "... | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "99.000000 | \n", "
8 rows × 100 columns
\n", "
| \n", " | a_0 | \n", "a_1 | \n", "a_2 | \n", "a_3 | \n", "a_4 | \n", "a_5 | \n", "a_6 | \n", "a_7 | \n", "a_8 | \n", "a_9 | \n", "... | \n", "b_40 | \n", "b_41 | \n", "b_42 | \n", "b_43 | \n", "b_44 | \n", "b_45 | \n", "b_46 | \n", "b_47 | \n", "b_48 | \n", "b_49 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a_0 | \n", "1.000000 | \n", "-0.001281 | \n", "-0.001831 | \n", "0.000314 | \n", "-0.001505 | \n", "-0.001247 | \n", "0.000018 | \n", "0.000585 | \n", "0.000628 | \n", "-0.001549 | \n", "... | \n", "0.002055 | \n", "0.000192 | \n", "0.001457 | \n", "-0.000054 | \n", "-0.000405 | \n", "-0.001055 | \n", "0.000768 | \n", "0.001139 | \n", "0.000209 | \n", "-0.000652 | \n", "
| a_1 | \n", "-0.001281 | \n", "1.000000 | \n", "0.000033 | \n", "-0.000212 | \n", "0.000829 | \n", "0.001192 | \n", "-0.000334 | \n", "-0.000551 | \n", "-0.000434 | \n", "-0.000121 | \n", "... | \n", "-0.000726 | \n", "0.000390 | \n", "0.000088 | \n", "0.000975 | \n", "-0.000287 | \n", "0.001054 | \n", "0.000370 | \n", "0.000552 | \n", "0.000185 | \n", "0.001366 | \n", "
| a_2 | \n", "-0.001831 | \n", "0.000033 | \n", "1.000000 | \n", "-0.000632 | \n", "-0.001345 | \n", "-0.000222 | \n", "-0.000713 | \n", "-0.001515 | \n", "-0.000810 | \n", "-0.000193 | \n", "... | \n", "0.000263 | \n", "0.000430 | \n", "-0.000263 | \n", "-0.000569 | \n", "0.001625 | \n", "-0.000449 | \n", "-0.001388 | \n", "-0.000414 | \n", "0.001550 | \n", "-0.000436 | \n", "
| a_3 | \n", "0.000314 | \n", "-0.000212 | \n", "-0.000632 | \n", "1.000000 | \n", "0.002325 | \n", "-0.001373 | \n", "-0.000923 | \n", "-0.000373 | \n", "0.000230 | \n", "-0.000529 | \n", "... | \n", "0.000448 | \n", "0.000080 | \n", "-0.000237 | \n", "-0.000018 | \n", "-0.000217 | \n", "-0.000565 | \n", "0.000607 | \n", "0.000945 | \n", "-0.000555 | \n", "-0.000179 | \n", "
| a_4 | \n", "-0.001505 | \n", "0.000829 | \n", "-0.001345 | \n", "0.002325 | \n", "1.000000 | \n", "-0.000842 | \n", "-0.000515 | \n", "-0.000127 | \n", "-0.000170 | \n", "-0.000975 | \n", "... | \n", "0.001551 | \n", "-0.000489 | \n", "-0.000425 | \n", "0.000450 | \n", "0.000633 | \n", "0.000267 | \n", "0.000340 | \n", "0.000945 | \n", "0.000047 | \n", "-0.000825 | \n", "
5 rows × 100 columns
\n", "
| \n", " | a_0 | \n", "a_1 | \n", "a_2 | \n", "a_3 | \n", "a_4 | \n", "a_5 | \n", "a_6 | \n", "a_7 | \n", "a_8 | \n", "a_9 | \n", "... | \n", "b_40 | \n", "b_41 | \n", "b_42 | \n", "b_43 | \n", "b_44 | \n", "b_45 | \n", "b_46 | \n", "b_47 | \n", "b_48 | \n", "b_49 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "49.529515 | \n", "49.518860 | \n", "49.389220 | \n", "49.522915 | \n", "49.456145 | \n", "49.506795 | \n", "49.566935 | \n", "49.514930 | \n", "49.490685 | \n", "49.438430 | \n", "... | \n", "49.417695 | \n", "49.457570 | \n", "49.552260 | \n", "49.370180 | \n", "49.548660 | \n", "49.543460 | \n", "49.50539 | \n", "49.483245 | \n", "49.467315 | \n", "49.504140 | \n", "
| 1 | \n", "49.533135 | \n", "49.434285 | \n", "49.548565 | \n", "49.473550 | \n", "49.587670 | \n", "49.450590 | \n", "49.433595 | \n", "49.538200 | \n", "49.483115 | \n", "49.520890 | \n", "... | \n", "49.490945 | \n", "49.531120 | \n", "49.535955 | \n", "49.465815 | \n", "49.445720 | \n", "49.393615 | \n", "49.54800 | \n", "49.607035 | \n", "49.514965 | \n", "49.483715 | \n", "
| 2 | \n", "49.463990 | \n", "49.423480 | \n", "49.525750 | \n", "49.559600 | \n", "49.556430 | \n", "49.513570 | \n", "49.490225 | \n", "49.483775 | \n", "49.453115 | \n", "49.368385 | \n", "... | \n", "49.578650 | \n", "49.550740 | \n", "49.506965 | \n", "49.421825 | \n", "49.542365 | \n", "49.439395 | \n", "49.43738 | \n", "49.508520 | \n", "49.517670 | \n", "49.505025 | \n", "
| 3 | \n", "49.486165 | \n", "49.432445 | \n", "49.528215 | \n", "49.410420 | \n", "49.378440 | \n", "49.530500 | \n", "49.538760 | \n", "49.491275 | \n", "49.552640 | \n", "49.550090 | \n", "... | \n", "49.437585 | \n", "49.481485 | \n", "49.605755 | \n", "49.498830 | \n", "49.488715 | \n", "49.535180 | \n", "49.49352 | \n", "49.529620 | \n", "49.429200 | \n", "49.604420 | \n", "
| 4 | \n", "49.542500 | \n", "49.422960 | \n", "49.423205 | \n", "49.503225 | \n", "49.436900 | \n", "49.436155 | \n", "49.599955 | \n", "49.492470 | \n", "49.444950 | \n", "49.537595 | \n", "... | \n", "49.595290 | \n", "49.515420 | \n", "49.486880 | \n", "49.493900 | \n", "49.405625 | \n", "49.447330 | \n", "49.51284 | \n", "49.422325 | \n", "49.454115 | \n", "49.464995 | \n", "
5 rows × 100 columns
\n", "