655 lines
24 KiB
Markdown
655 lines
24 KiB
Markdown
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
|
|
|
|
# Enhancing Data Science Outcomes With Efficient Workflow #
|
|
|
|
## 02 - Data Exploration and Data Visualization ##
|
|
In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load), is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data.
|
|
|
|
<p><img src='images/pipeline_overview_1.png' width=1080></p>
|
|
|
|
**Table of Contents**
|
|
<br>
|
|
In this notebook, we will load data from Parquet file format into a Dask DataFrame and perform various data transformations and exploratory data analysis. This notebook covers the below sections:
|
|
1. [Quick Recap](#s2-1)
|
|
2. [Data Exploration and Data Visualization](#s2-2)
|
|
* [Plotly](#s2-2.1)
|
|
* [Summarize](#s2-2.2)
|
|
* [Visualizing Distribution](#s2-2.3)
|
|
* [Exercise #1 - Histogram with GPU](#s2-e1)
|
|
* [Exercise #2 - Histogram with Log Scale](#s2-e2)
|
|
* [GroupBy Summarize](#s2-2.4)
|
|
* [Exercise #3 - Probability Bar Chart](#s2-e3)
|
|
* [User Features]()
|
|
* [Exercise #4 - Customer GroupBy Aggregation](#s2-e4)
|
|
* [Exercise #5 - Time-Series Analysis](#s2-e5)
|
|
* [Pivot Table](#s2-2.5)
|
|
2. [Summary](#s2-3)
|
|
|
|
<a name='s2-1'></a>
|
|
## Quick Recap ##
|
|
So far, we've identified several sources of hidden slowdowns when working with Dask and cuDF:
|
|
* Reading data without a schema or specifying `dtype`
|
|
* Having too many partitions due to small `chunksize`
|
|
* Memory spilling due to partitions being too large
|
|
* Performing groupby operations on too many groups scattered across multiple partitions
|
|
|
|
Going forward, we will continue to learn how to use Dask and RAPIDS efficiently.
|
|
|
|
<a name='s2-2'></a>
|
|
## Data Exploration and Data Visualization ##
|
|
Exploratory data analysis involves identification of predictor/feature variables and the target/class variable. We use this time to understand the distribution of the features and identify potentially problematic outliers. Data exploration helps users understand the data in order to better tackle a problem. It can be a way to ascertain the validity of the data as we begin to look for useful features that will help in the following stages of the development workflow.
|
|
|
|
<a name='s2-2.1'></a>
|
|
### Plotly ###
|
|
**Plotly** [[Doc]](https://plotly.com/) is a popular library for graphing and data dashboards. Plotly uses `plotly.graph_objects` to create figures for data visualization. Graph objects can be created using `plotly.express` or from the ground up. In order for Plotly to make a graph, data needs to be on the host, not the GPU. If the dataset is small, it may be more efficient to use `pandas` instead of `cudf` or `Dask-cuDF`. However, if the dataset is large, sending data to the GPU is a great way to speed up computation before sending it to the host for visualization. When using GPU acceleration and Plotly, only move the GPU DataFrame(s) to the host at the end with `to_pandas()`, as opposed to converting the entire GPU DataFrame(s) to a pandas DataFrame immediately. This will allow us to take advantages of GPU acceleration for processing.
|
|
|
|
For more information about how to use Plotly, we recommend [this guide](https://plotly.com/python/getting-started/).
|
|
|
|
We start by initiating the `LocalCUDACluster()` and Dask `Client()`, followed by loading data from the Parquet files into a Dask DataFrame.
|
|
|
|
|
|
```python
|
|
# import dependencies
|
|
from dask.distributed import Client, wait
|
|
from dask_cuda import LocalCUDACluster
|
|
import cudf
|
|
import dask_cudf
|
|
import numpy as np
|
|
import cupy as cp
|
|
import plotly.express as px
|
|
import gc
|
|
|
|
# create cluster
|
|
cluster=LocalCUDACluster()
|
|
|
|
# instantiate client
|
|
client=Client(cluster)
|
|
```
|
|
|
|
|
|
```python
|
|
# get the machine's external IP address
|
|
from requests import get
|
|
|
|
ip=get('https://api.ipify.org').content.decode('utf8')
|
|
|
|
print(f'Dask dashboard (status) address is: http://{ip}:8787/status')
|
|
print(f'Dask dashboard (Gpu) address is: http://{ip}:8787/gpu')
|
|
```
|
|
|
|
|
|
```python
|
|
# read data
|
|
ddf=dask_cudf.read_parquet('clean_parquet')
|
|
|
|
print(f'Total of {len(ddf)} records split across {ddf.npartitions} partitions. ')
|
|
|
|
ddf.dtypes
|
|
```
|
|
|
|
<p><img src='images/tip.png' width=720></p>
|
|
The Parquet file format includes metadata to inform `Dask-cuDF` which data types to use for each column.
|
|
|
|
|
|
```python
|
|
# create continue and categorical column lists
|
|
continuous_cols=['price', 'target', 'ts_hour', 'ts_minute', 'ts_weekday', 'ts_day', 'ts_month', 'ts_year']
|
|
categorical_cols=['event_type', 'category_code', 'brand', 'user_session', 'session_product', 'cat_0', 'cat_1', 'cat_2', 'cat_3', 'product_id', 'category_id', 'user_id']
|
|
```
|
|
|
|
|
|
```python
|
|
# preview DataFrame
|
|
ddf.head()
|
|
```
|
|
|
|
<a name='s2-2.2'></a>
|
|
### Summarize ###
|
|
We can use the `describe()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.dataframe.describe/) method to generate summary statistics for continuous features.
|
|
|
|
|
|
```python
|
|
# generate summary statistics for continuous features
|
|
ddf[continuous_cols].describe().compute().to_pandas().apply(lambda s: s.apply('{0:.2f}'.format))
|
|
```
|
|
|
|
For categorical values, we are often interested in the [cardinality](https://en.wikipedia.org/wiki/Cardinality) of each feature. Cardinality is number of unique elements the set contains. We use `.nunique()` to get the number of possible values for each categorical feature as it will inform how they can be encoded for machine learning model consumption.
|
|
|
|
|
|
```python
|
|
# count number of unique values for categorical features
|
|
ddf[categorical_cols].nunique().compute()
|
|
```
|
|
|
|
Note that in the previous step, we added `read_parquet()` to the task graph but did not `.persist()` data in memory. Recall that the Dask DataFrame APIs build the task graph until `.compute()`. The result of `.compute()` is a cuDF DataFrame and should be small.
|
|
|
|
**Observations**:
|
|
* The dataset has an ~41% purchase rate
|
|
* All data come from March of 2020
|
|
* Price has a very large standard deviation
|
|
|
|
<a name='s2-2.3'></a>
|
|
### Visualizing Distribution ###
|
|
A histogram is a graph that shows the frequency of data using rectangles. It's used to visualize the distribution of the the data so we can quickly approximate concenration, skewness, and variability. We will use histograms to identify popularity characteristics in the dataset.
|
|
|
|
When using Plotly or other visualization libraries, it's best to keep the data on the GPU for as long as possible and only move the data to the host when needed. For example, instead of relying on Plotly Express's `.histogram()`[[doc]](https://plotly.com/python/histograms/) function, we can use the `.value_count()` method to count the number of occurences. We can then pass the results to the `.bar()` function to generate a frequency bar chart. This can yield faster results by enabling GPU acceleration. Furthermore, we can use `.nlargest()` to limit the number of bars in the chart.
|
|
|
|
We want to visualize the distribution for specific features. Now that the data is in Parquet file format, we can use column pruning to only read in one column at a time to reduce the memory burden.
|
|
|
|
|
|
```python
|
|
%%time
|
|
# set cat column of interest
|
|
cat_col='cat_0'
|
|
|
|
# read data
|
|
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)
|
|
|
|
# create histogram
|
|
px.histogram(
|
|
# move data to CPU
|
|
ddf[cat_col].compute().to_pandas()
|
|
).update_layout(
|
|
yaxis_title='Frequency Count',
|
|
xaxis_title=cat_col,
|
|
title=f'Distribution of {cat_col}'
|
|
)
|
|
```
|
|
|
|
<a name='s2-e1'></a>
|
|
### Exercise #1 - Histogram with GPU ###
|
|
Instead of generating the histogram on CPU, we can use `.value_counts()` to achieve similar results.
|
|
|
|
**Instructions**: <br>
|
|
* Modify the `<FIXME>` only and execute the below cell to visualize the frequency of each `cat_0` value.
|
|
* Compare the performance efficiency with the previous CPU approach.
|
|
|
|
|
|
```python
|
|
%%time
|
|
# set cat column of interest
|
|
cat_col='cat_0'
|
|
n_bars=25
|
|
|
|
# read data
|
|
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)
|
|
|
|
# create frequency count DataFrame
|
|
cat_count_df=<<<<FIXME>>>>
|
|
|
|
# create histogram
|
|
px.bar(
|
|
# move data to CPU
|
|
cat_count_df.compute().to_pandas()
|
|
).update_layout(
|
|
yaxis_title='Frequency Count',
|
|
xaxis_title=cat_col,
|
|
title=f'Distribution of {cat_col}'
|
|
)
|
|
```
|
|
%%time
|
|
cat_col='cat_0'
|
|
n_bars=25
|
|
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)
|
|
|
|
cat_count_df=ddf[cat_col].value_counts().nlargest(n_bars)
|
|
|
|
px.bar(
|
|
cat_count_df.compute().to_pandas()
|
|
).update_layout(
|
|
yaxis_title='Frequency Count',
|
|
xaxis_title=cat_col,
|
|
title=f'Distribution of {cat_col}'
|
|
)
|
|
Click ... to show **solution**.
|
|
|
|
Using cuDF to calculate the frequency is much more efficient. For continuous features, we often have to bin the values into buckets. We can use `cudf.Series.digitize()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.series.digitize/), but it's not implemented for `dask_cudf.core.Series`, so we have to use `.map_partitions()` to perform the `cudf.Series.digitize()` method on each partition.
|
|
|
|
|
|
```python
|
|
%%time
|
|
# set cont column of interest
|
|
cont_col='price'
|
|
|
|
# read data
|
|
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)
|
|
|
|
# set bin
|
|
bins=np.array(range(-1, 10000, 50)).astype('float32')
|
|
|
|
# create frequency count DataFrame
|
|
cont_hist_df=ddf[cont_col].map_partitions(lambda p: p.digitize(bins)).value_counts()
|
|
|
|
# create histogram
|
|
px.bar(
|
|
# move data to CPU
|
|
cont_hist_df.compute().to_pandas(),
|
|
).update_xaxes(
|
|
tickmode='array',
|
|
tickvals=np.array(range(1, len(bins))),
|
|
ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])],
|
|
).update_layout(
|
|
yaxis_title='Frequency Count',
|
|
xaxis_title=f'{cont_col} Bin',
|
|
title=f'Distribution of {cont_col}'
|
|
)
|
|
```
|
|
|
|
<a name='s2-e2'></a>
|
|
### Exercise #2 - Histogram with Log Scale ###
|
|
`price` is positively skewed. We might be able to get visualize a better distribution by creating bins in logarithmic scale. We will create bin ranges using `numpy.logspace()` and pass it to `cudf.Series.digitize()`.
|
|
|
|
**Instructions**: <br>
|
|
* Modify the `<FIXME>` only and execute the below cell to visualize the frequency of `price` bins in log scale.
|
|
|
|
|
|
```python
|
|
# set cont column of interest
|
|
cont_col='price'
|
|
|
|
# read data
|
|
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)
|
|
|
|
# set bin
|
|
bins=np.logspace(0, 5).astype('float32')
|
|
|
|
# create frequency count DataFrame
|
|
cont_hist_df=ddf['price'].map_partitions(<<<<FIXME>>>>).value_counts()
|
|
|
|
# create histogram
|
|
px.bar(
|
|
# move data to CPU
|
|
cont_hist_df.compute().to_pandas(),
|
|
).update_xaxes(
|
|
tickmode='array',
|
|
tickvals=np.array(range(1, len(bins))),
|
|
ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])]
|
|
).update_layout(
|
|
yaxis_title='Frequency Count',
|
|
xaxis_title=f'{cont_col} Bin',
|
|
title=f'Distribution of {cont_col}'
|
|
)
|
|
```
|
|
cont_col='price'
|
|
|
|
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)
|
|
|
|
bins=np.logspace(0, 5).astype('float32')
|
|
|
|
cont_hist_df=ddf['price'].map_partitions(lambda p: p.digitize(bins)).value_counts()
|
|
|
|
px.bar(
|
|
cont_hist_df.compute().to_pandas(),
|
|
).update_xaxes(
|
|
tickmode='array',
|
|
tickvals=np.array(range(1, len(bins))),
|
|
ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])]
|
|
).update_layout(
|
|
yaxis_title='Frequency Count',
|
|
xaxis_title=f'{cont_col} Bin',
|
|
title=f'Distribution of {cont_col}'
|
|
)
|
|
Click ... to show **solution**.
|
|
|
|
**Observations**:
|
|
* Vast majority of the products are below $300.
|
|
|
|
<a name='s2-2.4'></a>
|
|
### GroupBy Summarize ###
|
|
We can use a variety of groupby aggregations to learn about the data. The aggregations supported by cuDF, as described [here](https://docs.rapids.ai/api/cudf/stable/user_guide/groupby/#aggregation), are very efficient. We might be interested in exploring several variations. To make the execution more efficient, we can `.persist()` the data into the memory after reading from the source. Subsequent operations will not require loading from the source again.
|
|
|
|
For example, we can visualize the probability of an event for each category. When the target column is a binary indicator, we can do this quickly by calculating the aggregate mean. For a categorical feature with binary outcomes, users can use the arithmetic mean to find the _probability_.
|
|
|
|
We use `.groupby()` on `cat_0`, followed by `.agg('mean')` on `target` to determine the probability of positive outcome for each `cat_0` group.
|
|
|
|
|
|
```python
|
|
# read data
|
|
ddf=dask_cudf.read_parquet('clean_parquet')
|
|
|
|
# persist data in memory
|
|
ddf=ddf.persist()
|
|
wait(ddf)
|
|
```
|
|
|
|
|
|
```python
|
|
# set cat column of interest
|
|
cat_col='cat_0'
|
|
n_bars=25
|
|
|
|
# create groupby probability DataFrame
|
|
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)
|
|
|
|
# create bar chart
|
|
px.bar(
|
|
# move data to CPU
|
|
cat_target_df.compute().to_pandas()
|
|
).update_layout(
|
|
yaxis_title='Probability',
|
|
xaxis_title=cat_col,
|
|
title=f'Probability of {cat_col}'
|
|
)
|
|
```
|
|
|
|
Some categoriies have a higher probability than other.
|
|
|
|
Other groupby aggregations include:
|
|
* What time of the week is the busiest
|
|
|
|
```
|
|
# show probability of each ts_weekday and ts_hour group
|
|
ddf.groupby(['ts_weekday', 'ts_hour'])['target'].agg({'target': 'mean'})
|
|
```
|
|
|
|
<p><img src='images/tip.png' width=720></p>
|
|
|
|
`.groupby().size()` or `.groupby().agg('size')` is very similar to `.value_counts()`.
|
|
|
|
<a name='s2-e3'></a>
|
|
### Exercise #3 - Probability Bar Chart ###
|
|
|
|
**Instructions**: <br>
|
|
* Mofidy the `<FIXME>` only and execute the below cell to visualize the probability of each `ts_hour` value.
|
|
|
|
|
|
```python
|
|
# set cat column of interest
|
|
cat_col=<<<<FIXME>>>>
|
|
n_bars=25
|
|
|
|
# create groupby probability DataFrame
|
|
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)
|
|
|
|
# create bar chart
|
|
px.bar(
|
|
# move data to CPU
|
|
cat_target_df.compute().to_pandas()
|
|
).update_layout(
|
|
yaxis_title='Probability',
|
|
xaxis_title=cat_col,
|
|
title=f'Probability of {cat_col}'
|
|
)
|
|
```
|
|
cat_col='ts_hour'
|
|
n_bars=25
|
|
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)
|
|
|
|
px.bar(
|
|
cat_target_df.compute().to_pandas()
|
|
).update_layout(
|
|
yaxis_title='Probability',
|
|
xaxis_title=cat_col,
|
|
title=f'Probability of {cat_col}'
|
|
)
|
|
Click ... to show **solution**.
|
|
|
|
Some aggregations require all data within the same group to be in memory for calculation such as `median`, `mode`, `nunique`, and etc. For these operations, `.groupby().apply()` is used. Because `.groupby().apply()` performs a shuffle, these operations scales poorly with large amounts of groups.
|
|
|
|
We use `.groupby()` on `brand` and `.apply()` on `SeriesGroupBy['user_id'].nunique()` to get the number of unique customers that have interacted with each brand.
|
|
|
|
|
|
```python
|
|
# set cat columns of interest
|
|
cat_col='brand'
|
|
group_statistic='user_id'
|
|
|
|
# create groupby summarize DataFrame
|
|
product_frequency=ddf.groupby(cat_col)[group_statistic].apply(lambda g: g.nunique(), meta=(f'{group_statistic}_count', 'int32')).nlargest(25)
|
|
|
|
# create bar chart
|
|
px.bar(
|
|
# move data to CPU
|
|
product_frequency.compute().to_pandas()
|
|
).update_layout(
|
|
yaxis_title=f'Number of Unique {group_statistic}',
|
|
xaxis_title=cat_col,
|
|
title=f'Number of Unique {group_statistic} per {cat_col}'
|
|
)
|
|
```
|
|
|
|
|
|
```python
|
|
# visualize graph
|
|
product_frequency.visualize(rankdir='LR')
|
|
```
|
|
|
|
Certain brands and categories have a higher penetration and higher probability of positive outcome.
|
|
|
|
Other groupby aggregations include:
|
|
* How many unique `customer_id` are in each `product_id` group
|
|
|
|
```
|
|
# show how many customers interacted with each product
|
|
ddf.groupby('product_id')['user_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
|
|
```
|
|
|
|
* How many unique `cat_0` are in each `brand` group
|
|
|
|
```
|
|
# show how many categories of product do each brand carry
|
|
ddf.groupby('brand')['cat_0'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
|
|
```
|
|
|
|
* How many unique `product_id` are in each `user_session` group
|
|
|
|
```
|
|
# show how many products are view in each session
|
|
ddf.groupby('user_session')['product_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
|
|
```
|
|
|
|
* How many unique `product_id` are in each `user_id` group
|
|
|
|
```
|
|
# show how many products each user interacts with
|
|
ddf.groupby('user_id')['product_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
|
|
```
|
|
|
|
<p><img src='images/tip.png' width=720></p>
|
|
|
|
For Dask, we can ensure the result of `.groupby()` is sorted using the `sort` paramter.
|
|
|
|
Sometimes we want to perform custom aggregations that are not yet supported. For custom aggregations, we can use `.groupby().apply()` and user-defined functions. For example, we might be interested in the range of `price` for each `category_code`. Arithmetically, this is done by taking the difference between the group-specific maxmimum and group-specific minimum. We can normalize the range by dividing it by the group-specific mean.
|
|
|
|
It's best to avoid using `.groupby().apply()` when possible. Similar results can be calculated by using `.groupby().agg()` to obtain the `max`, `min`, and `mean` separately, then applying a row-wise calculation with `.apply()`. This can be more efficient.
|
|
|
|
|
|
```python
|
|
%%time
|
|
# set cat column of interest
|
|
cat_col='category_code'
|
|
|
|
# define group-wise function
|
|
def normalized_range(group):
|
|
return (group.max()-group.min())/group.mean()
|
|
|
|
# create groupby apply DataFrame
|
|
normalized_range_df=ddf.groupby(cat_col)['price'].apply(normalized_range, meta=('normalize_range', 'float64')).nlargest(25)
|
|
|
|
# create bar chart
|
|
px.bar(
|
|
# move data to CPU
|
|
normalized_range_df.compute().to_pandas()
|
|
).update_layout(
|
|
yaxis_title='Normalize Range',
|
|
xaxis_title=cat_col,
|
|
title=f'Normalize Range of price per {cat_col}'
|
|
)
|
|
```
|
|
|
|
|
|
```python
|
|
# visualize graph
|
|
normalized_range_df.visualize(rankdir='LR')
|
|
```
|
|
|
|
<a name='s2-e4'></a>
|
|
### Exercise #4 - Custom GroupBy Aggregation ###
|
|
|
|
**Instructions**: <br>
|
|
* Modify the `<FIXME>` only and execute the below cell to visualize the normalized range of each `category_code`.
|
|
* Compare the performance efficiency with the previous `.groupby().apply()` approach.
|
|
|
|
|
|
```python
|
|
%%time
|
|
# set cat column of interest
|
|
cat_col='category_code'
|
|
|
|
# define row-wise function
|
|
def normalized_range(group):
|
|
return <<<<FIXME>>>>
|
|
|
|
# create groupby aggregate DataFrame
|
|
normalized_range_df=ddf.groupby(cat_col)['price'].agg({'price': ['max', 'min', 'mean']}).apply(normalized_range, axis=1, meta=('normalize_range', 'float64')).nlargest(25)
|
|
|
|
# create bar chart
|
|
px.bar(
|
|
# move data to CPU
|
|
normalized_range_df.compute().to_pandas()
|
|
).update_layout(
|
|
yaxis_title='Normalize Range',
|
|
xaxis_title=cat_col,
|
|
title=f'Normalize Range of price per {cat_col}'
|
|
)
|
|
```
|
|
cat_col='category_code'
|
|
|
|
def normalized_range(group):
|
|
return (group['max']-group['min'])/group['mean']
|
|
|
|
normalized_range_df=ddf.groupby(cat_col)['price'].agg({'price': ['max', 'min', 'mean']}).apply(normalized_range, axis=1, meta=('normalize_range', 'float64')).nlargest(25)
|
|
|
|
px.bar(
|
|
normalized_range_df.compute().to_pandas()
|
|
).update_layout(
|
|
yaxis_title='Normalize Range',
|
|
xaxis_title=cat_col,
|
|
title=f'Normalize Range of price per {cat_col}'
|
|
)
|
|
Click ... to show **solution**.
|
|
|
|
|
|
```python
|
|
# visualize graph
|
|
normalized_range_df.visualize(rankdir='LR')
|
|
```
|
|
|
|
We can apply predicate pushdown filters when reading from Parquet files with the `filters` parameter. This will enable Dask-cuDF to skip row-groups and files where _none_ of the rows can satisfy the criteria. This works well when the partitions are thought-out and uses the filter column. Since this is not the case for our dataset, we will apply a separate filter after importing the data.
|
|
|
|
<a name='s2-e5'></a>
|
|
### Exercise #5 - Time-Series Analysis ###
|
|
When dealing with time-series data, cuDF provides powerful `.rolling()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.dataframe.rolling/) and `.resample()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.dataframe.resample/) methods to perform window operations. Functionally, they behave very similarly to `.groupby()` operations. We use `.rolling()` followed by `.interpolate()` to find the frequency and probability over the entire span of the dataset.
|
|
|
|
We use `.map_partitions()` to perform `.resample()` on each partition. Because `.map_partitions()` doesn't perform a shuffle first, we manually perform a `.shuffle()` to ensure all members of each group are together. Once we have all the needed data in the same partitions, we can use `.map_partitions()` and pass the cuDF DataFrame `.resample()` operation.
|
|
|
|
**Instructions**: <br>
|
|
* Execute the below cell to clear memory.
|
|
* Execute the cell below to read data into memory and shuffle based on `ts_day`. This ensure that all records belonging to the same group are in the same partition.
|
|
* Execute the cell below to show the user shopping behavior.
|
|
* Modify the `resample_frequency` to various frequencies to look for more obvious patterns.
|
|
|
|
|
|
```python
|
|
del ddf
|
|
gc.collect()
|
|
```
|
|
|
|
|
|
```python
|
|
# read data with predicate pushdown
|
|
ddf=dask_cudf.read_parquet('clean_parquet', filters=[('ts_day', "<", 15)])
|
|
|
|
# apply filtering
|
|
ddf=ddf[ddf['ts_day']<15]
|
|
|
|
# shuffle first on ts_day
|
|
ddf=ddf.shuffle('ts_day')
|
|
```
|
|
|
|
|
|
```python
|
|
# set resample frequency
|
|
resample_frequency='3h'
|
|
|
|
# get time-series DataFrame
|
|
activity_amount_trend=ddf.map_partitions(lambda x: x.resample(resample_frequency, on='event_time').size().interpolate('linear'))
|
|
purchase_rate_trend=ddf.map_partitions(lambda x: x.resample(resample_frequency, on='event_time')['target'].mean().interpolate('linear'))
|
|
|
|
# create scatter plot
|
|
px.scatter(
|
|
# move data to CPU
|
|
activity_amount_trend.compute().to_pandas().sort_index(),
|
|
color=purchase_rate_trend.compute().to_pandas().sort_index()
|
|
).update_traces(
|
|
mode='markers+lines'
|
|
).update_layout(
|
|
yaxis_title='Number of Records',
|
|
xaxis_title='event_time',
|
|
title=f'Amount of Transactions Over Time'
|
|
)
|
|
```
|
|
|
|
<a name='s2-2.5'></a>
|
|
### Pivot Table ###
|
|
When data is small enough to fit in single GPU, it's often faster to perform data transformation with cuDF. Below we will read a few numerical columns, which fits nicely in memory. We use `.pivot_table()` to find the probability and frequency at each `ts_hour` and `ts_weekday` group.
|
|
|
|
|
|
```python
|
|
# read data
|
|
gdf=cudf.read_parquet('clean_parquet', columns=['ts_weekday', 'ts_hour', 'target'])
|
|
```
|
|
|
|
|
|
```python
|
|
# create pivot table
|
|
activity_amount=gdf.pivot_table(index=['ts_weekday'], columns=['ts_hour'], values=['target'], aggfunc='size')['target']
|
|
|
|
# create heatmap
|
|
px.imshow(
|
|
# move data to CPU
|
|
activity_amount.to_pandas(),
|
|
title='there is more activity in the day'
|
|
).update_layout(
|
|
title=f'Number of Records Heatmap'
|
|
)
|
|
```
|
|
|
|
|
|
```python
|
|
# create pivot table
|
|
purchase_rate=gdf[['target', 'ts_weekday', 'ts_hour']].pivot_table(index=['ts_weekday'], columns=['ts_hour'], aggfunc='mean')['target'].to_pandas()
|
|
|
|
# create heatmap
|
|
px.imshow(
|
|
# move data to CPU
|
|
purchase_rate,
|
|
title='there is potentially a higher purchase rate in the evening'
|
|
).update_layout(
|
|
title=f'Probability Heatmap'
|
|
)
|
|
```
|
|
|
|
**Observations**:
|
|
* Behavior changes on `ts_weekday` and `ts_hour` - e.g. during the week, users will not stay up late as they work next day.
|
|
|
|
<a name='s2-3'></a>
|
|
## Summary ##
|
|
* `.groupby().apply()` requires shuffling, which is time-expensive. When possible, try to use `.groupby().agg()` instead
|
|
* Keeping data processing on the GPU can help generate visualizations quickly
|
|
* Use predicate filtering and column pruning to reduce the amount of data read into memory. When data size is small, processing on cuDF can be more efficient than Dask-cuDF
|
|
* Use `.persit()` if subsequent operations are exploratory
|
|
* `.map_partitions()` does not involve shuffling
|
|
|
|
|
|
```python
|
|
# clean GPU memory
|
|
import IPython
|
|
app = IPython.Application.instance()
|
|
app.kernel.do_shutdown(True)
|
|
```
|
|
|
|
**Well Done!** Let's move to the [next notebook](1_03_categorical_feature_engineering.ipynb).
|
|
|
|
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
|