lab/ds/25-1/3/1_02_EDA.md

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# Enhancing Data Science Outcomes With Efficient Workflow #

## 02 - Data Exploration and Data Visualization ##
In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load), is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data.

<p><img src='images/pipeline_overview_1.png' width=1080></p>

**Table of Contents**
<br>
In this notebook, we will load data from Parquet file format into a Dask DataFrame and perform various data transformations and exploratory data analysis. This notebook covers the below sections:
1. [Quick Recap](#s2-1)
2. [Data Exploration and Data Visualization](#s2-2)
    * [Plotly](#s2-2.1)
    * [Summarize](#s2-2.2)
    * [Visualizing Distribution](#s2-2.3)
    * [Exercise #1 - Histogram with GPU](#s2-e1)
    * [Exercise #2 - Histogram with Log Scale](#s2-e2)
    * [GroupBy Summarize](#s2-2.4)
    * [Exercise #3 - Probability Bar Chart](#s2-e3)
    * [User Features]()
    * [Exercise #4 - Customer GroupBy Aggregation](#s2-e4)
    * [Exercise #5 - Time-Series Analysis](#s2-e5)
    * [Pivot Table](#s2-2.5)
2. [Summary](#s2-3)

<a name='s2-1'></a>
## Quick Recap ##
So far, we've identified several sources of hidden slowdowns when working with Dask and cuDF:
* Reading data without a schema or specifying `dtype`
* Having too many partitions due to small `chunksize`
* Memory spilling due to partitions being too large
* Performing groupby operations on too many groups scattered across multiple partitions

Going forward, we will continue to learn how to use Dask and RAPIDS efficiently.

<a name='s2-2'></a>
## Data Exploration and Data Visualization ##
Exploratory data analysis involves identification of predictor/feature variables and the target/class variable. We use this time to understand the distribution of the features and identify potentially problematic outliers. Data exploration helps users understand the data in order to better tackle a problem. It can be a way to ascertain the validity of the data as we begin to look for useful features that will help in the following stages of the development workflow.

<a name='s2-2.1'></a>
### Plotly ###
**Plotly** [[Doc]](https://plotly.com/) is a popular library for graphing and data dashboards. Plotly uses `plotly.graph_objects` to create figures for data visualization. Graph objects can be created using `plotly.express` or from the ground up. In order for Plotly to make a graph, data needs to be on the host, not the GPU. If the dataset is small, it may be more efficient to use `pandas` instead of `cudf` or `Dask-cuDF`. However, if the dataset is large, sending  data to the GPU is a great way to speed up computation before sending it to the host for visualization. When using GPU acceleration and Plotly, only move the GPU DataFrame(s) to the host at the end with `to_pandas()`, as opposed to converting the entire GPU DataFrame(s) to a pandas DataFrame immediately. This will allow us to take advantages of GPU acceleration for processing.

For more information about how to use Plotly, we recommend [this guide](https://plotly.com/python/getting-started/).

We start by initiating the `LocalCUDACluster()` and Dask `Client()`, followed by loading data from the Parquet files into a Dask DataFrame.


```python
# import dependencies
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import cudf
import dask_cudf
import numpy as np
import cupy as cp
import plotly.express as px
import gc

# create cluster
cluster=LocalCUDACluster()

# instantiate client
client=Client(cluster)
```


```python
# get the machine's external IP address
from requests import get

ip=get('https://api.ipify.org').content.decode('utf8')

print(f'Dask dashboard (status) address is: http://{ip}:8787/status')
print(f'Dask dashboard (Gpu) address is: http://{ip}:8787/gpu')
```


```python
# read data
ddf=dask_cudf.read_parquet('clean_parquet')

print(f'Total of {len(ddf)} records split across {ddf.npartitions} partitions. ')

ddf.dtypes
```

<p><img src='images/tip.png' width=720></p>
The Parquet file format includes metadata to inform `Dask-cuDF` which data types to use for each column.


```python
# create continue and categorical column lists
continuous_cols=['price', 'target', 'ts_hour', 'ts_minute', 'ts_weekday', 'ts_day', 'ts_month', 'ts_year']
categorical_cols=['event_type', 'category_code', 'brand', 'user_session', 'session_product', 'cat_0', 'cat_1', 'cat_2', 'cat_3', 'product_id', 'category_id', 'user_id']
```


```python
# preview DataFrame
ddf.head()
```

<a name='s2-2.2'></a>
### Summarize ###
We can use the `describe()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.dataframe.describe/) method to generate summary statistics for continuous features.


```python
# generate summary statistics for continuous features
ddf[continuous_cols].describe().compute().to_pandas().apply(lambda s: s.apply('{0:.2f}'.format))
```

For categorical values, we are often interested in the [cardinality](https://en.wikipedia.org/wiki/Cardinality) of each feature. Cardinality is number of unique elements the set contains. We use `.nunique()` to get the number of possible values for each categorical feature as it will inform how they can be encoded for machine learning model consumption.


```python
# count number of unique values for categorical features
ddf[categorical_cols].nunique().compute()
```

Note that in the previous step, we added `read_parquet()` to the task graph but did not `.persist()` data in memory. Recall that the Dask DataFrame APIs build the task graph until `.compute()`. The result of `.compute()` is a cuDF DataFrame and should be small.

**Observations**:
* The dataset has an ~41% purchase rate
* All data come from March of 2020
* Price has a very large standard deviation

<a name='s2-2.3'></a>
### Visualizing Distribution ###
A histogram is a graph that shows the frequency of data using rectangles. It's used to visualize the distribution of the the data so we can quickly approximate concenration, skewness, and variability. We will use histograms to identify popularity characteristics in the dataset.

When using Plotly or other visualization libraries, it's best to keep the data on the GPU for as long as possible and only move the data to the host when needed. For example, instead of relying on Plotly Express's `.histogram()`[[doc]](https://plotly.com/python/histograms/) function, we can use the `.value_count()` method to count the number of occurences. We can then pass the results to the `.bar()` function to generate a frequency bar chart. This can yield faster results by enabling GPU acceleration. Furthermore, we can use `.nlargest()` to limit the number of bars in the chart.

We want to visualize the distribution for specific features. Now that the data is in Parquet file format, we can use column pruning to only read in one column at a time to reduce the memory burden.


```python
%%time
# set cat column of interest
cat_col='cat_0'

# read data
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)

# create histogram
px.histogram(
    # move data to CPU
    ddf[cat_col].compute().to_pandas()
).update_layout(
    yaxis_title='Frequency Count',
    xaxis_title=cat_col,
    title=f'Distribution of {cat_col}'
)
```

<a name='s2-e1'></a>
### Exercise #1 - Histogram with GPU ###
Instead of generating the histogram on CPU, we can use `.value_counts()` to achieve similar results.

**Instructions**: <br>
* Modify the `<FIXME>` only and execute the below cell to visualize the frequency of each `cat_0` value.
* Compare the performance efficiency with the previous CPU approach.


```python
%%time
# set cat column of interest
cat_col='cat_0'
n_bars=25

# read data
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)

# create frequency count DataFrame
cat_count_df=<<<<FIXME>>>>

# create histogram
px.bar(
    # move data to CPU
    cat_count_df.compute().to_pandas()
).update_layout(
    yaxis_title='Frequency Count',
    xaxis_title=cat_col,
    title=f'Distribution of {cat_col}'
)
```
%%time
cat_col='cat_0'
n_bars=25
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)

cat_count_df=ddf[cat_col].value_counts().nlargest(n_bars)

px.bar(
    cat_count_df.compute().to_pandas()
).update_layout(
    yaxis_title='Frequency Count',
    xaxis_title=cat_col,
    title=f'Distribution of {cat_col}'
)
Click ... to show **solution**.

Using cuDF to calculate the frequency is much more efficient. For continuous features, we often have to bin the values into buckets. We can use `cudf.Series.digitize()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.series.digitize/), but it's not implemented for `dask_cudf.core.Series`, so we have to use `.map_partitions()` to perform the `cudf.Series.digitize()` method on each partition.


```python
%%time
# set cont column of interest
cont_col='price'

# read data
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)

# set bin
bins=np.array(range(-1, 10000, 50)).astype('float32')

# create frequency count DataFrame
cont_hist_df=ddf[cont_col].map_partitions(lambda p: p.digitize(bins)).value_counts()

# create histogram
px.bar(
    # move data to CPU
    cont_hist_df.compute().to_pandas(),
).update_xaxes(
    tickmode='array',
    tickvals=np.array(range(1, len(bins))),
    ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])],
).update_layout(
    yaxis_title='Frequency Count',
    xaxis_title=f'{cont_col} Bin',
    title=f'Distribution of {cont_col}'
)
```

<a name='s2-e2'></a>
### Exercise #2 - Histogram with Log Scale ###
`price` is positively skewed. We might be able to get visualize a better distribution by creating bins in logarithmic scale. We will create bin ranges using `numpy.logspace()` and pass it to `cudf.Series.digitize()`.

**Instructions**: <br>
* Modify the `<FIXME>` only and execute the below cell to visualize the frequency of `price` bins in log scale.


```python
# set cont column of interest
cont_col='price'

# read data
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)

# set bin
bins=np.logspace(0, 5).astype('float32')

# create frequency count DataFrame
cont_hist_df=ddf['price'].map_partitions(<<<<FIXME>>>>).value_counts()

# create histogram
px.bar(
    # move data to CPU
    cont_hist_df.compute().to_pandas(),
).update_xaxes(
    tickmode='array',
    tickvals=np.array(range(1, len(bins))),
    ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])]
).update_layout(
    yaxis_title='Frequency Count',
    xaxis_title=f'{cont_col} Bin',
    title=f'Distribution of {cont_col}'
)
```
cont_col='price'

ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)

bins=np.logspace(0, 5).astype('float32')

cont_hist_df=ddf['price'].map_partitions(lambda p: p.digitize(bins)).value_counts()

px.bar(
    cont_hist_df.compute().to_pandas(),
).update_xaxes(
    tickmode='array',
    tickvals=np.array(range(1, len(bins))),
    ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])]
).update_layout(
    yaxis_title='Frequency Count',
    xaxis_title=f'{cont_col} Bin',
    title=f'Distribution of {cont_col}'
)
Click ... to show **solution**.

**Observations**:
* Vast majority of the products are below $300.

<a name='s2-2.4'></a>
### GroupBy Summarize ###
We can use a variety of groupby aggregations to learn about the data. The aggregations supported by cuDF, as described [here](https://docs.rapids.ai/api/cudf/stable/user_guide/groupby/#aggregation), are very efficient. We might be interested in exploring several variations. To make the execution more efficient, we can `.persist()` the data into the memory after reading from the source. Subsequent operations will not require loading from the source again.

For example, we can visualize the probability of an event for each category. When the target column is a binary indicator, we can do this quickly by calculating the aggregate mean. For a categorical feature with binary outcomes, users can use the arithmetic mean to find the _probability_.

We use `.groupby()` on `cat_0`, followed by `.agg('mean')` on `target` to determine the probability of positive outcome for each `cat_0` group.


```python
# read data
ddf=dask_cudf.read_parquet('clean_parquet')

# persist data in memory
ddf=ddf.persist()
wait(ddf)
```


```python
# set cat column of interest
cat_col='cat_0'
n_bars=25

# create groupby probability DataFrame
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)

# create bar chart
px.bar(
    # move data to CPU
    cat_target_df.compute().to_pandas()
).update_layout(
    yaxis_title='Probability',
    xaxis_title=cat_col,
    title=f'Probability of {cat_col}'
)
```

Some categoriies have a higher probability than other.

Other groupby aggregations include:
* What time of the week is the busiest

```
# show probability of each ts_weekday and ts_hour group
ddf.groupby(['ts_weekday', 'ts_hour'])['target'].agg({'target': 'mean'})
```

<p><img src='images/tip.png' width=720></p>

`.groupby().size()` or `.groupby().agg('size')` is very similar to `.value_counts()`.

<a name='s2-e3'></a>
### Exercise #3 - Probability Bar Chart ###

**Instructions**: <br>
* Mofidy the `<FIXME>` only and execute the below cell to visualize the probability of each `ts_hour` value.


```python
# set cat column of interest
cat_col=<<<<FIXME>>>>
n_bars=25

# create groupby probability DataFrame
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)

# create bar chart
px.bar(
    # move data to CPU
    cat_target_df.compute().to_pandas()
).update_layout(
    yaxis_title='Probability',
    xaxis_title=cat_col,
    title=f'Probability of {cat_col}'
)
```
cat_col='ts_hour'
n_bars=25
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)

px.bar(
    cat_target_df.compute().to_pandas()
).update_layout(
    yaxis_title='Probability',
    xaxis_title=cat_col,
    title=f'Probability of {cat_col}'
)
Click ... to show **solution**.

Some aggregations require all data within the same group to be in memory for calculation such as `median`, `mode`, `nunique`, and etc. For these operations, `.groupby().apply()` is used. Because `.groupby().apply()` performs a shuffle, these operations scales poorly with large amounts of groups.

We use `.groupby()` on `brand` and `.apply()` on `SeriesGroupBy['user_id'].nunique()` to get the number of unique customers that have interacted with each brand.


```python
# set cat columns of interest
cat_col='brand'
group_statistic='user_id'

# create groupby summarize DataFrame
product_frequency=ddf.groupby(cat_col)[group_statistic].apply(lambda g: g.nunique(), meta=(f'{group_statistic}_count', 'int32')).nlargest(25)

# create bar chart
px.bar(
    # move data to CPU
    product_frequency.compute().to_pandas()
).update_layout(
    yaxis_title=f'Number of Unique {group_statistic}',
    xaxis_title=cat_col,
    title=f'Number of Unique {group_statistic} per {cat_col}'
)
```


```python
# visualize graph
product_frequency.visualize(rankdir='LR')
```

Certain brands and categories have a higher penetration and higher probability of positive outcome.

Other groupby aggregations include:
* How many unique `customer_id` are in each `product_id` group

```
# show how many customers interacted with each product
ddf.groupby('product_id')['user_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
```

* How many unique `cat_0` are in each `brand` group

```
# show how many categories of product do each brand carry
ddf.groupby('brand')['cat_0'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
```

* How many unique `product_id` are in each `user_session` group

```
# show how many products are view in each session
ddf.groupby('user_session')['product_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
```

* How many unique `product_id` are in each `user_id` group

```
# show how many products each user interacts with
ddf.groupby('user_id')['product_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
```

<p><img src='images/tip.png' width=720></p>

For Dask, we can ensure the result of `.groupby()` is sorted using the `sort` paramter.

Sometimes we want to perform custom aggregations that are not yet supported. For custom aggregations, we can use `.groupby().apply()` and user-defined functions. For example, we might be interested in the range of `price` for each `category_code`. Arithmetically, this is done by taking the difference between the group-specific maxmimum and group-specific minimum. We can normalize the range by dividing it by the group-specific mean.

It's best to avoid using `.groupby().apply()` when possible. Similar results can be calculated by using `.groupby().agg()` to obtain the `max`, `min`, and `mean` separately, then applying a row-wise calculation with `.apply()`. This can be more efficient.


```python
%%time
# set cat column of interest
cat_col='category_code'

# define group-wise function
def normalized_range(group):
    return (group.max()-group.min())/group.mean()

# create groupby apply DataFrame
normalized_range_df=ddf.groupby(cat_col)['price'].apply(normalized_range, meta=('normalize_range', 'float64')).nlargest(25)

# create bar chart
px.bar(
    # move data to CPU
    normalized_range_df.compute().to_pandas()
).update_layout(
    yaxis_title='Normalize Range',
    xaxis_title=cat_col,
    title=f'Normalize Range of price per {cat_col}'
)
```


```python
# visualize graph
normalized_range_df.visualize(rankdir='LR')
```

<a name='s2-e4'></a>
### Exercise #4 - Custom GroupBy Aggregation ###

**Instructions**: <br>
* Modify the `<FIXME>` only and execute the below cell to visualize the normalized range of each `category_code`.
* Compare the performance efficiency with the previous `.groupby().apply()` approach.


```python
%%time
# set cat column of interest
cat_col='category_code'

# define row-wise function
def normalized_range(group):
    return <<<<FIXME>>>>

# create groupby aggregate DataFrame
normalized_range_df=ddf.groupby(cat_col)['price'].agg({'price': ['max', 'min', 'mean']}).apply(normalized_range, axis=1, meta=('normalize_range', 'float64')).nlargest(25)

# create bar chart
px.bar(
    # move data to CPU
    normalized_range_df.compute().to_pandas()
).update_layout(
    yaxis_title='Normalize Range',
    xaxis_title=cat_col,
    title=f'Normalize Range of price per {cat_col}'
)
```
cat_col='category_code'

def normalized_range(group):
    return (group['max']-group['min'])/group['mean']

normalized_range_df=ddf.groupby(cat_col)['price'].agg({'price': ['max', 'min', 'mean']}).apply(normalized_range, axis=1, meta=('normalize_range', 'float64')).nlargest(25)

px.bar(
    normalized_range_df.compute().to_pandas()
).update_layout(
    yaxis_title='Normalize Range',
    xaxis_title=cat_col,
    title=f'Normalize Range of price per {cat_col}'
)
Click ... to show **solution**.


```python
# visualize graph
normalized_range_df.visualize(rankdir='LR')
```

We can apply predicate pushdown filters when reading from Parquet files with the `filters` parameter. This will enable Dask-cuDF to skip row-groups and files where _none_ of the rows can satisfy the criteria. This works well when the partitions are thought-out and uses the filter column. Since this is not the case for our dataset, we will apply a separate filter after importing the data.

<a name='s2-e5'></a>
### Exercise #5 - Time-Series Analysis ###
When dealing with time-series data, cuDF provides powerful `.rolling()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.dataframe.rolling/) and `.resample()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.dataframe.resample/) methods to perform window operations. Functionally, they behave very similarly to `.groupby()` operations. We use `.rolling()` followed by `.interpolate()` to find the frequency and probability over the entire span of the dataset.

We use `.map_partitions()` to perform `.resample()` on each partition. Because `.map_partitions()` doesn't perform a shuffle first, we manually perform a `.shuffle()` to ensure all members of each group are together. Once we have all the needed data in the same partitions, we can use `.map_partitions()` and pass the cuDF DataFrame `.resample()` operation.

**Instructions**: <br>
* Execute the below cell to clear memory.
* Execute the cell below to read data into memory and shuffle based on `ts_day`. This ensure that all records belonging to the same group are in the same partition.
* Execute the cell below to show the user shopping behavior.
* Modify the `resample_frequency` to various frequencies to look for more obvious patterns.


```python
del ddf
gc.collect()
```


```python
# read data with predicate pushdown
ddf=dask_cudf.read_parquet('clean_parquet', filters=[('ts_day', "<", 15)])

# apply filtering
ddf=ddf[ddf['ts_day']<15]

# shuffle first on ts_day
ddf=ddf.shuffle('ts_day')
```


```python
# set resample frequency
resample_frequency='3h'

# get time-series DataFrame
activity_amount_trend=ddf.map_partitions(lambda x: x.resample(resample_frequency, on='event_time').size().interpolate('linear'))
purchase_rate_trend=ddf.map_partitions(lambda x: x.resample(resample_frequency, on='event_time')['target'].mean().interpolate('linear'))

# create scatter plot
px.scatter(
    # move data to CPU
    activity_amount_trend.compute().to_pandas().sort_index(),
    color=purchase_rate_trend.compute().to_pandas().sort_index()
).update_traces(
    mode='markers+lines'
).update_layout(
    yaxis_title='Number of Records',
    xaxis_title='event_time',
    title=f'Amount of Transactions Over Time'
)
```

<a name='s2-2.5'></a>
### Pivot Table ###
When data is small enough to fit in single GPU, it's often faster to perform data transformation with cuDF. Below we will read a few numerical columns, which fits nicely in memory. We use `.pivot_table()` to find the probability and frequency at each `ts_hour` and `ts_weekday` group.


```python
# read data
gdf=cudf.read_parquet('clean_parquet', columns=['ts_weekday', 'ts_hour', 'target'])
```


```python
# create pivot table
activity_amount=gdf.pivot_table(index=['ts_weekday'], columns=['ts_hour'], values=['target'], aggfunc='size')['target']

# create heatmap
px.imshow(
    # move data to CPU
    activity_amount.to_pandas(),
    title='there is more activity in the day'
).update_layout(
    title=f'Number of Records Heatmap'
)
```


```python
# create pivot table
purchase_rate=gdf[['target', 'ts_weekday', 'ts_hour']].pivot_table(index=['ts_weekday'], columns=['ts_hour'], aggfunc='mean')['target'].to_pandas()

# create heatmap
px.imshow(
    # move data to CPU
    purchase_rate,
    title='there is potentially a higher purchase rate in the evening'
).update_layout(
    title=f'Probability Heatmap'
)
```

**Observations**:
* Behavior changes on `ts_weekday` and `ts_hour` - e.g. during the week, users will not stay up late as they work next day.

<a name='s2-3'></a>
## Summary ##
* `.groupby().apply()` requires shuffling, which is time-expensive. When possible, try to use `.groupby().agg()` instead
* Keeping data processing on the GPU can help generate visualizations quickly
* Use predicate filtering and column pruning to reduce the amount of data read into memory. When data size is small, processing on cuDF can be more efficient than Dask-cuDF
* Use `.persit()` if subsequent operations are exploratory
* `.map_partitions()` does not involve shuffling


```python
# clean GPU memory
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
```

**Well Done!** Let's move to the [next notebook](1_03_categorical_feature_engineering.ipynb).

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>