ds: r9
This commit is contained in:
1116
ds/25-1/3/1_01_data_loading.md
Normal file
1116
ds/25-1/3/1_01_data_loading.md
Normal file
File diff suppressed because it is too large
Load Diff
654
ds/25-1/3/1_02_EDA.md
Normal file
654
ds/25-1/3/1_02_EDA.md
Normal file
@ -0,0 +1,654 @@
|
||||
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
|
||||
|
||||
# Enhancing Data Science Outcomes With Efficient Workflow #
|
||||
|
||||
## 02 - Data Exploration and Data Visualization ##
|
||||
In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load), is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data.
|
||||
|
||||
<p><img src='images/pipeline_overview_1.png' width=1080></p>
|
||||
|
||||
**Table of Contents**
|
||||
<br>
|
||||
In this notebook, we will load data from Parquet file format into a Dask DataFrame and perform various data transformations and exploratory data analysis. This notebook covers the below sections:
|
||||
1. [Quick Recap](#s2-1)
|
||||
2. [Data Exploration and Data Visualization](#s2-2)
|
||||
* [Plotly](#s2-2.1)
|
||||
* [Summarize](#s2-2.2)
|
||||
* [Visualizing Distribution](#s2-2.3)
|
||||
* [Exercise #1 - Histogram with GPU](#s2-e1)
|
||||
* [Exercise #2 - Histogram with Log Scale](#s2-e2)
|
||||
* [GroupBy Summarize](#s2-2.4)
|
||||
* [Exercise #3 - Probability Bar Chart](#s2-e3)
|
||||
* [User Features]()
|
||||
* [Exercise #4 - Customer GroupBy Aggregation](#s2-e4)
|
||||
* [Exercise #5 - Time-Series Analysis](#s2-e5)
|
||||
* [Pivot Table](#s2-2.5)
|
||||
2. [Summary](#s2-3)
|
||||
|
||||
<a name='s2-1'></a>
|
||||
## Quick Recap ##
|
||||
So far, we've identified several sources of hidden slowdowns when working with Dask and cuDF:
|
||||
* Reading data without a schema or specifying `dtype`
|
||||
* Having too many partitions due to small `chunksize`
|
||||
* Memory spilling due to partitions being too large
|
||||
* Performing groupby operations on too many groups scattered across multiple partitions
|
||||
|
||||
Going forward, we will continue to learn how to use Dask and RAPIDS efficiently.
|
||||
|
||||
<a name='s2-2'></a>
|
||||
## Data Exploration and Data Visualization ##
|
||||
Exploratory data analysis involves identification of predictor/feature variables and the target/class variable. We use this time to understand the distribution of the features and identify potentially problematic outliers. Data exploration helps users understand the data in order to better tackle a problem. It can be a way to ascertain the validity of the data as we begin to look for useful features that will help in the following stages of the development workflow.
|
||||
|
||||
<a name='s2-2.1'></a>
|
||||
### Plotly ###
|
||||
**Plotly** [[Doc]](https://plotly.com/) is a popular library for graphing and data dashboards. Plotly uses `plotly.graph_objects` to create figures for data visualization. Graph objects can be created using `plotly.express` or from the ground up. In order for Plotly to make a graph, data needs to be on the host, not the GPU. If the dataset is small, it may be more efficient to use `pandas` instead of `cudf` or `Dask-cuDF`. However, if the dataset is large, sending data to the GPU is a great way to speed up computation before sending it to the host for visualization. When using GPU acceleration and Plotly, only move the GPU DataFrame(s) to the host at the end with `to_pandas()`, as opposed to converting the entire GPU DataFrame(s) to a pandas DataFrame immediately. This will allow us to take advantages of GPU acceleration for processing.
|
||||
|
||||
For more information about how to use Plotly, we recommend [this guide](https://plotly.com/python/getting-started/).
|
||||
|
||||
We start by initiating the `LocalCUDACluster()` and Dask `Client()`, followed by loading data from the Parquet files into a Dask DataFrame.
|
||||
|
||||
|
||||
```python
|
||||
# import dependencies
|
||||
from dask.distributed import Client, wait
|
||||
from dask_cuda import LocalCUDACluster
|
||||
import cudf
|
||||
import dask_cudf
|
||||
import numpy as np
|
||||
import cupy as cp
|
||||
import plotly.express as px
|
||||
import gc
|
||||
|
||||
# create cluster
|
||||
cluster=LocalCUDACluster()
|
||||
|
||||
# instantiate client
|
||||
client=Client(cluster)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# get the machine's external IP address
|
||||
from requests import get
|
||||
|
||||
ip=get('https://api.ipify.org').content.decode('utf8')
|
||||
|
||||
print(f'Dask dashboard (status) address is: http://{ip}:8787/status')
|
||||
print(f'Dask dashboard (Gpu) address is: http://{ip}:8787/gpu')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# read data
|
||||
ddf=dask_cudf.read_parquet('clean_parquet')
|
||||
|
||||
print(f'Total of {len(ddf)} records split across {ddf.npartitions} partitions. ')
|
||||
|
||||
ddf.dtypes
|
||||
```
|
||||
|
||||
<p><img src='images/tip.png' width=720></p>
|
||||
The Parquet file format includes metadata to inform `Dask-cuDF` which data types to use for each column.
|
||||
|
||||
|
||||
```python
|
||||
# create continue and categorical column lists
|
||||
continuous_cols=['price', 'target', 'ts_hour', 'ts_minute', 'ts_weekday', 'ts_day', 'ts_month', 'ts_year']
|
||||
categorical_cols=['event_type', 'category_code', 'brand', 'user_session', 'session_product', 'cat_0', 'cat_1', 'cat_2', 'cat_3', 'product_id', 'category_id', 'user_id']
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# preview DataFrame
|
||||
ddf.head()
|
||||
```
|
||||
|
||||
<a name='s2-2.2'></a>
|
||||
### Summarize ###
|
||||
We can use the `describe()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.dataframe.describe/) method to generate summary statistics for continuous features.
|
||||
|
||||
|
||||
```python
|
||||
# generate summary statistics for continuous features
|
||||
ddf[continuous_cols].describe().compute().to_pandas().apply(lambda s: s.apply('{0:.2f}'.format))
|
||||
```
|
||||
|
||||
For categorical values, we are often interested in the [cardinality](https://en.wikipedia.org/wiki/Cardinality) of each feature. Cardinality is number of unique elements the set contains. We use `.nunique()` to get the number of possible values for each categorical feature as it will inform how they can be encoded for machine learning model consumption.
|
||||
|
||||
|
||||
```python
|
||||
# count number of unique values for categorical features
|
||||
ddf[categorical_cols].nunique().compute()
|
||||
```
|
||||
|
||||
Note that in the previous step, we added `read_parquet()` to the task graph but did not `.persist()` data in memory. Recall that the Dask DataFrame APIs build the task graph until `.compute()`. The result of `.compute()` is a cuDF DataFrame and should be small.
|
||||
|
||||
**Observations**:
|
||||
* The dataset has an ~41% purchase rate
|
||||
* All data come from March of 2020
|
||||
* Price has a very large standard deviation
|
||||
|
||||
<a name='s2-2.3'></a>
|
||||
### Visualizing Distribution ###
|
||||
A histogram is a graph that shows the frequency of data using rectangles. It's used to visualize the distribution of the the data so we can quickly approximate concenration, skewness, and variability. We will use histograms to identify popularity characteristics in the dataset.
|
||||
|
||||
When using Plotly or other visualization libraries, it's best to keep the data on the GPU for as long as possible and only move the data to the host when needed. For example, instead of relying on Plotly Express's `.histogram()`[[doc]](https://plotly.com/python/histograms/) function, we can use the `.value_count()` method to count the number of occurences. We can then pass the results to the `.bar()` function to generate a frequency bar chart. This can yield faster results by enabling GPU acceleration. Furthermore, we can use `.nlargest()` to limit the number of bars in the chart.
|
||||
|
||||
We want to visualize the distribution for specific features. Now that the data is in Parquet file format, we can use column pruning to only read in one column at a time to reduce the memory burden.
|
||||
|
||||
|
||||
```python
|
||||
%%time
|
||||
# set cat column of interest
|
||||
cat_col='cat_0'
|
||||
|
||||
# read data
|
||||
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)
|
||||
|
||||
# create histogram
|
||||
px.histogram(
|
||||
# move data to CPU
|
||||
ddf[cat_col].compute().to_pandas()
|
||||
).update_layout(
|
||||
yaxis_title='Frequency Count',
|
||||
xaxis_title=cat_col,
|
||||
title=f'Distribution of {cat_col}'
|
||||
)
|
||||
```
|
||||
|
||||
<a name='s2-e1'></a>
|
||||
### Exercise #1 - Histogram with GPU ###
|
||||
Instead of generating the histogram on CPU, we can use `.value_counts()` to achieve similar results.
|
||||
|
||||
**Instructions**: <br>
|
||||
* Modify the `<FIXME>` only and execute the below cell to visualize the frequency of each `cat_0` value.
|
||||
* Compare the performance efficiency with the previous CPU approach.
|
||||
|
||||
|
||||
```python
|
||||
%%time
|
||||
# set cat column of interest
|
||||
cat_col='cat_0'
|
||||
n_bars=25
|
||||
|
||||
# read data
|
||||
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)
|
||||
|
||||
# create frequency count DataFrame
|
||||
cat_count_df=<<<<FIXME>>>>
|
||||
|
||||
# create histogram
|
||||
px.bar(
|
||||
# move data to CPU
|
||||
cat_count_df.compute().to_pandas()
|
||||
).update_layout(
|
||||
yaxis_title='Frequency Count',
|
||||
xaxis_title=cat_col,
|
||||
title=f'Distribution of {cat_col}'
|
||||
)
|
||||
```
|
||||
%%time
|
||||
cat_col='cat_0'
|
||||
n_bars=25
|
||||
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)
|
||||
|
||||
cat_count_df=ddf[cat_col].value_counts().nlargest(n_bars)
|
||||
|
||||
px.bar(
|
||||
cat_count_df.compute().to_pandas()
|
||||
).update_layout(
|
||||
yaxis_title='Frequency Count',
|
||||
xaxis_title=cat_col,
|
||||
title=f'Distribution of {cat_col}'
|
||||
)
|
||||
Click ... to show **solution**.
|
||||
|
||||
Using cuDF to calculate the frequency is much more efficient. For continuous features, we often have to bin the values into buckets. We can use `cudf.Series.digitize()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.series.digitize/), but it's not implemented for `dask_cudf.core.Series`, so we have to use `.map_partitions()` to perform the `cudf.Series.digitize()` method on each partition.
|
||||
|
||||
|
||||
```python
|
||||
%%time
|
||||
# set cont column of interest
|
||||
cont_col='price'
|
||||
|
||||
# read data
|
||||
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)
|
||||
|
||||
# set bin
|
||||
bins=np.array(range(-1, 10000, 50)).astype('float32')
|
||||
|
||||
# create frequency count DataFrame
|
||||
cont_hist_df=ddf[cont_col].map_partitions(lambda p: p.digitize(bins)).value_counts()
|
||||
|
||||
# create histogram
|
||||
px.bar(
|
||||
# move data to CPU
|
||||
cont_hist_df.compute().to_pandas(),
|
||||
).update_xaxes(
|
||||
tickmode='array',
|
||||
tickvals=np.array(range(1, len(bins))),
|
||||
ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])],
|
||||
).update_layout(
|
||||
yaxis_title='Frequency Count',
|
||||
xaxis_title=f'{cont_col} Bin',
|
||||
title=f'Distribution of {cont_col}'
|
||||
)
|
||||
```
|
||||
|
||||
<a name='s2-e2'></a>
|
||||
### Exercise #2 - Histogram with Log Scale ###
|
||||
`price` is positively skewed. We might be able to get visualize a better distribution by creating bins in logarithmic scale. We will create bin ranges using `numpy.logspace()` and pass it to `cudf.Series.digitize()`.
|
||||
|
||||
**Instructions**: <br>
|
||||
* Modify the `<FIXME>` only and execute the below cell to visualize the frequency of `price` bins in log scale.
|
||||
|
||||
|
||||
```python
|
||||
# set cont column of interest
|
||||
cont_col='price'
|
||||
|
||||
# read data
|
||||
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)
|
||||
|
||||
# set bin
|
||||
bins=np.logspace(0, 5).astype('float32')
|
||||
|
||||
# create frequency count DataFrame
|
||||
cont_hist_df=ddf['price'].map_partitions(<<<<FIXME>>>>).value_counts()
|
||||
|
||||
# create histogram
|
||||
px.bar(
|
||||
# move data to CPU
|
||||
cont_hist_df.compute().to_pandas(),
|
||||
).update_xaxes(
|
||||
tickmode='array',
|
||||
tickvals=np.array(range(1, len(bins))),
|
||||
ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])]
|
||||
).update_layout(
|
||||
yaxis_title='Frequency Count',
|
||||
xaxis_title=f'{cont_col} Bin',
|
||||
title=f'Distribution of {cont_col}'
|
||||
)
|
||||
```
|
||||
cont_col='price'
|
||||
|
||||
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)
|
||||
|
||||
bins=np.logspace(0, 5).astype('float32')
|
||||
|
||||
cont_hist_df=ddf['price'].map_partitions(lambda p: p.digitize(bins)).value_counts()
|
||||
|
||||
px.bar(
|
||||
cont_hist_df.compute().to_pandas(),
|
||||
).update_xaxes(
|
||||
tickmode='array',
|
||||
tickvals=np.array(range(1, len(bins))),
|
||||
ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])]
|
||||
).update_layout(
|
||||
yaxis_title='Frequency Count',
|
||||
xaxis_title=f'{cont_col} Bin',
|
||||
title=f'Distribution of {cont_col}'
|
||||
)
|
||||
Click ... to show **solution**.
|
||||
|
||||
**Observations**:
|
||||
* Vast majority of the products are below $300.
|
||||
|
||||
<a name='s2-2.4'></a>
|
||||
### GroupBy Summarize ###
|
||||
We can use a variety of groupby aggregations to learn about the data. The aggregations supported by cuDF, as described [here](https://docs.rapids.ai/api/cudf/stable/user_guide/groupby/#aggregation), are very efficient. We might be interested in exploring several variations. To make the execution more efficient, we can `.persist()` the data into the memory after reading from the source. Subsequent operations will not require loading from the source again.
|
||||
|
||||
For example, we can visualize the probability of an event for each category. When the target column is a binary indicator, we can do this quickly by calculating the aggregate mean. For a categorical feature with binary outcomes, users can use the arithmetic mean to find the _probability_.
|
||||
|
||||
We use `.groupby()` on `cat_0`, followed by `.agg('mean')` on `target` to determine the probability of positive outcome for each `cat_0` group.
|
||||
|
||||
|
||||
```python
|
||||
# read data
|
||||
ddf=dask_cudf.read_parquet('clean_parquet')
|
||||
|
||||
# persist data in memory
|
||||
ddf=ddf.persist()
|
||||
wait(ddf)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# set cat column of interest
|
||||
cat_col='cat_0'
|
||||
n_bars=25
|
||||
|
||||
# create groupby probability DataFrame
|
||||
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)
|
||||
|
||||
# create bar chart
|
||||
px.bar(
|
||||
# move data to CPU
|
||||
cat_target_df.compute().to_pandas()
|
||||
).update_layout(
|
||||
yaxis_title='Probability',
|
||||
xaxis_title=cat_col,
|
||||
title=f'Probability of {cat_col}'
|
||||
)
|
||||
```
|
||||
|
||||
Some categoriies have a higher probability than other.
|
||||
|
||||
Other groupby aggregations include:
|
||||
* What time of the week is the busiest
|
||||
|
||||
```
|
||||
# show probability of each ts_weekday and ts_hour group
|
||||
ddf.groupby(['ts_weekday', 'ts_hour'])['target'].agg({'target': 'mean'})
|
||||
```
|
||||
|
||||
<p><img src='images/tip.png' width=720></p>
|
||||
|
||||
`.groupby().size()` or `.groupby().agg('size')` is very similar to `.value_counts()`.
|
||||
|
||||
<a name='s2-e3'></a>
|
||||
### Exercise #3 - Probability Bar Chart ###
|
||||
|
||||
**Instructions**: <br>
|
||||
* Mofidy the `<FIXME>` only and execute the below cell to visualize the probability of each `ts_hour` value.
|
||||
|
||||
|
||||
```python
|
||||
# set cat column of interest
|
||||
cat_col=<<<<FIXME>>>>
|
||||
n_bars=25
|
||||
|
||||
# create groupby probability DataFrame
|
||||
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)
|
||||
|
||||
# create bar chart
|
||||
px.bar(
|
||||
# move data to CPU
|
||||
cat_target_df.compute().to_pandas()
|
||||
).update_layout(
|
||||
yaxis_title='Probability',
|
||||
xaxis_title=cat_col,
|
||||
title=f'Probability of {cat_col}'
|
||||
)
|
||||
```
|
||||
cat_col='ts_hour'
|
||||
n_bars=25
|
||||
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)
|
||||
|
||||
px.bar(
|
||||
cat_target_df.compute().to_pandas()
|
||||
).update_layout(
|
||||
yaxis_title='Probability',
|
||||
xaxis_title=cat_col,
|
||||
title=f'Probability of {cat_col}'
|
||||
)
|
||||
Click ... to show **solution**.
|
||||
|
||||
Some aggregations require all data within the same group to be in memory for calculation such as `median`, `mode`, `nunique`, and etc. For these operations, `.groupby().apply()` is used. Because `.groupby().apply()` performs a shuffle, these operations scales poorly with large amounts of groups.
|
||||
|
||||
We use `.groupby()` on `brand` and `.apply()` on `SeriesGroupBy['user_id'].nunique()` to get the number of unique customers that have interacted with each brand.
|
||||
|
||||
|
||||
```python
|
||||
# set cat columns of interest
|
||||
cat_col='brand'
|
||||
group_statistic='user_id'
|
||||
|
||||
# create groupby summarize DataFrame
|
||||
product_frequency=ddf.groupby(cat_col)[group_statistic].apply(lambda g: g.nunique(), meta=(f'{group_statistic}_count', 'int32')).nlargest(25)
|
||||
|
||||
# create bar chart
|
||||
px.bar(
|
||||
# move data to CPU
|
||||
product_frequency.compute().to_pandas()
|
||||
).update_layout(
|
||||
yaxis_title=f'Number of Unique {group_statistic}',
|
||||
xaxis_title=cat_col,
|
||||
title=f'Number of Unique {group_statistic} per {cat_col}'
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# visualize graph
|
||||
product_frequency.visualize(rankdir='LR')
|
||||
```
|
||||
|
||||
Certain brands and categories have a higher penetration and higher probability of positive outcome.
|
||||
|
||||
Other groupby aggregations include:
|
||||
* How many unique `customer_id` are in each `product_id` group
|
||||
|
||||
```
|
||||
# show how many customers interacted with each product
|
||||
ddf.groupby('product_id')['user_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
|
||||
```
|
||||
|
||||
* How many unique `cat_0` are in each `brand` group
|
||||
|
||||
```
|
||||
# show how many categories of product do each brand carry
|
||||
ddf.groupby('brand')['cat_0'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
|
||||
```
|
||||
|
||||
* How many unique `product_id` are in each `user_session` group
|
||||
|
||||
```
|
||||
# show how many products are view in each session
|
||||
ddf.groupby('user_session')['product_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
|
||||
```
|
||||
|
||||
* How many unique `product_id` are in each `user_id` group
|
||||
|
||||
```
|
||||
# show how many products each user interacts with
|
||||
ddf.groupby('user_id')['product_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
|
||||
```
|
||||
|
||||
<p><img src='images/tip.png' width=720></p>
|
||||
|
||||
For Dask, we can ensure the result of `.groupby()` is sorted using the `sort` paramter.
|
||||
|
||||
Sometimes we want to perform custom aggregations that are not yet supported. For custom aggregations, we can use `.groupby().apply()` and user-defined functions. For example, we might be interested in the range of `price` for each `category_code`. Arithmetically, this is done by taking the difference between the group-specific maxmimum and group-specific minimum. We can normalize the range by dividing it by the group-specific mean.
|
||||
|
||||
It's best to avoid using `.groupby().apply()` when possible. Similar results can be calculated by using `.groupby().agg()` to obtain the `max`, `min`, and `mean` separately, then applying a row-wise calculation with `.apply()`. This can be more efficient.
|
||||
|
||||
|
||||
```python
|
||||
%%time
|
||||
# set cat column of interest
|
||||
cat_col='category_code'
|
||||
|
||||
# define group-wise function
|
||||
def normalized_range(group):
|
||||
return (group.max()-group.min())/group.mean()
|
||||
|
||||
# create groupby apply DataFrame
|
||||
normalized_range_df=ddf.groupby(cat_col)['price'].apply(normalized_range, meta=('normalize_range', 'float64')).nlargest(25)
|
||||
|
||||
# create bar chart
|
||||
px.bar(
|
||||
# move data to CPU
|
||||
normalized_range_df.compute().to_pandas()
|
||||
).update_layout(
|
||||
yaxis_title='Normalize Range',
|
||||
xaxis_title=cat_col,
|
||||
title=f'Normalize Range of price per {cat_col}'
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# visualize graph
|
||||
normalized_range_df.visualize(rankdir='LR')
|
||||
```
|
||||
|
||||
<a name='s2-e4'></a>
|
||||
### Exercise #4 - Custom GroupBy Aggregation ###
|
||||
|
||||
**Instructions**: <br>
|
||||
* Modify the `<FIXME>` only and execute the below cell to visualize the normalized range of each `category_code`.
|
||||
* Compare the performance efficiency with the previous `.groupby().apply()` approach.
|
||||
|
||||
|
||||
```python
|
||||
%%time
|
||||
# set cat column of interest
|
||||
cat_col='category_code'
|
||||
|
||||
# define row-wise function
|
||||
def normalized_range(group):
|
||||
return <<<<FIXME>>>>
|
||||
|
||||
# create groupby aggregate DataFrame
|
||||
normalized_range_df=ddf.groupby(cat_col)['price'].agg({'price': ['max', 'min', 'mean']}).apply(normalized_range, axis=1, meta=('normalize_range', 'float64')).nlargest(25)
|
||||
|
||||
# create bar chart
|
||||
px.bar(
|
||||
# move data to CPU
|
||||
normalized_range_df.compute().to_pandas()
|
||||
).update_layout(
|
||||
yaxis_title='Normalize Range',
|
||||
xaxis_title=cat_col,
|
||||
title=f'Normalize Range of price per {cat_col}'
|
||||
)
|
||||
```
|
||||
cat_col='category_code'
|
||||
|
||||
def normalized_range(group):
|
||||
return (group['max']-group['min'])/group['mean']
|
||||
|
||||
normalized_range_df=ddf.groupby(cat_col)['price'].agg({'price': ['max', 'min', 'mean']}).apply(normalized_range, axis=1, meta=('normalize_range', 'float64')).nlargest(25)
|
||||
|
||||
px.bar(
|
||||
normalized_range_df.compute().to_pandas()
|
||||
).update_layout(
|
||||
yaxis_title='Normalize Range',
|
||||
xaxis_title=cat_col,
|
||||
title=f'Normalize Range of price per {cat_col}'
|
||||
)
|
||||
Click ... to show **solution**.
|
||||
|
||||
|
||||
```python
|
||||
# visualize graph
|
||||
normalized_range_df.visualize(rankdir='LR')
|
||||
```
|
||||
|
||||
We can apply predicate pushdown filters when reading from Parquet files with the `filters` parameter. This will enable Dask-cuDF to skip row-groups and files where _none_ of the rows can satisfy the criteria. This works well when the partitions are thought-out and uses the filter column. Since this is not the case for our dataset, we will apply a separate filter after importing the data.
|
||||
|
||||
<a name='s2-e5'></a>
|
||||
### Exercise #5 - Time-Series Analysis ###
|
||||
When dealing with time-series data, cuDF provides powerful `.rolling()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.dataframe.rolling/) and `.resample()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.dataframe.resample/) methods to perform window operations. Functionally, they behave very similarly to `.groupby()` operations. We use `.rolling()` followed by `.interpolate()` to find the frequency and probability over the entire span of the dataset.
|
||||
|
||||
We use `.map_partitions()` to perform `.resample()` on each partition. Because `.map_partitions()` doesn't perform a shuffle first, we manually perform a `.shuffle()` to ensure all members of each group are together. Once we have all the needed data in the same partitions, we can use `.map_partitions()` and pass the cuDF DataFrame `.resample()` operation.
|
||||
|
||||
**Instructions**: <br>
|
||||
* Execute the below cell to clear memory.
|
||||
* Execute the cell below to read data into memory and shuffle based on `ts_day`. This ensure that all records belonging to the same group are in the same partition.
|
||||
* Execute the cell below to show the user shopping behavior.
|
||||
* Modify the `resample_frequency` to various frequencies to look for more obvious patterns.
|
||||
|
||||
|
||||
```python
|
||||
del ddf
|
||||
gc.collect()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# read data with predicate pushdown
|
||||
ddf=dask_cudf.read_parquet('clean_parquet', filters=[('ts_day', "<", 15)])
|
||||
|
||||
# apply filtering
|
||||
ddf=ddf[ddf['ts_day']<15]
|
||||
|
||||
# shuffle first on ts_day
|
||||
ddf=ddf.shuffle('ts_day')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# set resample frequency
|
||||
resample_frequency='3h'
|
||||
|
||||
# get time-series DataFrame
|
||||
activity_amount_trend=ddf.map_partitions(lambda x: x.resample(resample_frequency, on='event_time').size().interpolate('linear'))
|
||||
purchase_rate_trend=ddf.map_partitions(lambda x: x.resample(resample_frequency, on='event_time')['target'].mean().interpolate('linear'))
|
||||
|
||||
# create scatter plot
|
||||
px.scatter(
|
||||
# move data to CPU
|
||||
activity_amount_trend.compute().to_pandas().sort_index(),
|
||||
color=purchase_rate_trend.compute().to_pandas().sort_index()
|
||||
).update_traces(
|
||||
mode='markers+lines'
|
||||
).update_layout(
|
||||
yaxis_title='Number of Records',
|
||||
xaxis_title='event_time',
|
||||
title=f'Amount of Transactions Over Time'
|
||||
)
|
||||
```
|
||||
|
||||
<a name='s2-2.5'></a>
|
||||
### Pivot Table ###
|
||||
When data is small enough to fit in single GPU, it's often faster to perform data transformation with cuDF. Below we will read a few numerical columns, which fits nicely in memory. We use `.pivot_table()` to find the probability and frequency at each `ts_hour` and `ts_weekday` group.
|
||||
|
||||
|
||||
```python
|
||||
# read data
|
||||
gdf=cudf.read_parquet('clean_parquet', columns=['ts_weekday', 'ts_hour', 'target'])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# create pivot table
|
||||
activity_amount=gdf.pivot_table(index=['ts_weekday'], columns=['ts_hour'], values=['target'], aggfunc='size')['target']
|
||||
|
||||
# create heatmap
|
||||
px.imshow(
|
||||
# move data to CPU
|
||||
activity_amount.to_pandas(),
|
||||
title='there is more activity in the day'
|
||||
).update_layout(
|
||||
title=f'Number of Records Heatmap'
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# create pivot table
|
||||
purchase_rate=gdf[['target', 'ts_weekday', 'ts_hour']].pivot_table(index=['ts_weekday'], columns=['ts_hour'], aggfunc='mean')['target'].to_pandas()
|
||||
|
||||
# create heatmap
|
||||
px.imshow(
|
||||
# move data to CPU
|
||||
purchase_rate,
|
||||
title='there is potentially a higher purchase rate in the evening'
|
||||
).update_layout(
|
||||
title=f'Probability Heatmap'
|
||||
)
|
||||
```
|
||||
|
||||
**Observations**:
|
||||
* Behavior changes on `ts_weekday` and `ts_hour` - e.g. during the week, users will not stay up late as they work next day.
|
||||
|
||||
<a name='s2-3'></a>
|
||||
## Summary ##
|
||||
* `.groupby().apply()` requires shuffling, which is time-expensive. When possible, try to use `.groupby().agg()` instead
|
||||
* Keeping data processing on the GPU can help generate visualizations quickly
|
||||
* Use predicate filtering and column pruning to reduce the amount of data read into memory. When data size is small, processing on cuDF can be more efficient than Dask-cuDF
|
||||
* Use `.persit()` if subsequent operations are exploratory
|
||||
* `.map_partitions()` does not involve shuffling
|
||||
|
||||
|
||||
```python
|
||||
# clean GPU memory
|
||||
import IPython
|
||||
app = IPython.Application.instance()
|
||||
app.kernel.do_shutdown(True)
|
||||
```
|
||||
|
||||
**Well Done!** Let's move to the [next notebook](1_03_categorical_feature_engineering.ipynb).
|
||||
|
||||
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
|
||||
279
ds/25-1/3/1_03_categorical_feature_engineering.md
Normal file
279
ds/25-1/3/1_03_categorical_feature_engineering.md
Normal file
@ -0,0 +1,279 @@
|
||||
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
|
||||
|
||||
# Enhancing Data Science Outcomes With Efficient Workflow #
|
||||
|
||||
## 03 - Feature Engineering for Categorical Features ##
|
||||
In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load), is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data.
|
||||
|
||||
<p><img src='images/pipeline_overview_1.png' width=1080></p>
|
||||
|
||||
**Table of Contents**
|
||||
<br>
|
||||
In this notebook, we will load data from Parquet file format into a Dask DataFrame and create additional features for machine learning model training. This notebook covers the below sections:
|
||||
1. [Quick Recap](#s3-1)
|
||||
2. [Feature Engineering](#s3-2)
|
||||
* [User Defined Functions](#s3-2.1)
|
||||
3. [Feature Engineering Techniques](#s3-3)
|
||||
* [One-Hot Encoding](#s3-3.1)
|
||||
* [Combining Categories](#s3-3.2)
|
||||
* [Categorify / Label Encoding](#s3-3.3)
|
||||
* [Count Encoding](#s3-3.4)
|
||||
* [Target Encoding](#s3-3.5)
|
||||
* [Embeddings](#s3-3.6)
|
||||
4. [Summary](#s3-4)
|
||||
|
||||
<a name='s3-1'></a>
|
||||
## Quick Recap ##
|
||||
So far, we've identified several sources of hidden slowdowns when working with Dask and cuDF:
|
||||
* Reading data without a schema or specifying `dtype`
|
||||
* Having too many partitions due to small `chunksize`
|
||||
* Memory spilling due to partitions being too large
|
||||
* Performing groupby operations on too many groups scattered across multiple partitions
|
||||
|
||||
Going forward, we will continue to learn how to use Dask and RAPIDS efficiently.
|
||||
|
||||
<a name='s3-2'></a>
|
||||
## Feature Engineering ##
|
||||
Feature engineer converts raw data to numeric vectors for model consumption. This is generally referred to as encoding, which transforms categorical data into continuous values. When encoding categorical values, there are three primary methods:
|
||||
* Label encoding when no ordered relationship
|
||||
* Ordinal encoding in case have ordered relationship
|
||||
* One-hot encoding when categorical variable data is binary in nature.
|
||||
|
||||
Additionally, we can create numerous sets of new features from existing ones, which are then tested for effectiveness during model training. Feature engineering is an important step when working with tabular data as it can improve a machine learning model's ability to learn faster and extract patterns. Feature engineering can be a time-consuming process, particularly when the dataset is large if the processing cycle takes a long time. The ability to perform feature engineering efficiently enables more exploration of useful features.
|
||||
|
||||
<a name='s3-2.1'></a>
|
||||
### User-Defined Functions ###
|
||||
Like many tabular data processing APIs, cuDF provides a range of composable, DataFrame style operators. While out of the box functions are flexible and useful, it is sometimes necessary to write custom code, or **user-defined functions** (UDFs), that can be applied to rows, columns, and other groupings of the cells making up the DataFrame.
|
||||
|
||||
Users can execute UDFs on `cudf.Series` with:
|
||||
* `cudf.Series.apply()` or
|
||||
* Numba's `forall` syntax [(link)](https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html#lower-level-control-with-custom-numba-kernels)
|
||||
|
||||
Users can execute UDFs on `cudf.DataFrame` with:
|
||||
* `cudf.DataFrame.apply()`
|
||||
* `cudf.DataFrame.apply_rows()`
|
||||
* `cudf.DataFrame.apply_chunks()`
|
||||
* `cudf.rolling().apply()`
|
||||
* `cudf.groupby().apply_grouped()`
|
||||
|
||||
Note that applying UDFs directly with Dask-cuDF is not yet implemented. For now, users can use `map_partitions` to apply a function to each partition of the distributed dataframe.
|
||||
|
||||
Currently, the use of string data within UDFs is provided through the `string_udf` library. This is powerful for use cases such as string splitting, regular expression, and tokenization. The topic of handling string data is discussed extensively [here](https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html#string-data). In addition to `Series.str`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/string_handling.html), cudf also supports `Series.list`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/list_handling.html) for applying custom transformations.
|
||||
|
||||
<p><img src='images/tip.png' width=720></p>
|
||||
|
||||
Below are some tips:
|
||||
* `apply` works by applying the provided function to each group sequentially, and concatenating the results together. This can be very slow, especially for a large number of small groups. For a small number of large groups, it can give acceptable performance.
|
||||
* With cuDF, we can also combine NumPy or cuPy methods into the precedure.
|
||||
* Related to `apply`, iterating over a cuDF Series, DataFrame or Index is not supported. This is because iterating over data that resides on the GPU will yield extremely poor performance, as GPUs are optimized for highly parallel operations rather than sequential operations. In the vast majority of cases, it is possible to avoid iteration and use an existing function or methods to accomplish the same task. It is recommended that users copy the data from GPU to host with `.to_arrow()` or `.to_pandas()`, then copy the result back to GPU using `.from_arrow()` or `.from_pandas()`.
|
||||
|
||||
<a name='s3-3'></a>
|
||||
## Feature Engineering Techniques ##
|
||||
Below is a list of common feature engineering techniques.
|
||||
|
||||
<img src='images/feature_engineering_methods.png' width=720>
|
||||
|
||||
|
||||
```python
|
||||
from dask.distributed import Client, wait
|
||||
from dask_cuda import LocalCUDACluster
|
||||
import cudf
|
||||
import dask.dataframe as dd
|
||||
import dask_cudf
|
||||
import gc
|
||||
|
||||
# instantiate a Client
|
||||
cluster=LocalCUDACluster()
|
||||
client=Client(cluster)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# get the machine's external IP address
|
||||
from requests import get
|
||||
|
||||
ip=get('https://api.ipify.org').content.decode('utf8')
|
||||
|
||||
print(f'Dask dashboard (status) is accessible on http://{ip}:8787/status')
|
||||
print(f'Dask dashboard (gpu) is accessible on http://{ip}:8787/gpu')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# read data as Dask-cuDF DataFrame
|
||||
ddf=dask_cudf.read_parquet('clean_parquet')
|
||||
ddf=ddf.categorize(columns=['brand', 'cat_0', 'cat_1', 'cat_2', 'cat_3'])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
ddf=ddf.persist()
|
||||
```
|
||||
|
||||
<p><img src='images/check.png' width=720></p>
|
||||
Did you get an error message? This notebook depends on the processed source file from previous notebooks.
|
||||
|
||||
<a name='s3-3.1'></a>
|
||||
### One-Hot Encoding ###
|
||||
**One-Hot Encoding**, also known as dummy encoding, creates several binary columns to indicate a row belonging to a specific category. It works well for categorical features that are not ordinal and have low cardinality. With one-hot encoding, each row would get a single column with a 1 and 0 everywhere else.
|
||||
|
||||
For example, we can get `cudf.get_dummies()` to perform one-hot encoding on all of one of the categorical columns.
|
||||
|
||||
<img src='images/tip.png' width=720>
|
||||
One-hot encoding doesn't work well for categorical features when the cardinality is large as it results in high dimensionality. This is particularly an issue for neural networks optimizers. Furthermore, data should not be saved in one-hot encoding format. If needed, it should only be used temporarily for specific tasks.
|
||||
|
||||
|
||||
```python
|
||||
def one_hot(df, cat):
|
||||
temp=dd.get_dummies(df[cat])
|
||||
return dask_cudf.concat([df, temp], axis=1)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
one_hot(ddf, 'cat_0').head()
|
||||
```
|
||||
|
||||
<a name='s3-3.2'></a>
|
||||
### Combining Categories ###
|
||||
|
||||
**Combining categories** creates new features that better identify patterns when the categories indepedently don't provide information to predict the target. It's also known as _cross column_ or _cross product_. It's a common data preprocessing step for machine learning since it reduces the cost of model training. It's also common for exploratory data analysis. Properly combined categorical features encourage more effective splits in tree-based methods than considering each feature independently.
|
||||
|
||||
For example, while `ts_weekday` and `ts_hour` may independently have no significant patterns, we might observe more obvious patterns if the two features are combined into `ts_weekday_hour`.
|
||||
|
||||
<img src='images/tip.png' width=720>
|
||||
When deciding which categorical features should be combined, it's important to balance the number of categories used, the number of observations in each combined category, and information gain. Combining features together reduces the number of observations per resulting category, which can lead to overfitting. Typically, combining low cardinal categories is recommended. Otherwise, experimentations are needed to discover the best combinations.
|
||||
|
||||
|
||||
```python
|
||||
def combine_cats(df, left, right):
|
||||
df['-'.join([left, right])]=df[left].astype('str').str.cat(df[right].astype('str'))
|
||||
return df
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
combine_cats(ddf, 'ts_weekday', 'ts_hour').head()
|
||||
```
|
||||
|
||||
<a name='s3-3.3'></a>
|
||||
### Categorify and Grouping ###
|
||||
|
||||
**Categorify**, also known as *Label Encoding*, converts features into continuous integers. Typically, it converts the values into monotonically increasing positive integers from 0 to *C*, or the cardinality. It enables numerical computations and can also reduce memory utilization if the original feature contains string values. Categorify is a necessary data preprocessing step for neural network embedding layers. It is required for using categorical features in deep learning models with Embedding layers.
|
||||
|
||||
Categorifying works well when the feature is ordinal, and is sometimes necessary when the cardinality is large. Categories with low frequency can be grouped together to prevent the model overfitting on spare signals. When categorifying a feature, we can apply a threshold to group all categories with lower frequency count to the `other` category.
|
||||
|
||||
Encode categorical features into continuous integer values if the category occurs more often than the specified threshold- frequency threshold. Infrequent categories are mapped to a special ‘unknown’ category. This handy functionality will map all categories which occur in the dataset with some threshold level of infrequency to the same index, keeping the model from overfitting to sparse signals.
|
||||
|
||||
|
||||
```python
|
||||
def categorify(df, cat, freq_threshold):
|
||||
freq=df[cat].value_counts()
|
||||
freq=freq.reset_index()
|
||||
freq.columns=[cat, 'count']
|
||||
|
||||
# reset index on the frequency dataframe for a new sequential index
|
||||
freq=freq.reset_index()
|
||||
freq.columns=[cat+'_Categorify', cat, 'count']
|
||||
|
||||
# we apply a frequency threshold of 5 to group low frequent categories together
|
||||
freq_filtered=freq[freq['count']>5]
|
||||
|
||||
# add 2 to the new index as we want to use index 0 for others and 1 for unknown
|
||||
freq_filtered[cat+'_Categorify']=freq_filtered[cat+'_Categorify']+2
|
||||
freq_filtered=freq_filtered.drop(columns=['count'])
|
||||
|
||||
# merge original dataframe with newly created dataframe to obtain the categorified value
|
||||
df=df.merge(freq_filtered, how='left', on=cat)
|
||||
|
||||
# fill null values with 0 to represent low frequency categories grouped as other
|
||||
df[cat + '_Categorify'] = df[cat + '_Categorify'].fillna(0)
|
||||
return df
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
categorify(ddf, 'cat_0', 10).head()
|
||||
```
|
||||
|
||||
<a name='s3-3.4'></a>
|
||||
### Count Encoding ###
|
||||
|
||||
*Count Encoding* represents a feature based on the frequency. This can be interpreted as the popularity of a category.
|
||||
|
||||
For example, we can count the frequency of `user_id` with `cudf.Series.value_counts()`. This creates a feature that can help a machine learning model learn the behavior pattern of users with low frequency together.
|
||||
|
||||
|
||||
```python
|
||||
def count_encoding(df, cat):
|
||||
count_df=df[cat].value_counts()
|
||||
count_df=count_df.reset_index()
|
||||
count_df.columns=[cat, cat+'_CE']
|
||||
df=df.merge(count_df, on=cat)
|
||||
return df
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
count_encoding(ddf, 'user_id').head()
|
||||
```
|
||||
|
||||
<a name='s3-3.5'></a>
|
||||
### Target Encoding ###
|
||||
|
||||
**Target Encoding** represents a categorical feature based on its effect on the target variable. One common technique is to replace values with the probability of the target given a category. Target encoding creates a new feature, which can be used by the model for training. The advantage of target encoding is that it processes the categorical features and makes them more easily accessible to the model during training and validation.
|
||||
|
||||
Mathematically, target encoding on a binary target can be:
|
||||
|
||||
p(t = 1 | x = ci)
|
||||
|
||||
For a binary classifier, we can calculate the probability when the target is `true` or `1` by taking the mean for each category group. This is also known as *Mean Encoding*.
|
||||
|
||||
In other words, it calculates statistics, such as the arithmetic mean, from a target variable grouped by the unique values of one or more categorical features.
|
||||
|
||||
<img src='images/tip.png' width=720>
|
||||
|
||||
*Leakage*, also known as data leakage or target leakage, occurs when training a model with information that would not be avilable at the time of prediction. This can cause the inflated model performance score to overestimate the model's utility. For example, including "temperature_celsius" as a feature when training and predicting "temperature_fahrenheit".
|
||||
|
||||
|
||||
```python
|
||||
def target_encoding(df, cat):
|
||||
te_df=df.groupby(cat)['target'].mean().reset_index()
|
||||
te_df.columns=[cat, cat+'_TE']
|
||||
df=df.merge(te_df, on=cat)
|
||||
return df
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
target_encoding(ddf, 'brand').head()
|
||||
```
|
||||
|
||||
<a name='s3-3.6'></a>
|
||||
### Embeddings ###
|
||||
|
||||
Deep learning models often apply **Embedding Layers** to categorical features. Over the past few years, this has become an increasing popular technique for encoding categorical features. Since the embeddings need to be trained through a neural network, we will cover this in the next lab.
|
||||
|
||||
|
||||
```python
|
||||
ddf=one_hot(ddf, 'cat_0')
|
||||
ddf=combine_cats(ddf, 'ts_weekday', 'ts_hour')
|
||||
ddf=categorify(ddf, 'product_id', 100)
|
||||
ddf=count_encoding(ddf, 'user_id')
|
||||
ddf=count_encoding(ddf, 'product_id')
|
||||
ddf=target_encoding(ddf, 'brand')
|
||||
ddf=target_encoding(ddf, 'product_id')
|
||||
ddf.head()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# clean GPU memory
|
||||
import IPython
|
||||
app = IPython.Application.instance()
|
||||
app.kernel.do_shutdown(True)
|
||||
```
|
||||
|
||||
**Well Done!** Let's move to the [next notebook](1_04_nvtabular_and_mgpu.ipynb).
|
||||
|
||||
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
|
||||
341
ds/25-1/3/1_04_nvtabular_and_mgpu.md
Normal file
341
ds/25-1/3/1_04_nvtabular_and_mgpu.md
Normal file
@ -0,0 +1,341 @@
|
||||
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
|
||||
|
||||
# Enhancing Data Science Outcomes With Efficient Workflow #
|
||||
|
||||
## 04 - NVTabular ##
|
||||
In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load), is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data.
|
||||
|
||||
<p><img src='images/pipeline_overview_1.png' width=1080></p>
|
||||
|
||||
**Table of Contents**
|
||||
<br>
|
||||
In this notebook, we will use NVTabular to perform feature engineering. This notebook covers the below sections:
|
||||
1. [NVTabular](#s4-1)
|
||||
* [Multi-GPU Scaling in NVTabular with Dask](#s4-1.1)
|
||||
2. [Operators](#s4-2)
|
||||
3. [Feature Engineering and Preprocessing with NVTabular](#s4-3)
|
||||
* [Defining the Workflow](#s4-3.1)
|
||||
* [Exercise #1 - Using NVTabular Operators](#s4-e1)
|
||||
* [Defining the Dataset](#s4-3.2)
|
||||
* [Fit, Transform, and Persist](#s4-3.3)
|
||||
* [Exercise #2 - Load Saved Workflow](#s4-e2)
|
||||
|
||||
<a name='s4-1'></a>
|
||||
## NVTabular ##
|
||||
[NVTabular](https://nvidia-merlin.github.io/NVTabular/main/index.html) is a feature engineering and preprocessing library for tabular data that is designed to easily manipulate terabyte scale datasets. It provides high-level abstraction to simplify code and accelerates computation on the GPU using the RAPIDS [cuDF](https://docs.rapids.ai/api/cudf/stable/) library. While NVTabular is built upon the RAPIDS cuDF library, it improves cuDF since data is not limited to GPU memory capacity. The API documentation can be found [here](https://nvidia-merlin.github.io/NVTabular/main/api.html#).
|
||||
|
||||
Core features of NVTabular include:
|
||||
* Easily process data by leveraging built-in or custom operators specifically designed for machine learning algorithms
|
||||
* Computations are carried out on the GPU with best practices baked into the library, allowing us to realize significant acceleration
|
||||
* Provide higher-level API to greatly simplify code complexity while still providing the same level of performance
|
||||
* Work on arbitrarily large datasets when used with [Dask](https://www.dask.org/)
|
||||
* Minimize the number of passes through the data with [Lazy execution](https://en.wikipedia.org/wiki/Lazy_evaluation)
|
||||
|
||||
In doing so, NVTabular helps data scientists and machine learning engineers to:
|
||||
* Process datasets that exceed GPU and CPU memory without having to worry about scale
|
||||
* Focus on what to do with the data and not how to do it by using abstraction at the operation level
|
||||
* Prepare datasets quickly and easily for experimentation so that more models can be trained
|
||||
|
||||
Data science can be an iterative process that requires extensive repeated experimentation. The ability to perform feature engineering and preprocessing quickly translates into faster iteration cycles, which can help us to arrive at an optimal solution.
|
||||
|
||||
<a name='s4-1.1'></a>
|
||||
### Multi-GPU Scaling in NVTabular with Dask ###
|
||||
NVTabular supports multi-GPU scaling with [Dask-CUDA](https://github.com/rapidsai/dask-cuda) and `dask.distributed`[[doc]](https://distributed.dask.org/en/latest/). For multi-GPU, NVTabular uses [Dask-cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) for internal data processing. The parallel performance can depend strongly on the size of the partitions, the shuffling procedure used for data output, and the arguments used for transformation operations.
|
||||
|
||||
<a name='s4-2'></a>
|
||||
## Operators ##
|
||||
NVTabular has already implemented several data transformations, called `ops`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/generated/nvtabular.ops.Operator.html). An `op` can be applied to a `ColumnGroup` from an overloaded `>>` operator, which in turn returns a new `ColumnGroup`. A `ColumnGroup` is a list of column names as text.
|
||||
|
||||
```
|
||||
features = [ column_name_1, column_name_2, ...] >> op1 >> op2 >> ...
|
||||
```
|
||||
|
||||
Since the Dataset API can both ingest and output a Dask collection, it is straightforward to transform data either before or after an NVTabular workflow is executed. This means that some complex preprocessing operations, that are not yet supported in NVTabular, can still be accomplished with the Dask-CuDF API:
|
||||
|
||||
Common operators include:
|
||||
* [Categorify](https://nvidia-merlin.github.io/NVTabular/main/api/ops/categorify.html) - transform categorical features into unique integer values
|
||||
* Can apply a frequency threshold to group low frequent categories together
|
||||
* [TargetEncoding](https://nvidia-merlin.github.io/NVTabular/main/api/ops/targetencoding.html) - transform categorical features into group-specific mean of each row
|
||||
* Using `kfold=1` and `p_smooth=0` is the same as disabling these additional logic
|
||||
* [Groupby](https://nvidia-merlin.github.io/NVTabular/main/api/ops/groupby.html) - transform feature into the result of one or more groupby aggregations
|
||||
* **NOTE**: Does not move data between partitions, which means data should be shuffled by groupby_cols
|
||||
* [JoinGroupby](https://nvidia-merlin.github.io/NVTabular/main/api/ops/joingroupby.html) - add new feature based on desired group-specific statistics of requested continuous features
|
||||
* Supported statistics include [`count`, `sum`, `mean`, `std`, `var`].
|
||||
* [LogOp](https://nvidia-merlin.github.io/NVTabular/main/api/ops/log.html) - log transform with the continuous features
|
||||
* [FillMissing](https://nvidia-merlin.github.io/NVTabular/main/api/ops/fillmissing.html) - replaces missing values with constant pre-defined value
|
||||
* [Bucketize](https://nvidia-merlin.github.io/NVTabular/main/api/ops/bucketize.html) - transform continuous features into categorical features with bins based on provided bin boundaries
|
||||
* [LambdaOp](https://nvidia-merlin.github.io/NVTabular/main/api/ops/lambdaop.html) - enables custom row-wise dataframe manipulations with NVTabular
|
||||
* [Rename](https://nvidia-merlin.github.io/NVTabular/main/api/ops/rename.html) - rename columns
|
||||
* [Normalize](https://nvidia-merlin.github.io/NVTabular/main/api/ops/normalize.html) - perform normalization using the mean standard deviation method
|
||||
|
||||
|
||||
```python
|
||||
# import dependencies
|
||||
import nvtabular as nvt
|
||||
from nvtabular.ops import *
|
||||
|
||||
from dask.distributed import Client, wait
|
||||
from dask_cuda import LocalCUDACluster
|
||||
import dask_cudf
|
||||
import cudf
|
||||
import gc
|
||||
|
||||
# instantiate a Client
|
||||
cluster=LocalCUDACluster()
|
||||
client=Client(cluster)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# get the machine's external IP address
|
||||
from requests import get
|
||||
|
||||
ip=get('https://api.ipify.org').content.decode('utf8')
|
||||
|
||||
print(f'Dask dashboard (status) is accessible on http://{ip}:8787/status')
|
||||
print(f'Dask dashboard (gpu) is accessible on http://{ip}:8787/gpu')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# read data as Dask DataFrame
|
||||
ddf=dask_cudf.read_parquet('clean_parquet')
|
||||
|
||||
# preview DataFrame
|
||||
ddf.head()
|
||||
```
|
||||
|
||||
<a name='s4-3'></a>
|
||||
## Feature Engineering and Preprocessing with NVTabular ##
|
||||
The typical steps for developing with NVTabular include:
|
||||
1. Design and Define Operations in the Pipeline
|
||||
2. Create Workflow
|
||||
3. Create Dataset
|
||||
4. Apply Workflow to Dataset
|
||||
|
||||
<p><img src='images/nvtabular_diagram.png' width=720></p>
|
||||
|
||||
<a name='s4-3.1'></a>
|
||||
### Defining the Workflow ###
|
||||
We start by creating the `nvtabular.workflow.workflow.Workflow`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/api/workflow/workflow.html), which defines the operations and preprocessing steps that we would like to perform on the data.
|
||||
|
||||
We will perform the following feature engineering and preprocessing steps:
|
||||
* Categorify the categorical features
|
||||
* Log transform and normalize continuous features
|
||||
* Calculate group-specific `sum`, `count`, and `mean` of the `target` for categorical features
|
||||
* Log transform `price`
|
||||
* Calculate `product_id` specific relative `price` to average `price`
|
||||
* Target encode all categorical features
|
||||
|
||||
One of the key advantages of using NVTabular is the high-level abstraction we can use, which simplifies code significantly.
|
||||
|
||||
|
||||
```python
|
||||
# assign features and label
|
||||
cat_cols=['brand', 'cat_0', 'cat_1', 'cat_2', 'cat_3']
|
||||
cont_cols=['price', 'ts_hour', 'ts_minute', 'ts_weekday']
|
||||
label='target'
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# categorify categorical features
|
||||
cat_features=cat_cols >> Categorify()
|
||||
```
|
||||
|
||||
<a name='s4-e1'></a>
|
||||
### Exercise #1 - Using NVTabular Operators ###
|
||||
We can use the `>>` operator to specify how columns will be transformed. We need to transform the `price` feature by performing the log transformation and normalization.
|
||||
|
||||
**Instructions**: <br>
|
||||
* Review the documentation for the `LogOp()`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/api/ops/log.html) and `Normalize()`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/api/ops/normalize.html) operators.
|
||||
* Modify the `<FIXME>`s only and execute the cell below to create a workflow.
|
||||
|
||||
|
||||
```python
|
||||
# log transform
|
||||
price = (
|
||||
['price']
|
||||
>> FillMissing(0)
|
||||
>> <<<<FIXME>>>>
|
||||
>> <<<<FIXME>>>>
|
||||
>> LambdaOp(lambda col: col.astype("float32"), dtype='float32')
|
||||
)
|
||||
```
|
||||
price = (
|
||||
['price']
|
||||
>> FillMissing(0)
|
||||
>> LogOp()
|
||||
>> Normalize()
|
||||
>> LambdaOp(lambda col: col.astype("float32"), dtype='float32')
|
||||
)
|
||||
Click ... to show **solution**.
|
||||
|
||||
There are several ways to create a feature for relative `price` to average. We will do so with the below steps:
|
||||
1. Calculate average `price` per group.
|
||||
2. Define a function to calculate the percentage difference
|
||||
3. Apply the user defined function to `price` and average `price`
|
||||
|
||||
|
||||
```python
|
||||
# relative price to the average price for the product_id
|
||||
# create product_id specific average price feature
|
||||
avg_price_product = ['product_id'] >> JoinGroupby(cont_cols =['price'], stats=["mean"])
|
||||
|
||||
# create user defined function to calculate percent difference
|
||||
def relative_price_to_avg(col, gdf):
|
||||
# introduce tiny number in case of 0
|
||||
epsilon = 1e-5
|
||||
col = ((gdf['price'] - col) / (col + epsilon)) * (col > 0).astype(int)
|
||||
return col
|
||||
|
||||
# create product_id specific relative price to average
|
||||
relative_price_to_avg_product = (
|
||||
avg_price_product
|
||||
>> LambdaOp(relative_price_to_avg, dependency=['price'], dtype='float64')
|
||||
>> Rename(name='relative_price_product')
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
avg_price_category = ['category_code'] >> JoinGroupby(cont_cols =['price'], stats=["mean"])
|
||||
|
||||
# create product_id specific relative price to average
|
||||
relative_price_to_avg_category = (
|
||||
avg_price_category
|
||||
>> LambdaOp(relative_price_to_avg, dependency=['price'], dtype='float64')
|
||||
>> Rename(name='relative_price_category')
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# calculate group-specific statistics for categorical features
|
||||
ce_features=cat_cols >> JoinGroupby(stats=['sum', 'count'], cont_cols=label)
|
||||
|
||||
# target encode
|
||||
te_features=cat_cols >> TargetEncoding(label)
|
||||
```
|
||||
|
||||
We also add the target, i.e. `label`, to the set of returned columns. We can visualize our data processing pipeline with `graphviz` by calling `.graph`. The data processing pipeline is a DAG (direct acyclic graph).
|
||||
|
||||
|
||||
```python
|
||||
features=cat_features+cont_cols+ce_features+te_features+price+relative_price_to_avg_product+relative_price_to_avg_category+[label]
|
||||
features.graph
|
||||
```
|
||||
|
||||
We are now ready to construct a `Workflow` that will run the operations we defined above. To enable distributed parallelism, the NVTabular `Workflow` must be initialized with a `dask.distributed.Client` object. Since NVTabular already uses Dask-CuDF for internal data processing, there are no other requirements for multi-GPU scaling.
|
||||
|
||||
|
||||
```python
|
||||
# define our NVTabular Workflow with client to enable multi-GPU execution
|
||||
# for multi-GPU execution, the only requirement is that we specify a client when
|
||||
# initializing the NVTabular Workflow.
|
||||
workflow=nvt.Workflow(features, client=client)
|
||||
```
|
||||
|
||||
<a name='s4-3.2'></a>
|
||||
### Defining the Dataset ###
|
||||
All external data need to be converted to the universal `nvtabular.io.dataset.Dataset`[[doc]](https://nvidia-merlin.github.io/NVTabular/v0.7.1/api/dataset.html) type. The main purpose of this class is to abstract away the raw format of the data, and to allow other NVTabular classes to reliably materialize a `dask.dataframe.DataFrame` collection and/or collection-based iterator on demand.
|
||||
|
||||
The collection-based iterator is important when working with large datasets that do not fit into GPU memory since operations in the `Workflow` often require statistics calculated across the entire dataset. For example, `Normalize` requires measurements of the dataset mean and standard deviation, and `Categorify` requires an accounting of all the unique categories a particular feature can manifest. The `Dataset` object partitions the dataset into chunks that will fit into GPU memory to compute statistics in an online fashion.
|
||||
|
||||
A `Dataset` can be initialized from a variety of different raw-data formats:
|
||||
1. With a parquet-dataset directory
|
||||
2. With a list of files
|
||||
3. In addition to handling data stored on disk, a `Dataset` can also be initialized from an existing cuDF DataFrame, or from a `dask.dataframe.DataFrame`
|
||||
|
||||
The data we pass to the `Dataset` constructor is usually the result of a query from some source, for example a data warehouse or data lake. The output is usually in Parquet, ORC, or CSV format. In our case, we have the data in parquet format saved on the disk from previous steps. When initializing a `Dataset` from a directory path, the engine should be used to specify either `parquet` or `csv` format. If initializing a `Dataset` from a list of files, the engine can be inferred.
|
||||
|
||||
Memory is an important consideration. The workflow will process data in chunks, therefore increasing the number of partitions will limit the memory footprint. Since we will initialize the `Dataset` with a DataFrame type (`cudf.DataFrame` or `dask.dataframe.DataFrame`), most of the parameters will be ignored and the partitions will be preserved. Otherwise, the data would be converted to a `dask.dataframe.DataFrame` with a maximum partition size of roughly 12.5% of the total memory on a single device by default. We can use the `npartitions` parameter for specifying into how many chunks we would like the data to be split. The partition size can be changed to a different fraction of total memory on a single device with the `part_mem_fraction` argument. Alternatively, a specific byte size can be specified with the `part_size` argument.
|
||||
|
||||
<p><img src='images/tip.png' width=720></p>
|
||||
|
||||
The NVTabular dataset should be created from Parquet files in order to get the best possible performance, preferably with a row group size of around 128MB. While NVTabular also supports reading from CSV files, reading CSV can be over twice as slow as reading from Parquet. It's recommended to convert a CSV dataset into Parquet format for use with NVTabular.
|
||||
|
||||
|
||||
```python
|
||||
# create dataset
|
||||
dataset=nvt.Dataset(ddf)
|
||||
|
||||
print(f'The Dataset is split into {dataset.npartitions} partitions')
|
||||
```
|
||||
|
||||
<a name='s4-3.3'></a>
|
||||
### Fit, Transform, and Persist ###
|
||||
NVTabular follows a familiar API for pipeline operations. We can `.fit()` the workflow to a training set to calculate the statistics for this workflow. Afterwards, we can use it to `.transform()` the training set and validation dataset. We will persist the transformed data to disk in parquet format for fast reading and train time. Importantly, we can use the `.save()`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/api/workflow/workflow.html#nvtabular.workflow.workflow.Workflow.save) method so that our `Workflow` can be used during model inference.
|
||||
|
||||
<p><img src='images/tip.png' width=720></p>
|
||||
|
||||
Since the `Dataset` API can both ingest and output a Dask collection, it is straightforward to transform data either before or after an NVTabular workflow is executed. This means that some complex pre-processing operations, that are not yet supported in NVTabular, can still be accomplished with the `dask_cudf.DataFrame` API after the `Dataset` is converted with `.to_ddf`.
|
||||
|
||||
|
||||
```python
|
||||
# fit and transform dataset
|
||||
workflow.fit(dataset)
|
||||
output_dataset=workflow.transform(dataset)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# save the workflow
|
||||
workflow.save('nvt_workflow')
|
||||
|
||||
!ls -l nvt_workflow
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# remove existing parquet directory
|
||||
!rm -R processed_parquet/*
|
||||
|
||||
# save output to parquet directory
|
||||
output_path='processed_parquet'
|
||||
output_dataset.to_parquet(output_path=output_path)
|
||||
```
|
||||
|
||||
If needed, we can convert the `Dataset` object to `dask.dataframe.DataFrame` to inspect the results.
|
||||
|
||||
|
||||
```python
|
||||
# convert to DataFrame and preview
|
||||
output_dataset.to_ddf().head()
|
||||
```
|
||||
|
||||
<a name='s4-e2'></a>
|
||||
### Exercise #2 - Load Saved Workflow ###
|
||||
We can load a saved workflow, which will contain the graph, schema, and statistics. This is useful if the workflow should be applied to future datasets.
|
||||
|
||||
**Instructions**: <br>
|
||||
* Review the [documentation](https://nvidia-merlin.github.io/NVTabular/main/api/workflow/workflow.html#nvtabular.workflow.workflow.Workflow.load) for the `.load()` _class_ method.
|
||||
* Modify the `<FIXME>` only and execute the cell below to create a workflow.
|
||||
* Execute the cell below to apply the graph of operators to transform the data.
|
||||
|
||||
|
||||
```python
|
||||
# load workflow
|
||||
loaded_workflow=<<<<FIXME>>>>
|
||||
```
|
||||
loaded_workflow=nvt.Workflow.load('nvt_workflow')
|
||||
Click ... to show **solution**.
|
||||
|
||||
|
||||
```python
|
||||
# create dataset from parquet directory
|
||||
dataset=nvt.Dataset('clean_parquet', engine='parquet')
|
||||
|
||||
# transform dataset
|
||||
loaded_workflow.transform(dataset).to_ddf().head()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# clean GPU memory
|
||||
import IPython
|
||||
app = IPython.Application.instance()
|
||||
app.kernel.do_shutdown(restart=False)
|
||||
```
|
||||
|
||||
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
|
||||
7
ds/25-1/3/README.md
Normal file
7
ds/25-1/3/README.md
Normal file
@ -0,0 +1,7 @@
|
||||
h
|
||||
|
||||
Some categorical features, such as `event_type`, `category_code`, `brand`, and etc. are stored as raw text.
|
||||
Dask > cuDF if dataset doesn't fit into memory
|
||||
none of the optimizations introduced by Dask-CUDA will be available in such cases (without dask_cuda.LocalCUDACluster())
|
||||
dask.distributed.Client
|
||||
spilling uses host <memory_limit> when <device_memory_limit> is reached
|
||||
92
ds/25-1/3/memory_utilization.md
Normal file
92
ds/25-1/3/memory_utilization.md
Normal file
@ -0,0 +1,92 @@
|
||||
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
|
||||
|
||||
<a name='s1-2.2'></a>
|
||||
### Memory Utilization ###
|
||||
Memory utilization on a DataFrame depends largely on the date types for each column.
|
||||
|
||||
<p><img src='images/dtypes.png' width=720></p>
|
||||
|
||||
We can use `DataFrame.memory_usage()` to see the memory usage for each column (in bytes). Most of the common data types have a fixed size in memory, such as `int`, `float`, `datetime`, and `bool`. Memory usage for these data types is the respective memory requirement multiplied by the number of data points. For `string` data types, the memory usage reported is the number of data points times 8 bytes. This accounts for the 64-bit required for the pointer that points to an address in memory but doesn't include the memory used for the actual string values. The actual memory required for a `string` value is 49 bytes plus an additional byte for each character. The `deep` parameter provides a more accurate memory usage report that accounts for the system-level memory consumption of the contained `string` data type.
|
||||
|
||||
Separately, we've provided a `dli_utils.make_decimal()` function to convert memory size into units based on powers of 2. In contrast to units based on powers of 10, this customary convention is commonly used to report memory capacity. More information about the two definitions can be found [here](https://en.wikipedia.org/wiki/Byte#Multiple-byte_units).
|
||||
|
||||
|
||||
```python
|
||||
# import dependencies
|
||||
import pandas as pd
|
||||
import sys
|
||||
import random
|
||||
|
||||
# import utility
|
||||
from dli_utils import make_decimal
|
||||
|
||||
# import data
|
||||
df=pd.read_csv('2020-Mar.csv')
|
||||
|
||||
# preview DataFrame
|
||||
df.head()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# convert feature as datetime data type
|
||||
df['event_time']=pd.to_datetime(df['event_time'])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# lists each column at 8 bytes/row
|
||||
memory_usage_df=df.memory_usage(index=False)
|
||||
memory_usage_df.name='memory_usage'
|
||||
dtypes_df=df.dtypes
|
||||
dtypes_df.name='dtype'
|
||||
|
||||
# show each column uses roughly number of rows * 8 bytes
|
||||
# 8 bytes from 64-bit numerical data as well as 8 bytes to store a pointer for object data type
|
||||
byte_size=len(df) * 8 * len(df.columns)
|
||||
|
||||
print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')
|
||||
|
||||
pd.concat([memory_usage_df, dtypes_df], axis=1)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# lists each column's full memory usage
|
||||
memory_usage_df=df.memory_usage(deep=True, index=False)
|
||||
memory_usage_df.name='memory_usage'
|
||||
|
||||
byte_size=memory_usage_df.sum()
|
||||
|
||||
# show total memory usage
|
||||
print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')
|
||||
|
||||
pd.concat([memory_usage_df, dtypes_df], axis=1)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# alternatively, use sys.getsizeof() instead
|
||||
byte_size=sys.getsizeof(df)
|
||||
|
||||
print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# check random string-typed column
|
||||
string_cols=[col for col in df.columns if df[col].dtype=='object' ]
|
||||
column_to_check=random.choice(string_cols)
|
||||
|
||||
overhead=49
|
||||
pointer_size=8
|
||||
|
||||
# nan==nan when value is not a number
|
||||
# nan uses 32 bytes of memory
|
||||
print(f'{column_to_check} column uses : {sum([(len(item)+overhead+pointer_size) if item==item else 32 for item in df[column_to_check].values])} bytes of memory.')
|
||||
```
|
||||
|
||||
<p><img src='images/tip.png' width=720></p>
|
||||
When Python stores a string, it actually uses memory for the overhead of the Python object, metadata about the string, and the string itself. The amount of memory usage we calculated includes temporary objects that get deallocated after the initial import. It's important to note that Python has memory optimization mechanics for strings such that when the same string is created multiple time, Python will cache or "intern" it in memory and reuse it for later string objects.
|
||||
|
||||
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
|
||||
161
ds/25-1/r/9.Rmd
Normal file
161
ds/25-1/r/9.Rmd
Normal file
@ -0,0 +1,161 @@
|
||||
---
|
||||
title: "Lab9: Decision trees"
|
||||
author: "Vladislav Litvinov <vlad@sek1ro>"
|
||||
output:
|
||||
pdf_document:
|
||||
toc_float: TRUE
|
||||
---
|
||||
# Data preparation
|
||||
```{r}
|
||||
setwd('/home/sek1ro/git/public/lab/ds/25-1/r')
|
||||
survey <- read.csv('survey.csv')
|
||||
|
||||
train_df = survey[1:600,]
|
||||
test_df = survey[601:750,]
|
||||
```
|
||||
# Building classification tree
|
||||
decision formula is MYDEPV ~ Price + Income + Age
|
||||
|
||||
Use three-fold cross-validation and the information gain splitting index
|
||||
Which features were actually used to construct the tree?
|
||||
Plot the tree using the “rpart.plot” package.
|
||||
|
||||
Three-fold cross-validation - Делают 3 прогона:
|
||||
Прогон 1: обучаемся на B + C, тестируем на A
|
||||
Прогон 2: обучаемся на A + C, тестируем на B
|
||||
Прогон 3: обучаемся на A + B, тестируем на C
|
||||
|
||||
Получаем 3 значения метрики (accuracy, F1, MSE и т.п.).
|
||||
Берём среднее значение — это и есть итоговая оценка качества модели.
|
||||
|
||||
rpart сам отбрасывает признаки, если они не улучшают разбиение по information gain.
|
||||
|
||||
CP-table - связь сложности дерева и ошибки
|
||||
Root node error — ошибка без разбиений
|
||||
nsplit — число split-ов
|
||||
rel error — обучающая ошибка относительно корня
|
||||
xerror — ошибка по cross-validation
|
||||
xstd — стандартное отклонение xerror
|
||||
|
||||
type — расположение split-ов
|
||||
extra — доп. информация в узлах
|
||||
fallen.leaves — выравнивание листьев
|
||||
|
||||
H = -x\cdot\log\left(x\right)-\left(1-x\right)\log\left(1-x\right)
|
||||
Gain(A) = Info(S) - Info(S_A) - максимизируем
|
||||
|
||||
Ранняя остановка. Ограничение грубины. Минимальное количество примеров в узле.
|
||||
|
||||
Отсечение ветвей.
|
||||
Строительство полного дерева, в котором листья содержат примеры одного класса.
|
||||
Определение двух показателей: относительную точность модели и абсолютную ошибку.
|
||||
Удаление листов и узлов, потеря которых минимально скажется на точности модели и увеличении ошибки.
|
||||
|
||||
|
||||
```{r}
|
||||
library(rpart)
|
||||
tree = rpart(
|
||||
MYDEPV ~ Price + Income + Age,
|
||||
data = train_df,
|
||||
method = "class",
|
||||
parms = list(split = "information"),
|
||||
control = rpart.control(
|
||||
xval = 3,
|
||||
),
|
||||
)
|
||||
printcp(tree)
|
||||
|
||||
library(rpart.plot)
|
||||
|
||||
rpart.plot(
|
||||
tree,
|
||||
type = 1,
|
||||
extra = 106,
|
||||
#6 Class models: the probability of the second class only. Useful for binary responses.
|
||||
#100 display the percentage of observations in the node.
|
||||
fallen.leaves = TRUE,
|
||||
)
|
||||
```
|
||||
Score the model with the training data and create the model’s confusion matrix. Which class of MYDEPV was the model better able to classify?
|
||||
```{r}
|
||||
pred_class = predict(tree, train_df, type="class")
|
||||
|
||||
conf_mat = table(
|
||||
Actual = train_df$MYDEPV,
|
||||
Predicted = pred_class
|
||||
)
|
||||
|
||||
conf_mat
|
||||
print(diag(conf_mat) / rowSums(conf_mat))
|
||||
```
|
||||
Define the resubstitution error rate, and then calculate it using the confusion matrix from the previous step. Is it a good indicator of predictive performance? Why or why not?
|
||||
|
||||
Resubstitution error rate — это доля неправильных предсказаний на тех же данных, на которых обучалась модель
|
||||
```{r}
|
||||
print(1 - sum(diag(conf_mat)) / sum(conf_mat))
|
||||
```
|
||||
ROC curve - Receiver Operating Characteristic
|
||||
x - FPR = FP / (FP + TN)
|
||||
y - TPR = TP / (TP + FN)
|
||||
```{r}
|
||||
pred_prob = predict(tree, train_df, type="prob")[,2]
|
||||
|
||||
library(ROCR)
|
||||
pred = prediction(pred_prob, train_df$MYDEPV)
|
||||
perf = performance(pred, "tpr", "fpr")
|
||||
|
||||
plot(perf)
|
||||
abline(a = 0, b = 1)
|
||||
|
||||
auc_perf = performance(pred, measure = "auc")
|
||||
auc_perf@y.values[[1]]
|
||||
```
|
||||
Score the model with the testing data. How accurate are the tree’s predictions?
|
||||
Repeat part (a), but set the splitting index to the Gini coefficient splitting index. How does the new tree compare to the previous one?
|
||||
|
||||
индекс Джини показывает, как часто случайно выбранный пример обучающего множества будет распознан неправильно.
|
||||
|
||||
Gini(Q) = 1 - sum(p^2) - максимизируем
|
||||
0 - все к 1 классу
|
||||
1 - все равновероятны
|
||||
1-\ x^{2}\ -\ \left(1-x\right)^{2}
|
||||
```{r}
|
||||
pred_test = predict(tree, test_df, type="class")
|
||||
conf_mat_test = table(Actual = test_df$MYDEPV, Predicted = pred_test)
|
||||
conf_mat_test
|
||||
print(diag(conf_mat_test) / rowSums(conf_mat_test))
|
||||
|
||||
tree_gini = rpart(
|
||||
MYDEPV ~ Price + Income + Age,
|
||||
data = train_df,
|
||||
method = "class",
|
||||
parms = list(split = "gini")
|
||||
)
|
||||
|
||||
printcp(tree_gini)
|
||||
|
||||
rpart.plot(
|
||||
tree_gini,
|
||||
type = 1,
|
||||
extra = 106,
|
||||
fallen.leaves = TRUE,
|
||||
)
|
||||
```
|
||||
One way to prune a tree is according to the complexity parameter associated with the smallest cross-validation error. Prune the new tree in this way using the “prune” function. Which features were actually used in the pruned tree? Why were certain variables not used?
|
||||
```{r}
|
||||
best_cp <- tree_gini$cptable[which.min(tree_gini$cptable[, "xerror"]), "CP"]
|
||||
best_cp
|
||||
|
||||
pruned_tree = prune(tree_gini, cp = best_cp)
|
||||
|
||||
printcp(pruned_tree)
|
||||
|
||||
rpart.plot(pruned_tree)
|
||||
```
|
||||
Create the confusion matrix for the new model, and compare the performance of the model before and after pruning.
|
||||
```{r}
|
||||
pruned_pred = predict(pruned_tree, test_df, type="class")
|
||||
pruned_conf_mat = table(Actual = test_df$MYDEPV, Predicted = pruned_pred)
|
||||
pruned_conf_mat
|
||||
print(diag(pruned_conf_mat) / rowSums(pruned_conf_mat))
|
||||
```
|
||||
@ -1,13 +0,0 @@
|
||||
Version: 1.0
|
||||
|
||||
RestoreWorkspace: Default
|
||||
SaveWorkspace: Default
|
||||
AlwaysSaveHistory: Default
|
||||
|
||||
EnableCodeIndexing: Yes
|
||||
UseSpacesForTab: Yes
|
||||
NumSpacesForTab: 2
|
||||
Encoding: UTF-8
|
||||
|
||||
RnwWeave: Sweave
|
||||
LaTeX: pdfLaTeX
|
||||
Reference in New Issue
Block a user