This commit is contained in:
2026-02-13 14:03:28 +03:00
parent 417326498e
commit 65218abfb1
159 changed files with 2577567 additions and 2553 deletions

View File

@ -438,6 +438,13 @@
"Now we are ready to train the model." "Now we are ready to train the model."
] ]
}, },
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 40, "execution_count": 40,

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

2530339
ds/25-1/3/1_02_EDA.ipynb Normal file

File diff suppressed because one or more lines are too long

View File

@ -1,654 +0,0 @@
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
# Enhancing Data Science Outcomes With Efficient Workflow #
## 02 - Data Exploration and Data Visualization ##
In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load), is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data.
<p><img src='images/pipeline_overview_1.png' width=1080></p>
**Table of Contents**
<br>
In this notebook, we will load data from Parquet file format into a Dask DataFrame and perform various data transformations and exploratory data analysis. This notebook covers the below sections:
1. [Quick Recap](#s2-1)
2. [Data Exploration and Data Visualization](#s2-2)
* [Plotly](#s2-2.1)
* [Summarize](#s2-2.2)
* [Visualizing Distribution](#s2-2.3)
* [Exercise #1 - Histogram with GPU](#s2-e1)
* [Exercise #2 - Histogram with Log Scale](#s2-e2)
* [GroupBy Summarize](#s2-2.4)
* [Exercise #3 - Probability Bar Chart](#s2-e3)
* [User Features]()
* [Exercise #4 - Customer GroupBy Aggregation](#s2-e4)
* [Exercise #5 - Time-Series Analysis](#s2-e5)
* [Pivot Table](#s2-2.5)
2. [Summary](#s2-3)
<a name='s2-1'></a>
## Quick Recap ##
So far, we've identified several sources of hidden slowdowns when working with Dask and cuDF:
* Reading data without a schema or specifying `dtype`
* Having too many partitions due to small `chunksize`
* Memory spilling due to partitions being too large
* Performing groupby operations on too many groups scattered across multiple partitions
Going forward, we will continue to learn how to use Dask and RAPIDS efficiently.
<a name='s2-2'></a>
## Data Exploration and Data Visualization ##
Exploratory data analysis involves identification of predictor/feature variables and the target/class variable. We use this time to understand the distribution of the features and identify potentially problematic outliers. Data exploration helps users understand the data in order to better tackle a problem. It can be a way to ascertain the validity of the data as we begin to look for useful features that will help in the following stages of the development workflow.
<a name='s2-2.1'></a>
### Plotly ###
**Plotly** [[Doc]](https://plotly.com/) is a popular library for graphing and data dashboards. Plotly uses `plotly.graph_objects` to create figures for data visualization. Graph objects can be created using `plotly.express` or from the ground up. In order for Plotly to make a graph, data needs to be on the host, not the GPU. If the dataset is small, it may be more efficient to use `pandas` instead of `cudf` or `Dask-cuDF`. However, if the dataset is large, sending data to the GPU is a great way to speed up computation before sending it to the host for visualization. When using GPU acceleration and Plotly, only move the GPU DataFrame(s) to the host at the end with `to_pandas()`, as opposed to converting the entire GPU DataFrame(s) to a pandas DataFrame immediately. This will allow us to take advantages of GPU acceleration for processing.
For more information about how to use Plotly, we recommend [this guide](https://plotly.com/python/getting-started/).
We start by initiating the `LocalCUDACluster()` and Dask `Client()`, followed by loading data from the Parquet files into a Dask DataFrame.
```python
# import dependencies
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import cudf
import dask_cudf
import numpy as np
import cupy as cp
import plotly.express as px
import gc
# create cluster
cluster=LocalCUDACluster()
# instantiate client
client=Client(cluster)
```
```python
# get the machine's external IP address
from requests import get
ip=get('https://api.ipify.org').content.decode('utf8')
print(f'Dask dashboard (status) address is: http://{ip}:8787/status')
print(f'Dask dashboard (Gpu) address is: http://{ip}:8787/gpu')
```
```python
# read data
ddf=dask_cudf.read_parquet('clean_parquet')
print(f'Total of {len(ddf)} records split across {ddf.npartitions} partitions. ')
ddf.dtypes
```
<p><img src='images/tip.png' width=720></p>
The Parquet file format includes metadata to inform `Dask-cuDF` which data types to use for each column.
```python
# create continue and categorical column lists
continuous_cols=['price', 'target', 'ts_hour', 'ts_minute', 'ts_weekday', 'ts_day', 'ts_month', 'ts_year']
categorical_cols=['event_type', 'category_code', 'brand', 'user_session', 'session_product', 'cat_0', 'cat_1', 'cat_2', 'cat_3', 'product_id', 'category_id', 'user_id']
```
```python
# preview DataFrame
ddf.head()
```
<a name='s2-2.2'></a>
### Summarize ###
We can use the `describe()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.dataframe.describe/) method to generate summary statistics for continuous features.
```python
# generate summary statistics for continuous features
ddf[continuous_cols].describe().compute().to_pandas().apply(lambda s: s.apply('{0:.2f}'.format))
```
For categorical values, we are often interested in the [cardinality](https://en.wikipedia.org/wiki/Cardinality) of each feature. Cardinality is number of unique elements the set contains. We use `.nunique()` to get the number of possible values for each categorical feature as it will inform how they can be encoded for machine learning model consumption.
```python
# count number of unique values for categorical features
ddf[categorical_cols].nunique().compute()
```
Note that in the previous step, we added `read_parquet()` to the task graph but did not `.persist()` data in memory. Recall that the Dask DataFrame APIs build the task graph until `.compute()`. The result of `.compute()` is a cuDF DataFrame and should be small.
**Observations**:
* The dataset has an ~41% purchase rate
* All data come from March of 2020
* Price has a very large standard deviation
<a name='s2-2.3'></a>
### Visualizing Distribution ###
A histogram is a graph that shows the frequency of data using rectangles. It's used to visualize the distribution of the the data so we can quickly approximate concenration, skewness, and variability. We will use histograms to identify popularity characteristics in the dataset.
When using Plotly or other visualization libraries, it's best to keep the data on the GPU for as long as possible and only move the data to the host when needed. For example, instead of relying on Plotly Express's `.histogram()`[[doc]](https://plotly.com/python/histograms/) function, we can use the `.value_count()` method to count the number of occurences. We can then pass the results to the `.bar()` function to generate a frequency bar chart. This can yield faster results by enabling GPU acceleration. Furthermore, we can use `.nlargest()` to limit the number of bars in the chart.
We want to visualize the distribution for specific features. Now that the data is in Parquet file format, we can use column pruning to only read in one column at a time to reduce the memory burden.
```python
%%time
# set cat column of interest
cat_col='cat_0'
# read data
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)
# create histogram
px.histogram(
# move data to CPU
ddf[cat_col].compute().to_pandas()
).update_layout(
yaxis_title='Frequency Count',
xaxis_title=cat_col,
title=f'Distribution of {cat_col}'
)
```
<a name='s2-e1'></a>
### Exercise #1 - Histogram with GPU ###
Instead of generating the histogram on CPU, we can use `.value_counts()` to achieve similar results.
**Instructions**: <br>
* Modify the `<FIXME>` only and execute the below cell to visualize the frequency of each `cat_0` value.
* Compare the performance efficiency with the previous CPU approach.
```python
%%time
# set cat column of interest
cat_col='cat_0'
n_bars=25
# read data
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)
# create frequency count DataFrame
cat_count_df=<<<<FIXME>>>>
# create histogram
px.bar(
# move data to CPU
cat_count_df.compute().to_pandas()
).update_layout(
yaxis_title='Frequency Count',
xaxis_title=cat_col,
title=f'Distribution of {cat_col}'
)
```
%%time
cat_col='cat_0'
n_bars=25
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)
cat_count_df=ddf[cat_col].value_counts().nlargest(n_bars)
px.bar(
cat_count_df.compute().to_pandas()
).update_layout(
yaxis_title='Frequency Count',
xaxis_title=cat_col,
title=f'Distribution of {cat_col}'
)
Click ... to show **solution**.
Using cuDF to calculate the frequency is much more efficient. For continuous features, we often have to bin the values into buckets. We can use `cudf.Series.digitize()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.series.digitize/), but it's not implemented for `dask_cudf.core.Series`, so we have to use `.map_partitions()` to perform the `cudf.Series.digitize()` method on each partition.
```python
%%time
# set cont column of interest
cont_col='price'
# read data
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)
# set bin
bins=np.array(range(-1, 10000, 50)).astype('float32')
# create frequency count DataFrame
cont_hist_df=ddf[cont_col].map_partitions(lambda p: p.digitize(bins)).value_counts()
# create histogram
px.bar(
# move data to CPU
cont_hist_df.compute().to_pandas(),
).update_xaxes(
tickmode='array',
tickvals=np.array(range(1, len(bins))),
ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])],
).update_layout(
yaxis_title='Frequency Count',
xaxis_title=f'{cont_col} Bin',
title=f'Distribution of {cont_col}'
)
```
<a name='s2-e2'></a>
### Exercise #2 - Histogram with Log Scale ###
`price` is positively skewed. We might be able to get visualize a better distribution by creating bins in logarithmic scale. We will create bin ranges using `numpy.logspace()` and pass it to `cudf.Series.digitize()`.
**Instructions**: <br>
* Modify the `<FIXME>` only and execute the below cell to visualize the frequency of `price` bins in log scale.
```python
# set cont column of interest
cont_col='price'
# read data
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)
# set bin
bins=np.logspace(0, 5).astype('float32')
# create frequency count DataFrame
cont_hist_df=ddf['price'].map_partitions(<<<<FIXME>>>>).value_counts()
# create histogram
px.bar(
# move data to CPU
cont_hist_df.compute().to_pandas(),
).update_xaxes(
tickmode='array',
tickvals=np.array(range(1, len(bins))),
ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])]
).update_layout(
yaxis_title='Frequency Count',
xaxis_title=f'{cont_col} Bin',
title=f'Distribution of {cont_col}'
)
```
cont_col='price'
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)
bins=np.logspace(0, 5).astype('float32')
cont_hist_df=ddf['price'].map_partitions(lambda p: p.digitize(bins)).value_counts()
px.bar(
cont_hist_df.compute().to_pandas(),
).update_xaxes(
tickmode='array',
tickvals=np.array(range(1, len(bins))),
ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])]
).update_layout(
yaxis_title='Frequency Count',
xaxis_title=f'{cont_col} Bin',
title=f'Distribution of {cont_col}'
)
Click ... to show **solution**.
**Observations**:
* Vast majority of the products are below $300.
<a name='s2-2.4'></a>
### GroupBy Summarize ###
We can use a variety of groupby aggregations to learn about the data. The aggregations supported by cuDF, as described [here](https://docs.rapids.ai/api/cudf/stable/user_guide/groupby/#aggregation), are very efficient. We might be interested in exploring several variations. To make the execution more efficient, we can `.persist()` the data into the memory after reading from the source. Subsequent operations will not require loading from the source again.
For example, we can visualize the probability of an event for each category. When the target column is a binary indicator, we can do this quickly by calculating the aggregate mean. For a categorical feature with binary outcomes, users can use the arithmetic mean to find the _probability_.
We use `.groupby()` on `cat_0`, followed by `.agg('mean')` on `target` to determine the probability of positive outcome for each `cat_0` group.
```python
# read data
ddf=dask_cudf.read_parquet('clean_parquet')
# persist data in memory
ddf=ddf.persist()
wait(ddf)
```
```python
# set cat column of interest
cat_col='cat_0'
n_bars=25
# create groupby probability DataFrame
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)
# create bar chart
px.bar(
# move data to CPU
cat_target_df.compute().to_pandas()
).update_layout(
yaxis_title='Probability',
xaxis_title=cat_col,
title=f'Probability of {cat_col}'
)
```
Some categoriies have a higher probability than other.
Other groupby aggregations include:
* What time of the week is the busiest
```
# show probability of each ts_weekday and ts_hour group
ddf.groupby(['ts_weekday', 'ts_hour'])['target'].agg({'target': 'mean'})
```
<p><img src='images/tip.png' width=720></p>
`.groupby().size()` or `.groupby().agg('size')` is very similar to `.value_counts()`.
<a name='s2-e3'></a>
### Exercise #3 - Probability Bar Chart ###
**Instructions**: <br>
* Mofidy the `<FIXME>` only and execute the below cell to visualize the probability of each `ts_hour` value.
```python
# set cat column of interest
cat_col=<<<<FIXME>>>>
n_bars=25
# create groupby probability DataFrame
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)
# create bar chart
px.bar(
# move data to CPU
cat_target_df.compute().to_pandas()
).update_layout(
yaxis_title='Probability',
xaxis_title=cat_col,
title=f'Probability of {cat_col}'
)
```
cat_col='ts_hour'
n_bars=25
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)
px.bar(
cat_target_df.compute().to_pandas()
).update_layout(
yaxis_title='Probability',
xaxis_title=cat_col,
title=f'Probability of {cat_col}'
)
Click ... to show **solution**.
Some aggregations require all data within the same group to be in memory for calculation such as `median`, `mode`, `nunique`, and etc. For these operations, `.groupby().apply()` is used. Because `.groupby().apply()` performs a shuffle, these operations scales poorly with large amounts of groups.
We use `.groupby()` on `brand` and `.apply()` on `SeriesGroupBy['user_id'].nunique()` to get the number of unique customers that have interacted with each brand.
```python
# set cat columns of interest
cat_col='brand'
group_statistic='user_id'
# create groupby summarize DataFrame
product_frequency=ddf.groupby(cat_col)[group_statistic].apply(lambda g: g.nunique(), meta=(f'{group_statistic}_count', 'int32')).nlargest(25)
# create bar chart
px.bar(
# move data to CPU
product_frequency.compute().to_pandas()
).update_layout(
yaxis_title=f'Number of Unique {group_statistic}',
xaxis_title=cat_col,
title=f'Number of Unique {group_statistic} per {cat_col}'
)
```
```python
# visualize graph
product_frequency.visualize(rankdir='LR')
```
Certain brands and categories have a higher penetration and higher probability of positive outcome.
Other groupby aggregations include:
* How many unique `customer_id` are in each `product_id` group
```
# show how many customers interacted with each product
ddf.groupby('product_id')['user_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
```
* How many unique `cat_0` are in each `brand` group
```
# show how many categories of product do each brand carry
ddf.groupby('brand')['cat_0'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
```
* How many unique `product_id` are in each `user_session` group
```
# show how many products are view in each session
ddf.groupby('user_session')['product_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
```
* How many unique `product_id` are in each `user_id` group
```
# show how many products each user interacts with
ddf.groupby('user_id')['product_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
```
<p><img src='images/tip.png' width=720></p>
For Dask, we can ensure the result of `.groupby()` is sorted using the `sort` paramter.
Sometimes we want to perform custom aggregations that are not yet supported. For custom aggregations, we can use `.groupby().apply()` and user-defined functions. For example, we might be interested in the range of `price` for each `category_code`. Arithmetically, this is done by taking the difference between the group-specific maxmimum and group-specific minimum. We can normalize the range by dividing it by the group-specific mean.
It's best to avoid using `.groupby().apply()` when possible. Similar results can be calculated by using `.groupby().agg()` to obtain the `max`, `min`, and `mean` separately, then applying a row-wise calculation with `.apply()`. This can be more efficient.
```python
%%time
# set cat column of interest
cat_col='category_code'
# define group-wise function
def normalized_range(group):
return (group.max()-group.min())/group.mean()
# create groupby apply DataFrame
normalized_range_df=ddf.groupby(cat_col)['price'].apply(normalized_range, meta=('normalize_range', 'float64')).nlargest(25)
# create bar chart
px.bar(
# move data to CPU
normalized_range_df.compute().to_pandas()
).update_layout(
yaxis_title='Normalize Range',
xaxis_title=cat_col,
title=f'Normalize Range of price per {cat_col}'
)
```
```python
# visualize graph
normalized_range_df.visualize(rankdir='LR')
```
<a name='s2-e4'></a>
### Exercise #4 - Custom GroupBy Aggregation ###
**Instructions**: <br>
* Modify the `<FIXME>` only and execute the below cell to visualize the normalized range of each `category_code`.
* Compare the performance efficiency with the previous `.groupby().apply()` approach.
```python
%%time
# set cat column of interest
cat_col='category_code'
# define row-wise function
def normalized_range(group):
return <<<<FIXME>>>>
# create groupby aggregate DataFrame
normalized_range_df=ddf.groupby(cat_col)['price'].agg({'price': ['max', 'min', 'mean']}).apply(normalized_range, axis=1, meta=('normalize_range', 'float64')).nlargest(25)
# create bar chart
px.bar(
# move data to CPU
normalized_range_df.compute().to_pandas()
).update_layout(
yaxis_title='Normalize Range',
xaxis_title=cat_col,
title=f'Normalize Range of price per {cat_col}'
)
```
cat_col='category_code'
def normalized_range(group):
return (group['max']-group['min'])/group['mean']
normalized_range_df=ddf.groupby(cat_col)['price'].agg({'price': ['max', 'min', 'mean']}).apply(normalized_range, axis=1, meta=('normalize_range', 'float64')).nlargest(25)
px.bar(
normalized_range_df.compute().to_pandas()
).update_layout(
yaxis_title='Normalize Range',
xaxis_title=cat_col,
title=f'Normalize Range of price per {cat_col}'
)
Click ... to show **solution**.
```python
# visualize graph
normalized_range_df.visualize(rankdir='LR')
```
We can apply predicate pushdown filters when reading from Parquet files with the `filters` parameter. This will enable Dask-cuDF to skip row-groups and files where _none_ of the rows can satisfy the criteria. This works well when the partitions are thought-out and uses the filter column. Since this is not the case for our dataset, we will apply a separate filter after importing the data.
<a name='s2-e5'></a>
### Exercise #5 - Time-Series Analysis ###
When dealing with time-series data, cuDF provides powerful `.rolling()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.dataframe.rolling/) and `.resample()`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.dataframe.resample/) methods to perform window operations. Functionally, they behave very similarly to `.groupby()` operations. We use `.rolling()` followed by `.interpolate()` to find the frequency and probability over the entire span of the dataset.
We use `.map_partitions()` to perform `.resample()` on each partition. Because `.map_partitions()` doesn't perform a shuffle first, we manually perform a `.shuffle()` to ensure all members of each group are together. Once we have all the needed data in the same partitions, we can use `.map_partitions()` and pass the cuDF DataFrame `.resample()` operation.
**Instructions**: <br>
* Execute the below cell to clear memory.
* Execute the cell below to read data into memory and shuffle based on `ts_day`. This ensure that all records belonging to the same group are in the same partition.
* Execute the cell below to show the user shopping behavior.
* Modify the `resample_frequency` to various frequencies to look for more obvious patterns.
```python
del ddf
gc.collect()
```
```python
# read data with predicate pushdown
ddf=dask_cudf.read_parquet('clean_parquet', filters=[('ts_day', "<", 15)])
# apply filtering
ddf=ddf[ddf['ts_day']<15]
# shuffle first on ts_day
ddf=ddf.shuffle('ts_day')
```
```python
# set resample frequency
resample_frequency='3h'
# get time-series DataFrame
activity_amount_trend=ddf.map_partitions(lambda x: x.resample(resample_frequency, on='event_time').size().interpolate('linear'))
purchase_rate_trend=ddf.map_partitions(lambda x: x.resample(resample_frequency, on='event_time')['target'].mean().interpolate('linear'))
# create scatter plot
px.scatter(
# move data to CPU
activity_amount_trend.compute().to_pandas().sort_index(),
color=purchase_rate_trend.compute().to_pandas().sort_index()
).update_traces(
mode='markers+lines'
).update_layout(
yaxis_title='Number of Records',
xaxis_title='event_time',
title=f'Amount of Transactions Over Time'
)
```
<a name='s2-2.5'></a>
### Pivot Table ###
When data is small enough to fit in single GPU, it's often faster to perform data transformation with cuDF. Below we will read a few numerical columns, which fits nicely in memory. We use `.pivot_table()` to find the probability and frequency at each `ts_hour` and `ts_weekday` group.
```python
# read data
gdf=cudf.read_parquet('clean_parquet', columns=['ts_weekday', 'ts_hour', 'target'])
```
```python
# create pivot table
activity_amount=gdf.pivot_table(index=['ts_weekday'], columns=['ts_hour'], values=['target'], aggfunc='size')['target']
# create heatmap
px.imshow(
# move data to CPU
activity_amount.to_pandas(),
title='there is more activity in the day'
).update_layout(
title=f'Number of Records Heatmap'
)
```
```python
# create pivot table
purchase_rate=gdf[['target', 'ts_weekday', 'ts_hour']].pivot_table(index=['ts_weekday'], columns=['ts_hour'], aggfunc='mean')['target'].to_pandas()
# create heatmap
px.imshow(
# move data to CPU
purchase_rate,
title='there is potentially a higher purchase rate in the evening'
).update_layout(
title=f'Probability Heatmap'
)
```
**Observations**:
* Behavior changes on `ts_weekday` and `ts_hour` - e.g. during the week, users will not stay up late as they work next day.
<a name='s2-3'></a>
## Summary ##
* `.groupby().apply()` requires shuffling, which is time-expensive. When possible, try to use `.groupby().agg()` instead
* Keeping data processing on the GPU can help generate visualizations quickly
* Use predicate filtering and column pruning to reduce the amount of data read into memory. When data size is small, processing on cuDF can be more efficient than Dask-cuDF
* Use `.persit()` if subsequent operations are exploratory
* `.map_partitions()` does not involve shuffling
```python
# clean GPU memory
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
```
**Well Done!** Let's move to the [next notebook](1_03_categorical_feature_engineering.ipynb).
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

File diff suppressed because it is too large Load Diff

View File

@ -1,279 +0,0 @@
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
# Enhancing Data Science Outcomes With Efficient Workflow #
## 03 - Feature Engineering for Categorical Features ##
In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load), is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data.
<p><img src='images/pipeline_overview_1.png' width=1080></p>
**Table of Contents**
<br>
In this notebook, we will load data from Parquet file format into a Dask DataFrame and create additional features for machine learning model training. This notebook covers the below sections:
1. [Quick Recap](#s3-1)
2. [Feature Engineering](#s3-2)
* [User Defined Functions](#s3-2.1)
3. [Feature Engineering Techniques](#s3-3)
* [One-Hot Encoding](#s3-3.1)
* [Combining Categories](#s3-3.2)
* [Categorify / Label Encoding](#s3-3.3)
* [Count Encoding](#s3-3.4)
* [Target Encoding](#s3-3.5)
* [Embeddings](#s3-3.6)
4. [Summary](#s3-4)
<a name='s3-1'></a>
## Quick Recap ##
So far, we've identified several sources of hidden slowdowns when working with Dask and cuDF:
* Reading data without a schema or specifying `dtype`
* Having too many partitions due to small `chunksize`
* Memory spilling due to partitions being too large
* Performing groupby operations on too many groups scattered across multiple partitions
Going forward, we will continue to learn how to use Dask and RAPIDS efficiently.
<a name='s3-2'></a>
## Feature Engineering ##
Feature engineer converts raw data to numeric vectors for model consumption. This is generally referred to as encoding, which transforms categorical data into continuous values. When encoding categorical values, there are three primary methods:
* Label encoding when no ordered relationship
* Ordinal encoding in case have ordered relationship
* One-hot encoding when categorical variable data is binary in nature.
Additionally, we can create numerous sets of new features from existing ones, which are then tested for effectiveness during model training. Feature engineering is an important step when working with tabular data as it can improve a machine learning model's ability to learn faster and extract patterns. Feature engineering can be a time-consuming process, particularly when the dataset is large if the processing cycle takes a long time. The ability to perform feature engineering efficiently enables more exploration of useful features.
<a name='s3-2.1'></a>
### User-Defined Functions ###
Like many tabular data processing APIs, cuDF provides a range of composable, DataFrame style operators. While out of the box functions are flexible and useful, it is sometimes necessary to write custom code, or **user-defined functions** (UDFs), that can be applied to rows, columns, and other groupings of the cells making up the DataFrame.
Users can execute UDFs on `cudf.Series` with:
* `cudf.Series.apply()` or
* Numba's `forall` syntax [(link)](https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html#lower-level-control-with-custom-numba-kernels)
Users can execute UDFs on `cudf.DataFrame` with:
* `cudf.DataFrame.apply()`
* `cudf.DataFrame.apply_rows()`
* `cudf.DataFrame.apply_chunks()`
* `cudf.rolling().apply()`
* `cudf.groupby().apply_grouped()`
Note that applying UDFs directly with Dask-cuDF is not yet implemented. For now, users can use `map_partitions` to apply a function to each partition of the distributed dataframe.
Currently, the use of string data within UDFs is provided through the `string_udf` library. This is powerful for use cases such as string splitting, regular expression, and tokenization. The topic of handling string data is discussed extensively [here](https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html#string-data). In addition to `Series.str`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/string_handling.html), cudf also supports `Series.list`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/list_handling.html) for applying custom transformations.
<p><img src='images/tip.png' width=720></p>
Below are some tips:
* `apply` works by applying the provided function to each group sequentially, and concatenating the results together. This can be very slow, especially for a large number of small groups. For a small number of large groups, it can give acceptable performance.
* With cuDF, we can also combine NumPy or cuPy methods into the precedure.
* Related to `apply`, iterating over a cuDF Series, DataFrame or Index is not supported. This is because iterating over data that resides on the GPU will yield extremely poor performance, as GPUs are optimized for highly parallel operations rather than sequential operations. In the vast majority of cases, it is possible to avoid iteration and use an existing function or methods to accomplish the same task. It is recommended that users copy the data from GPU to host with `.to_arrow()` or `.to_pandas()`, then copy the result back to GPU using `.from_arrow()` or `.from_pandas()`.
<a name='s3-3'></a>
## Feature Engineering Techniques ##
Below is a list of common feature engineering techniques.
<img src='images/feature_engineering_methods.png' width=720>
```python
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import cudf
import dask.dataframe as dd
import dask_cudf
import gc
# instantiate a Client
cluster=LocalCUDACluster()
client=Client(cluster)
```
```python
# get the machine's external IP address
from requests import get
ip=get('https://api.ipify.org').content.decode('utf8')
print(f'Dask dashboard (status) is accessible on http://{ip}:8787/status')
print(f'Dask dashboard (gpu) is accessible on http://{ip}:8787/gpu')
```
```python
# read data as Dask-cuDF DataFrame
ddf=dask_cudf.read_parquet('clean_parquet')
ddf=ddf.categorize(columns=['brand', 'cat_0', 'cat_1', 'cat_2', 'cat_3'])
```
```python
ddf=ddf.persist()
```
<p><img src='images/check.png' width=720></p>
Did you get an error message? This notebook depends on the processed source file from previous notebooks.
<a name='s3-3.1'></a>
### One-Hot Encoding ###
**One-Hot Encoding**, also known as dummy encoding, creates several binary columns to indicate a row belonging to a specific category. It works well for categorical features that are not ordinal and have low cardinality. With one-hot encoding, each row would get a single column with a 1 and 0 everywhere else.
For example, we can get `cudf.get_dummies()` to perform one-hot encoding on all of one of the categorical columns.
<img src='images/tip.png' width=720>
One-hot encoding doesn't work well for categorical features when the cardinality is large as it results in high dimensionality. This is particularly an issue for neural networks optimizers. Furthermore, data should not be saved in one-hot encoding format. If needed, it should only be used temporarily for specific tasks.
```python
def one_hot(df, cat):
temp=dd.get_dummies(df[cat])
return dask_cudf.concat([df, temp], axis=1)
```
```python
one_hot(ddf, 'cat_0').head()
```
<a name='s3-3.2'></a>
### Combining Categories ###
**Combining categories** creates new features that better identify patterns when the categories indepedently don't provide information to predict the target. It's also known as _cross column_ or _cross product_. It's a common data preprocessing step for machine learning since it reduces the cost of model training. It's also common for exploratory data analysis. Properly combined categorical features encourage more effective splits in tree-based methods than considering each feature independently.
For example, while `ts_weekday` and `ts_hour` may independently have no significant patterns, we might observe more obvious patterns if the two features are combined into `ts_weekday_hour`.
<img src='images/tip.png' width=720>
When deciding which categorical features should be combined, it's important to balance the number of categories used, the number of observations in each combined category, and information gain. Combining features together reduces the number of observations per resulting category, which can lead to overfitting. Typically, combining low cardinal categories is recommended. Otherwise, experimentations are needed to discover the best combinations.
```python
def combine_cats(df, left, right):
df['-'.join([left, right])]=df[left].astype('str').str.cat(df[right].astype('str'))
return df
```
```python
combine_cats(ddf, 'ts_weekday', 'ts_hour').head()
```
<a name='s3-3.3'></a>
### Categorify and Grouping ###
**Categorify**, also known as *Label Encoding*, converts features into continuous integers. Typically, it converts the values into monotonically increasing positive integers from 0 to *C*, or the cardinality. It enables numerical computations and can also reduce memory utilization if the original feature contains string values. Categorify is a necessary data preprocessing step for neural network embedding layers. It is required for using categorical features in deep learning models with Embedding layers.
Categorifying works well when the feature is ordinal, and is sometimes necessary when the cardinality is large. Categories with low frequency can be grouped together to prevent the model overfitting on spare signals. When categorifying a feature, we can apply a threshold to group all categories with lower frequency count to the `other` category.
Encode categorical features into continuous integer values if the category occurs more often than the specified threshold- frequency threshold. Infrequent categories are mapped to a special unknown category. This handy functionality will map all categories which occur in the dataset with some threshold level of infrequency to the same index, keeping the model from overfitting to sparse signals.
```python
def categorify(df, cat, freq_threshold):
freq=df[cat].value_counts()
freq=freq.reset_index()
freq.columns=[cat, 'count']
# reset index on the frequency dataframe for a new sequential index
freq=freq.reset_index()
freq.columns=[cat+'_Categorify', cat, 'count']
# we apply a frequency threshold of 5 to group low frequent categories together
freq_filtered=freq[freq['count']>5]
# add 2 to the new index as we want to use index 0 for others and 1 for unknown
freq_filtered[cat+'_Categorify']=freq_filtered[cat+'_Categorify']+2
freq_filtered=freq_filtered.drop(columns=['count'])
# merge original dataframe with newly created dataframe to obtain the categorified value
df=df.merge(freq_filtered, how='left', on=cat)
# fill null values with 0 to represent low frequency categories grouped as other
df[cat + '_Categorify'] = df[cat + '_Categorify'].fillna(0)
return df
```
```python
categorify(ddf, 'cat_0', 10).head()
```
<a name='s3-3.4'></a>
### Count Encoding ###
*Count Encoding* represents a feature based on the frequency. This can be interpreted as the popularity of a category.
For example, we can count the frequency of `user_id` with `cudf.Series.value_counts()`. This creates a feature that can help a machine learning model learn the behavior pattern of users with low frequency together.
```python
def count_encoding(df, cat):
count_df=df[cat].value_counts()
count_df=count_df.reset_index()
count_df.columns=[cat, cat+'_CE']
df=df.merge(count_df, on=cat)
return df
```
```python
count_encoding(ddf, 'user_id').head()
```
<a name='s3-3.5'></a>
### Target Encoding ###
**Target Encoding** represents a categorical feature based on its effect on the target variable. One common technique is to replace values with the probability of the target given a category. Target encoding creates a new feature, which can be used by the model for training. The advantage of target encoding is that it processes the categorical features and makes them more easily accessible to the model during training and validation.
Mathematically, target encoding on a binary target can be:
p(t = 1 | x = ci)
For a binary classifier, we can calculate the probability when the target is `true` or `1` by taking the mean for each category group. This is also known as *Mean Encoding*.
In other words, it calculates statistics, such as the arithmetic mean, from a target variable grouped by the unique values of one or more categorical features.
<img src='images/tip.png' width=720>
*Leakage*, also known as data leakage or target leakage, occurs when training a model with information that would not be avilable at the time of prediction. This can cause the inflated model performance score to overestimate the model's utility. For example, including "temperature_celsius" as a feature when training and predicting "temperature_fahrenheit".
```python
def target_encoding(df, cat):
te_df=df.groupby(cat)['target'].mean().reset_index()
te_df.columns=[cat, cat+'_TE']
df=df.merge(te_df, on=cat)
return df
```
```python
target_encoding(ddf, 'brand').head()
```
<a name='s3-3.6'></a>
### Embeddings ###
Deep learning models often apply **Embedding Layers** to categorical features. Over the past few years, this has become an increasing popular technique for encoding categorical features. Since the embeddings need to be trained through a neural network, we will cover this in the next lab.
```python
ddf=one_hot(ddf, 'cat_0')
ddf=combine_cats(ddf, 'ts_weekday', 'ts_hour')
ddf=categorify(ddf, 'product_id', 100)
ddf=count_encoding(ddf, 'user_id')
ddf=count_encoding(ddf, 'product_id')
ddf=target_encoding(ddf, 'brand')
ddf=target_encoding(ddf, 'product_id')
ddf.head()
```
```python
# clean GPU memory
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
```
**Well Done!** Let's move to the [next notebook](1_04_nvtabular_and_mgpu.ipynb).
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

File diff suppressed because it is too large Load Diff

View File

@ -1,341 +0,0 @@
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
# Enhancing Data Science Outcomes With Efficient Workflow #
## 04 - NVTabular ##
In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load), is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data.
<p><img src='images/pipeline_overview_1.png' width=1080></p>
**Table of Contents**
<br>
In this notebook, we will use NVTabular to perform feature engineering. This notebook covers the below sections:
1. [NVTabular](#s4-1)
* [Multi-GPU Scaling in NVTabular with Dask](#s4-1.1)
2. [Operators](#s4-2)
3. [Feature Engineering and Preprocessing with NVTabular](#s4-3)
* [Defining the Workflow](#s4-3.1)
* [Exercise #1 - Using NVTabular Operators](#s4-e1)
* [Defining the Dataset](#s4-3.2)
* [Fit, Transform, and Persist](#s4-3.3)
* [Exercise #2 - Load Saved Workflow](#s4-e2)
<a name='s4-1'></a>
## NVTabular ##
[NVTabular](https://nvidia-merlin.github.io/NVTabular/main/index.html) is a feature engineering and preprocessing library for tabular data that is designed to easily manipulate terabyte scale datasets. It provides high-level abstraction to simplify code and accelerates computation on the GPU using the RAPIDS [cuDF](https://docs.rapids.ai/api/cudf/stable/) library. While NVTabular is built upon the RAPIDS cuDF library, it improves cuDF since data is not limited to GPU memory capacity. The API documentation can be found [here](https://nvidia-merlin.github.io/NVTabular/main/api.html#).
Core features of NVTabular include:
* Easily process data by leveraging built-in or custom operators specifically designed for machine learning algorithms
* Computations are carried out on the GPU with best practices baked into the library, allowing us to realize significant acceleration
* Provide higher-level API to greatly simplify code complexity while still providing the same level of performance
* Work on arbitrarily large datasets when used with [Dask](https://www.dask.org/)
* Minimize the number of passes through the data with [Lazy execution](https://en.wikipedia.org/wiki/Lazy_evaluation)
In doing so, NVTabular helps data scientists and machine learning engineers to:
* Process datasets that exceed GPU and CPU memory without having to worry about scale
* Focus on what to do with the data and not how to do it by using abstraction at the operation level
* Prepare datasets quickly and easily for experimentation so that more models can be trained
Data science can be an iterative process that requires extensive repeated experimentation. The ability to perform feature engineering and preprocessing quickly translates into faster iteration cycles, which can help us to arrive at an optimal solution.
<a name='s4-1.1'></a>
### Multi-GPU Scaling in NVTabular with Dask ###
NVTabular supports multi-GPU scaling with [Dask-CUDA](https://github.com/rapidsai/dask-cuda) and `dask.distributed`[[doc]](https://distributed.dask.org/en/latest/). For multi-GPU, NVTabular uses [Dask-cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) for internal data processing. The parallel performance can depend strongly on the size of the partitions, the shuffling procedure used for data output, and the arguments used for transformation operations.
<a name='s4-2'></a>
## Operators ##
NVTabular has already implemented several data transformations, called `ops`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/generated/nvtabular.ops.Operator.html). An `op` can be applied to a `ColumnGroup` from an overloaded `>>` operator, which in turn returns a new `ColumnGroup`. A `ColumnGroup` is a list of column names as text.
```
features = [ column_name_1, column_name_2, ...] >> op1 >> op2 >> ...
```
Since the Dataset API can both ingest and output a Dask collection, it is straightforward to transform data either before or after an NVTabular workflow is executed. This means that some complex preprocessing operations, that are not yet supported in NVTabular, can still be accomplished with the Dask-CuDF API:
Common operators include:
* [Categorify](https://nvidia-merlin.github.io/NVTabular/main/api/ops/categorify.html) - transform categorical features into unique integer values
* Can apply a frequency threshold to group low frequent categories together
* [TargetEncoding](https://nvidia-merlin.github.io/NVTabular/main/api/ops/targetencoding.html) - transform categorical features into group-specific mean of each row
* Using `kfold=1` and `p_smooth=0` is the same as disabling these additional logic
* [Groupby](https://nvidia-merlin.github.io/NVTabular/main/api/ops/groupby.html) - transform feature into the result of one or more groupby aggregations
* **NOTE**: Does not move data between partitions, which means data should be shuffled by groupby_cols
* [JoinGroupby](https://nvidia-merlin.github.io/NVTabular/main/api/ops/joingroupby.html) - add new feature based on desired group-specific statistics of requested continuous features
* Supported statistics include [`count`, `sum`, `mean`, `std`, `var`].
* [LogOp](https://nvidia-merlin.github.io/NVTabular/main/api/ops/log.html) - log transform with the continuous features
* [FillMissing](https://nvidia-merlin.github.io/NVTabular/main/api/ops/fillmissing.html) - replaces missing values with constant pre-defined value
* [Bucketize](https://nvidia-merlin.github.io/NVTabular/main/api/ops/bucketize.html) - transform continuous features into categorical features with bins based on provided bin boundaries
* [LambdaOp](https://nvidia-merlin.github.io/NVTabular/main/api/ops/lambdaop.html) - enables custom row-wise dataframe manipulations with NVTabular
* [Rename](https://nvidia-merlin.github.io/NVTabular/main/api/ops/rename.html) - rename columns
* [Normalize](https://nvidia-merlin.github.io/NVTabular/main/api/ops/normalize.html) - perform normalization using the mean standard deviation method
```python
# import dependencies
import nvtabular as nvt
from nvtabular.ops import *
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf
import cudf
import gc
# instantiate a Client
cluster=LocalCUDACluster()
client=Client(cluster)
```
```python
# get the machine's external IP address
from requests import get
ip=get('https://api.ipify.org').content.decode('utf8')
print(f'Dask dashboard (status) is accessible on http://{ip}:8787/status')
print(f'Dask dashboard (gpu) is accessible on http://{ip}:8787/gpu')
```
```python
# read data as Dask DataFrame
ddf=dask_cudf.read_parquet('clean_parquet')
# preview DataFrame
ddf.head()
```
<a name='s4-3'></a>
## Feature Engineering and Preprocessing with NVTabular ##
The typical steps for developing with NVTabular include:
1. Design and Define Operations in the Pipeline
2. Create Workflow
3. Create Dataset
4. Apply Workflow to Dataset
<p><img src='images/nvtabular_diagram.png' width=720></p>
<a name='s4-3.1'></a>
### Defining the Workflow ###
We start by creating the `nvtabular.workflow.workflow.Workflow`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/api/workflow/workflow.html), which defines the operations and preprocessing steps that we would like to perform on the data.
We will perform the following feature engineering and preprocessing steps:
* Categorify the categorical features
* Log transform and normalize continuous features
* Calculate group-specific `sum`, `count`, and `mean` of the `target` for categorical features
* Log transform `price`
* Calculate `product_id` specific relative `price` to average `price`
* Target encode all categorical features
One of the key advantages of using NVTabular is the high-level abstraction we can use, which simplifies code significantly.
```python
# assign features and label
cat_cols=['brand', 'cat_0', 'cat_1', 'cat_2', 'cat_3']
cont_cols=['price', 'ts_hour', 'ts_minute', 'ts_weekday']
label='target'
```
```python
# categorify categorical features
cat_features=cat_cols >> Categorify()
```
<a name='s4-e1'></a>
### Exercise #1 - Using NVTabular Operators ###
We can use the `>>` operator to specify how columns will be transformed. We need to transform the `price` feature by performing the log transformation and normalization.
**Instructions**: <br>
* Review the documentation for the `LogOp()`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/api/ops/log.html) and `Normalize()`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/api/ops/normalize.html) operators.
* Modify the `<FIXME>`s only and execute the cell below to create a workflow.
```python
# log transform
price = (
['price']
>> FillMissing(0)
>> <<<<FIXME>>>>
>> <<<<FIXME>>>>
>> LambdaOp(lambda col: col.astype("float32"), dtype='float32')
)
```
price = (
['price']
>> FillMissing(0)
>> LogOp()
>> Normalize()
>> LambdaOp(lambda col: col.astype("float32"), dtype='float32')
)
Click ... to show **solution**.
There are several ways to create a feature for relative `price` to average. We will do so with the below steps:
1. Calculate average `price` per group.
2. Define a function to calculate the percentage difference
3. Apply the user defined function to `price` and average `price`
```python
# relative price to the average price for the product_id
# create product_id specific average price feature
avg_price_product = ['product_id'] >> JoinGroupby(cont_cols =['price'], stats=["mean"])
# create user defined function to calculate percent difference
def relative_price_to_avg(col, gdf):
# introduce tiny number in case of 0
epsilon = 1e-5
col = ((gdf['price'] - col) / (col + epsilon)) * (col > 0).astype(int)
return col
# create product_id specific relative price to average
relative_price_to_avg_product = (
avg_price_product
>> LambdaOp(relative_price_to_avg, dependency=['price'], dtype='float64')
>> Rename(name='relative_price_product')
)
```
```python
avg_price_category = ['category_code'] >> JoinGroupby(cont_cols =['price'], stats=["mean"])
# create product_id specific relative price to average
relative_price_to_avg_category = (
avg_price_category
>> LambdaOp(relative_price_to_avg, dependency=['price'], dtype='float64')
>> Rename(name='relative_price_category')
)
```
```python
# calculate group-specific statistics for categorical features
ce_features=cat_cols >> JoinGroupby(stats=['sum', 'count'], cont_cols=label)
# target encode
te_features=cat_cols >> TargetEncoding(label)
```
We also add the target, i.e. `label`, to the set of returned columns. We can visualize our data processing pipeline with `graphviz` by calling `.graph`. The data processing pipeline is a DAG (direct acyclic graph).
```python
features=cat_features+cont_cols+ce_features+te_features+price+relative_price_to_avg_product+relative_price_to_avg_category+[label]
features.graph
```
We are now ready to construct a `Workflow` that will run the operations we defined above. To enable distributed parallelism, the NVTabular `Workflow` must be initialized with a `dask.distributed.Client` object. Since NVTabular already uses Dask-CuDF for internal data processing, there are no other requirements for multi-GPU scaling.
```python
# define our NVTabular Workflow with client to enable multi-GPU execution
# for multi-GPU execution, the only requirement is that we specify a client when
# initializing the NVTabular Workflow.
workflow=nvt.Workflow(features, client=client)
```
<a name='s4-3.2'></a>
### Defining the Dataset ###
All external data need to be converted to the universal `nvtabular.io.dataset.Dataset`[[doc]](https://nvidia-merlin.github.io/NVTabular/v0.7.1/api/dataset.html) type. The main purpose of this class is to abstract away the raw format of the data, and to allow other NVTabular classes to reliably materialize a `dask.dataframe.DataFrame` collection and/or collection-based iterator on demand.
The collection-based iterator is important when working with large datasets that do not fit into GPU memory since operations in the `Workflow` often require statistics calculated across the entire dataset. For example, `Normalize` requires measurements of the dataset mean and standard deviation, and `Categorify` requires an accounting of all the unique categories a particular feature can manifest. The `Dataset` object partitions the dataset into chunks that will fit into GPU memory to compute statistics in an online fashion.
A `Dataset` can be initialized from a variety of different raw-data formats:
1. With a parquet-dataset directory
2. With a list of files
3. In addition to handling data stored on disk, a `Dataset` can also be initialized from an existing cuDF DataFrame, or from a `dask.dataframe.DataFrame`
The data we pass to the `Dataset` constructor is usually the result of a query from some source, for example a data warehouse or data lake. The output is usually in Parquet, ORC, or CSV format. In our case, we have the data in parquet format saved on the disk from previous steps. When initializing a `Dataset` from a directory path, the engine should be used to specify either `parquet` or `csv` format. If initializing a `Dataset` from a list of files, the engine can be inferred.
Memory is an important consideration. The workflow will process data in chunks, therefore increasing the number of partitions will limit the memory footprint. Since we will initialize the `Dataset` with a DataFrame type (`cudf.DataFrame` or `dask.dataframe.DataFrame`), most of the parameters will be ignored and the partitions will be preserved. Otherwise, the data would be converted to a `dask.dataframe.DataFrame` with a maximum partition size of roughly 12.5% of the total memory on a single device by default. We can use the `npartitions` parameter for specifying into how many chunks we would like the data to be split. The partition size can be changed to a different fraction of total memory on a single device with the `part_mem_fraction` argument. Alternatively, a specific byte size can be specified with the `part_size` argument.
<p><img src='images/tip.png' width=720></p>
The NVTabular dataset should be created from Parquet files in order to get the best possible performance, preferably with a row group size of around 128MB. While NVTabular also supports reading from CSV files, reading CSV can be over twice as slow as reading from Parquet. It's recommended to convert a CSV dataset into Parquet format for use with NVTabular.
```python
# create dataset
dataset=nvt.Dataset(ddf)
print(f'The Dataset is split into {dataset.npartitions} partitions')
```
<a name='s4-3.3'></a>
### Fit, Transform, and Persist ###
NVTabular follows a familiar API for pipeline operations. We can `.fit()` the workflow to a training set to calculate the statistics for this workflow. Afterwards, we can use it to `.transform()` the training set and validation dataset. We will persist the transformed data to disk in parquet format for fast reading and train time. Importantly, we can use the `.save()`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/api/workflow/workflow.html#nvtabular.workflow.workflow.Workflow.save) method so that our `Workflow` can be used during model inference.
<p><img src='images/tip.png' width=720></p>
Since the `Dataset` API can both ingest and output a Dask collection, it is straightforward to transform data either before or after an NVTabular workflow is executed. This means that some complex pre-processing operations, that are not yet supported in NVTabular, can still be accomplished with the `dask_cudf.DataFrame` API after the `Dataset` is converted with `.to_ddf`.
```python
# fit and transform dataset
workflow.fit(dataset)
output_dataset=workflow.transform(dataset)
```
```python
# save the workflow
workflow.save('nvt_workflow')
!ls -l nvt_workflow
```
```python
# remove existing parquet directory
!rm -R processed_parquet/*
# save output to parquet directory
output_path='processed_parquet'
output_dataset.to_parquet(output_path=output_path)
```
If needed, we can convert the `Dataset` object to `dask.dataframe.DataFrame` to inspect the results.
```python
# convert to DataFrame and preview
output_dataset.to_ddf().head()
```
<a name='s4-e2'></a>
### Exercise #2 - Load Saved Workflow ###
We can load a saved workflow, which will contain the graph, schema, and statistics. This is useful if the workflow should be applied to future datasets.
**Instructions**: <br>
* Review the [documentation](https://nvidia-merlin.github.io/NVTabular/main/api/workflow/workflow.html#nvtabular.workflow.workflow.Workflow.load) for the `.load()` _class_ method.
* Modify the `<FIXME>` only and execute the cell below to create a workflow.
* Execute the cell below to apply the graph of operators to transform the data.
```python
# load workflow
loaded_workflow=<<<<FIXME>>>>
```
loaded_workflow=nvt.Workflow.load('nvt_workflow')
Click ... to show **solution**.
```python
# create dataset from parquet directory
dataset=nvt.Dataset('clean_parquet', engine='parquet')
# transform dataset
loaded_workflow.transform(dataset).to_ddf().head()
```
```python
# clean GPU memory
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
```
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,776 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0bf7f930-76a1-4c16-84e4-cf1e73b54c55",
"metadata": {},
"source": [
"<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
]
},
{
"cell_type": "markdown",
"id": "400a41da-bc38-4e9a-9ece-d2744ffb16b0",
"metadata": {
"tags": []
},
"source": [
"# Enhancing Data Science Outcomes With Efficient Workflow #"
]
},
{
"cell_type": "markdown",
"id": "8897c66c-4f9d-48b4-a60b-ddae16f2f61b",
"metadata": {},
"source": [
"## 04 - Embeddings ##\n",
"In this lab, you will use high-performance computing to create machine learning solutions. This lab covers the model development portion of the data science workflow. A good machine learning solution excels that both accuracy and inference performance. \n",
"\n",
"<p><img src='images/pipeline_overview_2.png' width=1080></p>\n",
"\n",
"**Table of Contents**\n",
"<br>\n",
"This notebook covers the below sections: \n",
"1. [Entity Embedding](#s4-1)\n",
"2. [Training the Embeddings](#s4-2)\n",
" * [Preparing the Data - Normalization](#s4-2.1)\n",
" * [Model Building](#s4-2.2)\n",
" * [Being Training](#s4-2.3)\n",
"3. [Visualizing the Embeddings](#s4-3)\n",
"4. [Conclusion](#s4-4)"
]
},
{
"cell_type": "markdown",
"id": "28538773-6b95-4840-aca2-73a6f7d98b07",
"metadata": {},
"source": [
"<a name='s4-1'></a>\n",
"## Entity Embeddings ##\n",
"[Entity Embeddings](https://arxiv.org/pdf/1604.06737.pdf) are very similar to word embeddings used in NLP. They are a way to represent categorical features in a defined latent space. In the latent space, categories that are semantically similar have similar vectors. Embeddings can be trained to assign a learnable feature vector to each category. Using embeddings, each categorical value is mapped to its own associated vector representation that is more informative than a single point value. Even though embeddings require a large amount of data and computational resources to train, they have proven to be a great alternative encoding method to consider. Once trained, embeddings can boost the performance of downstream machine learning tasks when used as the input features. Users can combine the power of deep learning with traditional machine learning on tabular data. \n",
"\n",
"<p><img src='images/embedding.png' width=720></p>\n",
"\n",
"Reasons for using embeddings include: \n",
"* It is much more efficient than the one-hot approach for encoding when cardinality if high\n",
"* Allows rich relationships and complexities between categories to be captured\n",
"* Reduce memory usage and speed up downstream machine learning model training\n",
"* Once trained, the same embedding can be used for various use cases\n",
"* Can be used to visualize categorical data and for data clustering, since the embedding space quantifies semantic similarity as distance between the categories in the latent space\n",
"* Mitigates the need to perform cumbersome manual feature engineering, which requires extensive domain knowledge\n",
"\n",
"<p><img src='images/tip.png' width=720></p>\n",
"\n",
"Below are some tips about embeddings: \n",
"* Requires training with large amounts of data, making it inappropriate for unseen data such as when new categories are added\n",
"* Can overfit\n",
"* Difficult to interpret"
]
},
{
"cell_type": "markdown",
"id": "6ba4160d-4b41-40d3-93bc-f1fae0b9dddc",
"metadata": {},
"source": [
"<a name='s4-2'></a>\n",
"## Training the Embeddings ##\n",
"Embeddings aim to represent each entity as a numeric vector such that products in similar context have similar vectors. Mathematically, similar entities will have a large dot product whereas every entity when one-hot encoded has a zero dot product with every other entity. This is because all one-hot vectors are orthogonal. \n",
"\n",
"We will use [PyTorch](https://pytorch.org/) to train a simple fully-connected neural network. A surrogate problem is setup for the purpose of finding the embedding vectors. Neural networks have difficultly with sparse categorical features. Traditionally, embeddings are a way to reduce those features to increase model performance. \n",
"\n",
"Technically, the idea of an embedding layer is very similar to a dense or linear layer (without bias) in the neural network. When training an embedding this way, users will one-hot encode the categorical data so each record becomes a vector with C features, where C is the cardinality. We then perform matrix vector multiplication on the input vector and the weights before feeding the next layer. This is inefficient when the number of input features is large and sparse, as is the case for categorical features from a tabular dataset. \n",
"\n",
"A better and more efficient approach would be to train a `torch.nn.Embedding` layer, which can be treated as a \"lookup\" table with the label-encoded category id as the index. By using choosing this, we avoid one-hot encoding and the matrix vector multiplication. \n",
"\n",
"<p><img src='images/surrogate_problem.png' width=720></p>\n",
"\n",
"<p><img src='images/tip.png' width=720></p>\n",
"\n",
"Embeddings will naturally be affected by how the surrogate problem is defined. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ec50a570-247f-4cfc-8dc5-2c2b501de703",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# import dependencies\n",
"from tqdm import tqdm\n",
"import cudf\n",
"import cuml\n",
"import dask_cudf\n",
"\n",
"import torch\n",
"import torch.nn as nn\n",
"import torch.nn.functional as F\n",
"import torch.optim as torch_optim\n",
"from torch.utils.data import Dataset, DataLoader"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "036bf6ee-d5cb-4f20-a591-681706a098ac",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"# set device cuda to use GPU\n",
"device=torch.device('cuda')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "3726fc69-2a2b-42e2-be12-d235ce2322c1",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# define features and label\n",
"cols=['brand', 'cat_0', 'cat_1', 'cat_2', 'price', 'target']\n",
"cat_cols=['brand', 'cat_0', 'cat_1', 'cat_2']\n",
"label='target'\n",
"\n",
"feature_cols=[col for col in cols if col != label]\n",
"cont_cols=[col for col in feature_cols if col not in cat_cols] # ['price']"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ae87d23f-0c67-4758-8842-ca5770e740f9",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total of 2461697 records.\n"
]
}
],
"source": [
"# read data\n",
"parquet_dir='processed_parquet'\n",
"\n",
"ddf=dask_cudf.read_parquet(parquet_dir, columns=cols)\n",
"gdf=ddf.compute()\n",
"\n",
"print(f'Total of {len(gdf)} records.')"
]
},
{
"cell_type": "markdown",
"id": "b9110c9d-5924-4cb2-8bf3-cabd398aad0e",
"metadata": {},
"source": [
"<p><img src='images/tip.png' width=720></p>\n",
"\n",
"Even though we intend to keep all the data in one GPU, we still recommend loading data with `Dask-cuDF`. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f782bc7e-e6c4-4d87-a839-5a99227dca7c",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>brand</th>\n",
" <th>cat_0</th>\n",
" <th>cat_1</th>\n",
" <th>cat_2</th>\n",
" <th>price</th>\n",
" <th>target</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>6</td>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>100.229996</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>871.839966</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>872.090027</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>6</td>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>306.690002</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>13</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>24</td>\n",
" <td>334.349976</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" brand cat_0 cat_1 cat_2 price target\n",
"0 1 6 5 2 100.229996 1\n",
"1 2 1 1 1 871.839966 1\n",
"2 2 1 1 1 872.090027 1\n",
"3 2 6 5 2 306.690002 1\n",
"4 13 2 3 24 334.349976 1"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gdf.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "3673f202-7aea-43a7-a569-4c210a614529",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'brand': (3303, 7), 'cat_0': (14, 3), 'cat_1': (61, 3), 'cat_2': (90, 3)}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# the embedding vectors will start with 0 so we decrease the categorical values by 1 to match\n",
"gdf[cat_cols]=gdf[cat_cols]-1\n",
"\n",
"n_uniques=gdf.nunique()\n",
"\n",
"# use higher of 4th root of nunique and 3 for vector dimension\n",
"embedding_sizes={col: (n_uniques[col], max(3, int(n_uniques[col]**0.25))) for col in cat_cols}\n",
"embedding_sizes"
]
},
{
"cell_type": "markdown",
"id": "a327c1f9-0683-45f1-90a6-6d4d4daa093c",
"metadata": {
"tags": []
},
"source": [
"<p><img src='images/tip.png' width=720></p>\n",
"\n",
"The size of embeddings can become very large. For example, large embeddings are usually needed for users and items for large platforms. "
]
},
{
"cell_type": "markdown",
"id": "2c1c7fee-dad0-4009-a55c-513465db8a7c",
"metadata": {},
"source": [
"<a name='s4-2.1'></a>\n",
"### Preparing the Data - Normalization ###\n",
"**Normalization** is required to enable neural networks to leverage numerical features. Tree-based models do not require normalization as they define the split independent of the scale of a feature. Without normalization, neural networks are difficult to train. The reason is that different numerical features have different scales. When we combine the features in a hidden layer, the different scales make it more difficult to extract patterns from it. \n",
"\n",
"<p><img src='images/tip.png' width=720></p>\n",
"\n",
"We will also implement a `torch.nn.BatchNorm1d`[[doc]](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html) layer to mitigate the exploding gradient problem. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "fb1840b3-a7d8-4b91-98ef-bddf59afd5e6",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"# normalize data\n",
"gdf['price']=cuml.preprocessing.StandardScaler().fit_transform(gdf[['price']])"
]
},
{
"cell_type": "markdown",
"id": "d6991948-f79a-4b51-b3a9-2571b2be5262",
"metadata": {
"tags": []
},
"source": [
"<a name='s4-2.2'></a>\n",
"### Model Building ###\n",
"We construct a model with several layers. The embeddings will be the same dimension as num_unique x vector_size. The embeddings will be concatenated, along with the continous variable(s), before they are fed into the next layer. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "35a8055b-8b7b-4fb8-8d3a-9f36fc03b171",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"# define neural network with embedding layers\n",
"class ProductPurchaseModel(nn.Module):\n",
" def __init__(self, embedding_sizes, n_cont):\n",
" super().__init__()\n",
" # make an embedding for each categorical feature\n",
" # The `nn.Embedding` layer can be thought of as a lookup table where the key is \n",
" # the category index and the value is the corresponding embedding vector\n",
" self.embeddings=nn.ModuleList([nn.Embedding(n_categories, size) for n_categories, size in embedding_sizes.values()])\n",
" \n",
" # n_emb is the length of all embeddings combined\n",
" n_emb=sum(e.embedding_dim for e in self.embeddings)\n",
" \n",
" self.n_emb=n_emb\n",
" self.n_cont=n_cont\n",
" self.emb_drop = nn.Dropout(0.6)\n",
" \n",
" # apply dropout, batch norm and linear layers\n",
" self.bn1=nn.BatchNorm1d(self.n_cont)\n",
" self.lin1=nn.Linear(self.n_emb + self.n_cont, 200)\n",
" self.drop1=nn.Dropout(0.3)\n",
" self.bn2=nn.BatchNorm1d(200)\n",
" self.drop2=nn.Dropout(0.3)\n",
" self.lin2=nn.Linear(200, 70)\n",
" self.bn3=nn.BatchNorm1d(70)\n",
" self.lin3=nn.Linear(70, 2)\n",
"\n",
" def forward(self, X_cat, X_cont):\n",
" # map each categorical feature to the embedding vector on its corresponding embedding layer\n",
" x_1=[embedding(X_cat[:, idx]) for idx, embedding in enumerate(self.embeddings)]\n",
" \n",
" # concatenate all categorical embedding vectors together\n",
" x_1=torch.cat(x_1, 1)\n",
" \n",
" # apply random drop out, normalization, and activation\n",
" x_1=self.emb_drop(x_1)\n",
" x_2=self.bn1(X_cont)\n",
" \n",
" # concatenate categorical embeddings to input layer from continuous variable(s)\n",
" x_1=torch.cat([x_1, x_2], 1)\n",
" \n",
" # apply random drop out, normalization, and activation\n",
" x_1=F.relu(self.lin1(x_1))\n",
" x_1=self.drop1(x_1)\n",
" x_1=self.bn2(x_1)\n",
" x_1=F.relu(self.lin2(x_1))\n",
" x_1=self.drop2(x_1)\n",
" x_1=self.bn3(x_1)\n",
" x_1=self.lin3(x_1)\n",
" return x_1"
]
},
{
"cell_type": "markdown",
"id": "c52e50a2-99b6-4a8c-aa65-5f11a7806c6e",
"metadata": {},
"source": [
"<p><img src='images/tip.png' width=720></p>\n",
"\n",
"Tabular data uses shallow models with huge embedding tables and few feed-forward layers. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "5b7d18b1-d29e-43d4-8091-3aba41968ebf",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"ProductPurchaseModel(\n",
" (embeddings): ModuleList(\n",
" (0): Embedding(3303, 7)\n",
" (1): Embedding(14, 3)\n",
" (2): Embedding(61, 3)\n",
" (3): Embedding(90, 3)\n",
" )\n",
" (emb_drop): Dropout(p=0.6, inplace=False)\n",
" (bn1): BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (lin1): Linear(in_features=17, out_features=200, bias=True)\n",
" (drop1): Dropout(p=0.3, inplace=False)\n",
" (bn2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (drop2): Dropout(p=0.3, inplace=False)\n",
" (lin2): Linear(in_features=200, out_features=70, bias=True)\n",
" (bn3): BatchNorm1d(70, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" (lin3): Linear(in_features=70, out_features=2, bias=True)\n",
")"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# instantiate model\n",
"model=ProductPurchaseModel(embedding_sizes, len(cont_cols))\n",
"model.to(device)"
]
},
{
"cell_type": "markdown",
"id": "f35dab8e-f1cd-484b-999e-b9e0f7e79edd",
"metadata": {},
"source": [
"Next, we define a `torch.utils.data.Dataset` class to be use by `torch.utils.data.DataLoader`. The Dataset is makes it easier to track separate categorical and continuous variables. The DatalLoader wraps an iterable around the Dataset to enable easy access to the samples. More information about Dataset and DataLoader can be found in quick PyTorch [guide](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html). "
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "98f74906-7b79-4fda-8626-df17023ee512",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"# define dataset\n",
"class myDataset(Dataset):\n",
" def __init__(self, X, y, cat_cols, cont_cols):\n",
" self.X_cat=torch.as_tensor(X.loc[:, cat_cols].copy().values.astype('int32'), device=device)\n",
" self.X_cont=torch.as_tensor(X.loc[:, cont_cols].copy().values.astype('float32'), device=device)\n",
" self.y=torch.as_tensor(y.astype('int64'), device=device)\n",
" \n",
" def __len__(self):\n",
" return len(self.y)\n",
" \n",
" def __getitem__(self, idx): \n",
" return self.X_cat[idx], self.X_cont[idx], self.y[idx]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "a0973509-6a11-49d8-b346-ab9ec8cfaef5",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"# instantiate dataset\n",
"X_train=gdf[feature_cols]\n",
"y_train=gdf['target'].values\n",
"\n",
"train_ds=myDataset(X_train, y_train, cat_cols, cont_cols)"
]
},
{
"cell_type": "markdown",
"id": "5336cfd0-39ed-4285-9b66-e4f5d1b7d75e",
"metadata": {},
"source": [
"<a name='s4-2.3'></a>\n",
"### Begin Training ###\n",
"We will set some parameters for training. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "0604708e-1c2c-485b-a029-eadd17356a03",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"# set optimizer\n",
"def get_optimizer(model, lr = 0.001, wd = 0.0):\n",
" parameters=filter(lambda p: p.requires_grad, model.parameters())\n",
" optim=torch_optim.Adam(parameters, lr=lr, weight_decay=wd)\n",
" return optim"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "39e0ce25-f65c-4330-98cc-34ee4b30bae4",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"# define training function\n",
"def train_model(model, optim, train_dl):\n",
" # set the model to training, which is useful for BatchNorm and Dropout layers that behave differently during training and evaluation\n",
" model.train()\n",
" total=0\n",
" sum_loss=0\n",
" \n",
" # iterate through batches\n",
" for b, (X_cat, X_cont, y) in enumerate(train_dl):\n",
" batch=y.shape[0]\n",
" \n",
" # forward pass\n",
" output=model(X_cat, X_cont)\n",
" \n",
" # calculate loss\n",
" loss=F.cross_entropy(output, y)\n",
" \n",
" # zero out the gradients so the parameters update correctly, otherwise gradients would be combined with old\n",
" optim.zero_grad()\n",
" loss.backward()\n",
" optim.step()\n",
" \n",
" # calculate total loss per batch\n",
" total+=batch\n",
" sum_loss+=batch*(loss.item())\n",
" return sum_loss/total"
]
},
{
"cell_type": "markdown",
"id": "a60dd511-3121-4eb0-beb7-3a03d56de202",
"metadata": {},
"source": [
"Instantiate a `torch.utils.data.DataLoader` and begin training. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "5a25e4e6-f0b5-4bbc-8a1d-0eee74c7faaf",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# define training loop\n",
"def train_loop(model, epochs, lr=0.01, wd=0.0):\n",
" # instantiate optimizer\n",
" optim=get_optimizer(model, lr = lr, wd = wd)\n",
" \n",
" # iterate through number of epochs\n",
" for i in tqdm(range(epochs)): \n",
" loss=train_model(model, optim, train_dl)\n",
" print(\"training loss: \", round(loss, 3))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "68b43459-fb0a-4c13-9371-7c15327ff624",
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 33%|███▎ | 1/3 [00:28<00:57, 28.79s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"training loss: 0.666\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 67%|██████▋ | 2/3 [00:57<00:28, 28.67s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"training loss: 0.665\n"
]
}
],
"source": [
"%%time\n",
"\n",
"# define batch size and begin training\n",
"batch_size=1000\n",
"train_dl=DataLoader(train_ds, batch_size=batch_size, shuffle=True)\n",
"\n",
"train_loop(model, epochs=3, lr=0.05, wd=0.00001)"
]
},
{
"cell_type": "markdown",
"id": "7d6656b3-3642-4279-b787-0c034c45b739",
"metadata": {},
"source": [
"<a name='s4-3'></a>\n",
"## Visualizing the Embeddings ##"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20973ee4-a723-4931-bf50-8efffe275026",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# visualize embeddings\n",
"\n",
"# import dependencies\n",
"import plotly.express as px\n",
"import pandas as pd\n",
"\n",
"# pick category to visualize\n",
"category='brand'\n",
"\n",
"category_label=pd.read_parquet(f'categories/unique.{category}.parquet')[category]\n",
"category_label=category_label[1:]\n",
"\n",
"embeddings_idx=list(embedding_sizes.keys()).index(category)\n",
"embeddings=model.embeddings[embeddings_idx].weight.detach().cpu().numpy()\n",
"\n",
"fig=px.scatter_3d(\n",
" x=embeddings[:, 0], \n",
" y=embeddings[:, 1], \n",
" z=embeddings[:, 2], \n",
" text=category_label, \n",
" height=720\n",
")\n",
"fig.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "130a2b16-89e5-4eda-8155-014a75a3638e",
"metadata": {},
"outputs": [],
"source": [
"# persist embeddings\n",
"!mkdir trained_embedding_weights\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"for idx, each_col in enumerate(cat_cols): \n",
" weights=model.embeddings[idx].weight.detach().cpu().numpy()\n",
" pd.DataFrame(weights).to_csv(f'trained_embedding_weights/{each_col}.csv', index=False)"
]
},
{
"cell_type": "markdown",
"id": "bc7cce0e-6dcb-4d5a-82dd-e8074abaaaec",
"metadata": {},
"source": [
"<a name='s4-4'></a>\n",
"## Conclusion ##\n",
"Deep Learning is very good at feature extraction, which can be used for finding categorical embeddings. This is the advantage of using a Deep Learning approach, as it requires way less feature engineering and less dependent on domain knowledge. "
]
},
{
"cell_type": "markdown",
"id": "997bd6f7-9efb-4fee-b3d4-9d4454694c7b",
"metadata": {},
"source": [
"<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because one or more lines are too long

View File

@ -1,7 +0,0 @@
h
Some categorical features, such as `event_type`, `category_code`, `brand`, and etc. are stored as raw text.
Dask > cuDF if dataset doesn't fit into memory
none of the optimizations introduced by Dask-CUDA will be available in such cases (without dask_cuda.LocalCUDACluster())
dask.distributed.Client
spilling uses host <memory_limit> when <device_memory_limit> is reached

1552
ds/25-1/3/assessment.ipynb Normal file

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 200 KiB

BIN
ds/25-1/3/images/agg.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 282 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 192 KiB

BIN
ds/25-1/3/images/check.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 340 KiB

BIN
ds/25-1/3/images/credit.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 792 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 270 KiB

BIN
ds/25-1/3/images/dask.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 230 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 942 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 706 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 176 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 859 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 129 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 320 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 374 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 66 KiB

BIN
ds/25-1/3/images/dtypes.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 222 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 140 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 107 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 241 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 743 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 159 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

BIN
ds/25-1/3/images/kernel.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 170 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 148 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 952 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 186 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 442 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 571 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 514 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 166 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 112 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 256 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 262 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 265 KiB

BIN
ds/25-1/3/images/tip.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 131 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

View File

@ -1,92 +0,0 @@
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
<a name='s1-2.2'></a>
### Memory Utilization ###
Memory utilization on a DataFrame depends largely on the date types for each column.
<p><img src='images/dtypes.png' width=720></p>
We can use `DataFrame.memory_usage()` to see the memory usage for each column (in bytes). Most of the common data types have a fixed size in memory, such as `int`, `float`, `datetime`, and `bool`. Memory usage for these data types is the respective memory requirement multiplied by the number of data points. For `string` data types, the memory usage reported is the number of data points times 8 bytes. This accounts for the 64-bit required for the pointer that points to an address in memory but doesn't include the memory used for the actual string values. The actual memory required for a `string` value is 49 bytes plus an additional byte for each character. The `deep` parameter provides a more accurate memory usage report that accounts for the system-level memory consumption of the contained `string` data type.
Separately, we've provided a `dli_utils.make_decimal()` function to convert memory size into units based on powers of 2. In contrast to units based on powers of 10, this customary convention is commonly used to report memory capacity. More information about the two definitions can be found [here](https://en.wikipedia.org/wiki/Byte#Multiple-byte_units).
```python
# import dependencies
import pandas as pd
import sys
import random
# import utility
from dli_utils import make_decimal
# import data
df=pd.read_csv('2020-Mar.csv')
# preview DataFrame
df.head()
```
```python
# convert feature as datetime data type
df['event_time']=pd.to_datetime(df['event_time'])
```
```python
# lists each column at 8 bytes/row
memory_usage_df=df.memory_usage(index=False)
memory_usage_df.name='memory_usage'
dtypes_df=df.dtypes
dtypes_df.name='dtype'
# show each column uses roughly number of rows * 8 bytes
# 8 bytes from 64-bit numerical data as well as 8 bytes to store a pointer for object data type
byte_size=len(df) * 8 * len(df.columns)
print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')
pd.concat([memory_usage_df, dtypes_df], axis=1)
```
```python
# lists each column's full memory usage
memory_usage_df=df.memory_usage(deep=True, index=False)
memory_usage_df.name='memory_usage'
byte_size=memory_usage_df.sum()
# show total memory usage
print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')
pd.concat([memory_usage_df, dtypes_df], axis=1)
```
```python
# alternatively, use sys.getsizeof() instead
byte_size=sys.getsizeof(df)
print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')
```
```python
# check random string-typed column
string_cols=[col for col in df.columns if df[col].dtype=='object' ]
column_to_check=random.choice(string_cols)
overhead=49
pointer_size=8
# nan==nan when value is not a number
# nan uses 32 bytes of memory
print(f'{column_to_check} column uses : {sum([(len(item)+overhead+pointer_size) if item==item else 32 for item in df[column_to_check].values])} bytes of memory.')
```
<p><img src='images/tip.png' width=720></p>
When Python stores a string, it actually uses memory for the overhead of the Python object, metadata about the string, and the string itself. The amount of memory usage we calculated includes temporary objects that get deallocated after the initial import. It's important to note that Python has memory optimization mechanics for strings such that when the same string is created multiple time, Python will cache or "intern" it in memory and reuse it for later string objects.
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

2239
ds/25-1/4/01_mnist.ipynb Normal file

File diff suppressed because one or more lines are too long

1669
ds/25-1/4/02_asl.ipynb Normal file

File diff suppressed because one or more lines are too long

1374
ds/25-1/4/03_asl_cnn.ipynb Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

1033
ds/25-1/4/06_nlp.ipynb Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,654 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center><a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 7. Assessment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations on going through today's course! Hopefully, you've learned some valuable skills along the way and had fun doing it. Now it's time to put those skills to the test. In this assessment, you will train a new model that is able to recognize fresh and rotten fruit. You will need to get the model to a validation accuracy of `92%` in order to pass the assessment, though we challenge you to do even better if you can. You will have the use the skills that you learned in the previous exercises. Specifically, we suggest using some combination of transfer learning, data augmentation, and fine tuning. Once you have trained the model to be at least 92% accurate on the validation dataset, save your model, and then assess its accuracy. Let's get started! "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import torch\n",
"import torch.nn as nn\n",
"from torch.optim import Adam\n",
"from torch.utils.data import Dataset, DataLoader\n",
"import torchvision.transforms.v2 as transforms\n",
"import torchvision.io as tv_io\n",
"\n",
"import glob\n",
"from PIL import Image\n",
"\n",
"import utils\n",
"\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"torch.cuda.is_available()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.1 The Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this exercise, you will train a model to recognize fresh and rotten fruits. The dataset comes from [Kaggle](https://www.kaggle.com/sriramr/fruits-fresh-and-rotten-for-classification), a great place to go if you're interested in starting a project after this class. The dataset structure is in the `data/fruits` folder. There are 6 categories of fruits: fresh apples, fresh oranges, fresh bananas, rotten apples, rotten oranges, and rotten bananas. This will mean that your model will require an output layer of 6 neurons to do the categorization successfully. You'll also need to compile the model with `categorical_crossentropy`, as we have more than two categories."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"./images/fruits.png\" style=\"width: 600px;\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.2 Load ImageNet Base Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We encourage you to start with a model pretrained on ImageNet. Load the model with the correct weights. Because these pictures are in color, there will be three channels for red, green, and blue. We've filled in the input shape for you. If you need a reference for setting up the pretrained model, please take a look at [notebook 05encourageb](05b_presidential_doggy_door.ipynb) where we implemented transfer learning."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from torchvision.models import vgg16\n",
"from torchvision.models import VGG16_Weights\n",
"\n",
"weights = VGG16_Weights.DEFAULT\n",
"vgg_model = vgg16(weights=weights)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.3 Freeze Base Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we suggest freezing the base model, as done in [notebook 05b](05b_presidential_doggy_door.ipynb). This is done so that all the learning from the ImageNet dataset does not get destroyed in the initial training."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Freeze base model\n",
"vgg_model.requires_grad_(False)\n",
"next(iter(vgg_model.parameters())).requires_grad"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.4 Add Layers to Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now it's time to add layers to the pretrained model. [Notebook 05b](05b_presidential_doggy_door.ipynb) can be used as a guide. Pay close attention to the last dense layer and make sure it has the correct number of neurons to classify the different types of fruit.\n",
"\n",
"The later layers of a model become more specific to the data the model trained on. Since we want the more general learnings from VGG, we can select parts of it, like so:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Sequential(\n",
" (0): Linear(in_features=25088, out_features=4096, bias=True)\n",
" (1): ReLU(inplace=True)\n",
" (2): Dropout(p=0.5, inplace=False)\n",
")"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vgg_model.classifier[0:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we've taken what we've wanted from VGG16, we can then add our own modifications. No matter what additional modules we add, we still need to end with one value for each output."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Sequential(\n",
" (0): Sequential(\n",
" (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (1): ReLU(inplace=True)\n",
" (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (3): ReLU(inplace=True)\n",
" (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n",
" (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (6): ReLU(inplace=True)\n",
" (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (8): ReLU(inplace=True)\n",
" (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n",
" (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (11): ReLU(inplace=True)\n",
" (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (13): ReLU(inplace=True)\n",
" (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (15): ReLU(inplace=True)\n",
" (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n",
" (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (18): ReLU(inplace=True)\n",
" (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (20): ReLU(inplace=True)\n",
" (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (22): ReLU(inplace=True)\n",
" (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n",
" (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (25): ReLU(inplace=True)\n",
" (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (27): ReLU(inplace=True)\n",
" (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n",
" (29): ReLU(inplace=True)\n",
" (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n",
" )\n",
" (1): AdaptiveAvgPool2d(output_size=(7, 7))\n",
" (2): Flatten(start_dim=1, end_dim=-1)\n",
" (3): Sequential(\n",
" (0): Linear(in_features=25088, out_features=4096, bias=True)\n",
" (1): ReLU(inplace=True)\n",
" (2): Dropout(p=0.5, inplace=False)\n",
" )\n",
" (4): Linear(in_features=4096, out_features=500, bias=True)\n",
" (5): ReLU()\n",
" (6): Linear(in_features=500, out_features=6, bias=True)\n",
")"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"N_CLASSES = 6\n",
"\n",
"my_model = nn.Sequential(\n",
" vgg_model.features,\n",
" vgg_model.avgpool,\n",
" nn.Flatten(),\n",
" vgg_model.classifier[0:3],\n",
" nn.Linear(4096, 500),\n",
" nn.ReLU(),\n",
" nn.Linear(500, N_CLASSES)\n",
")\n",
"my_model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.5 Compile Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now it's time to compile the model with loss and metrics options. We have 6 classes, so which loss function should we use?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"loss_function = nn.CrossEntropyLoss()\n",
"optimizer = Adam(my_model.parameters())\n",
"my_model = torch.compile(my_model.to(device))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.6 Data Transforms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To preprocess our input images, we will use the transforms included with the VGG16 weights."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"pre_trans = weights.transforms()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Try to randomly augment the data to improve the dataset. Feel free to look at [notebook 04a](04a_asl_augmentation.ipynb) and [notebook 05b](05b_presidential_doggy_door.ipynb) for augmentation examples. There is also documentation for the [TorchVision Transforms class](https://pytorch.org/vision/stable/transforms.html).\n",
"\n",
"**Hint**: Remember not to make the data augmentation too extreme."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"IMG_WIDTH, IMG_HEIGHT = (224, 224)\n",
"\n",
"random_trans = transforms.Compose([\n",
" transforms.RandomRotation(5),\n",
" transforms.RandomResizedCrop((IMG_WIDTH, IMG_HEIGHT), scale=(.8, 1), ratio=(1, 1)),\n",
" transforms.RandomHorizontalFlip(),\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.7 Load Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now it's time to load the train and validation datasets. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"DATA_LABELS = [\"freshapples\", \"freshbanana\", \"freshoranges\", \"rottenapples\", \"rottenbanana\", \"rottenoranges\"] \n",
" \n",
"class MyDataset(Dataset):\n",
" def __init__(self, data_dir):\n",
" self.imgs = []\n",
" self.labels = []\n",
" \n",
" for l_idx, label in enumerate(DATA_LABELS):\n",
" data_paths = glob.glob(data_dir + label + '/*.png', recursive=True)\n",
" for path in data_paths:\n",
" img = tv_io.read_image(path, tv_io.ImageReadMode.RGB)\n",
" self.imgs.append(pre_trans(img).to(device))\n",
" self.labels.append(torch.tensor(l_idx).to(device))\n",
"\n",
"\n",
" def __getitem__(self, idx):\n",
" img = self.imgs[idx]\n",
" label = self.labels[idx]\n",
" return img, label\n",
"\n",
" def __len__(self):\n",
" return len(self.imgs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select the batch size `n` and set `shuffle` either to `True` or `False` depending on if we are `train`ing or `valid`ating. For a reference, check out [notebook 05b](05b_presidential_doggy_door.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"n = 32\n",
"\n",
"train_path = \"data/fruits/train/\"\n",
"train_data = MyDataset(train_path)\n",
"train_loader = DataLoader(train_data, batch_size=n, shuffle=True)\n",
"train_N = len(train_loader.dataset)\n",
"\n",
"valid_path = \"data/fruits/valid/\"\n",
"valid_data = MyDataset(valid_path)\n",
"valid_loader = DataLoader(valid_data, batch_size=n, shuffle=False)\n",
"valid_N = len(valid_loader.dataset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.8 Train the Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Time to train the model! We've moved the `train` and `validate` functions to our [utils.py](./utils.py) file. Before running the below, make sure all your variables are correctly defined.\n",
"\n",
"It may help to rerun this cell or change the number of `epochs`."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch: 0\n",
"Train - Loss: 12.3180 Accuracy: 0.8832\n",
"Valid - Loss: 1.8627 Accuracy: 0.9544\n",
"Epoch: 1\n",
"Train - Loss: 4.7510 Accuracy: 0.9560\n",
"Valid - Loss: 1.4336 Accuracy: 0.9514\n",
"Epoch: 2\n",
"Train - Loss: 2.0692 Accuracy: 0.9856\n",
"Valid - Loss: 1.0474 Accuracy: 0.9635\n",
"Epoch: 3\n",
"Train - Loss: 3.4884 Accuracy: 0.9679\n",
"Valid - Loss: 1.6070 Accuracy: 0.9666\n",
"Epoch: 4\n",
"Train - Loss: 2.5929 Accuracy: 0.9729\n",
"Valid - Loss: 1.2029 Accuracy: 0.9726\n",
"Epoch: 5\n",
"Train - Loss: 1.3059 Accuracy: 0.9873\n",
"Valid - Loss: 2.0978 Accuracy: 0.9544\n",
"Epoch: 6\n",
"Train - Loss: 1.2224 Accuracy: 0.9848\n",
"Valid - Loss: 1.1851 Accuracy: 0.9666\n",
"Epoch: 7\n",
"Train - Loss: 1.3899 Accuracy: 0.9856\n",
"Valid - Loss: 1.3799 Accuracy: 0.9666\n",
"Epoch: 8\n",
"Train - Loss: 1.1155 Accuracy: 0.9907\n",
"Valid - Loss: 1.9695 Accuracy: 0.9514\n",
"Epoch: 9\n",
"Train - Loss: 1.4266 Accuracy: 0.9873\n",
"Valid - Loss: 2.0177 Accuracy: 0.9635\n"
]
}
],
"source": [
"epochs = 10\n",
"\n",
"for epoch in range(epochs):\n",
" print('Epoch: {}'.format(epoch))\n",
" utils.train(my_model, train_loader, train_N, random_trans, optimizer, loss_function)\n",
" utils.validate(my_model, valid_loader, valid_N, loss_function)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.9 Unfreeze Model for Fine Tuning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you have reached 92% validation accuracy already, this next step is optional. If not, we suggest fine tuning the model with a very low learning rate."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# Unfreeze the base model\n",
"vgg_model.requires_grad_(True)\n",
"optimizer = Adam(my_model.parameters(), lr=.0001)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch: 0\n",
"Train - Loss: 0.4828 Accuracy: 0.9949\n",
"Valid - Loss: 1.3939 Accuracy: 0.9757\n"
]
}
],
"source": [
"epochs = 1\n",
"\n",
"for epoch in range(epochs):\n",
" print('Epoch: {}'.format(epoch))\n",
" utils.train(my_model, train_loader, train_N, random_trans, optimizer, loss_function)\n",
" utils.validate(my_model, valid_loader, valid_N, loss_function)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.10 Evaluate the Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hopefully, you now have a model that has a validation accuracy of 92% or higher. If not, you may want to go back and either run more epochs of training, or adjust your data augmentation. \n",
"\n",
"Once you are satisfied with the validation accuracy, evaluate the model by executing the following cell. The evaluate function will return a tuple, where the first value is your loss, and the second value is your accuracy. To pass, the model will need have an accuracy value of `92% or higher`. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Valid - Loss: 1.3939 Accuracy: 0.9757\n"
]
}
],
"source": [
"utils.validate(my_model, valid_loader, valid_N, loss_function)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.11 Run the Assessment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To assess your model run the following two cells.\n",
"\n",
"**NOTE:** `run_assessment` assumes your model is named `my_model`. If for any reason you have modified these variable names, please update the names of the arguments passed to `run_assessment`."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"from run_assessment import run_assessment"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Evaluating model to obtain average accuracy...\n",
"\n",
"Accuracy: 0.9757\n",
"\n",
"Accuracy required to pass the assessment is 0.92 or greater.\n",
"Your average accuracy is 0.9757.\n",
"\n",
"Congratulations! You passed the assessment!\n",
"See instructions below to generate a certificate.\n"
]
}
],
"source": [
"run_assessment(my_model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.12 Generate a Certificate"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you passed the assessment, please return to the course page (shown below) and click the \"ASSESS TASK\" button, which will generate your certificate for the course."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"./images/assess_task.png\" style=\"width: 800px;\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center><a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a></center>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

1506
ds/25-1/5/assessment.ipynb Normal file

File diff suppressed because it is too large Load Diff

50
ds/25-1/5/config.txt Normal file
View File

@ -0,0 +1,50 @@
model_config {
arch: "vgg"
n_layers: 19
use_batch_norm: True
freeze_blocks: 0
input_image_size: "3,224,224"
}
train_config {
train_dataset_path: "/workspace/tao-experiments/data/train"
val_dataset_path: "/workspace/tao-experiments/data/val"
pretrained_model_path: "/workspace/tao-experiments/classification/pretrained_vgg19/pretrained_classification_vvgg19/vgg_19.hdf5"
optimizer {
sgd {
lr: 0.01
decay: 0.0
momentum: 0.9
nesterov: False
}
}
n_epochs: 5
batch_size_per_gpu: 32
n_workers: 8
enable_random_crop: False
enable_center_crop: False
enable_color_augmentation: False
preprocess_mode: "caffe"
reg_config {
type: "L2"
scope: "Conv2D, Dense"
weight_decay: 0.00005
}
lr_config {
step {
learning_rate: 0.006
step_size: 10
gamma: 0.1
}
}
}
eval_config {
eval_dataset_path: "/workspace/tao-experiments/data/val"
model_path: "/workspace/tao-experiments/classification/vgg19/weights/vgg_005.hdf5"
top_k: 1
batch_size: 32
n_workers: 8
enable_center_crop: False
}

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 266 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 859 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1022 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 383 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 938 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 248 KiB

BIN
ds/25-1/5/images/check.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.5 KiB

BIN
ds/25-1/5/images/credit.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 792 KiB

BIN
ds/25-1/5/images/dali.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 175 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Some files were not shown because too many files have changed in this diff Show More