17 KiB
Enhancing Data Science Outcomes With Efficient Workflow
04 - NVTabular
In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or ETL, is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data.
Table of Contents
In this notebook, we will use NVTabular to perform feature engineering. This notebook covers the below sections:
NVTabular
NVTabular is a feature engineering and preprocessing library for tabular data that is designed to easily manipulate terabyte scale datasets. It provides high-level abstraction to simplify code and accelerates computation on the GPU using the RAPIDS cuDF library. While NVTabular is built upon the RAPIDS cuDF library, it improves cuDF since data is not limited to GPU memory capacity. The API documentation can be found here.
Core features of NVTabular include:
- Easily process data by leveraging built-in or custom operators specifically designed for machine learning algorithms
- Computations are carried out on the GPU with best practices baked into the library, allowing us to realize significant acceleration
- Provide higher-level API to greatly simplify code complexity while still providing the same level of performance
- Work on arbitrarily large datasets when used with Dask
- Minimize the number of passes through the data with Lazy execution
In doing so, NVTabular helps data scientists and machine learning engineers to:
- Process datasets that exceed GPU and CPU memory without having to worry about scale
- Focus on what to do with the data and not how to do it by using abstraction at the operation level
- Prepare datasets quickly and easily for experimentation so that more models can be trained
Data science can be an iterative process that requires extensive repeated experimentation. The ability to perform feature engineering and preprocessing quickly translates into faster iteration cycles, which can help us to arrive at an optimal solution.
Multi-GPU Scaling in NVTabular with Dask
NVTabular supports multi-GPU scaling with Dask-CUDA and dask.distributed[doc]. For multi-GPU, NVTabular uses Dask-cuDF for internal data processing. The parallel performance can depend strongly on the size of the partitions, the shuffling procedure used for data output, and the arguments used for transformation operations.
Operators
NVTabular has already implemented several data transformations, called ops[doc]. An op can be applied to a ColumnGroup from an overloaded >> operator, which in turn returns a new ColumnGroup. A ColumnGroup is a list of column names as text.
features = [ column_name_1, column_name_2, ...] >> op1 >> op2 >> ...
Since the Dataset API can both ingest and output a Dask collection, it is straightforward to transform data either before or after an NVTabular workflow is executed. This means that some complex preprocessing operations, that are not yet supported in NVTabular, can still be accomplished with the Dask-CuDF API:
Common operators include:
- Categorify - transform categorical features into unique integer values
- Can apply a frequency threshold to group low frequent categories together
- TargetEncoding - transform categorical features into group-specific mean of each row
- Using
kfold=1andp_smooth=0is the same as disabling these additional logic
- Using
- Groupby - transform feature into the result of one or more groupby aggregations
- NOTE: Does not move data between partitions, which means data should be shuffled by groupby_cols
- JoinGroupby - add new feature based on desired group-specific statistics of requested continuous features
- Supported statistics include [
count,sum,mean,std,var].
- Supported statistics include [
- LogOp - log transform with the continuous features
- FillMissing - replaces missing values with constant pre-defined value
- Bucketize - transform continuous features into categorical features with bins based on provided bin boundaries
- LambdaOp - enables custom row-wise dataframe manipulations with NVTabular
- Rename - rename columns
- Normalize - perform normalization using the mean standard deviation method
# import dependencies
import nvtabular as nvt
from nvtabular.ops import *
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf
import cudf
import gc
# instantiate a Client
cluster=LocalCUDACluster()
client=Client(cluster)
# get the machine's external IP address
from requests import get
ip=get('https://api.ipify.org').content.decode('utf8')
print(f'Dask dashboard (status) is accessible on http://{ip}:8787/status')
print(f'Dask dashboard (gpu) is accessible on http://{ip}:8787/gpu')
# read data as Dask DataFrame
ddf=dask_cudf.read_parquet('clean_parquet')
# preview DataFrame
ddf.head()
Feature Engineering and Preprocessing with NVTabular
The typical steps for developing with NVTabular include:
- Design and Define Operations in the Pipeline
- Create Workflow
- Create Dataset
- Apply Workflow to Dataset
Defining the Workflow
We start by creating the nvtabular.workflow.workflow.Workflow[doc], which defines the operations and preprocessing steps that we would like to perform on the data.
We will perform the following feature engineering and preprocessing steps:
- Categorify the categorical features
- Log transform and normalize continuous features
- Calculate group-specific
sum,count, andmeanof thetargetfor categorical features - Log transform
price - Calculate
product_idspecific relativepriceto averageprice - Target encode all categorical features
One of the key advantages of using NVTabular is the high-level abstraction we can use, which simplifies code significantly.
# assign features and label
cat_cols=['brand', 'cat_0', 'cat_1', 'cat_2', 'cat_3']
cont_cols=['price', 'ts_hour', 'ts_minute', 'ts_weekday']
label='target'
# categorify categorical features
cat_features=cat_cols >> Categorify()
Exercise #1 - Using NVTabular Operators
We can use the >> operator to specify how columns will be transformed. We need to transform the price feature by performing the log transformation and normalization.
Instructions:
- Review the documentation for the
LogOp()[doc] andNormalize()[doc] operators. - Modify the
<FIXME>s only and execute the cell below to create a workflow.
# log transform
price = (
['price']
>> FillMissing(0)
>> <<<<FIXME>>>>
>> <<<<FIXME>>>>
>> LambdaOp(lambda col: col.astype("float32"), dtype='float32')
)
price = (
['price']
>> FillMissing(0)
>> LogOp()
>> Normalize()
>> LambdaOp(lambda col: col.astype("float32"), dtype='float32')
)
Click ... to show solution.
There are several ways to create a feature for relative price to average. We will do so with the below steps:
- Calculate average
priceper group. - Define a function to calculate the percentage difference
- Apply the user defined function to
priceand averageprice
# relative price to the average price for the product_id
# create product_id specific average price feature
avg_price_product = ['product_id'] >> JoinGroupby(cont_cols =['price'], stats=["mean"])
# create user defined function to calculate percent difference
def relative_price_to_avg(col, gdf):
# introduce tiny number in case of 0
epsilon = 1e-5
col = ((gdf['price'] - col) / (col + epsilon)) * (col > 0).astype(int)
return col
# create product_id specific relative price to average
relative_price_to_avg_product = (
avg_price_product
>> LambdaOp(relative_price_to_avg, dependency=['price'], dtype='float64')
>> Rename(name='relative_price_product')
)
avg_price_category = ['category_code'] >> JoinGroupby(cont_cols =['price'], stats=["mean"])
# create product_id specific relative price to average
relative_price_to_avg_category = (
avg_price_category
>> LambdaOp(relative_price_to_avg, dependency=['price'], dtype='float64')
>> Rename(name='relative_price_category')
)
# calculate group-specific statistics for categorical features
ce_features=cat_cols >> JoinGroupby(stats=['sum', 'count'], cont_cols=label)
# target encode
te_features=cat_cols >> TargetEncoding(label)
We also add the target, i.e. label, to the set of returned columns. We can visualize our data processing pipeline with graphviz by calling .graph. The data processing pipeline is a DAG (direct acyclic graph).
features=cat_features+cont_cols+ce_features+te_features+price+relative_price_to_avg_product+relative_price_to_avg_category+[label]
features.graph
We are now ready to construct a Workflow that will run the operations we defined above. To enable distributed parallelism, the NVTabular Workflow must be initialized with a dask.distributed.Client object. Since NVTabular already uses Dask-CuDF for internal data processing, there are no other requirements for multi-GPU scaling.
# define our NVTabular Workflow with client to enable multi-GPU execution
# for multi-GPU execution, the only requirement is that we specify a client when
# initializing the NVTabular Workflow.
workflow=nvt.Workflow(features, client=client)
Defining the Dataset
All external data need to be converted to the universal nvtabular.io.dataset.Dataset[doc] type. The main purpose of this class is to abstract away the raw format of the data, and to allow other NVTabular classes to reliably materialize a dask.dataframe.DataFrame collection and/or collection-based iterator on demand.
The collection-based iterator is important when working with large datasets that do not fit into GPU memory since operations in the Workflow often require statistics calculated across the entire dataset. For example, Normalize requires measurements of the dataset mean and standard deviation, and Categorify requires an accounting of all the unique categories a particular feature can manifest. The Dataset object partitions the dataset into chunks that will fit into GPU memory to compute statistics in an online fashion.
A Dataset can be initialized from a variety of different raw-data formats:
- With a parquet-dataset directory
- With a list of files
- In addition to handling data stored on disk, a
Datasetcan also be initialized from an existing cuDF DataFrame, or from adask.dataframe.DataFrame
The data we pass to the Dataset constructor is usually the result of a query from some source, for example a data warehouse or data lake. The output is usually in Parquet, ORC, or CSV format. In our case, we have the data in parquet format saved on the disk from previous steps. When initializing a Dataset from a directory path, the engine should be used to specify either parquet or csv format. If initializing a Dataset from a list of files, the engine can be inferred.
Memory is an important consideration. The workflow will process data in chunks, therefore increasing the number of partitions will limit the memory footprint. Since we will initialize the Dataset with a DataFrame type (cudf.DataFrame or dask.dataframe.DataFrame), most of the parameters will be ignored and the partitions will be preserved. Otherwise, the data would be converted to a dask.dataframe.DataFrame with a maximum partition size of roughly 12.5% of the total memory on a single device by default. We can use the npartitions parameter for specifying into how many chunks we would like the data to be split. The partition size can be changed to a different fraction of total memory on a single device with the part_mem_fraction argument. Alternatively, a specific byte size can be specified with the part_size argument.
The NVTabular dataset should be created from Parquet files in order to get the best possible performance, preferably with a row group size of around 128MB. While NVTabular also supports reading from CSV files, reading CSV can be over twice as slow as reading from Parquet. It's recommended to convert a CSV dataset into Parquet format for use with NVTabular.
# create dataset
dataset=nvt.Dataset(ddf)
print(f'The Dataset is split into {dataset.npartitions} partitions')
Fit, Transform, and Persist
NVTabular follows a familiar API for pipeline operations. We can .fit() the workflow to a training set to calculate the statistics for this workflow. Afterwards, we can use it to .transform() the training set and validation dataset. We will persist the transformed data to disk in parquet format for fast reading and train time. Importantly, we can use the .save()[doc] method so that our Workflow can be used during model inference.
Since the Dataset API can both ingest and output a Dask collection, it is straightforward to transform data either before or after an NVTabular workflow is executed. This means that some complex pre-processing operations, that are not yet supported in NVTabular, can still be accomplished with the dask_cudf.DataFrame API after the Dataset is converted with .to_ddf.
# fit and transform dataset
workflow.fit(dataset)
output_dataset=workflow.transform(dataset)
# save the workflow
workflow.save('nvt_workflow')
!ls -l nvt_workflow
# remove existing parquet directory
!rm -R processed_parquet/*
# save output to parquet directory
output_path='processed_parquet'
output_dataset.to_parquet(output_path=output_path)
If needed, we can convert the Dataset object to dask.dataframe.DataFrame to inspect the results.
# convert to DataFrame and preview
output_dataset.to_ddf().head()
Exercise #2 - Load Saved Workflow
We can load a saved workflow, which will contain the graph, schema, and statistics. This is useful if the workflow should be applied to future datasets.
Instructions:
- Review the documentation for the
.load()class method. - Modify the
<FIXME>only and execute the cell below to create a workflow. - Execute the cell below to apply the graph of operators to transform the data.
# load workflow
loaded_workflow=<<<<FIXME>>>>
loaded_workflow=nvt.Workflow.load('nvt_workflow') Click ... to show solution.
# create dataset from parquet directory
dataset=nvt.Dataset('clean_parquet', engine='parquet')
# transform dataset
loaded_workflow.transform(dataset).to_ddf().head()
# clean GPU memory
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)


