This commit is contained in:
2026-01-26 20:58:26 +03:00
parent 7853800437
commit 417326498e
8 changed files with 2650 additions and 13 deletions

View File

@ -0,0 +1,279 @@
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>
# Enhancing Data Science Outcomes With Efficient Workflow #
## 03 - Feature Engineering for Categorical Features ##
In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load), is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data.
<p><img src='images/pipeline_overview_1.png' width=1080></p>
**Table of Contents**
<br>
In this notebook, we will load data from Parquet file format into a Dask DataFrame and create additional features for machine learning model training. This notebook covers the below sections:
1. [Quick Recap](#s3-1)
2. [Feature Engineering](#s3-2)
* [User Defined Functions](#s3-2.1)
3. [Feature Engineering Techniques](#s3-3)
* [One-Hot Encoding](#s3-3.1)
* [Combining Categories](#s3-3.2)
* [Categorify / Label Encoding](#s3-3.3)
* [Count Encoding](#s3-3.4)
* [Target Encoding](#s3-3.5)
* [Embeddings](#s3-3.6)
4. [Summary](#s3-4)
<a name='s3-1'></a>
## Quick Recap ##
So far, we've identified several sources of hidden slowdowns when working with Dask and cuDF:
* Reading data without a schema or specifying `dtype`
* Having too many partitions due to small `chunksize`
* Memory spilling due to partitions being too large
* Performing groupby operations on too many groups scattered across multiple partitions
Going forward, we will continue to learn how to use Dask and RAPIDS efficiently.
<a name='s3-2'></a>
## Feature Engineering ##
Feature engineer converts raw data to numeric vectors for model consumption. This is generally referred to as encoding, which transforms categorical data into continuous values. When encoding categorical values, there are three primary methods:
* Label encoding when no ordered relationship
* Ordinal encoding in case have ordered relationship
* One-hot encoding when categorical variable data is binary in nature.
Additionally, we can create numerous sets of new features from existing ones, which are then tested for effectiveness during model training. Feature engineering is an important step when working with tabular data as it can improve a machine learning model's ability to learn faster and extract patterns. Feature engineering can be a time-consuming process, particularly when the dataset is large if the processing cycle takes a long time. The ability to perform feature engineering efficiently enables more exploration of useful features.
<a name='s3-2.1'></a>
### User-Defined Functions ###
Like many tabular data processing APIs, cuDF provides a range of composable, DataFrame style operators. While out of the box functions are flexible and useful, it is sometimes necessary to write custom code, or **user-defined functions** (UDFs), that can be applied to rows, columns, and other groupings of the cells making up the DataFrame.
Users can execute UDFs on `cudf.Series` with:
* `cudf.Series.apply()` or
* Numba's `forall` syntax [(link)](https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html#lower-level-control-with-custom-numba-kernels)
Users can execute UDFs on `cudf.DataFrame` with:
* `cudf.DataFrame.apply()`
* `cudf.DataFrame.apply_rows()`
* `cudf.DataFrame.apply_chunks()`
* `cudf.rolling().apply()`
* `cudf.groupby().apply_grouped()`
Note that applying UDFs directly with Dask-cuDF is not yet implemented. For now, users can use `map_partitions` to apply a function to each partition of the distributed dataframe.
Currently, the use of string data within UDFs is provided through the `string_udf` library. This is powerful for use cases such as string splitting, regular expression, and tokenization. The topic of handling string data is discussed extensively [here](https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html#string-data). In addition to `Series.str`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/string_handling.html), cudf also supports `Series.list`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/list_handling.html) for applying custom transformations.
<p><img src='images/tip.png' width=720></p>
Below are some tips:
* `apply` works by applying the provided function to each group sequentially, and concatenating the results together. This can be very slow, especially for a large number of small groups. For a small number of large groups, it can give acceptable performance.
* With cuDF, we can also combine NumPy or cuPy methods into the precedure.
* Related to `apply`, iterating over a cuDF Series, DataFrame or Index is not supported. This is because iterating over data that resides on the GPU will yield extremely poor performance, as GPUs are optimized for highly parallel operations rather than sequential operations. In the vast majority of cases, it is possible to avoid iteration and use an existing function or methods to accomplish the same task. It is recommended that users copy the data from GPU to host with `.to_arrow()` or `.to_pandas()`, then copy the result back to GPU using `.from_arrow()` or `.from_pandas()`.
<a name='s3-3'></a>
## Feature Engineering Techniques ##
Below is a list of common feature engineering techniques.
<img src='images/feature_engineering_methods.png' width=720>
```python
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import cudf
import dask.dataframe as dd
import dask_cudf
import gc
# instantiate a Client
cluster=LocalCUDACluster()
client=Client(cluster)
```
```python
# get the machine's external IP address
from requests import get
ip=get('https://api.ipify.org').content.decode('utf8')
print(f'Dask dashboard (status) is accessible on http://{ip}:8787/status')
print(f'Dask dashboard (gpu) is accessible on http://{ip}:8787/gpu')
```
```python
# read data as Dask-cuDF DataFrame
ddf=dask_cudf.read_parquet('clean_parquet')
ddf=ddf.categorize(columns=['brand', 'cat_0', 'cat_1', 'cat_2', 'cat_3'])
```
```python
ddf=ddf.persist()
```
<p><img src='images/check.png' width=720></p>
Did you get an error message? This notebook depends on the processed source file from previous notebooks.
<a name='s3-3.1'></a>
### One-Hot Encoding ###
**One-Hot Encoding**, also known as dummy encoding, creates several binary columns to indicate a row belonging to a specific category. It works well for categorical features that are not ordinal and have low cardinality. With one-hot encoding, each row would get a single column with a 1 and 0 everywhere else.
For example, we can get `cudf.get_dummies()` to perform one-hot encoding on all of one of the categorical columns.
<img src='images/tip.png' width=720>
One-hot encoding doesn't work well for categorical features when the cardinality is large as it results in high dimensionality. This is particularly an issue for neural networks optimizers. Furthermore, data should not be saved in one-hot encoding format. If needed, it should only be used temporarily for specific tasks.
```python
def one_hot(df, cat):
temp=dd.get_dummies(df[cat])
return dask_cudf.concat([df, temp], axis=1)
```
```python
one_hot(ddf, 'cat_0').head()
```
<a name='s3-3.2'></a>
### Combining Categories ###
**Combining categories** creates new features that better identify patterns when the categories indepedently don't provide information to predict the target. It's also known as _cross column_ or _cross product_. It's a common data preprocessing step for machine learning since it reduces the cost of model training. It's also common for exploratory data analysis. Properly combined categorical features encourage more effective splits in tree-based methods than considering each feature independently.
For example, while `ts_weekday` and `ts_hour` may independently have no significant patterns, we might observe more obvious patterns if the two features are combined into `ts_weekday_hour`.
<img src='images/tip.png' width=720>
When deciding which categorical features should be combined, it's important to balance the number of categories used, the number of observations in each combined category, and information gain. Combining features together reduces the number of observations per resulting category, which can lead to overfitting. Typically, combining low cardinal categories is recommended. Otherwise, experimentations are needed to discover the best combinations.
```python
def combine_cats(df, left, right):
df['-'.join([left, right])]=df[left].astype('str').str.cat(df[right].astype('str'))
return df
```
```python
combine_cats(ddf, 'ts_weekday', 'ts_hour').head()
```
<a name='s3-3.3'></a>
### Categorify and Grouping ###
**Categorify**, also known as *Label Encoding*, converts features into continuous integers. Typically, it converts the values into monotonically increasing positive integers from 0 to *C*, or the cardinality. It enables numerical computations and can also reduce memory utilization if the original feature contains string values. Categorify is a necessary data preprocessing step for neural network embedding layers. It is required for using categorical features in deep learning models with Embedding layers.
Categorifying works well when the feature is ordinal, and is sometimes necessary when the cardinality is large. Categories with low frequency can be grouped together to prevent the model overfitting on spare signals. When categorifying a feature, we can apply a threshold to group all categories with lower frequency count to the `other` category.
Encode categorical features into continuous integer values if the category occurs more often than the specified threshold- frequency threshold. Infrequent categories are mapped to a special unknown category. This handy functionality will map all categories which occur in the dataset with some threshold level of infrequency to the same index, keeping the model from overfitting to sparse signals.
```python
def categorify(df, cat, freq_threshold):
freq=df[cat].value_counts()
freq=freq.reset_index()
freq.columns=[cat, 'count']
# reset index on the frequency dataframe for a new sequential index
freq=freq.reset_index()
freq.columns=[cat+'_Categorify', cat, 'count']
# we apply a frequency threshold of 5 to group low frequent categories together
freq_filtered=freq[freq['count']>5]
# add 2 to the new index as we want to use index 0 for others and 1 for unknown
freq_filtered[cat+'_Categorify']=freq_filtered[cat+'_Categorify']+2
freq_filtered=freq_filtered.drop(columns=['count'])
# merge original dataframe with newly created dataframe to obtain the categorified value
df=df.merge(freq_filtered, how='left', on=cat)
# fill null values with 0 to represent low frequency categories grouped as other
df[cat + '_Categorify'] = df[cat + '_Categorify'].fillna(0)
return df
```
```python
categorify(ddf, 'cat_0', 10).head()
```
<a name='s3-3.4'></a>
### Count Encoding ###
*Count Encoding* represents a feature based on the frequency. This can be interpreted as the popularity of a category.
For example, we can count the frequency of `user_id` with `cudf.Series.value_counts()`. This creates a feature that can help a machine learning model learn the behavior pattern of users with low frequency together.
```python
def count_encoding(df, cat):
count_df=df[cat].value_counts()
count_df=count_df.reset_index()
count_df.columns=[cat, cat+'_CE']
df=df.merge(count_df, on=cat)
return df
```
```python
count_encoding(ddf, 'user_id').head()
```
<a name='s3-3.5'></a>
### Target Encoding ###
**Target Encoding** represents a categorical feature based on its effect on the target variable. One common technique is to replace values with the probability of the target given a category. Target encoding creates a new feature, which can be used by the model for training. The advantage of target encoding is that it processes the categorical features and makes them more easily accessible to the model during training and validation.
Mathematically, target encoding on a binary target can be:
p(t = 1 | x = ci)
For a binary classifier, we can calculate the probability when the target is `true` or `1` by taking the mean for each category group. This is also known as *Mean Encoding*.
In other words, it calculates statistics, such as the arithmetic mean, from a target variable grouped by the unique values of one or more categorical features.
<img src='images/tip.png' width=720>
*Leakage*, also known as data leakage or target leakage, occurs when training a model with information that would not be avilable at the time of prediction. This can cause the inflated model performance score to overestimate the model's utility. For example, including "temperature_celsius" as a feature when training and predicting "temperature_fahrenheit".
```python
def target_encoding(df, cat):
te_df=df.groupby(cat)['target'].mean().reset_index()
te_df.columns=[cat, cat+'_TE']
df=df.merge(te_df, on=cat)
return df
```
```python
target_encoding(ddf, 'brand').head()
```
<a name='s3-3.6'></a>
### Embeddings ###
Deep learning models often apply **Embedding Layers** to categorical features. Over the past few years, this has become an increasing popular technique for encoding categorical features. Since the embeddings need to be trained through a neural network, we will cover this in the next lab.
```python
ddf=one_hot(ddf, 'cat_0')
ddf=combine_cats(ddf, 'ts_weekday', 'ts_hour')
ddf=categorify(ddf, 'product_id', 100)
ddf=count_encoding(ddf, 'user_id')
ddf=count_encoding(ddf, 'product_id')
ddf=target_encoding(ddf, 'brand')
ddf=target_encoding(ddf, 'product_id')
ddf.head()
```
```python
# clean GPU memory
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
```
**Well Done!** Let's move to the [next notebook](1_04_nvtabular_and_mgpu.ipynb).
<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>