14 KiB
Enhancing Data Science Outcomes With Efficient Workflow
03 - Feature Engineering for Categorical Features
In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or ETL, is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data.
Table of Contents
In this notebook, we will load data from Parquet file format into a Dask DataFrame and create additional features for machine learning model training. This notebook covers the below sections:
Quick Recap
So far, we've identified several sources of hidden slowdowns when working with Dask and cuDF:
- Reading data without a schema or specifying
dtype - Having too many partitions due to small
chunksize - Memory spilling due to partitions being too large
- Performing groupby operations on too many groups scattered across multiple partitions
Going forward, we will continue to learn how to use Dask and RAPIDS efficiently.
Feature Engineering
Feature engineer converts raw data to numeric vectors for model consumption. This is generally referred to as encoding, which transforms categorical data into continuous values. When encoding categorical values, there are three primary methods:
- Label encoding when no ordered relationship
- Ordinal encoding in case have ordered relationship
- One-hot encoding when categorical variable data is binary in nature.
Additionally, we can create numerous sets of new features from existing ones, which are then tested for effectiveness during model training. Feature engineering is an important step when working with tabular data as it can improve a machine learning model's ability to learn faster and extract patterns. Feature engineering can be a time-consuming process, particularly when the dataset is large if the processing cycle takes a long time. The ability to perform feature engineering efficiently enables more exploration of useful features.
User-Defined Functions
Like many tabular data processing APIs, cuDF provides a range of composable, DataFrame style operators. While out of the box functions are flexible and useful, it is sometimes necessary to write custom code, or user-defined functions (UDFs), that can be applied to rows, columns, and other groupings of the cells making up the DataFrame.
Users can execute UDFs on cudf.Series with:
cudf.Series.apply()or- Numba's
forallsyntax (link)
Users can execute UDFs on cudf.DataFrame with:
cudf.DataFrame.apply()cudf.DataFrame.apply_rows()cudf.DataFrame.apply_chunks()cudf.rolling().apply()cudf.groupby().apply_grouped()
Note that applying UDFs directly with Dask-cuDF is not yet implemented. For now, users can use map_partitions to apply a function to each partition of the distributed dataframe.
Currently, the use of string data within UDFs is provided through the string_udf library. This is powerful for use cases such as string splitting, regular expression, and tokenization. The topic of handling string data is discussed extensively here. In addition to Series.str[doc], cudf also supports Series.list[doc] for applying custom transformations.
Below are some tips:
applyworks by applying the provided function to each group sequentially, and concatenating the results together. This can be very slow, especially for a large number of small groups. For a small number of large groups, it can give acceptable performance.- With cuDF, we can also combine NumPy or cuPy methods into the precedure.
- Related to
apply, iterating over a cuDF Series, DataFrame or Index is not supported. This is because iterating over data that resides on the GPU will yield extremely poor performance, as GPUs are optimized for highly parallel operations rather than sequential operations. In the vast majority of cases, it is possible to avoid iteration and use an existing function or methods to accomplish the same task. It is recommended that users copy the data from GPU to host with.to_arrow()or.to_pandas(), then copy the result back to GPU using.from_arrow()or.from_pandas().
Feature Engineering Techniques
Below is a list of common feature engineering techniques.
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import cudf
import dask.dataframe as dd
import dask_cudf
import gc
# instantiate a Client
cluster=LocalCUDACluster()
client=Client(cluster)
# get the machine's external IP address
from requests import get
ip=get('https://api.ipify.org').content.decode('utf8')
print(f'Dask dashboard (status) is accessible on http://{ip}:8787/status')
print(f'Dask dashboard (gpu) is accessible on http://{ip}:8787/gpu')
# read data as Dask-cuDF DataFrame
ddf=dask_cudf.read_parquet('clean_parquet')
ddf=ddf.categorize(columns=['brand', 'cat_0', 'cat_1', 'cat_2', 'cat_3'])
ddf=ddf.persist()
One-Hot Encoding
One-Hot Encoding, also known as dummy encoding, creates several binary columns to indicate a row belonging to a specific category. It works well for categorical features that are not ordinal and have low cardinality. With one-hot encoding, each row would get a single column with a 1 and 0 everywhere else.
For example, we can get cudf.get_dummies() to perform one-hot encoding on all of one of the categorical columns.
One-hot encoding doesn't work well for categorical features when the cardinality is large as it results in high dimensionality. This is particularly an issue for neural networks optimizers. Furthermore, data should not be saved in one-hot encoding format. If needed, it should only be used temporarily for specific tasks.
def one_hot(df, cat):
temp=dd.get_dummies(df[cat])
return dask_cudf.concat([df, temp], axis=1)
one_hot(ddf, 'cat_0').head()
Combining Categories
Combining categories creates new features that better identify patterns when the categories indepedently don't provide information to predict the target. It's also known as cross column or cross product. It's a common data preprocessing step for machine learning since it reduces the cost of model training. It's also common for exploratory data analysis. Properly combined categorical features encourage more effective splits in tree-based methods than considering each feature independently.
For example, while ts_weekday and ts_hour may independently have no significant patterns, we might observe more obvious patterns if the two features are combined into ts_weekday_hour.
When deciding which categorical features should be combined, it's important to balance the number of categories used, the number of observations in each combined category, and information gain. Combining features together reduces the number of observations per resulting category, which can lead to overfitting. Typically, combining low cardinal categories is recommended. Otherwise, experimentations are needed to discover the best combinations.
def combine_cats(df, left, right):
df['-'.join([left, right])]=df[left].astype('str').str.cat(df[right].astype('str'))
return df
combine_cats(ddf, 'ts_weekday', 'ts_hour').head()
Categorify and Grouping
Categorify, also known as Label Encoding, converts features into continuous integers. Typically, it converts the values into monotonically increasing positive integers from 0 to C, or the cardinality. It enables numerical computations and can also reduce memory utilization if the original feature contains string values. Categorify is a necessary data preprocessing step for neural network embedding layers. It is required for using categorical features in deep learning models with Embedding layers.
Categorifying works well when the feature is ordinal, and is sometimes necessary when the cardinality is large. Categories with low frequency can be grouped together to prevent the model overfitting on spare signals. When categorifying a feature, we can apply a threshold to group all categories with lower frequency count to the other category.
Encode categorical features into continuous integer values if the category occurs more often than the specified threshold- frequency threshold. Infrequent categories are mapped to a special ‘unknown’ category. This handy functionality will map all categories which occur in the dataset with some threshold level of infrequency to the same index, keeping the model from overfitting to sparse signals.
def categorify(df, cat, freq_threshold):
freq=df[cat].value_counts()
freq=freq.reset_index()
freq.columns=[cat, 'count']
# reset index on the frequency dataframe for a new sequential index
freq=freq.reset_index()
freq.columns=[cat+'_Categorify', cat, 'count']
# we apply a frequency threshold of 5 to group low frequent categories together
freq_filtered=freq[freq['count']>5]
# add 2 to the new index as we want to use index 0 for others and 1 for unknown
freq_filtered[cat+'_Categorify']=freq_filtered[cat+'_Categorify']+2
freq_filtered=freq_filtered.drop(columns=['count'])
# merge original dataframe with newly created dataframe to obtain the categorified value
df=df.merge(freq_filtered, how='left', on=cat)
# fill null values with 0 to represent low frequency categories grouped as other
df[cat + '_Categorify'] = df[cat + '_Categorify'].fillna(0)
return df
categorify(ddf, 'cat_0', 10).head()
Count Encoding
Count Encoding represents a feature based on the frequency. This can be interpreted as the popularity of a category.
For example, we can count the frequency of user_id with cudf.Series.value_counts(). This creates a feature that can help a machine learning model learn the behavior pattern of users with low frequency together.
def count_encoding(df, cat):
count_df=df[cat].value_counts()
count_df=count_df.reset_index()
count_df.columns=[cat, cat+'_CE']
df=df.merge(count_df, on=cat)
return df
count_encoding(ddf, 'user_id').head()
Target Encoding
Target Encoding represents a categorical feature based on its effect on the target variable. One common technique is to replace values with the probability of the target given a category. Target encoding creates a new feature, which can be used by the model for training. The advantage of target encoding is that it processes the categorical features and makes them more easily accessible to the model during training and validation.
Mathematically, target encoding on a binary target can be:
p(t = 1 | x = ci)
For a binary classifier, we can calculate the probability when the target is true or 1 by taking the mean for each category group. This is also known as Mean Encoding.
In other words, it calculates statistics, such as the arithmetic mean, from a target variable grouped by the unique values of one or more categorical features.
Leakage, also known as data leakage or target leakage, occurs when training a model with information that would not be avilable at the time of prediction. This can cause the inflated model performance score to overestimate the model's utility. For example, including "temperature_celsius" as a feature when training and predicting "temperature_fahrenheit".
def target_encoding(df, cat):
te_df=df.groupby(cat)['target'].mean().reset_index()
te_df.columns=[cat, cat+'_TE']
df=df.merge(te_df, on=cat)
return df
target_encoding(ddf, 'brand').head()
Embeddings
Deep learning models often apply Embedding Layers to categorical features. Over the past few years, this has become an increasing popular technique for encoding categorical features. Since the embeddings need to be trained through a neural network, we will cover this in the next lab.
ddf=one_hot(ddf, 'cat_0')
ddf=combine_cats(ddf, 'ts_weekday', 'ts_hour')
ddf=categorify(ddf, 'product_id', 100)
ddf=count_encoding(ddf, 'user_id')
ddf=count_encoding(ddf, 'product_id')
ddf=target_encoding(ddf, 'brand')
ddf=target_encoding(ddf, 'product_id')
ddf.head()
# clean GPU memory
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
Well Done! Let's move to the next notebook.

