Files
lab/ds/25-1/3/1_02_EDA.md
2026-01-26 20:58:26 +03:00

24 KiB

Header

Enhancing Data Science Outcomes With Efficient Workflow

02 - Data Exploration and Data Visualization

In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or ETL, is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data.

Table of Contents
In this notebook, we will load data from Parquet file format into a Dask DataFrame and perform various data transformations and exploratory data analysis. This notebook covers the below sections:

  1. Quick Recap
  2. Data Exploration and Data Visualization
  3. Summary

Quick Recap

So far, we've identified several sources of hidden slowdowns when working with Dask and cuDF:

  • Reading data without a schema or specifying dtype
  • Having too many partitions due to small chunksize
  • Memory spilling due to partitions being too large
  • Performing groupby operations on too many groups scattered across multiple partitions

Going forward, we will continue to learn how to use Dask and RAPIDS efficiently.

Data Exploration and Data Visualization

Exploratory data analysis involves identification of predictor/feature variables and the target/class variable. We use this time to understand the distribution of the features and identify potentially problematic outliers. Data exploration helps users understand the data in order to better tackle a problem. It can be a way to ascertain the validity of the data as we begin to look for useful features that will help in the following stages of the development workflow.

Plotly

Plotly [Doc] is a popular library for graphing and data dashboards. Plotly uses plotly.graph_objects to create figures for data visualization. Graph objects can be created using plotly.express or from the ground up. In order for Plotly to make a graph, data needs to be on the host, not the GPU. If the dataset is small, it may be more efficient to use pandas instead of cudf or Dask-cuDF. However, if the dataset is large, sending data to the GPU is a great way to speed up computation before sending it to the host for visualization. When using GPU acceleration and Plotly, only move the GPU DataFrame(s) to the host at the end with to_pandas(), as opposed to converting the entire GPU DataFrame(s) to a pandas DataFrame immediately. This will allow us to take advantages of GPU acceleration for processing.

For more information about how to use Plotly, we recommend this guide.

We start by initiating the LocalCUDACluster() and Dask Client(), followed by loading data from the Parquet files into a Dask DataFrame.

# import dependencies
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import cudf
import dask_cudf
import numpy as np
import cupy as cp
import plotly.express as px
import gc

# create cluster
cluster=LocalCUDACluster()

# instantiate client
client=Client(cluster)
# get the machine's external IP address
from requests import get

ip=get('https://api.ipify.org').content.decode('utf8')

print(f'Dask dashboard (status) address is: http://{ip}:8787/status')
print(f'Dask dashboard (Gpu) address is: http://{ip}:8787/gpu')
# read data
ddf=dask_cudf.read_parquet('clean_parquet')

print(f'Total of {len(ddf)} records split across {ddf.npartitions} partitions. ')

ddf.dtypes

The Parquet file format includes metadata to inform `Dask-cuDF` which data types to use for each column.
# create continue and categorical column lists
continuous_cols=['price', 'target', 'ts_hour', 'ts_minute', 'ts_weekday', 'ts_day', 'ts_month', 'ts_year']
categorical_cols=['event_type', 'category_code', 'brand', 'user_session', 'session_product', 'cat_0', 'cat_1', 'cat_2', 'cat_3', 'product_id', 'category_id', 'user_id']
# preview DataFrame
ddf.head()

Summarize

We can use the describe()[doc] method to generate summary statistics for continuous features.

# generate summary statistics for continuous features
ddf[continuous_cols].describe().compute().to_pandas().apply(lambda s: s.apply('{0:.2f}'.format))

For categorical values, we are often interested in the cardinality of each feature. Cardinality is number of unique elements the set contains. We use .nunique() to get the number of possible values for each categorical feature as it will inform how they can be encoded for machine learning model consumption.

# count number of unique values for categorical features
ddf[categorical_cols].nunique().compute()

Note that in the previous step, we added read_parquet() to the task graph but did not .persist() data in memory. Recall that the Dask DataFrame APIs build the task graph until .compute(). The result of .compute() is a cuDF DataFrame and should be small.

Observations:

  • The dataset has an ~41% purchase rate
  • All data come from March of 2020
  • Price has a very large standard deviation

Visualizing Distribution

A histogram is a graph that shows the frequency of data using rectangles. It's used to visualize the distribution of the the data so we can quickly approximate concenration, skewness, and variability. We will use histograms to identify popularity characteristics in the dataset.

When using Plotly or other visualization libraries, it's best to keep the data on the GPU for as long as possible and only move the data to the host when needed. For example, instead of relying on Plotly Express's .histogram()[doc] function, we can use the .value_count() method to count the number of occurences. We can then pass the results to the .bar() function to generate a frequency bar chart. This can yield faster results by enabling GPU acceleration. Furthermore, we can use .nlargest() to limit the number of bars in the chart.

We want to visualize the distribution for specific features. Now that the data is in Parquet file format, we can use column pruning to only read in one column at a time to reduce the memory burden.

%%time
# set cat column of interest
cat_col='cat_0'

# read data
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)

# create histogram
px.histogram(
    # move data to CPU
    ddf[cat_col].compute().to_pandas()
).update_layout(
    yaxis_title='Frequency Count', 
    xaxis_title=cat_col, 
    title=f'Distribution of {cat_col}'
)

Exercise #1 - Histogram with GPU

Instead of generating the histogram on CPU, we can use .value_counts() to achieve similar results.

Instructions:

  • Modify the <FIXME> only and execute the below cell to visualize the frequency of each cat_0 value.
  • Compare the performance efficiency with the previous CPU approach.
%%time
# set cat column of interest
cat_col='cat_0'
n_bars=25

# read data
ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)

# create frequency count DataFrame
cat_count_df=<<<<FIXME>>>>

# create histogram
px.bar(
    # move data to CPU
    cat_count_df.compute().to_pandas()
).update_layout(
    yaxis_title='Frequency Count', 
    xaxis_title=cat_col, 
    title=f'Distribution of {cat_col}'
)

%%time cat_col='cat_0' n_bars=25 ddf=dask_cudf.read_parquet('clean_parquet', columns=cat_col)

cat_count_df=ddf[cat_col].value_counts().nlargest(n_bars)

px.bar( cat_count_df.compute().to_pandas() ).update_layout( yaxis_title='Frequency Count', xaxis_title=cat_col, title=f'Distribution of {cat_col}' ) Click ... to show solution.

Using cuDF to calculate the frequency is much more efficient. For continuous features, we often have to bin the values into buckets. We can use cudf.Series.digitize()[doc], but it's not implemented for dask_cudf.core.Series, so we have to use .map_partitions() to perform the cudf.Series.digitize() method on each partition.

%%time
# set cont column of interest
cont_col='price'

# read data
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)

# set bin
bins=np.array(range(-1, 10000, 50)).astype('float32')

# create frequency count DataFrame
cont_hist_df=ddf[cont_col].map_partitions(lambda p: p.digitize(bins)).value_counts()

# create histogram
px.bar(
    # move data to CPU
    cont_hist_df.compute().to_pandas(), 
).update_xaxes(
    tickmode='array', 
    tickvals=np.array(range(1, len(bins))), 
    ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])], 
).update_layout(
    yaxis_title='Frequency Count', 
    xaxis_title=f'{cont_col} Bin', 
    title=f'Distribution of {cont_col}'
)

Exercise #2 - Histogram with Log Scale

price is positively skewed. We might be able to get visualize a better distribution by creating bins in logarithmic scale. We will create bin ranges using numpy.logspace() and pass it to cudf.Series.digitize().

Instructions:

  • Modify the <FIXME> only and execute the below cell to visualize the frequency of price bins in log scale.
# set cont column of interest
cont_col='price'

# read data
ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)

# set bin
bins=np.logspace(0, 5).astype('float32')

# create frequency count DataFrame
cont_hist_df=ddf['price'].map_partitions(<<<<FIXME>>>>).value_counts()

# create histogram
px.bar(
    # move data to CPU
    cont_hist_df.compute().to_pandas(), 
).update_xaxes(
    tickmode='array', 
    tickvals=np.array(range(1, len(bins))), 
    ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])]
).update_layout(
    yaxis_title='Frequency Count', 
    xaxis_title=f'{cont_col} Bin', 
    title=f'Distribution of {cont_col}'
)

cont_col='price'

ddf=dask_cudf.read_parquet('clean_parquet', columns=cont_col)

bins=np.logspace(0, 5).astype('float32')

cont_hist_df=ddf['price'].map_partitions(lambda p: p.digitize(bins)).value_counts()

px.bar( cont_hist_df.compute().to_pandas(), ).update_xaxes( tickmode='array', tickvals=np.array(range(1, len(bins))), ticktext=[f'{int(bins[idx])} - {int(bins[idx+1])}' for idx, bin in enumerate(bins[:-1])] ).update_layout( yaxis_title='Frequency Count', xaxis_title=f'{cont_col} Bin', title=f'Distribution of {cont_col}' ) Click ... to show solution.

Observations:

  • Vast majority of the products are below $300.

GroupBy Summarize

We can use a variety of groupby aggregations to learn about the data. The aggregations supported by cuDF, as described here, are very efficient. We might be interested in exploring several variations. To make the execution more efficient, we can .persist() the data into the memory after reading from the source. Subsequent operations will not require loading from the source again.

For example, we can visualize the probability of an event for each category. When the target column is a binary indicator, we can do this quickly by calculating the aggregate mean. For a categorical feature with binary outcomes, users can use the arithmetic mean to find the probability.

We use .groupby() on cat_0, followed by .agg('mean') on target to determine the probability of positive outcome for each cat_0 group.

# read data
ddf=dask_cudf.read_parquet('clean_parquet')

# persist data in memory
ddf=ddf.persist()
wait(ddf)
# set cat column of interest
cat_col='cat_0'
n_bars=25

# create groupby probability DataFrame
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)

# create bar chart
px.bar(
    # move data to CPU
    cat_target_df.compute().to_pandas()
).update_layout(
    yaxis_title='Probability', 
    xaxis_title=cat_col, 
    title=f'Probability of {cat_col}'
)

Some categoriies have a higher probability than other.

Other groupby aggregations include:

  • What time of the week is the busiest
# show probability of each ts_weekday and ts_hour group
ddf.groupby(['ts_weekday', 'ts_hour'])['target'].agg({'target': 'mean'})

.groupby().size() or .groupby().agg('size') is very similar to .value_counts().

Exercise #3 - Probability Bar Chart

Instructions:

  • Mofidy the <FIXME> only and execute the below cell to visualize the probability of each ts_hour value.
# set cat column of interest
cat_col=<<<<FIXME>>>>
n_bars=25

# create groupby probability DataFrame
cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)

# create bar chart
px.bar(
    # move data to CPU
    cat_target_df.compute().to_pandas()
).update_layout(
    yaxis_title='Probability', 
    xaxis_title=cat_col, 
    title=f'Probability of {cat_col}'
)

cat_col='ts_hour' n_bars=25 cat_target_df=ddf.groupby(cat_col)['target'].agg({'target': 'mean'}).nlargest(n_bars)

px.bar( cat_target_df.compute().to_pandas() ).update_layout( yaxis_title='Probability', xaxis_title=cat_col, title=f'Probability of {cat_col}' ) Click ... to show solution.

Some aggregations require all data within the same group to be in memory for calculation such as median, mode, nunique, and etc. For these operations, .groupby().apply() is used. Because .groupby().apply() performs a shuffle, these operations scales poorly with large amounts of groups.

We use .groupby() on brand and .apply() on SeriesGroupBy['user_id'].nunique() to get the number of unique customers that have interacted with each brand.

# set cat columns of interest
cat_col='brand'
group_statistic='user_id'

# create groupby summarize DataFrame
product_frequency=ddf.groupby(cat_col)[group_statistic].apply(lambda g: g.nunique(), meta=(f'{group_statistic}_count', 'int32')).nlargest(25)

# create bar chart
px.bar(
    # move data to CPU
    product_frequency.compute().to_pandas()
).update_layout(
    yaxis_title=f'Number of Unique {group_statistic}', 
    xaxis_title=cat_col, 
    title=f'Number of Unique {group_statistic} per {cat_col}'
)
# visualize graph
product_frequency.visualize(rankdir='LR')

Certain brands and categories have a higher penetration and higher probability of positive outcome.

Other groupby aggregations include:

  • How many unique customer_id are in each product_id group
# show how many customers interacted with each product
ddf.groupby('product_id')['user_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
  • How many unique cat_0 are in each brand group
# show how many categories of product do each brand carry
ddf.groupby('brand')['cat_0'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
  • How many unique product_id are in each user_session group
# show how many products are view in each session
ddf.groupby('user_session')['product_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))
  • How many unique product_id are in each user_id group
# show how many products each user interacts with
ddf.groupby('user_id')['product_id'].apply(lambda g: g.nunique(), meta=('nunique', 'int32'))

For Dask, we can ensure the result of .groupby() is sorted using the sort paramter.

Sometimes we want to perform custom aggregations that are not yet supported. For custom aggregations, we can use .groupby().apply() and user-defined functions. For example, we might be interested in the range of price for each category_code. Arithmetically, this is done by taking the difference between the group-specific maxmimum and group-specific minimum. We can normalize the range by dividing it by the group-specific mean.

It's best to avoid using .groupby().apply() when possible. Similar results can be calculated by using .groupby().agg() to obtain the max, min, and mean separately, then applying a row-wise calculation with .apply(). This can be more efficient.

%%time
# set cat column of interest
cat_col='category_code'

# define group-wise function
def normalized_range(group): 
    return (group.max()-group.min())/group.mean()

# create groupby apply DataFrame
normalized_range_df=ddf.groupby(cat_col)['price'].apply(normalized_range, meta=('normalize_range', 'float64')).nlargest(25)

# create bar chart
px.bar(
    # move data to CPU
    normalized_range_df.compute().to_pandas()
).update_layout(
    yaxis_title='Normalize Range', 
    xaxis_title=cat_col, 
    title=f'Normalize Range of price per {cat_col}'
)
# visualize graph
normalized_range_df.visualize(rankdir='LR')

Exercise #4 - Custom GroupBy Aggregation

Instructions:

  • Modify the <FIXME> only and execute the below cell to visualize the normalized range of each category_code.
  • Compare the performance efficiency with the previous .groupby().apply() approach.
%%time
# set cat column of interest
cat_col='category_code'

# define row-wise function
def normalized_range(group): 
    return <<<<FIXME>>>>

# create groupby aggregate DataFrame
normalized_range_df=ddf.groupby(cat_col)['price'].agg({'price': ['max', 'min', 'mean']}).apply(normalized_range, axis=1, meta=('normalize_range', 'float64')).nlargest(25)

# create bar chart
px.bar(
    # move data to CPU
    normalized_range_df.compute().to_pandas()
).update_layout(
    yaxis_title='Normalize Range', 
    xaxis_title=cat_col, 
    title=f'Normalize Range of price per {cat_col}'
)

cat_col='category_code'

def normalized_range(group): return (group['max']-group['min'])/group['mean']

normalized_range_df=ddf.groupby(cat_col)['price'].agg({'price': ['max', 'min', 'mean']}).apply(normalized_range, axis=1, meta=('normalize_range', 'float64')).nlargest(25)

px.bar( normalized_range_df.compute().to_pandas() ).update_layout( yaxis_title='Normalize Range', xaxis_title=cat_col, title=f'Normalize Range of price per {cat_col}' ) Click ... to show solution.

# visualize graph
normalized_range_df.visualize(rankdir='LR')

We can apply predicate pushdown filters when reading from Parquet files with the filters parameter. This will enable Dask-cuDF to skip row-groups and files where none of the rows can satisfy the criteria. This works well when the partitions are thought-out and uses the filter column. Since this is not the case for our dataset, we will apply a separate filter after importing the data.

Exercise #5 - Time-Series Analysis

When dealing with time-series data, cuDF provides powerful .rolling()[doc] and .resample()[doc] methods to perform window operations. Functionally, they behave very similarly to .groupby() operations. We use .rolling() followed by .interpolate() to find the frequency and probability over the entire span of the dataset.

We use .map_partitions() to perform .resample() on each partition. Because .map_partitions() doesn't perform a shuffle first, we manually perform a .shuffle() to ensure all members of each group are together. Once we have all the needed data in the same partitions, we can use .map_partitions() and pass the cuDF DataFrame .resample() operation.

Instructions:

  • Execute the below cell to clear memory.
  • Execute the cell below to read data into memory and shuffle based on ts_day. This ensure that all records belonging to the same group are in the same partition.
  • Execute the cell below to show the user shopping behavior.
  • Modify the resample_frequency to various frequencies to look for more obvious patterns.
del ddf
gc.collect()
# read data with predicate pushdown
ddf=dask_cudf.read_parquet('clean_parquet', filters=[('ts_day', "<", 15)])

# apply filtering
ddf=ddf[ddf['ts_day']<15]

# shuffle first on ts_day
ddf=ddf.shuffle('ts_day')
# set resample frequency
resample_frequency='3h'

# get time-series DataFrame
activity_amount_trend=ddf.map_partitions(lambda x: x.resample(resample_frequency, on='event_time').size().interpolate('linear'))
purchase_rate_trend=ddf.map_partitions(lambda x: x.resample(resample_frequency, on='event_time')['target'].mean().interpolate('linear'))

# create scatter plot
px.scatter(
    # move data to CPU
    activity_amount_trend.compute().to_pandas().sort_index(), 
    color=purchase_rate_trend.compute().to_pandas().sort_index()
).update_traces(
    mode='markers+lines'
).update_layout(
    yaxis_title='Number of Records', 
    xaxis_title='event_time', 
    title=f'Amount of Transactions Over Time'
)

Pivot Table

When data is small enough to fit in single GPU, it's often faster to perform data transformation with cuDF. Below we will read a few numerical columns, which fits nicely in memory. We use .pivot_table() to find the probability and frequency at each ts_hour and ts_weekday group.

# read data
gdf=cudf.read_parquet('clean_parquet', columns=['ts_weekday', 'ts_hour', 'target'])
# create pivot table
activity_amount=gdf.pivot_table(index=['ts_weekday'], columns=['ts_hour'], values=['target'], aggfunc='size')['target']

# create heatmap
px.imshow(
    # move data to CPU
    activity_amount.to_pandas(), 
    title='there is more activity in the day'
).update_layout(
    title=f'Number of Records Heatmap'
)
# create pivot table
purchase_rate=gdf[['target', 'ts_weekday', 'ts_hour']].pivot_table(index=['ts_weekday'], columns=['ts_hour'], aggfunc='mean')['target'].to_pandas()

# create heatmap
px.imshow(
    # move data to CPU
    purchase_rate, 
    title='there is potentially a higher purchase rate in the evening'
).update_layout(
    title=f'Probability Heatmap'
)

Observations:

  • Behavior changes on ts_weekday and ts_hour - e.g. during the week, users will not stay up late as they work next day.

Summary

  • .groupby().apply() requires shuffling, which is time-expensive. When possible, try to use .groupby().agg() instead
  • Keeping data processing on the GPU can help generate visualizations quickly
  • Use predicate filtering and column pruning to reduce the amount of data read into memory. When data size is small, processing on cuDF can be more efficient than Dask-cuDF
  • Use .persit() if subsequent operations are exploratory
  • .map_partitions() does not involve shuffling
# clean GPU memory
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

Well Done! Let's move to the next notebook.

Header