1908 lines
74 KiB
Plaintext
1908 lines
74 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0bf7f930-76a1-4c16-84e4-cf1e73b54c55",
|
||
"metadata": {},
|
||
"source": [
|
||
"<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "400a41da-bc38-4e9a-9ece-d2744ffb16b0",
|
||
"metadata": {
|
||
"tags": []
|
||
},
|
||
"source": [
|
||
"# Enhancing Data Science Outcomes With Efficient Workflow #"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8897c66c-4f9d-48b4-a60b-ddae16f2f61b",
|
||
"metadata": {
|
||
"tags": []
|
||
},
|
||
"source": [
|
||
"## 03 - Feature Engineering for Categorical Features ##\n",
|
||
"In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load), is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data. \n",
|
||
"\n",
|
||
"<p><img src='images/pipeline_overview_1.png' width=1080></p>\n",
|
||
"\n",
|
||
"**Table of Contents**\n",
|
||
"<br>\n",
|
||
"In this notebook, we will load data from Parquet file format into a Dask DataFrame and create additional features for machine learning model training. This notebook covers the below sections: \n",
|
||
"1. [Quick Recap](#s3-1)\n",
|
||
"2. [Feature Engineering](#s3-2)\n",
|
||
" * [User Defined Functions](#s3-2.1)\n",
|
||
"3. [Feature Engineering Techniques](#s3-3)\n",
|
||
" * [One-Hot Encoding](#s3-3.1)\n",
|
||
" * [Combining Categories](#s3-3.2)\n",
|
||
" * [Categorify / Label Encoding](#s3-3.3)\n",
|
||
" * [Count Encoding](#s3-3.4)\n",
|
||
" * [Target Encoding](#s3-3.5)\n",
|
||
" * [Embeddings](#s3-3.6)\n",
|
||
"4. [Summary](#s3-4)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5570d950-5b5a-48c2-93fa-dfd80e2beaf9",
|
||
"metadata": {},
|
||
"source": [
|
||
"<a name='s3-1'></a>\n",
|
||
"## Quick Recap ##\n",
|
||
"So far, we've identified several sources of hidden slowdowns when working with Dask and cuDF: \n",
|
||
"* Reading data without a schema or specifying `dtype`\n",
|
||
"* Having too many partitions due to small `chunksize`\n",
|
||
"* Memory spilling due to partitions being too large\n",
|
||
"* Performing groupby operations on too many groups scattered across multiple partitions\n",
|
||
"\n",
|
||
"Going forward, we will continue to learn how to use Dask and RAPIDS efficiently. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "cba7c117-cd05-423e-a958-93a203e879b6",
|
||
"metadata": {
|
||
"tags": []
|
||
},
|
||
"source": [
|
||
"<a name='s3-2'></a>\n",
|
||
"## Feature Engineering ##\n",
|
||
"Feature engineer converts raw data to numeric vectors for model consumption. This is generally referred to as encoding, which transforms categorical data into continuous values. When encoding categorical values, there are three primary methods: \n",
|
||
"* Label encoding when no ordered relationship\n",
|
||
"* Ordinal encoding in case have ordered relationship\n",
|
||
"* One-hot encoding when categorical variable data is binary in nature. \n",
|
||
"\n",
|
||
"Additionally, we can create numerous sets of new features from existing ones, which are then tested for effectiveness during model training. Feature engineering is an important step when working with tabular data as it can improve a machine learning model's ability to learn faster and extract patterns. Feature engineering can be a time-consuming process, particularly when the dataset is large if the processing cycle takes a long time. The ability to perform feature engineering efficiently enables more exploration of useful features. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "82aa2285-30ce-4d43-a754-e2c308506fdd",
|
||
"metadata": {},
|
||
"source": [
|
||
"<a name='s3-2.1'></a>\n",
|
||
"### User-Defined Functions ###\n",
|
||
"Like many tabular data processing APIs, cuDF provides a range of composable, DataFrame style operators. While out of the box functions are flexible and useful, it is sometimes necessary to write custom code, or **user-defined functions** (UDFs), that can be applied to rows, columns, and other groupings of the cells making up the DataFrame.\n",
|
||
"\n",
|
||
"Users can execute UDFs on `cudf.Series` with: \n",
|
||
"* `cudf.Series.apply()` or \n",
|
||
"* Numba's `forall` syntax [(link)](https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html#lower-level-control-with-custom-numba-kernels)\n",
|
||
"\n",
|
||
"Users can execute UDFs on `cudf.DataFrame` with: \n",
|
||
"* `cudf.DataFrame.apply()`\n",
|
||
"* `cudf.DataFrame.apply_rows()`\n",
|
||
"* `cudf.DataFrame.apply_chunks()`\n",
|
||
"* `cudf.rolling().apply()`\n",
|
||
"* `cudf.groupby().apply_grouped()`\n",
|
||
"\n",
|
||
"Note that applying UDFs directly with Dask-cuDF is not yet implemented. For now, users can use `map_partitions` to apply a function to each partition of the distributed dataframe.\n",
|
||
"\n",
|
||
"Currently, the use of string data within UDFs is provided through the `string_udf` library. This is powerful for use cases such as string splitting, regular expression, and tokenization. The topic of handling string data is discussed extensively [here](https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html#string-data). In addition to `Series.str`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/string_handling.html), cudf also supports `Series.list`[[doc]](https://docs.rapids.ai/api/cudf/stable/api_docs/list_handling.html) for applying custom transformations. \n",
|
||
"\n",
|
||
"<p><img src='images/tip.png' width=720></p>\n",
|
||
"\n",
|
||
"Below are some tips: \n",
|
||
"* `apply` works by applying the provided function to each group sequentially, and concatenating the results together. This can be very slow, especially for a large number of small groups. For a small number of large groups, it can give acceptable performance.\n",
|
||
"* With cuDF, we can also combine NumPy or cuPy methods into the precedure. \n",
|
||
"* Related to `apply`, iterating over a cuDF Series, DataFrame or Index is not supported. This is because iterating over data that resides on the GPU will yield extremely poor performance, as GPUs are optimized for highly parallel operations rather than sequential operations. In the vast majority of cases, it is possible to avoid iteration and use an existing function or methods to accomplish the same task. It is recommended that users copy the data from GPU to host with `.to_arrow()` or `.to_pandas()`, then copy the result back to GPU using `.from_arrow()` or `.from_pandas()`. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b7dae2ea-9354-43bd-8f56-fcda283f3455",
|
||
"metadata": {},
|
||
"source": [
|
||
"<a name='s3-3'></a>\n",
|
||
"## Feature Engineering Techniques ##\n",
|
||
"Below is a list of common feature engineering techniques.\n",
|
||
"<mark>\n",
|
||
"Ordinal - Assigns a unique integer to each category based on its order.\n",
|
||
"Label - Assigns a unique integer to each category without considering any ranking.\n",
|
||
"</mark>\n",
|
||
"<img src='images/feature_engineering_methods.png' width=720>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "83513c1b-346e-4cb6-abc7-aa5e9a74dd27",
|
||
"metadata": {
|
||
"tags": []
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.\n",
|
||
"Perhaps you already have a cluster running?\n",
|
||
"Hosting the HTTP server on port 42643 instead\n",
|
||
" warnings.warn(\n",
|
||
"2026-02-11 12:49:00,123 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n",
|
||
"2026-02-11 12:49:00,123 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n",
|
||
"2026-02-11 12:49:00,123 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n",
|
||
"2026-02-11 12:49:00,123 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n",
|
||
"2026-02-11 12:49:00,127 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n",
|
||
"2026-02-11 12:49:00,127 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n",
|
||
"2026-02-11 12:49:00,128 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n",
|
||
"2026-02-11 12:49:00,128 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from dask.distributed import Client, wait\n",
|
||
"from dask_cuda import LocalCUDACluster\n",
|
||
"import cudf\n",
|
||
"import dask.dataframe as dd\n",
|
||
"import dask_cudf\n",
|
||
"import gc\n",
|
||
"\n",
|
||
"# instantiate a Client\n",
|
||
"cluster=LocalCUDACluster()\n",
|
||
"client=Client(cluster)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "7eed5742-23fb-44cc-b048-11b64858644e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# get the machine's external IP address\n",
|
||
"from requests import get\n",
|
||
"\n",
|
||
"ip=get('https://api.ipify.org').content.decode('utf8')\n",
|
||
"\n",
|
||
"print(f'Dask dashboard (status) is accessible on http://{ip}:8787/status')\n",
|
||
"print(f'Dask dashboard (gpu) is accessible on http://{ip}:8787/gpu')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "50a80952-c378-4777-bdb3-ee625efafd2b",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# read data as Dask-cuDF DataFrame\n",
|
||
"ddf=dask_cudf.read_parquet('clean_parquet')\n",
|
||
"ddf=ddf.categorize(columns=['brand', 'cat_0', 'cat_1', 'cat_2', 'cat_3'])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "8250caf8-28a9-4d35-b1a8-efadcfbe092a",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"ddf=ddf.persist()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "94b1ccfc-bcf3-433d-aafd-f943818cf9fb",
|
||
"metadata": {},
|
||
"source": [
|
||
"<p><img src='images/check.png' width=720></p>\n",
|
||
"Did you get an error message? This notebook depends on the processed source file from previous notebooks. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7a7013bd-a83b-487a-8578-404e0b1c1f00",
|
||
"metadata": {
|
||
"tags": []
|
||
},
|
||
"source": [
|
||
"<a name='s3-3.1'></a>\n",
|
||
"### One-Hot Encoding ###\n",
|
||
"**One-Hot Encoding**, also known as dummy encoding, creates several binary columns to indicate a row belonging to a specific category. It works well for categorical features that are not ordinal and have low cardinality. With one-hot encoding, each row would get a single column with a 1 and 0 everywhere else. \n",
|
||
"\n",
|
||
"For example, we can get `cudf.get_dummies()` to perform one-hot encoding on all of one of the categorical columns. \n",
|
||
"\n",
|
||
"<img src='images/tip.png' width=720>\n",
|
||
"<mark>One-hot encoding doesn't work well for categorical features when the cardinality is large</mark> as it results in high dimensionality. This is particularly an issue for neural networks optimizers. Furthermore, data should not be saved in one-hot encoding format. If needed, it should only be used temporarily for specific tasks. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "e3a44993-a73c-4560-9291-0b8d3de03a1c",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def one_hot(df, cat): \n",
|
||
" temp=dd.get_dummies(df[cat])\n",
|
||
" return dask_cudf.concat([df, temp], axis=1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "64e95a04-ede4-4dd7-9907-ab3d66d4c4de",
|
||
"metadata": {
|
||
"scrolled": true,
|
||
"tags": []
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"/opt/conda/envs/rapids/lib/python3.9/site-packages/dask/dataframe/multi.py:1287: UserWarning: Concatenating dataframes with unknown divisions.\n",
|
||
"We're assuming that the indices of each dataframes are \n",
|
||
" aligned. This assumption is not generally safe.\n",
|
||
" warnings.warn(\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>event_time</th>\n",
|
||
" <th>event_type</th>\n",
|
||
" <th>product_id</th>\n",
|
||
" <th>category_id</th>\n",
|
||
" <th>category_code</th>\n",
|
||
" <th>brand</th>\n",
|
||
" <th>price</th>\n",
|
||
" <th>user_id</th>\n",
|
||
" <th>user_session</th>\n",
|
||
" <th>session_product</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>auto</th>\n",
|
||
" <th>computers</th>\n",
|
||
" <th>construction</th>\n",
|
||
" <th>country_yard</th>\n",
|
||
" <th>electronics</th>\n",
|
||
" <th>furniture</th>\n",
|
||
" <th>kids</th>\n",
|
||
" <th>medicine</th>\n",
|
||
" <th>sport</th>\n",
|
||
" <th>stationery</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>2020-03-01 04:54:09</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>10301400</td>\n",
|
||
" <td>2232732104888681081</td>\n",
|
||
" <td>apparel.scarf</td>\n",
|
||
" <td>bburago</td>\n",
|
||
" <td>19.280001</td>\n",
|
||
" <td>537144080</td>\n",
|
||
" <td>053d5ad3-01c7-4dfb-9079-e121c33b0938</td>\n",
|
||
" <td>053d5ad3-01c7-4dfb-9079-e121c33b0938_10301400</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>2020-03-01 04:55:26</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>15700285</td>\n",
|
||
" <td>2232732094134485388</td>\n",
|
||
" <td>UNKNOWN</td>\n",
|
||
" <td>UNKNOWN</td>\n",
|
||
" <td>154.190002</td>\n",
|
||
" <td>514686549</td>\n",
|
||
" <td>3c842e53-1e47-4941-83e0-2a27a8fdeaf1</td>\n",
|
||
" <td>3c842e53-1e47-4941-83e0-2a27a8fdeaf1_15700285</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>2020-03-01 04:54:46</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>21406331</td>\n",
|
||
" <td>2232732082063278200</td>\n",
|
||
" <td>electronics.clocks</td>\n",
|
||
" <td>casio</td>\n",
|
||
" <td>30.369999</td>\n",
|
||
" <td>522564661</td>\n",
|
||
" <td>cfa89b7f-5b34-4d65-a135-bb924d98af9c</td>\n",
|
||
" <td>cfa89b7f-5b34-4d65-a135-bb924d98af9c_21406331</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>2020-03-01 07:45:47</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>1004665</td>\n",
|
||
" <td>2232732093077520756</td>\n",
|
||
" <td>construction.tools.light</td>\n",
|
||
" <td>samsung</td>\n",
|
||
" <td>816.690002</td>\n",
|
||
" <td>596178054</td>\n",
|
||
" <td>f84b2b78-50a0-4e34-ad8d-da60a6178091</td>\n",
|
||
" <td>f84b2b78-50a0-4e34-ad8d-da60a6178091_1004665</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>2020-03-02 05:26:04</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>21400996</td>\n",
|
||
" <td>2232732082063278200</td>\n",
|
||
" <td>electronics.clocks</td>\n",
|
||
" <td>casio</td>\n",
|
||
" <td>81.159996</td>\n",
|
||
" <td>537131991</td>\n",
|
||
" <td>b19b380b-2db7-4c6e-bb91-998556315d0a</td>\n",
|
||
" <td>b19b380b-2db7-4c6e-bb91-998556315d0a_21400996</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 36 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" event_time event_type product_id category_id \\\n",
|
||
"0 2020-03-01 04:54:09 purchase 10301400 2232732104888681081 \n",
|
||
"1 2020-03-01 04:55:26 purchase 15700285 2232732094134485388 \n",
|
||
"2 2020-03-01 04:54:46 purchase 21406331 2232732082063278200 \n",
|
||
"3 2020-03-01 07:45:47 purchase 1004665 2232732093077520756 \n",
|
||
"4 2020-03-02 05:26:04 purchase 21400996 2232732082063278200 \n",
|
||
"\n",
|
||
" category_code brand price user_id \\\n",
|
||
"0 apparel.scarf bburago 19.280001 537144080 \n",
|
||
"1 UNKNOWN UNKNOWN 154.190002 514686549 \n",
|
||
"2 electronics.clocks casio 30.369999 522564661 \n",
|
||
"3 construction.tools.light samsung 816.690002 596178054 \n",
|
||
"4 electronics.clocks casio 81.159996 537131991 \n",
|
||
"\n",
|
||
" user_session \\\n",
|
||
"0 053d5ad3-01c7-4dfb-9079-e121c33b0938 \n",
|
||
"1 3c842e53-1e47-4941-83e0-2a27a8fdeaf1 \n",
|
||
"2 cfa89b7f-5b34-4d65-a135-bb924d98af9c \n",
|
||
"3 f84b2b78-50a0-4e34-ad8d-da60a6178091 \n",
|
||
"4 b19b380b-2db7-4c6e-bb91-998556315d0a \n",
|
||
"\n",
|
||
" session_product ... auto computers \\\n",
|
||
"0 053d5ad3-01c7-4dfb-9079-e121c33b0938_10301400 ... 0 0 \n",
|
||
"1 3c842e53-1e47-4941-83e0-2a27a8fdeaf1_15700285 ... 0 0 \n",
|
||
"2 cfa89b7f-5b34-4d65-a135-bb924d98af9c_21406331 ... 0 0 \n",
|
||
"3 f84b2b78-50a0-4e34-ad8d-da60a6178091_1004665 ... 0 0 \n",
|
||
"4 b19b380b-2db7-4c6e-bb91-998556315d0a_21400996 ... 0 0 \n",
|
||
"\n",
|
||
" construction country_yard electronics furniture kids medicine sport \\\n",
|
||
"0 0 0 0 0 0 0 0 \n",
|
||
"1 0 0 0 0 0 0 0 \n",
|
||
"2 0 0 1 0 0 0 0 \n",
|
||
"3 1 0 0 0 0 0 0 \n",
|
||
"4 0 0 1 0 0 0 0 \n",
|
||
"\n",
|
||
" stationery \n",
|
||
"0 0 \n",
|
||
"1 0 \n",
|
||
"2 0 \n",
|
||
"3 0 \n",
|
||
"4 0 \n",
|
||
"\n",
|
||
"[5 rows x 36 columns]"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"one_hot(ddf, 'cat_0').head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f5048e13-41b7-4462-974c-abe03ac8704f",
|
||
"metadata": {},
|
||
"source": [
|
||
"<a name='s3-3.2'></a>\n",
|
||
"### Combining Categories ###\n",
|
||
"\n",
|
||
"**Combining categories** creates new features that better identify patterns when the categories indepedently don't provide information to predict the target. It's also known as _cross column_ or _cross product_. It's a common data preprocessing step for machine learning since it reduces the cost of model training. It's also common for exploratory data analysis. Properly combined categorical features encourage more effective splits in tree-based methods than considering each feature independently. \n",
|
||
"\n",
|
||
"For example, while `ts_weekday` and `ts_hour` may independently have no significant patterns, we might observe more obvious patterns if the two features are combined into `ts_weekday_hour`. \n",
|
||
"\n",
|
||
"<img src='images/tip.png' width=720>\n",
|
||
"When deciding which categorical features should be combined, it's important to balance the number of categories used, the number of observations in each combined category, and information gain. Combining features together reduces the number of observations per resulting category, which can lead to overfitting. Typically, combining low cardinal categories is recommended. Otherwise, experimentations are needed to discover the best combinations. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "a16194ee-bcfd-4405-b57c-76ac2f131785",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def combine_cats(df, left, right): \n",
|
||
" df['-'.join([left, right])]=df[left].astype('str').str.cat(df[right].astype('str'))\n",
|
||
" return df"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "b4ebbd2a-0b72-407c-8613-4a6980446fea",
|
||
"metadata": {
|
||
"scrolled": true,
|
||
"tags": []
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>event_time</th>\n",
|
||
" <th>event_type</th>\n",
|
||
" <th>product_id</th>\n",
|
||
" <th>category_id</th>\n",
|
||
" <th>category_code</th>\n",
|
||
" <th>brand</th>\n",
|
||
" <th>price</th>\n",
|
||
" <th>user_id</th>\n",
|
||
" <th>user_session</th>\n",
|
||
" <th>session_product</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>cat_2</th>\n",
|
||
" <th>cat_3</th>\n",
|
||
" <th>date</th>\n",
|
||
" <th>ts_hour</th>\n",
|
||
" <th>ts_minute</th>\n",
|
||
" <th>ts_weekday</th>\n",
|
||
" <th>ts_day</th>\n",
|
||
" <th>ts_month</th>\n",
|
||
" <th>ts_year</th>\n",
|
||
" <th>ts_weekday-ts_hour</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>2020-03-01 04:54:09</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>10301400</td>\n",
|
||
" <td>2232732104888681081</td>\n",
|
||
" <td>apparel.scarf</td>\n",
|
||
" <td>bburago</td>\n",
|
||
" <td>19.280001</td>\n",
|
||
" <td>537144080</td>\n",
|
||
" <td>053d5ad3-01c7-4dfb-9079-e121c33b0938</td>\n",
|
||
" <td>053d5ad3-01c7-4dfb-9079-e121c33b0938_10301400</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-01</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>54</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>64</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>2020-03-01 04:55:26</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>15700285</td>\n",
|
||
" <td>2232732094134485388</td>\n",
|
||
" <td>UNKNOWN</td>\n",
|
||
" <td>UNKNOWN</td>\n",
|
||
" <td>154.190002</td>\n",
|
||
" <td>514686549</td>\n",
|
||
" <td>3c842e53-1e47-4941-83e0-2a27a8fdeaf1</td>\n",
|
||
" <td>3c842e53-1e47-4941-83e0-2a27a8fdeaf1_15700285</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-01</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>55</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>64</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>2020-03-01 04:54:46</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>21406331</td>\n",
|
||
" <td>2232732082063278200</td>\n",
|
||
" <td>electronics.clocks</td>\n",
|
||
" <td>casio</td>\n",
|
||
" <td>30.369999</td>\n",
|
||
" <td>522564661</td>\n",
|
||
" <td>cfa89b7f-5b34-4d65-a135-bb924d98af9c</td>\n",
|
||
" <td>cfa89b7f-5b34-4d65-a135-bb924d98af9c_21406331</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-01</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>54</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>64</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>2020-03-01 07:45:47</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>1004665</td>\n",
|
||
" <td>2232732093077520756</td>\n",
|
||
" <td>construction.tools.light</td>\n",
|
||
" <td>samsung</td>\n",
|
||
" <td>816.690002</td>\n",
|
||
" <td>596178054</td>\n",
|
||
" <td>f84b2b78-50a0-4e34-ad8d-da60a6178091</td>\n",
|
||
" <td>f84b2b78-50a0-4e34-ad8d-da60a6178091_1004665</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>light</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-01</td>\n",
|
||
" <td>7</td>\n",
|
||
" <td>45</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>67</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>2020-03-02 05:26:04</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>21400996</td>\n",
|
||
" <td>2232732082063278200</td>\n",
|
||
" <td>electronics.clocks</td>\n",
|
||
" <td>casio</td>\n",
|
||
" <td>81.159996</td>\n",
|
||
" <td>537131991</td>\n",
|
||
" <td>b19b380b-2db7-4c6e-bb91-998556315d0a</td>\n",
|
||
" <td>b19b380b-2db7-4c6e-bb91-998556315d0a_21400996</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-02</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>26</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>05</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 23 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" event_time event_type product_id category_id \\\n",
|
||
"0 2020-03-01 04:54:09 purchase 10301400 2232732104888681081 \n",
|
||
"1 2020-03-01 04:55:26 purchase 15700285 2232732094134485388 \n",
|
||
"2 2020-03-01 04:54:46 purchase 21406331 2232732082063278200 \n",
|
||
"3 2020-03-01 07:45:47 purchase 1004665 2232732093077520756 \n",
|
||
"4 2020-03-02 05:26:04 purchase 21400996 2232732082063278200 \n",
|
||
"\n",
|
||
" category_code brand price user_id \\\n",
|
||
"0 apparel.scarf bburago 19.280001 537144080 \n",
|
||
"1 UNKNOWN UNKNOWN 154.190002 514686549 \n",
|
||
"2 electronics.clocks casio 30.369999 522564661 \n",
|
||
"3 construction.tools.light samsung 816.690002 596178054 \n",
|
||
"4 electronics.clocks casio 81.159996 537131991 \n",
|
||
"\n",
|
||
" user_session \\\n",
|
||
"0 053d5ad3-01c7-4dfb-9079-e121c33b0938 \n",
|
||
"1 3c842e53-1e47-4941-83e0-2a27a8fdeaf1 \n",
|
||
"2 cfa89b7f-5b34-4d65-a135-bb924d98af9c \n",
|
||
"3 f84b2b78-50a0-4e34-ad8d-da60a6178091 \n",
|
||
"4 b19b380b-2db7-4c6e-bb91-998556315d0a \n",
|
||
"\n",
|
||
" session_product ... cat_2 cat_3 date \\\n",
|
||
"0 053d5ad3-01c7-4dfb-9079-e121c33b0938_10301400 ... NA NA 2020-03-01 \n",
|
||
"1 3c842e53-1e47-4941-83e0-2a27a8fdeaf1_15700285 ... NA NA 2020-03-01 \n",
|
||
"2 cfa89b7f-5b34-4d65-a135-bb924d98af9c_21406331 ... NA NA 2020-03-01 \n",
|
||
"3 f84b2b78-50a0-4e34-ad8d-da60a6178091_1004665 ... light NA 2020-03-01 \n",
|
||
"4 b19b380b-2db7-4c6e-bb91-998556315d0a_21400996 ... NA NA 2020-03-02 \n",
|
||
"\n",
|
||
" ts_hour ts_minute ts_weekday ts_day ts_month ts_year \\\n",
|
||
"0 4 54 6 1 3 2020 \n",
|
||
"1 4 55 6 1 3 2020 \n",
|
||
"2 4 54 6 1 3 2020 \n",
|
||
"3 7 45 6 1 3 2020 \n",
|
||
"4 5 26 0 2 3 2020 \n",
|
||
"\n",
|
||
" ts_weekday-ts_hour \n",
|
||
"0 64 \n",
|
||
"1 64 \n",
|
||
"2 64 \n",
|
||
"3 67 \n",
|
||
"4 05 \n",
|
||
"\n",
|
||
"[5 rows x 23 columns]"
|
||
]
|
||
},
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"combine_cats(ddf, 'ts_weekday', 'ts_hour').head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "daa12853-a976-4e57-8787-cc9f5f1369ee",
|
||
"metadata": {},
|
||
"source": [
|
||
"<a name='s3-3.3'></a>\n",
|
||
"### Categorify and Grouping ###\n",
|
||
"\n",
|
||
"**Categorify**, also known as *Label Encoding*, converts features into continuous integers. Typically, it converts the values into monotonically increasing positive integers from 0 to *C*, or the cardinality. It enables numerical computations and can also reduce memory utilization if the original feature contains string values. Categorify is a necessary data preprocessing step for neural network embedding layers. It is required for using categorical features in deep learning models with Embedding layers. \n",
|
||
"\n",
|
||
"Categorifying works well when the feature is ordinal, and is sometimes necessary when the cardinality is large. Categories with low frequency can be grouped together to prevent the model overfitting on spare signals. When categorifying a feature, we can apply a threshold to group all categories with lower frequency count to the `other` category."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a2171a74-7b2a-432c-993f-befea42ee6bd",
|
||
"metadata": {},
|
||
"source": [
|
||
"Encode categorical features into continuous integer values if the category occurs more often than the specified threshold- frequency threshold. Infrequent categories are mapped to a special ‘unknown’ category. This handy functionality will map all categories which occur in the dataset with some threshold level of infrequency to the same index, keeping the model from overfitting to sparse signals."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "5e70ce8e-4032-4bb4-b42e-2687a29d50b2",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Не используется freq_threshold\n",
|
||
"\n",
|
||
"def categorify(df, cat, freq_threshold):\n",
|
||
" freq=df[cat].value_counts()\n",
|
||
" freq=freq.reset_index()\n",
|
||
" freq.columns=[cat, 'count']\n",
|
||
" \n",
|
||
" # reset index on the frequency dataframe for a new sequential index\n",
|
||
" freq=freq.reset_index()\n",
|
||
" freq.columns=[cat+'_Categorify', cat, 'count']\n",
|
||
" \n",
|
||
" # we apply a frequency threshold of 5 to group low frequent categories together\n",
|
||
" freq_filtered=freq[freq['count']>5]\n",
|
||
" \n",
|
||
" # add 2 to the new index as we want to use index 0 for others and 1 for unknown\n",
|
||
" freq_filtered[cat+'_Categorify']=freq_filtered[cat+'_Categorify']+2\n",
|
||
" freq_filtered=freq_filtered.drop(columns=['count'])\n",
|
||
" \n",
|
||
" # merge original dataframe with newly created dataframe to obtain the categorified value\n",
|
||
" df=df.merge(freq_filtered, how='left', on=cat)\n",
|
||
" \n",
|
||
" # fill null values with 0 to represent low frequency categories grouped as other\n",
|
||
" df[cat + '_Categorify'] = df[cat + '_Categorify'].fillna(0)\n",
|
||
" return df"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"id": "391ebb7a-df65-4b77-9205-ee7bc6cf567d",
|
||
"metadata": {
|
||
"scrolled": true,
|
||
"tags": []
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>event_time</th>\n",
|
||
" <th>event_type</th>\n",
|
||
" <th>product_id</th>\n",
|
||
" <th>category_id</th>\n",
|
||
" <th>category_code</th>\n",
|
||
" <th>brand</th>\n",
|
||
" <th>price</th>\n",
|
||
" <th>user_id</th>\n",
|
||
" <th>user_session</th>\n",
|
||
" <th>session_product</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>cat_3</th>\n",
|
||
" <th>date</th>\n",
|
||
" <th>ts_hour</th>\n",
|
||
" <th>ts_minute</th>\n",
|
||
" <th>ts_weekday</th>\n",
|
||
" <th>ts_day</th>\n",
|
||
" <th>ts_month</th>\n",
|
||
" <th>ts_year</th>\n",
|
||
" <th>ts_weekday-ts_hour</th>\n",
|
||
" <th>cat_0_Categorify</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>2020-03-01 12:40:02</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>100110405</td>\n",
|
||
" <td>2053013551932506308</td>\n",
|
||
" <td>construction.tools.drill</td>\n",
|
||
" <td>spotter</td>\n",
|
||
" <td>41.160000</td>\n",
|
||
" <td>552186714</td>\n",
|
||
" <td>d9899f66-61fe-4320-8223-4e481816d452</td>\n",
|
||
" <td>d9899f66-61fe-4320-8223-4e481816d452_100110405</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-01</td>\n",
|
||
" <td>12</td>\n",
|
||
" <td>40</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>612</td>\n",
|
||
" <td>2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>2020-03-02 06:16:44</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>1004767</td>\n",
|
||
" <td>2232732093077520756</td>\n",
|
||
" <td>construction.tools.light</td>\n",
|
||
" <td>samsung</td>\n",
|
||
" <td>241.830002</td>\n",
|
||
" <td>622603102</td>\n",
|
||
" <td>ac95e795-0395-4e1c-9813-be168aa3c0c5</td>\n",
|
||
" <td>ac95e795-0395-4e1c-9813-be168aa3c0c5_1004767</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-02</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>16</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>06</td>\n",
|
||
" <td>2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>2020-03-02 06:42:45</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>4200998</td>\n",
|
||
" <td>2053013557695480347</td>\n",
|
||
" <td>appliances.environment.air_conditioner</td>\n",
|
||
" <td>almacom</td>\n",
|
||
" <td>256.619995</td>\n",
|
||
" <td>512853766</td>\n",
|
||
" <td>a892b7fc-cf07-4674-8f62-2e97afa3a110</td>\n",
|
||
" <td>a892b7fc-cf07-4674-8f62-2e97afa3a110_4200998</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-02</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>42</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>06</td>\n",
|
||
" <td>3</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>2020-03-02 07:21:00</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>14100052</td>\n",
|
||
" <td>2232732107883414193</td>\n",
|
||
" <td>electronics.audio.acoustic</td>\n",
|
||
" <td>yamaha</td>\n",
|
||
" <td>118.050003</td>\n",
|
||
" <td>536444961</td>\n",
|
||
" <td>c63c383b-ca94-486a-bfd0-030b44f427e6</td>\n",
|
||
" <td>c63c383b-ca94-486a-bfd0-030b44f427e6_14100052</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-02</td>\n",
|
||
" <td>7</td>\n",
|
||
" <td>21</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>07</td>\n",
|
||
" <td>6</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>2020-03-01 12:23:39</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>3600666</td>\n",
|
||
" <td>2232732092297380188</td>\n",
|
||
" <td>appliances.kitchen.washer</td>\n",
|
||
" <td>samsung</td>\n",
|
||
" <td>308.859985</td>\n",
|
||
" <td>541091080</td>\n",
|
||
" <td>7660dc3e-4593-46da-8afe-6773e7dce6f2</td>\n",
|
||
" <td>7660dc3e-4593-46da-8afe-6773e7dce6f2_3600666</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-01</td>\n",
|
||
" <td>12</td>\n",
|
||
" <td>23</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>612</td>\n",
|
||
" <td>3</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 24 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" event_time event_type product_id category_id \\\n",
|
||
"0 2020-03-01 12:40:02 purchase 100110405 2053013551932506308 \n",
|
||
"1 2020-03-02 06:16:44 purchase 1004767 2232732093077520756 \n",
|
||
"2 2020-03-02 06:42:45 purchase 4200998 2053013557695480347 \n",
|
||
"3 2020-03-02 07:21:00 purchase 14100052 2232732107883414193 \n",
|
||
"4 2020-03-01 12:23:39 purchase 3600666 2232732092297380188 \n",
|
||
"\n",
|
||
" category_code brand price user_id \\\n",
|
||
"0 construction.tools.drill spotter 41.160000 552186714 \n",
|
||
"1 construction.tools.light samsung 241.830002 622603102 \n",
|
||
"2 appliances.environment.air_conditioner almacom 256.619995 512853766 \n",
|
||
"3 electronics.audio.acoustic yamaha 118.050003 536444961 \n",
|
||
"4 appliances.kitchen.washer samsung 308.859985 541091080 \n",
|
||
"\n",
|
||
" user_session \\\n",
|
||
"0 d9899f66-61fe-4320-8223-4e481816d452 \n",
|
||
"1 ac95e795-0395-4e1c-9813-be168aa3c0c5 \n",
|
||
"2 a892b7fc-cf07-4674-8f62-2e97afa3a110 \n",
|
||
"3 c63c383b-ca94-486a-bfd0-030b44f427e6 \n",
|
||
"4 7660dc3e-4593-46da-8afe-6773e7dce6f2 \n",
|
||
"\n",
|
||
" session_product ... cat_3 date \\\n",
|
||
"0 d9899f66-61fe-4320-8223-4e481816d452_100110405 ... NA 2020-03-01 \n",
|
||
"1 ac95e795-0395-4e1c-9813-be168aa3c0c5_1004767 ... NA 2020-03-02 \n",
|
||
"2 a892b7fc-cf07-4674-8f62-2e97afa3a110_4200998 ... NA 2020-03-02 \n",
|
||
"3 c63c383b-ca94-486a-bfd0-030b44f427e6_14100052 ... NA 2020-03-02 \n",
|
||
"4 7660dc3e-4593-46da-8afe-6773e7dce6f2_3600666 ... NA 2020-03-01 \n",
|
||
"\n",
|
||
" ts_hour ts_minute ts_weekday ts_day ts_month ts_year \\\n",
|
||
"0 12 40 6 1 3 2020 \n",
|
||
"1 6 16 0 2 3 2020 \n",
|
||
"2 6 42 0 2 3 2020 \n",
|
||
"3 7 21 0 2 3 2020 \n",
|
||
"4 12 23 6 1 3 2020 \n",
|
||
"\n",
|
||
" ts_weekday-ts_hour cat_0_Categorify \n",
|
||
"0 612 2 \n",
|
||
"1 06 2 \n",
|
||
"2 06 3 \n",
|
||
"3 07 6 \n",
|
||
"4 612 3 \n",
|
||
"\n",
|
||
"[5 rows x 24 columns]"
|
||
]
|
||
},
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"categorify(ddf, 'cat_0', 10).head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4497bf2d-01e1-4786-a2e9-402b9985ceeb",
|
||
"metadata": {},
|
||
"source": [
|
||
"<a name='s3-3.4'></a>\n",
|
||
"### Count Encoding ###\n",
|
||
"\n",
|
||
"*Count Encoding* represents a feature based on the frequency. This can be interpreted as the popularity of a category. \n",
|
||
"\n",
|
||
"For example, we can count the frequency of `user_id` with `cudf.Series.value_counts()`. This creates a feature that can help a machine learning model learn the behavior pattern of users with low frequency together. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"id": "7b3a864c-2149-4348-9eda-97593ad50a5d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def count_encoding(df, cat): \n",
|
||
" count_df=df[cat].value_counts()\n",
|
||
" count_df=count_df.reset_index()\n",
|
||
" count_df.columns=[cat, cat+'_CE']\n",
|
||
" df=df.merge(count_df, on=cat)\n",
|
||
" return df"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"id": "fdaf7b68-770c-46f7-ad59-3051905658bf",
|
||
"metadata": {
|
||
"scrolled": true,
|
||
"tags": []
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>event_time</th>\n",
|
||
" <th>event_type</th>\n",
|
||
" <th>product_id</th>\n",
|
||
" <th>category_id</th>\n",
|
||
" <th>category_code</th>\n",
|
||
" <th>brand</th>\n",
|
||
" <th>price</th>\n",
|
||
" <th>user_id</th>\n",
|
||
" <th>user_session</th>\n",
|
||
" <th>session_product</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>cat_3</th>\n",
|
||
" <th>date</th>\n",
|
||
" <th>ts_hour</th>\n",
|
||
" <th>ts_minute</th>\n",
|
||
" <th>ts_weekday</th>\n",
|
||
" <th>ts_day</th>\n",
|
||
" <th>ts_month</th>\n",
|
||
" <th>ts_year</th>\n",
|
||
" <th>ts_weekday-ts_hour</th>\n",
|
||
" <th>user_id_CE</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>2020-03-04 14:49:19</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>2602033</td>\n",
|
||
" <td>2232732101835227701</td>\n",
|
||
" <td>UNKNOWN</td>\n",
|
||
" <td>bosch</td>\n",
|
||
" <td>643.490051</td>\n",
|
||
" <td>590324989</td>\n",
|
||
" <td>572278b1-f8cc-4941-acda-30eca4313a77</td>\n",
|
||
" <td>572278b1-f8cc-4941-acda-30eca4313a77_2602033</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-04</td>\n",
|
||
" <td>14</td>\n",
|
||
" <td>49</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>214</td>\n",
|
||
" <td>56</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>2020-03-04 14:40:52</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>100115963</td>\n",
|
||
" <td>2232732089042600205</td>\n",
|
||
" <td>furniture.bedroom.bed</td>\n",
|
||
" <td>UNKNOWN</td>\n",
|
||
" <td>16.730001</td>\n",
|
||
" <td>590324989</td>\n",
|
||
" <td>572278b1-f8cc-4941-acda-30eca4313a77</td>\n",
|
||
" <td>572278b1-f8cc-4941-acda-30eca4313a77_100115963</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-04</td>\n",
|
||
" <td>14</td>\n",
|
||
" <td>40</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>214</td>\n",
|
||
" <td>56</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>2020-03-19 10:52:11</td>\n",
|
||
" <td>cart</td>\n",
|
||
" <td>100166722</td>\n",
|
||
" <td>2053013553887052089</td>\n",
|
||
" <td>apparel.shoes</td>\n",
|
||
" <td>quechua</td>\n",
|
||
" <td>22.910000</td>\n",
|
||
" <td>630257323</td>\n",
|
||
" <td>a84ace23-b9f6-46cd-ac90-7d29279e5e0a</td>\n",
|
||
" <td>a84ace23-b9f6-46cd-ac90-7d29279e5e0a_100166722</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-19</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>52</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>19</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>310</td>\n",
|
||
" <td>56</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>2020-03-19 10:52:09</td>\n",
|
||
" <td>cart</td>\n",
|
||
" <td>100166722</td>\n",
|
||
" <td>2053013553887052089</td>\n",
|
||
" <td>apparel.shoes</td>\n",
|
||
" <td>quechua</td>\n",
|
||
" <td>22.910000</td>\n",
|
||
" <td>630257323</td>\n",
|
||
" <td>a84ace23-b9f6-46cd-ac90-7d29279e5e0a</td>\n",
|
||
" <td>a84ace23-b9f6-46cd-ac90-7d29279e5e0a_100166722</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-19</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>52</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>19</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>310</td>\n",
|
||
" <td>56</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>2020-03-19 10:52:06</td>\n",
|
||
" <td>cart</td>\n",
|
||
" <td>100166722</td>\n",
|
||
" <td>2053013553887052089</td>\n",
|
||
" <td>apparel.shoes</td>\n",
|
||
" <td>quechua</td>\n",
|
||
" <td>22.910000</td>\n",
|
||
" <td>630257323</td>\n",
|
||
" <td>a84ace23-b9f6-46cd-ac90-7d29279e5e0a</td>\n",
|
||
" <td>a84ace23-b9f6-46cd-ac90-7d29279e5e0a_100166722</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-19</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>52</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>19</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>310</td>\n",
|
||
" <td>56</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 24 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" event_time event_type product_id category_id \\\n",
|
||
"0 2020-03-04 14:49:19 purchase 2602033 2232732101835227701 \n",
|
||
"1 2020-03-04 14:40:52 purchase 100115963 2232732089042600205 \n",
|
||
"2 2020-03-19 10:52:11 cart 100166722 2053013553887052089 \n",
|
||
"3 2020-03-19 10:52:09 cart 100166722 2053013553887052089 \n",
|
||
"4 2020-03-19 10:52:06 cart 100166722 2053013553887052089 \n",
|
||
"\n",
|
||
" category_code brand price user_id \\\n",
|
||
"0 UNKNOWN bosch 643.490051 590324989 \n",
|
||
"1 furniture.bedroom.bed UNKNOWN 16.730001 590324989 \n",
|
||
"2 apparel.shoes quechua 22.910000 630257323 \n",
|
||
"3 apparel.shoes quechua 22.910000 630257323 \n",
|
||
"4 apparel.shoes quechua 22.910000 630257323 \n",
|
||
"\n",
|
||
" user_session \\\n",
|
||
"0 572278b1-f8cc-4941-acda-30eca4313a77 \n",
|
||
"1 572278b1-f8cc-4941-acda-30eca4313a77 \n",
|
||
"2 a84ace23-b9f6-46cd-ac90-7d29279e5e0a \n",
|
||
"3 a84ace23-b9f6-46cd-ac90-7d29279e5e0a \n",
|
||
"4 a84ace23-b9f6-46cd-ac90-7d29279e5e0a \n",
|
||
"\n",
|
||
" session_product ... cat_3 date \\\n",
|
||
"0 572278b1-f8cc-4941-acda-30eca4313a77_2602033 ... NA 2020-03-04 \n",
|
||
"1 572278b1-f8cc-4941-acda-30eca4313a77_100115963 ... NA 2020-03-04 \n",
|
||
"2 a84ace23-b9f6-46cd-ac90-7d29279e5e0a_100166722 ... NA 2020-03-19 \n",
|
||
"3 a84ace23-b9f6-46cd-ac90-7d29279e5e0a_100166722 ... NA 2020-03-19 \n",
|
||
"4 a84ace23-b9f6-46cd-ac90-7d29279e5e0a_100166722 ... NA 2020-03-19 \n",
|
||
"\n",
|
||
" ts_hour ts_minute ts_weekday ts_day ts_month ts_year \\\n",
|
||
"0 14 49 2 4 3 2020 \n",
|
||
"1 14 40 2 4 3 2020 \n",
|
||
"2 10 52 3 19 3 2020 \n",
|
||
"3 10 52 3 19 3 2020 \n",
|
||
"4 10 52 3 19 3 2020 \n",
|
||
"\n",
|
||
" ts_weekday-ts_hour user_id_CE \n",
|
||
"0 214 56 \n",
|
||
"1 214 56 \n",
|
||
"2 310 56 \n",
|
||
"3 310 56 \n",
|
||
"4 310 56 \n",
|
||
"\n",
|
||
"[5 rows x 24 columns]"
|
||
]
|
||
},
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"count_encoding(ddf, 'user_id').head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "22883677-31a0-4999-b51b-8bb0f63c21dc",
|
||
"metadata": {},
|
||
"source": [
|
||
"<a name='s3-3.5'></a>\n",
|
||
"### Target Encoding ###\n",
|
||
"\n",
|
||
"|x|y|\n",
|
||
"|-|-|\n",
|
||
"|0|0|\n",
|
||
"|0|1|\n",
|
||
"|1|0|\n",
|
||
"|1|1|\n",
|
||
"|1|1|\n",
|
||
"\n",
|
||
"0 - 1/2 <br>\n",
|
||
"1 - 2/3\n",
|
||
"\n",
|
||
"**Target Encoding** represents a categorical feature based on its effect on the target variable. One common technique is to replace values with the probability of the target given a category. Target encoding creates a new feature, which can be used by the model for training. The advantage of target encoding is that it processes the categorical features and makes them more easily accessible to the model during training and validation. \n",
|
||
"\n",
|
||
"Mathematically, target encoding on a binary target can be: \n",
|
||
"\n",
|
||
"p(t = 1 | x = ci)\n",
|
||
"\n",
|
||
"For a binary classifier, we can calculate the probability when the target is `true` or `1` by taking the mean for each category group. This is also known as *Mean Encoding*. \n",
|
||
"\n",
|
||
"In other words, it calculates statistics, such as the arithmetic mean, from a target variable grouped by the unique values of one or more categorical features. \n",
|
||
"\n",
|
||
"<img src='images/tip.png' width=720>\n",
|
||
"\n",
|
||
"*Leakage*, also known as data leakage or target leakage, occurs when training a model with information that would not be avilable at the time of prediction. This can cause the inflated model performance score to overestimate the model's utility. For example, including \"temperature_celsius\" as a feature when training and predicting \"temperature_fahrenheit\". "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"id": "4676161d-ccb5-49cc-92ee-f0402cdcc168",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def target_encoding(df, cat): \n",
|
||
" te_df=df.groupby(cat)['target'].mean().reset_index()\n",
|
||
" te_df.columns=[cat, cat+'_TE']\n",
|
||
" df=df.merge(te_df, on=cat)\n",
|
||
" return df"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"id": "b04a7cea-08c8-43ba-b77b-8b13e70413ef",
|
||
"metadata": {
|
||
"scrolled": true,
|
||
"tags": []
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>event_time</th>\n",
|
||
" <th>event_type</th>\n",
|
||
" <th>product_id</th>\n",
|
||
" <th>category_id</th>\n",
|
||
" <th>category_code</th>\n",
|
||
" <th>brand</th>\n",
|
||
" <th>price</th>\n",
|
||
" <th>user_id</th>\n",
|
||
" <th>user_session</th>\n",
|
||
" <th>session_product</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>cat_3</th>\n",
|
||
" <th>date</th>\n",
|
||
" <th>ts_hour</th>\n",
|
||
" <th>ts_minute</th>\n",
|
||
" <th>ts_weekday</th>\n",
|
||
" <th>ts_day</th>\n",
|
||
" <th>ts_month</th>\n",
|
||
" <th>ts_year</th>\n",
|
||
" <th>ts_weekday-ts_hour</th>\n",
|
||
" <th>brand_TE</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>2020-03-02 09:59:59</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>1002544</td>\n",
|
||
" <td>2232732093077520756</td>\n",
|
||
" <td>construction.tools.light</td>\n",
|
||
" <td>apple</td>\n",
|
||
" <td>397.100006</td>\n",
|
||
" <td>620541411</td>\n",
|
||
" <td>4429a3fc-c281-4698-a306-08cebcf60556</td>\n",
|
||
" <td>4429a3fc-c281-4698-a306-08cebcf60556_1002544</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-02</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>59</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>09</td>\n",
|
||
" <td>0.481383</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>2020-03-02 09:12:38</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>4804055</td>\n",
|
||
" <td>2232732079706079299</td>\n",
|
||
" <td>sport.bicycle</td>\n",
|
||
" <td>apple</td>\n",
|
||
" <td>193.739990</td>\n",
|
||
" <td>556813366</td>\n",
|
||
" <td>3b036554-77e4-4eb4-95d0-25977b78bc21</td>\n",
|
||
" <td>3b036554-77e4-4eb4-95d0-25977b78bc21_4804055</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-02</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>12</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>09</td>\n",
|
||
" <td>0.481383</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>2020-03-02 08:40:27</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>100068491</td>\n",
|
||
" <td>2232732093077520756</td>\n",
|
||
" <td>construction.tools.light</td>\n",
|
||
" <td>samsung</td>\n",
|
||
" <td>299.309998</td>\n",
|
||
" <td>551196403</td>\n",
|
||
" <td>43fd3e13-d723-4edb-ba1d-f648103e78c2</td>\n",
|
||
" <td>43fd3e13-d723-4edb-ba1d-f648103e78c2_100068491</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-02</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>40</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>08</td>\n",
|
||
" <td>0.494324</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>2020-03-02 09:05:09</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>4804056</td>\n",
|
||
" <td>2232732079706079299</td>\n",
|
||
" <td>sport.bicycle</td>\n",
|
||
" <td>apple</td>\n",
|
||
" <td>166.720001</td>\n",
|
||
" <td>610702122</td>\n",
|
||
" <td>f4a624f0-cce4-464b-bddc-1a8dd9d8044d</td>\n",
|
||
" <td>f4a624f0-cce4-464b-bddc-1a8dd9d8044d_4804056</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-02</td>\n",
|
||
" <td>9</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>09</td>\n",
|
||
" <td>0.481383</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>2020-03-02 12:55:41</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>1002544</td>\n",
|
||
" <td>2232732093077520756</td>\n",
|
||
" <td>construction.tools.light</td>\n",
|
||
" <td>apple</td>\n",
|
||
" <td>397.100006</td>\n",
|
||
" <td>612025825</td>\n",
|
||
" <td>5130f545-28ad-4f1a-beb4-9a2795aa19eb</td>\n",
|
||
" <td>5130f545-28ad-4f1a-beb4-9a2795aa19eb_1002544</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NA</td>\n",
|
||
" <td>2020-03-02</td>\n",
|
||
" <td>12</td>\n",
|
||
" <td>55</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2020</td>\n",
|
||
" <td>012</td>\n",
|
||
" <td>0.481383</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 24 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" event_time event_type product_id category_id \\\n",
|
||
"0 2020-03-02 09:59:59 purchase 1002544 2232732093077520756 \n",
|
||
"1 2020-03-02 09:12:38 purchase 4804055 2232732079706079299 \n",
|
||
"2 2020-03-02 08:40:27 purchase 100068491 2232732093077520756 \n",
|
||
"3 2020-03-02 09:05:09 purchase 4804056 2232732079706079299 \n",
|
||
"4 2020-03-02 12:55:41 purchase 1002544 2232732093077520756 \n",
|
||
"\n",
|
||
" category_code brand price user_id \\\n",
|
||
"0 construction.tools.light apple 397.100006 620541411 \n",
|
||
"1 sport.bicycle apple 193.739990 556813366 \n",
|
||
"2 construction.tools.light samsung 299.309998 551196403 \n",
|
||
"3 sport.bicycle apple 166.720001 610702122 \n",
|
||
"4 construction.tools.light apple 397.100006 612025825 \n",
|
||
"\n",
|
||
" user_session \\\n",
|
||
"0 4429a3fc-c281-4698-a306-08cebcf60556 \n",
|
||
"1 3b036554-77e4-4eb4-95d0-25977b78bc21 \n",
|
||
"2 43fd3e13-d723-4edb-ba1d-f648103e78c2 \n",
|
||
"3 f4a624f0-cce4-464b-bddc-1a8dd9d8044d \n",
|
||
"4 5130f545-28ad-4f1a-beb4-9a2795aa19eb \n",
|
||
"\n",
|
||
" session_product ... cat_3 date \\\n",
|
||
"0 4429a3fc-c281-4698-a306-08cebcf60556_1002544 ... NA 2020-03-02 \n",
|
||
"1 3b036554-77e4-4eb4-95d0-25977b78bc21_4804055 ... NA 2020-03-02 \n",
|
||
"2 43fd3e13-d723-4edb-ba1d-f648103e78c2_100068491 ... NA 2020-03-02 \n",
|
||
"3 f4a624f0-cce4-464b-bddc-1a8dd9d8044d_4804056 ... NA 2020-03-02 \n",
|
||
"4 5130f545-28ad-4f1a-beb4-9a2795aa19eb_1002544 ... NA 2020-03-02 \n",
|
||
"\n",
|
||
" ts_hour ts_minute ts_weekday ts_day ts_month ts_year \\\n",
|
||
"0 9 59 0 2 3 2020 \n",
|
||
"1 9 12 0 2 3 2020 \n",
|
||
"2 8 40 0 2 3 2020 \n",
|
||
"3 9 5 0 2 3 2020 \n",
|
||
"4 12 55 0 2 3 2020 \n",
|
||
"\n",
|
||
" ts_weekday-ts_hour brand_TE \n",
|
||
"0 09 0.481383 \n",
|
||
"1 09 0.481383 \n",
|
||
"2 08 0.494324 \n",
|
||
"3 09 0.481383 \n",
|
||
"4 012 0.481383 \n",
|
||
"\n",
|
||
"[5 rows x 24 columns]"
|
||
]
|
||
},
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"target_encoding(ddf, 'brand').head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b36b33ba-c14c-41f9-b224-bf48ee654bd4",
|
||
"metadata": {},
|
||
"source": [
|
||
"<a name='s3-3.6'></a>\n",
|
||
"### Embeddings ###\n",
|
||
"\n",
|
||
"Deep learning models often apply **Embedding Layers** to categorical features. Over the past few years, this has become an increasing popular technique for encoding categorical features. Since the embeddings need to be trained through a neural network, we will cover this in the next lab. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"id": "811f0409-d1aa-4895-8c21-49ade3e0e7e5",
|
||
"metadata": {
|
||
"scrolled": true,
|
||
"tags": []
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"/opt/conda/envs/rapids/lib/python3.9/site-packages/dask/dataframe/multi.py:1287: UserWarning: Concatenating dataframes with unknown divisions.\n",
|
||
"We're assuming that the indices of each dataframes are \n",
|
||
" aligned. This assumption is not generally safe.\n",
|
||
" warnings.warn(\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>event_time</th>\n",
|
||
" <th>event_type</th>\n",
|
||
" <th>product_id</th>\n",
|
||
" <th>category_id</th>\n",
|
||
" <th>category_code</th>\n",
|
||
" <th>brand</th>\n",
|
||
" <th>price</th>\n",
|
||
" <th>user_id</th>\n",
|
||
" <th>user_session</th>\n",
|
||
" <th>session_product</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>furniture</th>\n",
|
||
" <th>kids</th>\n",
|
||
" <th>medicine</th>\n",
|
||
" <th>sport</th>\n",
|
||
" <th>stationery</th>\n",
|
||
" <th>product_id_Categorify</th>\n",
|
||
" <th>user_id_CE</th>\n",
|
||
" <th>product_id_CE</th>\n",
|
||
" <th>brand_TE</th>\n",
|
||
" <th>product_id_TE</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>2020-03-24 04:05:48</td>\n",
|
||
" <td>purchase</td>\n",
|
||
" <td>100068488</td>\n",
|
||
" <td>2232732093077520756</td>\n",
|
||
" <td>construction.tools.light</td>\n",
|
||
" <td>samsung</td>\n",
|
||
" <td>286.260010</td>\n",
|
||
" <td>519448564</td>\n",
|
||
" <td>bf12eb4d-bf0c-466a-90f8-44f3f18c072e</td>\n",
|
||
" <td>bf12eb4d-bf0c-466a-90f8-44f3f18c072e_100068488</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>23</td>\n",
|
||
" <td>34447</td>\n",
|
||
" <td>0.494324</td>\n",
|
||
" <td>0.533631</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>2020-03-11 11:32:24</td>\n",
|
||
" <td>cart</td>\n",
|
||
" <td>1005100</td>\n",
|
||
" <td>2232732093077520756</td>\n",
|
||
" <td>construction.tools.light</td>\n",
|
||
" <td>samsung</td>\n",
|
||
" <td>150.039993</td>\n",
|
||
" <td>613919337</td>\n",
|
||
" <td>43b7b450-3220-41f2-8df6-c8027e9982da</td>\n",
|
||
" <td>43b7b450-3220-41f2-8df6-c8027e9982da_1005100</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>23</td>\n",
|
||
" <td>42304</td>\n",
|
||
" <td>0.494324</td>\n",
|
||
" <td>0.561436</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>2020-03-11 11:33:06</td>\n",
|
||
" <td>cart</td>\n",
|
||
" <td>1005100</td>\n",
|
||
" <td>2232732093077520756</td>\n",
|
||
" <td>construction.tools.light</td>\n",
|
||
" <td>samsung</td>\n",
|
||
" <td>150.039993</td>\n",
|
||
" <td>613919337</td>\n",
|
||
" <td>43b7b450-3220-41f2-8df6-c8027e9982da</td>\n",
|
||
" <td>43b7b450-3220-41f2-8df6-c8027e9982da_1005100</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>23</td>\n",
|
||
" <td>42304</td>\n",
|
||
" <td>0.494324</td>\n",
|
||
" <td>0.561436</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>2020-03-11 11:32:11</td>\n",
|
||
" <td>cart</td>\n",
|
||
" <td>1005100</td>\n",
|
||
" <td>2232732093077520756</td>\n",
|
||
" <td>construction.tools.light</td>\n",
|
||
" <td>samsung</td>\n",
|
||
" <td>150.039993</td>\n",
|
||
" <td>613919337</td>\n",
|
||
" <td>43b7b450-3220-41f2-8df6-c8027e9982da</td>\n",
|
||
" <td>43b7b450-3220-41f2-8df6-c8027e9982da_1005100</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>23</td>\n",
|
||
" <td>42304</td>\n",
|
||
" <td>0.494324</td>\n",
|
||
" <td>0.561436</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>2020-03-11 11:33:06</td>\n",
|
||
" <td>cart</td>\n",
|
||
" <td>1005100</td>\n",
|
||
" <td>2232732093077520756</td>\n",
|
||
" <td>construction.tools.light</td>\n",
|
||
" <td>samsung</td>\n",
|
||
" <td>150.039993</td>\n",
|
||
" <td>613919337</td>\n",
|
||
" <td>43b7b450-3220-41f2-8df6-c8027e9982da</td>\n",
|
||
" <td>43b7b450-3220-41f2-8df6-c8027e9982da_1005100</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>23</td>\n",
|
||
" <td>42304</td>\n",
|
||
" <td>0.494324</td>\n",
|
||
" <td>0.561436</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>5 rows × 42 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" event_time event_type product_id category_id \\\n",
|
||
"0 2020-03-24 04:05:48 purchase 100068488 2232732093077520756 \n",
|
||
"1 2020-03-11 11:32:24 cart 1005100 2232732093077520756 \n",
|
||
"2 2020-03-11 11:33:06 cart 1005100 2232732093077520756 \n",
|
||
"3 2020-03-11 11:32:11 cart 1005100 2232732093077520756 \n",
|
||
"4 2020-03-11 11:33:06 cart 1005100 2232732093077520756 \n",
|
||
"\n",
|
||
" category_code brand price user_id \\\n",
|
||
"0 construction.tools.light samsung 286.260010 519448564 \n",
|
||
"1 construction.tools.light samsung 150.039993 613919337 \n",
|
||
"2 construction.tools.light samsung 150.039993 613919337 \n",
|
||
"3 construction.tools.light samsung 150.039993 613919337 \n",
|
||
"4 construction.tools.light samsung 150.039993 613919337 \n",
|
||
"\n",
|
||
" user_session \\\n",
|
||
"0 bf12eb4d-bf0c-466a-90f8-44f3f18c072e \n",
|
||
"1 43b7b450-3220-41f2-8df6-c8027e9982da \n",
|
||
"2 43b7b450-3220-41f2-8df6-c8027e9982da \n",
|
||
"3 43b7b450-3220-41f2-8df6-c8027e9982da \n",
|
||
"4 43b7b450-3220-41f2-8df6-c8027e9982da \n",
|
||
"\n",
|
||
" session_product ... furniture kids \\\n",
|
||
"0 bf12eb4d-bf0c-466a-90f8-44f3f18c072e_100068488 ... 0 0 \n",
|
||
"1 43b7b450-3220-41f2-8df6-c8027e9982da_1005100 ... 0 0 \n",
|
||
"2 43b7b450-3220-41f2-8df6-c8027e9982da_1005100 ... 0 0 \n",
|
||
"3 43b7b450-3220-41f2-8df6-c8027e9982da_1005100 ... 0 0 \n",
|
||
"4 43b7b450-3220-41f2-8df6-c8027e9982da_1005100 ... 0 0 \n",
|
||
"\n",
|
||
" medicine sport stationery product_id_Categorify user_id_CE \\\n",
|
||
"0 0 0 0 6 23 \n",
|
||
"1 0 0 0 4 23 \n",
|
||
"2 0 0 0 4 23 \n",
|
||
"3 0 0 0 4 23 \n",
|
||
"4 0 0 0 4 23 \n",
|
||
"\n",
|
||
" product_id_CE brand_TE product_id_TE \n",
|
||
"0 34447 0.494324 0.533631 \n",
|
||
"1 42304 0.494324 0.561436 \n",
|
||
"2 42304 0.494324 0.561436 \n",
|
||
"3 42304 0.494324 0.561436 \n",
|
||
"4 42304 0.494324 0.561436 \n",
|
||
"\n",
|
||
"[5 rows x 42 columns]"
|
||
]
|
||
},
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"ddf=one_hot(ddf, 'cat_0')\n",
|
||
"ddf=combine_cats(ddf, 'ts_weekday', 'ts_hour')\n",
|
||
"ddf=categorify(ddf, 'product_id', 100)\n",
|
||
"ddf=count_encoding(ddf, 'user_id')\n",
|
||
"ddf=count_encoding(ddf, 'product_id')\n",
|
||
"ddf=target_encoding(ddf, 'brand')\n",
|
||
"ddf=target_encoding(ddf, 'product_id')\n",
|
||
"ddf.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"id": "a09925b8-812e-470a-9b00-dfa899984eb1",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"{'status': 'ok', 'restart': True}"
|
||
]
|
||
},
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# clean GPU memory\n",
|
||
"import IPython\n",
|
||
"app = IPython.Application.instance()\n",
|
||
"app.kernel.do_shutdown(True)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d0408660-dd38-474c-a6df-529fdd91a3d7",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Well Done!** Let's move to the [next notebook](1_04_nvtabular_and_mgpu.ipynb). "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "997bd6f7-9efb-4fee-b3d4-9d4454694c7b",
|
||
"metadata": {},
|
||
"source": [
|
||
"<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.9.16"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|