{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a5863a6f-d28f-4f3c-bceb-69a8d6482cc8",
   "metadata": {},
   "source": [
    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e60bbdfc-6e2c-4275-ba3e-d72e70ab021b",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Enhancing Data Science Outcomes With Efficient Workflow #"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9e9d7e6-8b43-45f0-9de5-9889a67a229b",
   "metadata": {},
   "source": [
    "## 04 - NVTabular ##\n",
    "In this lab, you will learn the motivation behind doing data science on a GPU cluster. This lab covers the ETL, data exploration, and feature engineering steps of the data processing pipeline. Extract, transform, load, or [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load), is the process where data is transformed into a proper structure for the purposes of querying and analysis. Feature engineering, on the other hand, involves the extraction and transformation of raw data. \n",
    "\n",
    "<p><img src='images/pipeline_overview_1.png' width=1080></p>\n",
    "\n",
    "**Table of Contents**\n",
    "<br>\n",
    "In this notebook, we will use NVTabular to perform feature engineering. This notebook covers the below sections: \n",
    "1. [NVTabular](#s4-1)\n",
    "    * [Multi-GPU Scaling in NVTabular with Dask](#s4-1.1)\n",
    "2. [Operators](#s4-2)\n",
    "3. [Feature Engineering and Preprocessing with NVTabular](#s4-3)\n",
    "    * [Defining the Workflow](#s4-3.1)\n",
    "    * [Exercise #1 - Using NVTabular Operators](#s4-e1)\n",
    "    * [Defining the Dataset](#s4-3.2)\n",
    "    * [Fit, Transform, and Persist](#s4-3.3)\n",
    "    * [Exercise #2 - Load Saved Workflow](#s4-e2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2eaa7334-ffc5-4ab2-b76c-3bfb9d21b9cd",
   "metadata": {},
   "source": [
    "<a name='s4-1'></a>\n",
    "## NVTabular ##\n",
    "[NVTabular](https://nvidia-merlin.github.io/NVTabular/main/index.html) is a feature engineering and preprocessing library for tabular data that is designed to easily manipulate terabyte scale datasets. It provides high-level abstraction to simplify code and accelerates computation on the GPU using the RAPIDS [cuDF](https://docs.rapids.ai/api/cudf/stable/) library. While NVTabular is built upon the RAPIDS cuDF library, it improves cuDF since data is not limited to GPU memory capacity. The API documentation can be found [here](https://nvidia-merlin.github.io/NVTabular/main/api.html#). \n",
    "\n",
    "Core features of NVTabular include: \n",
    "* Easily process data by leveraging built-in or custom operators specifically designed for machine learning algorithms\n",
    "* Computations are carried out on the GPU with best practices baked into the library, allowing us to realize significant acceleration\n",
    "* Provide higher-level API to greatly simplify code complexity while still providing the same level of performance\n",
    "* Work on arbitrarily large datasets when used with [Dask](https://www.dask.org/)\n",
    "* Minimize the number of passes through the data with [Lazy execution](https://en.wikipedia.org/wiki/Lazy_evaluation)\n",
    "\n",
    "In doing so, NVTabular helps data scientists and machine learning engineers to: \n",
    "* Process datasets that exceed GPU and CPU memory without having to worry about scale\n",
    "* Focus on what to do with the data and not how to do it by using abstraction at the operation level\n",
    "* Prepare datasets quickly and easily for experimentation so that more models can be trained\n",
    "\n",
    "Data science can be an iterative process that requires extensive repeated experimentation. The ability to perform feature engineering and preprocessing quickly translates into faster iteration cycles, which can help us to arrive at an optimal solution. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbf181ea-5e26-48f5-a4eb-e2b95112f744",
   "metadata": {},
   "source": [
    "<a name='s4-1.1'></a>\n",
    "### Multi-GPU Scaling in NVTabular with Dask ###\n",
    "NVTabular supports multi-GPU scaling with [Dask-CUDA](https://github.com/rapidsai/dask-cuda) and `dask.distributed`[[doc]](https://distributed.dask.org/en/latest/). For multi-GPU, NVTabular uses [Dask-cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) for internal data processing. The parallel performance can depend strongly on the size of the partitions, the shuffling procedure used for data output, and the arguments used for transformation operations. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ed12dc2-5f40-4fc2-b7c0-2a8535c81d7e",
   "metadata": {},
   "source": [
    "<a name='s4-2'></a>\n",
    "## Operators ##\n",
    "NVTabular has already implemented several data transformations, called `ops`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/generated/nvtabular.ops.Operator.html). An `op` can be applied to a `ColumnGroup` from an overloaded `>>` operator, which in turn returns a new `ColumnGroup`. A `ColumnGroup` is a list of column names as text. \n",
    "\n",
    "```\n",
    "features = [ column_name_1, column_name_2, ...] >> op1 >> op2 >> ...\n",
    "```\n",
    "\n",
    "Since the Dataset API can both ingest and output a Dask collection, it is straightforward to transform data either before or after an NVTabular workflow is executed. This means that some complex preprocessing operations, that are not yet supported in NVTabular, can still be accomplished with the Dask-CuDF API:\n",
    "\n",
    "Common operators include: \n",
    "* [Categorify](https://nvidia-merlin.github.io/NVTabular/main/api/ops/categorify.html) - transform categorical features into unique integer values\n",
    "    * Can apply a frequency threshold to group low frequent categories together\n",
    "* [TargetEncoding](https://nvidia-merlin.github.io/NVTabular/main/api/ops/targetencoding.html) - transform categorical features into group-specific mean of each row\n",
    "    * Using `kfold=1` and `p_smooth=0` is the same as disabling these additional logic\n",
    "* [Groupby](https://nvidia-merlin.github.io/NVTabular/main/api/ops/groupby.html) - transform feature into the result of one or more groupby aggregations\n",
    "    * **NOTE**: Does not move data between partitions, which means data should be shuffled by groupby_cols\n",
    "* [JoinGroupby](https://nvidia-merlin.github.io/NVTabular/main/api/ops/joingroupby.html) - add new feature based on desired group-specific statistics of requested continuous features\n",
    "    * Supported statistics include [`count`, `sum`, `mean`, `std`, `var`]. \n",
    "* [LogOp](https://nvidia-merlin.github.io/NVTabular/main/api/ops/log.html) - log transform with the continuous features\n",
    "* [FillMissing](https://nvidia-merlin.github.io/NVTabular/main/api/ops/fillmissing.html) - replaces missing values with constant pre-defined value\n",
    "* [Bucketize](https://nvidia-merlin.github.io/NVTabular/main/api/ops/bucketize.html) - transform continuous features into categorical features with bins based on provided bin boundaries\n",
    "* [LambdaOp](https://nvidia-merlin.github.io/NVTabular/main/api/ops/lambdaop.html) - enables custom row-wise dataframe manipulations with NVTabular\n",
    "* [Rename](https://nvidia-merlin.github.io/NVTabular/main/api/ops/rename.html) - rename columns\n",
    "* [Normalize](https://nvidia-merlin.github.io/NVTabular/main/api/ops/normalize.html) - perform normalization using the mean standard deviation method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ef7617b1-c799-4c33-8c34-b339c7df649d",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/envs/rapids/lib/python3.9/site-packages/merlin/dtypes/mappings/tf.py:52: UserWarning: Tensorflow dtype mappings did not load successfully due to an error: No module named 'tensorflow'\n",
      "  warn(f\"Tensorflow dtype mappings did not load successfully due to an error: {exc.msg}\")\n",
      "/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.\n",
      "Perhaps you already have a cluster running?\n",
      "Hosting the HTTP server on port 45589 instead\n",
      "  warnings.warn(\n",
      "2026-02-11 13:20:08,688 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n",
      "2026-02-11 13:20:08,688 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n",
      "2026-02-11 13:20:08,712 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n",
      "2026-02-11 13:20:08,712 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n",
      "2026-02-11 13:20:08,718 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n",
      "2026-02-11 13:20:08,718 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n",
      "2026-02-11 13:20:08,723 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n",
      "2026-02-11 13:20:08,723 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n"
     ]
    }
   ],
   "source": [
    "# import dependencies\n",
    "import nvtabular as nvt\n",
    "from nvtabular.ops import * \n",
    "\n",
    "from dask.distributed import Client, wait\n",
    "from dask_cuda import LocalCUDACluster\n",
    "import dask_cudf\n",
    "import cudf\n",
    "import gc\n",
    "\n",
    "# instantiate a Client\n",
    "cluster=LocalCUDACluster()\n",
    "client=Client(cluster)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4646966-9d65-4d44-ad9a-abc0c68763bc",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# get the machine's external IP address\n",
    "from requests import get\n",
    "\n",
    "ip=get('https://api.ipify.org').content.decode('utf8')\n",
    "\n",
    "print(f'Dask dashboard (status) is accessible on http://{ip}:8787/status')\n",
    "print(f'Dask dashboard (gpu) is accessible on http://{ip}:8787/gpu')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "6941e910-3611-4edc-ba89-a480e5386fb6",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>event_time</th>\n",
       "      <th>event_type</th>\n",
       "      <th>product_id</th>\n",
       "      <th>category_id</th>\n",
       "      <th>category_code</th>\n",
       "      <th>brand</th>\n",
       "      <th>price</th>\n",
       "      <th>user_id</th>\n",
       "      <th>user_session</th>\n",
       "      <th>session_product</th>\n",
       "      <th>...</th>\n",
       "      <th>cat_1</th>\n",
       "      <th>cat_2</th>\n",
       "      <th>cat_3</th>\n",
       "      <th>date</th>\n",
       "      <th>ts_hour</th>\n",
       "      <th>ts_minute</th>\n",
       "      <th>ts_weekday</th>\n",
       "      <th>ts_day</th>\n",
       "      <th>ts_month</th>\n",
       "      <th>ts_year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2020-03-01 04:54:09</td>\n",
       "      <td>purchase</td>\n",
       "      <td>10301400</td>\n",
       "      <td>2232732104888681081</td>\n",
       "      <td>apparel.scarf</td>\n",
       "      <td>bburago</td>\n",
       "      <td>19.280001</td>\n",
       "      <td>537144080</td>\n",
       "      <td>053d5ad3-01c7-4dfb-9079-e121c33b0938</td>\n",
       "      <td>053d5ad3-01c7-4dfb-9079-e121c33b0938_10301400</td>\n",
       "      <td>...</td>\n",
       "      <td>scarf</td>\n",
       "      <td>NA</td>\n",
       "      <td>NA</td>\n",
       "      <td>2020-03-01</td>\n",
       "      <td>4</td>\n",
       "      <td>54</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2020-03-01 04:55:26</td>\n",
       "      <td>purchase</td>\n",
       "      <td>15700285</td>\n",
       "      <td>2232732094134485388</td>\n",
       "      <td>UNKNOWN</td>\n",
       "      <td>UNKNOWN</td>\n",
       "      <td>154.190002</td>\n",
       "      <td>514686549</td>\n",
       "      <td>3c842e53-1e47-4941-83e0-2a27a8fdeaf1</td>\n",
       "      <td>3c842e53-1e47-4941-83e0-2a27a8fdeaf1_15700285</td>\n",
       "      <td>...</td>\n",
       "      <td>NA</td>\n",
       "      <td>NA</td>\n",
       "      <td>NA</td>\n",
       "      <td>2020-03-01</td>\n",
       "      <td>4</td>\n",
       "      <td>55</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2020-03-01 04:54:46</td>\n",
       "      <td>purchase</td>\n",
       "      <td>21406331</td>\n",
       "      <td>2232732082063278200</td>\n",
       "      <td>electronics.clocks</td>\n",
       "      <td>casio</td>\n",
       "      <td>30.369999</td>\n",
       "      <td>522564661</td>\n",
       "      <td>cfa89b7f-5b34-4d65-a135-bb924d98af9c</td>\n",
       "      <td>cfa89b7f-5b34-4d65-a135-bb924d98af9c_21406331</td>\n",
       "      <td>...</td>\n",
       "      <td>clocks</td>\n",
       "      <td>NA</td>\n",
       "      <td>NA</td>\n",
       "      <td>2020-03-01</td>\n",
       "      <td>4</td>\n",
       "      <td>54</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2020-03-01 07:45:47</td>\n",
       "      <td>purchase</td>\n",
       "      <td>1004665</td>\n",
       "      <td>2232732093077520756</td>\n",
       "      <td>construction.tools.light</td>\n",
       "      <td>samsung</td>\n",
       "      <td>816.690002</td>\n",
       "      <td>596178054</td>\n",
       "      <td>f84b2b78-50a0-4e34-ad8d-da60a6178091</td>\n",
       "      <td>f84b2b78-50a0-4e34-ad8d-da60a6178091_1004665</td>\n",
       "      <td>...</td>\n",
       "      <td>tools</td>\n",
       "      <td>light</td>\n",
       "      <td>NA</td>\n",
       "      <td>2020-03-01</td>\n",
       "      <td>7</td>\n",
       "      <td>45</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2020-03-02 05:26:04</td>\n",
       "      <td>purchase</td>\n",
       "      <td>21400996</td>\n",
       "      <td>2232732082063278200</td>\n",
       "      <td>electronics.clocks</td>\n",
       "      <td>casio</td>\n",
       "      <td>81.159996</td>\n",
       "      <td>537131991</td>\n",
       "      <td>b19b380b-2db7-4c6e-bb91-998556315d0a</td>\n",
       "      <td>b19b380b-2db7-4c6e-bb91-998556315d0a_21400996</td>\n",
       "      <td>...</td>\n",
       "      <td>clocks</td>\n",
       "      <td>NA</td>\n",
       "      <td>NA</td>\n",
       "      <td>2020-03-02</td>\n",
       "      <td>5</td>\n",
       "      <td>26</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 22 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           event_time event_type  product_id          category_id  \\\n",
       "0 2020-03-01 04:54:09   purchase    10301400  2232732104888681081   \n",
       "1 2020-03-01 04:55:26   purchase    15700285  2232732094134485388   \n",
       "2 2020-03-01 04:54:46   purchase    21406331  2232732082063278200   \n",
       "3 2020-03-01 07:45:47   purchase     1004665  2232732093077520756   \n",
       "4 2020-03-02 05:26:04   purchase    21400996  2232732082063278200   \n",
       "\n",
       "              category_code    brand       price    user_id  \\\n",
       "0             apparel.scarf  bburago   19.280001  537144080   \n",
       "1                   UNKNOWN  UNKNOWN  154.190002  514686549   \n",
       "2        electronics.clocks    casio   30.369999  522564661   \n",
       "3  construction.tools.light  samsung  816.690002  596178054   \n",
       "4        electronics.clocks    casio   81.159996  537131991   \n",
       "\n",
       "                           user_session  \\\n",
       "0  053d5ad3-01c7-4dfb-9079-e121c33b0938   \n",
       "1  3c842e53-1e47-4941-83e0-2a27a8fdeaf1   \n",
       "2  cfa89b7f-5b34-4d65-a135-bb924d98af9c   \n",
       "3  f84b2b78-50a0-4e34-ad8d-da60a6178091   \n",
       "4  b19b380b-2db7-4c6e-bb91-998556315d0a   \n",
       "\n",
       "                                 session_product  ...   cat_1  cat_2 cat_3  \\\n",
       "0  053d5ad3-01c7-4dfb-9079-e121c33b0938_10301400  ...   scarf     NA    NA   \n",
       "1  3c842e53-1e47-4941-83e0-2a27a8fdeaf1_15700285  ...      NA     NA    NA   \n",
       "2  cfa89b7f-5b34-4d65-a135-bb924d98af9c_21406331  ...  clocks     NA    NA   \n",
       "3   f84b2b78-50a0-4e34-ad8d-da60a6178091_1004665  ...   tools  light    NA   \n",
       "4  b19b380b-2db7-4c6e-bb91-998556315d0a_21400996  ...  clocks     NA    NA   \n",
       "\n",
       "        date ts_hour  ts_minute  ts_weekday  ts_day  ts_month  ts_year  \n",
       "0 2020-03-01       4         54           6       1         3     2020  \n",
       "1 2020-03-01       4         55           6       1         3     2020  \n",
       "2 2020-03-01       4         54           6       1         3     2020  \n",
       "3 2020-03-01       7         45           6       1         3     2020  \n",
       "4 2020-03-02       5         26           0       2         3     2020  \n",
       "\n",
       "[5 rows x 22 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# read data as Dask DataFrame\n",
    "ddf=dask_cudf.read_parquet('clean_parquet')\n",
    "\n",
    "# preview DataFrame\n",
    "ddf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0df8197-a6fe-46f3-8e77-231e7853b16e",
   "metadata": {
    "tags": []
   },
   "source": [
    "<a name='s4-3'></a>\n",
    "## Feature Engineering and Preprocessing with NVTabular ##\n",
    "The typical steps for developing with NVTabular include: \n",
    "1. Design and Define Operations in the Pipeline\n",
    "2. Create Workflow\n",
    "3. Create Dataset\n",
    "4. Apply Workflow to Dataset\n",
    "\n",
    "<p><img src='images/nvtabular_diagram.png' width=720></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bedb235f-8d08-4a28-956c-a6298379a278",
   "metadata": {},
   "source": [
    "<a name='s4-3.1'></a>\n",
    "### Defining the Workflow ###\n",
    "We start by creating the `nvtabular.workflow.workflow.Workflow`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/api/workflow/workflow.html), which defines the operations and preprocessing steps that we would like to perform on the data. \n",
    "\n",
    "We will perform the following feature engineering and preprocessing steps: \n",
    "* Categorify the categorical features\n",
    "* Log transform and normalize continuous features\n",
    "* Calculate group-specific `sum`, `count`, and `mean` of the `target` for categorical features\n",
    "* Log transform `price`\n",
    "* Calculate `product_id` specific relative `price` to average `price`\n",
    "* Target encode all categorical features\n",
    "\n",
    "One of the key advantages of using NVTabular is the high-level abstraction we can use, which simplifies code significantly. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "78b305b3-76f8-44ce-912d-f3adbdef7a69",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# assign features and label\n",
    "cat_cols=['brand', 'cat_0', 'cat_1', 'cat_2', 'cat_3']\n",
    "cont_cols=['price', 'ts_hour', 'ts_minute', 'ts_weekday']\n",
    "label='target'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a527a25e-e6c2-4e27-a2af-ffdeb7992892",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# categorify categorical features\n",
    "cat_features=cat_cols >> Categorify()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd1a2ac8-8b14-4664-9cd6-b34a2c98eea2",
   "metadata": {},
   "source": [
    "<a name='s4-e1'></a>\n",
    "### Exercise #1 - Using NVTabular Operators ###\n",
    "We can use the `>>` operator to specify how columns will be transformed. We need to transform the `price` feature by performing the log transformation and normalization. \n",
    "\n",
    "**Instructions**: <br>\n",
    "* Review the documentation for the `LogOp()`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/api/ops/log.html)\n",
    "<mark>This operator calculates the log of continuous columns. Note that to handle the common case of zerofilling null values, this calculates log(1+x) instead of just log(x).</mark>\n",
    "and `Normalize()`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/api/ops/normalize.html) operators. <mark>Standardizing the features around 0 with a standard deviation of 1</mark>\n",
    "* Modify the `<FIXME>`s only and execute the cell below to create a workflow. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a734412d-c1f2-47bb-8016-fdc4e1c4c4b0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# \n",
    "price = (\n",
    "    ['price']\n",
    "    >> FillMissing(0)\n",
    "    >> LogOp()\n",
    "    >> Normalize()\n",
    "    >> LambdaOp(lambda col: col.astype(\"float32\"), dtype='float32')\n",
    ")   "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b76f36b-d7a8-46c8-91c2-c68733dc1d99",
   "metadata": {},
   "source": [
    "There are several ways to create a feature for relative `price` to average. We will do so with the below steps: \n",
    "1. Calculate average `price` per group. \n",
    "2. Define a function to calculate the percentage difference\n",
    "3. Apply the user defined function to `price` and average `price`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "b0641d59-ad3e-4ffe-a7e3-144b4b98141b",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# relative price to the average price for the product_id\n",
    "# create product_id specific average price feature\n",
    "avg_price_product = ['product_id'] >> JoinGroupby(cont_cols =['price'], stats=[\"mean\"])\n",
    "\n",
    "# create user defined function to calculate percent difference\n",
    "def relative_price_to_avg(col, gdf):\n",
    "    # introduce tiny number in case of 0\n",
    "    epsilon = 1e-5\n",
    "    col = ((gdf['price'] - col) / (col + epsilon)) * (col > 0).astype(int)\n",
    "    return col\n",
    "\n",
    "# create product_id specific relative price to average\n",
    "relative_price_to_avg_product = (\n",
    "    avg_price_product \n",
    "    >> LambdaOp(relative_price_to_avg, dependency=['price'], dtype='float64') \n",
    "    >> Rename(name='relative_price_product')\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "89136070-effd-4bd6-a7a8-8ae03cf40e31",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "avg_price_category = ['category_code'] >> JoinGroupby(cont_cols =['price'], stats=[\"mean\"])\n",
    "\n",
    "# create product_id specific relative price to average\n",
    "relative_price_to_avg_category = (\n",
    "    avg_price_category \n",
    "    >> LambdaOp(relative_price_to_avg, dependency=['price'], dtype='float64') \n",
    "    >> Rename(name='relative_price_category')\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c9c77298-7884-4d05-9d3c-b2c7c277fa7a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# calculate group-specific statistics for categorical features\n",
    "ce_features=cat_cols >> JoinGroupby(stats=['sum', 'count'], cont_cols=label)\n",
    "\n",
    "# target encode\n",
    "te_features=cat_cols >> TargetEncoding(label)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba7db9e8-9597-4193-a54b-78752ac9e651",
   "metadata": {},
   "source": [
    "We also add the target, i.e. `label`, to the set of returned columns. We can visualize our data processing pipeline with `graphviz` by calling `.graph`. The data processing pipeline is a DAG (direct acyclic graph). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "09666729-0cba-438e-aeb6-97bb97c274a1",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.43.0 (0)\n",
       " -->\n",
       "<!-- Title: %3 Pages: 1 -->\n",
       "<svg width=\"2434pt\" height=\"548pt\"\n",
       " viewBox=\"0.00 0.00 2434.38 548.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 544)\">\n",
       "<title>%3</title>\n",
       "<polygon fill=\"white\" stroke=\"transparent\" points=\"-4,4 -4,-544 2430.38,-544 2430.38,4 -4,4\"/>\n",
       "<!-- 0 -->\n",
       "<g id=\"node1\" class=\"node\">\n",
       "<title>0</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"2171.43\" cy=\"-162\" rx=\"48.19\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"2171.43\" y=\"-158.3\" font-family=\"Times,serif\" font-size=\"14.00\">Rename</text>\n",
       "</g>\n",
       "<!-- 2 -->\n",
       "<g id=\"node5\" class=\"node\">\n",
       "<title>2</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1300.43\" cy=\"-90\" rx=\"27\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1300.43\" y=\"-86.3\" font-family=\"Times,serif\" font-size=\"14.00\">+</text>\n",
       "</g>\n",
       "<!-- 0&#45;&gt;2 -->\n",
       "<g id=\"edge8\" class=\"edge\">\n",
       "<title>0&#45;&gt;2</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M2130.83,-152.27C2115.42,-149.24 2097.68,-146.08 2081.43,-144 1796.74,-107.58 1450.26,-95.18 1337.68,-91.96\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1337.62,-88.45 1327.52,-91.67 1337.42,-95.45 1337.62,-88.45\"/>\n",
       "</g>\n",
       "<!-- 18 -->\n",
       "<g id=\"node2\" class=\"node\">\n",
       "<title>18</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"2205.43\" cy=\"-234\" rx=\"108.58\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"2205.43\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\">relative_price_to_avg</text>\n",
       "</g>\n",
       "<!-- 18&#45;&gt;0 -->\n",
       "<g id=\"edge1\" class=\"edge\">\n",
       "<title>18&#45;&gt;0</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M2197.03,-215.7C2193.16,-207.73 2188.48,-198.1 2184.19,-189.26\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"2187.26,-187.57 2179.74,-180.1 2180.96,-190.63 2187.26,-187.57\"/>\n",
       "</g>\n",
       "<!-- 1 -->\n",
       "<g id=\"node3\" class=\"node\">\n",
       "<title>1</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1665.43\" cy=\"-378\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1665.43\" y=\"-374.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 19 -->\n",
       "<g id=\"node23\" class=\"node\">\n",
       "<title>19</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1795.43\" cy=\"-306\" rx=\"68.49\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1795.43\" y=\"-302.3\" font-family=\"Times,serif\" font-size=\"14.00\">JoinGroupby</text>\n",
       "</g>\n",
       "<!-- 1&#45;&gt;19 -->\n",
       "<g id=\"edge30\" class=\"edge\">\n",
       "<title>1&#45;&gt;19</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1693.96,-361.64C1712.63,-351.59 1737.23,-338.34 1757.53,-327.41\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1759.38,-330.39 1766.52,-322.57 1756.06,-324.23 1759.38,-330.39\"/>\n",
       "</g>\n",
       "<!-- 1_selector -->\n",
       "<g id=\"node4\" class=\"node\">\n",
       "<title>1_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1665.43\" cy=\"-450\" rx=\"89.08\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1665.43\" y=\"-446.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;category_code&#39;]</text>\n",
       "</g>\n",
       "<!-- 1_selector&#45;&gt;1 -->\n",
       "<g id=\"edge2\" class=\"edge\">\n",
       "<title>1_selector&#45;&gt;1</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1665.43,-431.7C1665.43,-423.98 1665.43,-414.71 1665.43,-406.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1668.93,-406.1 1665.43,-396.1 1661.93,-406.1 1668.93,-406.1\"/>\n",
       "</g>\n",
       "<!-- 28 -->\n",
       "<g id=\"node43\" class=\"node\">\n",
       "<title>28</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1300.43\" cy=\"-18\" rx=\"62.29\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1300.43\" y=\"-14.3\" font-family=\"Times,serif\" font-size=\"14.00\">output cols</text>\n",
       "</g>\n",
       "<!-- 2&#45;&gt;28 -->\n",
       "<g id=\"edge42\" class=\"edge\">\n",
       "<title>2&#45;&gt;28</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1300.43,-71.7C1300.43,-63.98 1300.43,-54.71 1300.43,-46.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1303.93,-46.1 1300.43,-36.1 1296.93,-46.1 1303.93,-46.1\"/>\n",
       "</g>\n",
       "<!-- 13 -->\n",
       "<g id=\"node6\" class=\"node\">\n",
       "<title>13</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"302.43\" cy=\"-162\" rx=\"59.59\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"302.43\" y=\"-158.3\" font-family=\"Times,serif\" font-size=\"14.00\">Categorify</text>\n",
       "</g>\n",
       "<!-- 13&#45;&gt;2 -->\n",
       "<g id=\"edge3\" class=\"edge\">\n",
       "<title>13&#45;&gt;2</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M360.04,-156.96C543.67,-144.08 1112.02,-104.22 1263.19,-93.61\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1263.61,-97.09 1273.34,-92.9 1263.12,-90.11 1263.61,-97.09\"/>\n",
       "</g>\n",
       "<!-- 4 -->\n",
       "<g id=\"node7\" class=\"node\">\n",
       "<title>4</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"667.43\" cy=\"-162\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"667.43\" y=\"-158.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 4&#45;&gt;2 -->\n",
       "<g id=\"edge4\" class=\"edge\">\n",
       "<title>4&#45;&gt;2</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M728.13,-154.29C857.97,-139.93 1159.36,-106.6 1263.92,-95.04\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1264.43,-98.5 1273.98,-93.92 1263.66,-91.55 1264.43,-98.5\"/>\n",
       "</g>\n",
       "<!-- 20 -->\n",
       "<g id=\"node8\" class=\"node\">\n",
       "<title>20</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"930.43\" cy=\"-162\" rx=\"68.49\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"930.43\" y=\"-158.3\" font-family=\"Times,serif\" font-size=\"14.00\">JoinGroupby</text>\n",
       "</g>\n",
       "<!-- 20&#45;&gt;2 -->\n",
       "<g id=\"edge5\" class=\"edge\">\n",
       "<title>20&#45;&gt;2</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M980.73,-149.77C989.93,-147.79 999.46,-145.79 1008.43,-144 1100.52,-125.61 1209.49,-106.54 1264.62,-97.08\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1265.33,-100.51 1274.6,-95.38 1264.15,-93.61 1265.33,-100.51\"/>\n",
       "</g>\n",
       "<!-- 6 -->\n",
       "<g id=\"node9\" class=\"node\">\n",
       "<title>6</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1101.43\" cy=\"-162\" rx=\"84.49\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1101.43\" y=\"-158.3\" font-family=\"Times,serif\" font-size=\"14.00\">TargetEncoding</text>\n",
       "</g>\n",
       "<!-- 6&#45;&gt;2 -->\n",
       "<g id=\"edge6\" class=\"edge\">\n",
       "<title>6&#45;&gt;2</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1143.15,-146.33C1180.18,-133.3 1233.56,-114.52 1267.62,-102.54\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1268.83,-105.83 1277.1,-99.21 1266.51,-99.22 1268.83,-105.83\"/>\n",
       "</g>\n",
       "<!-- 7 -->\n",
       "<g id=\"node10\" class=\"node\">\n",
       "<title>7</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1499.43\" cy=\"-162\" rx=\"295.95\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1499.43\" y=\"-158.3\" font-family=\"Times,serif\" font-size=\"14.00\">LambdaOp(lambda col: col.astype(&quot;float32&quot;), dtype=&#39;float32&#39;)</text>\n",
       "</g>\n",
       "<!-- 7&#45;&gt;2 -->\n",
       "<g id=\"edge7\" class=\"edge\">\n",
       "<title>7&#45;&gt;2</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1451.77,-144.23C1415.25,-131.39 1365.8,-113.99 1333.51,-102.64\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1334.27,-99.19 1323.67,-99.18 1331.95,-105.8 1334.27,-99.19\"/>\n",
       "</g>\n",
       "<!-- 3 -->\n",
       "<g id=\"node11\" class=\"node\">\n",
       "<title>3</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1861.43\" cy=\"-162\" rx=\"48.19\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1861.43\" y=\"-158.3\" font-family=\"Times,serif\" font-size=\"14.00\">Rename</text>\n",
       "</g>\n",
       "<!-- 3&#45;&gt;2 -->\n",
       "<g id=\"edge9\" class=\"edge\">\n",
       "<title>3&#45;&gt;2</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1826.72,-149.54C1819.42,-147.44 1811.72,-145.46 1804.43,-144 1631.93,-109.51 1421.43,-96.57 1337.53,-92.55\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1337.62,-89.05 1327.47,-92.08 1337.3,-96.04 1337.62,-89.05\"/>\n",
       "</g>\n",
       "<!-- 17 -->\n",
       "<g id=\"node12\" class=\"node\">\n",
       "<title>17</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"2005.43\" cy=\"-162\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"2005.43\" y=\"-158.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 17&#45;&gt;2 -->\n",
       "<g id=\"edge10\" class=\"edge\">\n",
       "<title>17&#45;&gt;2</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1954.52,-150.1C1942.72,-147.84 1930.17,-145.64 1918.43,-144 1699.23,-113.28 1433.63,-97.72 1337.44,-92.78\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1337.52,-89.28 1327.35,-92.27 1337.16,-96.27 1337.52,-89.28\"/>\n",
       "</g>\n",
       "<!-- 10 -->\n",
       "<g id=\"node13\" class=\"node\">\n",
       "<title>10</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1827.43\" cy=\"-234\" rx=\"108.58\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1827.43\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\">relative_price_to_avg</text>\n",
       "</g>\n",
       "<!-- 10&#45;&gt;3 -->\n",
       "<g id=\"edge11\" class=\"edge\">\n",
       "<title>10&#45;&gt;3</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1835.84,-215.7C1839.71,-207.73 1844.38,-198.1 1848.68,-189.26\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1851.9,-190.63 1853.13,-180.1 1845.61,-187.57 1851.9,-190.63\"/>\n",
       "</g>\n",
       "<!-- 4_selector -->\n",
       "<g id=\"node14\" class=\"node\">\n",
       "<title>4_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"487.43\" cy=\"-234\" rx=\"211.76\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"487.43\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;price&#39;, &#39;ts_hour&#39;, &#39;ts_minute&#39;, &#39;ts_weekday&#39;]</text>\n",
       "</g>\n",
       "<!-- 4_selector&#45;&gt;4 -->\n",
       "<g id=\"edge12\" class=\"edge\">\n",
       "<title>4_selector&#45;&gt;4</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M530.55,-216.23C557.99,-205.56 593.51,-191.75 621.49,-180.87\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"622.99,-184.04 631.04,-177.15 620.46,-177.51 622.99,-184.04\"/>\n",
       "</g>\n",
       "<!-- 5 -->\n",
       "<g id=\"node15\" class=\"node\">\n",
       "<title>5</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1498.43\" cy=\"-306\" rx=\"40.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1498.43\" y=\"-302.3\" font-family=\"Times,serif\" font-size=\"14.00\">LogOp</text>\n",
       "</g>\n",
       "<!-- 14 -->\n",
       "<g id=\"node19\" class=\"node\">\n",
       "<title>14</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1498.43\" cy=\"-234\" rx=\"58.49\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1498.43\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\">Normalize</text>\n",
       "</g>\n",
       "<!-- 5&#45;&gt;14 -->\n",
       "<g id=\"edge24\" class=\"edge\">\n",
       "<title>5&#45;&gt;14</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1498.43,-287.7C1498.43,-279.98 1498.43,-270.71 1498.43,-262.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1501.93,-262.1 1498.43,-252.1 1494.93,-262.1 1501.93,-262.1\"/>\n",
       "</g>\n",
       "<!-- 24 -->\n",
       "<g id=\"node16\" class=\"node\">\n",
       "<title>24</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1497.43\" cy=\"-378\" rx=\"62.29\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1497.43\" y=\"-374.3\" font-family=\"Times,serif\" font-size=\"14.00\">FillMissing</text>\n",
       "</g>\n",
       "<!-- 24&#45;&gt;5 -->\n",
       "<g id=\"edge13\" class=\"edge\">\n",
       "<title>24&#45;&gt;5</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1497.68,-359.7C1497.79,-351.98 1497.92,-342.71 1498.05,-334.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1501.55,-334.15 1498.19,-324.1 1494.55,-334.05 1501.55,-334.15\"/>\n",
       "</g>\n",
       "<!-- 15 -->\n",
       "<g id=\"node17\" class=\"node\">\n",
       "<title>15</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1111.43\" cy=\"-234\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1111.43\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 15&#45;&gt;6 -->\n",
       "<g id=\"edge14\" class=\"edge\">\n",
       "<title>15&#45;&gt;6</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1108.96,-215.7C1107.86,-207.98 1106.53,-198.71 1105.31,-190.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1108.76,-189.51 1103.88,-180.1 1101.83,-190.5 1108.76,-189.51\"/>\n",
       "</g>\n",
       "<!-- 8 -->\n",
       "<g id=\"node18\" class=\"node\">\n",
       "<title>8</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1315.43\" cy=\"-234\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1315.43\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 8&#45;&gt;6 -->\n",
       "<g id=\"edge15\" class=\"edge\">\n",
       "<title>8&#45;&gt;6</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1274.68,-219.67C1240.84,-208.6 1192.42,-192.76 1155.37,-180.64\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1156.32,-177.27 1145.73,-177.49 1154.14,-183.92 1156.32,-177.27\"/>\n",
       "</g>\n",
       "<!-- 14&#45;&gt;7 -->\n",
       "<g id=\"edge16\" class=\"edge\">\n",
       "<title>14&#45;&gt;7</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1498.68,-215.7C1498.79,-207.98 1498.92,-198.71 1499.05,-190.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1502.55,-190.15 1499.19,-180.1 1495.55,-190.05 1502.55,-190.15\"/>\n",
       "</g>\n",
       "<!-- 8_selector -->\n",
       "<g id=\"node20\" class=\"node\">\n",
       "<title>8_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1382.43\" cy=\"-306\" rx=\"51.19\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1382.43\" y=\"-302.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;target&#39;]</text>\n",
       "</g>\n",
       "<!-- 8_selector&#45;&gt;8 -->\n",
       "<g id=\"edge17\" class=\"edge\">\n",
       "<title>8_selector&#45;&gt;8</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1366.89,-288.76C1358.47,-279.97 1347.91,-268.93 1338.54,-259.14\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1340.8,-256.44 1331.35,-251.63 1335.74,-261.28 1340.8,-256.44\"/>\n",
       "</g>\n",
       "<!-- 9 -->\n",
       "<g id=\"node21\" class=\"node\">\n",
       "<title>9</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"190.43\" cy=\"-234\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"190.43\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 9&#45;&gt;13 -->\n",
       "<g id=\"edge23\" class=\"edge\">\n",
       "<title>9&#45;&gt;13</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M215.57,-217.29C231.39,-207.4 251.98,-194.53 269.15,-183.8\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"271.14,-186.68 277.77,-178.42 267.43,-180.75 271.14,-186.68\"/>\n",
       "</g>\n",
       "<!-- 9_selector -->\n",
       "<g id=\"node22\" class=\"node\">\n",
       "<title>9_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"190.43\" cy=\"-306\" rx=\"190.37\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"190.43\" y=\"-302.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;brand&#39;, &#39;cat_0&#39;, &#39;cat_1&#39;, &#39;cat_2&#39;, &#39;cat_3&#39;]</text>\n",
       "</g>\n",
       "<!-- 9_selector&#45;&gt;9 -->\n",
       "<g id=\"edge18\" class=\"edge\">\n",
       "<title>9_selector&#45;&gt;9</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M190.43,-287.7C190.43,-279.98 190.43,-270.71 190.43,-262.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"193.93,-262.1 190.43,-252.1 186.93,-262.1 193.93,-262.1\"/>\n",
       "</g>\n",
       "<!-- 19&#45;&gt;10 -->\n",
       "<g id=\"edge19\" class=\"edge\">\n",
       "<title>19&#45;&gt;10</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1803.18,-288.05C1806.82,-280.09 1811.25,-270.41 1815.31,-261.51\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1818.56,-262.83 1819.54,-252.28 1812.2,-259.91 1818.56,-262.83\"/>\n",
       "</g>\n",
       "<!-- 12 -->\n",
       "<g id=\"node24\" class=\"node\">\n",
       "<title>12</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1948.43\" cy=\"-306\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1948.43\" y=\"-302.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 12&#45;&gt;10 -->\n",
       "<g id=\"edge20\" class=\"edge\">\n",
       "<title>12&#45;&gt;10</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1921.58,-289.46C1904.83,-279.77 1883.03,-267.16 1864.62,-256.51\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1866.31,-253.45 1855.9,-251.47 1862.8,-259.5 1866.31,-253.45\"/>\n",
       "</g>\n",
       "<!-- 11 -->\n",
       "<g id=\"node25\" class=\"node\">\n",
       "<title>11</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"784.43\" cy=\"-234\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"784.43\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 11&#45;&gt;20 -->\n",
       "<g id=\"edge33\" class=\"edge\">\n",
       "<title>11&#45;&gt;20</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M815.75,-217.98C837.33,-207.64 866.21,-193.79 889.6,-182.58\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"891.41,-185.59 898.91,-178.11 888.38,-179.28 891.41,-185.59\"/>\n",
       "</g>\n",
       "<!-- 11_selector -->\n",
       "<g id=\"node26\" class=\"node\">\n",
       "<title>11_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"463.43\" cy=\"-306\" rx=\"51.19\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"463.43\" y=\"-302.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;target&#39;]</text>\n",
       "</g>\n",
       "<!-- 11_selector&#45;&gt;11 -->\n",
       "<g id=\"edge21\" class=\"edge\">\n",
       "<title>11_selector&#45;&gt;11</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M501.71,-293.84C509.22,-291.79 517.05,-289.75 524.43,-288 605.52,-268.8 626.88,-269.1 708.43,-252 713.81,-250.87 719.39,-249.68 724.98,-248.46\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"725.99,-251.82 735.01,-246.26 724.49,-244.98 725.99,-251.82\"/>\n",
       "</g>\n",
       "<!-- 12_selector -->\n",
       "<g id=\"node27\" class=\"node\">\n",
       "<title>12_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1948.43\" cy=\"-378\" rx=\"46.29\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1948.43\" y=\"-374.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;price&#39;]</text>\n",
       "</g>\n",
       "<!-- 12_selector&#45;&gt;12 -->\n",
       "<g id=\"edge22\" class=\"edge\">\n",
       "<title>12_selector&#45;&gt;12</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1948.43,-359.7C1948.43,-351.98 1948.43,-342.71 1948.43,-334.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1951.93,-334.1 1948.43,-324.1 1944.93,-334.1 1951.93,-334.1\"/>\n",
       "</g>\n",
       "<!-- 15_selector -->\n",
       "<g id=\"node28\" class=\"node\">\n",
       "<title>15_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1122.43\" cy=\"-306\" rx=\"190.37\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1122.43\" y=\"-302.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;brand&#39;, &#39;cat_0&#39;, &#39;cat_1&#39;, &#39;cat_2&#39;, &#39;cat_3&#39;]</text>\n",
       "</g>\n",
       "<!-- 15_selector&#45;&gt;15 -->\n",
       "<g id=\"edge25\" class=\"edge\">\n",
       "<title>15_selector&#45;&gt;15</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1119.71,-287.7C1118.5,-279.98 1117.05,-270.71 1115.69,-262.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1119.13,-261.44 1114.12,-252.1 1112.22,-262.53 1119.13,-261.44\"/>\n",
       "</g>\n",
       "<!-- 16 -->\n",
       "<g id=\"node29\" class=\"node\">\n",
       "<title>16</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"2079.43\" cy=\"-378\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"2079.43\" y=\"-374.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 21 -->\n",
       "<g id=\"node32\" class=\"node\">\n",
       "<title>21</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"2205.43\" cy=\"-306\" rx=\"68.49\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"2205.43\" y=\"-302.3\" font-family=\"Times,serif\" font-size=\"14.00\">JoinGroupby</text>\n",
       "</g>\n",
       "<!-- 16&#45;&gt;21 -->\n",
       "<g id=\"edge34\" class=\"edge\">\n",
       "<title>16&#45;&gt;21</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M2107.4,-361.46C2125.42,-351.45 2149.06,-338.32 2168.61,-327.46\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"2170.51,-330.41 2177.55,-322.49 2167.11,-324.29 2170.51,-330.41\"/>\n",
       "</g>\n",
       "<!-- 16_selector -->\n",
       "<g id=\"node30\" class=\"node\">\n",
       "<title>16_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"2079.43\" cy=\"-450\" rx=\"71.49\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"2079.43\" y=\"-446.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;product_id&#39;]</text>\n",
       "</g>\n",
       "<!-- 16_selector&#45;&gt;16 -->\n",
       "<g id=\"edge26\" class=\"edge\">\n",
       "<title>16_selector&#45;&gt;16</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M2079.43,-431.7C2079.43,-423.98 2079.43,-414.71 2079.43,-406.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"2082.93,-406.1 2079.43,-396.1 2075.93,-406.1 2082.93,-406.1\"/>\n",
       "</g>\n",
       "<!-- 17_selector -->\n",
       "<g id=\"node31\" class=\"node\">\n",
       "<title>17_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"2005.43\" cy=\"-234\" rx=\"51.19\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"2005.43\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;target&#39;]</text>\n",
       "</g>\n",
       "<!-- 17_selector&#45;&gt;17 -->\n",
       "<g id=\"edge27\" class=\"edge\">\n",
       "<title>17_selector&#45;&gt;17</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M2005.43,-215.7C2005.43,-207.98 2005.43,-198.71 2005.43,-190.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"2008.93,-190.1 2005.43,-180.1 2001.93,-190.1 2008.93,-190.1\"/>\n",
       "</g>\n",
       "<!-- 21&#45;&gt;18 -->\n",
       "<g id=\"edge28\" class=\"edge\">\n",
       "<title>21&#45;&gt;18</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M2205.43,-287.7C2205.43,-279.98 2205.43,-270.71 2205.43,-262.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"2208.93,-262.1 2205.43,-252.1 2201.93,-262.1 2208.93,-262.1\"/>\n",
       "</g>\n",
       "<!-- 27 -->\n",
       "<g id=\"node33\" class=\"node\">\n",
       "<title>27</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"2359.43\" cy=\"-306\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"2359.43\" y=\"-302.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 27&#45;&gt;18 -->\n",
       "<g id=\"edge29\" class=\"edge\">\n",
       "<title>27&#45;&gt;18</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M2326.77,-290.15C2304.38,-279.98 2274.4,-266.35 2249.85,-255.19\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"2251.28,-251.99 2240.73,-251.04 2248.38,-258.37 2251.28,-251.99\"/>\n",
       "</g>\n",
       "<!-- 25 -->\n",
       "<g id=\"node34\" class=\"node\">\n",
       "<title>25</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1817.43\" cy=\"-378\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1817.43\" y=\"-374.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 25&#45;&gt;19 -->\n",
       "<g id=\"edge31\" class=\"edge\">\n",
       "<title>25&#45;&gt;19</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1812.11,-360.05C1809.66,-352.26 1806.69,-342.82 1803.94,-334.08\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1807.2,-332.77 1800.86,-324.28 1800.52,-334.86 1807.2,-332.77\"/>\n",
       "</g>\n",
       "<!-- 23 -->\n",
       "<g id=\"node35\" class=\"node\">\n",
       "<title>23</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"936.43\" cy=\"-234\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"936.43\" y=\"-230.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 23&#45;&gt;20 -->\n",
       "<g id=\"edge32\" class=\"edge\">\n",
       "<title>23&#45;&gt;20</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M934.95,-215.7C934.29,-207.98 933.49,-198.71 932.76,-190.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"936.24,-189.77 931.9,-180.1 929.27,-190.37 936.24,-189.77\"/>\n",
       "</g>\n",
       "<!-- 22 -->\n",
       "<g id=\"node36\" class=\"node\">\n",
       "<title>22</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"2231.43\" cy=\"-378\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"2231.43\" y=\"-374.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 22&#45;&gt;21 -->\n",
       "<g id=\"edge35\" class=\"edge\">\n",
       "<title>22&#45;&gt;21</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M2225.14,-360.05C2222.21,-352.18 2218.66,-342.62 2215.39,-333.79\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"2218.61,-332.43 2211.85,-324.28 2212.05,-334.87 2218.61,-332.43\"/>\n",
       "</g>\n",
       "<!-- 22_selector -->\n",
       "<g id=\"node37\" class=\"node\">\n",
       "<title>22_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"2231.43\" cy=\"-450\" rx=\"46.29\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"2231.43\" y=\"-446.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;price&#39;]</text>\n",
       "</g>\n",
       "<!-- 22_selector&#45;&gt;22 -->\n",
       "<g id=\"edge36\" class=\"edge\">\n",
       "<title>22_selector&#45;&gt;22</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M2231.43,-431.7C2231.43,-423.98 2231.43,-414.71 2231.43,-406.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"2234.93,-406.1 2231.43,-396.1 2227.93,-406.1 2234.93,-406.1\"/>\n",
       "</g>\n",
       "<!-- 23_selector -->\n",
       "<g id=\"node38\" class=\"node\">\n",
       "<title>23_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"723.43\" cy=\"-306\" rx=\"190.37\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"723.43\" y=\"-302.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;brand&#39;, &#39;cat_0&#39;, &#39;cat_1&#39;, &#39;cat_2&#39;, &#39;cat_3&#39;]</text>\n",
       "</g>\n",
       "<!-- 23_selector&#45;&gt;23 -->\n",
       "<g id=\"edge37\" class=\"edge\">\n",
       "<title>23_selector&#45;&gt;23</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M773.37,-288.59C807.28,-277.44 851.97,-262.76 885.99,-251.58\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"887.32,-254.82 895.73,-248.38 885.13,-248.17 887.32,-254.82\"/>\n",
       "</g>\n",
       "<!-- 26 -->\n",
       "<g id=\"node39\" class=\"node\">\n",
       "<title>26</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1491.43\" cy=\"-450\" rx=\"66.89\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1491.43\" y=\"-446.3\" font-family=\"Times,serif\" font-size=\"14.00\">SelectionOp</text>\n",
       "</g>\n",
       "<!-- 26&#45;&gt;24 -->\n",
       "<g id=\"edge38\" class=\"edge\">\n",
       "<title>26&#45;&gt;24</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1492.92,-431.7C1493.58,-423.98 1494.37,-414.71 1495.11,-406.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1498.6,-406.37 1495.97,-396.1 1491.63,-405.77 1498.6,-406.37\"/>\n",
       "</g>\n",
       "<!-- 25_selector -->\n",
       "<g id=\"node40\" class=\"node\">\n",
       "<title>25_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1818.43\" cy=\"-450\" rx=\"46.29\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1818.43\" y=\"-446.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;price&#39;]</text>\n",
       "</g>\n",
       "<!-- 25_selector&#45;&gt;25 -->\n",
       "<g id=\"edge39\" class=\"edge\">\n",
       "<title>25_selector&#45;&gt;25</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1818.19,-431.7C1818.08,-423.98 1817.94,-414.71 1817.82,-406.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1821.32,-406.05 1817.68,-396.1 1814.32,-406.15 1821.32,-406.05\"/>\n",
       "</g>\n",
       "<!-- 26_selector -->\n",
       "<g id=\"node41\" class=\"node\">\n",
       "<title>26_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"1491.43\" cy=\"-522\" rx=\"46.29\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"1491.43\" y=\"-518.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;price&#39;]</text>\n",
       "</g>\n",
       "<!-- 26_selector&#45;&gt;26 -->\n",
       "<g id=\"edge40\" class=\"edge\">\n",
       "<title>26_selector&#45;&gt;26</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M1491.43,-503.7C1491.43,-495.98 1491.43,-486.71 1491.43,-478.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"1494.93,-478.1 1491.43,-468.1 1487.93,-478.1 1494.93,-478.1\"/>\n",
       "</g>\n",
       "<!-- 27_selector -->\n",
       "<g id=\"node42\" class=\"node\">\n",
       "<title>27_selector</title>\n",
       "<ellipse fill=\"none\" stroke=\"black\" cx=\"2362.43\" cy=\"-378\" rx=\"46.29\" ry=\"18\"/>\n",
       "<text text-anchor=\"middle\" x=\"2362.43\" y=\"-374.3\" font-family=\"Times,serif\" font-size=\"14.00\">[&#39;price&#39;]</text>\n",
       "</g>\n",
       "<!-- 27_selector&#45;&gt;27 -->\n",
       "<g id=\"edge41\" class=\"edge\">\n",
       "<title>27_selector&#45;&gt;27</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M2361.69,-359.7C2361.36,-351.98 2360.96,-342.71 2360.6,-334.11\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"2364.09,-333.95 2360.17,-324.1 2357.1,-334.25 2364.09,-333.95\"/>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.graphs.Digraph at 0x7f718846cac0>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "features=cat_features+cont_cols+ce_features+te_features+price+relative_price_to_avg_product+relative_price_to_avg_category+[label]\n",
    "features.graph"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "696f1cdd-3a14-462c-9fea-e9f03a9a47d6",
   "metadata": {},
   "source": [
    "We are now ready to construct a `Workflow` that will run the operations we defined above. To enable distributed parallelism, the NVTabular `Workflow` must be initialized with a `dask.distributed.Client` object. Since NVTabular already uses Dask-CuDF for internal data processing, there are no other requirements for multi-GPU scaling. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7ae89f60-b420-4871-ad5f-b7f292c56add",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/envs/rapids/lib/python3.9/site-packages/merlin/core/utils.py:361: FutureWarning: The `client` argument is deprecated from DaskExecutor and will be removed in a future version of NVTabular. By default, a global client in the same python context will be detected automatically, and `merlin.utils.set_dask_client` (as well as `Distributed` and `Serial`) can be used for explicit control.\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "# define our NVTabular Workflow with client to enable multi-GPU execution\n",
    "# for multi-GPU execution, the only requirement is that we specify a client when \n",
    "# initializing the NVTabular Workflow.\n",
    "workflow=nvt.Workflow(features, client=client)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88e03d7f-03b1-4968-855c-e8656e9ff062",
   "metadata": {},
   "source": [
    "<a name='s4-3.2'></a>\n",
    "### Defining the Dataset ###\n",
    "All external data need to be converted to the universal `nvtabular.io.dataset.Dataset`[[doc]](https://nvidia-merlin.github.io/NVTabular/v0.7.1/api/dataset.html) type. The main purpose of this class is to abstract away the raw format of the data, and to allow other NVTabular classes to reliably materialize a `dask.dataframe.DataFrame` collection and/or <mark>collection-based iterator</mark> on demand. \n",
    "\n",
    "The collection-based iterator is important when working with large datasets that do not fit into GPU memory since operations in the `Workflow` often require statistics calculated across the entire dataset. For example, `Normalize` requires measurements of the dataset mean and standard deviation, and `Categorify` requires an accounting of all the unique categories a particular feature can manifest. The `Dataset` object partitions the dataset into chunks that will fit into GPU memory to compute statistics in an online fashion. \n",
    "\n",
    "A `Dataset` can be initialized from a variety of different raw-data formats: \n",
    "1. With a parquet-dataset directory\n",
    "2. With a list of files\n",
    "3. In addition to handling data stored on disk, a `Dataset` can also be initialized from an existing cuDF DataFrame, or from a `dask.dataframe.DataFrame`\n",
    "\n",
    "The data we pass to the `Dataset` constructor is usually the result of a query from some source, for example a data warehouse or data lake. The output is usually in Parquet, ORC, or CSV format. In our case, we have the data in parquet format saved on the disk from previous steps. When initializing a `Dataset` from a directory path, the engine should be used to specify either `parquet` or `csv` format. If initializing a `Dataset` from a list of files, the engine can be inferred. \n",
    "\n",
    "Memory is an important consideration. The workflow will process data in chunks, therefore increasing the number of partitions will limit the memory footprint. Since we will initialize the `Dataset` with a DataFrame type (`cudf.DataFrame` or `dask.dataframe.DataFrame`), most of the parameters will be ignored and the partitions will be preserved. Otherwise, the data would be converted to a `dask.dataframe.DataFrame` with a maximum partition size of roughly 12.5% of the total memory on a single device by default. We can use the `npartitions` parameter for specifying into how many chunks we would like the data to be split. The partition size can be changed to a different fraction of total memory on a single device with the `part_mem_fraction` argument. Alternatively, a specific byte size can be specified with the `part_size` argument. \n",
    "\n",
    "<p><img src='images/tip.png' width=720></p>\n",
    "\n",
    "The NVTabular dataset should be created from Parquet files in order to get the best possible performance, preferably with a row group size of around 128MB. While NVTabular also supports reading from CSV files, reading CSV can be over twice as slow as reading from Parquet. It's recommended to convert a CSV dataset into Parquet format for use with NVTabular. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "25f69f56-7883-4630-a916-866c21984f77",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Dataset is split into 4 partitions\n"
     ]
    }
   ],
   "source": [
    "# create dataset\n",
    "dataset=nvt.Dataset(ddf)\n",
    "\n",
    "print(f'The Dataset is split into {dataset.npartitions} partitions')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9d465b6-f695-43d5-b33c-adc4a19fc35b",
   "metadata": {},
   "source": [
    "<a name='s4-3.3'></a>\n",
    "### Fit, Transform, and Persist ###\n",
    "NVTabular follows a familiar API for pipeline operations. We can `.fit()` the workflow to a training set to calculate the statistics for this workflow. Afterwards, we can use it to `.transform()` the training set and validation dataset. We will persist the transformed data to disk in parquet format for fast reading and train time. Importantly, we can use the `.save()`[[doc]](https://nvidia-merlin.github.io/NVTabular/main/api/workflow/workflow.html#nvtabular.workflow.workflow.Workflow.save) method so that our `Workflow` can be used during model inference. \n",
    "\n",
    "<p><img src='images/tip.png' width=720></p>\n",
    "\n",
    "Since the `Dataset` API can both ingest and output a Dask collection, it is straightforward to transform data either before or after an NVTabular workflow is executed. This means that some complex pre-processing operations, that are not yet supported in NVTabular, can still be accomplished with the `dask_cudf.DataFrame` API after the `Dataset` is converted with `.to_ddf`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "2bff2a60-5892-4ee8-9277-3d5be0caff91",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/envs/rapids/lib/python3.9/site-packages/merlin/dtypes/mappings/tf.py:52: UserWarning: Tensorflow dtype mappings did not load successfully due to an error: No module named 'tensorflow'\n",
      "  warn(f\"Tensorflow dtype mappings did not load successfully due to an error: {exc.msg}\")\n",
      "/opt/conda/envs/rapids/lib/python3.9/site-packages/merlin/dtypes/mappings/tf.py:52: UserWarning: Tensorflow dtype mappings did not load successfully due to an error: No module named 'tensorflow'\n",
      "  warn(f\"Tensorflow dtype mappings did not load successfully due to an error: {exc.msg}\")\n",
      "/opt/conda/envs/rapids/lib/python3.9/site-packages/merlin/dtypes/mappings/tf.py:52: UserWarning: Tensorflow dtype mappings did not load successfully due to an error: No module named 'tensorflow'\n",
      "  warn(f\"Tensorflow dtype mappings did not load successfully due to an error: {exc.msg}\")\n",
      "/opt/conda/envs/rapids/lib/python3.9/site-packages/merlin/dtypes/mappings/tf.py:52: UserWarning: Tensorflow dtype mappings did not load successfully due to an error: No module named 'tensorflow'\n",
      "  warn(f\"Tensorflow dtype mappings did not load successfully due to an error: {exc.msg}\")\n"
     ]
    }
   ],
   "source": [
    "# fit and transform dataset\n",
    "workflow.fit(dataset)\n",
    "output_dataset=workflow.transform(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "3eb48ab0-16a7-4b83-91d6-719af97acdca",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 32\n",
      "drwxr-xr-x 14 root root  4096 Feb 11 13:29 categories\n",
      "-rw-r--r--  1 root root   187 Feb 11 13:29 metadata.json\n",
      "-rw-r--r--  1 root root 21110 Feb 11 13:29 workflow.pkl\n"
     ]
    }
   ],
   "source": [
    "# save the workflow\n",
    "workflow.save('nvt_workflow')\n",
    "\n",
    "!ls -l nvt_workflow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "c7ea31ca-4b58-409b-b8cd-f357dc2a1abc",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rm: cannot remove 'processed_parquet/*': No such file or directory\n"
     ]
    }
   ],
   "source": [
    "# remove existing parquet directory\n",
    "!rm -R processed_parquet/*\n",
    "\n",
    "# save output to parquet directory\n",
    "output_path='processed_parquet'\n",
    "output_dataset.to_parquet(output_path=output_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e8ae451-3ba7-4a27-b8fa-d5d79d4768ab",
   "metadata": {},
   "source": [
    "If needed, we can convert the `Dataset` object to `dask.dataframe.DataFrame` to inspect the results. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "658f563b-76e0-4927-80e7-87a52d5f8d63",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>brand</th>\n",
       "      <th>cat_0</th>\n",
       "      <th>cat_1</th>\n",
       "      <th>cat_2</th>\n",
       "      <th>cat_3</th>\n",
       "      <th>ts_hour</th>\n",
       "      <th>ts_minute</th>\n",
       "      <th>ts_weekday</th>\n",
       "      <th>brand_target_sum</th>\n",
       "      <th>brand_count</th>\n",
       "      <th>...</th>\n",
       "      <th>cat_3_count</th>\n",
       "      <th>TE_brand_target</th>\n",
       "      <th>TE_cat_0_target</th>\n",
       "      <th>TE_cat_1_target</th>\n",
       "      <th>TE_cat_2_target</th>\n",
       "      <th>TE_cat_3_target</th>\n",
       "      <th>price</th>\n",
       "      <th>relative_price_product</th>\n",
       "      <th>relative_price_category</th>\n",
       "      <th>target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>952</td>\n",
       "      <td>6</td>\n",
       "      <td>19</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>54</td>\n",
       "      <td>6</td>\n",
       "      <td>12</td>\n",
       "      <td>71</td>\n",
       "      <td>...</td>\n",
       "      <td>2460405</td>\n",
       "      <td>0.220351</td>\n",
       "      <td>0.349630</td>\n",
       "      <td>0.262990</td>\n",
       "      <td>0.340122</td>\n",
       "      <td>0.410012</td>\n",
       "      <td>-1.583984</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-0.678853</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>55</td>\n",
       "      <td>6</td>\n",
       "      <td>81333</td>\n",
       "      <td>234973</td>\n",
       "      <td>...</td>\n",
       "      <td>2460405</td>\n",
       "      <td>0.346285</td>\n",
       "      <td>0.297294</td>\n",
       "      <td>0.297294</td>\n",
       "      <td>0.339305</td>\n",
       "      <td>0.410121</td>\n",
       "      <td>0.031592</td>\n",
       "      <td>-0.034146</td>\n",
       "      <td>0.261951</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>34</td>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>54</td>\n",
       "      <td>6</td>\n",
       "      <td>3259</td>\n",
       "      <td>8325</td>\n",
       "      <td>...</td>\n",
       "      <td>2460405</td>\n",
       "      <td>0.387353</td>\n",
       "      <td>0.402070</td>\n",
       "      <td>0.429762</td>\n",
       "      <td>0.340703</td>\n",
       "      <td>0.410679</td>\n",
       "      <td>-1.237676</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-0.883515</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>45</td>\n",
       "      <td>6</td>\n",
       "      <td>226354</td>\n",
       "      <td>457906</td>\n",
       "      <td>...</td>\n",
       "      <td>2460405</td>\n",
       "      <td>0.494355</td>\n",
       "      <td>0.483119</td>\n",
       "      <td>0.482531</td>\n",
       "      <td>0.488641</td>\n",
       "      <td>0.410679</td>\n",
       "      <td>1.350904</td>\n",
       "      <td>-0.017399</td>\n",
       "      <td>0.772717</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>34</td>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>26</td>\n",
       "      <td>0</td>\n",
       "      <td>3259</td>\n",
       "      <td>8325</td>\n",
       "      <td>...</td>\n",
       "      <td>2460405</td>\n",
       "      <td>0.396942</td>\n",
       "      <td>0.401396</td>\n",
       "      <td>0.427910</td>\n",
       "      <td>0.339305</td>\n",
       "      <td>0.410121</td>\n",
       "      <td>-0.473307</td>\n",
       "      <td>-0.006232</td>\n",
       "      <td>-0.688708</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   brand  cat_0  cat_1  cat_2  cat_3  ts_hour  ts_minute  ts_weekday  \\\n",
       "0    952      6     19      4      3        4         54           6   \n",
       "1      5      5      4      4      3        4         55           6   \n",
       "2     34      7     10      4      3        4         54           6   \n",
       "3      3      3      3      3      3        7         45           6   \n",
       "4     34      7     10      4      3        5         26           0   \n",
       "\n",
       "   brand_target_sum  brand_count  ...  cat_3_count  TE_brand_target  \\\n",
       "0                12           71  ...      2460405         0.220351   \n",
       "1             81333       234973  ...      2460405         0.346285   \n",
       "2              3259         8325  ...      2460405         0.387353   \n",
       "3            226354       457906  ...      2460405         0.494355   \n",
       "4              3259         8325  ...      2460405         0.396942   \n",
       "\n",
       "   TE_cat_0_target  TE_cat_1_target  TE_cat_2_target  TE_cat_3_target  \\\n",
       "0         0.349630         0.262990         0.340122         0.410012   \n",
       "1         0.297294         0.297294         0.339305         0.410121   \n",
       "2         0.402070         0.429762         0.340703         0.410679   \n",
       "3         0.483119         0.482531         0.488641         0.410679   \n",
       "4         0.401396         0.427910         0.339305         0.410121   \n",
       "\n",
       "      price  relative_price_product  relative_price_category  target  \n",
       "0 -1.583984                0.000000                -0.678853       1  \n",
       "1  0.031592               -0.034146                 0.261951       1  \n",
       "2 -1.237676                0.000000                -0.883515       1  \n",
       "3  1.350904               -0.017399                 0.772717       1  \n",
       "4 -0.473307               -0.006232                -0.688708       1  \n",
       "\n",
       "[5 rows x 27 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# convert to DataFrame and preview\n",
    "output_dataset.to_ddf().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8f1253e-19d0-4a36-b065-42324f5f3e2a",
   "metadata": {},
   "source": [
    "<a name='s4-e2'></a>\n",
    "### Exercise #2 - Load Saved Workflow ###\n",
    "We can load a saved workflow, which will contain the graph, schema, and statistics. This is useful if the workflow should be applied to future datasets. \n",
    "\n",
    "**Instructions**: <br>\n",
    "* Review the [documentation](https://nvidia-merlin.github.io/NVTabular/main/api/workflow/workflow.html#nvtabular.workflow.workflow.Workflow.load) for the `.load()` _class_ method. \n",
    "* Modify the `<FIXME>` only and execute the cell below to create a workflow. \n",
    "* Execute the cell below to apply the graph of operators to transform the data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "2748e6e8-a866-4c96-a534-decf2fe269c7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# load workflow\n",
    "loaded_workflow=nvt.Workflow.load('nvt_workflow')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de1f6efe-e6cf-4576-88e6-6ae95dbd49b3",
   "metadata": {},
   "source": [
    "Click ... to show **solution**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "8ee80ee7-7162-4352-90d4-51bd2c6f1531",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>brand</th>\n",
       "      <th>cat_0</th>\n",
       "      <th>cat_1</th>\n",
       "      <th>cat_2</th>\n",
       "      <th>cat_3</th>\n",
       "      <th>ts_hour</th>\n",
       "      <th>ts_minute</th>\n",
       "      <th>ts_weekday</th>\n",
       "      <th>brand_target_sum</th>\n",
       "      <th>brand_count</th>\n",
       "      <th>...</th>\n",
       "      <th>cat_3_count</th>\n",
       "      <th>TE_brand_target</th>\n",
       "      <th>TE_cat_0_target</th>\n",
       "      <th>TE_cat_1_target</th>\n",
       "      <th>TE_cat_2_target</th>\n",
       "      <th>TE_cat_3_target</th>\n",
       "      <th>price</th>\n",
       "      <th>relative_price_product</th>\n",
       "      <th>relative_price_category</th>\n",
       "      <th>target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>952</td>\n",
       "      <td>6</td>\n",
       "      <td>19</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>54</td>\n",
       "      <td>6</td>\n",
       "      <td>12</td>\n",
       "      <td>71</td>\n",
       "      <td>...</td>\n",
       "      <td>2460405</td>\n",
       "      <td>0.220351</td>\n",
       "      <td>0.349630</td>\n",
       "      <td>0.262990</td>\n",
       "      <td>0.340122</td>\n",
       "      <td>0.410012</td>\n",
       "      <td>-1.583984</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-0.678853</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>55</td>\n",
       "      <td>6</td>\n",
       "      <td>81333</td>\n",
       "      <td>234973</td>\n",
       "      <td>...</td>\n",
       "      <td>2460405</td>\n",
       "      <td>0.346285</td>\n",
       "      <td>0.297294</td>\n",
       "      <td>0.297294</td>\n",
       "      <td>0.339305</td>\n",
       "      <td>0.410121</td>\n",
       "      <td>0.031592</td>\n",
       "      <td>-0.034146</td>\n",
       "      <td>0.261951</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>34</td>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>54</td>\n",
       "      <td>6</td>\n",
       "      <td>3259</td>\n",
       "      <td>8325</td>\n",
       "      <td>...</td>\n",
       "      <td>2460405</td>\n",
       "      <td>0.387353</td>\n",
       "      <td>0.402070</td>\n",
       "      <td>0.429762</td>\n",
       "      <td>0.340703</td>\n",
       "      <td>0.410679</td>\n",
       "      <td>-1.237676</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-0.883515</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>45</td>\n",
       "      <td>6</td>\n",
       "      <td>226354</td>\n",
       "      <td>457906</td>\n",
       "      <td>...</td>\n",
       "      <td>2460405</td>\n",
       "      <td>0.494355</td>\n",
       "      <td>0.483119</td>\n",
       "      <td>0.482531</td>\n",
       "      <td>0.488641</td>\n",
       "      <td>0.410679</td>\n",
       "      <td>1.350904</td>\n",
       "      <td>-0.017399</td>\n",
       "      <td>0.772717</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>34</td>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>26</td>\n",
       "      <td>0</td>\n",
       "      <td>3259</td>\n",
       "      <td>8325</td>\n",
       "      <td>...</td>\n",
       "      <td>2460405</td>\n",
       "      <td>0.396942</td>\n",
       "      <td>0.401396</td>\n",
       "      <td>0.427910</td>\n",
       "      <td>0.339305</td>\n",
       "      <td>0.410121</td>\n",
       "      <td>-0.473307</td>\n",
       "      <td>-0.006232</td>\n",
       "      <td>-0.688708</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   brand  cat_0  cat_1  cat_2  cat_3  ts_hour  ts_minute  ts_weekday  \\\n",
       "0    952      6     19      4      3        4         54           6   \n",
       "1      5      5      4      4      3        4         55           6   \n",
       "2     34      7     10      4      3        4         54           6   \n",
       "3      3      3      3      3      3        7         45           6   \n",
       "4     34      7     10      4      3        5         26           0   \n",
       "\n",
       "   brand_target_sum  brand_count  ...  cat_3_count  TE_brand_target  \\\n",
       "0                12           71  ...      2460405         0.220351   \n",
       "1             81333       234973  ...      2460405         0.346285   \n",
       "2              3259         8325  ...      2460405         0.387353   \n",
       "3            226354       457906  ...      2460405         0.494355   \n",
       "4              3259         8325  ...      2460405         0.396942   \n",
       "\n",
       "   TE_cat_0_target  TE_cat_1_target  TE_cat_2_target  TE_cat_3_target  \\\n",
       "0         0.349630         0.262990         0.340122         0.410012   \n",
       "1         0.297294         0.297294         0.339305         0.410121   \n",
       "2         0.402070         0.429762         0.340703         0.410679   \n",
       "3         0.483119         0.482531         0.488641         0.410679   \n",
       "4         0.401396         0.427910         0.339305         0.410121   \n",
       "\n",
       "      price  relative_price_product  relative_price_category  target  \n",
       "0 -1.583984                0.000000                -0.678853       1  \n",
       "1  0.031592               -0.034146                 0.261951       1  \n",
       "2 -1.237676                0.000000                -0.883515       1  \n",
       "3  1.350904               -0.017399                 0.772717       1  \n",
       "4 -0.473307               -0.006232                -0.688708       1  \n",
       "\n",
       "[5 rows x 27 columns]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create dataset from parquet directory\n",
    "dataset=nvt.Dataset('clean_parquet', engine='parquet')\n",
    "\n",
    "# transform dataset\n",
    "loaded_workflow.transform(dataset).to_ddf().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "48f8ca09-920d-48e9-aa78-7fab88364953",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'status': 'ok', 'restart': False}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# clean GPU memory\n",
    "import IPython\n",
    "app = IPython.Application.instance()\n",
    "app.kernel.do_shutdown(restart=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d921fa5-0f64-4e9c-bdc5-2c13432f1cc5",
   "metadata": {},
   "source": [
    "<a href=\"https://www.nvidia.com/dli\"> <img src=\"images/DLI_Header.png\" alt=\"Header\" style=\"width: 400px;\"/> </a>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}