From f6b0305d216847fb9d78958cbbc92c106e17b428 Mon Sep 17 00:00:00 2001 From: Pritam Dodeja Date: Wed, 25 May 2022 10:26:17 -0400 Subject: [PATCH 1/3] Added information about preprocessing layers. Added information about the usage of preprocessing layers to docs/get_started.ipynb, including caveats for which layers can be used, and which cannot, along with deserialization issues with Lambda layers. --- docs/get_started.ipynb | 2588 ++++++++++++++++++++++++---------------- 1 file changed, 1585 insertions(+), 1003 deletions(-) diff --git a/docs/get_started.ipynb b/docs/get_started.ipynb index aa4962d4..5e106926 100644 --- a/docs/get_started.ipynb +++ b/docs/get_started.ipynb @@ -1,1074 +1,1656 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "KpXGE33umpig" - }, - "source": [ - "\u003c!-- See: www.tensorflow.org/tfx/transform/ --\u003e\n", - "\n", - "# Get Started with TensorFlow Transform" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FT1xumYJ-oEW" - }, - "source": [ - "This guide introduces the basic concepts of `tf.Transform` and how to use them.\n", - "It will:\n", - "\n", - "* Define a *preprocessing function*, a logical description of the pipeline\n", - " that transforms the raw data into the data used to train a machine learning\n", - " model.\n", - "* Show the [Apache Beam](https://beam.apache.org/) implementation used to\n", - " transform data by converting the *preprocessing function* into a *Beam\n", - " pipeline*.\n", - "* Show additional usage examples." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9_SoiTcNmkVu" - }, - "source": [ - "## Setup" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "gc6oSu9BnwJe" - }, - "outputs": [], - "source": [ - "!pip install -U tensorflow_transform" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "pY_BNfLemjY4" - }, - "outputs": [], - "source": [ - "!pip install pyarrow" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "L7Mtis2Jn2Af" - }, - "outputs": [], - "source": [ - "import pkg_resources\n", - "import importlib\n", - "importlib.reload(pkg_resources)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "PvDoWUfynTWh" - }, - "outputs": [], - "source": [ - "import os\n", - "import tempfile\n", - "\n", - "import tensorflow as tf\n", - "import tensorflow_transform as tft\n", - "import tensorflow_transform.beam as tft_beam\n", - "\n", - "from tensorflow_transform.tf_metadata import dataset_metadata\n", - "from tensorflow_transform.tf_metadata import schema_utils\n", - "\n", - "from tfx_bsl.public import tfxio" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "j4W_yuSr-ro3" - }, - "source": [ - "## Define a preprocessing function\n", - "\n", - "The *preprocessing function* is the most important concept of `tf.Transform`.\n", - "The preprocessing function is a logical description of a transformation of the\n", - "dataset. The preprocessing function accepts and returns a dictionary of tensors,\n", - "where a *tensor* means `Tensor` or `SparseTensor`. There are two kinds of\n", - "functions used to define the preprocessing function:\n", - "\n", - "1. Any function that accepts and returns tensors. These add TensorFlow\n", - " operations to the graph that transform raw data into transformed data.\n", - "2. Any of the *analyzers* provided by `tf.Transform`. Analyzers also accept\n", - " and return tensors, but unlike TensorFlow functions, they *do not* add\n", - " operations to the graph. Instead, analyzers cause `tf.Transform` to compute\n", - " a full-pass operation outside of TensorFlow. They use the input tensor values\n", - " over the entire dataset to generate a constant tensor that is returned as the\n", - " output. For example, `tft.min` computes the minimum of a tensor over the\n", - " dataset. `tf.Transform` provides a fixed set of analyzers, but this will be\n", - " extended in future versions.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "72ff0efc" - }, - "source": [ - "### Preprocessing function example\n", - "\n", - "By combining analyzers and regular TensorFlow functions, users can create\n", - "flexible pipelines for transforming data. The following preprocessing function\n", - "transforms each of the three features in different ways, and combines two of the\n", - "features:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "c6bf64fe" - }, - "outputs": [], - "source": [ - "def preprocessing_fn(inputs):\n", - " x = inputs['x']\n", - " y = inputs['y']\n", - " s = inputs['s']\n", - " x_centered = x - tft.mean(x)\n", - " y_normalized = tft.scale_to_0_1(y)\n", - " s_integerized = tft.compute_and_apply_vocabulary(s)\n", - " x_centered_times_y_normalized = x_centered * y_normalized\n", - " return {\n", - " 'x_centered': x_centered,\n", - " 'y_normalized': y_normalized,\n", - " 'x_centered_times_y_normalized': x_centered_times_y_normalized,\n", - " 's_integerized': s_integerized\n", - " }" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LU8aclPGAZLX" - }, - "source": [ - "Here, `x`, `y` and `s` are `Tensor`s that represent input features. The first\n", - "new tensor that is created, `x_centered`, is built by applying `tft.mean` to `x`\n", - "and subtracting this from `x`. `tft.mean(x)` returns a tensor representing the\n", - "mean of the tensor `x`. `x_centered` is the tensor `x` with the mean subtracted.\n", - "\n", - "The second new tensor, `y_normalized`, is created in a similar manner but using\n", - "the convenience method `tft.scale_to_0_1`. This method does something similar to\n", - "computing `x_centered`, namely computing a maximum and minimum and using these\n", - "to scale `y`.\n", - "\n", - "The tensor `s_integerized` shows an example of string manipulation. In this\n", - "case, we take a string and map it to an integer. This uses the convenience\n", - "function `tft.compute_and_apply_vocabulary`. This function uses an analyzer to\n", - "compute the unique values taken by the input strings, and then uses TensorFlow\n", - "operations to convert the input strings to indices in the table of unique\n", - "values.\n", - "\n", - "The final column shows that it is possible to use TensorFlow operations to\n", - "create new features by combining tensors.\n", - "\n", - "The preprocessing function defines a pipeline of operations on a dataset. In\n", - "order to apply the pipeline, we rely on a concrete implementation of the\n", - "`tf.Transform` API. The Apache Beam implementation provides `PTransform` which\n", - "applies a user's preprocessing function to data. The typical workflow of a\n", - "`tf.Transform` user will construct a preprocessing function, then incorporate\n", - "this into a larger Beam pipeline, creating the data for training." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nxnXDEK1AezF" - }, - "source": [ - "### Batching\n", - "\n", - "Batching is an important part of TensorFlow. Since one of the goals of\n", - "`tf.Transform` is to provide a TensorFlow graph for preprocessing that can be\n", - "incorporated into the serving graph (and, optionally, the training graph),\n", - "batching is also an important concept in `tf.Transform`.\n", - "\n", - "While not obvious in the example above, the user defined preprocessing function\n", - "is passed tensors representing *batches* and not individual instances, as\n", - "happens during training and serving with TensorFlow. On the other hand,\n", - "analyzers perform a computation over the entire dataset that returns a single\n", - "value and not a batch of values. `x` is a `Tensor` with a shape of\n", - "`(batch_size,)`, while `tft.mean(x)` is a `Tensor` with a shape of `()`. The\n", - "subtraction `x - tft.mean(x)` broadcasts where the value of `tft.mean(x)` is\n", - "subtracted from every element of the batch represented by `x`." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "09bf63cc" - }, - "source": [ - "## Apache Beam Implementation\n", - "\n", - "While the *preprocessing function* is intended as a logical description of a\n", - "*preprocessing pipeline* implemented on multiple data processing frameworks,\n", - "`tf.Transform` provides a canonical implementation used on Apache Beam. This\n", - "implementation demonstrates the functionality required from an implementation.\n", - "There is no formal API for this functionality, so each implementation can use an\n", - "API that is idiomatic for its particular data processing framework.\n", - "\n", - "The Apache Beam implementation provides two `PTransform`s used to process data\n", - "for a preprocessing function. The following shows the usage for the composite\n", - "`PTransform` - `tft_beam.AnalyzeAndTransformDataset`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2e1e01ec" - }, - "outputs": [], - "source": [ - "raw_data = [\n", - " {'x': 1, 'y': 1, 's': 'hello'},\n", - " {'x': 2, 'y': 2, 's': 'world'},\n", - " {'x': 3, 'y': 3, 's': 'hello'}\n", - "]\n", - "\n", - "raw_data_metadata = dataset_metadata.DatasetMetadata(\n", - " schema_utils.schema_from_feature_spec({\n", - " 'y': tf.io.FixedLenFeature([], tf.float32),\n", - " 'x': tf.io.FixedLenFeature([], tf.float32),\n", - " 's': tf.io.FixedLenFeature([], tf.string),\n", - " }))\n", - "\n", - "with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n", - " transformed_dataset, transform_fn = (\n", - " (raw_data, raw_data_metadata) |\n", - " tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "jl2gbkvUICd_" - }, - "outputs": [], - "source": [ - "transformed_data, transformed_metadata = transformed_dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "e6029b09" - }, - "source": [ - "The `transformed_data` content is shown below and contains the transformed\n", - "columns in the same format as the raw data. In particular, the values of\n", - "`s_integerized` are `[0, 1, 0]`—these values depend on how the words `hello` and\n", - "`world` were mapped to integers, which is deterministic. For the column\n", - "`x_centered`, we subtracted the mean so the values of the column `x`, which were\n", - "`[1.0, 2.0, 3.0]`, became `[-1.0, 0.0, 1.0]`. Similarly, the rest of the columns\n", - "match their expected values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "vcMpG2bFFcgP" - }, - "outputs": [], - "source": [ - "transformed_data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0ee0d9ac" - }, - "source": [ - "Both `raw_data` and `transformed_data` are datasets. The next two sections show\n", - "how the Beam implementation represents datasets and how to read and write data\n", - "to disk. The other return value, `transform_fn`, represents the transformation\n", - "applied to the data, covered in detail below.\n", - "\n", - "The `tft_beam.AnalyzeAndTransformDataset` class is the composition of the two\n", - "fundamental transforms provided by the implementation\n", - "`tft_beam.AnalyzeDataset` and `tft_beam.TransformDataset`. So the following\n", - "two code snippets are equivalent:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "BCZqx7OfGjZ_" - }, - "outputs": [], - "source": [ - "my_data = (raw_data, raw_data_metadata)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "816cdb9b" - }, - "outputs": [], - "source": [ - "with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n", - " transformed_data, transform_fn = (\n", - " my_data | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "dJImGAaeHDTo" - }, - "outputs": [], - "source": [ - "with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n", - " transform_fn = my_data | tft_beam.AnalyzeDataset(preprocessing_fn)\n", - " transformed_data = (my_data, transform_fn) | tft_beam.TransformDataset()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "M4kl5IA5H29G" - }, - "source": [ - "`transform_fn` is a pure function that represents an operation that is applied\n", - "to each row of the dataset. In particular, the analyzer values are already\n", - "computed and treated as constants. In the example, the `transform_fn` contains\n", - "as constants the mean of column `x`, the min and max of column `y`, and the\n", - "vocabulary used to map the strings to integers.\n", - "\n", - "An important feature of `tf.Transform` is that `transform_fn` represents a map\n", - "*over rows*—it is a pure function applied to each row separately. All of the\n", - "computation for aggregating rows is done in `AnalyzeDataset`. Furthermore, the\n", - "`transform_fn` is represented as a TensorFlow `Graph` which can be embedded into\n", - "the serving graph.\n", - "\n", - "`AnalyzeAndTransformDataset` is provided for optimizations in this special case.\n", - "This is the same pattern used in\n", - "[scikit-learn](http://scikit-learn.org/stable/index.html), providing the `fit`,\n", - "`transform`, and `fit_transform` methods.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2bedd48a" - }, - "source": [ - "## Data Formats and Schema\n", - "\n", - "TFT Beam implementation accepts two different input data formats. The\n", - "\"instance dict\" format (as seen in the example above and [simple.ipynb](https://www.tensorflow.org/tfx/tutorials/transform/simple) \u0026 [simple_example.py](https://github.com/tensorflow/transform/blob/master/examples/simple_example.py))\n", - "is an intuitive format and is suitable for small datasets while the TFXIO\n", - "([Apache Arrow](https://arrow.apache.org)) format provides improved performance\n", - "and is suitble for large datasets.\n", - "\n", - "The \"metadata\" accompanying the `PCollection` tells the Beam implementation the format of the `PCollection`.\n", - "\n", - "```\n", - "(raw_data, raw_data_metadata) | tft.AnalyzeDataset(...)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5dc76c5a" - }, - "source": [ - "- If `raw_data_metadata` is a `dataset_metadata.DatasetMetadata` (see below,\n", - " \"The 'instance dict' format\" section),\n", - " then `raw_data` is expected to be in the \"instance dict\" format.\n", - "- If `raw_data_metadata` is a `tfxio.TensorAdapterConfig`\n", - " (see below, \"The TFXIO format\" section), then `raw_data` is expected to be\n", - " in the TFXIO format." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "XPjE0a7kNU5i" - }, - "source": [ - "### The \"instance dict\" format\n", - "\n", - "The previous code examples used this format. The metadata contains the schema that defines the layout of the data and how it is read from and written to various formats. Even this in-memory format is not self-describing and requires the schema in order to be interpreted as tensors.\n", - "\n", - "Again, here is the definition of the schema for the example data:\n", - "\n", - "\u003c!--\n", - "TODO(b/223384488): Switch to `tft.DatasetMetadata.from_feature_spec` once\n", - "version 1.8 is released.\n", - "--\u003e" - ] - }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "KpXGE33umpig" + }, + "source": [ + "\n", + "\n", + "# Get Started with TensorFlow Transform" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FT1xumYJ-oEW" + }, + "source": [ + "This guide introduces the basic concepts of `tf.Transform` and how to use them.\n", + "It will:\n", + "\n", + "* Define a *preprocessing function*, a logical description of the pipeline\n", + " that transforms the raw data into the data used to train a machine learning\n", + " model.\n", + "* Show the [Apache Beam](https://beam.apache.org/) implementation used to\n", + " transform data by converting the *preprocessing function* into a *Beam\n", + " pipeline*.\n", + "* Show additional usage examples." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9_SoiTcNmkVu" + }, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gc6oSu9BnwJe" + }, + "outputs": [], + "source": [ + "!pip install -U tensorflow_transform" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pY_BNfLemjY4" + }, + "outputs": [], + "source": [ + "!pip install pyarrow" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "L7Mtis2Jn2Af" + }, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "372894b6" - }, - "outputs": [], - "source": [ - "from tensorflow_transform.tf_metadata import dataset_metadata\n", - "from tensorflow_transform.tf_metadata import schema_utils\n", - "\n", - "raw_data_metadata = dataset_metadata.DatasetMetadata(\n", - " schema_utils.schema_from_feature_spec({\n", - " 's': tf.io.FixedLenFeature([], tf.string),\n", - " 'y': tf.io.FixedLenFeature([], tf.float32),\n", - " 'x': tf.io.FixedLenFeature([], tf.float32),\n", - " }))" + "data": { + "text/plain": [ + "" ] - }, + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pkg_resources\n", + "import importlib\n", + "importlib.reload(pkg_resources)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "PvDoWUfynTWh" + }, + "outputs": [], + "source": [ + "import os\n", + "import tempfile\n", + "\n", + "import tensorflow as tf\n", + "import tensorflow_transform as tft\n", + "import tensorflow_transform.beam as tft_beam\n", + "\n", + "from tensorflow_transform.tf_metadata import dataset_metadata\n", + "from tensorflow_transform.tf_metadata import schema_utils\n", + "\n", + "from tfx_bsl.public import tfxio" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j4W_yuSr-ro3" + }, + "source": [ + "## Define a preprocessing function\n", + "\n", + "The *preprocessing function* is the most important concept of `tf.Transform`.\n", + "The preprocessing function is a logical description of a transformation of the\n", + "dataset. The preprocessing function accepts and returns a dictionary of tensors,\n", + "where a *tensor* means `Tensor` or `SparseTensor`. There are two kinds of\n", + "functions used to define the preprocessing function:\n", + "\n", + "1. Any function that accepts and returns tensors. These add TensorFlow\n", + " operations to the graph that transform raw data into transformed data.\n", + "2. Any of the *analyzers* provided by `tf.Transform`. Analyzers also accept\n", + " and return tensors, but unlike TensorFlow functions, they *do not* add\n", + " operations to the graph. Instead, analyzers cause `tf.Transform` to compute\n", + " a full-pass operation outside of TensorFlow. They use the input tensor values\n", + " over the entire dataset to generate a constant tensor that is returned as the\n", + " output. For example, `tft.min` computes the minimum of a tensor over the\n", + " dataset. `tf.Transform` provides a fixed set of analyzers, but this will be\n", + " extended in future versions.\n", + "3. Any stateless [preprocessing layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) (i.e. these layers must not invoke the ```adapt()``` method). These can be added as operations to the graph as they do not require a full pass over the data outside of the management of ```tf.Transform```. For example, you can add

\n", + "[tf.keras.layers.experimental.preprocessing.HashedCrossing](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/HashedCrossing),

\n", + "but not

\n", + "[tf.keras.layers.Normalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization),

as the latter needs to be adapted over the entire dataset. Do note that if you use [Lambda layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda), there are some de-serialization limitations which might prevent ```preprocessing_fn``` from being fully re-loaded off of disk by [tft.TFTransformOutput](https://www.tensorflow.org/tfx/transform/api_docs/python/tft/TFTransformOutput). \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "72ff0efc" + }, + "source": [ + "### Preprocessing function example\n", + "\n", + "By combining analyzers and regular TensorFlow functions, users can create\n", + "flexible pipelines for transforming data. The following preprocessing function\n", + "transforms each of the three features in different ways, and combines two of the\n", + "features:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "c6bf64fe" + }, + "outputs": [], + "source": [ + "def preprocessing_fn(inputs):\n", + " x = inputs['x']\n", + " y = inputs['y']\n", + " s = inputs['s']\n", + " x_centered = x - tft.mean(x)\n", + " y_normalized = tft.scale_to_0_1(y)\n", + " s_integerized = tft.compute_and_apply_vocabulary(s)\n", + " x_centered_times_y_normalized = x_centered * y_normalized\n", + " return {\n", + " 'x_centered': x_centered,\n", + " 'y_normalized': y_normalized,\n", + " 'x_centered_times_y_normalized': x_centered_times_y_normalized,\n", + " 's_integerized': s_integerized\n", + " }" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LU8aclPGAZLX" + }, + "source": [ + "Here, `x`, `y` and `s` are `Tensor`s that represent input features. The first\n", + "new tensor that is created, `x_centered`, is built by applying `tft.mean` to `x`\n", + "and subtracting this from `x`. `tft.mean(x)` returns a tensor representing the\n", + "mean of the tensor `x`. `x_centered` is the tensor `x` with the mean subtracted.\n", + "\n", + "The second new tensor, `y_normalized`, is created in a similar manner but using\n", + "the convenience method `tft.scale_to_0_1`. This method does something similar to\n", + "computing `x_centered`, namely computing a maximum and minimum and using these\n", + "to scale `y`.\n", + "\n", + "The tensor `s_integerized` shows an example of string manipulation. In this\n", + "case, we take a string and map it to an integer. This uses the convenience\n", + "function `tft.compute_and_apply_vocabulary`. This function uses an analyzer to\n", + "compute the unique values taken by the input strings, and then uses TensorFlow\n", + "operations to convert the input strings to indices in the table of unique\n", + "values.\n", + "\n", + "The final column shows that it is possible to use TensorFlow operations to\n", + "create new features by combining tensors.\n", + "\n", + "The preprocessing function defines a pipeline of operations on a dataset. In\n", + "order to apply the pipeline, we rely on a concrete implementation of the\n", + "`tf.Transform` API. The Apache Beam implementation provides `PTransform` which\n", + "applies a user's preprocessing function to data. The typical workflow of a\n", + "`tf.Transform` user will construct a preprocessing function, then incorporate\n", + "this into a larger Beam pipeline, creating the data for training." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nxnXDEK1AezF" + }, + "source": [ + "### Batching\n", + "\n", + "Batching is an important part of TensorFlow. Since one of the goals of\n", + "`tf.Transform` is to provide a TensorFlow graph for preprocessing that can be\n", + "incorporated into the serving graph (and, optionally, the training graph),\n", + "batching is also an important concept in `tf.Transform`.\n", + "\n", + "While not obvious in the example above, the user defined preprocessing function\n", + "is passed tensors representing *batches* and not individual instances, as\n", + "happens during training and serving with TensorFlow. On the other hand,\n", + "analyzers perform a computation over the entire dataset that returns a single\n", + "value and not a batch of values. `x` is a `Tensor` with a shape of\n", + "`(batch_size,)`, while `tft.mean(x)` is a `Tensor` with a shape of `()`. The\n", + "subtraction `x - tft.mean(x)` broadcasts where the value of `tft.mean(x)` is\n", + "subtracted from every element of the batch represented by `x`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "09bf63cc" + }, + "source": [ + "## Apache Beam Implementation\n", + "\n", + "While the *preprocessing function* is intended as a logical description of a\n", + "*preprocessing pipeline* implemented on multiple data processing frameworks,\n", + "`tf.Transform` provides a canonical implementation used on Apache Beam. This\n", + "implementation demonstrates the functionality required from an implementation.\n", + "There is no formal API for this functionality, so each implementation can use an\n", + "API that is idiomatic for its particular data processing framework.\n", + "\n", + "The Apache Beam implementation provides two `PTransform`s used to process data\n", + "for a preprocessing function. The following shows the usage for the composite\n", + "`PTransform` - `tft_beam.AnalyzeAndTransformDataset`:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "2e1e01ec" + }, + "outputs": [ { - "cell_type": "markdown", - "metadata": { - "id": "58c2402c" - }, - "source": [ - "The `Schema` proto contains the information needed to parse the\n", - "data from its on-disk or in-memory format, into tensors. It is typically\n", - "constructed by calling `schema_utils.schema_from_feature_spec` with a dict\n", - "mapping feature keys to `tf.io.FixedLenFeature`, `tf.io.VarLenFeature`, and\n", - "`tf.io.SparseFeature` values. See the documentation for\n", - "[`tf.parse_example`](https://www.tensorflow.org/api_docs/python/tf/parse_example)\n", - "for more details.\n", - "\n", - "Above we use `tf.io.FixedLenFeature` to indicate that each feature contains a\n", - "fixed number of values, in this case a single scalar value. Because\n", - "`tf.Transform` batches instances, the actual `Tensor` representing the feature\n", - "will have shape `(None,)` where the unknown dimension is the batch dimension.\n" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" + ] }, { - "cell_type": "markdown", - "metadata": { - "id": "jatXeEayOhza" - }, - "source": [ - "### The TFXIO format\n", - "\n", - "With this format, the data is expected to be contained in a\n", - "[`pyarrow.RecordBatch`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html).\n", - "For tabular data, our Apache Beam implementation\n", - "accepts Arrow `RecordBatch`es that consist of columns of the following types:\n", - "\n", - " - `pa.list_(\u003cprimitive\u003e)`, where `\u003cprimitive\u003e` is `pa.int64()`, `pa.float32()`\n", - " `pa.binary()` or `pa.large_binary()`.\n", - "\n", - " - `pa.large_list(\u003cprimitive\u003e)`\n", - "\n", - "The toy input dataset we used above, when represented as a `RecordBatch`, looks\n", - "like the following:" - ] + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "fd01900a" - }, - "outputs": [], - "source": [ - "import pyarrow as pa\n", - "\n", - "raw_data = [\n", - " pa.record_batch(\n", - " data=[\n", - " pa.array([[1], [2], [3]], pa.list_(pa.float32())),\n", - " pa.array([[1], [2], [3]], pa.list_(pa.float32())),\n", - " pa.array([['hello'], ['world'], ['hello']], pa.list_(pa.binary())),\n", - " ],\n", - " names=['x', 'y', 's'])\n", - "]" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" + ] }, { - "cell_type": "markdown", - "metadata": { - "id": "114d171e" - }, - "source": [ - "Similar to the `dataset_metadata.DatasetMetadata` instance that accompanies the \"instance dict\" format, a `tfxio.TensorAdapterConfig`\n", - "is must accompany the `RecordBatch`es. It consists of the Arrow schema of\n", - "the `RecordBatch`es, and\n", - "`tfxio.TensorRepresentations` to uniquely determine how columns in `RecordBatch`es can be interpreted as TensorFlow Tensors (including but not limited to `tf.Tensor`, `tf.SparseTensor`).\n", - "\n", - "`tfxio.TensorRepresentations` is type alias for a `Dict[str, tensorflow_metadata.proto.v0.schema_pb2.TensorRepresentation]` which\n", - "establishes the relationship between a Tensor that a `preprocessing_fn` accepts\n", - "and columns in the `RecordBatch`es. For example:" - ] + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n", + "WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/home/pdodeja/.pyenv/versions/3.8.5/lib/python3.8/site-packages/ipykernel_launcher.py', '-f', '/home/pdodeja/.local/share/jupyter/runtime/kernel-a0e59961-bc03-4678-855c-373020284cb1.json']\n", + "WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "b8478d18" - }, - "outputs": [], - "source": [ - "from google.protobuf import text_format\n", - "from tensorflow_metadata.proto.v0 import schema_pb2\n", - "\n", - "tensor_representation = {\n", - " 'x': text_format.Parse(\n", - " \"\"\"dense_tensor { column_name: \"col1\" shape { dim { size: 2 } } }\"\"\",\n", - " schema_pb2.TensorRepresentation())\n", - "}" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmpt382s9g0/tftransform_tmp/05913662504346a59fa1f1348c34f3ef/assets\n" + ] }, { - "cell_type": "markdown", - "metadata": { - "id": "ZAqE0Fb2Ta47" - }, - "source": [ - "Means that `inputs['x']` in `preprocessing_fn` should be a dense `tf.Tensor`,\n", - "whose values come from a column of name `'col1'` in the input `RecordBatch`es,\n", - "and its (batched) shape should be `[batch_size, 2]`.\n", - "\n", - "A `schema_pb2.TensorRepresentation` is a Protobuf defined in\n", - "[TensorFlow Metadata](https://github.com/tensorflow/metadata/blob/v0.22.2/tensorflow_metadata/proto/v0/schema.proto#L592)." - ] + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmpt382s9g0/tftransform_tmp/05913662504346a59fa1f1348c34f3ef/assets\n" + ] }, { - "cell_type": "markdown", - "metadata": { - "id": "qbPmiaJOTe09" - }, - "source": [ - "## Compatibility with TensorFlow\n", - "\n", - "`tf.Transform` provides support for exporting the `transform_fn` as\n", - "a SavedModel, see the [simple tutorial](https://www.tensorflow.org/tfx/tutorials/transform/simple) for an example. The default behavior before the `0.30` release\n", - "exported a TF 1.x SavedModel. Starting with the `0.30` release, the default\n", - "behavior is to export a TF 2.x SavedModel unless TF 2.x behaviors are explicitly\n", - "disabled (by calling `tf.compat.v1.disable_v2_behavior()`).\n", - "\n", - "If using TF 1.x concepts such as `tf.estimator` and `tf.Sessions`, you can retain the previous behavior by passing `force_tf_compat_v1=True` to\n", - "[`tft_beam.Context`](https://www.tensorflow.org/tfx/transform/api_docs/python/tft_beam/Context)\n", - "if using `tf.Transform` as a standalone library or to the\n", - "[Transform](https://www.tensorflow.org/tfx/api_docs/python/tfx/components/Transform)\n", - "component in TFX.\n", - "\n", - "When exporting the `transform_fn` as a TF 2.x SavedModel, the `preprocessing_fn`\n", - "is expected to be traceable using `tf.function`. Additionally, if running your\n", - "pipeline remotely (for example with the `DataflowRunner`), ensure that the\n", - "`preprocessing_fn` and any dependencies are packaged properly as described\n", - "[here](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies).\n", - "\n", - "Known issues with using `tf.Transform` to export a TF 2.x SavedModel are\n", - "documented [here](https://www.tensorflow.org/tfx/transform/tf2_support)." - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmpt382s9g0/tftransform_tmp/596c854c9fc0413d9cbf11fa6e1e5d59/assets\n" + ] }, { - "cell_type": "markdown", - "metadata": { - "id": "XZBRgTv4Th31" - }, - "source": [ - "## Input and output with Apache Beam\n", - "\n", - "So far, we've seen input and output data in python lists (of `RecordBatch`es or\n", - "instance dictionaries). This is a simplification that relies on Apache Beam's\n", - "ability to work with lists as well as its main representation of data, the\n", - "`PCollection`.\n", - "\n", - "A `PCollection` is a data representation that forms a part of a Beam pipeline.\n", - "A Beam pipeline is formed by applying various `PTransform`s, including\n", - "`AnalyzeDataset` and `TransformDataset`, and running the pipeline. A\n", - "`PCollection` is not created in the memory of the main binary, but instead is\n", - "distributed among the workers (although this section uses the in-memory\n", - "execution mode).\n" - ] - }, + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmpt382s9g0/tftransform_tmp/596c854c9fc0413d9cbf11fa6e1e5d59/assets\n" + ] + } + ], + "source": [ + "raw_data = [\n", + " {'x': 1, 'y': 1, 's': 'hello'},\n", + " {'x': 2, 'y': 2, 's': 'world'},\n", + " {'x': 3, 'y': 3, 's': 'hello'}\n", + "]\n", + "\n", + "raw_data_metadata = dataset_metadata.DatasetMetadata(\n", + " schema_utils.schema_from_feature_spec({\n", + " 'y': tf.io.FixedLenFeature([], tf.float32),\n", + " 'x': tf.io.FixedLenFeature([], tf.float32),\n", + " 's': tf.io.FixedLenFeature([], tf.string),\n", + " }))\n", + "\n", + "with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n", + " transformed_dataset, transform_fn = (\n", + " (raw_data, raw_data_metadata) |\n", + " tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "jl2gbkvUICd_" + }, + "outputs": [], + "source": [ + "transformed_data, transformed_metadata = transformed_dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e6029b09" + }, + "source": [ + "The `transformed_data` content is shown below and contains the transformed\n", + "columns in the same format as the raw data. In particular, the values of\n", + "`s_integerized` are `[0, 1, 0]`—these values depend on how the words `hello` and\n", + "`world` were mapped to integers, which is deterministic. For the column\n", + "`x_centered`, we subtracted the mean so the values of the column `x`, which were\n", + "`[1.0, 2.0, 3.0]`, became `[-1.0, 0.0, 1.0]`. Similarly, the rest of the columns\n", + "match their expected values." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "vcMpG2bFFcgP" + }, + "outputs": [ { - "cell_type": "markdown", - "metadata": { - "id": "oz_PA4dLTlEe" - }, - "source": [ - "### Pre-canned `PCollection` Sources (`TFXIO`)\n", - "\n", - "The `RecordBatch` format that our implementation accepts is a common format that\n", - "other TFX libraries accept. Therefore TFX offers convenient \"sources\" (a.k.a\n", - "`TFXIO`) that read files of various formats on disk and produce `RecordBatch`es\n", - "and can also give `tfxio.TensorAdapterConfig`, including inferred\n", - "`tfxio.TensorRepresentations`.\n", - "\n", - "Those `TFXIO`s can be found in package `tfx_bsl` ([`tfx_bsl.public.tfxio`](https://www.tensorflow.org/tfx/tfx_bsl/api_docs/python/tfx_bsl/public/tfxio)).\n" + "data": { + "text/plain": [ + "[{'s_integerized': 0,\n", + " 'x_centered': -1.0,\n", + " 'x_centered_times_y_normalized': -0.0,\n", + " 'y_normalized': 0.0},\n", + " {'s_integerized': 1,\n", + " 'x_centered': 0.0,\n", + " 'x_centered_times_y_normalized': 0.0,\n", + " 'y_normalized': 0.5},\n", + " {'s_integerized': 0,\n", + " 'x_centered': 1.0,\n", + " 'x_centered_times_y_normalized': 1.0,\n", + " 'y_normalized': 1.0}]" ] - }, + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "transformed_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0ee0d9ac" + }, + "source": [ + "Both `raw_data` and `transformed_data` are datasets. The next two sections show\n", + "how the Beam implementation represents datasets and how to read and write data\n", + "to disk. The other return value, `transform_fn`, represents the transformation\n", + "applied to the data, covered in detail below.\n", + "\n", + "The `tft_beam.AnalyzeAndTransformDataset` class is the composition of the two\n", + "fundamental transforms provided by the implementation\n", + "`tft_beam.AnalyzeDataset` and `tft_beam.TransformDataset`. So the following\n", + "two code snippets are equivalent:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "BCZqx7OfGjZ_" + }, + "outputs": [], + "source": [ + "my_data = (raw_data, raw_data_metadata)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "816cdb9b" + }, + "outputs": [ { - "cell_type": "markdown", - "metadata": { - "id": "135596bb" - }, - "source": [ - "## Example: \"Census Income\" dataset\n", - "\n", - "The following example requires both reading and writing data on disk and\n", - "representing data as a `PCollection` (not a list), see:\n", - "[`census_example.py`](https://github.com/tensorflow/transform/tree/master/examples/census_example.py).\n", - "Below we show how to download the data and run this example. The \"Census Income\"\n", - "dataset is provided by the\n", - "[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n", - "This dataset contains both categorical and numeric data.\n", - "\n", - "Here is some code to download and preview this data:" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "p-iPFfR-y-Nb" - }, - "outputs": [], - "source": [ - "!wget https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/census/adult.data" - ] + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "yP-YBifvwh3C" - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "train_data_file = \"adult.data\"" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" + ] }, { - "cell_type": "markdown", - "metadata": { - "id": "Fca2FC8IKwnt" - }, - "source": [ - "There's some configuration code hidden in the cell below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "fo3aBF4CyxTW" - }, - "outputs": [], - "source": [ - "#@title\n", - "ORDERED_CSV_COLUMNS = [\n", - " 'age', 'workclass', 'fnlwgt', 'education', 'education-num',\n", - " 'marital-status', 'occupation', 'relationship', 'race', 'sex',\n", - " 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'label'\n", - "]\n", - "\n", - "CATEGORICAL_FEATURE_KEYS = [\n", - " 'workclass',\n", - " 'education',\n", - " 'marital-status',\n", - " 'occupation',\n", - " 'relationship',\n", - " 'race',\n", - " 'sex',\n", - " 'native-country',\n", - "]\n", - "\n", - "NUMERIC_FEATURE_KEYS = [\n", - " 'age',\n", - " 'capital-gain',\n", - " 'capital-loss',\n", - " 'hours-per-week',\n", - " 'education-num',\n", - "]\n", - "\n", - "LABEL_KEY = 'label'\n", - "\n", - "RAW_DATA_FEATURE_SPEC = dict(\n", - " [(name, tf.io.FixedLenFeature([], tf.string))\n", - " for name in CATEGORICAL_FEATURE_KEYS] +\n", - " [(name, tf.io.FixedLenFeature([], tf.float32))\n", - " for name in NUMERIC_FEATURE_KEYS] +\n", - " [(LABEL_KEY, tf.io.FixedLenFeature([], tf.string))]\n", - ")\n", - "\n", - "SCHEMA = tft.tf_metadata.dataset_metadata.DatasetMetadata(\n", - " tft.tf_metadata.schema_utils.schema_from_feature_spec(RAW_DATA_FEATURE_SPEC)).schema" - ] + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n", + "WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/home/pdodeja/.pyenv/versions/3.8.5/lib/python3.8/site-packages/ipykernel_launcher.py', '-f', '/home/pdodeja/.local/share/jupyter/runtime/kernel-a0e59961-bc03-4678-855c-373020284cb1.json']\n", + "WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "wCoqaKcgwMyC" - }, - "outputs": [], - "source": [ - "pd.read_csv(train_data_file, names = ORDERED_CSV_COLUMNS).head()" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmps5l389rr/tftransform_tmp/3525992e1f39412abc0fa8a4a293037a/assets\n" + ] }, { - "cell_type": "markdown", - "metadata": { - "id": "ed4e8acf" - }, - "source": [ - "The columns of the dataset are either categorical or numeric. This dataset\n", - "describes a classification problem: predicting the last column where the\n", - "individual earns more or less than 50K per year. However, from the perspective\n", - "of `tf.Transform`, this label is just another categorical column.\n", - "\n", - "We use a Pre-canned `tfxio.BeamRecordCsvTFXIO` to translate the CSV lines\n", - "into `RecordBatches`. `TFXIO` requires two important piece of information:\n", - "\n", - " - a TensorFlow Metadata Schema,`tfmd.proto.v0.shema_pb2`,\n", - " that contains type and shape information about each CSV column.\n", - " `schema_pb2.TensorRepresentation`s are an optional part of the Schema;\n", - " if not provided (which is the case in this example), they will be inferred\n", - " from the type and shape information. One can get the Schema either by\n", - " using a helper function we provide to translate from TF parsing specs\n", - " (shown in this example), or by running\n", - " [TensorFlow Data Validation](https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic).\n", - " - a list of column names, in the order they appear in the CSV file. Note\n", - " that those names must match the feature names in the Schema." - ] + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmps5l389rr/tftransform_tmp/3525992e1f39412abc0fa8a4a293037a/assets\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "TSwWOApYojXn" - }, - "outputs": [], - "source": [ - "!pip install -U -q tfx_bsl" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmps5l389rr/tftransform_tmp/e803b4b0258b452d9b4ac7e0d5b15bdb/assets\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "97V3x2FQyMWE" - }, - "outputs": [], - "source": [ - "from tfx_bsl.public import tfxio\n", - "from tfx_bsl.coders.example_coder import RecordBatchToExamples\n", - "\n", - "import apache_beam as beam" - ] - }, + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmps5l389rr/tftransform_tmp/e803b4b0258b452d9b4ac7e0d5b15bdb/assets\n" + ] + } + ], + "source": [ + "with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n", + " transformed_data, transform_fn = (\n", + " my_data | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "6641a8c9" - }, - "outputs": [], - "source": [ - "pipeline = beam.Pipeline()\n", - "\n", - "csv_tfxio = tfxio.BeamRecordCsvTFXIO(\n", - " physical_format='text', column_names=ORDERED_CSV_COLUMNS, schema=SCHEMA)\n", - "\n", - "raw_data = (\n", - " pipeline\n", - " | 'ReadTrainData' \u003e\u003e beam.io.ReadFromText(\n", - " train_data_file, coder=beam.coders.BytesCoder())\n", - " | 'FixCommasTrainData' \u003e\u003e beam.Map(\n", - " lambda line: line.replace(b', ', b','))\n", - " | 'DecodeTrainData' \u003e\u003e csv_tfxio.BeamSource())" + "data": { + "text/plain": [ + "(['/tmp/tmps5l389rr/tftransform_tmp/e803b4b0258b452d9b4ac7e0d5b15bdb'],\n", + " BeamDatasetMetadata(dataset_metadata={'_schema': feature {\n", + " name: \"s_integerized\"\n", + " type: INT\n", + " int_domain {\n", + " is_categorical: true\n", + " }\n", + " presence {\n", + " min_fraction: 1.0\n", + " }\n", + " shape {\n", + " }\n", + " }\n", + " feature {\n", + " name: \"x_centered\"\n", + " type: FLOAT\n", + " presence {\n", + " min_fraction: 1.0\n", + " }\n", + " shape {\n", + " }\n", + " }\n", + " feature {\n", + " name: \"x_centered_times_y_normalized\"\n", + " type: FLOAT\n", + " presence {\n", + " min_fraction: 1.0\n", + " }\n", + " shape {\n", + " }\n", + " }\n", + " feature {\n", + " name: \"y_normalized\"\n", + " type: FLOAT\n", + " presence {\n", + " min_fraction: 1.0\n", + " }\n", + " shape {\n", + " }\n", + " }\n", + " }, deferred_metadata=[{'_schema': feature {\n", + " name: \"s_integerized\"\n", + " type: INT\n", + " int_domain {\n", + " min: -1\n", + " max: 1\n", + " is_categorical: true\n", + " }\n", + " presence {\n", + " min_fraction: 1.0\n", + " }\n", + " shape {\n", + " }\n", + " }\n", + " feature {\n", + " name: \"x_centered\"\n", + " type: FLOAT\n", + " presence {\n", + " min_fraction: 1.0\n", + " }\n", + " shape {\n", + " }\n", + " }\n", + " feature {\n", + " name: \"x_centered_times_y_normalized\"\n", + " type: FLOAT\n", + " presence {\n", + " min_fraction: 1.0\n", + " }\n", + " shape {\n", + " }\n", + " }\n", + " feature {\n", + " name: \"y_normalized\"\n", + " type: FLOAT\n", + " presence {\n", + " min_fraction: 1.0\n", + " }\n", + " shape {\n", + " }\n", + " }\n", + " }], asset_map={'vocab_compute_and_apply_vocabulary_vocabulary': 'vocab_compute_and_apply_vocabulary_vocabulary'}))" ] - }, + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "transform_fn" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "dJImGAaeHDTo" + }, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "5qATseJbK91x" - }, - "outputs": [], - "source": [ - "raw_data" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" + ] }, { - "cell_type": "markdown", - "metadata": { - "id": "e9b2f63b" - }, - "source": [ - "Note that we had to do some additional fix-ups after the CSV lines are read\n", - "in. Otherwise, we could rely on the `tfxio.CsvTFXIO` to handle both reading the files\n", - "and translating to `RecordBatch`es:" - ] + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n", + "WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/home/pdodeja/.pyenv/versions/3.8.5/lib/python3.8/site-packages/ipykernel_launcher.py', '-f', '/home/pdodeja/.local/share/jupyter/runtime/kernel-a0e59961-bc03-4678-855c-373020284cb1.json']\n", + "WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "9ede0fb4" - }, - "outputs": [], - "source": [ - "csv_tfxio = tfxio.CsvTFXIO(train_data_file,\n", - " telemetry_descriptors=[], #???\n", - " column_names=ORDERED_CSV_COLUMNS,\n", - " schema=SCHEMA)\n", - "\n", - "p2 = beam.Pipeline()\n", - "raw_data_2 = p2 | 'TFXIORead' \u003e\u003e csv_tfxio.BeamSource()" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmp34rhmzgj/tftransform_tmp/a4b4feb1883a42afa8fd95f4aca657df/assets\n" + ] }, { - "cell_type": "markdown", - "metadata": { - "id": "67d86fba" - }, - "source": [ - "Preprocessing for this dataset is similar to the previous example,\n", - " except the preprocessing function is programmatically generated instead of manually specifying each column. In the preprocessing function below, `NUMERICAL_COLUMNS` and `CATEGORICAL_COLUMNS` are lists that contain the names of the numeric and categorical columns:" - ] + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmp34rhmzgj/tftransform_tmp/a4b4feb1883a42afa8fd95f4aca657df/assets\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "5f880a78" - }, - "outputs": [], - "source": [ - "NUM_OOV_BUCKETS = 1\n", - "\n", - "def preprocessing_fn(inputs):\n", - " \"\"\"Preprocess input columns into transformed columns.\"\"\"\n", - " # Since we are modifying some features and leaving others unchanged, we\n", - " # start by setting `outputs` to a copy of `inputs.\n", - " outputs = inputs.copy()\n", - "\n", - " # Scale numeric columns to have range [0, 1].\n", - " for key in NUMERIC_FEATURE_KEYS:\n", - " outputs[key] = tft.scale_to_0_1(outputs[key])\n", - "\n", - " # For all categorical columns except the label column, we generate a\n", - " # vocabulary but do not modify the feature. This vocabulary is instead\n", - " # used in the trainer, by means of a feature column, to convert the feature\n", - " # from a string to an integer id.\n", - " for key in CATEGORICAL_FEATURE_KEYS:\n", - " outputs[key] = tft.compute_and_apply_vocabulary(\n", - " tf.strings.strip(inputs[key]),\n", - " num_oov_buckets=NUM_OOV_BUCKETS,\n", - " vocab_filename=key)\n", - "\n", - " # For the label column we provide the mapping from string to index.\n", - " with tf.init_scope():\n", - " # `init_scope` - Only initialize the table once.\n", - " initializer = tf.lookup.KeyValueTensorInitializer(\n", - " keys=['\u003e50K', '\u003c=50K'],\n", - " values=tf.cast(tf.range(2), tf.int64),\n", - " key_dtype=tf.string,\n", - " value_dtype=tf.int64)\n", - " table = tf.lookup.StaticHashTable(initializer, default_value=-1)\n", - "\n", - " outputs[LABEL_KEY] = table.lookup(outputs[LABEL_KEY])\n", - "\n", - " return outputs" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmp34rhmzgj/tftransform_tmp/d8513b34b8c2494d951f1a70ed122d07/assets\n" + ] }, { - "cell_type": "markdown", - "metadata": { - "id": "afa165ac" - }, - "source": [ - "One difference from the previous example is the label column manually specifies\n", - "the mapping from the string to an index. So `'\u003e50'` is mapped to `0` and\n", - "`'\u003c=50K'` is mapped to `1` because it's useful to know which index in the\n", - "trained model corresponds to which label.\n", - "\n", - "The `record_batches` variable represents a `PCollection` of\n", - "`pyarrow.RecordBatch`es. The `tensor_adapter_config` is given by `csv_tfxio`,\n", - "which is inferred from `SCHEMA` (and ultimately, in this example, from the TF\n", - "parsing specs).\n", - "\n", - "The final stage is to write the transformed data to disk and has a similar form\n", - "to reading the raw data. The schema used to do this is part of the output of\n", - "`tft_beam.AnalyzeAndTransformDataset` which infers a schema for the output data. The code to write to disk is shown below. The schema is a part of the metadata but uses the two interchangeably in the `tf.Transform` API (i.e. pass the metadata to the `tft.coders.ExampleProtoCoder`). Be aware that this writes to a different format. Instead of `textio.WriteToText`, use Beam's built-in support for the `TFRecord` format and use a coder to encode the data as `Example` protos. This is a better format to use for training, as shown in the next section. `transformed_eval_data_base` provides the base filename for the individual shards that are written." - ] + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmp34rhmzgj/tftransform_tmp/d8513b34b8c2494d951f1a70ed122d07/assets\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "PiHLl83FLRXi" - }, - "outputs": [], - "source": [ - "raw_dataset = (raw_data, csv_tfxio.TensorAdapterConfig())" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "giIQd-8xKubp" - }, - "outputs": [], - "source": [ - "working_dir = tempfile.mkdtemp()\n", - "with tft_beam.Context(temp_dir=working_dir):\n", - " transformed_dataset, transform_fn = (\n", - " raw_dataset | tft_beam.AnalyzeAndTransformDataset(\n", - " preprocessing_fn, output_record_batches=True))" - ] - }, + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n", + "WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/home/pdodeja/.pyenv/versions/3.8.5/lib/python3.8/site-packages/ipykernel_launcher.py', '-f', '/home/pdodeja/.local/share/jupyter/runtime/kernel-a0e59961-bc03-4678-855c-373020284cb1.json']\n", + "WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.\n" + ] + } + ], + "source": [ + "with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n", + " transform_fn = my_data | tft_beam.AnalyzeDataset(preprocessing_fn)\n", + " transformed_data = (my_data, transform_fn) | tft_beam.TransformDataset()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M4kl5IA5H29G" + }, + "source": [ + "`transform_fn` is a pure function that represents an operation that is applied\n", + "to each row of the dataset. In particular, the analyzer values are already\n", + "computed and treated as constants. In the example, the `transform_fn` contains\n", + "as constants the mean of column `x`, the min and max of column `y`, and the\n", + "vocabulary used to map the strings to integers.\n", + "\n", + "An important feature of `tf.Transform` is that `transform_fn` represents a map\n", + "*over rows*—it is a pure function applied to each row separately. All of the\n", + "computation for aggregating rows is done in `AnalyzeDataset`. Furthermore, the\n", + "`transform_fn` is represented as a TensorFlow `Graph` which can be embedded into\n", + "the serving graph.\n", + "\n", + "`AnalyzeAndTransformDataset` is provided for optimizations in this special case.\n", + "This is the same pattern used in\n", + "[scikit-learn](http://scikit-learn.org/stable/index.html), providing the `fit`,\n", + "`transform`, and `fit_transform` methods.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2bedd48a" + }, + "source": [ + "## Data Formats and Schema\n", + "\n", + "TFT Beam implementation accepts two different input data formats. The\n", + "\"instance dict\" format (as seen in the example above and [simple.ipynb](https://www.tensorflow.org/tfx/tutorials/transform/simple) & [simple_example.py](https://github.com/tensorflow/transform/blob/master/examples/simple_example.py))\n", + "is an intuitive format and is suitable for small datasets while the TFXIO\n", + "([Apache Arrow](https://arrow.apache.org)) format provides improved performance\n", + "and is suitble for large datasets.\n", + "\n", + "The \"metadata\" accompanying the `PCollection` tells the Beam implementation the format of the `PCollection`.\n", + "\n", + "```\n", + "(raw_data, raw_data_metadata) | tft.AnalyzeDataset(...)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5dc76c5a" + }, + "source": [ + "- If `raw_data_metadata` is a `dataset_metadata.DatasetMetadata` (see below,\n", + " \"The 'instance dict' format\" section),\n", + " then `raw_data` is expected to be in the \"instance dict\" format.\n", + "- If `raw_data_metadata` is a `tfxio.TensorAdapterConfig`\n", + " (see below, \"The TFXIO format\" section), then `raw_data` is expected to be\n", + " in the TFXIO format." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XPjE0a7kNU5i" + }, + "source": [ + "### The \"instance dict\" format\n", + "\n", + "The previous code examples used this format. The metadata contains the schema that defines the layout of the data and how it is read from and written to various formats. Even this in-memory format is not self-describing and requires the schema in order to be interpreted as tensors.\n", + "\n", + "Again, here is the definition of the schema for the example data:\n", + "\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "372894b6" + }, + "outputs": [], + "source": [ + "from tensorflow_transform.tf_metadata import dataset_metadata\n", + "from tensorflow_transform.tf_metadata import schema_utils\n", + "\n", + "raw_data_metadata = dataset_metadata.DatasetMetadata(\n", + " schema_utils.schema_from_feature_spec({\n", + " 's': tf.io.FixedLenFeature([], tf.string),\n", + " 'y': tf.io.FixedLenFeature([], tf.float32),\n", + " 'x': tf.io.FixedLenFeature([], tf.float32),\n", + " }))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "58c2402c" + }, + "source": [ + "The `Schema` proto contains the information needed to parse the\n", + "data from its on-disk or in-memory format, into tensors. It is typically\n", + "constructed by calling `schema_utils.schema_from_feature_spec` with a dict\n", + "mapping feature keys to `tf.io.FixedLenFeature`, `tf.io.VarLenFeature`, and\n", + "`tf.io.SparseFeature` values. See the documentation for\n", + "[`tf.parse_example`](https://www.tensorflow.org/api_docs/python/tf/parse_example)\n", + "for more details.\n", + "\n", + "Above we use `tf.io.FixedLenFeature` to indicate that each feature contains a\n", + "fixed number of values, in this case a single scalar value. Because\n", + "`tf.Transform` batches instances, the actual `Tensor` representing the feature\n", + "will have shape `(None,)` where the unknown dimension is the batch dimension.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jatXeEayOhza" + }, + "source": [ + "### The TFXIO format\n", + "\n", + "With this format, the data is expected to be contained in a\n", + "[`pyarrow.RecordBatch`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html).\n", + "For tabular data, our Apache Beam implementation\n", + "accepts Arrow `RecordBatch`es that consist of columns of the following types:\n", + "\n", + " - `pa.list_()`, where `` is `pa.int64()`, `pa.float32()`\n", + " `pa.binary()` or `pa.large_binary()`.\n", + "\n", + " - `pa.large_list()`\n", + "\n", + "The toy input dataset we used above, when represented as a `RecordBatch`, looks\n", + "like the following:" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "fd01900a" + }, + "outputs": [], + "source": [ + "import pyarrow as pa\n", + "\n", + "raw_data = [\n", + " pa.record_batch(\n", + " data=[\n", + " pa.array([[1], [2], [3]], pa.list_(pa.float32())),\n", + " pa.array([[1], [2], [3]], pa.list_(pa.float32())),\n", + " pa.array([['hello'], ['world'], ['hello']], pa.list_(pa.binary())),\n", + " ],\n", + " names=['x', 'y', 's'])\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "114d171e" + }, + "source": [ + "Similar to the `dataset_metadata.DatasetMetadata` instance that accompanies the \"instance dict\" format, a `tfxio.TensorAdapterConfig`\n", + "is must accompany the `RecordBatch`es. It consists of the Arrow schema of\n", + "the `RecordBatch`es, and\n", + "`tfxio.TensorRepresentations` to uniquely determine how columns in `RecordBatch`es can be interpreted as TensorFlow Tensors (including but not limited to `tf.Tensor`, `tf.SparseTensor`).\n", + "\n", + "`tfxio.TensorRepresentations` is type alias for a `Dict[str, tensorflow_metadata.proto.v0.schema_pb2.TensorRepresentation]` which\n", + "establishes the relationship between a Tensor that a `preprocessing_fn` accepts\n", + "and columns in the `RecordBatch`es. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "b8478d18" + }, + "outputs": [], + "source": [ + "from google.protobuf import text_format\n", + "from tensorflow_metadata.proto.v0 import schema_pb2\n", + "\n", + "tensor_representation = {\n", + " 'x': text_format.Parse(\n", + " \"\"\"dense_tensor { column_name: \"col1\" shape { dim { size: 2 } } }\"\"\",\n", + " schema_pb2.TensorRepresentation())\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZAqE0Fb2Ta47" + }, + "source": [ + "Means that `inputs['x']` in `preprocessing_fn` should be a dense `tf.Tensor`,\n", + "whose values come from a column of name `'col1'` in the input `RecordBatch`es,\n", + "and its (batched) shape should be `[batch_size, 2]`.\n", + "\n", + "A `schema_pb2.TensorRepresentation` is a Protobuf defined in\n", + "[TensorFlow Metadata](https://github.com/tensorflow/metadata/blob/v0.22.2/tensorflow_metadata/proto/v0/schema.proto#L592)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qbPmiaJOTe09" + }, + "source": [ + "## Compatibility with TensorFlow\n", + "\n", + "`tf.Transform` provides support for exporting the `transform_fn` as\n", + "a SavedModel, see the [simple tutorial](https://www.tensorflow.org/tfx/tutorials/transform/simple) for an example. The default behavior before the `0.30` release\n", + "exported a TF 1.x SavedModel. Starting with the `0.30` release, the default\n", + "behavior is to export a TF 2.x SavedModel unless TF 2.x behaviors are explicitly\n", + "disabled (by calling `tf.compat.v1.disable_v2_behavior()`).\n", + "\n", + "If using TF 1.x concepts such as `tf.estimator` and `tf.Sessions`, you can retain the previous behavior by passing `force_tf_compat_v1=True` to\n", + "[`tft_beam.Context`](https://www.tensorflow.org/tfx/transform/api_docs/python/tft_beam/Context)\n", + "if using `tf.Transform` as a standalone library or to the\n", + "[Transform](https://www.tensorflow.org/tfx/api_docs/python/tfx/components/Transform)\n", + "component in TFX.\n", + "\n", + "When exporting the `transform_fn` as a TF 2.x SavedModel, the `preprocessing_fn`\n", + "is expected to be traceable using `tf.function`. Additionally, if running your\n", + "pipeline remotely (for example with the `DataflowRunner`), ensure that the\n", + "`preprocessing_fn` and any dependencies are packaged properly as described\n", + "[here](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies).\n", + "\n", + "Known issues with using `tf.Transform` to export a TF 2.x SavedModel are\n", + "documented [here](https://www.tensorflow.org/tfx/transform/tf2_support)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XZBRgTv4Th31" + }, + "source": [ + "## Input and output with Apache Beam\n", + "\n", + "So far, we've seen input and output data in python lists (of `RecordBatch`es or\n", + "instance dictionaries). This is a simplification that relies on Apache Beam's\n", + "ability to work with lists as well as its main representation of data, the\n", + "`PCollection`.\n", + "\n", + "A `PCollection` is a data representation that forms a part of a Beam pipeline.\n", + "A Beam pipeline is formed by applying various `PTransform`s, including\n", + "`AnalyzeDataset` and `TransformDataset`, and running the pipeline. A\n", + "`PCollection` is not created in the memory of the main binary, but instead is\n", + "distributed among the workers (although this section uses the in-memory\n", + "execution mode).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oz_PA4dLTlEe" + }, + "source": [ + "### Pre-canned `PCollection` Sources (`TFXIO`)\n", + "\n", + "The `RecordBatch` format that our implementation accepts is a common format that\n", + "other TFX libraries accept. Therefore TFX offers convenient \"sources\" (a.k.a\n", + "`TFXIO`) that read files of various formats on disk and produce `RecordBatch`es\n", + "and can also give `tfxio.TensorAdapterConfig`, including inferred\n", + "`tfxio.TensorRepresentations`.\n", + "\n", + "Those `TFXIO`s can be found in package `tfx_bsl` ([`tfx_bsl.public.tfxio`](https://www.tensorflow.org/tfx/tfx_bsl/api_docs/python/tfx_bsl/public/tfxio)).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "135596bb" + }, + "source": [ + "## Example: \"Census Income\" dataset\n", + "\n", + "The following example requires both reading and writing data on disk and\n", + "representing data as a `PCollection` (not a list), see:\n", + "[`census_example.py`](https://github.com/tensorflow/transform/tree/master/examples/census_example.py).\n", + "Below we show how to download the data and run this example. The \"Census Income\"\n", + "dataset is provided by the\n", + "[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n", + "This dataset contains both categorical and numeric data.\n", + "\n", + "Here is some code to download and preview this data:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "p-iPFfR-y-Nb" + }, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "EEVc2Hdr0Upe" - }, - "outputs": [], - "source": [ - "output_dir = tempfile.mkdtemp()" - ] - }, + "name": "stdout", + "output_type": "stream", + "text": [ + "--2022-05-24 15:07:07-- https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/census/adult.data\n", + "Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.40.176, 142.250.64.80, 142.250.64.112, ...\n", + "Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.40.176|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 3974305 (3.8M) [application/octet-stream]\n", + "Saving to: ‘adult.data’\n", + "\n", + "adult.data 100%[===================>] 3.79M --.-KB/s in 0.1s \n", + "\n", + "2022-05-24 15:07:07 (27.3 MB/s) - ‘adult.data’ saved [3974305/3974305]\n", + "\n" + ] + } + ], + "source": [ + "!wget https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/census/adult.data" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "id": "yP-YBifvwh3C" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "train_data_file = \"adult.data\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fca2FC8IKwnt" + }, + "source": [ + "There's some configuration code hidden in the cell below." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "fo3aBF4CyxTW" + }, + "outputs": [], + "source": [ + "#@title\n", + "ORDERED_CSV_COLUMNS = [\n", + " 'age', 'workclass', 'fnlwgt', 'education', 'education-num',\n", + " 'marital-status', 'occupation', 'relationship', 'race', 'sex',\n", + " 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'label'\n", + "]\n", + "\n", + "CATEGORICAL_FEATURE_KEYS = [\n", + " 'workclass',\n", + " 'education',\n", + " 'marital-status',\n", + " 'occupation',\n", + " 'relationship',\n", + " 'race',\n", + " 'sex',\n", + " 'native-country',\n", + "]\n", + "\n", + "NUMERIC_FEATURE_KEYS = [\n", + " 'age',\n", + " 'capital-gain',\n", + " 'capital-loss',\n", + " 'hours-per-week',\n", + " 'education-num',\n", + "]\n", + "\n", + "LABEL_KEY = 'label'\n", + "\n", + "RAW_DATA_FEATURE_SPEC = dict(\n", + " [(name, tf.io.FixedLenFeature([], tf.string))\n", + " for name in CATEGORICAL_FEATURE_KEYS] +\n", + " [(name, tf.io.FixedLenFeature([], tf.float32))\n", + " for name in NUMERIC_FEATURE_KEYS] +\n", + " [(LABEL_KEY, tf.io.FixedLenFeature([], tf.string))]\n", + ")\n", + "\n", + "SCHEMA = tft.tf_metadata.dataset_metadata.DatasetMetadata(\n", + " tft.tf_metadata.schema_utils.schema_from_feature_spec(RAW_DATA_FEATURE_SPEC)).schema" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "id": "wCoqaKcgwMyC" + }, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "sB5m6v_GPUM5" - }, - "outputs": [], - "source": [ - "transformed_data, _ = transformed_dataset\n", - "\n", - "_ = (\n", - " transformed_data\n", - " | 'EncodeTrainData' \u003e\u003e\n", - " beam.FlatMapTuple(lambda batch, _: RecordBatchToExamples(batch))\n", - " | 'WriteTrainData' \u003e\u003e beam.io.WriteToTFRecord(\n", - " os.path.join(output_dir , 'transformed.tfrecord')))" + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrylabel
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
\n", + "
" + ], + "text/plain": [ + " age workclass fnlwgt education education-num \\\n", + "0 39 State-gov 77516 Bachelors 13 \n", + "1 50 Self-emp-not-inc 83311 Bachelors 13 \n", + "2 38 Private 215646 HS-grad 9 \n", + "3 53 Private 234721 11th 7 \n", + "4 28 Private 338409 Bachelors 13 \n", + "\n", + " marital-status occupation relationship race sex \\\n", + "0 Never-married Adm-clerical Not-in-family White Male \n", + "1 Married-civ-spouse Exec-managerial Husband White Male \n", + "2 Divorced Handlers-cleaners Not-in-family White Male \n", + "3 Married-civ-spouse Handlers-cleaners Husband Black Male \n", + "4 Married-civ-spouse Prof-specialty Wife Black Female \n", + "\n", + " capital-gain capital-loss hours-per-week native-country label \n", + "0 2174 0 40 United-States <=50K \n", + "1 0 0 13 United-States <=50K \n", + "2 0 0 40 United-States <=50K \n", + "3 0 0 40 United-States <=50K \n", + "4 0 0 40 Cuba <=50K " ] - }, + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.read_csv(train_data_file, names = ORDERED_CSV_COLUMNS).head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ed4e8acf" + }, + "source": [ + "The columns of the dataset are either categorical or numeric. This dataset\n", + "describes a classification problem: predicting the last column where the\n", + "individual earns more or less than 50K per year. However, from the perspective\n", + "of `tf.Transform`, this label is just another categorical column.\n", + "\n", + "We use a Pre-canned `tfxio.BeamRecordCsvTFXIO` to translate the CSV lines\n", + "into `RecordBatches`. `TFXIO` requires two important piece of information:\n", + "\n", + " - a TensorFlow Metadata Schema,`tfmd.proto.v0.shema_pb2`,\n", + " that contains type and shape information about each CSV column.\n", + " `schema_pb2.TensorRepresentation`s are an optional part of the Schema;\n", + " if not provided (which is the case in this example), they will be inferred\n", + " from the type and shape information. One can get the Schema either by\n", + " using a helper function we provide to translate from TF parsing specs\n", + " (shown in this example), or by running\n", + " [TensorFlow Data Validation](https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic).\n", + " - a list of column names, in the order they appear in the CSV file. Note\n", + " that those names must match the feature names in the Schema." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TSwWOApYojXn" + }, + "outputs": [], + "source": [ + "!pip install -U -q tfx_bsl" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "id": "97V3x2FQyMWE" + }, + "outputs": [], + "source": [ + "from tfx_bsl.public import tfxio\n", + "from tfx_bsl.coders.example_coder import RecordBatchToExamples\n", + "\n", + "import apache_beam as beam" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "id": "6641a8c9" + }, + "outputs": [], + "source": [ + "pipeline = beam.Pipeline()\n", + "\n", + "csv_tfxio = tfxio.BeamRecordCsvTFXIO(\n", + " physical_format='text', column_names=ORDERED_CSV_COLUMNS, schema=SCHEMA)\n", + "\n", + "raw_data = (\n", + " pipeline\n", + " | 'ReadTrainData' >> beam.io.ReadFromText(\n", + " train_data_file, coder=beam.coders.BytesCoder())\n", + " | 'FixCommasTrainData' >> beam.Map(\n", + " lambda line: line.replace(b', ', b','))\n", + " | 'DecodeTrainData' >> csv_tfxio.BeamSource())" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "id": "5qATseJbK91x" + }, + "outputs": [ { - "cell_type": "markdown", - "metadata": { - "id": "5a9df5da" - }, - "source": [ - "In addition to the training data, `transform_fn` is also written out with the\n", - "metadata:" + "data": { + "text/plain": [ + "" ] - }, + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "raw_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e9b2f63b" + }, + "source": [ + "Note that we had to do some additional fix-ups after the CSV lines are read\n", + "in. Otherwise, we could rely on the `tfxio.CsvTFXIO` to handle both reading the files\n", + "and translating to `RecordBatch`es:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "id": "9ede0fb4" + }, + "outputs": [], + "source": [ + "csv_tfxio = tfxio.CsvTFXIO(train_data_file,\n", + " telemetry_descriptors=[], #???\n", + " column_names=ORDERED_CSV_COLUMNS,\n", + " schema=SCHEMA)\n", + "\n", + "p2 = beam.Pipeline()\n", + "raw_data_2 = p2 | 'TFXIORead' >> csv_tfxio.BeamSource()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "67d86fba" + }, + "source": [ + "Preprocessing for this dataset is similar to the previous example,\n", + " except the preprocessing function is programmatically generated instead of manually specifying each column. In the preprocessing function below, `NUMERICAL_COLUMNS` and `CATEGORICAL_COLUMNS` are lists that contain the names of the numeric and categorical columns:" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "id": "5f880a78" + }, + "outputs": [], + "source": [ + "NUM_OOV_BUCKETS = 1\n", + "\n", + "def preprocessing_fn(inputs):\n", + " \"\"\"Preprocess input columns into transformed columns.\"\"\"\n", + " # Since we are modifying some features and leaving others unchanged, we\n", + " # start by setting `outputs` to a copy of `inputs.\n", + " outputs = inputs.copy()\n", + "\n", + " # Scale numeric columns to have range [0, 1].\n", + " for key in NUMERIC_FEATURE_KEYS:\n", + " outputs[key] = tft.scale_to_0_1(outputs[key])\n", + "\n", + " # For all categorical columns except the label column, we generate a\n", + " # vocabulary but do not modify the feature. This vocabulary is instead\n", + " # used in the trainer, by means of a feature column, to convert the feature\n", + " # from a string to an integer id.\n", + " for key in CATEGORICAL_FEATURE_KEYS:\n", + " outputs[key] = tft.compute_and_apply_vocabulary(\n", + " tf.strings.strip(inputs[key]),\n", + " num_oov_buckets=NUM_OOV_BUCKETS,\n", + " vocab_filename=key)\n", + "\n", + " # For the label column we provide the mapping from string to index.\n", + " with tf.init_scope():\n", + " # `init_scope` - Only initialize the table once.\n", + " initializer = tf.lookup.KeyValueTensorInitializer(\n", + " keys=['>50K', '<=50K'],\n", + " values=tf.cast(tf.range(2), tf.int64),\n", + " key_dtype=tf.string,\n", + " value_dtype=tf.int64)\n", + " table = tf.lookup.StaticHashTable(initializer, default_value=-1)\n", + "\n", + " outputs[LABEL_KEY] = table.lookup(outputs[LABEL_KEY])\n", + "\n", + " return outputs" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "afa165ac" + }, + "source": [ + "One difference from the previous example is the label column manually specifies\n", + "the mapping from the string to an index. So `'>50'` is mapped to `0` and\n", + "`'<=50K'` is mapped to `1` because it's useful to know which index in the\n", + "trained model corresponds to which label.\n", + "\n", + "The `record_batches` variable represents a `PCollection` of\n", + "`pyarrow.RecordBatch`es. The `tensor_adapter_config` is given by `csv_tfxio`,\n", + "which is inferred from `SCHEMA` (and ultimately, in this example, from the TF\n", + "parsing specs).\n", + "\n", + "The final stage is to write the transformed data to disk and has a similar form\n", + "to reading the raw data. The schema used to do this is part of the output of\n", + "`tft_beam.AnalyzeAndTransformDataset` which infers a schema for the output data. The code to write to disk is shown below. The schema is a part of the metadata but uses the two interchangeably in the `tf.Transform` API (i.e. pass the metadata to the `tft.coders.ExampleProtoCoder`). Be aware that this writes to a different format. Instead of `textio.WriteToText`, use Beam's built-in support for the `TFRecord` format and use a coder to encode the data as `Example` protos. This is a better format to use for training, as shown in the next section. `transformed_eval_data_base` provides the base filename for the individual shards that are written." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "id": "PiHLl83FLRXi" + }, + "outputs": [], + "source": [ + "raw_dataset = (raw_data, csv_tfxio.TensorAdapterConfig())" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "id": "giIQd-8xKubp" + }, + "outputs": [], + "source": [ + "working_dir = tempfile.mkdtemp()\n", + "with tft_beam.Context(temp_dir=working_dir):\n", + " transformed_dataset, transform_fn = (\n", + " raw_dataset | tft_beam.AnalyzeAndTransformDataset(\n", + " preprocessing_fn, output_record_batches=True))" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "id": "EEVc2Hdr0Upe" + }, + "outputs": [], + "source": [ + "output_dir = tempfile.mkdtemp()" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "id": "sB5m6v_GPUM5" + }, + "outputs": [], + "source": [ + "transformed_data, _ = transformed_dataset\n", + "\n", + "_ = (\n", + " transformed_data\n", + " | 'EncodeTrainData' >>\n", + " beam.FlatMapTuple(lambda batch, _: RecordBatchToExamples(batch))\n", + " | 'WriteTrainData' >> beam.io.WriteToTFRecord(\n", + " os.path.join(output_dir , 'transformed.tfrecord')))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5a9df5da" + }, + "source": [ + "In addition to the training data, `transform_fn` is also written out with the\n", + "metadata:" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "id": "cdd42661" + }, + "outputs": [], + "source": [ + "_ = (\n", + " transform_fn\n", + " | 'WriteTransformFn' >> tft_beam.WriteTransformFn(output_dir))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-PFSCDrJQXen" + }, + "source": [ + "Run the entire Beam pipeline with `pipeline.run().wait_until_finish()`. Up until this point, the Beam pipeline represents a deferred, distributed computation. It provides instructions for what will be done, but the instructions have not been executed. This final call executes the specified pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "id": "IZWHQSesQW3I" + }, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "cdd42661" - }, - "outputs": [], - "source": [ - "_ = (\n", - " transform_fn\n", - " | 'WriteTransformFn' \u003e\u003e tft_beam.WriteTransformFn(output_dir))" - ] + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.\n" + ] }, { - "cell_type": "markdown", - "metadata": { - "id": "-PFSCDrJQXen" - }, - "source": [ - "Run the entire Beam pipeline with `pipeline.run().wait_until_finish()`. Up until this point, the Beam pipeline represents a deferred, distributed computation. It provides instructions for what will be done, but the instructions have not been executed. This final call executes the specified pipeline." - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmp9j7iakzh/tftransform_tmp/ef3ef7611e4c485d92b6a0b4a0807de7/assets\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "IZWHQSesQW3I" - }, - "outputs": [], - "source": [ - "result = pipeline.run().wait_until_finish()" - ] + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmp9j7iakzh/tftransform_tmp/ef3ef7611e4c485d92b6a0b4a0807de7/assets\n" + ] }, { - "cell_type": "markdown", - "metadata": { - "id": "0dSYQALF05ug" - }, - "source": [ - "After running the pipeline the output directory contains two artifacts.\n", - "\n", - "* The transformed data, and the metadata describing it.\n", - "* The `tf.saved_model` containing the resulting `preprocessing_fn`" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmp9j7iakzh/tftransform_tmp/e3d3a20dc9324545847a3b060fd3ec71/assets\n" + ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "yxM83xAuzL6o" - }, - "outputs": [], - "source": [ - "!ls {output_dir}" - ] - }, + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /tmp/tmp9j7iakzh/tftransform_tmp/e3d3a20dc9324545847a3b060fd3ec71/assets\n" + ] + } + ], + "source": [ + "result = pipeline.run().wait_until_finish()" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ { - "cell_type": "markdown", - "metadata": { - "id": "OllPVQJl2dRx" - }, - "source": [ - "To see how to use these artifacts refer to the [Advanced preprocessing tutorial](https://www.tensorflow.org/tfx/tutorials/transform/census)." - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] } - ], - "metadata": { - "colab": { - "collapsed_sections": [], - "name": "get_started.ipynb", - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" + ], + "source": [ + "print(pipeline)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0dSYQALF05ug" + }, + "source": [ + "After running the pipeline the output directory contains two artifacts.\n", + "\n", + "* The transformed data, and the metadata describing it.\n", + "* The `tf.saved_model` containing the resulting `preprocessing_fn`" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "id": "yxM83xAuzL6o" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "transformed_metadata transformed.tfrecord-00000-of-00001 transform_fn\n" + ] } + ], + "source": [ + "!ls {output_dir}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OllPVQJl2dRx" + }, + "source": [ + "To see how to use these artifacts refer to the [Advanced preprocessing tutorial](https://www.tensorflow.org/tfx/tutorials/transform/census)." + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "get_started.ipynb", + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" }, - "nbformat": 4, - "nbformat_minor": 0 + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 } From 724060e1f44fa02fefcd4c04e76b391d2550a019 Mon Sep 17 00:00:00 2001 From: Pritam Dodeja Date: Wed, 16 Nov 2022 10:54:04 -0500 Subject: [PATCH 2/3] Corrected number of types of preprocessing layers in notebook. --- docs/get_started.ipynb | 2772 ++++++++++++++++------------------------ 1 file changed, 1121 insertions(+), 1651 deletions(-) diff --git a/docs/get_started.ipynb b/docs/get_started.ipynb index 5e106926..63d41dba 100644 --- a/docs/get_started.ipynb +++ b/docs/get_started.ipynb @@ -1,1656 +1,1126 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "KpXGE33umpig" - }, - "source": [ - "\n", - "\n", - "# Get Started with TensorFlow Transform" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FT1xumYJ-oEW" - }, - "source": [ - "This guide introduces the basic concepts of `tf.Transform` and how to use them.\n", - "It will:\n", - "\n", - "* Define a *preprocessing function*, a logical description of the pipeline\n", - " that transforms the raw data into the data used to train a machine learning\n", - " model.\n", - "* Show the [Apache Beam](https://beam.apache.org/) implementation used to\n", - " transform data by converting the *preprocessing function* into a *Beam\n", - " pipeline*.\n", - "* Show additional usage examples." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9_SoiTcNmkVu" - }, - "source": [ - "## Setup" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "gc6oSu9BnwJe" - }, - "outputs": [], - "source": [ - "!pip install -U tensorflow_transform" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "pY_BNfLemjY4" - }, - "outputs": [], - "source": [ - "!pip install pyarrow" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "id": "L7Mtis2Jn2Af" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 1, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import pkg_resources\n", - "import importlib\n", - "importlib.reload(pkg_resources)" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "id": "PvDoWUfynTWh" - }, - "outputs": [], - "source": [ - "import os\n", - "import tempfile\n", - "\n", - "import tensorflow as tf\n", - "import tensorflow_transform as tft\n", - "import tensorflow_transform.beam as tft_beam\n", - "\n", - "from tensorflow_transform.tf_metadata import dataset_metadata\n", - "from tensorflow_transform.tf_metadata import schema_utils\n", - "\n", - "from tfx_bsl.public import tfxio" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "j4W_yuSr-ro3" - }, - "source": [ - "## Define a preprocessing function\n", - "\n", - "The *preprocessing function* is the most important concept of `tf.Transform`.\n", - "The preprocessing function is a logical description of a transformation of the\n", - "dataset. The preprocessing function accepts and returns a dictionary of tensors,\n", - "where a *tensor* means `Tensor` or `SparseTensor`. There are two kinds of\n", - "functions used to define the preprocessing function:\n", - "\n", - "1. Any function that accepts and returns tensors. These add TensorFlow\n", - " operations to the graph that transform raw data into transformed data.\n", - "2. Any of the *analyzers* provided by `tf.Transform`. Analyzers also accept\n", - " and return tensors, but unlike TensorFlow functions, they *do not* add\n", - " operations to the graph. Instead, analyzers cause `tf.Transform` to compute\n", - " a full-pass operation outside of TensorFlow. They use the input tensor values\n", - " over the entire dataset to generate a constant tensor that is returned as the\n", - " output. For example, `tft.min` computes the minimum of a tensor over the\n", - " dataset. `tf.Transform` provides a fixed set of analyzers, but this will be\n", - " extended in future versions.\n", - "3. Any stateless [preprocessing layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) (i.e. these layers must not invoke the ```adapt()``` method). These can be added as operations to the graph as they do not require a full pass over the data outside of the management of ```tf.Transform```. For example, you can add

\n", - "[tf.keras.layers.experimental.preprocessing.HashedCrossing](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/HashedCrossing),

\n", - "but not

\n", - "[tf.keras.layers.Normalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization),

as the latter needs to be adapted over the entire dataset. Do note that if you use [Lambda layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda), there are some de-serialization limitations which might prevent ```preprocessing_fn``` from being fully re-loaded off of disk by [tft.TFTransformOutput](https://www.tensorflow.org/tfx/transform/api_docs/python/tft/TFTransformOutput). \n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "72ff0efc" - }, - "source": [ - "### Preprocessing function example\n", - "\n", - "By combining analyzers and regular TensorFlow functions, users can create\n", - "flexible pipelines for transforming data. The following preprocessing function\n", - "transforms each of the three features in different ways, and combines two of the\n", - "features:" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "id": "c6bf64fe" - }, - "outputs": [], - "source": [ - "def preprocessing_fn(inputs):\n", - " x = inputs['x']\n", - " y = inputs['y']\n", - " s = inputs['s']\n", - " x_centered = x - tft.mean(x)\n", - " y_normalized = tft.scale_to_0_1(y)\n", - " s_integerized = tft.compute_and_apply_vocabulary(s)\n", - " x_centered_times_y_normalized = x_centered * y_normalized\n", - " return {\n", - " 'x_centered': x_centered,\n", - " 'y_normalized': y_normalized,\n", - " 'x_centered_times_y_normalized': x_centered_times_y_normalized,\n", - " 's_integerized': s_integerized\n", - " }" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LU8aclPGAZLX" - }, - "source": [ - "Here, `x`, `y` and `s` are `Tensor`s that represent input features. The first\n", - "new tensor that is created, `x_centered`, is built by applying `tft.mean` to `x`\n", - "and subtracting this from `x`. `tft.mean(x)` returns a tensor representing the\n", - "mean of the tensor `x`. `x_centered` is the tensor `x` with the mean subtracted.\n", - "\n", - "The second new tensor, `y_normalized`, is created in a similar manner but using\n", - "the convenience method `tft.scale_to_0_1`. This method does something similar to\n", - "computing `x_centered`, namely computing a maximum and minimum and using these\n", - "to scale `y`.\n", - "\n", - "The tensor `s_integerized` shows an example of string manipulation. In this\n", - "case, we take a string and map it to an integer. This uses the convenience\n", - "function `tft.compute_and_apply_vocabulary`. This function uses an analyzer to\n", - "compute the unique values taken by the input strings, and then uses TensorFlow\n", - "operations to convert the input strings to indices in the table of unique\n", - "values.\n", - "\n", - "The final column shows that it is possible to use TensorFlow operations to\n", - "create new features by combining tensors.\n", - "\n", - "The preprocessing function defines a pipeline of operations on a dataset. In\n", - "order to apply the pipeline, we rely on a concrete implementation of the\n", - "`tf.Transform` API. The Apache Beam implementation provides `PTransform` which\n", - "applies a user's preprocessing function to data. The typical workflow of a\n", - "`tf.Transform` user will construct a preprocessing function, then incorporate\n", - "this into a larger Beam pipeline, creating the data for training." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nxnXDEK1AezF" - }, - "source": [ - "### Batching\n", - "\n", - "Batching is an important part of TensorFlow. Since one of the goals of\n", - "`tf.Transform` is to provide a TensorFlow graph for preprocessing that can be\n", - "incorporated into the serving graph (and, optionally, the training graph),\n", - "batching is also an important concept in `tf.Transform`.\n", - "\n", - "While not obvious in the example above, the user defined preprocessing function\n", - "is passed tensors representing *batches* and not individual instances, as\n", - "happens during training and serving with TensorFlow. On the other hand,\n", - "analyzers perform a computation over the entire dataset that returns a single\n", - "value and not a batch of values. `x` is a `Tensor` with a shape of\n", - "`(batch_size,)`, while `tft.mean(x)` is a `Tensor` with a shape of `()`. The\n", - "subtraction `x - tft.mean(x)` broadcasts where the value of `tft.mean(x)` is\n", - "subtracted from every element of the batch represented by `x`." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "09bf63cc" - }, - "source": [ - "## Apache Beam Implementation\n", - "\n", - "While the *preprocessing function* is intended as a logical description of a\n", - "*preprocessing pipeline* implemented on multiple data processing frameworks,\n", - "`tf.Transform` provides a canonical implementation used on Apache Beam. This\n", - "implementation demonstrates the functionality required from an implementation.\n", - "There is no formal API for this functionality, so each implementation can use an\n", - "API that is idiomatic for its particular data processing framework.\n", - "\n", - "The Apache Beam implementation provides two `PTransform`s used to process data\n", - "for a preprocessing function. The following shows the usage for the composite\n", - "`PTransform` - `tft_beam.AnalyzeAndTransformDataset`:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "id": "2e1e01ec" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n", - "WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/home/pdodeja/.pyenv/versions/3.8.5/lib/python3.8/site-packages/ipykernel_launcher.py', '-f', '/home/pdodeja/.local/share/jupyter/runtime/kernel-a0e59961-bc03-4678-855c-373020284cb1.json']\n", - "WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmpt382s9g0/tftransform_tmp/05913662504346a59fa1f1348c34f3ef/assets\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmpt382s9g0/tftransform_tmp/05913662504346a59fa1f1348c34f3ef/assets\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmpt382s9g0/tftransform_tmp/596c854c9fc0413d9cbf11fa6e1e5d59/assets\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmpt382s9g0/tftransform_tmp/596c854c9fc0413d9cbf11fa6e1e5d59/assets\n" - ] - } - ], - "source": [ - "raw_data = [\n", - " {'x': 1, 'y': 1, 's': 'hello'},\n", - " {'x': 2, 'y': 2, 's': 'world'},\n", - " {'x': 3, 'y': 3, 's': 'hello'}\n", - "]\n", - "\n", - "raw_data_metadata = dataset_metadata.DatasetMetadata(\n", - " schema_utils.schema_from_feature_spec({\n", - " 'y': tf.io.FixedLenFeature([], tf.float32),\n", - " 'x': tf.io.FixedLenFeature([], tf.float32),\n", - " 's': tf.io.FixedLenFeature([], tf.string),\n", - " }))\n", - "\n", - "with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n", - " transformed_dataset, transform_fn = (\n", - " (raw_data, raw_data_metadata) |\n", - " tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "id": "jl2gbkvUICd_" - }, - "outputs": [], - "source": [ - "transformed_data, transformed_metadata = transformed_dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "e6029b09" - }, - "source": [ - "The `transformed_data` content is shown below and contains the transformed\n", - "columns in the same format as the raw data. In particular, the values of\n", - "`s_integerized` are `[0, 1, 0]`—these values depend on how the words `hello` and\n", - "`world` were mapped to integers, which is deterministic. For the column\n", - "`x_centered`, we subtracted the mean so the values of the column `x`, which were\n", - "`[1.0, 2.0, 3.0]`, became `[-1.0, 0.0, 1.0]`. Similarly, the rest of the columns\n", - "match their expected values." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "id": "vcMpG2bFFcgP" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[{'s_integerized': 0,\n", - " 'x_centered': -1.0,\n", - " 'x_centered_times_y_normalized': -0.0,\n", - " 'y_normalized': 0.0},\n", - " {'s_integerized': 1,\n", - " 'x_centered': 0.0,\n", - " 'x_centered_times_y_normalized': 0.0,\n", - " 'y_normalized': 0.5},\n", - " {'s_integerized': 0,\n", - " 'x_centered': 1.0,\n", - " 'x_centered_times_y_normalized': 1.0,\n", - " 'y_normalized': 1.0}]" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "transformed_data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0ee0d9ac" - }, - "source": [ - "Both `raw_data` and `transformed_data` are datasets. The next two sections show\n", - "how the Beam implementation represents datasets and how to read and write data\n", - "to disk. The other return value, `transform_fn`, represents the transformation\n", - "applied to the data, covered in detail below.\n", - "\n", - "The `tft_beam.AnalyzeAndTransformDataset` class is the composition of the two\n", - "fundamental transforms provided by the implementation\n", - "`tft_beam.AnalyzeDataset` and `tft_beam.TransformDataset`. So the following\n", - "two code snippets are equivalent:" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "BCZqx7OfGjZ_" - }, - "outputs": [], - "source": [ - "my_data = (raw_data, raw_data_metadata)" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "816cdb9b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n", - "WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/home/pdodeja/.pyenv/versions/3.8.5/lib/python3.8/site-packages/ipykernel_launcher.py', '-f', '/home/pdodeja/.local/share/jupyter/runtime/kernel-a0e59961-bc03-4678-855c-373020284cb1.json']\n", - "WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmps5l389rr/tftransform_tmp/3525992e1f39412abc0fa8a4a293037a/assets\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmps5l389rr/tftransform_tmp/3525992e1f39412abc0fa8a4a293037a/assets\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmps5l389rr/tftransform_tmp/e803b4b0258b452d9b4ac7e0d5b15bdb/assets\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmps5l389rr/tftransform_tmp/e803b4b0258b452d9b4ac7e0d5b15bdb/assets\n" - ] - } - ], - "source": [ - "with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n", - " transformed_data, transform_fn = (\n", - " my_data | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(['/tmp/tmps5l389rr/tftransform_tmp/e803b4b0258b452d9b4ac7e0d5b15bdb'],\n", - " BeamDatasetMetadata(dataset_metadata={'_schema': feature {\n", - " name: \"s_integerized\"\n", - " type: INT\n", - " int_domain {\n", - " is_categorical: true\n", - " }\n", - " presence {\n", - " min_fraction: 1.0\n", - " }\n", - " shape {\n", - " }\n", - " }\n", - " feature {\n", - " name: \"x_centered\"\n", - " type: FLOAT\n", - " presence {\n", - " min_fraction: 1.0\n", - " }\n", - " shape {\n", - " }\n", - " }\n", - " feature {\n", - " name: \"x_centered_times_y_normalized\"\n", - " type: FLOAT\n", - " presence {\n", - " min_fraction: 1.0\n", - " }\n", - " shape {\n", - " }\n", - " }\n", - " feature {\n", - " name: \"y_normalized\"\n", - " type: FLOAT\n", - " presence {\n", - " min_fraction: 1.0\n", - " }\n", - " shape {\n", - " }\n", - " }\n", - " }, deferred_metadata=[{'_schema': feature {\n", - " name: \"s_integerized\"\n", - " type: INT\n", - " int_domain {\n", - " min: -1\n", - " max: 1\n", - " is_categorical: true\n", - " }\n", - " presence {\n", - " min_fraction: 1.0\n", - " }\n", - " shape {\n", - " }\n", - " }\n", - " feature {\n", - " name: \"x_centered\"\n", - " type: FLOAT\n", - " presence {\n", - " min_fraction: 1.0\n", - " }\n", - " shape {\n", - " }\n", - " }\n", - " feature {\n", - " name: \"x_centered_times_y_normalized\"\n", - " type: FLOAT\n", - " presence {\n", - " min_fraction: 1.0\n", - " }\n", - " shape {\n", - " }\n", - " }\n", - " feature {\n", - " name: \"y_normalized\"\n", - " type: FLOAT\n", - " presence {\n", - " min_fraction: 1.0\n", - " }\n", - " shape {\n", - " }\n", - " }\n", - " }], asset_map={'vocab_compute_and_apply_vocabulary_vocabulary': 'vocab_compute_and_apply_vocabulary_vocabulary'}))" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "transform_fn" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": { - "id": "dJImGAaeHDTo" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n", - "WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/home/pdodeja/.pyenv/versions/3.8.5/lib/python3.8/site-packages/ipykernel_launcher.py', '-f', '/home/pdodeja/.local/share/jupyter/runtime/kernel-a0e59961-bc03-4678-855c-373020284cb1.json']\n", - "WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmp34rhmzgj/tftransform_tmp/a4b4feb1883a42afa8fd95f4aca657df/assets\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmp34rhmzgj/tftransform_tmp/a4b4feb1883a42afa8fd95f4aca657df/assets\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmp34rhmzgj/tftransform_tmp/d8513b34b8c2494d951f1a70ed122d07/assets\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmp34rhmzgj/tftransform_tmp/d8513b34b8c2494d951f1a70ed122d07/assets\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n", - "WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/home/pdodeja/.pyenv/versions/3.8.5/lib/python3.8/site-packages/ipykernel_launcher.py', '-f', '/home/pdodeja/.local/share/jupyter/runtime/kernel-a0e59961-bc03-4678-855c-373020284cb1.json']\n", - "WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.\n" - ] - } - ], - "source": [ - "with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n", - " transform_fn = my_data | tft_beam.AnalyzeDataset(preprocessing_fn)\n", - " transformed_data = (my_data, transform_fn) | tft_beam.TransformDataset()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "M4kl5IA5H29G" - }, - "source": [ - "`transform_fn` is a pure function that represents an operation that is applied\n", - "to each row of the dataset. In particular, the analyzer values are already\n", - "computed and treated as constants. In the example, the `transform_fn` contains\n", - "as constants the mean of column `x`, the min and max of column `y`, and the\n", - "vocabulary used to map the strings to integers.\n", - "\n", - "An important feature of `tf.Transform` is that `transform_fn` represents a map\n", - "*over rows*—it is a pure function applied to each row separately. All of the\n", - "computation for aggregating rows is done in `AnalyzeDataset`. Furthermore, the\n", - "`transform_fn` is represented as a TensorFlow `Graph` which can be embedded into\n", - "the serving graph.\n", - "\n", - "`AnalyzeAndTransformDataset` is provided for optimizations in this special case.\n", - "This is the same pattern used in\n", - "[scikit-learn](http://scikit-learn.org/stable/index.html), providing the `fit`,\n", - "`transform`, and `fit_transform` methods.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2bedd48a" - }, - "source": [ - "## Data Formats and Schema\n", - "\n", - "TFT Beam implementation accepts two different input data formats. The\n", - "\"instance dict\" format (as seen in the example above and [simple.ipynb](https://www.tensorflow.org/tfx/tutorials/transform/simple) & [simple_example.py](https://github.com/tensorflow/transform/blob/master/examples/simple_example.py))\n", - "is an intuitive format and is suitable for small datasets while the TFXIO\n", - "([Apache Arrow](https://arrow.apache.org)) format provides improved performance\n", - "and is suitble for large datasets.\n", - "\n", - "The \"metadata\" accompanying the `PCollection` tells the Beam implementation the format of the `PCollection`.\n", - "\n", - "```\n", - "(raw_data, raw_data_metadata) | tft.AnalyzeDataset(...)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5dc76c5a" - }, - "source": [ - "- If `raw_data_metadata` is a `dataset_metadata.DatasetMetadata` (see below,\n", - " \"The 'instance dict' format\" section),\n", - " then `raw_data` is expected to be in the \"instance dict\" format.\n", - "- If `raw_data_metadata` is a `tfxio.TensorAdapterConfig`\n", - " (see below, \"The TFXIO format\" section), then `raw_data` is expected to be\n", - " in the TFXIO format." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "XPjE0a7kNU5i" - }, - "source": [ - "### The \"instance dict\" format\n", - "\n", - "The previous code examples used this format. The metadata contains the schema that defines the layout of the data and how it is read from and written to various formats. Even this in-memory format is not self-describing and requires the schema in order to be interpreted as tensors.\n", - "\n", - "Again, here is the definition of the schema for the example data:\n", - "\n", - "" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": { - "id": "372894b6" - }, - "outputs": [], - "source": [ - "from tensorflow_transform.tf_metadata import dataset_metadata\n", - "from tensorflow_transform.tf_metadata import schema_utils\n", - "\n", - "raw_data_metadata = dataset_metadata.DatasetMetadata(\n", - " schema_utils.schema_from_feature_spec({\n", - " 's': tf.io.FixedLenFeature([], tf.string),\n", - " 'y': tf.io.FixedLenFeature([], tf.float32),\n", - " 'x': tf.io.FixedLenFeature([], tf.float32),\n", - " }))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "58c2402c" - }, - "source": [ - "The `Schema` proto contains the information needed to parse the\n", - "data from its on-disk or in-memory format, into tensors. It is typically\n", - "constructed by calling `schema_utils.schema_from_feature_spec` with a dict\n", - "mapping feature keys to `tf.io.FixedLenFeature`, `tf.io.VarLenFeature`, and\n", - "`tf.io.SparseFeature` values. See the documentation for\n", - "[`tf.parse_example`](https://www.tensorflow.org/api_docs/python/tf/parse_example)\n", - "for more details.\n", - "\n", - "Above we use `tf.io.FixedLenFeature` to indicate that each feature contains a\n", - "fixed number of values, in this case a single scalar value. Because\n", - "`tf.Transform` batches instances, the actual `Tensor` representing the feature\n", - "will have shape `(None,)` where the unknown dimension is the batch dimension.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jatXeEayOhza" - }, - "source": [ - "### The TFXIO format\n", - "\n", - "With this format, the data is expected to be contained in a\n", - "[`pyarrow.RecordBatch`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html).\n", - "For tabular data, our Apache Beam implementation\n", - "accepts Arrow `RecordBatch`es that consist of columns of the following types:\n", - "\n", - " - `pa.list_()`, where `` is `pa.int64()`, `pa.float32()`\n", - " `pa.binary()` or `pa.large_binary()`.\n", - "\n", - " - `pa.large_list()`\n", - "\n", - "The toy input dataset we used above, when represented as a `RecordBatch`, looks\n", - "like the following:" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": { - "id": "fd01900a" - }, - "outputs": [], - "source": [ - "import pyarrow as pa\n", - "\n", - "raw_data = [\n", - " pa.record_batch(\n", - " data=[\n", - " pa.array([[1], [2], [3]], pa.list_(pa.float32())),\n", - " pa.array([[1], [2], [3]], pa.list_(pa.float32())),\n", - " pa.array([['hello'], ['world'], ['hello']], pa.list_(pa.binary())),\n", - " ],\n", - " names=['x', 'y', 's'])\n", - "]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "114d171e" - }, - "source": [ - "Similar to the `dataset_metadata.DatasetMetadata` instance that accompanies the \"instance dict\" format, a `tfxio.TensorAdapterConfig`\n", - "is must accompany the `RecordBatch`es. It consists of the Arrow schema of\n", - "the `RecordBatch`es, and\n", - "`tfxio.TensorRepresentations` to uniquely determine how columns in `RecordBatch`es can be interpreted as TensorFlow Tensors (including but not limited to `tf.Tensor`, `tf.SparseTensor`).\n", - "\n", - "`tfxio.TensorRepresentations` is type alias for a `Dict[str, tensorflow_metadata.proto.v0.schema_pb2.TensorRepresentation]` which\n", - "establishes the relationship between a Tensor that a `preprocessing_fn` accepts\n", - "and columns in the `RecordBatch`es. For example:" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": { - "id": "b8478d18" - }, - "outputs": [], - "source": [ - "from google.protobuf import text_format\n", - "from tensorflow_metadata.proto.v0 import schema_pb2\n", - "\n", - "tensor_representation = {\n", - " 'x': text_format.Parse(\n", - " \"\"\"dense_tensor { column_name: \"col1\" shape { dim { size: 2 } } }\"\"\",\n", - " schema_pb2.TensorRepresentation())\n", - "}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZAqE0Fb2Ta47" - }, - "source": [ - "Means that `inputs['x']` in `preprocessing_fn` should be a dense `tf.Tensor`,\n", - "whose values come from a column of name `'col1'` in the input `RecordBatch`es,\n", - "and its (batched) shape should be `[batch_size, 2]`.\n", - "\n", - "A `schema_pb2.TensorRepresentation` is a Protobuf defined in\n", - "[TensorFlow Metadata](https://github.com/tensorflow/metadata/blob/v0.22.2/tensorflow_metadata/proto/v0/schema.proto#L592)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qbPmiaJOTe09" - }, - "source": [ - "## Compatibility with TensorFlow\n", - "\n", - "`tf.Transform` provides support for exporting the `transform_fn` as\n", - "a SavedModel, see the [simple tutorial](https://www.tensorflow.org/tfx/tutorials/transform/simple) for an example. The default behavior before the `0.30` release\n", - "exported a TF 1.x SavedModel. Starting with the `0.30` release, the default\n", - "behavior is to export a TF 2.x SavedModel unless TF 2.x behaviors are explicitly\n", - "disabled (by calling `tf.compat.v1.disable_v2_behavior()`).\n", - "\n", - "If using TF 1.x concepts such as `tf.estimator` and `tf.Sessions`, you can retain the previous behavior by passing `force_tf_compat_v1=True` to\n", - "[`tft_beam.Context`](https://www.tensorflow.org/tfx/transform/api_docs/python/tft_beam/Context)\n", - "if using `tf.Transform` as a standalone library or to the\n", - "[Transform](https://www.tensorflow.org/tfx/api_docs/python/tfx/components/Transform)\n", - "component in TFX.\n", - "\n", - "When exporting the `transform_fn` as a TF 2.x SavedModel, the `preprocessing_fn`\n", - "is expected to be traceable using `tf.function`. Additionally, if running your\n", - "pipeline remotely (for example with the `DataflowRunner`), ensure that the\n", - "`preprocessing_fn` and any dependencies are packaged properly as described\n", - "[here](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies).\n", - "\n", - "Known issues with using `tf.Transform` to export a TF 2.x SavedModel are\n", - "documented [here](https://www.tensorflow.org/tfx/transform/tf2_support)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "XZBRgTv4Th31" - }, - "source": [ - "## Input and output with Apache Beam\n", - "\n", - "So far, we've seen input and output data in python lists (of `RecordBatch`es or\n", - "instance dictionaries). This is a simplification that relies on Apache Beam's\n", - "ability to work with lists as well as its main representation of data, the\n", - "`PCollection`.\n", - "\n", - "A `PCollection` is a data representation that forms a part of a Beam pipeline.\n", - "A Beam pipeline is formed by applying various `PTransform`s, including\n", - "`AnalyzeDataset` and `TransformDataset`, and running the pipeline. A\n", - "`PCollection` is not created in the memory of the main binary, but instead is\n", - "distributed among the workers (although this section uses the in-memory\n", - "execution mode).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oz_PA4dLTlEe" - }, - "source": [ - "### Pre-canned `PCollection` Sources (`TFXIO`)\n", - "\n", - "The `RecordBatch` format that our implementation accepts is a common format that\n", - "other TFX libraries accept. Therefore TFX offers convenient \"sources\" (a.k.a\n", - "`TFXIO`) that read files of various formats on disk and produce `RecordBatch`es\n", - "and can also give `tfxio.TensorAdapterConfig`, including inferred\n", - "`tfxio.TensorRepresentations`.\n", - "\n", - "Those `TFXIO`s can be found in package `tfx_bsl` ([`tfx_bsl.public.tfxio`](https://www.tensorflow.org/tfx/tfx_bsl/api_docs/python/tfx_bsl/public/tfxio)).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "135596bb" - }, - "source": [ - "## Example: \"Census Income\" dataset\n", - "\n", - "The following example requires both reading and writing data on disk and\n", - "representing data as a `PCollection` (not a list), see:\n", - "[`census_example.py`](https://github.com/tensorflow/transform/tree/master/examples/census_example.py).\n", - "Below we show how to download the data and run this example. The \"Census Income\"\n", - "dataset is provided by the\n", - "[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n", - "This dataset contains both categorical and numeric data.\n", - "\n", - "Here is some code to download and preview this data:" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": { - "id": "p-iPFfR-y-Nb" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "--2022-05-24 15:07:07-- https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/census/adult.data\n", - "Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.40.176, 142.250.64.80, 142.250.64.112, ...\n", - "Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.40.176|:443... connected.\n", - "HTTP request sent, awaiting response... 200 OK\n", - "Length: 3974305 (3.8M) [application/octet-stream]\n", - "Saving to: ‘adult.data’\n", - "\n", - "adult.data 100%[===================>] 3.79M --.-KB/s in 0.1s \n", - "\n", - "2022-05-24 15:07:07 (27.3 MB/s) - ‘adult.data’ saved [3974305/3974305]\n", - "\n" - ] - } - ], - "source": [ - "!wget https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/census/adult.data" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": { - "id": "yP-YBifvwh3C" - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "train_data_file = \"adult.data\"" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Fca2FC8IKwnt" - }, - "source": [ - "There's some configuration code hidden in the cell below." - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": { - "id": "fo3aBF4CyxTW" - }, - "outputs": [], - "source": [ - "#@title\n", - "ORDERED_CSV_COLUMNS = [\n", - " 'age', 'workclass', 'fnlwgt', 'education', 'education-num',\n", - " 'marital-status', 'occupation', 'relationship', 'race', 'sex',\n", - " 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'label'\n", - "]\n", - "\n", - "CATEGORICAL_FEATURE_KEYS = [\n", - " 'workclass',\n", - " 'education',\n", - " 'marital-status',\n", - " 'occupation',\n", - " 'relationship',\n", - " 'race',\n", - " 'sex',\n", - " 'native-country',\n", - "]\n", - "\n", - "NUMERIC_FEATURE_KEYS = [\n", - " 'age',\n", - " 'capital-gain',\n", - " 'capital-loss',\n", - " 'hours-per-week',\n", - " 'education-num',\n", - "]\n", - "\n", - "LABEL_KEY = 'label'\n", - "\n", - "RAW_DATA_FEATURE_SPEC = dict(\n", - " [(name, tf.io.FixedLenFeature([], tf.string))\n", - " for name in CATEGORICAL_FEATURE_KEYS] +\n", - " [(name, tf.io.FixedLenFeature([], tf.float32))\n", - " for name in NUMERIC_FEATURE_KEYS] +\n", - " [(LABEL_KEY, tf.io.FixedLenFeature([], tf.string))]\n", - ")\n", - "\n", - "SCHEMA = tft.tf_metadata.dataset_metadata.DatasetMetadata(\n", - " tft.tf_metadata.schema_utils.schema_from_feature_spec(RAW_DATA_FEATURE_SPEC)).schema" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": { - "id": "wCoqaKcgwMyC" - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrylabel
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
\n", - "
" - ], - "text/plain": [ - " age workclass fnlwgt education education-num \\\n", - "0 39 State-gov 77516 Bachelors 13 \n", - "1 50 Self-emp-not-inc 83311 Bachelors 13 \n", - "2 38 Private 215646 HS-grad 9 \n", - "3 53 Private 234721 11th 7 \n", - "4 28 Private 338409 Bachelors 13 \n", - "\n", - " marital-status occupation relationship race sex \\\n", - "0 Never-married Adm-clerical Not-in-family White Male \n", - "1 Married-civ-spouse Exec-managerial Husband White Male \n", - "2 Divorced Handlers-cleaners Not-in-family White Male \n", - "3 Married-civ-spouse Handlers-cleaners Husband Black Male \n", - "4 Married-civ-spouse Prof-specialty Wife Black Female \n", - "\n", - " capital-gain capital-loss hours-per-week native-country label \n", - "0 2174 0 40 United-States <=50K \n", - "1 0 0 13 United-States <=50K \n", - "2 0 0 40 United-States <=50K \n", - "3 0 0 40 United-States <=50K \n", - "4 0 0 40 Cuba <=50K " - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.read_csv(train_data_file, names = ORDERED_CSV_COLUMNS).head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ed4e8acf" - }, - "source": [ - "The columns of the dataset are either categorical or numeric. This dataset\n", - "describes a classification problem: predicting the last column where the\n", - "individual earns more or less than 50K per year. However, from the perspective\n", - "of `tf.Transform`, this label is just another categorical column.\n", - "\n", - "We use a Pre-canned `tfxio.BeamRecordCsvTFXIO` to translate the CSV lines\n", - "into `RecordBatches`. `TFXIO` requires two important piece of information:\n", - "\n", - " - a TensorFlow Metadata Schema,`tfmd.proto.v0.shema_pb2`,\n", - " that contains type and shape information about each CSV column.\n", - " `schema_pb2.TensorRepresentation`s are an optional part of the Schema;\n", - " if not provided (which is the case in this example), they will be inferred\n", - " from the type and shape information. One can get the Schema either by\n", - " using a helper function we provide to translate from TF parsing specs\n", - " (shown in this example), or by running\n", - " [TensorFlow Data Validation](https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic).\n", - " - a list of column names, in the order they appear in the CSV file. Note\n", - " that those names must match the feature names in the Schema." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "TSwWOApYojXn" - }, - "outputs": [], - "source": [ - "!pip install -U -q tfx_bsl" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": { - "id": "97V3x2FQyMWE" - }, - "outputs": [], - "source": [ - "from tfx_bsl.public import tfxio\n", - "from tfx_bsl.coders.example_coder import RecordBatchToExamples\n", - "\n", - "import apache_beam as beam" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": { - "id": "6641a8c9" - }, - "outputs": [], - "source": [ - "pipeline = beam.Pipeline()\n", - "\n", - "csv_tfxio = tfxio.BeamRecordCsvTFXIO(\n", - " physical_format='text', column_names=ORDERED_CSV_COLUMNS, schema=SCHEMA)\n", - "\n", - "raw_data = (\n", - " pipeline\n", - " | 'ReadTrainData' >> beam.io.ReadFromText(\n", - " train_data_file, coder=beam.coders.BytesCoder())\n", - " | 'FixCommasTrainData' >> beam.Map(\n", - " lambda line: line.replace(b', ', b','))\n", - " | 'DecodeTrainData' >> csv_tfxio.BeamSource())" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": { - "id": "5qATseJbK91x" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "raw_data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "e9b2f63b" - }, - "source": [ - "Note that we had to do some additional fix-ups after the CSV lines are read\n", - "in. Otherwise, we could rely on the `tfxio.CsvTFXIO` to handle both reading the files\n", - "and translating to `RecordBatch`es:" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": { - "id": "9ede0fb4" - }, - "outputs": [], - "source": [ - "csv_tfxio = tfxio.CsvTFXIO(train_data_file,\n", - " telemetry_descriptors=[], #???\n", - " column_names=ORDERED_CSV_COLUMNS,\n", - " schema=SCHEMA)\n", - "\n", - "p2 = beam.Pipeline()\n", - "raw_data_2 = p2 | 'TFXIORead' >> csv_tfxio.BeamSource()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "67d86fba" - }, - "source": [ - "Preprocessing for this dataset is similar to the previous example,\n", - " except the preprocessing function is programmatically generated instead of manually specifying each column. In the preprocessing function below, `NUMERICAL_COLUMNS` and `CATEGORICAL_COLUMNS` are lists that contain the names of the numeric and categorical columns:" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": { - "id": "5f880a78" - }, - "outputs": [], - "source": [ - "NUM_OOV_BUCKETS = 1\n", - "\n", - "def preprocessing_fn(inputs):\n", - " \"\"\"Preprocess input columns into transformed columns.\"\"\"\n", - " # Since we are modifying some features and leaving others unchanged, we\n", - " # start by setting `outputs` to a copy of `inputs.\n", - " outputs = inputs.copy()\n", - "\n", - " # Scale numeric columns to have range [0, 1].\n", - " for key in NUMERIC_FEATURE_KEYS:\n", - " outputs[key] = tft.scale_to_0_1(outputs[key])\n", - "\n", - " # For all categorical columns except the label column, we generate a\n", - " # vocabulary but do not modify the feature. This vocabulary is instead\n", - " # used in the trainer, by means of a feature column, to convert the feature\n", - " # from a string to an integer id.\n", - " for key in CATEGORICAL_FEATURE_KEYS:\n", - " outputs[key] = tft.compute_and_apply_vocabulary(\n", - " tf.strings.strip(inputs[key]),\n", - " num_oov_buckets=NUM_OOV_BUCKETS,\n", - " vocab_filename=key)\n", - "\n", - " # For the label column we provide the mapping from string to index.\n", - " with tf.init_scope():\n", - " # `init_scope` - Only initialize the table once.\n", - " initializer = tf.lookup.KeyValueTensorInitializer(\n", - " keys=['>50K', '<=50K'],\n", - " values=tf.cast(tf.range(2), tf.int64),\n", - " key_dtype=tf.string,\n", - " value_dtype=tf.int64)\n", - " table = tf.lookup.StaticHashTable(initializer, default_value=-1)\n", - "\n", - " outputs[LABEL_KEY] = table.lookup(outputs[LABEL_KEY])\n", - "\n", - " return outputs" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "afa165ac" - }, - "source": [ - "One difference from the previous example is the label column manually specifies\n", - "the mapping from the string to an index. So `'>50'` is mapped to `0` and\n", - "`'<=50K'` is mapped to `1` because it's useful to know which index in the\n", - "trained model corresponds to which label.\n", - "\n", - "The `record_batches` variable represents a `PCollection` of\n", - "`pyarrow.RecordBatch`es. The `tensor_adapter_config` is given by `csv_tfxio`,\n", - "which is inferred from `SCHEMA` (and ultimately, in this example, from the TF\n", - "parsing specs).\n", - "\n", - "The final stage is to write the transformed data to disk and has a similar form\n", - "to reading the raw data. The schema used to do this is part of the output of\n", - "`tft_beam.AnalyzeAndTransformDataset` which infers a schema for the output data. The code to write to disk is shown below. The schema is a part of the metadata but uses the two interchangeably in the `tf.Transform` API (i.e. pass the metadata to the `tft.coders.ExampleProtoCoder`). Be aware that this writes to a different format. Instead of `textio.WriteToText`, use Beam's built-in support for the `TFRecord` format and use a coder to encode the data as `Example` protos. This is a better format to use for training, as shown in the next section. `transformed_eval_data_base` provides the base filename for the individual shards that are written." - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": { - "id": "PiHLl83FLRXi" - }, - "outputs": [], - "source": [ - "raw_dataset = (raw_data, csv_tfxio.TensorAdapterConfig())" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": { - "id": "giIQd-8xKubp" - }, - "outputs": [], - "source": [ - "working_dir = tempfile.mkdtemp()\n", - "with tft_beam.Context(temp_dir=working_dir):\n", - " transformed_dataset, transform_fn = (\n", - " raw_dataset | tft_beam.AnalyzeAndTransformDataset(\n", - " preprocessing_fn, output_record_batches=True))" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": { - "id": "EEVc2Hdr0Upe" - }, - "outputs": [], - "source": [ - "output_dir = tempfile.mkdtemp()" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": { - "id": "sB5m6v_GPUM5" - }, - "outputs": [], - "source": [ - "transformed_data, _ = transformed_dataset\n", - "\n", - "_ = (\n", - " transformed_data\n", - " | 'EncodeTrainData' >>\n", - " beam.FlatMapTuple(lambda batch, _: RecordBatchToExamples(batch))\n", - " | 'WriteTrainData' >> beam.io.WriteToTFRecord(\n", - " os.path.join(output_dir , 'transformed.tfrecord')))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5a9df5da" - }, - "source": [ - "In addition to the training data, `transform_fn` is also written out with the\n", - "metadata:" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": { - "id": "cdd42661" - }, - "outputs": [], - "source": [ - "_ = (\n", - " transform_fn\n", - " | 'WriteTransformFn' >> tft_beam.WriteTransformFn(output_dir))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-PFSCDrJQXen" - }, - "source": [ - "Run the entire Beam pipeline with `pipeline.run().wait_until_finish()`. Up until this point, the Beam pipeline represents a deferred, distributed computation. It provides instructions for what will be done, but the instructions have not been executed. This final call executes the specified pipeline." - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": { - "id": "IZWHQSesQW3I" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmp9j7iakzh/tftransform_tmp/ef3ef7611e4c485d92b6a0b4a0807de7/assets\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmp9j7iakzh/tftransform_tmp/ef3ef7611e4c485d92b6a0b4a0807de7/assets\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmp9j7iakzh/tftransform_tmp/e3d3a20dc9324545847a3b060fd3ec71/assets\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:tensorflow:Assets written to: /tmp/tmp9j7iakzh/tftransform_tmp/e3d3a20dc9324545847a3b060fd3ec71/assets\n" - ] - } - ], - "source": [ - "result = pipeline.run().wait_until_finish()" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n" - ] + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "KpXGE33umpig" + }, + "source": [ + "\n", + "\n", + "# Get Started with TensorFlow Transform" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FT1xumYJ-oEW" + }, + "source": [ + "This guide introduces the basic concepts of `tf.Transform` and how to use them.\n", + "It will:\n", + "\n", + "* Define a *preprocessing function*, a logical description of the pipeline\n", + " that transforms the raw data into the data used to train a machine learning\n", + " model.\n", + "* Show the [Apache Beam](https://beam.apache.org/) implementation used to\n", + " transform data by converting the *preprocessing function* into a *Beam\n", + " pipeline*.\n", + "* Show additional usage examples." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9_SoiTcNmkVu" + }, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gc6oSu9BnwJe" + }, + "outputs": [], + "source": [ + "!pip install -U tensorflow_transform" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pY_BNfLemjY4" + }, + "outputs": [], + "source": [ + "!pip install pyarrow" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "L7Mtis2Jn2Af" + }, + "outputs": [], + "source": [ + "import importlib\n", + "\n", + "import pkg_resources\n", + "\n", + "importlib.reload(pkg_resources)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PvDoWUfynTWh" + }, + "outputs": [], + "source": [ + "import os\n", + "import tempfile\n", + "\n", + "import tensorflow as tf\n", + "import tensorflow_transform as tft\n", + "import tensorflow_transform.beam as tft_beam\n", + "from tensorflow_transform.tf_metadata import dataset_metadata, schema_utils\n", + "from tfx_bsl.public import tfxio" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j4W_yuSr-ro3" + }, + "source": [ + "## Define a preprocessing function\n", + "\n", + "The *preprocessing function* is the most important concept of `tf.Transform`.\n", + "The preprocessing function is a logical description of a transformation of the\n", + "dataset. The preprocessing function accepts and returns a dictionary of tensors,\n", + "where a *tensor* means `Tensor` or `SparseTensor`. There are three kinds of\n", + "functions used to define the preprocessing function:\n", + "\n", + "1. Any function that accepts and returns tensors. These add TensorFlow\n", + " operations to the graph that transform raw data into transformed data.\n", + "2. Any of the *analyzers* provided by `tf.Transform`. Analyzers also accept\n", + " and return tensors, but unlike TensorFlow functions, they *do not* add\n", + " operations to the graph. Instead, analyzers cause `tf.Transform` to compute\n", + " a full-pass operation outside of TensorFlow. They use the input tensor values\n", + " over the entire dataset to generate a constant tensor that is returned as the\n", + " output. For example, `tft.min` computes the minimum of a tensor over the\n", + " dataset. `tf.Transform` provides a fixed set of analyzers, but this will be\n", + " extended in future versions.\n", + "3. Any stateless [preprocessing layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) (i.e. these layers must not invoke the ```adapt()``` method). These can be added as operations to the graph as they do not require a full pass over the data outside of the management of ```tf.Transform```. For example, you can add

\n", + "[tf.keras.layers.experimental.preprocessing.HashedCrossing](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/HashedCrossing),

\n", + "but not

\n", + "[tf.keras.layers.Normalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization),

as the latter needs to be adapted over the entire dataset. Do note that if you use [Lambda layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda), there are some de-serialization limitations which might prevent ```preprocessing_fn``` from being fully re-loaded off of disk by [tft.TFTransformOutput](https://www.tensorflow.org/tfx/transform/api_docs/python/tft/TFTransformOutput). \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "72ff0efc" + }, + "source": [ + "### Preprocessing function example\n", + "\n", + "By combining analyzers and regular TensorFlow functions, users can create\n", + "flexible pipelines for transforming data. The following preprocessing function\n", + "transforms each of the three features in different ways, and combines two of the\n", + "features:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "c6bf64fe" + }, + "outputs": [], + "source": [ + "def preprocessing_fn(inputs):\n", + " x = inputs[\"x\"]\n", + " y = inputs[\"y\"]\n", + " s = inputs[\"s\"]\n", + " x_centered = x - tft.mean(x)\n", + " y_normalized = tft.scale_to_0_1(y)\n", + " s_integerized = tft.compute_and_apply_vocabulary(s)\n", + " x_centered_times_y_normalized = x_centered * y_normalized\n", + " return {\n", + " \"x_centered\": x_centered,\n", + " \"y_normalized\": y_normalized,\n", + " \"x_centered_times_y_normalized\": x_centered_times_y_normalized,\n", + " \"s_integerized\": s_integerized,\n", + " }" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LU8aclPGAZLX" + }, + "source": [ + "Here, `x`, `y` and `s` are `Tensor`s that represent input features. The first\n", + "new tensor that is created, `x_centered`, is built by applying `tft.mean` to `x`\n", + "and subtracting this from `x`. `tft.mean(x)` returns a tensor representing the\n", + "mean of the tensor `x`. `x_centered` is the tensor `x` with the mean subtracted.\n", + "\n", + "The second new tensor, `y_normalized`, is created in a similar manner but using\n", + "the convenience method `tft.scale_to_0_1`. This method does something similar to\n", + "computing `x_centered`, namely computing a maximum and minimum and using these\n", + "to scale `y`.\n", + "\n", + "The tensor `s_integerized` shows an example of string manipulation. In this\n", + "case, we take a string and map it to an integer. This uses the convenience\n", + "function `tft.compute_and_apply_vocabulary`. This function uses an analyzer to\n", + "compute the unique values taken by the input strings, and then uses TensorFlow\n", + "operations to convert the input strings to indices in the table of unique\n", + "values.\n", + "\n", + "The final column shows that it is possible to use TensorFlow operations to\n", + "create new features by combining tensors.\n", + "\n", + "The preprocessing function defines a pipeline of operations on a dataset. In\n", + "order to apply the pipeline, we rely on a concrete implementation of the\n", + "`tf.Transform` API. The Apache Beam implementation provides `PTransform` which\n", + "applies a user's preprocessing function to data. The typical workflow of a\n", + "`tf.Transform` user will construct a preprocessing function, then incorporate\n", + "this into a larger Beam pipeline, creating the data for training." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nxnXDEK1AezF" + }, + "source": [ + "### Batching\n", + "\n", + "Batching is an important part of TensorFlow. Since one of the goals of\n", + "`tf.Transform` is to provide a TensorFlow graph for preprocessing that can be\n", + "incorporated into the serving graph (and, optionally, the training graph),\n", + "batching is also an important concept in `tf.Transform`.\n", + "\n", + "While not obvious in the example above, the user defined preprocessing function\n", + "is passed tensors representing *batches* and not individual instances, as\n", + "happens during training and serving with TensorFlow. On the other hand,\n", + "analyzers perform a computation over the entire dataset that returns a single\n", + "value and not a batch of values. `x` is a `Tensor` with a shape of\n", + "`(batch_size,)`, while `tft.mean(x)` is a `Tensor` with a shape of `()`. The\n", + "subtraction `x - tft.mean(x)` broadcasts where the value of `tft.mean(x)` is\n", + "subtracted from every element of the batch represented by `x`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "09bf63cc" + }, + "source": [ + "## Apache Beam Implementation\n", + "\n", + "While the *preprocessing function* is intended as a logical description of a\n", + "*preprocessing pipeline* implemented on multiple data processing frameworks,\n", + "`tf.Transform` provides a canonical implementation used on Apache Beam. This\n", + "implementation demonstrates the functionality required from an implementation.\n", + "There is no formal API for this functionality, so each implementation can use an\n", + "API that is idiomatic for its particular data processing framework.\n", + "\n", + "The Apache Beam implementation provides two `PTransform`s used to process data\n", + "for a preprocessing function. The following shows the usage for the composite\n", + "`PTransform` - `tft_beam.AnalyzeAndTransformDataset`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2e1e01ec" + }, + "outputs": [], + "source": [ + "raw_data = [\n", + " {\"x\": 1, \"y\": 1, \"s\": \"hello\"},\n", + " {\"x\": 2, \"y\": 2, \"s\": \"world\"},\n", + " {\"x\": 3, \"y\": 3, \"s\": \"hello\"},\n", + "]\n", + "\n", + "raw_data_metadata = dataset_metadata.DatasetMetadata(\n", + " schema_utils.schema_from_feature_spec(\n", + " {\n", + " \"y\": tf.io.FixedLenFeature([], tf.float32),\n", + " \"x\": tf.io.FixedLenFeature([], tf.float32),\n", + " \"s\": tf.io.FixedLenFeature([], tf.string),\n", + " }\n", + " )\n", + ")\n", + "\n", + "with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n", + " transformed_dataset, transform_fn = (\n", + " raw_data,\n", + " raw_data_metadata,\n", + " ) | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jl2gbkvUICd_" + }, + "outputs": [], + "source": [ + "transformed_data, transformed_metadata = transformed_dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e6029b09" + }, + "source": [ + "The `transformed_data` content is shown below and contains the transformed\n", + "columns in the same format as the raw data. In particular, the values of\n", + "`s_integerized` are `[0, 1, 0]`—these values depend on how the words `hello` and\n", + "`world` were mapped to integers, which is deterministic. For the column\n", + "`x_centered`, we subtracted the mean so the values of the column `x`, which were\n", + "`[1.0, 2.0, 3.0]`, became `[-1.0, 0.0, 1.0]`. Similarly, the rest of the columns\n", + "match their expected values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vcMpG2bFFcgP" + }, + "outputs": [], + "source": [ + "transformed_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0ee0d9ac" + }, + "source": [ + "Both `raw_data` and `transformed_data` are datasets. The next two sections show\n", + "how the Beam implementation represents datasets and how to read and write data\n", + "to disk. The other return value, `transform_fn`, represents the transformation\n", + "applied to the data, covered in detail below.\n", + "\n", + "The `tft_beam.AnalyzeAndTransformDataset` class is the composition of the two\n", + "fundamental transforms provided by the implementation\n", + "`tft_beam.AnalyzeDataset` and `tft_beam.TransformDataset`. So the following\n", + "two code snippets are equivalent:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BCZqx7OfGjZ_" + }, + "outputs": [], + "source": [ + "my_data = (raw_data, raw_data_metadata)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "816cdb9b" + }, + "outputs": [], + "source": [ + "with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n", + " transformed_data, transform_fn = my_data | tft_beam.AnalyzeAndTransformDataset(\n", + " preprocessing_fn\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3ecf416c08f5" + }, + "outputs": [], + "source": [ + "transform_fn" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dJImGAaeHDTo" + }, + "outputs": [], + "source": [ + "with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n", + " transform_fn = my_data | tft_beam.AnalyzeDataset(preprocessing_fn)\n", + " transformed_data = (my_data, transform_fn) | tft_beam.TransformDataset()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M4kl5IA5H29G" + }, + "source": [ + "`transform_fn` is a pure function that represents an operation that is applied\n", + "to each row of the dataset. In particular, the analyzer values are already\n", + "computed and treated as constants. In the example, the `transform_fn` contains\n", + "as constants the mean of column `x`, the min and max of column `y`, and the\n", + "vocabulary used to map the strings to integers.\n", + "\n", + "An important feature of `tf.Transform` is that `transform_fn` represents a map\n", + "*over rows*—it is a pure function applied to each row separately. All of the\n", + "computation for aggregating rows is done in `AnalyzeDataset`. Furthermore, the\n", + "`transform_fn` is represented as a TensorFlow `Graph` which can be embedded into\n", + "the serving graph.\n", + "\n", + "`AnalyzeAndTransformDataset` is provided for optimizations in this special case.\n", + "This is the same pattern used in\n", + "[scikit-learn](http://scikit-learn.org/stable/index.html), providing the `fit`,\n", + "`transform`, and `fit_transform` methods.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2bedd48a" + }, + "source": [ + "## Data Formats and Schema\n", + "\n", + "TFT Beam implementation accepts two different input data formats. The\n", + "\"instance dict\" format (as seen in the example above and [simple.ipynb](https://www.tensorflow.org/tfx/tutorials/transform/simple) & [simple_example.py](https://github.com/tensorflow/transform/blob/master/examples/simple_example.py))\n", + "is an intuitive format and is suitable for small datasets while the TFXIO\n", + "([Apache Arrow](https://arrow.apache.org)) format provides improved performance\n", + "and is suitble for large datasets.\n", + "\n", + "The \"metadata\" accompanying the `PCollection` tells the Beam implementation the format of the `PCollection`.\n", + "\n", + "```\n", + "(raw_data, raw_data_metadata) | tft.AnalyzeDataset(...)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5dc76c5a" + }, + "source": [ + "- If `raw_data_metadata` is a `dataset_metadata.DatasetMetadata` (see below,\n", + " \"The 'instance dict' format\" section),\n", + " then `raw_data` is expected to be in the \"instance dict\" format.\n", + "- If `raw_data_metadata` is a `tfxio.TensorAdapterConfig`\n", + " (see below, \"The TFXIO format\" section), then `raw_data` is expected to be\n", + " in the TFXIO format." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XPjE0a7kNU5i" + }, + "source": [ + "### The \"instance dict\" format\n", + "\n", + "The previous code examples used this format. The metadata contains the schema that defines the layout of the data and how it is read from and written to various formats. Even this in-memory format is not self-describing and requires the schema in order to be interpreted as tensors.\n", + "\n", + "Again, here is the definition of the schema for the example data:\n", + "\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "372894b6" + }, + "outputs": [], + "source": [ + "from tensorflow_transform.tf_metadata import dataset_metadata, schema_utils\n", + "\n", + "raw_data_metadata = dataset_metadata.DatasetMetadata(\n", + " schema_utils.schema_from_feature_spec(\n", + " {\n", + " \"s\": tf.io.FixedLenFeature([], tf.string),\n", + " \"y\": tf.io.FixedLenFeature([], tf.float32),\n", + " \"x\": tf.io.FixedLenFeature([], tf.float32),\n", + " }\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "58c2402c" + }, + "source": [ + "The `Schema` proto contains the information needed to parse the\n", + "data from its on-disk or in-memory format, into tensors. It is typically\n", + "constructed by calling `schema_utils.schema_from_feature_spec` with a dict\n", + "mapping feature keys to `tf.io.FixedLenFeature`, `tf.io.VarLenFeature`, and\n", + "`tf.io.SparseFeature` values. See the documentation for\n", + "[`tf.parse_example`](https://www.tensorflow.org/api_docs/python/tf/parse_example)\n", + "for more details.\n", + "\n", + "Above we use `tf.io.FixedLenFeature` to indicate that each feature contains a\n", + "fixed number of values, in this case a single scalar value. Because\n", + "`tf.Transform` batches instances, the actual `Tensor` representing the feature\n", + "will have shape `(None,)` where the unknown dimension is the batch dimension.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jatXeEayOhza" + }, + "source": [ + "### The TFXIO format\n", + "\n", + "With this format, the data is expected to be contained in a\n", + "[`pyarrow.RecordBatch`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html).\n", + "For tabular data, our Apache Beam implementation\n", + "accepts Arrow `RecordBatch`es that consist of columns of the following types:\n", + "\n", + " - `pa.list_()`, where `` is `pa.int64()`, `pa.float32()`\n", + " `pa.binary()` or `pa.large_binary()`.\n", + "\n", + " - `pa.large_list()`\n", + "\n", + "The toy input dataset we used above, when represented as a `RecordBatch`, looks\n", + "like the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fd01900a" + }, + "outputs": [], + "source": [ + "import pyarrow as pa\n", + "\n", + "raw_data = [\n", + " pa.record_batch(\n", + " data=[\n", + " pa.array([[1], [2], [3]], pa.list_(pa.float32())),\n", + " pa.array([[1], [2], [3]], pa.list_(pa.float32())),\n", + " pa.array([[\"hello\"], [\"world\"], [\"hello\"]], pa.list_(pa.binary())),\n", + " ],\n", + " names=[\"x\", \"y\", \"s\"],\n", + " )\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "114d171e" + }, + "source": [ + "Similar to the `dataset_metadata.DatasetMetadata` instance that accompanies the \"instance dict\" format, a `tfxio.TensorAdapterConfig`\n", + "is must accompany the `RecordBatch`es. It consists of the Arrow schema of\n", + "the `RecordBatch`es, and\n", + "`tfxio.TensorRepresentations` to uniquely determine how columns in `RecordBatch`es can be interpreted as TensorFlow Tensors (including but not limited to `tf.Tensor`, `tf.SparseTensor`).\n", + "\n", + "`tfxio.TensorRepresentations` is type alias for a `Dict[str, tensorflow_metadata.proto.v0.schema_pb2.TensorRepresentation]` which\n", + "establishes the relationship between a Tensor that a `preprocessing_fn` accepts\n", + "and columns in the `RecordBatch`es. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "b8478d18" + }, + "outputs": [], + "source": [ + "from google.protobuf import text_format\n", + "from tensorflow_metadata.proto.v0 import schema_pb2\n", + "\n", + "tensor_representation = {\n", + " \"x\": text_format.Parse(\n", + " \"\"\"dense_tensor { column_name: \"col1\" shape { dim { size: 2 } } }\"\"\",\n", + " schema_pb2.TensorRepresentation(),\n", + " )\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZAqE0Fb2Ta47" + }, + "source": [ + "Means that `inputs['x']` in `preprocessing_fn` should be a dense `tf.Tensor`,\n", + "whose values come from a column of name `'col1'` in the input `RecordBatch`es,\n", + "and its (batched) shape should be `[batch_size, 2]`.\n", + "\n", + "A `schema_pb2.TensorRepresentation` is a Protobuf defined in\n", + "[TensorFlow Metadata](https://github.com/tensorflow/metadata/blob/v0.22.2/tensorflow_metadata/proto/v0/schema.proto#L592)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qbPmiaJOTe09" + }, + "source": [ + "## Compatibility with TensorFlow\n", + "\n", + "`tf.Transform` provides support for exporting the `transform_fn` as\n", + "a SavedModel, see the [simple tutorial](https://www.tensorflow.org/tfx/tutorials/transform/simple) for an example. The default behavior before the `0.30` release\n", + "exported a TF 1.x SavedModel. Starting with the `0.30` release, the default\n", + "behavior is to export a TF 2.x SavedModel unless TF 2.x behaviors are explicitly\n", + "disabled (by calling `tf.compat.v1.disable_v2_behavior()`).\n", + "\n", + "If using TF 1.x concepts such as `tf.estimator` and `tf.Sessions`, you can retain the previous behavior by passing `force_tf_compat_v1=True` to\n", + "[`tft_beam.Context`](https://www.tensorflow.org/tfx/transform/api_docs/python/tft_beam/Context)\n", + "if using `tf.Transform` as a standalone library or to the\n", + "[Transform](https://www.tensorflow.org/tfx/api_docs/python/tfx/components/Transform)\n", + "component in TFX.\n", + "\n", + "When exporting the `transform_fn` as a TF 2.x SavedModel, the `preprocessing_fn`\n", + "is expected to be traceable using `tf.function`. Additionally, if running your\n", + "pipeline remotely (for example with the `DataflowRunner`), ensure that the\n", + "`preprocessing_fn` and any dependencies are packaged properly as described\n", + "[here](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies).\n", + "\n", + "Known issues with using `tf.Transform` to export a TF 2.x SavedModel are\n", + "documented [here](https://www.tensorflow.org/tfx/transform/tf2_support)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XZBRgTv4Th31" + }, + "source": [ + "## Input and output with Apache Beam\n", + "\n", + "So far, we've seen input and output data in python lists (of `RecordBatch`es or\n", + "instance dictionaries). This is a simplification that relies on Apache Beam's\n", + "ability to work with lists as well as its main representation of data, the\n", + "`PCollection`.\n", + "\n", + "A `PCollection` is a data representation that forms a part of a Beam pipeline.\n", + "A Beam pipeline is formed by applying various `PTransform`s, including\n", + "`AnalyzeDataset` and `TransformDataset`, and running the pipeline. A\n", + "`PCollection` is not created in the memory of the main binary, but instead is\n", + "distributed among the workers (although this section uses the in-memory\n", + "execution mode).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oz_PA4dLTlEe" + }, + "source": [ + "### Pre-canned `PCollection` Sources (`TFXIO`)\n", + "\n", + "The `RecordBatch` format that our implementation accepts is a common format that\n", + "other TFX libraries accept. Therefore TFX offers convenient \"sources\" (a.k.a\n", + "`TFXIO`) that read files of various formats on disk and produce `RecordBatch`es\n", + "and can also give `tfxio.TensorAdapterConfig`, including inferred\n", + "`tfxio.TensorRepresentations`.\n", + "\n", + "Those `TFXIO`s can be found in package `tfx_bsl` ([`tfx_bsl.public.tfxio`](https://www.tensorflow.org/tfx/tfx_bsl/api_docs/python/tfx_bsl/public/tfxio)).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "135596bb" + }, + "source": [ + "## Example: \"Census Income\" dataset\n", + "\n", + "The following example requires both reading and writing data on disk and\n", + "representing data as a `PCollection` (not a list), see:\n", + "[`census_example.py`](https://github.com/tensorflow/transform/tree/master/examples/census_example.py).\n", + "Below we show how to download the data and run this example. The \"Census Income\"\n", + "dataset is provided by the\n", + "[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n", + "This dataset contains both categorical and numeric data.\n", + "\n", + "Here is some code to download and preview this data:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "p-iPFfR-y-Nb" + }, + "outputs": [], + "source": [ + "!wget https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/census/adult.data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yP-YBifvwh3C" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "train_data_file = \"adult.data\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fca2FC8IKwnt" + }, + "source": [ + "There's some configuration code hidden in the cell below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fo3aBF4CyxTW" + }, + "outputs": [], + "source": [ + "# @title\n", + "ORDERED_CSV_COLUMNS = [\n", + " \"age\",\n", + " \"workclass\",\n", + " \"fnlwgt\",\n", + " \"education\",\n", + " \"education-num\",\n", + " \"marital-status\",\n", + " \"occupation\",\n", + " \"relationship\",\n", + " \"race\",\n", + " \"sex\",\n", + " \"capital-gain\",\n", + " \"capital-loss\",\n", + " \"hours-per-week\",\n", + " \"native-country\",\n", + " \"label\",\n", + "]\n", + "\n", + "CATEGORICAL_FEATURE_KEYS = [\n", + " \"workclass\",\n", + " \"education\",\n", + " \"marital-status\",\n", + " \"occupation\",\n", + " \"relationship\",\n", + " \"race\",\n", + " \"sex\",\n", + " \"native-country\",\n", + "]\n", + "\n", + "NUMERIC_FEATURE_KEYS = [\n", + " \"age\",\n", + " \"capital-gain\",\n", + " \"capital-loss\",\n", + " \"hours-per-week\",\n", + " \"education-num\",\n", + "]\n", + "\n", + "LABEL_KEY = \"label\"\n", + "\n", + "RAW_DATA_FEATURE_SPEC = dict(\n", + " [(name, tf.io.FixedLenFeature([], tf.string)) for name in CATEGORICAL_FEATURE_KEYS]\n", + " + [(name, tf.io.FixedLenFeature([], tf.float32)) for name in NUMERIC_FEATURE_KEYS]\n", + " + [(LABEL_KEY, tf.io.FixedLenFeature([], tf.string))]\n", + ")\n", + "\n", + "SCHEMA = tft.tf_metadata.dataset_metadata.DatasetMetadata(\n", + " tft.tf_metadata.schema_utils.schema_from_feature_spec(RAW_DATA_FEATURE_SPEC)\n", + ").schema" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wCoqaKcgwMyC" + }, + "outputs": [], + "source": [ + "pd.read_csv(train_data_file, names=ORDERED_CSV_COLUMNS).head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ed4e8acf" + }, + "source": [ + "The columns of the dataset are either categorical or numeric. This dataset\n", + "describes a classification problem: predicting the last column where the\n", + "individual earns more or less than 50K per year. However, from the perspective\n", + "of `tf.Transform`, this label is just another categorical column.\n", + "\n", + "We use a Pre-canned `tfxio.BeamRecordCsvTFXIO` to translate the CSV lines\n", + "into `RecordBatches`. `TFXIO` requires two important piece of information:\n", + "\n", + " - a TensorFlow Metadata Schema,`tfmd.proto.v0.shema_pb2`,\n", + " that contains type and shape information about each CSV column.\n", + " `schema_pb2.TensorRepresentation`s are an optional part of the Schema;\n", + " if not provided (which is the case in this example), they will be inferred\n", + " from the type and shape information. One can get the Schema either by\n", + " using a helper function we provide to translate from TF parsing specs\n", + " (shown in this example), or by running\n", + " [TensorFlow Data Validation](https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic).\n", + " - a list of column names, in the order they appear in the CSV file. Note\n", + " that those names must match the feature names in the Schema." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TSwWOApYojXn" + }, + "outputs": [], + "source": [ + "!pip install -U -q tfx_bsl" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "97V3x2FQyMWE" + }, + "outputs": [], + "source": [ + "import apache_beam as beam\n", + "from tfx_bsl.coders.example_coder import RecordBatchToExamples\n", + "from tfx_bsl.public import tfxio" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6641a8c9" + }, + "outputs": [], + "source": [ + "pipeline = beam.Pipeline()\n", + "\n", + "csv_tfxio = tfxio.BeamRecordCsvTFXIO(\n", + " physical_format=\"text\", column_names=ORDERED_CSV_COLUMNS, schema=SCHEMA\n", + ")\n", + "\n", + "raw_data = (\n", + " pipeline\n", + " | \"ReadTrainData\"\n", + " >> beam.io.ReadFromText(train_data_file, coder=beam.coders.BytesCoder())\n", + " | \"FixCommasTrainData\" >> beam.Map(lambda line: line.replace(b\", \", b\",\"))\n", + " | \"DecodeTrainData\" >> csv_tfxio.BeamSource()\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5qATseJbK91x" + }, + "outputs": [], + "source": [ + "raw_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e9b2f63b" + }, + "source": [ + "Note that we had to do some additional fix-ups after the CSV lines are read\n", + "in. Otherwise, we could rely on the `tfxio.CsvTFXIO` to handle both reading the files\n", + "and translating to `RecordBatch`es:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9ede0fb4" + }, + "outputs": [], + "source": [ + "csv_tfxio = tfxio.CsvTFXIO(\n", + " train_data_file,\n", + " telemetry_descriptors=[], # ???\n", + " column_names=ORDERED_CSV_COLUMNS,\n", + " schema=SCHEMA,\n", + ")\n", + "\n", + "p2 = beam.Pipeline()\n", + "raw_data_2 = p2 | \"TFXIORead\" >> csv_tfxio.BeamSource()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "67d86fba" + }, + "source": [ + "Preprocessing for this dataset is similar to the previous example,\n", + " except the preprocessing function is programmatically generated instead of manually specifying each column. In the preprocessing function below, `NUMERICAL_COLUMNS` and `CATEGORICAL_COLUMNS` are lists that contain the names of the numeric and categorical columns:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5f880a78" + }, + "outputs": [], + "source": [ + "NUM_OOV_BUCKETS = 1\n", + "\n", + "\n", + "def preprocessing_fn(inputs):\n", + " \"\"\"Preprocess input columns into transformed columns.\"\"\"\n", + " # Since we are modifying some features and leaving others unchanged, we\n", + " # start by setting `outputs` to a copy of `inputs.\n", + " outputs = inputs.copy()\n", + "\n", + " # Scale numeric columns to have range [0, 1].\n", + " for key in NUMERIC_FEATURE_KEYS:\n", + " outputs[key] = tft.scale_to_0_1(outputs[key])\n", + "\n", + " # For all categorical columns except the label column, we generate a\n", + " # vocabulary but do not modify the feature. This vocabulary is instead\n", + " # used in the trainer, by means of a feature column, to convert the feature\n", + " # from a string to an integer id.\n", + " for key in CATEGORICAL_FEATURE_KEYS:\n", + " outputs[key] = tft.compute_and_apply_vocabulary(\n", + " tf.strings.strip(inputs[key]),\n", + " num_oov_buckets=NUM_OOV_BUCKETS,\n", + " vocab_filename=key,\n", + " )\n", + "\n", + " # For the label column we provide the mapping from string to index.\n", + " with tf.init_scope():\n", + " # `init_scope` - Only initialize the table once.\n", + " initializer = tf.lookup.KeyValueTensorInitializer(\n", + " keys=[\">50K\", \"<=50K\"],\n", + " values=tf.cast(tf.range(2), tf.int64),\n", + " key_dtype=tf.string,\n", + " value_dtype=tf.int64,\n", + " )\n", + " table = tf.lookup.StaticHashTable(initializer, default_value=-1)\n", + "\n", + " outputs[LABEL_KEY] = table.lookup(outputs[LABEL_KEY])\n", + "\n", + " return outputs" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "afa165ac" + }, + "source": [ + "One difference from the previous example is the label column manually specifies\n", + "the mapping from the string to an index. So `'>50'` is mapped to `0` and\n", + "`'<=50K'` is mapped to `1` because it's useful to know which index in the\n", + "trained model corresponds to which label.\n", + "\n", + "The `record_batches` variable represents a `PCollection` of\n", + "`pyarrow.RecordBatch`es. The `tensor_adapter_config` is given by `csv_tfxio`,\n", + "which is inferred from `SCHEMA` (and ultimately, in this example, from the TF\n", + "parsing specs).\n", + "\n", + "The final stage is to write the transformed data to disk and has a similar form\n", + "to reading the raw data. The schema used to do this is part of the output of\n", + "`tft_beam.AnalyzeAndTransformDataset` which infers a schema for the output data. The code to write to disk is shown below. The schema is a part of the metadata but uses the two interchangeably in the `tf.Transform` API (i.e. pass the metadata to the `tft.coders.ExampleProtoCoder`). Be aware that this writes to a different format. Instead of `textio.WriteToText`, use Beam's built-in support for the `TFRecord` format and use a coder to encode the data as `Example` protos. This is a better format to use for training, as shown in the next section. `transformed_eval_data_base` provides the base filename for the individual shards that are written." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PiHLl83FLRXi" + }, + "outputs": [], + "source": [ + "raw_dataset = (raw_data, csv_tfxio.TensorAdapterConfig())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "giIQd-8xKubp" + }, + "outputs": [], + "source": [ + "working_dir = tempfile.mkdtemp()\n", + "with tft_beam.Context(temp_dir=working_dir):\n", + " (\n", + " transformed_dataset,\n", + " transform_fn,\n", + " ) = raw_dataset | tft_beam.AnalyzeAndTransformDataset(\n", + " preprocessing_fn, output_record_batches=True\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EEVc2Hdr0Upe" + }, + "outputs": [], + "source": [ + "output_dir = tempfile.mkdtemp()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sB5m6v_GPUM5" + }, + "outputs": [], + "source": [ + "transformed_data, _ = transformed_dataset\n", + "\n", + "_ = (\n", + " transformed_data\n", + " | \"EncodeTrainData\"\n", + " >> beam.FlatMapTuple(lambda batch, _: RecordBatchToExamples(batch))\n", + " | \"WriteTrainData\"\n", + " >> beam.io.WriteToTFRecord(os.path.join(output_dir, \"transformed.tfrecord\"))\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5a9df5da" + }, + "source": [ + "In addition to the training data, `transform_fn` is also written out with the\n", + "metadata:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cdd42661" + }, + "outputs": [], + "source": [ + "_ = transform_fn | \"WriteTransformFn\" >> tft_beam.WriteTransformFn(output_dir)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-PFSCDrJQXen" + }, + "source": [ + "Run the entire Beam pipeline with `pipeline.run().wait_until_finish()`. Up until this point, the Beam pipeline represents a deferred, distributed computation. It provides instructions for what will be done, but the instructions have not been executed. This final call executes the specified pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IZWHQSesQW3I" + }, + "outputs": [], + "source": [ + "result = pipeline.run().wait_until_finish()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6457b1a7dea1" + }, + "outputs": [], + "source": [ + "print(pipeline)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0dSYQALF05ug" + }, + "source": [ + "After running the pipeline the output directory contains two artifacts.\n", + "\n", + "* The transformed data, and the metadata describing it.\n", + "* The `tf.saved_model` containing the resulting `preprocessing_fn`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yxM83xAuzL6o" + }, + "outputs": [], + "source": [ + "!ls {output_dir}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OllPVQJl2dRx" + }, + "source": [ + "To see how to use these artifacts refer to the [Advanced preprocessing tutorial](https://www.tensorflow.org/tfx/tutorials/transform/census)." + ] } - ], - "source": [ - "print(pipeline)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0dSYQALF05ug" - }, - "source": [ - "After running the pipeline the output directory contains two artifacts.\n", - "\n", - "* The transformed data, and the metadata describing it.\n", - "* The `tf.saved_model` containing the resulting `preprocessing_fn`" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": { - "id": "yxM83xAuzL6o" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "transformed_metadata transformed.tfrecord-00000-of-00001 transform_fn\n" - ] + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "get_started.ipynb", + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" } - ], - "source": [ - "!ls {output_dir}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OllPVQJl2dRx" - }, - "source": [ - "To see how to use these artifacts refer to the [Advanced preprocessing tutorial](https://www.tensorflow.org/tfx/tutorials/transform/census)." - ] - } - ], - "metadata": { - "colab": { - "collapsed_sections": [], - "name": "get_started.ipynb", - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - } - }, - "nbformat": 4, - "nbformat_minor": 4 + "nbformat": 4, + "nbformat_minor": 0 } From 49ee768bcdc79fb2602a2dc74c10f1c0527ae218 Mon Sep 17 00:00:00 2001 From: Pritam Dodeja Date: Wed, 16 Nov 2022 11:14:27 -0500 Subject: [PATCH 3/3] Switched to two space indentation. --- docs/get_started.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/get_started.ipynb b/docs/get_started.ipynb index 63d41dba..e0ce5e79 100644 --- a/docs/get_started.ipynb +++ b/docs/get_started.ipynb @@ -929,7 +929,7 @@ " vocab_filename=key,\n", " )\n", "\n", - " # For the label column we provide the mapping from string to index.\n", + " # For the label column we provide the mapping from string to index.\n", " with tf.init_scope():\n", " # `init_scope` - Only initialize the table once.\n", " initializer = tf.lookup.KeyValueTensorInitializer(\n",