This repository hosts code for the following series of blog posts:
- Variable-Length Sequences in TensorFlow Part 1: Optimizing Sequence Padding
- Variable-Length Sequences in TensorFlow Part 2: Training with a Simple BERT Model
- Variable-Length Sequences in TensorFlow Part 3: Using a Sentence-Conditioned BERT Encoder
We provide code in the form of Jupyter Notebook so that you can experiment with the methods interactively.
Text data comes in different shapes and forms. Sometimes they come as sequences of characters, sometimes as sequences of words, and sometimes even as sequences of sentences, and so on. Now, for machine learning (ML) algorithms to process text sequences in batches the batches need to have uniform-length sequences. However, text sequences can come in varying lengths.
In this project, we implement different strategies to handle variable-length sequences in TensorFlow with a focus on performance. We will discuss the pros and cons of each strategy along with their implementations in TensorFlow. We have successfully applied some of these strategies to the large-scale data here at Carted and have greatly benefited from them. We hope you’ll be able to apply them in your own projects as well. Below is a side-by-side comparison of how efficient handling of variable-length sequences can boost up the model training time:
We’ll be using this dataset from Kaggle which concerns a text classification problem. Specifically, given some description of a movie, we need to predict its genre. Each description consists of multiple sentences and there are 27 unique genres (such as action, adult, adventure, animation, biography, comedy, etc.).
As a disclaimer, our focus is on designing efficient data pipelines for handling variable-length text sequences that can help us reduce compute waste. But we will also see how to use these pipelines to train text classification models for completeness.
Note: One central theme around our code is to be able to process text sequences in the following manner. Padding a batch of sequences with respect to the maximum sequence length of the batch instead of a fixed sequence length.
bigram-tfidf-shallow-mlp.ipynb
: Shows how to use bigrams along with TF-IDF vectorization to train a simple test classification model with fully-connected layers.smart-batching-shallow-mlp.ipynb
: Shows how to train a text classifier using simple models consisting of embeddings, GRUs, and fully-connected layers.bert/
data-preparation.ipynb
: Shows how to prepare text data into TensorFlow Records (TFRecords) with tokenization. The corresponding modeling notebook is present atbert/train-vanilla-bert.ipynb
.data-preparation-sentence-splitter.ipynb
: Treats each movie description as a sequence of sentences and serializes them into TFRecords. The corresponding modeling notebook is present atbert/train-model-split-sentence.ipynb
.
All the results discussed in the above-mentioned articles can be found at the following links:
- https://wandb.ai/carted/smart-batching-simpler-models (simpler models)
- https://wandb.ai/carted/batching-experiments (BERT models)
Feel free to open an issue and let us know.