Presents an optimized Apache Beam pipeline for generating sentence embeddings (runnable on Cloud Dataflow). This repository accompanies our blog post: Improving Dataflow Pipelines for Text Data Processing.
We assume that you already have a billing enabled Google Cloud Platform (GCP) project in case you wanted to run the pipeline on Cloud Dataflow.
To run the code locally, first install the dependencies: pip install -r requirements
. If you cannot
create a Google Cloud Storage (GCS) Bucket then download the data using from
here. We just need the
train_data.txt
file for our purpose. Also, note that without a GCS Bucket, one cannot
run the pipeline on Cloud Dataflow which is the main objective of this repository.
After downloading the dataset, make changes to the respective paths and command-line
arguments that use GCS in main.py
.
Then execute python main.py -r DirectRunner
.
-
Create a GCS Bucket and note its name.
-
Then create a folder called
data
inside the Bucket. -
Copy over the
train_data.txt
file to thedata
folder:gsutil cp train_data.txt gs://<BUCKET-NAME>/data
. -
Then run the following from the terminal:
python main.py \ --project <GCP-Project> \ --gcs-bucket <BUCKET-NAME> --runner DataflowRunner
For more details please refer to our blog post: Improving Dataflow Pipelines for Text Data Processing.