Parallax
is a distributed query engine for private/sensitive data. Its main features are:
- Privacy as Code: It lets you write privacy policies in a simple YAML manifest and enforces them rigorously at query-time.
- Horizontally Scalable: It scales as much as the database where your data is.
- Declarative: You state what you need and it figures out how to do it. With no unpredictable side-effect and with complete reproducibility.
At a high-level, Parallax
works by recursively pattern-matching and transforming input user queries to adhere to the administrator's policies.
Parallax then re-writes queries into semantically equivalent private queries which are executed against the underlying backend.
The fastest way to get hacking with Parallax
is to run it straight from our offical containers registry
$ docker run -it --rm eu.gcr.io/openquery/parallax
The image will start a single parallax-worker
node in the background. This lets you issue commands against it using the parallax
CLI. The first thing you will want to do is initialize the deployment with
root@xxxxxxxx:/opt/work# parallax init
which will configure an initial cluster state for us. That's it! We're up and running with the basics.
Should you wish to quickly onboard some data so we can start doing some queries, the easiest is to use the example manifests under example/manifest
. In the rest of this guide we'll walk through that.
In the example we will link up with a public dataset on BigQuery. Of course, it does not need any anonymization or privacy features, but it will be enough to demonstrate the most basic features of Parallax
. For this part you will need a Service Account Key for Google Cloud Platform (GCP) so that Parallax
can authenticate with GCP to run queries on your behalf. That service account will need write access to a blank BigQuery dataset (called a 'staging' dataset) so that Parallax
can store staging artifacts. The queries in this example never exceed 5gb in size so you should comfortably stay within the BigQuery free-tier of 1TiB at the time of writing this.
Let's use the public BigQuery london_bicycles
dataset, which has some data on individual usage of rental bikes in London between 2015 and mid 2017. We have already made a simple set of policies for it here. That manifest declares a dataset "safe_dataset" consisting of two underlying BigQuery tables: london_bicycles.cycle_hire
and london_bicycles.cycle_stations
. After this manifest is applied to Parallax
, these tables will show up to be queried under safe_dataset.cycle_hire
and safe_dataset.cycle_stations
. The policies in the manifest ensure that only some select columns (e.g. start_date
, start_station_id
, etc) are allowed to be queried as-is through Parallax
, while some values must be obfuscated (e.g. start_station_name
, etc) before being released to the user. All the columns not mentioned explicitly in the manifest will cause the query to be denied (the default is to not reveal anything).
The very first thing we need to do is get the current configuration state of Parallax
checked out in the working directory. Parallax
is completely declarative, a bit like Terraform (we do not use HCL currently but we definitely want to). So the workflow is to add/modify/remove items from a collection of local manifest files, then run parallax gov apply
to commit the changes back to Parallax
.
To check out the entire state in the file state.yaml
we run
root@xxxxxxxx:/opt/work# parallax --disable-tls gov fetch --output=state.yaml
At this point the state.yaml
file will look something like this:
- group:
name: wheel
members:
- resource.user.root
- user:
name: root
email: ""
public_keys:
- "{SOME PEM ENCODED PUBLIC KEY}"
auth_provider: ~
primary_group: resource.group.wheel
super_user: true
So it only contains a root
user and an administrators group called wheel
. Both were generated by our first call to parallax init
. Note that we had to use the --disable-tls
flag as the worker node on the image has TLS disabled by default.
Now copy your service account key into the running container (see docker cp) and set the following environment variables
root@xxxxxxxx:/opt/work# export SERVICE_ACCOUNT_KEY={where my key is}
root@xxxxxxxx:/opt/work# export STAGING_PROJECT_ID={a GCP project_id}
root@xxxxxxxx:/opt/work# export STAGING_DATASET_ID={a GCP dataset_id}
We append the example manifest to our existing state.yaml
with something like
root@xxxxxxxx:/opt/work# curl https://raw.githubusercontent.com/openquery-io/core/master/parallax/example/manifest/london_bicycles.yaml |\
envsubst >> state.yaml
Finally we have to apply the new manifest back to Parallax
so that it actually uses it. This can be done with
root@xxxxxxxx:/opt/work# parallax --disable-tls gov apply
We can then start running SQL queries to the protected dataset. Queries are managed via the parallax jobs
subcommand. For example
root@xxxxxxxx:/opt/work# cat<<EOF | parallax --disable-tls jobs insert
SELECT start_station_id, start_station_name
FROM safe_dataset.cycle_hire
EOF
We can find out the state of submitted jobs with parallax jobs list
and parallax jobs get
. When it is done (should only take a few seconds, depending on BigQuery's mood) we can fetch the result as a csv with
root@xxxxxxxx:/opt/work# parallax jobs fetch {JOB_ID} --format=csv
You will first need to clone this repository locally, including all the submodules with,
$ git clone --recursive [email protected]:openquery-io/core.git openquery-core
The easiest is to build the whole thing in Docker. Simply navigate to the root of the core
repository and run
$ docker build -f parallax/docker/Dockerfile.build .
Run cargo build
in this workspace. Building Parallax
currently requires Rust nightly.
Most tests can be run locally as usual with cargo test
. Some will require additional credentials to query which by default have to supplied by filling in a secret/
folder at the root of the repository (currently used by the BigQuery backend looking for a GCP service account key to run under). The easiest way to find out what is missing is trying to run them and filling them in as required. We're actually working on a solution to make these publically runnable using our credentials so hang tight!
Apart from that you will need a quick and dirty Redis box for integration tests. This can be done simply by running off the official Redis docker image. For example:
$ docker run -it --rm -p 6379:6379 redis
First of all, we sincerely appreciate any contributions large or small - so thank you.
See the contributing page for more details.
This project is licensed under the BSL license.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in Parallax by you, shall be licensed as BSL license, without any additional terms or conditions.