Skip to content

Commit

Permalink
Update build docs
Browse files Browse the repository at this point in the history
  • Loading branch information
hellais committed Jan 8, 2025
1 parent 7224338 commit c063a4a
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 143 deletions.
122 changes: 5 additions & 117 deletions oonipipeline/Design.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,10 +81,11 @@ There should be clear instructions on how to set it up and get it running.
The analysis engine is made up of several components:

- Observation generation
- Response body archiving
- Ground truth generation
- Experiment result generation

As part of future work we might also perform:
- Response body archiving

Below we explain each step of this process in detail

At a high level the pipeline looks like this:
Expand All @@ -94,17 +95,8 @@ graph
M{{Measurement}} --> OGEN[[make_observations]]
OGEN --> |many| O{{Observations}}
NDB[(NetInfoDB)] --> OGEN
OGEN --> RB{{ResponseBodies}}
RB --> BA[(BodyArchive)]
FDB[(FingerprintDB)] --> FPH
FPH --> BA
RB --> FPH[[fingerprint_hunter]]
O --> ODB[(ObservationTables)]
ODB --> MKGT[[make_ground_truths]]
MKGT --> GTDB[(GroundTruthDB)]
GTDB --> MKER
BA --> MKER
ODB --> MKER[[make_experiment_results]]
MKER --> |one| ER{{ExperimentResult}}
```
Expand Down Expand Up @@ -137,64 +129,12 @@ incredibly fast.
A side effect is that we end up with tables are can be a bit sparse (several
columns are NULL).

The tricky part, in the case of complex tests like web_connectivity, is to
The tricky part, in the case of complex tests like `web_connectivity`, is to
figure out which individual sub measurements fit into the same observation row.
For example we would like to have the TCP connect result to appear in the same
row as the DNS query that lead to it with the TLS handshake towards that IP,
port combination.

You can run the observation generation with a clickhouse backend like so:

```
poetry run python -m oonidata mkobs --clickhouse clickhouse://localhost/ --data-dir tests/data/datadir/ --start-day 2022-08-01 --end-day 2022-10-01 --create-tables --parallelism 20
```

Here is the list of supported observations so far:

- [x] WebObservation, which has information about DNS, TCP, TLS and HTTP(s)
- [x] WebControlObservation, has the control measurements run by web connectivity (is used to generate ground truths)
- [ ] CircumventionToolObservation, still needs to be designed and implemented
(ideally we would use the same for OpenVPN, Psiphon, VanillaTor)

### Response body archiving

It is optionally possible to also create WAR archives of HTTP response bodies
when running the observation generation.

This is enabled by passing the extra command line argument `--archives-dir`.

Whenever a response body is detected in a measurement it is sent to the
archiving queue which takes the response body, looks up in the database if it
has seen it already (so we don't store exact duplicate bodies).
If we haven't archived it yet, we write the body to a WAR file and record it's
sha1 hash together with the filename where we wrote it to into a database.

These WAR archives can then be mined asynchronously for blockpages using the
fingerprint hunter command:

```
oonidata fphunt --data-dir tests/data/datadir/ --archives-dir warchives/ --parallelism 20
```

When a blockpage matching the fingerprint is detected, the relevant database row
for that fingerprint is updated with the ID of the fingerprint which was
detected.

### Ground Truth generation

In order to establish if something is being blocked or not, we need some ground truth for comparison.

The goal of the ground truth generation task is to build a ground truth
database, which contains all the ground truths for every target that has been
tested in a particular day.

Currently it's implemented using the WebControlObservations, but in the future
we could just use other WebObservation.

Each ground truth database is actually just a sqlite3 database. For a given day
it's approximately 150MB in size and we load them in memory when we are running
the analysis workflow.

### ExperimentResult generation

An experiment result is the interpretation of one or more observations with a
Expand All @@ -218,7 +158,7 @@ blocking https://facebook.com/ with the following logic:
- any TLS handshake with SNI facebook.com gets a RST

In this scenario, assuming the probe has discovered other IPs for facebook.com
through other means (ex. through the test helper or DoH as web_connectivity 0.5
through other means (ex. through the test helper or DoH as `web_connectivity` 0.5
does), we would like to emit the following experiment results:

- BLOCKED, `dns.bogon`, `facebook.com`
Expand All @@ -231,55 +171,3 @@ does), we would like to emit the following experiment results:

This way we are fully characterising the block in all the methods through which
it is implemented.

### Current pipeline

This section documents the current [ooni/pipeline](https://github.com/ooni/pipeline)
design.

```mermaid
graph LR
Probes --> ProbeServices
ProbeServices --> Fastpath
Fastpath --> S3MiniCans
Fastpath --> S3JSONL
Fastpath --> FastpathClickhouse
S3JSONL --> API
FastpathClickhouse --> API
API --> Explorer
```

```mermaid
classDiagram
direction RL
class CommonMeta{
measurement_uid
report_id
input
domain
probe_cc
probe_asn
test_name
test_start_time
measurement_start_time
platform
software_name
software_version
}
class Measurement{
+Dict test_keys
}
class Fastpath{
anomaly
confirmed
msm_failure
blocking_general
+Dict scores
}
Fastpath "1" --> "1" Measurement
Measurement *-- CommonMeta
Fastpath *-- CommonMeta
```
48 changes: 22 additions & 26 deletions scripts/build_docs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,35 +12,31 @@ strip_title() {
cat $infile | awk 'BEGIN{p=1} /^#/{if(p){p=0; next}} {print}'
}

cat <<EOF>$DOCS_ROOT/00-index.md
---
# Do not edit! This file is automatically generated
# version: $REPO_NAME:$COMMIT_HASH
title: Accessing OONI data
description: How to access OONI data
slug: data
---
EOF
strip_title Readme.md >> $DOCS_ROOT/00-index.md
generate_doc() {
local order="$1"
local slug="$2"
local input_file="$3"
local output_file="$4"
local title="$5"
local description="$6"

cat <<EOF>$DOCS_ROOT/01-oonidata.md
cat <<EOF>"$DOCS_ROOT/$output_file"
---
# Do not edit! This file is automatically generated
# version: $REPO_NAME:$COMMIT_HASH
title: OONI Data CLI
description: Using the oonidata command line interface
slug: data/oonidata
# version: $REPO_NAME/$input_file:$COMMIT_HASH
title: $title
description: $description
slug: $slug
sidebar:
order: $order
---
EOF
strip_title oonidata/Readme.md >> $DOCS_ROOT/01-oonidata.md
echo "[edit file](https://github.com/$REPO_NAME/edit/$MAIN_BRANCH/$input_file)" >> "$DOCS_ROOT/$output_file"
strip_title "$input_file" >> "$DOCS_ROOT/$output_file"
}
cat <<EOF>$DOCS_ROOT/02-pipeline.md
---
# Do not edit! This file is automatically generated
# version: $REPO_NAME:$COMMIT_HASH
title: OONI Data Pipepline
description: OONI Data Pipeline documentation
slug: data/pipeline
---
EOF
strip_title oonipipeline/Readme.md >> $DOCS_ROOT/02-pipeline.md
generate_doc 0 "data" "Readme.md" "00-index.md" "Accessing OONI data" "How to access OONI data"
generate_doc 1 "data/oonidata" "oonidata/Readme.md" "01-oonidata.md" "OONI Data CLI" "Using the oonidata command line interface"
generate_doc 2 "data/pipeline-design" "oonipipeline/Design.md" "02-pipeline-design.md" "OONI Data Pipeline design" "Design for OONI Pipeline v5"
generate_doc 3 "data/pipeline" "oonipipeline/Readme.md" "03-pipeline.md" "OONI Data Pipeline v5" "OONI Data Pipeline v5"

0 comments on commit c063a4a

Please sign in to comment.