Evaluate the analysis engine against the analysis we currently have #21

hellais · 2022-12-13T10:26:53Z

This is about setting up some form of evaluation criteria for the OONI Data analysis engine and measuring some key metrics to asses how good it's doing its job.

hellais · 2023-10-16T12:34:47Z

An initial evaluation has been done by focusing only on the DNS level anomalies. The findings from this initial investigation have been documented as part of an internal presentation made to the team here: https://docs.google.com/presentation/d/1rw7a02lpTj4CcguAz_nbqzNzkACKdFvMzTLrairAQ70/edit.

The approach we followed was to restrict the comparison to a location where we have a good ground truth, which in was Russia where we have the ability to easily build ground truth from the official blocklists.

Some ML based analysis was also done as part of this.

The next steps for this issue involve:

extending it to support more than just DNS based anomalies
expanding the ground truth to other countries (optional)
writing up the findings in a report
sharing the report with some field experts for collecting feedback
setting up the analysis so it can easily be reproduced as we move forward with the analsysis engine

hellais · 2024-11-25T21:27:59Z

Progress is being made WRT comparing the new analysis against the old analysis for web_connectivity.

To setup a comparison between the two I am running a query against the new analysis tables where the critical piece is the following:

        multiIf(
            final_outcome_label = 'ok' AND top_probe_analysis = 'false', 'consistent',
            startsWith(final_outcome_label, 'dns.') AND top_probe_analysis = 'dns', 'consistent',
            startsWith(final_outcome_label, 'tcp.') AND top_probe_analysis = 'tcp_ip', 'consistent',
            startsWith(final_outcome_label, 'tls.') AND top_probe_analysis = 'http-failure', 'consistent',
            top_probe_analysis IS NULL, 'null',
            'inconsistent'
        ) as probe_analysis_consistency

I am limiting the comparison to 25 days worth of data.

To start off we just look overall how much the new and probe analysis compare against each other:

Right off the bat, we can see that in 75% of the cases both analysis are in agreement about the outcome of a measurement, they disagree in 3% of the cases and in 21% of the cases the probe analysis produced a NULL outcome (meaning it wasn't able to draw any conclusion).

We then try to get some understanding of which kinds of outcomes the probe and new analysis are most in agreement or disagreement:

In this chart we are comparing the outcomes from the probe and disaggregating them by how many times they agree or disagree with the new analysis. Ignoring the http-diff analysis (which is currently not performed as part of the new analysis), we can see that the new analysis disagrees mostly for cases of tcp_ip and tls level blocking, while they agree a lot more in the dns type analysis.

Setting aside the cases in which the probe can't tell what's blocked, we should probably focus our attention on the dns.blocked and tcp.blocked cases.

The following heatmap shows a comparison between the new analysis and the probe analysis which helps us highlight which areas deserve most attention.

To summarize:

It's reassuring that overall the new analysis is in disagreement only in 3% of cases
Already just the fact we are able to associate a valid outcome to 21% of cases is a pretty big win for the new analysis

More work is now needed to take a closer look at the other cases and understand which of the two analysis engines are correct.

Another approach which we might consider is to apply a method similar to what's done in meteorological forecasting when performing ensemble forecasting and produce a final outcome which is the sum of the probe analysis together with the new analysis.

hellais added the funder/drl2022-2024 label Jan 10, 2023

hellais self-assigned this Oct 6, 2023

hellais added the priority/medium label Oct 16, 2023

hellais added this to Sprint Planning Jan 7, 2025

hellais moved this to In Progress in Sprint Planning Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate the analysis engine against the analysis we currently have #21

Evaluate the analysis engine against the analysis we currently have #21

hellais commented Dec 13, 2022

hellais commented Oct 16, 2023

hellais commented Nov 25, 2024 •

edited

Loading

Evaluate the analysis engine against the analysis we currently have #21

Evaluate the analysis engine against the analysis we currently have #21

Comments

hellais commented Dec 13, 2022

hellais commented Oct 16, 2023

hellais commented Nov 25, 2024 • edited Loading

hellais commented Nov 25, 2024 •

edited

Loading