Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate the analysis engine against the analysis we currently have #21

Open
hellais opened this issue Dec 13, 2022 · 2 comments
Open

Comments

@hellais
Copy link
Member

hellais commented Dec 13, 2022

This is about setting up some form of evaluation criteria for the OONI Data analysis engine and measuring some key metrics to asses how good it's doing its job.

@hellais hellais self-assigned this Oct 6, 2023
@hellais
Copy link
Member Author

hellais commented Oct 16, 2023

An initial evaluation has been done by focusing only on the DNS level anomalies. The findings from this initial investigation have been documented as part of an internal presentation made to the team here: https://docs.google.com/presentation/d/1rw7a02lpTj4CcguAz_nbqzNzkACKdFvMzTLrairAQ70/edit.

The approach we followed was to restrict the comparison to a location where we have a good ground truth, which in was Russia where we have the ability to easily build ground truth from the official blocklists.

Some ML based analysis was also done as part of this.

The next steps for this issue involve:

  • extending it to support more than just DNS based anomalies
  • expanding the ground truth to other countries (optional)
  • writing up the findings in a report
  • sharing the report with some field experts for collecting feedback
  • setting up the analysis so it can easily be reproduced as we move forward with the analsysis engine

@hellais
Copy link
Member Author

hellais commented Nov 25, 2024

Progress is being made WRT comparing the new analysis against the old analysis for web_connectivity.

To setup a comparison between the two I am running a query against the new analysis tables where the critical piece is the following:

        multiIf(
            final_outcome_label = 'ok' AND top_probe_analysis = 'false', 'consistent',
            startsWith(final_outcome_label, 'dns.') AND top_probe_analysis = 'dns', 'consistent',
            startsWith(final_outcome_label, 'tcp.') AND top_probe_analysis = 'tcp_ip', 'consistent',
            startsWith(final_outcome_label, 'tls.') AND top_probe_analysis = 'http-failure', 'consistent',
            top_probe_analysis IS NULL, 'null',
            'inconsistent'
        ) as probe_analysis_consistency

I am limiting the comparison to 25 days worth of data.

To start off we just look overall how much the new and probe analysis compare against each other:
visualization (94)

Right off the bat, we can see that in 75% of the cases both analysis are in agreement about the outcome of a measurement, they disagree in 3% of the cases and in 21% of the cases the probe analysis produced a NULL outcome (meaning it wasn't able to draw any conclusion).

We then try to get some understanding of which kinds of outcomes the probe and new analysis are most in agreement or disagreement:
visualization (95)

In this chart we are comparing the outcomes from the probe and disaggregating them by how many times they agree or disagree with the new analysis. Ignoring the http-diff analysis (which is currently not performed as part of the new analysis), we can see that the new analysis disagrees mostly for cases of tcp_ip and tls level blocking, while they agree a lot more in the dns type analysis.

visualization (96)
Setting aside the cases in which the probe can't tell what's blocked, we should probably focus our attention on the dns.blocked and tcp.blocked cases.

The following heatmap shows a comparison between the new analysis and the probe analysis which helps us highlight which areas deserve most attention.
visualization (97)

To summarize:

  • It's reassuring that overall the new analysis is in disagreement only in 3% of cases
  • Already just the fact we are able to associate a valid outcome to 21% of cases is a pretty big win for the new analysis

More work is now needed to take a closer look at the other cases and understand which of the two analysis engines are correct.

Another approach which we might consider is to apply a method similar to what's done in meteorological forecasting when performing ensemble forecasting and produce a final outcome which is the sum of the probe analysis together with the new analysis.

@hellais hellais moved this to In Progress in Sprint Planning Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

1 participant