-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Look into applying some more rigorous approach to generating experiment results #84
Comments
Some work in progress on this front is being done on this branch: #85 In particular see the notebook which implements an early stage version of the bayes net: https://github.com/ooni/data/blob/bayes-net/oonipipeline/notebooks/web-analysis-bn.ipynb There are still a few critical theoretical hurdles that need to be overcome, which are questions I would like to pose to people that have more experience about this, namely:
|
After more experimentation with the bayesian network approach and having a working PoC of it, I came to the conclusion that for the moment the performance of running this is not going to scale well to our use case without some significant work to re-engineer the analysis pipeline. This lead to the conclusion that it was probably best for the time being to rollback to an approach that's simpler and closer to what we had done before, by using a fuzzy logic rule-based style classifier. Put in simpler terms this is just a list of IF THEN clauses that lead to the confidence estimates we have in a particular outcome being true. Through these we are effectively encoding the knowledge we have about certain signals in the measurements being a sign of blocking or not blocking. In terms of implementation it's done directly as SQL queries which has the benefit of both being more performant than having to carry data in and out of python, but also allows to inspect and update the rules more easily as they all live in one place. Work related to this is done inside of the following PR: #99, specifically the I will be following up with some more extensive documentation explaining how this whole system works. |
Probably a dumb question but still worth asking, why not use embeddings in order to do some pre/post grouping, and apply labels based on the clusters that get formed ? |
That's kind of what we are doing, though the clustering and labeling process is being done at moment using fuzzy rule based system. You can find the list of what you could call embeddings in this mega SQL query which are recomputed every day based on the observations: https://github.com/ooni/data/blob/main/oonipipeline/src/oonipipeline/analysis/web_analysis.py#L111. Examples of these are things like:
In the future it would be interesting to apply some ML to these feature vectors to see if it's possible to automatically generate the labels/outcomes, however the biggest challenge in doing so is the labelling through some form of ground truth. |
I think that the approach we have at the moment is working OK for the intents and purposes we need, so I am going to close this issue as done. Follow up issues shall be created as more progress is made on this front. |
Currently experiment results are semi-manually coded using bayesian style reasoning to come up with the weights.
It's however possible to do this using a more rigorous approach that makes use of well established graph based modeling systems such as bayesian networks.
Work on this has started already since a few months and had a very fruitful conversation about this topic with Joss who provided key insight.
As part of this activity the plan is to move this forward by doing some more modeling using bayes networks and see how it works.
Some sub-activities as part of this might include:
The text was updated successfully, but these errors were encountered: