Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Very Wide Observation Rows + Experiment generation (#17)
As part of this PR I did a major rework of the observation generation system. tl;dr * Observations are being generated at a rate of ~5k measurements per second (i.e. it should be possible to reprocess the full dataset in less than 3 days) * Bodies from the measurements are being archived in WAR files (this is currently what is slowing down the observation generation the most and is what leads to the process stalling at the end with a huge queue of bodies to archive). This needs some work to be optimised further. * We are now able to generate Experiment results from the based observations using ground truths at a rate of 15k measurements per seconds (i.e. it should be possible to re-analyse the full OONI dataset in less than a day) If you care to read more details, see below: ## Very Wide Observation Rows Each Web Connectivity measurements ends up producing observations that are all of the same type and are written to the same DB table. This has the benefit that we don't need to lookup the observations we care about in several disparate tables, but can do it all in the same one, which is incredibly fast. A side effect is that we end up with tables are can be a bit sparse (several columns are NULL), but this doesn't seem to present major difficulties. The biggest challenge in this approach is figuring out which observations are related to each other so that they can be packed into the same row. In order to do this I kept the original observation model in place, which gave me guarantees that the data structures were properly filled out, and then for each of them I tried to lookup the relevant other ones. Any observation that doesn't have a friend, just ends up on its own database row all alone. ## WAR Body writer I worked on separating the process of archiving bodies and finding blocking fingerprints in it. Basically during the processing we create a WAR file with inside it the raw bodies and write to a dedicated database (or potentially the same in it's own table, but I need to find how to get that to perform well). We are then able to separately scan through all these WAR files hunting for blockpage fingerprints, which is actually pretty fast. If we add new blockpage fingerprints we can just re-scan the WAR files looking for them and update the database column with what we found. ## Misc performance improvements It turns out clickhouse is not too happy when you do many writes per second to it. In their docs they state you shouldn't be making more than 1 request per second (https://clickhouse.com/docs/en/about-us/performance/#performance-when-inserting-data). I encountered this issue when I had optimised the processor to the point that I was hitting this limit. The result is that the clickhouse process starts consuming a bunch of CPU and memory and eventually just stops dropping any connection attempt to it. To overcome this the ClickhouseConnection database abstraction I added the concept of a row buffer, which waits to become full with a certain number of rows before flushing it to the clickhouse connection. This worked surprisingly well and improved the overall performance of the reprocessing task by 1 order of magnitude. Quite a bit of additional changes were made to how multiprocessing is done and small tweaks here and there based on iterations. ### Experiment result generation I have added support in here for generating experiment results from the Very Wide Observation Rows. Basically we process data in batches of 1 day. For each day we first generate a ground truth database which tells us what we should expect to see by looking at all other web connectivity control measurement, but in the future maybe from other measurements too. The process of generating the ground truths is actually pretty expensive (it used to be the most expensive task) and takes about 80-90 seconds for a given day. We then need to efficiently lookup the ground truths that are related to a specific measurement so that we can correlate them to what we are seeing in the data. In the beginning I went for the most naive solution of just putting it all in a list and then doing a full scan of it for the relevant ground truths. As the ground truths for a given day can be in the order of the 100s of thousands, this obviously turned out to be incredibly expensive. I briefly experimented with creating some hash maps onto the data, so that these lookups would be faster, but quickly realised I needed multiple indexes and I was basically re-inventing a database. I obviously could not use clickhouse for this purpose because doing many per second there is not what it's made for. I then realised that I actually already had a database right inside of the standard library of python: SQLite! So I quickly put together an in-memory groundtruth database to put all the ground truths and then do the lookup. This made things significantly faster. Yet this was not enough, because when you are processing a measurement, you don't actually care to look at all the ground truths for the full day, but only those which are for that specific measurement. It's pretty easy to figure out which is the subset of all ground truths you care about, so I implemented a system that does some pre-filtering and reduction of the ground truths for the full day into only those that are related to a particular measurement. Note: this part of the code was put together very quickly and is currently a bit racy and not so nice to look at, so it needs some refactoring (the goal was just to see if it would work at all). After this last improvement, the performance went up by 1 order of magnitude. All in all I'm glad to see that it's starting to come together and it offers the prospect of being a much more efficient and iterative way of doing analysis on OONI data. The current state of things that Experiment Result generation is happening at a rate of 20k results per seconds that are mostly bottlenecked by the database writes. Some significant amount of work needs to happen on validating the data outputs so that we can check if the analysis logic is good (I didn't spend much time working on this after the big ground truth refactor, so it likely has some bugs). It's nice that the results are explainable and you can easily figure out which part of the analysis code generated a particular outcome through the blocking_meta key.
- Loading branch information