-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open evidence and Wiki scraper #12
base: v3
Are you sure you want to change the base?
Conversation
Since extracting text with basic formatting is relatively simple, instead of using external libraries, it is much faster to just unzip the .docx file and parse the document.xml from scratch
Since the tokenizer is much faster, tokensToMarkup became the bottleneck in extracting cards, rewritting to use a string instead of cheerio dom speeds up card extraction by 3-4x
Some text tokens overwrite styles set by the style name, which wasnt handled properly
Sometimes documents will use the outlinelvl property instead of the heading style to indicate a heading, this is now handled properly
With the roughly 200,000 files, there is a decent chance of a collision between two file ids. Switching to 64 bit lowers the chance to around 1/1000
Switch to using the htmlparser2 library cheerio uses under the hood for parsing. Around 3 times faster, handles links properly, and the code is probably simpler Add full cite field to database
Deduplicator independently fetches evidence by id and creates DedupTasks
… into pr/D0ugins/13
Create db entity to hold groups of simmilar cards and store frequency
Most of deduplication time is spent waiting for responses, concurrent parsing is much faster, especially if ping with the radius server is high. Locks processing on cards with the same sentences, and updating the parent of a card whos parent is being updated to prevent race conditions.
With the htmlparser2 based parsing, simiplifyTokens takes up around 1/3 of parsing time due to the slow lodash methods
Data about sentences is now stored inside binary strings. More compact than how it worked before, and more information is stored. Data is split into buckets so the performances is reasonable. Each bucket contains a sequence of 11 byte blocks containing the sentence information. First 5 bytes are the key of the sentence within the bucket, Next 4 bytes are card id, Last 2 bytes are index of sentence in card. Still uses the one pass algorithm from the last implementation, but this method of storage is more flexible and allows for better algorithms.
Now takes into account index of matches in both cards when determining if a match is real or a coincidence. Quality of matches are now much higher, I dont think its really worth it to do it in two passes. Maybe just look through EvidenceBucket entities ocasionally to fix edge cases.
Restructures scarper to the same format as other modules
Trying to run this, but the application seems to hang (I think while trying to load |
Sorry, should have clarified. Loading the list of rounds to download takes a long time (Something like 30 minutes irrc) |
Wiki was just updated and the api overhauled, terms now also ban bulk downloads of data. I have a dump of most of the relevant data though. |
Downloads round data and open source documents/cites from debate wiki. Also downloads file from openev. Uses xiki's rest api to pull the data. Main limiting factor is the response speed from the server, although should only take a day or two to run. In total there are around 320k rounds across the wikis with roughly half having open source documents + around 10k open ev files.
Todo: