Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multicore systems #32

Open
ghost opened this issue Apr 18, 2021 · 2 comments
Open

Support for multicore systems #32

ghost opened this issue Apr 18, 2021 · 2 comments

Comments

@ghost
Copy link

ghost commented Apr 18, 2021

I use the filter in 2 ramdisks, each around 100GB large to speed up processing. Still my 32 cores machine idles at around 5% and will take 12-16 hours filtering all entries (0.5ms average time).

As i don't know nodejs a lot im not sure i can add multi threading to this but node.js totally can invoke child threads - is there an easy 2-3 line addition possible to spawn more threads? See https://nodejs.org/docs/latest/api/cluster.html

Im using server boards, but i guess lots of ppl doing this will sit on a ryzen system or similiar.

multicore unpacking the archive is doable with 'pbzip2 -d -c /mnt/ramdisk/latest-all.json.bz2 | wikibase-dump-filter', thus showing node at exactly 100% and unzipping at ~110%, so its still node the bottleneck. This halves average to 0.25 for me, but with "just" 64GB RAM on maybe some rented hosted machine with lots of cores you can get filter time down to under 30 minutes with multicore processing, thus greatly reducing costs for weekly updates.

Thanks for your great work, really sparing me days of processing,
R

@maxlath
Copy link
Owner

maxlath commented Apr 18, 2021

It could be possible to use threads, but I haven't explored that option yet. I explored the multi-process option though, see the documentation on parallelization. Note that wikibase-dump-filter will always be the bottleneck because of the operations on JSON (parsing and stringifying), so it's worth it to pre-filter-out any line that can be, see pre-filtering

@ghost
Copy link
Author

ghost commented Apr 18, 2021

Thanks for pointing that out to me, but that somewhat makes it slower on my machine and the CLI output gets weird.

I'm glad i have something to get simple Q-P-Q with maybe labels for meta analysis so im fine, but im really surprised your software is kinda the only one out there doing this task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant