Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter purely by Q-Number #37

Open
ajinnah opened this issue Dec 21, 2021 · 1 comment
Open

Filter purely by Q-Number #37

ajinnah opened this issue Dec 21, 2021 · 1 comment

Comments

@ajinnah
Copy link

ajinnah commented Dec 21, 2021

Hello,

I have a large list of wikidata id's or Q Numbers and I'd like to filter out purely these entities. Does this already exist/is this possible to implement?

Thank you!

@maxlath
Copy link
Owner

maxlath commented Dec 21, 2021

It's not implemented but could be done fairly easily with grep (which will be much faster, see documentation on prefiltering):

# Create a file with one id per line, matching dump lines start
echo "Q1
Q2
Q3" | awk '{print "^{\"type\":\"item\",\"id\":\"" $1 "\","}' > qid_filter

# Filter the dump with that shortlist of ids
cat latest-all.json.gz  | gzip -d | grep -E -f qid_filter | sed 's/,$//' > selected_entities.ndjson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants