Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filtering items with deprecated claims #28

Open
rwst opened this issue May 28, 2020 · 2 comments
Open

filtering items with deprecated claims #28

rwst opened this issue May 28, 2020 · 2 comments

Comments

@rwst
Copy link

rwst commented May 28, 2020

Applying the command bzcat latest-all.json.bz2 |wikibase-dump-filter --simplify --claim 'P698' |jq '[.id,.claims.P698,.claims.P921]' -c >PMID.ndjson results in >30M lines like this:

["Q94880466",["19484558"],null]
["Q17485067",["21609473"],["Q18123741","Q12156","Q193430"]]

where the first case is an item with P698 claim but without P921 claims, and the second has P698 and P921 claims. However out of these 30M there are at least six (6) that are different:
ralf@ark:~/wikidata> grep '[]' PMID.ndjson

["Q30573040",["23057853"],[]]
["Q30523792",["22888462"],[]]
["Q48835971",[],null]
["Q50125628",[],null]
["Q58616403",[],null]
["Q31128925",["27613570"],[]]

Note that 3 don't have P698 (which should not happen given the filter), and 3 have [] instead of null for no P921.

I'm not claiming there is a bug in wikibase-dump-filter, just that this needs investigating, and the ticket is a start. But maybe you have seen this and have an immediate explanation?

@rwst
Copy link
Author

rwst commented May 28, 2020

Ah got it, these were deprecated claims. Should they appear at all?

@rwst rwst changed the title filtering some wrong items for unknown reasons filtering items with deprecated claims May 28, 2020
maxlath added a commit that referenced this issue May 28, 2020
@maxlath
Copy link
Owner

maxlath commented May 28, 2020

the problem comes from this untested situation where you use both a --claim filter and --simplify:

  • the claim filter didn't care for ranks, and let the deprecated statements through
  • while the simplify function, by default, only keeps the truthy statements.

The later behavior can be disabled by passing a keepNonTruthy=true flag to the simplify function, but we could also consider having the filter checking the simplify option to know if it should drop or not a match due to a non-truthy statement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants