Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add number of encodings in blocking metadata #87

Open
joyceyuu opened this issue Jul 21, 2020 · 2 comments
Open

Add number of encodings in blocking metadata #87

joyceyuu opened this issue Jul 21, 2020 · 2 comments

Comments

@joyceyuu
Copy link
Contributor

Before computing similarity score computation and matching, we need to check if the number of encodings in blocking data is consistent with number of encodings in CLK data.

Currently we either load the whole JSON just to get the number of encodings or use ijson to iteratively count the number of encodings.

It would be better to store the count in the metadata and just read this metadata when needing the count.

@hardbyte
Copy link
Collaborator

hardbyte commented Mar 7, 2021

Anonlink client already does this, I'm not sure there needs to be any functionality added to blocklib.

cc @wilko77

@hardbyte hardbyte closed this as completed Mar 7, 2021
@wilko77
Copy link
Collaborator

wilko77 commented Mar 7, 2021

There are two different counts:

  • The count of encodings as produced by clkhash.
  • The count of encodings as referenced in the blocking data.

Anonlink-client writes the first count into the meta data. I think Joyce was talking about the latter one here.

Those two counts can be different, depending on the choice of blocking algorithm. Whereas some algorithms nicely map every entity in at least one block, P-Sig does not guarantee that, because of the filtering.
For these probabilistic schemes we found it useful to sanity check. That is, count the entities that are part of at least one block. This way we get an understanding of how aggressively the algorithm filters big blocks.
There is code somewhere in blocklib that does just that. This count allows you to compute the coverage of the blocking scheme (percentage of entities referenced in at least one block). High coverage is a necessary condition for good linkage results. - An entity, that is not referenced in any block will never be matched.

As coverage is crucial for linkage success it makes sense to expose this measure downstream.

@hardbyte hardbyte reopened this Mar 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants