Add number of encodings in blocking metadata #87

joyceyuu · 2020-07-21T00:32:41Z

Before computing similarity score computation and matching, we need to check if the number of encodings in blocking data is consistent with number of encodings in CLK data.

Currently we either load the whole JSON just to get the number of encodings or use ijson to iteratively count the number of encodings.

It would be better to store the count in the metadata and just read this metadata when needing the count.

The text was updated successfully, but these errors were encountered:

hardbyte · 2021-03-07T21:20:41Z

Anonlink client already does this, I'm not sure there needs to be any functionality added to blocklib.

cc @wilko77

wilko77 · 2021-03-07T23:29:38Z

There are two different counts:

The count of encodings as produced by clkhash.
The count of encodings as referenced in the blocking data.

Anonlink-client writes the first count into the meta data. I think Joyce was talking about the latter one here.

Those two counts can be different, depending on the choice of blocking algorithm. Whereas some algorithms nicely map every entity in at least one block, P-Sig does not guarantee that, because of the filtering.
For these probabilistic schemes we found it useful to sanity check. That is, count the entities that are part of at least one block. This way we get an understanding of how aggressively the algorithm filters big blocks.
There is code somewhere in blocklib that does just that. This count allows you to compute the coverage of the blocking scheme (percentage of entities referenced in at least one block). High coverage is a necessary condition for good linkage results. - An entity, that is not referenced in any block will never be matched.

As coverage is crucial for linkage success it makes sense to expose this measure downstream.

hardbyte closed this as completed Mar 7, 2021

hardbyte reopened this Mar 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add number of encodings in blocking metadata #87

Add number of encodings in blocking metadata #87

joyceyuu commented Jul 21, 2020

hardbyte commented Mar 7, 2021

wilko77 commented Mar 7, 2021

Add number of encodings in blocking metadata #87

Add number of encodings in blocking metadata #87

Comments

joyceyuu commented Jul 21, 2020

hardbyte commented Mar 7, 2021

wilko77 commented Mar 7, 2021