You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Before computing similarity score computation and matching, we need to check if the number of encodings in blocking data is consistent with number of encodings in CLK data.
Currently we either load the whole JSON just to get the number of encodings or use ijson to iteratively count the number of encodings.
It would be better to store the count in the metadata and just read this metadata when needing the count.
The text was updated successfully, but these errors were encountered:
The count of encodings as referenced in the blocking data.
Anonlink-client writes the first count into the meta data. I think Joyce was talking about the latter one here.
Those two counts can be different, depending on the choice of blocking algorithm. Whereas some algorithms nicely map every entity in at least one block, P-Sig does not guarantee that, because of the filtering.
For these probabilistic schemes we found it useful to sanity check. That is, count the entities that are part of at least one block. This way we get an understanding of how aggressively the algorithm filters big blocks.
There is code somewhere in blocklib that does just that. This count allows you to compute the coverage of the blocking scheme (percentage of entities referenced in at least one block). High coverage is a necessary condition for good linkage results. - An entity, that is not referenced in any block will never be matched.
As coverage is crucial for linkage success it makes sense to expose this measure downstream.
Before computing similarity score computation and matching, we need to check if the number of encodings in blocking data is consistent with number of encodings in CLK data.
Currently we either load the whole JSON just to get the number of encodings or use
ijson
to iteratively count the number of encodings.It would be better to store the count in the metadata and just read this metadata when needing the count.
The text was updated successfully, but these errors were encountered: