Add workflow for evaluating predictions #12

cthoyt · 2023-12-05T15:30:39Z

This workflow takes in three parts:

Positive, manually curated mappings
Negative, manually curated mappings
Predicted mappings

And estimates several metrics such as accuracy, precision, recall, and F1 for the predictions. This gives back an estimation of the true metrics, since the positive and negative manually curated mappings likely are not complete and therefore have some bias in which things were curated (e.g., I always curate the easiest first, leading towards a skew that more of my manual curations result in positive calls).

Why is this useful?

Mapping tool competitions don't have to keep writing their own infrastructure for holding their competitions. You do the following:

Curate (or generate) the gold standard correct and incorrect mappings
Ask the competitors to generate their predictions in SSSOM
Load them into this function and get results

Demonstration

This also comes with a demonstrator by comparing a combination first-party ontology curations combine with third-party Biomappings curations against lexical mapping predictions made by Gilda. It reports the following when applied to a small number of OBO Foundry ontologies.

prefix	completion	accuracy	precision	recall	$F_1$
chebi	10.8%	98.0%	98.8%	99.1%	99.0%
cl	28.3%	53.7%	90.8%	47.9%	62.7%
clo	52.6%	34.9%	70.0%	38.9%	50.0%
doid	30.1%	26.8%	92.2%	26.3%	40.9%
go	38.0%	80.0%	81.8%	96.8%	88.7%
maxo	44.6%	86.4%	100.0%	86.4%	92.7%
uberon	6.3%	11.2%	98.5%	11.1%	20.0%
vo	66.4%	79.1%	91.7%	77.2%	83.8%

Completion refers to the percentage of predicted mappings that appear in the curated sets (both positive and negative). A higher completion reduces the impact of curation bias. E.g., a completion of 100% means that the metrics are unbiased.

Note that lexical matching has pretty high precision, i.e., most of the predictions it makes are right, but it is more prone to false negatives, so accuracy can vary. Some observations:

This leads to the DOID accuracy being pretty low.
ChEBI has no curations outside of Biomappings, so the number of false negatives is zero, meaning that the accuracy is a less useful metric (TBD, how to communicate that in the table).
CLO has a large number of duplicate terms, which results in an artificially low precision.

Caution

Mapping shouldn't be a competition. Make your predictions, curate them, contribute them to Biomappings or directly upstream, then everyone benefits and we don't have to keep playing this game.

matentzn · 2023-12-05T17:53:01Z

Wow this is such a cool idea.. Awesome man!

codecov · 2024-05-02T15:21:36Z

Codecov Report

Attention: Patch coverage is 0% with 98 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (main@8d1d4b4). Click here to learn what that means.

Files	Patch %	Lines
src/semra/evaluate_prediction.py	0.00%	98 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #12   +/-   ##
=======================================
  Coverage        ?   28.57%           
=======================================
  Files           ?       32           
  Lines           ?     2390           
  Branches        ?      488           
=======================================
  Hits            ?      683           
  Misses          ?     1666           
  Partials        ?       41

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cthoyt added 3 commits December 5, 2023 16:30

Create evaluate_prediction.py

e4a1d67

Incorporate upstream curated stuff

9f0de4e

Update evaluate_prediction.py

969b749

cthoyt added 2 commits January 22, 2024 11:42

Update evaluate_prediction.py

008d457

Merge branch 'main' into evaluate-predictions

39325ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workflow for evaluating predictions #12

Add workflow for evaluating predictions #12

cthoyt commented Dec 5, 2023 •

edited

Loading

matentzn commented Dec 5, 2023

codecov bot commented May 2, 2024 •

edited

Loading

Add workflow for evaluating predictions #12

Are you sure you want to change the base?

Add workflow for evaluating predictions #12

Conversation

cthoyt commented Dec 5, 2023 • edited Loading

Why is this useful?

Demonstration

matentzn commented Dec 5, 2023

codecov bot commented May 2, 2024 • edited Loading

Codecov Report

cthoyt commented Dec 5, 2023 •

edited

Loading

codecov bot commented May 2, 2024 •

edited

Loading