Skip to content

Commit

Permalink
Cut down by a lot
Browse files Browse the repository at this point in the history
  • Loading branch information
zimolzak committed Jul 1, 2024
1 parent f19cb28 commit a42ae74
Showing 1 changed file with 33 additions and 148 deletions.
181 changes: 33 additions & 148 deletions zimolzak-data-quality-2024.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,58 +23,28 @@ improvement, and identify methods to access the data.
2. Understand the general processes used to appraise data quality.




# Introduction

## About me

| Yrs | Research activities | Clinical activities |
|-----|-------------------------------------|------------------------------|
| 3+1 | n/a | Internal medicine residency |
| 2+1 | MMSc biomedical informatics | Outpatient urgent care |
| 4 | VA Boston: Clinical trials | Hospitalist |
| 5 | BCM & VA Houston: Health services research | Hospitalist |
::: columns
:::: column
- Internal medicine residency
- MMSc biomedical informatics
- VA Boston: Clinical trials, hospitalist, urgent care
- BCM & VA Houston: Health services research & hospitalist

What is **Clinical research informatics?**

- I make various clinical research studies "go," using existing data.
- I make various clinical research studies "go," using existing data.[^MIT]
- "Phenotyping" using electronic health record **(EHR)** data


## A detailed reference about secondary use[^MIT]
::::
:::: column
![](img/book.jpg){ height=75% }
::::
:::

[^MIT]: MIT Critical Data. *Secondary Analysis of Electronic Health Records.*
Springer; 2016. [**Click here** for free access!](https://link.springer.com/book/10.1007/978-3-319-43742-2)

![](img/book.jpg){ height=75% }


## Layers of data quality (where things can go wrong)

My informal classification scheme, arranged from "little picture" to big picture:

1. Data itself (contents) is flawed
- occasional errors, typos, misunderstandings, *etc.*
- low fidelity extraction
- missing or highly missing
- rampant errors
2. Data exist but are called 130 different things.
3. Data fields are called misleading things (names don't mean what clinicians think)
4. Data exist only in "free text"
- Data can be auto-extracted but we must build that pipeline.
- Data exist but need human judgment to extract.
5. Data you want aren't in here at all.


## What "data cleaning" means

If someone says "data cleaning," I recommend *having them explain* what they mean!

It's not "just filtering out obvious errors" like height = 6.1 inches.

It's not "throwing away outliers."




Expand Down Expand Up @@ -124,19 +94,7 @@ EHR data is **not the only way to do your Inquiry project!** Leave adequate time



# Data quality frameworks

## Examples of simple entry errors (what many people think "data cleaning" is)

::: columns
:::: column
![28 inches is a **plausible** length for a 6--13 month-old, not a retired veteran. Happens to be 72 *centimeters!*](img/72cm.jpg)
::::
:::: column
![Warning: Are you sure you want to correct the obviously wrong BMI of 4484?](img/5-9inch.jpg)
::::
:::

# Data quality domains

## Data quality domains

Expand All @@ -159,7 +117,7 @@ quality assessment\ldots."
[^Lewis]: Lewis AE, *et al.* Electronic health record data quality assessment and tools: a systematic review. *J Am Med Inform Assoc.* 2023;30(10):1730--1740. [PMID: 37390812](https://pubmed.ncbi.nlm.nih.gov/37390812/)


## Definitions 1--5
## Definitions 1--5 (the most common dimensions)

Correctness:

Expand Down Expand Up @@ -193,22 +151,6 @@ Currency:
: The accuracy of the EHR data for the time at which it was recorded and how up to date the data are. "Timeliness."[^Lewis]


## Summary of papers discussing domains of data quality

Author/yr. $\to$ **Lewis23** Weiskopf13 Kahn16 Feder18 Wang21
-------- ---- ---- ---- ---- ----
Correctness + + +
\ Concordance + + "consistency"
\ **Plausibility** + + + "credibility" +
**Completeness** + + + + +
\ Bias +
Conformance + + +
Currency + + +

**Completeness and plausibility seem to be everyone's favorites**
across this table!




# Domains: Completeness, Bias
Expand Down Expand Up @@ -237,19 +179,6 @@ will change, and walk away forever.
[^mini]: Raebel MA, Haynes K, Woodworth TS, *et al.* Electronic clinical laboratory test results data tables: lessons from Mini-Sentinel. *Pharmacoepidemiol Drug Saf.* 2014;23(6):609--618. [PMID: 24677577](https://pubmed.ncbi.nlm.nih.gov/24677577/)


## Missing data in general: Bias

- This phenomenon is under-recognized. People think *missing data* means, "The lab measured the patient's serum sodium, but I can't access the result."

- But *missing* also means "not checked at all." One example: Tests get checked for a reason, and **more frequently for sick patients.** My serum sodium exists, but it was not measured on any day in 2024. Large gaps in time $\to$ "Was this an acute or slow change?"

- Potentially massive threat to validity.

- There is no one right way to handle missing data, but several wrong ways. Detailed methods are out of scope for this talk. Observational data are tricky. Epidemiology and statistics professionals are there for a reason.

**EHR data do not tell the whole story!**


## When data aren't in the medical record at all

You might know\ldots But you don't know\ldots
Expand All @@ -261,26 +190,23 @@ The patient's ZIP code. This *individual* patient's income.
![Real prescription fills of 20 patients. What happens during those gaps?](img/statinFills.png){height=50%}


## Missing data in general: Bias

# Domain: Conformance
- This phenomenon is under-recognized. People think *missing data* means, "The lab measured the patient's serum sodium, but I can't access the result."

## One approach (Mini-Sentinel)[^mini]
- But *missing* also means "not checked at all." One example: Tests get checked for a reason, and **more frequently for sick patients.** My serum sodium exists, but it was not measured on any day in 2024. Large gaps in time $\to$ "Was this an acute or slow change?"

- Potentially massive threat to validity.

### It was harder than they expected to "just merge labs" from data partners.
- There is no one right way to handle missing data, but several wrong ways. Detailed methods are out of scope for this talk. Observational data are tricky. Epidemiology and statistics professionals are there for a reason.

**EHR data do not tell the whole story!**

LOINC is a code that is supposed to take care of this, but\ldots

> [S]ome data partners found LOINC associated with essentially all
> results, others had LOINC associated with some results, and others
> found **no LOINC in source data.**

### How they follow data quality (sounds like "manually"):

> Checks included assessment of variable **completeness,** consistency,
> content, **alignment** with specifications, patterns, and trends. Data
> distributions are **examined** over time within and between [data]
> refreshes.

# Domain: Conformance

## Lab units (Mini-Sentinel): 12 data partners = 67 units!

Expand All @@ -294,16 +220,21 @@ LOINC is a code that is supposed to take care of this, but\ldots
[^Nate]: Fillmore N, Do N, Brophy M, Zimolzak A. Interactive Machine Learning for Laboratory Data Integration. *Stud Health Technol Inform.* 2019;264:133--137. [PMID: 31437900](https://pubmed.ncbi.nlm.nih.gov/31437900/)


## Unexpected data naming: I just wanted to find ER discharge against medical advice\ldots

![](img/discharge1.png){width=200px} ![](img/discharge2.png){width=200}\

![I'm sorry that I didn't know to look under `EDISTrackingCode`!](img/discharge3.png){width=300px}

# Domain: Plausibility

## Examples of simple entry errors (what many people think "data cleaning" is)

::: columns
:::: column
![28 inches is a **plausible** length for a 6--13 month-old, not a retired veteran. Happens to be 72 *centimeters!*](img/72cm.jpg)
::::
:::: column
![Warning: Are you sure you want to correct the obviously wrong BMI of 4484?](img/5-9inch.jpg)
::::
:::

# Domain: Plausibility

## Statistical approach to data quality in the Million Veteran Program[^MVP]

Expand Down Expand Up @@ -341,52 +272,6 @@ record data with an application to the VA million veteran program.



# Artificial intelligence for data quality?

## "Let's just do\ldots"

AI, machine learning, natural language processing, *etc.,* for
improving **completeness** by extracting data from text & images:

### Don't assume natural language processing will go according to plan!

- Humans are maddeningly creative at expressing the same concept with many different phrasings.

- Notes have typos, nonstandard abbreviations, and incorrect
information, just like "structured" data.

- Not typos but transcription (or other) errors, nearly undetectable to the untrained: "Intrathecal DepoCyt" $\to$ "Intrathecal etoposide"


## Automated information extraction from text[^Ryu]

**Rules-based and machine learning approaches work!** But the problem was selected carefully. (Don't bite off more than you can chew.)

![](img/ryu.jpg){ height=60% }

[^Ryu]: Ryu JH, Zimolzak AJ. Natural Language Processing of Serum Protein Electrophoresis Reports in the Veterans Affairs Health Care System. *JCO Clin Cancer Inform.* 2020;4:749--756. [PMID: 32813561](https://pubmed.ncbi.nlm.nih.gov/32813561/)


## Machine learning: harder than people think

Labeling data is *expensive!* How did Google/Verily train a convolutional neural net to interpret retinal fundus photographs?[^Gulshan]

![](img/labeling.png)

- **476,000 to 989,000** retinal imaging reads. Assume 44 reads / hour.[^rate]

- That equals 5--11 *working years,* or **\$1.4--2.8
million!** (Before any computing at all. The AI is *just* to score images for
"referable" diabetic retinopathy. It can assess no other features
of the retina whatsoever.)

[^rate]: Kolomeyer AM *et al.* *Int J Telemed Appl.* 2012;2012:806464. [PMID: 23316224](https://pubmed.ncbi.nlm.nih.gov/23316224/)

[^Gulshan]: Gulshan V, *et al.* *JAMA.* 2016;316(22):2402--2410. [PMID: 27898976](https://pubmed.ncbi.nlm.nih.gov/27898976/)




# Conclusion

## Reusing EHR data is not what you may think\ldots
Expand Down

0 comments on commit a42ae74

Please sign in to comment.