-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path0101-process-ahrq-del-test.Rmd
199 lines (140 loc) · 5.92 KB
/
0101-process-ahrq-del-test.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
title: "AHRQ Deliveries process testing"
date: "`r Sys.Date()`"
output:
html_document:
df_print: paged
knit: (function(inputFile, encoding) { rmarkdown::render(inputFile, encoding = encoding, output_dir = "docs") })
---
By **Christian McDonald**, Assistant Professor of Practice\
School of Journalism and Media, Moody College of Communication\
University of Texas at Austin
---
The purpose of this notebook is to process test data from multiple quarters of [THCIC in-patient public use data files](https://www.dshs.texas.gov/thcic/hospitals/Inpatientpudf.shtm) into a single data file of deliveries without complications. This requires importing and applying several filtering options.
The methods in this notebook are used in `01-process-ahrq-del-loop` to process multiple years of data. This notebook cannot support that amount of data, but was used to check various steps in the filtering process for the main script. Please see the `01-process-ahrq-del-loop` for more additional details.
There is another notebook `00-process-lists` where various AHRQ lists of ICD-10 and other codes are defined separately. Those values are written out to the `procedures-lists` folder as .rds and .csv files and then imported into this notebook and others. See that notebook to inspect the lists.
```{r setup, echo=T, results='hide', message=F, warning=F}
library(fs)
library(tidyverse)
```
## Set up import
We search through the `data` folder to build a list files to import into this notebook. The test data was created using the first 10,000 rows from one quarter of four years, 2016-2019.
```{r dirs_test}
# set up test data
test_data_dir <- "data-test"
test_tsv_files <- dir_ls(test_data_dir, recurse = TRUE, regexp = "test_base1")
test_tsv_files
```
## Import the base1 files
At this time, our analysis utilizes only one (PUDF_base1) of several files in the release for each quarter.
Of note:
- There is a trailing tab on each row, whic brings in an unnecssary column. This is removed with `col_skip()`. The `EMERGENCY_DEPT_FLAG` col was introduced in 2017, so we have to remove two differnet "last columns".
- We set default type as col_character because some cols will appear as logical. We reset necessary cols as numbers where necessary.
```{r import, echo=T, results='hide', message=F, warning=F}
# warnings are suppressed, so check problems()
# add/remove test_ as necessary
base1 <- test_tsv_files %>%
map_dfr(
read_tsv,
col_types = cols(
.default = col_character(),
X168 = col_skip(),
X167 = col_skip()
)
) %>%
mutate_at(
vars(contains("_CHARGES")), as.numeric
)
# number of rows
base1 %>% nrow()
# klaxon for import complete
# beepr::beep(3)
```
## Filtering for deliveries
### Filtering muliple columns, multiple conditions
The logic here looks through a number of columns for a number of ICD codes.
In ths case, we are looking at all columns with "DIAG" in name for values in the `delocmd_list`, which comes from "DELOCMD*" in our IQI 33 reference. See `01-process-lists` for details.
Then we import the DELOCMD list and filter for it.
```{r deliveries}
delocmd_list <- read_rds("procedures-lists/ahrq_delocmd.rds") %>% .$delocmd
del <- base1 %>%
filter_at(
vars(
matches("_DIAG"),
-starts_with("POA")
),
any_vars(
. %in% delocmd_list
)
)
del %>% nrow()
```
We peek here at the resulting frame to eyeball codes.
```{r peek_del}
del %>%
select(
matches("_DIAG"),
-starts_with("POA")
) %>% head(10)
```
## Exclusions from the deliveries
Some further notebooks need to exclude cases for complications like for abnormal presentation, fetal death, or multiple gestation. Those will be handled in those notebooks as needed.
Here we only filter out missing or bad data.
### Filter out blank cells per Appendix A
"with missing gender (SEX=missing), age (AGE=missing), quarter (DQTR=missing), year (YEAR=missing) or principal diagnosis (DX1=missing)."
In base1, the fields are `SEX_CODE`, `PAT_AGE`, `DISCHARGE` for both quarter and year, and `PRINC_DIAG_CODE`.
```{r clean}
del_cln <- del %>%
filter(
SEX_CODE == "F",
PAT_AGE != "`",
RACE != "`",
!is.na(DISCHARGE),
!is.na(PRINC_DIAG_CODE)
)
del_cln %>% nrow()
```
### Child-bearing age
Researchers at the Office of Health Affairs-Population Health, The University of Texas System work with the THCIC file daily and they suggest to filter deliveries to women of normal child-bearing age.
We'll look here how those ages break down in the cleaned file:
```{r peek_age}
del_cln %>%
count(PAT_AGE)
```
The codes for the ages 15-49 include 05-12. For HIV or drug patients it includes 23 (18-44 yrs). I import those from `procedures-lists`.
Here we will filter for those values.
```{r age}
age_list <- read_rds("procedures-lists/utoha_age.rds") %>% .$age
del_cln_age <- del_cln %>%
filter(PAT_AGE %in% age_list)
del_cln_age %>% nrow()
```
Peeking at records outside the child-bearing age list to make sure are none.
```{r age_test}
# set up not in
`%ni%` <- Negate(`%in%`)
del_cln_age %>%
filter(PAT_AGE %ni% age_list) %>%
select(PAT_AGE) %>%
count(PAT_AGE)
```
## Add convenience columns for dates
```{r add_yr}
del_cln_age_yr <- del_cln_age %>%
mutate(
YR = substr(DISCHARGE, 1, 4)
)
```
## Remove other years
Because of a reporting lag, there are years in the original data that we are not using for our analysis. At some point in 2015 there was a switch from ICD-9 to ICD-10 coding, so going eariler would require some conversions. Not impossible, but not in scope at this time to ease complication.
We are using full years from 2016-2018 and a partial year 2019 through the 2nd quarter release. This is subject to change as new data is released.
```{r filter_yr}
del_cln_age_yr <- del_cln_age_yr %>%
filter(YR %in% c("2016", "2017", "2018", "2019"))
```
## Write file
```{r write}
del_cln_age_yr %>% nrow()
del_cln_age_yr %>% write_rds("data-test/ahrq_del_all_single_test.rds")
beepr::beep(4)
```