-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathAnalysis-Walkthrough.Rmd
867 lines (659 loc) · 33.3 KB
/
Analysis-Walkthrough.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
---
title: Analysis walkthrough
subtitle: "for the paper [Sociolinguistic auto-coding has fairness problems too: Measuring and mitigating overlearning bias](https://doi.org/10.1515/lingvan-2022-0114) (_Linguistics Vanguard_, 2024)"
author: "Dan Villarreal (Department of Linguistics, University of Pittsburgh)"
output:
html_document:
toc: yes
toc_float: yes
number_sections: yes
df_print: paged
code_folding: show
includes:
in_header: "_includes/head-custom.html"
params:
extract_metrics: TRUE
extract_only_metrics: FALSE
---
```{r setup, include=FALSE}
##Start timer
timing <- list()
timing$start <- proc.time()
##Handle parameters
extract_only_metrics <- ifelse(params$extract_metrics, params$extract_only_metrics, FALSE)
##knitr settings
knitr::opts_chunk$set(eval=!extract_only_metrics, echo=TRUE, include=TRUE,
comment=NA, results="hold")
##Packages
library(tidyverse) # tidyverse 'dialect' of R
library(magrittr) # nicer aliases (see ?add)
library(knitr) # combine text, code, and code output
library(ggrepel) # repel plot labels
library(rPref) # Pareto analysis
library(benchmarkme) # get_cpu(), get_ram()
##Don't show column types for read_csv()
options(readr.show_col_types = FALSE)
```
```{r, include=FALSE, fig.alt="Creative Commons Shield: BY-NC-SA", fig.link="http://creativecommons.org/licenses/by-nc-sa/4.0/"}
include_graphics("https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png")
```
# Introduction {#intro}
This tutorial is a companion to the paper "Sociolinguistic auto-coding has fairness problems too: Measuring and mitigating overlearning bias", published open-access in _Linguistics Vanguard_ in 2024: <https://doi.org/10.1515/lingvan-2022-0114>.
It walks readers through the process of running all the code to replicate the paper's analysis.
Check out the README for a higher-level overview of this repository and sociolinguistic auto-coding (_SLAC_) more generally.
You can use this tutorial to...
1. [**Assess fairness**](#RQ2) for sociolinguistic auto-coders
2. [**Mitigate unfairness**](#RQ3) for sociolinguistic auto-coders
<div id="repo-nav">
**Repository navigation:**
- [Homepage](https://djvill.github.io/SLAC-Fairness)
- [Repository code](https://github.com/djvill/SLAC-Fairness)
- [Code for this tutorial](https://github.com/djvill/SLAC-Fairness/tree/main/Analysis-Walkthrough.Rmd)
- [Unfairness mitigation strategy descriptions](UMS-Info.html)
</div>
## Running this code on your own machine {#run-tutorial}
This tutorial was written in [R Markdown](https://rmarkdown.rstudio.com/).
R Markdown is an extension of the [R statistical programming language](https://www.r-project.org/) that allows users to interweave formatted text with R commands and outputs (in other words, a [literate programming](https://en.wikipedia.org/wiki/Literate_programming) approach to R).
The [webpage version](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html) of this tutorial is the output generated by running the [code](https://github.com/djvill/SLAC-Fairness/tree/main/Analysis-Walkthrough.Rmd) with R Markdown.
The tutorial code, in turn, relies on a [larger set of data and scripts](index#whats-in-this-repository) that do the dirty work so this tutorial can be a nice clean summary of the analysis.
To run the tutorial code on your own machine, you'll need a suitable computing environment and software, as described [here](index#running-this-code-on-your-own-machine).
See below for more information about [machine specs](#machine-specs), [how long](#running-time) it took this script to run, and how much [disk space](#disk-space-used) auto-coder files take up.
In addition, you can customize some of this tutorial's behaviors via the [parameters in its YAML header](https://bookdown.org/yihui/rmarkdown/parameterized-reports.html).
This is especially useful if you want to [adapt this code](index#adapting-this-code-to-your-own-projects) to your own projects.
- `extract_metrics` (default `TRUE`): Extract fairness/performance metrics from auto-coder files and save to `Outputs/Performance/`? If you use a [two-computer setup](index#a-quick-note-on-the-two-computer-setup), use `TRUE` for the computer where the auto-coder files are stored, and `FALSE` for the other computer. Or if you've already extracted metrics,
- `TRUE`: Run code chunks that extract & save metrics.
- `FALSE`: Skip these code chunks.
- `extract_only_metrics` (default `FALSE`): Useful if you're using this code to _just_ update your performance files (e.g., you are testing out new UMSs and want to analyze them separately). Note: if `extract_metrics` is `FALSE`, `extract_only_metrics` will be overridden to `FALSE`.
- `TRUE`: Run _only_ code chunks that extract & save metrics.
- `FALSE`: Run other code chunks as well.
# RQ2: Assessing fairness for SLAC {#RQ2}
This section assesses gender fairness in the auto-coder reported on in Villarreal et al.'s [2020 _Laboratory Phonology_ article](https://doi.org/10.5334/labphon.216) and ["How to train your classifier" auto-coding tutorial](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html).
Read the auto-coder:
```{r}
##N.B. Copy of https://github.com/nzilbb/How-to-Train-Your-Classifier/blob/main/LabPhonClassifier.Rds,
## but with Gender added to internal representation of training data
## (trainingData element) to facilitate fairness measurement
LabPhonClassifier <- readRDS("Input-Data/LabPhonClassifier.Rds")
```
If you originally ran your auto-coder using the scripts in this repository, then it's ready for fairness measurement.
If not, you may need to manually modify the auto-coder or re-run your auto-coder using these scripts (see [here](index#using-your-own-training-data)).
## Measuring fairness: `UMS-Utils.R` functions {#rq2-ums-utils}
Next, we measure fairness using functions in the `UMS-Utils.R` script: `cls_fairness()` and `cls_summary()`.
```{r, eval=TRUE}
source("R-Scripts/UMS-Utils.R", keep.source=TRUE)
```
`cls_fairness()` is useful in exploratory data analysis---
that stage where you're poking around the data, trying to wrap your head around it.
You can customize its output using different arguments.
```{r}
##Overall accuracy
cls_fairness(LabPhonClassifier)
##Class accuracies
cls_fairness(LabPhonClassifier, byClass=TRUE)
##Confusion matrix
cls_fairness(LabPhonClassifier, output="cm")
```
```{r}
##Chi-sq test: Overall accuracy
cls_fairness(LabPhonClassifier, output="chisq")
##Chi-sq tests: Class accuracies
cls_fairness(LabPhonClassifier, output="chisq", byClass=TRUE)
##Raw predictions
head(cls_fairness(LabPhonClassifier, output="pred"))
```
`cls_summary()` returns a one-row dataframe of fairness and (optionally) performance info:
```{r}
cls_summary(LabPhonClassifier)
```
You can shape the outputs of `cls_fairness()` and `cls_summary()` to create nicer-formatted results tables.
For example, this is (roughly) how I generated Table 4 in the _Linguistics Vanguard_ paper, using `cls_summary()`:
```{r}
##Summary
smry <- cls_summary(LabPhonClassifier)
##Get fairness dataframe: Metric in rows, Female/Male/Diff in columns
fmDiff <-
smry %>%
select(matches("_(Female|Male)")) %>%
pivot_longer(everything(), names_to=c("Metric","name"),
names_pattern="(.+)_(.+)") %>%
pivot_wider() %>%
mutate(Difference = Female-Male)
##Get chisq dataframe: Metric in rows, Chisq stat/df/p in columns
chiCols <-
smry %>%
select(contains("Chisq")) %>%
pivot_longer(everything(), names_to=c("Metric","name"),
names_pattern="(.+)_(Chisq.+)") %>%
pivot_wider()
##Join together & recode Metric w/ nicer labels
recodeMetric <- c("Overall accuracy" = "Acc",
"Absent class accuracy" = "ClassAcc_Absent",
"Present class accuracy" = "ClassAcc_Present")
left_join(fmDiff, chiCols, by="Metric") %>%
mutate(across(Metric, ~ fct_recode(.x, !!!recodeMetric)))
```
This is (roughly) how I generated Table 5 in the _Linguistics Vanguard_ paper, using `cls_fairness()` and the auto-coder's `trainingData` element:
```{r}
##Get each token's gender & majority-vote prediction
LabPhonPred <-
##Prediction dataframe (one row per token * resample)
cls_fairness(LabPhonClassifier, "pred") %>%
##Get counts of each token's Rpresent votes (plus tokens' unique Gender values)
count(rowIndex, Gender, Rpresent = Predicted) %>%
##Only take most frequent prediction per token
slice_max(n, by=rowIndex)
##Put together table
list(Actual = LabPhonClassifier$trainingData %>%
rename(Rpresent = .outcome),
Predicted = LabPhonPred) %>%
##Get Gender & Rpresent counts for each data source
imap(~ count(.x, Gender, Rpresent, name=.y)) %>%
##Put into a single dataframe
reduce(left_join, by=c("Gender","Rpresent")) %>%
##Calculate Under/overprediction
mutate("Under/overprediction" = Predicted / Actual - 1)
```
# RQ3: Mitigating SLAC unfairness {#RQ3}
In this section, we attempt to produce a fair auto-coder that does not suffer from the fairness issues in the preceding auto-coder (aka the _LabPhon auto-coder_).
To do this, we run and analyze additional auto-coders under different _unfairness mitigation strategies_ (_UMSs_).
## How to generate auto-coders
To generate auto-coders, use the Bash command-line client to run shell scripts.
The shell scripts are written to be compatible with [Slurm](https://slurm.schedmd.com/), the job queue used by Pitt's [CRC](https://crc.pitt.edu/) clusters, using the command `sbatch` to submit jobs to Slurm.
For example, to run `UMS-Round1.sh` with Slurm:
```{bash, eval=FALSE}
##Assuming you are in Shell-Scripts working directory
sbatch UMS-Round1.sh
```
If you don't need to use Slurm, you can submit jobs directly using the command `bash`.
In this case, you should explicitly specify where script outputs & errors should go:
```{bash, eval=FALSE}
##Assuming you are in Shell-Scripts working directory
bash UMS-Round1.sh &> ../Outputs/Shell-Scripts/UMS-Round1.out
```
The following R code returns `TRUE` if the command `sbatch` will work on your system:
```{r}
unname(nchar(Sys.which("sbatch")) > 0)
```
For Slurm users, note that the shell scripts pre-define several additional arguments to `sbatch` (e.g., `--partition=smp`).
Unfortunately, [it's not possible](https://stackoverflow.com/a/36303809) to override these hard-coded defaults by passing arguments to `sbatch` in the command line (e.g., `sbatch UMS-Round1.sh --partition=htc` will still use the hard-coded `--partition=smp`).
If you need different `sbatch` arguments, either hard-code new arguments (if you don't need them to change each time you execute the script) or write a [wrapper script](https://stackoverflow.com/a/36303809).
See the [README](index#shell-scripts) for more info about how the shell scripts work.
## Baseline {#baseline}
The auto-coders that we run for RQ3 will **not** undergo the time-consuming process of _optimization for performance_: [hyperparameter tuning](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#step-4) and [outlier dropping](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#step-5).
Applying these steps to each UMS would dramatically increase the amount of time it would take to run this analysis.
Instead of comparing UMSs to the _LabPhon_ auto-coder in [RQ2](#RQ2), which was optimized for performance, we'll run an un-optimized baseline auto-coder so we get an apples-to-apples comparison.
### Run auto-coder {#baseline-run}
Load Bash, navigate to the `Shell-Scripts/` directory, and run one of the following:
```{bash, eval=FALSE}
##Run with Slurm
sbatch Baseline.sh
##OR
##Run directly
bash Baseline.sh &> ../Outputs/Shell-Scripts/Run-Baseline.out
```
Once that script is done running, you should have a new auto-coder file: `Outputs/Diagnostic-Files/Temp-Autocoders/Run-UMS_UMS0.0.Rds`.
### Extract fairness and performance metrics {#baseline-metrics}
Before proceeding, we'll extract fairness/performance metrics from this auto-coder (using [`cls_summary()`](#rq2-ums-utils)) and save metrics to `Outputs/Performance/`.
This allows us to bridge the [two-computer split](index#a-quick-note-on-the-two-computer-setup).
To extract and save fairness/performance metrics from the baseline auto-coder, switch back to R (on the same computer the auto-coder was run on), and run the following code:
```{r, eval=params$extract_metrics}
##Get list of UMS descriptions
umsList <- read.csv("Input-Data/UMS-List.txt", sep="\t")
##Read auto-coder file
file_baseline <- "Run-UMS_UMS0.0.Rds"
cls_baseline <- readRDS(paste0("Outputs/Diagnostic-Files/Temp-Autocoders/",
file_baseline))
##Extract performance
cls_baseline %>%
cls_summary() %>%
##Add name and long description
mutate(Classifier = str_remove_all(file_baseline, ".+_|\\.Rds"),
.before=1) %>%
left_join(umsList %>%
mutate(Classifier = paste0("UMS", UMS)) %>%
select(-UMS),
by="Classifier") %>%
##Save data
write_csv("Outputs/Performance/Perf_Baseline.csv")
```
We won't analyze baseline fairness here because its whole purpose is to compare UMSs against it (with neither the baseline nor UMS auto-coders optimized for performance).
However, it's worth noting that there are small differences in fairness/performance between this un-optimized baseline and the _LabPhon_ auto-coder (which _was_ optimized for performance) analyzed for fairness [above](#RQ2):
```{r}
##Read baseline performance
perf_baseline <- read_csv("Outputs/Performance/Perf_Baseline.csv")
##Combine LabPhon & Baseline metrics into a single dataframe
list(LabPhon = cls_summary(LabPhonClassifier),
Baseline = perf_baseline %>%
select(-c(Classifier, Description))) %>%
##One dataframe with just the necessary columns
map_dfr(select, Acc, Acc_Diff, matches("ClassAcc_(Present|Absent)$"),
matches("ClassAcc_(Present|Absent)_Diff"),
.id="Version") %>%
##LabPhon/Baseline in separate columns, one row per metric * type
pivot_longer(contains("Acc"), names_to="Metric") %>%
mutate(Type = if_else(str_detect(Metric, "Diff"), "Fairness", "Performance"),
Metric = fct_inorder(if_else(str_detect(Metric, "Absent|Present"),
paste(str_extract(Metric, "Absent|Present"), "class accuracy"),
"Overall accuracy"))) %>%
pivot_wider(names_from=Version) %>%
##Put rows in nicer order
arrange(Metric, desc(Type))
```
## UMS round 1
The auto-coders in UMS round 1 include downsampling, valid predictor selection, and normalization UMSs (see [here](UMS-Info.html) for more info).
### Run auto-coders and extract metrics {#ums1-run}
Load Bash, navigate to `Shell-Scripts/`, and run one of the following:
```{bash, eval=FALSE}
##Run with Slurm
sbatch UMS-Round1.sh
##OR
##Run directly
bash UMS-Round1.sh &> ../Outputs/Shell-Scripts/UMS-Round1.out
```
Once that script is done running, you should have a bunch more files in `Outputs/Diagnostic-Files/Temp-Autocoders/`.
To extract and save fairness/performance metrics from the round 1 auto-coders, switch back to R (staying on the same computer), and run the following code:
```{r, eval=params$extract_metrics}
##Get auto-coder filenames (exclude UMS 0.x precursor auto-coders)
files_round1 <- list.files("Outputs/Diagnostic-Files/Temp-Autocoders/",
"Run-UMS_UMS[1-3]", full.names=TRUE)
##Read auto-coder files
cls_round1 <-
files_round1 %>%
##Better names
set_names(str_remove_all(files_round1, ".+_|\\.Rds")) %>%
map(readRDS)
##Extract performance
cls_round1 %>%
map_dfr(cls_summary, .id="Classifier") %>%
##Add long description
left_join(umsList %>% mutate(Classifier = paste0("UMS", UMS)) %>% select(-UMS),
by="Classifier") %>%
##Save data
write_csv("Outputs/Performance/Perf_UMS-Round1.csv")
```
---
<a id="meas-precursor" />
We'll also extract and save variable importance data from particular auto-coders.
Several UMSs are "valid predictor selection" strategies: they remove acoustic measures that could inadvertently signal gender.
To determine which measures could inadvertently signal gender, we run a auto-coder predicting _speaker gender_ rather than rhoticity and discard the measures that were "too helpful" in predicting gender.
The following code pulls variable importance from these "precursor" auto-coders.
```{r, eval=params$extract_metrics}
readRDS("Outputs/Diagnostic-Files/Temp-Autocoders/Run-UMS_UMS0.1.1.Rds") %>%
pluck("finalModel", "variable.importance") %>%
{tibble(Measure=names(.), Importance=.)} %>%
write.csv("Outputs/Other/Var-Imp_UMS0.1.1.csv", row.names=FALSE)
readRDS("Outputs/Diagnostic-Files/Temp-Autocoders/Run-UMS_UMS0.2.Rds") %>%
map_dfr(~ .x %>%
pluck("finalModel", "variable.importance") %>%
{tibble(Measure=names(.), value=.)},
.id="name") %>%
pivot_wider(names_prefix="Importance_") %>%
write.csv("Outputs/Other/Var-Imp_UMS0.2.csv", row.names=FALSE)
```
This step isn't strictly necessary for the R code in this document, but it allows us to run [`umsData()`](index#r-scripts) for all UMSs without needing to be on the [computer where the auto-coders are saved](index#a-quick-note-on-the-two-computer-setup).
### Analyze fairness and performance
Now we can analyze metrics on a user-friendlier system.
Read fairness/performance data for the round 1 auto-coders, and add baseline data:
```{r}
##Read
perf_baseline <- read_csv("Outputs/Performance/Perf_Baseline.csv")
perf_round1 <- rbind(perf_baseline,
read_csv("Outputs/Performance/Perf_UMS-Round1.csv"))
##Decode first digit of UMS code
categories <- c("Baseline", "Downsampling", "Valid pred selection",
"Normalization", "Combination") %>%
set_names(0:4)
##Shape performance dataframe for plotting: Fairness/Performance in separate
## columns, one row per UMS * metric, Category factor, shorter Classifier label
perfPlot_round1 <- perf_round1 %>%
select(Classifier,
Acc, Acc_Diff, matches("ClassAcc_(Present|Absent)$"),
matches("ClassAcc_(Present|Absent)_Diff")) %>%
##Fairness/Performance in separate columns, one row per UMS * metric
pivot_longer(contains("Acc")) %>%
mutate(Metric = fct_inorder(if_else(str_detect(name, "Absent|Present"),
paste(str_extract(name, "Absent|Present"), "class accuracy"),
"Overall accuracy")),
name = if_else(str_detect(name, "Diff"), "Fairness", "Performance")) %>%
pivot_wider() %>%
##Add Category column, shorter Classifier label
mutate(Category = recode_factor(str_extract(Classifier, "\\d"), !!!categories),
across(Classifier, ~ str_remove(.x, "UMS")))
```
```{r}
##Plot
perfPlot_round1 %>%
mutate(across(Fairness, abs)) %>%
ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier)) +
##Points w/ ggrepel'd labels
geom_point(size=3) +
geom_text_repel(show.legend=FALSE, max.overlaps=20) +
##Each metric in its own facet
## (N.B. use arg scales="free" to have each facet zoom to fit data)
facet_wrap(~ Metric, nrow=1) +
##Lower fairness on the right (so top-right is optimal)
scale_x_reverse() +
##Theme
theme_bw()
```
To make the baseline stand out, we can plot it with separate aesthetics:
```{r}
perfPlot_round1 %>%
mutate(across(Fairness, abs)) %>%
##Exclude Baseline from points & labels
filter(Classifier != "0.0") %>%
ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier)) +
##Points w/ ggrepel'd labels
geom_point(size=3) +
geom_text_repel(show.legend=FALSE, max.overlaps=20) +
##Dotted line for baseline
geom_vline(data=perfPlot_round1 %>% filter(Classifier=="0.0"),
aes(xintercept=abs(Fairness)), linetype="dashed") +
geom_hline(data=perfPlot_round1 %>% filter(Classifier=="0.0"),
aes(yintercept=Performance), linetype="dashed") +
##Each metric in its own facet
facet_wrap(~ Metric, nrow=1) +
##Lower fairness on the right (so top-right is optimal)
scale_x_reverse() +
##Theme
theme_bw()
```
While numerous UMSs improve on fairness relative to the baseline, there is no one obvious winner.
This is often because a UMS will perform well on some metrics but poorly on others.
For example, while UMS 1.3.2 improves fairness and performance for overall accuracy and Absent class accuracy, it has dismal performance for Present class accuracy (under 40%).
In other instances the disparity is more dramatic;
for example, UMS 1.5 is clearly superior for the fairness/performance tradeoff when it comes to Present class accuracy, but its Absent class accuracy fairness is _worse_ than the baseline.
## UMS round 2
Since the round 1 results weren't completely satisfactory, I decided to attempt combination strategies: combining downsampling with either valid predictor selection or normalization.
Combining these strategies is feasible because different categories of UMS affect the data in different ways;
downsampling UMSs remove tokens (rows), valid predictor selection UMSs remove acoustic measures (columns), and normalization transforms acoustic measures.
This could theoretically produce better results if the strengths and weakness of the combined UMSs hedge against one another (e.g., the improvements in Present class accuracy performance for UMS 2.2 could balance out the decline in Present class accuracy performance for UMS 1.3.2).
I chose 8 combination UMSs based on round 1 results: 2 downsampling UMSs (1.3.1, 1.3.2) $\times$ 4 other UMSs (2.1.1, 2.2, 2.3, 3.1).
I chose these UMSs because they were relatively balanced across all 3 metrics for fairness and performance (i.e., excluding UMSs like 1.5 that performed very poorly on at least one metric).
In your own projects, it may be appropriate to choose different UMSs to combine depending on how the round 1 results shake out.
### Run auto-coders and extract metrics {#ums2-run}
Load Bash, navigate to `Shell-Scripts/`, and run one of the following:
```{bash, eval=FALSE}
##Run with Slurm
sbatch UMS-Round2.sh
##OR
##Run directly
bash UMS-Round2.sh &> ../Outputs/Shell-Scripts/UMS-Round2.out
```
Once that script is done running, you should have additional files in `Outputs/Diagnostic-Files/Temp-Autocoders/`.
To extract and save fairness/performance metrics from the round 2 auto-coders, switch back to R (staying on the same computer), and run the following code:
```{r, eval=params$extract_metrics}
##Get auto-coder filenames (all combo UMSs start with the digit 4)
files_round2 <- list.files("Outputs/Diagnostic-Files/Temp-Autocoders/",
"Run-UMS_UMS4", full.names=TRUE)
##Read auto-coder files
cls_round2 <-
files_round2 %>%
##Better names
set_names(str_remove_all(files_round2, ".+_|\\.Rds")) %>%
map(readRDS)
##Extract performance
cls_round2 %>%
map_dfr(cls_summary, .id="Classifier") %>%
##Add long description
left_join(umsList %>% mutate(Classifier = paste0("UMS", UMS)) %>% select(-UMS),
by="Classifier") %>%
##Save data
write_csv("Outputs/Performance/Perf_UMS-Round2.csv")
```
### Analyze fairness and performance
Now we can analyze these metrics on a user-friendlier system.
Read fairness/performance data for the round 1 auto-coders, and add baseline data:
```{r}
##Read
perf_baseline <- read_csv("Outputs/Performance/Perf_Baseline.csv")
perf_round2 <- rbind(perf_baseline,
read_csv("Outputs/Performance/Perf_UMS-Round2.csv"))
##Shape performance dataframe for plotting: Fairness/Performance in separate
## columns, one row per UMS * metric, Category factor, shorter Classifier label
perfPlot_round2 <- perf_round2 %>%
select(Classifier,
Acc, Acc_Diff, matches("ClassAcc_(Present|Absent)$"),
matches("ClassAcc_(Present|Absent)_Diff")) %>%
##Fairness/Performance in separate columns, one row per UMS * metric
pivot_longer(contains("Acc")) %>%
mutate(Metric = fct_inorder(if_else(str_detect(name, "Absent|Present"),
paste(str_extract(name, "Absent|Present"), "class accuracy"),
"Overall accuracy")),
name = if_else(str_detect(name, "Diff"), "Fairness", "Performance")) %>%
pivot_wider() %>%
##Add Category column, shorter Classifier label
mutate(Category = recode_factor(str_extract(Classifier, "\\d"), !!!categories),
across(Classifier, ~ str_remove(.x, "UMS")))
```
Plot (using dotted line for baseline)
```{r}
perfPlot_round2 %>%
mutate(across(Fairness, abs)) %>%
##Exclude Baseline from points & labels
filter(Classifier != "0.0") %>%
ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier)) +
##Points w/ ggrepel'd labels
geom_point(size=3) +
geom_text_repel(show.legend=FALSE, max.overlaps=20, color="black") +
##Dotted line for baseline
geom_vline(data=perfPlot_round2 %>% filter(Classifier=="0.0"),
aes(xintercept=abs(Fairness)), linetype="dashed") +
geom_hline(data=perfPlot_round2 %>% filter(Classifier=="0.0"),
aes(yintercept=Performance), linetype="dashed") +
##Each metric in its own facet
facet_wrap(~ Metric, nrow=1) +
##Lower fairness on the right (so top-right is optimal)
scale_x_reverse() +
##Theme
theme_bw() +
##Rotate x-axis labels to avoid clash
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```
We can also plot round 1 & 2 together:
```{r}
##Put perfPlot dfs together
perfPlot <- rbind(perfPlot_round1,
perfPlot_round2) %>%
##Remove duplicate Baseline rows
distinct()
perfPlot %>%
mutate(across(Fairness, abs)) %>%
##Exclude Baseline from points
filter(Classifier != "0.0") %>%
ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier)) +
##Points w/ ggrepel'd labels
geom_point(size=3) +
geom_text_repel(show.legend=FALSE, max.overlaps=40) +
##Dotted line for baseline
geom_vline(data=perfPlot %>% filter(Classifier=="0.0"),
aes(xintercept=abs(Fairness)), linetype="dashed") +
geom_hline(data=perfPlot %>% filter(Classifier=="0.0"),
aes(yintercept=Performance), linetype="dashed") +
##Each metric in its own facet
facet_wrap(~ Metric, nrow=1) +
##Lower fairness on the right (so top-right is optimal)
scale_x_reverse() +
##Theme
theme_bw() +
##Rotate x-axis labels to avoid clash
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```
## Identify optimal auto-coder
Now that we've got all our performance data, we need to choose which auto-coder to actually _use_ for auto-coding data that hasn't previously been coded (i.e., to scale up our dataset of coded tokens without more manual coding).
The previous plot tells us that some UMSs are better than others (e.g., we obviously won't be using UMS 1.2), but there isn't any UMS that clearly stands out from the rest.
Furthermore, even if we eliminate the obviously bad options, there seems to be a tradeoff between performance and fairness.
How do we winnow down the space of options?
One technique is to find the UMSs that are _Pareto-optimal_: a given UMS is Pareto-optimal if every other UMS that is _better_ in fairness is _worse_ in performance, or vice versa.
In this sense, the best UMS for our purposes might be neither the fairest nor the best-performing, but the UMS for which there's a good fairness--performance tradeoff.
In R, we can use `psel()` from the `rPref` package to find Pareto-optimal auto-coders.
For example, the following auto-coders are Pareto-optimal for Overall accuracy:
```{r}
perfPlot %>%
mutate(across(Fairness, abs)) %>%
filter(Metric=="Overall accuracy") %>%
psel(high(Performance) * low(Fairness))
```
Here's that same info represented in a plot:
```{r, warning=FALSE}
perfPlot %>%
mutate(across(Fairness, abs)) %>%
filter(Metric=="Overall accuracy",
Classifier != "0.0") %>%
mutate(`Pareto-optimal` = Classifier %in% psel(., high(Performance) * low(Fairness))$Classifier) %>%
ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier, alpha=`Pareto-optimal`)) +
##Points w/ ggrepel'd labels
geom_point(size=3) +
geom_text_repel(show.legend=FALSE, max.overlaps=40) +
##Dotted line for baseline
geom_vline(data=perfPlot %>%
filter(Metric=="Overall accuracy", Classifier=="0.0"),
aes(xintercept=abs(Fairness)), linetype="dashed") +
geom_hline(data=perfPlot %>%
filter(Metric=="Overall accuracy", Classifier=="0.0"),
aes(yintercept=Performance), linetype="dashed") +
##Lower fairness on the right (so top-right is optimal)
scale_x_reverse() +
##Theme
theme_bw() +
##Rotate x-axis labels to avoid clash
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```
In this case, we have 3 metrics, so for each metric we find the UMSs that are Pareto-optimal for the fairness--performance tradeoff
```{r}
##Get list of dataframes, one per Metric, w/ only Pareto-optimal UMSs
paretoOpt <-
perfPlot %>%
mutate(across(Fairness, abs)) %>%
group_by(Metric) %>%
##Identify Pareto-optimal UMSs
group_map(~ psel(.x, high(Performance) * low(Fairness)),
.keep=TRUE)
##Display as a single dataframe
bind_rows(paretoOpt)
```
In this particular dataset, we get really lucky:
There is one UMS, 4.2.1, that is shared among these 3 sets of Pareto-optimal UMSs.
This is certainly _not_ a guaranteed outcome!
```{r}
##Get Classifier value that is in all 3 dataframes (if any)
paretoOpt %>%
map("Classifier") %>%
reduce(intersect)
```
Incidentally, UMS 4.2.1 also happens to be the fairest UMS for all 3 metrics---
this is _definitely_ not a guaranteed outcome!
```{r, results='hold'}
##Get fairest UMS for each Metric
perfPlot %>%
mutate(across(Fairness, abs)) %>%
group_by(Metric) %>%
filter(Fairness==min(Fairness))
##Could also do
# perfPlot %>%
# mutate(across(Fairness, abs)) %>%
# group_by(Metric) %>%
# group_modify(~ psel(.x, low(Fairness)))
```
Thus, we choose 4.2.1 as the optimal UMS.
In fact, this is the UMS that was used to grow an /r/ dataset fivefold for the 2021 _Language Variation and Change_ article ["Gender separation and the speech community: Rhoticity in early 20th century Southland New Zealand English"](https://doi.org/10.1017/S0954394521000090) by me, Lynn Clark, Jen Hay, and Kevin Watson.
(The auto-coder used for that analysis was optimized for performance, so the fairness we report in that paper is slightly worse than UMS 4.2.1.)
# Script meta-info {#script-info}
## R session info
```{r, eval=TRUE}
sessionInfo()
```
## Disk space used
These are only shown if `params$extract_metrics` is `TRUE` (because otherwise it's assumed that you're not working on the same system the auto-coders were run on).
Temporary auto-coders:
```{r, eval=params$extract_metrics}
tmpAuto <-
list.files("Outputs/Diagnostic-Files/Temp-Autocoders/",
# "^(Run-UMS|(Hyperparam-Tuning|Outlier-Dropping)_UMS0\\.0).*Rds$") %>%
"^Run-UMS.*Rds$", full.names=TRUE) %>%
file.info()
# cat("Disk space: ", round(sum(tmpAuto$size)/2^30, 1), " Gb (",
cat("Disk space: ", round(sum(tmpAuto$size)/2^20, 1), " Mb (",
nrow(tmpAuto), " files)", sep="")
```
Complete repository:
```{r, eval=params$extract_metrics}
if (.Platform$OS.type=="windows") {
shell("dir /s", intern=TRUE) %>%
tail(2) %>%
head(1) %>%
str_trim() %>%
str_squish()
}
if (.Platform$OS.type=="unix") {
system2("du", "-sh", stdout=TRUE) %>%
str_remove("\\s.+") %>%
paste0("b")
}
```
## Machine specs
System:
```{r, eval=TRUE}
Sys.info()
```
Processor:
```{r, eval=TRUE}
get_cpu()
```
RAM:
```{r, eval=TRUE}
get_ram()
```
## Running time
Total running time for shell scripts:
```{r timing, eval=TRUE}
##Parse HH:MM:SS and print as nicer time
printDur <- function(x) {
library(lubridate)
library(magrittr)
x %>%
lubridate::hms() %>%
as.duration() %>%
sum() %>%
seconds_to_period()
}
scripts <- c("Baseline", "UMS-Round1", "UMS-Round2")
cat("Total running time for", paste0(scripts, ".sh", collapse=", "), fill=TRUE)
paste0("Outputs/Shell-Scripts/", paste0(scripts, ".out")) %>%
map_chr(~ .x %>%
readLines() %>%
str_subset("RunTime")) %>%
str_extract("[\\d:]{2,}") %>%
printDur()
```
Total running time for R code in this document (with `params$extract_metrics` set to `r params$extract_metrics`), in seconds:
```{r, eval=TRUE}
timing$stop <- proc.time()
timing$stop - timing$start
```
# Acknowledgements
I would like to thank Chris Bartlett, the Southland Oral History Project (Invercargill City Libraries and Archives), and the speakers for sharing their data and their voices.
Thanks are also due to Lynn Clark, Jen Hay, Kevin Watson, and the New Zealand Institute of Language, Brain and Behaviour for supporting this research.
Valuable feedback was provided by audiences at NWAV 49, the Penn Linguistics Conference, Pitt Computer Science, and the Michigan State SocioLab.
Other resources were provided by a Royal Society of New Zealand Marsden Research Grant (16-UOC-058) and the University of Pittsburgh Center for Research Computing (specifically, the H2P cluster supported by NSF award number OAC-2117681).
Any errors are mine entirely.
<!-- Footer: Add CSS styling -->
```{css, echo=FALSE}
/* Add scrollbars to long R input/output blocks */
pre {
max-height: 300px;
overflow-y: auto;
}
/* Repo navigation block */
div#repo-nav {
background-color: #ddd;
width: fit-content;
padding: 10px 20px;
border: 1px solid #bbb;
border-radius: 20px;
}
div#repo-nav * {
margin: 0px;
font-style: italic;
}
```