Analysis-Walkthrough.Rmd

---
title: Analysis walkthrough
subtitle: "for the paper [Sociolinguistic auto-coding has fairness problems too: Measuring and mitigating overlearning bias](https://doi.org/10.1515/lingvan-2022-0114) (_Linguistics Vanguard_, 2024)" 
author: "Dan Villarreal (Department of Linguistics, University of Pittsburgh)"
output:
  html_document:
    toc: yes
    toc_float: yes
    number_sections: yes
    df_print: paged
    code_folding: show
    includes:
      in_header: "_includes/head-custom.html"
params:
  extract_metrics: TRUE 
  extract_only_metrics: FALSE 
---

```{r setup, include=FALSE}
##Start timer
timing <- list()
timing$start <- proc.time()

##Handle parameters
extract_only_metrics <- ifelse(params$extract_metrics, params$extract_only_metrics, FALSE)

##knitr settings
knitr::opts_chunk$set(eval=!extract_only_metrics, echo=TRUE, include=TRUE, 
                      comment=NA, results="hold")

##Packages
library(tidyverse)    # tidyverse 'dialect' of R
library(magrittr)     # nicer aliases (see ?add)
library(knitr)        # combine text, code, and code output
library(ggrepel)      # repel plot labels
library(rPref)        # Pareto analysis
library(benchmarkme)  # get_cpu(), get_ram()

##Don't show column types for read_csv()
options(readr.show_col_types = FALSE)
```

```{r, include=FALSE, fig.alt="Creative Commons Shield: BY-NC-SA", fig.link="http://creativecommons.org/licenses/by-nc-sa/4.0/"}
include_graphics("https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png")
```


# Introduction {#intro}

This tutorial is a companion to the paper "Sociolinguistic auto-coding has fairness problems too: Measuring and mitigating overlearning bias", published open-access in _Linguistics Vanguard_ in 2024: <https://doi.org/10.1515/lingvan-2022-0114>. 
It walks readers through the process of running all the code to replicate the paper's analysis. 
Check out the README for a higher-level overview of this repository and sociolinguistic auto-coding (_SLAC_) more generally.


You can use this tutorial to...

1. [**Assess fairness**](#RQ2) for sociolinguistic auto-coders
2. [**Mitigate unfairness**](#RQ3) for sociolinguistic auto-coders

<div id="repo-nav">

**Repository navigation:**

- [Homepage](https://djvill.github.io/SLAC-Fairness)
- [Repository code](https://github.com/djvill/SLAC-Fairness)
- [Code for this tutorial](https://github.com/djvill/SLAC-Fairness/tree/main/Analysis-Walkthrough.Rmd)
- [Unfairness mitigation strategy descriptions](UMS-Info.html)

</div>


## Running this code on your own machine {#run-tutorial}

This tutorial was written in [R Markdown](https://rmarkdown.rstudio.com/). 
R Markdown is an extension of the [R statistical programming language](https://www.r-project.org/) that allows users to interweave formatted text with R commands and outputs (in other words, a [literate programming](https://en.wikipedia.org/wiki/Literate_programming) approach to R).
The [webpage version](https://djvill.github.io/SLAC-Fairness/Analysis-Walkthrough.html) of this tutorial is the output generated by running the [code](https://github.com/djvill/SLAC-Fairness/tree/main/Analysis-Walkthrough.Rmd) with R Markdown.
The tutorial code, in turn, relies on a [larger set of data and scripts](index#whats-in-this-repository) that do the dirty work so this tutorial can be a nice clean summary of the analysis.


To run the tutorial code on your own machine, you'll need a suitable computing environment and software, as described [here](index#running-this-code-on-your-own-machine).
See below for more information about [machine specs](#machine-specs), [how long](#running-time) it took this script to run, and how much [disk space](#disk-space-used) auto-coder files take up.


In addition, you can customize some of this tutorial's behaviors via the [parameters in its YAML header](https://bookdown.org/yihui/rmarkdown/parameterized-reports.html).
This is especially useful if you want to [adapt this code](index#adapting-this-code-to-your-own-projects) to your own projects.

- `extract_metrics` (default `TRUE`): Extract fairness/performance metrics from auto-coder files and save to `Outputs/Performance/`? If you use a [two-computer setup](index#a-quick-note-on-the-two-computer-setup), use `TRUE` for the computer where the auto-coder files are stored, and `FALSE` for the other computer. Or if you've already extracted metrics, 
  - `TRUE`: Run code chunks that extract & save metrics.
  - `FALSE`: Skip these code chunks.

- `extract_only_metrics` (default `FALSE`): Useful if you're using this code to _just_ update your performance files (e.g., you are testing out new UMSs and want to analyze them separately). Note: if `extract_metrics` is `FALSE`, `extract_only_metrics` will be overridden to `FALSE`.
  - `TRUE`: Run _only_ code chunks that extract & save metrics.
  - `FALSE`: Run other code chunks as well.


# RQ2: Assessing fairness for SLAC {#RQ2}

This section assesses gender fairness in the auto-coder reported on in Villarreal et al.'s [2020 _Laboratory Phonology_ article](https://doi.org/10.5334/labphon.216) and ["How to train your classifier" auto-coding tutorial](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html).

Read the auto-coder:

```{r}
##N.B. Copy of https://github.com/nzilbb/How-to-Train-Your-Classifier/blob/main/LabPhonClassifier.Rds, 
##  but with Gender added to internal representation of training data
##  (trainingData element) to facilitate fairness measurement
LabPhonClassifier <- readRDS("Input-Data/LabPhonClassifier.Rds")
```


If you originally ran your auto-coder using the scripts in this repository, then it's ready for fairness measurement.
If not, you may need to manually modify the auto-coder or re-run your auto-coder using these scripts (see [here](index#using-your-own-training-data)).


## Measuring fairness: `UMS-Utils.R` functions {#rq2-ums-utils}

Next, we measure fairness using functions in the `UMS-Utils.R` script: `cls_fairness()` and `cls_summary()`.

```{r, eval=TRUE}
source("R-Scripts/UMS-Utils.R", keep.source=TRUE)
```


`cls_fairness()` is useful in exploratory data analysis---
that stage where you're poking around the data, trying to wrap your head around it.
You can customize its output using different arguments.


```{r}
##Overall accuracy
cls_fairness(LabPhonClassifier)
##Class accuracies
cls_fairness(LabPhonClassifier, byClass=TRUE)
##Confusion matrix
cls_fairness(LabPhonClassifier, output="cm")
```


```{r}
##Chi-sq test: Overall accuracy
cls_fairness(LabPhonClassifier, output="chisq")
##Chi-sq tests: Class accuracies
cls_fairness(LabPhonClassifier, output="chisq", byClass=TRUE)
##Raw predictions
head(cls_fairness(LabPhonClassifier, output="pred"))
```


`cls_summary()` returns a one-row dataframe of fairness and (optionally) performance info:

```{r}
cls_summary(LabPhonClassifier)
```


You can shape the outputs of `cls_fairness()` and `cls_summary()` to create nicer-formatted results tables.
For example, this is (roughly) how I generated Table 4 in the _Linguistics Vanguard_ paper, using `cls_summary()`:

```{r}
##Summary
smry <- cls_summary(LabPhonClassifier)

##Get fairness dataframe: Metric in rows, Female/Male/Diff in columns
fmDiff <- 
  smry %>% 
  select(matches("_(Female|Male)")) %>%
  pivot_longer(everything(), names_to=c("Metric","name"), 
               names_pattern="(.+)_(.+)") %>% 
  pivot_wider() %>% 
  mutate(Difference = Female-Male)

##Get chisq dataframe: Metric in rows, Chisq stat/df/p in columns
chiCols <- 
  smry %>% 
  select(contains("Chisq")) %>% 
  pivot_longer(everything(), names_to=c("Metric","name"), 
               names_pattern="(.+)_(Chisq.+)") %>% 
  pivot_wider()

##Join together & recode Metric w/ nicer labels
recodeMetric <- c("Overall accuracy" = "Acc", 
                  "Absent class accuracy" = "ClassAcc_Absent",
                  "Present class accuracy" = "ClassAcc_Present")
left_join(fmDiff, chiCols, by="Metric") %>% 
  mutate(across(Metric, ~ fct_recode(.x, !!!recodeMetric)))
```


This is (roughly) how I generated Table 5 in the _Linguistics Vanguard_ paper, using `cls_fairness()` and the auto-coder's `trainingData` element:

```{r}
##Get each token's gender & majority-vote prediction
LabPhonPred <-
  ##Prediction dataframe (one row per token * resample)
  cls_fairness(LabPhonClassifier, "pred") %>% 
  ##Get counts of each token's Rpresent votes (plus tokens' unique Gender values)
  count(rowIndex, Gender, Rpresent = Predicted) %>% 
  ##Only take most frequent prediction per token
  slice_max(n, by=rowIndex)
##Put together table
list(Actual = LabPhonClassifier$trainingData %>% 
       rename(Rpresent = .outcome), 
     Predicted = LabPhonPred) %>% 
  ##Get Gender & Rpresent counts for each data source
  imap(~ count(.x, Gender, Rpresent, name=.y)) %>% 
  ##Put into a single dataframe
  reduce(left_join, by=c("Gender","Rpresent")) %>% 
  ##Calculate Under/overprediction
  mutate("Under/overprediction" = Predicted / Actual - 1)
```


# RQ3: Mitigating SLAC unfairness {#RQ3}


In this section, we attempt to produce a fair auto-coder that does not suffer from the fairness issues in the preceding auto-coder (aka the _LabPhon auto-coder_).
To do this, we run and analyze additional auto-coders under different _unfairness mitigation strategies_ (_UMSs_).


## How to generate auto-coders

To generate auto-coders, use the Bash command-line client to run shell scripts.
The shell scripts are written to be compatible with [Slurm](https://slurm.schedmd.com/), the job queue used by Pitt's [CRC](https://crc.pitt.edu/) clusters, using the command `sbatch` to submit jobs to Slurm.
For example, to run `UMS-Round1.sh` with Slurm:

```{bash, eval=FALSE}
##Assuming you are in Shell-Scripts working directory
sbatch UMS-Round1.sh
```

If you don't need to use Slurm, you can submit jobs directly using the command `bash`.
In this case, you should explicitly specify where script outputs & errors should go:

```{bash, eval=FALSE}
##Assuming you are in Shell-Scripts working directory
bash UMS-Round1.sh &> ../Outputs/Shell-Scripts/UMS-Round1.out
```


The following R code returns `TRUE` if the command `sbatch` will work on your system:

```{r}
unname(nchar(Sys.which("sbatch")) > 0)
```


For Slurm users, note that the shell scripts pre-define several additional arguments to `sbatch` (e.g., `--partition=smp`).
Unfortunately, [it's not possible](https://stackoverflow.com/a/36303809) to override these hard-coded defaults by passing arguments to `sbatch` in the command line (e.g., `sbatch UMS-Round1.sh --partition=htc` will still use the hard-coded `--partition=smp`).
If you need different `sbatch` arguments, either hard-code new arguments (if you don't need them to change each time you execute the script) or write a [wrapper script](https://stackoverflow.com/a/36303809).

See the [README](index#shell-scripts) for more info about how the shell scripts work.


## Baseline {#baseline}

The auto-coders that we run for RQ3 will **not** undergo the time-consuming process of _optimization for performance_: [hyperparameter tuning](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#step-4) and [outlier dropping](https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html#step-5).
Applying these steps to each UMS would dramatically increase the amount of time it would take to run this analysis.
Instead of comparing UMSs to the _LabPhon_ auto-coder in [RQ2](#RQ2), which was optimized for performance, we'll run an un-optimized baseline auto-coder so we get an apples-to-apples comparison.


### Run auto-coder {#baseline-run}

Load Bash, navigate to the `Shell-Scripts/` directory, and run one of the following:

```{bash, eval=FALSE}
##Run with Slurm
sbatch Baseline.sh

##OR

##Run directly
bash Baseline.sh &> ../Outputs/Shell-Scripts/Run-Baseline.out
```

Once that script is done running, you should have a new auto-coder file: `Outputs/Diagnostic-Files/Temp-Autocoders/Run-UMS_UMS0.0.Rds`.


### Extract fairness and performance metrics {#baseline-metrics}

Before proceeding, we'll extract fairness/performance metrics from this auto-coder (using [`cls_summary()`](#rq2-ums-utils)) and save metrics to `Outputs/Performance/`.
This allows us to bridge the [two-computer split](index#a-quick-note-on-the-two-computer-setup).


To extract and save fairness/performance metrics from the baseline auto-coder, switch back to R (on the same computer the auto-coder was run on), and run the following code:

```{r, eval=params$extract_metrics}
##Get list of UMS descriptions
umsList <- read.csv("Input-Data/UMS-List.txt", sep="\t")

##Read auto-coder file
file_baseline <- "Run-UMS_UMS0.0.Rds"
cls_baseline <- readRDS(paste0("Outputs/Diagnostic-Files/Temp-Autocoders/", 
                               file_baseline))

##Extract performance
cls_baseline %>% 
  cls_summary() %>%
  ##Add name and long description
  mutate(Classifier = str_remove_all(file_baseline, ".+_|\\.Rds"),
         .before=1) %>% 
  left_join(umsList %>% 
              mutate(Classifier = paste0("UMS", UMS)) %>% 
              select(-UMS),
            by="Classifier") %>% 
  ##Save data
  write_csv("Outputs/Performance/Perf_Baseline.csv")
```


We won't analyze baseline fairness here because its whole purpose is to compare UMSs against it (with neither the baseline nor UMS auto-coders optimized for performance).
However, it's worth noting that there are small differences in fairness/performance between this un-optimized baseline and the _LabPhon_ auto-coder (which _was_ optimized for performance) analyzed for fairness [above](#RQ2):

```{r}
##Read baseline performance
perf_baseline <- read_csv("Outputs/Performance/Perf_Baseline.csv")

##Combine LabPhon & Baseline metrics into a single dataframe
list(LabPhon = cls_summary(LabPhonClassifier),
     Baseline = perf_baseline %>% 
       select(-c(Classifier, Description))) %>% 
  ##One dataframe with just the necessary columns
  map_dfr(select, Acc, Acc_Diff, matches("ClassAcc_(Present|Absent)$"),
          matches("ClassAcc_(Present|Absent)_Diff"),
          .id="Version") %>%
  ##LabPhon/Baseline in separate columns, one row per metric * type
  pivot_longer(contains("Acc"), names_to="Metric") %>%
  mutate(Type = if_else(str_detect(Metric, "Diff"), "Fairness", "Performance"),
         Metric = fct_inorder(if_else(str_detect(Metric, "Absent|Present"),
                                      paste(str_extract(Metric, "Absent|Present"), "class accuracy"),
                                      "Overall accuracy"))) %>%
  pivot_wider(names_from=Version) %>%
  ##Put rows in nicer order
  arrange(Metric, desc(Type))
```


## UMS round 1

The auto-coders in UMS round 1 include downsampling, valid predictor selection, and normalization UMSs (see [here](UMS-Info.html) for more info).


### Run auto-coders and extract metrics {#ums1-run}

Load Bash, navigate to `Shell-Scripts/`, and run one of the following:

```{bash, eval=FALSE}
##Run with Slurm
sbatch UMS-Round1.sh

##OR

##Run directly
bash UMS-Round1.sh &> ../Outputs/Shell-Scripts/UMS-Round1.out
```


Once that script is done running, you should have a bunch more files in `Outputs/Diagnostic-Files/Temp-Autocoders/`.

To extract and save fairness/performance metrics from the round 1 auto-coders, switch back to R (staying on the same computer), and run the following code:

```{r, eval=params$extract_metrics}
##Get auto-coder filenames (exclude UMS 0.x precursor auto-coders)
files_round1 <- list.files("Outputs/Diagnostic-Files/Temp-Autocoders/", 
                           "Run-UMS_UMS[1-3]", full.names=TRUE)

##Read auto-coder files
cls_round1 <-
  files_round1 %>% 
  ##Better names
  set_names(str_remove_all(files_round1, ".+_|\\.Rds")) %>% 
  map(readRDS)

##Extract performance
cls_round1 %>% 
  map_dfr(cls_summary, .id="Classifier") %>% 
  ##Add long description
  left_join(umsList %>% mutate(Classifier = paste0("UMS", UMS)) %>% select(-UMS),
            by="Classifier") %>% 
  ##Save data
  write_csv("Outputs/Performance/Perf_UMS-Round1.csv")
```

---

<a id="meas-precursor" />

We'll also extract and save variable importance data from particular auto-coders.
Several UMSs are "valid predictor selection" strategies: they remove acoustic measures that could inadvertently signal gender.
To determine which measures could inadvertently signal gender, we run a auto-coder predicting _speaker gender_ rather than rhoticity and discard the measures that were "too helpful" in predicting gender.
The following code pulls variable importance from these "precursor" auto-coders.

```{r, eval=params$extract_metrics}
readRDS("Outputs/Diagnostic-Files/Temp-Autocoders/Run-UMS_UMS0.1.1.Rds") %>% 
  pluck("finalModel", "variable.importance") %>% 
  {tibble(Measure=names(.), Importance=.)} %>% 
  write.csv("Outputs/Other/Var-Imp_UMS0.1.1.csv", row.names=FALSE)
readRDS("Outputs/Diagnostic-Files/Temp-Autocoders/Run-UMS_UMS0.2.Rds") %>% 
  map_dfr(~ .x %>% 
            pluck("finalModel", "variable.importance") %>% 
            {tibble(Measure=names(.), value=.)},
          .id="name") %>% 
  pivot_wider(names_prefix="Importance_") %>% 
  write.csv("Outputs/Other/Var-Imp_UMS0.2.csv", row.names=FALSE)
```

This step isn't strictly necessary for the R code in this document, but it allows us to run [`umsData()`](index#r-scripts) for all UMSs without needing to be on the [computer where the auto-coders are saved](index#a-quick-note-on-the-two-computer-setup).


### Analyze fairness and performance

Now we can analyze metrics on a user-friendlier system.
Read fairness/performance data for the round 1 auto-coders, and add baseline data:

```{r}
##Read
perf_baseline <- read_csv("Outputs/Performance/Perf_Baseline.csv")
perf_round1 <- rbind(perf_baseline,
                     read_csv("Outputs/Performance/Perf_UMS-Round1.csv"))

##Decode first digit of UMS code
categories <- c("Baseline", "Downsampling", "Valid pred selection", 
                "Normalization", "Combination") %>% 
  set_names(0:4)

##Shape performance dataframe for plotting: Fairness/Performance in separate
##  columns, one row per UMS * metric, Category factor, shorter Classifier label
perfPlot_round1 <- perf_round1 %>%
  select(Classifier,
         Acc, Acc_Diff, matches("ClassAcc_(Present|Absent)$"),
         matches("ClassAcc_(Present|Absent)_Diff")) %>%
  ##Fairness/Performance in separate columns, one row per UMS * metric
  pivot_longer(contains("Acc")) %>%
  mutate(Metric = fct_inorder(if_else(str_detect(name, "Absent|Present"),
                                      paste(str_extract(name, "Absent|Present"), "class accuracy"),
                                      "Overall accuracy")),
         name = if_else(str_detect(name, "Diff"), "Fairness", "Performance")) %>%
  pivot_wider() %>%
  ##Add Category column, shorter Classifier label
  mutate(Category = recode_factor(str_extract(Classifier, "\\d"), !!!categories),
         across(Classifier, ~ str_remove(.x, "UMS")))
```


```{r}
##Plot
perfPlot_round1 %>% 
  mutate(across(Fairness, abs)) %>% 
  ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier)) +
  ##Points w/ ggrepel'd labels
  geom_point(size=3) +
  geom_text_repel(show.legend=FALSE, max.overlaps=20) +
  ##Each metric in its own facet
  ##  (N.B. use arg scales="free" to have each facet zoom to fit data)
  facet_wrap(~ Metric, nrow=1) +
  ##Lower fairness on the right (so top-right is optimal)
  scale_x_reverse() +
  ##Theme
  theme_bw()
```

To make the baseline stand out, we can plot it with separate aesthetics:

```{r}
perfPlot_round1 %>%
  mutate(across(Fairness, abs)) %>% 
  ##Exclude Baseline from points & labels
  filter(Classifier != "0.0") %>% 
  ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier)) +
  ##Points w/ ggrepel'd labels
  geom_point(size=3) +
  geom_text_repel(show.legend=FALSE, max.overlaps=20) +
  ##Dotted line for baseline
  geom_vline(data=perfPlot_round1 %>% filter(Classifier=="0.0"), 
             aes(xintercept=abs(Fairness)), linetype="dashed") +
  geom_hline(data=perfPlot_round1 %>% filter(Classifier=="0.0"), 
             aes(yintercept=Performance), linetype="dashed") +
  ##Each metric in its own facet
  facet_wrap(~ Metric, nrow=1) +
  ##Lower fairness on the right (so top-right is optimal)
  scale_x_reverse() +
  ##Theme
  theme_bw()
```


While numerous UMSs improve on fairness relative to the baseline, there is no one obvious winner.
This is often because a UMS will perform well on some metrics but poorly on others.
For example, while UMS 1.3.2 improves fairness and performance for overall accuracy and Absent class accuracy, it has dismal performance for Present class accuracy (under 40%).
In other instances the disparity is more dramatic;
for example, UMS 1.5 is clearly superior for the fairness/performance tradeoff when it comes to Present class accuracy, but its Absent class accuracy fairness is _worse_ than the baseline.


## UMS round 2

Since the round 1 results weren't completely satisfactory, I decided to attempt combination strategies: combining downsampling with either valid predictor selection or normalization.
Combining these strategies is feasible because different categories of UMS affect the data in different ways;
downsampling UMSs remove tokens (rows), valid predictor selection UMSs remove acoustic measures (columns), and normalization transforms acoustic measures.
This could theoretically produce better results if the strengths and weakness of the combined UMSs hedge against one another (e.g., the improvements in Present class accuracy performance for UMS 2.2 could balance out the decline in Present class accuracy performance for UMS 1.3.2).

I chose 8 combination UMSs based on round 1 results: 2 downsampling UMSs (1.3.1, 1.3.2) $\times$ 4 other UMSs (2.1.1, 2.2, 2.3, 3.1).
I chose these UMSs because they were relatively balanced across all 3 metrics for fairness and performance (i.e., excluding UMSs like 1.5 that performed very poorly on at least one metric).
In your own projects, it may be appropriate to choose different UMSs to combine depending on how the round 1 results shake out.


### Run auto-coders and extract metrics {#ums2-run}

Load Bash, navigate to `Shell-Scripts/`, and run one of the following:

```{bash, eval=FALSE}
##Run with Slurm
sbatch UMS-Round2.sh

##OR

##Run directly
bash UMS-Round2.sh &> ../Outputs/Shell-Scripts/UMS-Round2.out
```


Once that script is done running, you should have additional files in `Outputs/Diagnostic-Files/Temp-Autocoders/`.

To extract and save fairness/performance metrics from the round 2 auto-coders, switch back to R (staying on the same computer), and run the following code:

```{r, eval=params$extract_metrics}
##Get auto-coder filenames (all combo UMSs start with the digit 4)
files_round2 <- list.files("Outputs/Diagnostic-Files/Temp-Autocoders/", 
                           "Run-UMS_UMS4", full.names=TRUE)

##Read auto-coder files
cls_round2 <-
  files_round2 %>% 
  ##Better names
  set_names(str_remove_all(files_round2, ".+_|\\.Rds")) %>% 
  map(readRDS)

##Extract performance
cls_round2 %>% 
  map_dfr(cls_summary, .id="Classifier") %>% 
  ##Add long description
  left_join(umsList %>% mutate(Classifier = paste0("UMS", UMS)) %>% select(-UMS),
            by="Classifier") %>% 
  ##Save data
  write_csv("Outputs/Performance/Perf_UMS-Round2.csv")
```


### Analyze fairness and performance

Now we can analyze these metrics on a user-friendlier system.
Read fairness/performance data for the round 1 auto-coders, and add baseline data:

```{r}
##Read
perf_baseline <- read_csv("Outputs/Performance/Perf_Baseline.csv")
perf_round2 <- rbind(perf_baseline,
                     read_csv("Outputs/Performance/Perf_UMS-Round2.csv"))

##Shape performance dataframe for plotting: Fairness/Performance in separate
##  columns, one row per UMS * metric, Category factor, shorter Classifier label
perfPlot_round2 <- perf_round2 %>%
  select(Classifier,
         Acc, Acc_Diff, matches("ClassAcc_(Present|Absent)$"),
         matches("ClassAcc_(Present|Absent)_Diff")) %>%
  ##Fairness/Performance in separate columns, one row per UMS * metric
  pivot_longer(contains("Acc")) %>%
  mutate(Metric = fct_inorder(if_else(str_detect(name, "Absent|Present"),
                                      paste(str_extract(name, "Absent|Present"), "class accuracy"),
                                      "Overall accuracy")),
         name = if_else(str_detect(name, "Diff"), "Fairness", "Performance")) %>%
  pivot_wider() %>%
  ##Add Category column, shorter Classifier label
  mutate(Category = recode_factor(str_extract(Classifier, "\\d"), !!!categories),
         across(Classifier, ~ str_remove(.x, "UMS")))
```


Plot (using dotted line for baseline)

```{r}
perfPlot_round2 %>%
  mutate(across(Fairness, abs)) %>% 
  ##Exclude Baseline from points & labels
  filter(Classifier != "0.0") %>% 
  ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier)) +
  ##Points w/ ggrepel'd labels
  geom_point(size=3) +
  geom_text_repel(show.legend=FALSE, max.overlaps=20, color="black") +
  ##Dotted line for baseline
  geom_vline(data=perfPlot_round2 %>% filter(Classifier=="0.0"), 
             aes(xintercept=abs(Fairness)), linetype="dashed") +
  geom_hline(data=perfPlot_round2 %>% filter(Classifier=="0.0"), 
             aes(yintercept=Performance), linetype="dashed") +
  ##Each metric in its own facet
  facet_wrap(~ Metric, nrow=1) +
  ##Lower fairness on the right (so top-right is optimal)
  scale_x_reverse() +
  ##Theme
  theme_bw() +
  ##Rotate x-axis labels to avoid clash
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```


We can also plot round 1 & 2 together:

```{r}
##Put perfPlot dfs together
perfPlot <- rbind(perfPlot_round1,
                  perfPlot_round2) %>% 
  ##Remove duplicate Baseline rows
  distinct()

perfPlot %>%
  mutate(across(Fairness, abs)) %>% 
  ##Exclude Baseline from points
  filter(Classifier != "0.0") %>% 
  ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier)) +
  ##Points w/ ggrepel'd labels
  geom_point(size=3) +
  geom_text_repel(show.legend=FALSE, max.overlaps=40) +
  ##Dotted line for baseline
  geom_vline(data=perfPlot %>% filter(Classifier=="0.0"), 
             aes(xintercept=abs(Fairness)), linetype="dashed") +
  geom_hline(data=perfPlot %>% filter(Classifier=="0.0"), 
             aes(yintercept=Performance), linetype="dashed") +
  ##Each metric in its own facet
  facet_wrap(~ Metric, nrow=1) +
  ##Lower fairness on the right (so top-right is optimal)
  scale_x_reverse() +
  ##Theme
  theme_bw() +
  ##Rotate x-axis labels to avoid clash
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```


## Identify optimal auto-coder

Now that we've got all our performance data, we need to choose which auto-coder to actually _use_ for auto-coding data that hasn't previously been coded (i.e., to scale up our dataset of coded tokens without more manual coding).
The previous plot tells us that some UMSs are better than others (e.g., we obviously won't be using UMS 1.2), but there isn't any UMS that clearly stands out from the rest.
Furthermore, even if we eliminate the obviously bad options, there seems to be a tradeoff between performance and fairness.
How do we winnow down the space of options?
One technique is to find the UMSs that are _Pareto-optimal_: a given UMS is Pareto-optimal if every other UMS that is _better_ in fairness is _worse_ in performance, or vice versa.
In this sense, the best UMS for our purposes might be neither the fairest nor the best-performing, but the UMS for which there's a good fairness--performance tradeoff.


In R, we can use `psel()` from the `rPref` package to find Pareto-optimal auto-coders.
For example, the following auto-coders are Pareto-optimal for Overall accuracy:

```{r}
perfPlot %>% 
  mutate(across(Fairness, abs)) %>% 
  filter(Metric=="Overall accuracy") %>% 
  psel(high(Performance) * low(Fairness))
```


Here's that same info represented in a plot:

```{r, warning=FALSE}
perfPlot %>% 
  mutate(across(Fairness, abs)) %>% 
  filter(Metric=="Overall accuracy",
         Classifier != "0.0") %>%
  mutate(`Pareto-optimal` = Classifier %in% psel(., high(Performance) * low(Fairness))$Classifier) %>% 
  ggplot(aes(x=Fairness, y=Performance, color=Category, label=Classifier, alpha=`Pareto-optimal`)) +
  ##Points w/ ggrepel'd labels
  geom_point(size=3) +
  geom_text_repel(show.legend=FALSE, max.overlaps=40) +
  ##Dotted line for baseline
  geom_vline(data=perfPlot %>% 
               filter(Metric=="Overall accuracy", Classifier=="0.0"), 
             aes(xintercept=abs(Fairness)), linetype="dashed") +
  geom_hline(data=perfPlot %>% 
               filter(Metric=="Overall accuracy", Classifier=="0.0"), 
             aes(yintercept=Performance), linetype="dashed") +
  ##Lower fairness on the right (so top-right is optimal)
  scale_x_reverse() +
  ##Theme
  theme_bw() +
  ##Rotate x-axis labels to avoid clash
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```


In this case, we have 3 metrics, so for each metric we find the UMSs that are Pareto-optimal for the fairness--performance tradeoff

```{r}
##Get list of dataframes, one per Metric, w/ only Pareto-optimal UMSs
paretoOpt <- 
  perfPlot %>% 
  mutate(across(Fairness, abs)) %>% 
  group_by(Metric) %>% 
  ##Identify Pareto-optimal UMSs
  group_map(~ psel(.x, high(Performance) * low(Fairness)),
            .keep=TRUE)
##Display as a single dataframe
bind_rows(paretoOpt)
```


In this particular dataset, we get really lucky:
There is one UMS, 4.2.1, that is shared among these 3 sets of Pareto-optimal UMSs.
This is certainly _not_ a guaranteed outcome!

```{r}
##Get Classifier value that is in all 3 dataframes (if any)
paretoOpt %>% 
  map("Classifier") %>% 
  reduce(intersect)
```

Incidentally, UMS 4.2.1 also happens to be the fairest UMS for all 3 metrics---
this is _definitely_ not a guaranteed outcome!

```{r, results='hold'}
##Get fairest UMS for each Metric
perfPlot %>% 
  mutate(across(Fairness, abs)) %>% 
  group_by(Metric) %>%
  filter(Fairness==min(Fairness))
##Could also do
# perfPlot %>%
#   mutate(across(Fairness, abs)) %>%
#   group_by(Metric) %>%
#   group_modify(~ psel(.x, low(Fairness))) 
```


Thus, we choose 4.2.1 as the optimal UMS.
In fact, this is the UMS that was used to grow an /r/ dataset fivefold for the 2021 _Language Variation and Change_ article ["Gender separation and the speech community: Rhoticity in early 20th century Southland New Zealand English"](https://doi.org/10.1017/S0954394521000090) by me, Lynn Clark, Jen Hay, and Kevin Watson.
(The auto-coder used for that analysis was optimized for performance, so the fairness we report in that paper is slightly worse than UMS 4.2.1.)


# Script meta-info {#script-info}

## R session info

```{r, eval=TRUE}
sessionInfo()
```


## Disk space used

These are only shown if `params$extract_metrics` is `TRUE` (because otherwise it's assumed that you're not working on the same system the auto-coders were run on).

Temporary auto-coders:

```{r, eval=params$extract_metrics}
tmpAuto <- 
  list.files("Outputs/Diagnostic-Files/Temp-Autocoders/", 
             # "^(Run-UMS|(Hyperparam-Tuning|Outlier-Dropping)_UMS0\\.0).*Rds$") %>%
             "^Run-UMS.*Rds$", full.names=TRUE) %>%
  file.info()
# cat("Disk space: ", round(sum(tmpAuto$size)/2^30, 1), " Gb (",
cat("Disk space: ", round(sum(tmpAuto$size)/2^20, 1), " Mb (",
    nrow(tmpAuto), " files)", sep="")
```


Complete repository:

```{r, eval=params$extract_metrics}
if (.Platform$OS.type=="windows") {
  shell("dir /s", intern=TRUE) %>% 
    tail(2) %>% 
    head(1) %>% 
    str_trim() %>% 
    str_squish()
}
if (.Platform$OS.type=="unix") {
  system2("du", "-sh", stdout=TRUE) %>% 
    str_remove("\\s.+") %>% 
    paste0("b")
}
```

## Machine specs

System:

```{r, eval=TRUE}
Sys.info()
```

Processor:

```{r, eval=TRUE}
get_cpu()
```

RAM:

```{r, eval=TRUE}
get_ram()
```


## Running time

Total running time for shell scripts:

```{r timing, eval=TRUE}
##Parse HH:MM:SS and print as nicer time
printDur <- function(x) {
  library(lubridate)
  library(magrittr)
  x %>% 
    lubridate::hms() %>% 
    as.duration() %>% 
    sum() %>% 
    seconds_to_period()
}
scripts <- c("Baseline", "UMS-Round1", "UMS-Round2")
cat("Total running time for", paste0(scripts, ".sh", collapse=", "), fill=TRUE)
paste0("Outputs/Shell-Scripts/", paste0(scripts, ".out")) %>% 
  map_chr(~ .x %>% 
            readLines() %>% 
            str_subset("RunTime")) %>% 
  str_extract("[\\d:]{2,}") %>%
  printDur()
```


Total running time for R code in this document (with `params$extract_metrics` set to `r params$extract_metrics`), in seconds:

```{r, eval=TRUE}
timing$stop <- proc.time()

timing$stop - timing$start
```


# Acknowledgements

I would like to thank Chris Bartlett, the Southland Oral History Project (Invercargill City Libraries and Archives), and the speakers for sharing their data and their voices.
Thanks are also due to Lynn Clark, Jen Hay, Kevin Watson, and the New Zealand Institute of Language, Brain and Behaviour for supporting this research.
Valuable feedback was provided by audiences at NWAV 49, the Penn Linguistics Conference, Pitt Computer Science, and the Michigan State SocioLab.
Other resources were provided by a Royal Society of New Zealand Marsden Research Grant (16-UOC-058) and the University of Pittsburgh Center for Research Computing (specifically, the H2P cluster supported by NSF award number OAC-2117681).
Any errors are mine entirely.


<!-- Footer: Add CSS styling -->

```{css, echo=FALSE}
/* Add scrollbars to long R input/output blocks */
pre {
  max-height: 300px;
  overflow-y: auto;
}

/* Repo navigation block */
div#repo-nav {
  background-color: #ddd;
  width: fit-content;
  padding: 10px 20px;
  border: 1px solid #bbb;
  border-radius: 20px;
}
div#repo-nav * {
  margin: 0px;
  font-style: italic;
}
```