ADP is a set of scripts for standardizing free-text datasets in TSV form. ADP uses a three-stage approach that applies standardization at the level of characters, words, and phrases.
Included in this repository are two datasets containing age and data-location free-text data from the Immune Epitope Database (IEDB).
The scripts char_normalizer.py
, word_normalizer.py
, and phrase_normalizer.py
perform the core normalization functions in ADP, but the scripts/
directory contains several other scripts that provide accessory functions and perform data collection and analysis on the outputs of the normalization process.
ADP needs a specific folder structure to work on a dataset. To run ADP on a TSV of data, do the following steps. The linked paths point to the corresponding directories and files for the age dataset as examples.
- Create a main directory for that "style" of data, like the directory
age/
. - Create the following directories in the parent directory:
input_files/
,output_files
. - Put your TSV file of data in
<style>/input_files/
- Optional: Also create these directories:
<style>/analysis/
and<style>/analysis/figures/
.
The character normalization script
requires 3 arguments:
- Style: the name of the parent directory for that dataset, e.g.,
"age"
. - File name: the name of the TSV of data to be normalized, not including its path, e.g.,
"age.tsv"
/ - Target column: the name of the column of data to normalize, e.g., "h_age".
To call this script on the age dataset, try:
python3 scripts/char_normalizer.py age age.tsv h_age
The word
and phrase normalization scripts
only require 2 arguments:
- Style: the name of the parent directory for that dataset, e.g.,
"age"
- Target column: the name of the column of data to normalize, e.g., "h_age".
To call these scripts on the age dataset, try:
python3 scripts/word_normalizer.py age h_age
python3 scripts/phrase_normalizer.py age h_age
Generally, the scripts in this repository attempt to adhere to a convention of requiring arguments in a general-to-specific order, e.g., directory, filename, column.
Running the character normalization script
will create two files: a review file
and an character normalized output file
. By editing the action columns in the review file, you can create rules that direct the behavior of the character normalization script next time you run it on the dataset. The action columns are as follows:
- replace_with: Adding text in this column within a row directs the script to replace the character listed in that row with the text in that column.
- remove: Adding any text in this column within a row directs the script to remove the character listed in that row wherever it occurs in the data.
- invalidate: Adding any text in this column within a row directs the script to fail validation on data items that contain the character listed in that row.
- allow: Adding any text in this column within a row directs the script to accept that character as a known & permitted character that will not trigger data items to fail validation.
Numbers and lowercase letters are automatically treated as allowed characters and do not appear in the review file.
Once you make an action decision (by adding text to an action column) on at least one row in the review file, when you re-run the character normalizer script, the script will transfer the line(s) with action decisions to a reference file
, and the actions you specified will be applied to the output file. You can also edit the reference file to change the behavior of the script in future runs.
The word normalization script
functions the same way, with words instead of characters.
More coming soon on the phrase normalization script.