This pipeline was built to predict antimicrobial resistance (AMR) to 8 common antibiotics in 5 different Gram-negative species based on whole-genome sequencing data.
This project was tested with installation of all dependencies on a linux based machine.
This tool is used to align the bacterial sequencing files (fastq format) to a precompiled database of antimicrobial resistance genes. To install DIAMOND, follow the instruction on their GitHub page.
The database to align to is provided in this project and does not need to be compiled.
Processing of the output as created by the DIAMOND aligner and predicing the AMR is done in R. For this project, R 3.5.0 was used, so this or any later version should work (though older versions are expected to work as well). After installing R on your machine, the following packages need to be installed before running the pipeline:
install.packages(c("tibble", "dplyr", "seqinr", "stringr", "purrr", "DescTools", "xgboost"))
In case you like to download and process publicly available files directly from NCBI's Sequence Reads Archive (SRA), you can install the SRA-Tools.
The models were created to make predictions for the following bacterial species: Acinetobacter baumannii, Enterobacter cloacae, Escherichia Coli, Klebsiella aerogenes and Klebsiella pneumoniae. They were trained to be species independent and could theoretically make predictions on other Gram-negative species, but have not been validated for those thus far.
The files used as input should be next-generation-sequencing files in either fastq or compressed fastq.gz format. This can be in one single file or two separate files in case of pair-end read files. The testFiles
folder in the wgs2amr
folder contains dummy files that can be used to test the pipeline when all dependencies are installed.
It is also possible to download and process files directly from NCBI's Sequence Reads Archive (SRA) if the SRA-Tools have been installed on your machine and you know the SRR number of the sample of interest. Our pipeline can take process files of different sequencing depths, but know that shallow sequencing is more likely to miss important resistance genes and thus the error rate could be higher in such cases.
Download the wgs2amr
folder from this GitHub page and put it on the same machine where you installed the dependencies.
The wgs2amr
folder contains the master script wgs2amr.sh
that can be called with the following arguments
- -d : Path to the
diamond
script in the diamond folder - -r : Either path to the
Rscript
in the R bin folder, or the name of the R module to load on a linux machine (e.g. R/3.5.0) - -f : The first sequence file. If there is only one sequence file, this one should be set. Both fastq and fastq.gz are supported
- -s : The second sequence file in case of two pair-end reads files. Again, both fastq and fastq.gz are supported. In case of only one file, this argument can be omitted
- -o : The location of the output folder where the prediction results in csv format will be stored. If not set, this will default to the
RESULTS
folder within thewgs2amr
folder
Here are some examples of a typical wgs2amr.sh call using the test data provided in the wgs2amr
#Using two pair-end reads input files and the link to the Rscript
/pathToScript/wgs2amr.sh \
-d '/pathToDiamond/diamond' \
-r '/pathToR/bin/Rscript' \
-f '/pathTo/wgs2amr/testFiles/readsFile1.fastq.gz' \
-s '/pathTo/wgs2amr/testFiles/readsFile2.fastq.gz'
#Using one combined reads file and the name of the R module
/pathToScript/wgs2amr.sh \
-d '/pathToDiamond/diamond' \
-r 'R/3.5.0' \
-f '/pathTo/wgs2amr/testFiles/testFile.fastq.gz'
This will download the read files of the specified SRR from NCBI's SRA and put them in the wgs2amr/downloads/
folder. If the files already have been downloaded, this step will be skipped to save time and banswidth. Should the file be corrupt and lead to errors, simply delete both read files from the foolder and run the scrip again, in which a new download will start.
- -n : Use NCBI's SRAToolkit's
fastq-dump
to download a SRR file. Set this parameter to the path of the fastq-dump script or the name of the SRAToolkit module to be loaded on a linux machine. - -d : Path to the
diamond
script in the diamond folder - -r : Either path to the
Rscript
in the R bin folder, or the name of the R module to load on a linux machine (e.g. R/3.5.0) - -f : This should be the SRR number you like to download and process.
- -o : The location of the output folder where the prediction results in csv format will be stored. If not set, this will default to the
RESULTS
folder within thewgs2amr
folder
#Downloading data from NCBI's SRA then processing it
/pathToScript/wgs2amr.sh \
-n '/pathToSRAtoolkit/fastq-dump'
-d '/pathToDiamond/diamond' \
-r 'R/3.5.0' \
-f 'SRR4017839'
Note:If an output folder is specified, make sure the path ends with a backslash
The output file will be named after the first sequence file used in the format <filename>_AMRpredictions.csv
and found in the default or specified output folder.
The first column contains the names of the 8 antibiotics for which predictions were made, the second column the predicted susceptibility (susceptible or resistant), the last column the reliability, offering some idea of the degree the model thinks the output is correct (0 = very uncertain to 1 = near certain)
Example of the test file output:
antibiotic | prediction | reliability |
---|---|---|
cefepime | resistant | 0.70 |
cefotaxime | resistant | 0.78 |
ceftriaxone | resistant | 0.98 |
ciprofloxacin | resistant | 0.96 |
gentamicin | suseptible | 0.88 |
levofloxacin | resistant | 0.96 |
meropenem | suseptible | 0.40 |
tobramycin | resistant | 0.66 |
NOTE: Temporary files like the diamond alignment files are stored in the wgs2amr/temp
folder and can safely be removed after finishing the pipeline.