Pathogenicity prediction for bacterial genomes is a random forest based method for the assessment of the pathogenic potential of a set of reads belonging to a single genome. Its strength lies in the prediction of novel, unknown bacterial pathogens.
The related paper is available at https://www.nature.com/articles/srep39194
*** News
- New forestes were trained with a new version of ranger and are available in the release
- The new section Predicting real data
# install.packages("devtools")
devtools::install_github("crarlus/paprbag")
(Installation might require some time due to the package's dependencies)
Due to changes of the dependencies, some problems occurred when installing paprbag and using the original data. In particular, different ranger versions are not compatible. Since the original data were trained with ranger 0.3 they can only be used when the same ranger version is installed on your system.
The original version of paprbag is still available under the legacy branch.
The data made available in this branch, as well as the release data, should work with recent ranger versions.
For full reproducibility, the packages used for building the data sets are provided as a packrat bundle. The bundle can be downloaded from the release
They can be installed via
# Download the bundle
path2bundle <- "/path/to/bundle/"
packrat::unbundle(bundle=path2bundle, where="/path/to/my/project")
Alternatively you can unpack the bundle, change into the new directory and start an R session. This will trigger the installation of the bundled packages. See here for more details about packrat.
The packages versions are also listed here. They were installed under R version 3.4.4.
The resource of the inferred pathogenicity labels can be found in labels
The label data together with the strain's Bioproject accession can be accessed by data("Labels")
The large classifiers used in the publication are provided in the R package data4paprbag. They were trained with ranger version 0.3. Current data is available in the (release tab)[link]
library(paprbag)
library(ranger) # load ranger package explicitly
data("ReadData")
# Or read in sequence data via Biostrings::readDNAStringSet()
Prediction <- Predict.ReadSet (ForestObject = forest, ReadObject = ReadData, Feature.Configuration = NULL, verbose = T)
# Histogram of pathogenic potential
hist(Prediction$predictions[,"TRUE"])
# Mean pathogenic potential
mean(Prediction$predictions[,"TRUE"])
Predict.ReadSet.fromFiles (Path2Forest = Path2Forest, Path2ReadFiles = Path2ReadFiles, saveLocation = saveLocation, OutputFilename = OutputFilename, Return.Predictions = F, Save.AsTSV = F, verbose = T, num.threads = 1)
In this section, we describe how paprbag can be applied to real data based on the pre-trained forests located in the release.
The following forests are available in the release tab:
- classifier_all.rds: Classifier that was trained on all data described in paprbag; only nucleotide features (faster)
- classifier_all_includingPeptideFeatures.rds: Classifier that was trained on all data described in paprbag; including peptide features (slower)
Furthermore, the forests related to the 5-fold cross validation in (paprbag)[link] can be found here:
- classifier_fold1.rds: Classifier based on training fold 1
- classifier_fold2.rds: Classifier based on training fold 2
- classifier_fold3.rds: Classifier based on training fold 3
- classifier_fold4.rds: Classifier based on training fold 4
- classifier_fold5.rds: Classifier based on training fold 5
library(paprbag)
library(ranger)
# download forest from release
# e.g. using wget in linux:
# wget https://github.com/crarlus/paprbag/releases/download/2.0/classifier_all.rds
# or using your browser
# define your path to the previously saved forest
Path2Forest <- path/to/forest
# define independent prediction data
Path2ReadFiles <- path/to/independent_test_data_in_fasta_format
# define output
OutputFilename <- "prediction_paprbag"
saveLocation <- "Predictions/example"
Predictions_paprbag <- Predict.ReadSet.fromFiles (Path2Forest = Path2Forest, Path2ReadFiles = Path2ReadFiles, saveLocation = saveLocation, OutputFilename = OutputFilename, Return.Predictions = T, Save.AsTSV = F, num.threads = 20, verbose = T)
- Use case: Avoids re-loading of large forests when predicting many fasta files
library(paprbag)
library(ranger)
# download forest from release
# load forest
forest_large <- readRDS(file.path(forest_nuc))
# read file
Readfile <- Readfiles[1]
ReadData_real <- Biostrings::readDNAStringSet(Readfile)
# predict
Prediction_forest_nuc_realdata <- Predict.ReadSet (ForestObject = forest_large <- , ReadObject = ReadData_real, Feature.Configuration = NULL, verbose = T, num.threads = 20)
- note: the prediction function can be called with a number of threads via the num.threads option, however it still predicts read per read in a linear fashion. If a number of cores is available, the prediction process can be sped up by diving the reads in read chunks and joining the prediction results.
hist(Prediction_forest_nuc_realdata$predictions[,2])
mean(Prediction_forest_nuc_realdata$predictions[,2])
ifelse(mean(Prediction_forest_nuc_realdata$predictions[,2])> 0.5, "Pathogenic", "Non-Pathogenic")
You can pass a value of Feature.Configuration
in function Predict.ReadSet
. Choose between:
# load standard configuration; used in toy forest and classifier_all.rds, classifier_fold1.rds, etc.
data("Standard.configuration")
# load standard configuration including peptide features; used in forest classifier_all_includingPeptideFeatures.rds
data("Standard.configuration_peptides")
# infer feature configuration directly from forest
my.configuration <- paprbag:::IdentifyFeatures.FromForest(ForestObject = forest_new)
And set e.g. Feature.Configuration = Standard.configuration
Extract features from a set of reads
data("ReadData")
# Or read in sequence data via Biostrings::readDNAStringSet()
Features <- CreateFeaturesFromReads(Reads=ReadData, NT_kmax = 4, Do.NTMotifs = T)
data("Standard.configuration") # load standard configuration
Standard.configuration$Path2ReadFiles <- Path2ReadFiles # add path to Read files
Standard.configuration$savePath <- "Features" # add save path
do.call(UpdateFeatures,Standard.configuration)
# or
UpdateFeatures (Path2ReadFiles = Path2ReadFiles,savePath = file.path(Path2ReadFiles,"Features") ,Cores = 5, verbose = T)
Join all feature files in a directory and create read labels for every data row by matching to bioproject accessions.
#Labels: A vector containing either 'TRUE' or 'FALSE' together with the name attributes pointing to the Organism identifier (Bioproject ID)
data("Labels")
Create.TrainingDataSet (Path2Files = "Features", pattern="Features",OSlabels = Labels, savePath = "TrainingData")
# patterns: Regex-Pattern that identifies the feature files
Train a pathogenicity classifier from a set of feature data together with read label information
# Please specify paths:
# Path2FeatureFile: path to combined label and feature file (in rds format)
# savePath: path where trained classifier is saved
# example path
Tempdir <- tempdir()
data("trainingData")
saveRDS(trainingData,file.path(Tempdir,"trainingData.rds"))
Path2FeatureFile <- file.path(Tempdir,"trainingData.rds")
savePath <- file.path(Tempdir,"classifier")
# Run with path information
Run.Training (Path2FeatureFile = Path2FeatureFile, savePath = savePath)
# Specify
# Path2FeatureFile: Path to feature data only
# Path2LabelFile: Path to label file with labels for every data row in 'Path2FeatureFile'
forest <- Run.Training (Path2FeatureFile = Path2FeatureFile, Path2LabelFile = Path2LabelFile, SelectedFeatures = NULL, savePath = NULL, ReturnForest = T, verbose = T, min.node.size = 1, num.trees = 100)
As mentioned in the original publication, PaPrBaG is not restricted for classification of pathogens only.
It is rather a general workflow for the classification of labeled genomes, potential further applications range from bacterial host and habitat prediction, taxonomic classification to human and microbial read separation
Basically the same workflows as described here can be applied and also the features need not be re-calculated. It suffices to
- either provide a new path to a label file Path2NewLabelFile with labels for every data row in Path2FeatureFile
Run.Training (Path2FeatureFile = Path2FeatureFile, Path2LabelFile = Path2NewLabelFile, savePath = NewsavePath)
- or manipulate the column "Labels" in the data.frame "Path2FeatureFile". E.g.
trainingData$Labels <- as.factor(c(rep("NewLabel_A",floor(nrow(trainingData)/2)),rep("NewLabel_B",ceiling(floor(nrow(trainingData)/2)) )) )
saveRDS(trainingData, NewPath2FeatureFile) # save
Run.Training (Path2FeatureFile = NewPath2FeatureFile, savePath = NewsavePath)
In case that read labels are not yet available, a new training set based on the new labels can be obtained via the command
Create.TrainingDataSet (Path2Files = "Features", pattern="Features",OSlabels = NewLabels, savePath = "TrainingData")
A proper vector NewLabels containing two (or more) labels (e.g. 'TRUE' or 'FALSE') together with the name attributes pointing to the Organism identifier (Bioproject ID) is required. It outputs a file "ReadLabel_OS.rds" in the savePath which can be used for running the function Run.Training subsequently.
Predictions obtained from the newly trained classifier provide classification according to the new label.