Skip to content

Filtering out false taxonomic hits from shotgun sequencing based on genome coverage

Notifications You must be signed in to change notification settings

biocore/zebra_filter

Repository files navigation

Zebra Filter

Filtering out false taxonomic hits from shotgun sequencing based on aggregated genome coverage of all samples in dataset.

When you hear hoofs, think horse, not zebra.
-Theodore Woodward


Overview

False positives are common in metagenomic sequencing because databases are incomplete and do not represent many of the microbes actually in our samples. Short shotgun sequencing reads will incorrectly map to short common motifs and transposable elements, making it look like an microbe is present when it actually isn't. One way to distinguish these false positives is by looking at genome coverage; microbes that are actually present will have high genome coverage, while mis-mappings to short motifs will have low genome coverage. A problem with this coverage approach is that shotgun sequencing is often shallow for each sample, so coverage of all microbes will be low whether it is acually there or not.

The logic behind this method is that datasets often have many samples from the same environment, and those samples are likely to have very similar microbes. With that assumption, we aggregate the genome coverage of all samples in the dataset to create "artificially deep" shotgun sequencing. Note that only unique coverage is counted, multiple reads covering the same location of the genome will not increase the coverage. At this deeper sequencing, true positives will have higher genome coverage and will be easier to distinguish from false positives, which will still have low genome coverage.


Usage

This repository contains 2 scripts, the first, calculate_coverages.py, calculates the genome coverage from .sam alignments. The second script, filter_sam.py, filters .sam alignments based on a coverage threshhold that can be determined by the user based on the output from calculate_coverages.py.

Note: calculate_coverages.py can be used alone to look at the coverage of each genome in the dataset. You can use this coverage information to filter your abundance table.

calculate_coverages.py

Calculates the aggregated unique coverage of each target genome from all the input alignments together. It is designed to work with alignments generated by bowtie2 or SHOGUN (using the bowtie2 alignment option). The input is a directory of .sam alignment files. Currently only works with the Web of Life database (https://biocore.github.io/wol/). The Web of Life metadata file is also required and provided in this repository at databases/WoL/metadata.tsv.

Usage: calculate_coverages.py [OPTIONS]

Options:
  -i, --input TEXT     Input: Directory of sam files (files must end in .sam).
                       [required]
  -o, --output TEXT    Output: file name for list of coverages.  [required]
  -d, --database TEXT  WoL genome metadata file.  [default:
                       databases/WoL/metadata.tsv]
  --help               Show this message and exit.

The output .tsv contains information about the coverage of each genome in the dataset. This alone can be used to decide which genomes to filter out of your abundance table, without using filter_sam.py. However, this is not recommended as it will decrease the apparent number of reads rather than redistributing these to the other hits when using WoLTKA.

Instead, we recommend that you remove the false positive hits from your alignments, using the filter_sam.py script and then provide these alignments to WoLTKA to regenerate your abundance table.


filter_sam.py

Filters alignments from .sam files based on a user-set coverage threshhold, creating new .sam files. Takes in the output .tsv from calculate_coverages.py, the directory of .sam alignments, a coverage threshhold, and an output folder. Outputs one filtered .sam alignment for each input alignment. The coverage threshhold is based on the coverage_rato column of the input .tsv, which is the portion of the genome with unique coverage in the dataset.

Usage: filter_sam.py [OPTIONS]

Options:
  -i, --input TEXT    Input coverage .tsv file generated from
                      calculate_coverages.py.  [required]
  -s, --sam TEXT      Sam file to filter or directory of sam files to filter.
                      [required]
  -c, --cutoff FLOAT  Minimum % genome coverage.  [default: 0.1]
  -o, --output TEXT   Output directory to write output files. Will be created
                      if does not exist.  [required]
  --help              Show this message and exit.

Example

Example you can run using the test_data provided in this repo (test_data/sample_alignments). The test data is 5 .sam alignment files with 100 alignments in each file.

First run the first script: calcualte_coverages.py

$ python calculate_coverages.py -i test_data/ -o test_data/sample_output.tsv -d databases/WoL/metadata.tsv

View output

$head test_data/sample_output.tsv
gotu	covered_length	total_length	coverage_ratio	strain
G001578205	20963	3878815	0.005404485648323006	Bacillus pumilus SH-B9
G001017485	3366	3795691	0.0008867950526004356	Bacillus pumilus 15.1
G001653905	3292	3670170	0.0008969611761852994	Bacillus safensis S9
G900119345	3218	3706009	0.0008683195318737758	Bacillus sp. ru9509.4 RU9509.4
G000986655	3152	3634690	0.0008671991283988456	Bacillus sp. L_1B0_12 ASM98665v1
G000701305	2848	3718149	0.0007659725309555911	Bacillus sp. UNC125MFCrub1.1 ASM70130v1
G001457015	2775	3615064	0.0007676212647964185	Bacillus cellulasensis NIO-1130
G900094985	2773	3615064	0.0007670680242452139	Bacillus sp. nio-1130 NIO-1130
G000715205	2344	3736192	0.0006273767515159821	Bacillus zhangzhouensis DW5-4

The columns in the output tsv:

  • gotu: genome id of hit organism
  • covered_length: Length of total unique genome coverage in base pairs from all samples combined
  • total_length: Total length of genome in base pairs
  • coverage_ratio: Portion of hit genome covered. Calculated by covered_length / total_length

Since this is a tiny dataset, so coverages are extremely low. Looking at the data we choose 0.0008 as our coverage_ratio cutoff.

We run the second script, filter_sam.py to filter all GOTUs with a coverage_ratio less than 0.008 from our .sam alignments. This will create a new folder of .sam alignments with only hits from genomes with over 0.008 coverage_ratio.

$ python filter_sam.py -i test_data/sample_output.tsv -s test_data/sample_alignments/ -c .0008 -o test_data/filtered_sample_alignments

If we look in the folder that was created, test_data/filtered_sample_alignments, we see that new filtered alignments have been created.

$ ls test_data/filtered_sample_alignments
alignment1_filtered.sam
alignment2_filtered.sam
alignment3_filtered.sam
alignment4_filtered.sam
alignment5_filtered.sam

These alignments can be used to create a biom table using Woltka (https://github.com/qiyunzhu/woltka).

About

Filtering out false taxonomic hits from shotgun sequencing based on genome coverage

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •