Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many "alignment score too low, or score drop too high" during pair-end merge compared with FLASH #526

Closed
billzt opened this issue Jun 21, 2023 · 5 comments

Comments

@billzt
Copy link

billzt commented Jun 21, 2023

Testing data: https://www.ncbi.nlm.nih.gov/sra/?term=DRR126155 6714 pairs

FLASH (v1.2.7), using all default parameter, i.e.

  -m, --min-overlap=NUM  Default: 10
 -x, --max-mismatch-density=NUM Default: 0.25.

The result is:

# [FLASH] Processed 6714 read pairs
# [FLASH]  
# [FLASH] Read combination statistics:
# [FLASH]     Total reads:      6714
# [FLASH]     Combined reads:   6331
# [FLASH]     Uncombined reads: 383
# [FLASH]     Percent combined: 94.30%

The merging rate is very high.

vsearch (v2.22.1_linux_x86_64), command is

vsearch --fastq_mergepairs DRR126155.1.fq.gz -reverse DRR126155.2.fq.gz \
--fastq_allowmergestagger --fastaout DRR126155.merge.vsearch.fa --fastq_maxdiffpct 25 \
--fastq_maxdiffs 100000

That is, making maximum percentage diff. bases in overlap = 25%, equal with FLASH

The result is:

Merging reads 100% 
      6714  Pairs
      5696  Merged (84.8%)
      1018  Not merged (15.2%)

Pairs that failed merging due to various reasons:
        83  too few kmers found on same diagonal
         3  too high percentage of differences
       932  alignment score too low, or score drop too high

Statistics of all reads:
    148.27  Mean read length

Statistics of merged reads:
    225.33  Mean fragment length
     25.95  Standard deviation of fragment length
      0.81  Mean expected error in forward sequences
      0.55  Mean expected error in reverse sequences
      0.61  Mean expected error in merged sequences
      3.50  Mean observed errors in merged region of forward sequences
      2.63  Mean observed errors in merged region of reverse sequences
      6.13  Mean observed errors in merged region

The percentage of not merging is significantly higher than FLASH, due to many reads regarded as "alignment score too low, or score drop too high"

How can I adjust the parameters to get similar results as FLASH?

@frederic-mahe
Copy link
Collaborator

How can I adjust the parameters to get similar results as FLASH?

vsearch's merging algorithm is more conservative than flash's by design. In vsearch, there are three options you can toggle to relax some merging parameters:

  • fastq_minovlen: specify the minimum overlap between the merged reads. The default is 10. Must be at least 5.
  • fastq_maxdiffpct: specify the maximum percentage of non-matching nucleotides allowed in the overlap region. The default value is 100.0%.
  • fastq_maxdiffs: specify the maximum number of non-matching nucleotides allowed in the overlap region. That option has a strong influence on the merging success rate. The default value is 10.

There are other more sophisticated rules in the merging algorithm that will discard read pairs with a high fraction of mismatches, but these rules are not controlled by user-facing variables. So, it is currently not possible for end-users to adjust the vsearch's merging algorithm parameters to get similar results as flash.

@frederic-mahe
Copy link
Collaborator

@billzt using your 12S data sample I tried to show the difference between the two merging approaches, and the benefit-risk of each. To do so, I use taxonomic assignment as our reference metric, with the idea that amplicons with higher similarity percentages with reference sequences are less likely to be false-positives.

To summarize, flash may produce more false-positives, while vsearch may produce more false-negatives. Choosing one method depends on the scientific question at hand (for instance, is assessing precise alpha-diversity values important?)

The testing dataset initially contains 6,714 pairs. Of these, 5,050 are merged in exactly the same way by both methods (75.2% overlap). vsearch merges 8 additional pairs, but these amplicons end-up not assigned (so possible false-positives). flash merges 1,281 additional pairs, most of them with a similarity of 95% or more with reference sequences. However, the similarity distribution of these 1,281 flash-specific amplicons is not the same as the similarity distribution of the 5,050 amplicons common to both methods. The cumulative distribution function of flash-specific amplicons is greater than the cumulative distribution function of common amplicons, with a p-value < 2.2e-16 (conditions of the Kolmogorov-Smirnov test are violated, but I am not sure what other test would work here). My interpretation is that the 1,281 flash-specific amplicons are not a random subset of the general amplicon pool, but a set skewed toward (relatively) lower similarity percentages with reference sequences, indicating a higher false-positive rate or some erroneous merging.

See code below for future reference:

cd ./issue_526_20230704/

mkdir -p data results

## versions
flash --version    # FLASH v1.2.11
vsearch --version  # vsearch v2.22.1


## raw data
(cd ./data/
 wget \
     ftp.sra.ebi.ac.uk/vol1/fastq/DRR126/DRR126155/DRR126155_1.fastq.gz \
     ftp.sra.ebi.ac.uk/vol1/fastq/DRR126/DRR126155/DRR126155_2.fastq.gz
)


## run flash
(cd ./data/
 flash \
     --output-directory=../results/ \
     DRR126155_1.fastq.gz \
     DRR126155_2.fastq.gz 2>&1 | \
     tee ../results/flash.log
)

(cd ./results/
 vsearch \
     --fastq_filter "out.extendedFrags.fastq" \
     --fastq_ascii 33 \
     --fasta_width 0 \
     --fastaout "DRR126155_flash.fasta"

 rm out.*
)


## run vsearch
(cd ./data/
 vsearch \
     --fastq_mergepairs DRR126155_1.fastq.gz \
     --reverse DRR126155_2.fastq.gz \
     --fastq_allowmergestagger \
     --quiet \
     --fasta_width 0 \
     --fastaout ../results/DRR126155_vsearch.fasta 2>&1 | \
     tee ../results/vsearch.log
)

## dereplication (intra)
(cd ./results/
 FASTA_V="DRR126155_vsearch.fasta"
 FASTA_F="DRR126155_flash.fasta"
 (( $(grep -c "^>" "${FASTA_V}") == $(grep -v "^>" "${FASTA_V}" | sort -u | wc -l) )) && \
     echo "no duplicates in vsearch results"
 (( $(grep -c "^>" "${FASTA_F}") == $(grep -v "^>" "${FASTA_F}" | sort -u | wc -l) )) && \
     echo "no duplicates in flash results"
)


## common (n = 5050) most entries are common!
(cd ./results/
 FASTA_V="DRR126155_vsearch.fasta"
 FASTA_F="DRR126155_flash.fasta"
 comm -12 \
      <(grep -v "^>" "${FASTA_F}" | sort) \
      <(grep -v "^>" "${FASTA_V}" | sort) | \
     grep \
         --no-group-separator \
         -F -f - \
         -B 1 "${FASTA_V}" > "DRR126155_common.fasta"
)


## entries specific to vsearch (n = 8)
(cd ./results/
 FASTA_V="DRR126155_vsearch.fasta"
 FASTA_F="DRR126155_flash.fasta"
 comm -13 \
      <(grep -v "^>" "${FASTA_F}" | sort) \
      <(grep -v "^>" "${FASTA_V}" | sort) | \
     grep \
         --no-group-separator \
         -F -f - \
         -B 1 "${FASTA_V}" > "DRR126155_vsearch_exclusive.fasta"
)


## entries specific to flash (n = 1281)
(cd ./results/
 FASTA_V="DRR126155_vsearch.fasta"
 FASTA_F="DRR126155_flash.fasta"
 comm -23 \
      <(grep -v "^>" "${FASTA_F}" | sort) \
      <(grep -v "^>" "${FASTA_V}" | sort) | \
     grep \
         --no-group-separator \
          -F -f - -B 1 "${FASTA_F}" > "DRR126155_flash_exclusive.fasta"
)


## taxonomic assignments with a 12S reference database from:
## https://github.com/zjgold/FishCARD
function taxonomic_assignment() {
    local -ir THREADS=4
    local -ir IDDEF=2
    local -r IDENTITY="0.5"
    local -ir MAXREJECTS=32
    local -r SUBJECTS="../data/GenBank_.fasta"
    local ASSIGNMENTS="${1/\.fasta/.hits}"

    vsearch \
        --usearch_global "${1}" \
        --threads "${THREADS}" \
        --dbmask none \
        --qmask none \
        --rowlen 0 \
        --notrunclabels \
        --userfields query+id${IDDEF}+target \
        --maxaccepts 0 \
        --maxrejects "${MAXREJECTS}" \
        --top_hits_only \
        --output_no_hits \
        --db "${SUBJECTS}" \
        --id "${IDENTITY}" \
        --iddef "${IDDEF}" \
        --userout "${ASSIGNMENTS}"
}

(cd ./results/
 taxonomic_assignment "DRR126155_common.fasta"
 taxonomic_assignment "DRR126155_vsearch_exclusive.fasta"
 taxonomic_assignment "DRR126155_flash_exclusive.fasta"
)
library(tidyverse)

setwd("issue_526_20230704/results/")

colnames <- c("amplicon", "similarity", "reference")
sources <- c("common", "flash_exclusive", "vsearch_exclusive")

sources %>%
    purrr::map_chr(~ str_c("DRR126155_", .x, ".hits")) %>%
    purrr::map_dfr(~ read_tsv(.x,
                              col_names = colnames,
                              show_col_types = FALSE),
                   .id = "source") %>%
    select(-reference) %>%
    distinct() -> d

d %>%  
    ggplot(aes(x = source, y = similarity)) +
    geom_violin(scale = "count") +
    coord_cartesian(ylim = c(95, 100))


ks.test(d %>% filter(source == 2) %>% pull(similarity),
        d %>% filter(source == 1) %>% pull(similarity),
        alternative = "two.sided",
        exact = NULL)

ks.test(d %>% filter(source == 2) %>% pull(similarity),
        d %>% filter(source == 1) %>% pull(similarity),
        alternative = "great",
        exact = NULL)

## D^+ = 0.35122, p-value < 2.2e-16
## alternative hypothesis: the CDF of x lies above that of y

@billzt
Copy link
Author

billzt commented Jul 5, 2023

Thank you

@billzt billzt closed this as completed Jul 5, 2023
@torognes
Copy link
Owner

torognes commented Jul 5, 2023

Thanks for explaining and testing, @frederic-mahe !

@torognes
Copy link
Owner

torognes commented Jul 7, 2023

See also issues #524 and #282.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants