Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to map all short peptides (7–30 aa) to reference proteome #929

Open
PominovaMS opened this issue Jan 15, 2025 · 0 comments
Open

Comments

@PominovaMS
Copy link

Expected Behavior

I’m trying to map a query set of short peptides (7–30 residues) to a reference proteome using MMseqs2. These peptides are known to be present in the reference proteome, so I expect to get a 100% match rate.

Which parameters or strategy would you recommend to ensure that all peptides are found?
Are there any additional parameters, scoring matrices, or special workflows recommended for short peptides?

Current Behavior

From 30% to 2% of exact matches are missed.

Steps to Reproduce (for bugs)

I run mmseqs map with the following commands:

mmseqs createdb ./mmseqs2_tmp/proteome.fasta ./mmseqs2_tmp/targetDB/targetDB
mmseqs createdb ./mmseqs2_tmp/query_peptides.fasta ./mmseqs2_tmp/queryDB/queryDB

mmseqs map ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/results/search_result ./mmseqs2_tmp/tmp -a -e inf --cov-mode 1
mmseqs convertalis ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/results/search_result ./mmseqs2_tmp/search_result.m8 --format-output query,target,qaln,taln,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits

An existing issue #391 seems to discuss a similar problem and I tried to apply the search parameters suggested there.
Eventually, I was able to reduce the number of missing peptides from 600 (with default map parameters) to ~50 with the following parameters:
mmseqs map ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/results/search_result ./mmseqs2_tmp/tmp -a -e inf --cov-mode 1 --seed-sub-mat VTML40.out --comp-bias-corr 0 --mask 0 --spaced-kmer-mode 0 -k 5

MMseqs Output (for bugs)

map ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/results/search_result ./mmseqs2_tmp/tmp -a -e inf --cov-mode 1 --seed-sub-mat VTML40.out --comp-bias-corr 0 --mask 0 --spaced-kmer-mode 0 -k 5

MMseqs Version: 747c64c
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix aa:VTML40.out,nucl:VTML40.out
Sensitivity 2
k-mer length 5
Target search mode 0
k-score seq:2147483647,prof:2147483647
Alphabet size aa:21,nucl:5
Max sequence length 65535
Max results per query 300
Split database 0
Split mode 2
Split memory limit 0
Coverage threshold 0.95
Coverage mode 1
Compositional bias 0
Compositional bias 1
Diagonal scoring true
Exact k-mer matching 0
Mask residues 0
Mask residues probability 0.9
Mask lower case residues 0
Minimum diagonal score 15
Selected taxa
Include identical seq. id. false
Spaced k-mers 0
Preload mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Spaced k-mer pattern
Local temporary path
Threads 56
Compressed 0
Verbosity 3
Rescore mode 2
Allow wrapped scoring false
Remove hits by seq. id. and coverage false
E-value threshold inf
Add backtrace true
Seq. id. threshold 0.9
Min alignment length 0
Seq. id. mode 0
Sort results 1
Min codons in orf 10
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Start sensitivity 4
Search steps 1
MPI runner
Force restart with latest tmp false
Remove temporary files false

search ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/results/search_result ./mmseqs2_tmp/tmp/2718880745757471168 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML40.out,nucl:VTML40.out' -s 2 -k 5 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.95 --cov-mode 1 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 56 --compressed 0 -v 3 --rescore-mode 2 --wrapped-scoring 0 --filter-hits 0 -e inf -a 1 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --sort-results 1 --min-length 10 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 0 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --start-sens 4 --sens-steps 1 --force-reuse 0 --remove-tmp-files 0 --alignment-mode 4

prefilter ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/tmp/2718880745757471168/12632981311954453740/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML40.out,nucl:VTML40.out' -k 5 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.95 --cov-mode 1 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 56 --compressed 0 -v 3 -s 2.0

Query database size: 82609 type: Aminoacid
Estimated memory consumption: 520M
Target database size: 2231 type: Aminoacid
Index table k-mer threshold: 135 at k-mer size 5
Index table: counting k-mers
[=================================================================] 2.23K 0s 2ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 2.23K 0s 2ms
Index statistics
Entries: 20288
DB size: 24 MB
Avg k-mer size: 0.006340
Top 10 k-mers
PGSPF 12
PSLDL 10
TPNSL 10
LEEEQ 10
QPASF 9
VSLNG 9
LNGAK 9
NQPAS 9
SFAVS 9
ELQQL 8
Time for index table init: 0h 0m 0s 33ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 82609
Target db start 1 to 2231
[=================================================================] 82.61K 0s 772ms

2.264198 k-mers per position
13 DB matches per sequence
0 overflows
0 sequences passed prefiltering per query sequence
0 median result list length
55964 sequences with 0 size result lists
Time for merging to pref_0: 0h 0m 1s 592ms
Time for processing: 0h 0m 5s 684ms
rescorediagonal ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/tmp/2718880745757471168/12632981311954453740/pref_0 ./mmseqs2_tmp/results/search_result --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 2 --wrapped-scoring 0 --filter-hits 0 -e inf -c 0.95 -a 1 --cov-mode 1 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 1 --db-load-mode 0 --threads 56 --compressed 0 -v 3

[=================================================================] 82.61K 0s 542ms
Time for merging to search_result: 0h 0m 2s 242ms
Time for processing: 0h 0m 4s 598ms

Context

I attach the query_peptides.fasta and proteome.fasta databases if needed. In my task, I also need the matching process to consider leucine and isoleucine indistinguishable, so I replace all I’ with L’ in both the query and the target databases.
proteome.fasta.txt.zip
query_peptides.fasta.txt

In practice, since query_peptides.fasta is a small database, I run MMseqs search with proteome.fasta as query DB and query_peptides.fasta as target DB. As discussed in #826, when doing in the reversed order (query_peptides.fasta as query, proteome.fasta as target), it leads to a "prefilter died" error.
Can this also impact the number of matches found?

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • MMseqs2 Version: 747c64c
  • Operating system and version: Rocky Linux release 8.10 (Green Obsidian)
  • Server specifications:
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    RAM: 192 GB
    Byte Order: Little Endian
    CPU(s): 64
    On-line CPU(s) list: 0-63
    Thread(s) per core: 1
    Core(s) per socket: 32
    Socket(s): 2
    NUMA node(s): 8
    Vendor ID: AuthenticAMD
    CPU family: 23
    Model: 49
    Model name: AMD EPYC 7452 32-Core Processor
    Stepping: 0
    CPU MHz: 2350.000
    CPU max MHz: 2350.0000
    CPU min MHz: 1500.0000
    BogoMIPS: 4700.12
    Virtualization: AMD-V
    L1d cache: 32K
    L1i cache: 32K
    L2 cache: 512K
    L3 cache: 16384K
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant