Unable to map all short peptides (7–30 aa) to reference proteome #929

PominovaMS · 2025-01-15T17:56:01Z

Expected Behavior

I’m trying to map a query set of short peptides (7–30 residues) to a reference proteome using MMseqs2. These peptides are known to be present in the reference proteome, so I expect to get a 100% match rate.

Which parameters or strategy would you recommend to ensure that all peptides are found?
Are there any additional parameters, scoring matrices, or special workflows recommended for short peptides?

Current Behavior

From 30% to 2% of exact matches are missed.

Steps to Reproduce (for bugs)

I run mmseqs map with the following commands:

mmseqs createdb ./mmseqs2_tmp/proteome.fasta ./mmseqs2_tmp/targetDB/targetDB
mmseqs createdb ./mmseqs2_tmp/query_peptides.fasta ./mmseqs2_tmp/queryDB/queryDB

mmseqs map ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/results/search_result ./mmseqs2_tmp/tmp -a -e inf --cov-mode 1
mmseqs convertalis ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/results/search_result ./mmseqs2_tmp/search_result.m8 --format-output query,target,qaln,taln,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits

An existing issue #391 seems to discuss a similar problem and I tried to apply the search parameters suggested there.
Eventually, I was able to reduce the number of missing peptides from 600 (with default map parameters) to ~50 with the following parameters:
mmseqs map ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/results/search_result ./mmseqs2_tmp/tmp -a -e inf --cov-mode 1 --seed-sub-mat VTML40.out --comp-bias-corr 0 --mask 0 --spaced-kmer-mode 0 -k 5

MMseqs Output (for bugs)

map ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/results/search_result ./mmseqs2_tmp/tmp -a -e inf --cov-mode 1 --seed-sub-mat VTML40.out --comp-bias-corr 0 --mask 0 --spaced-kmer-mode 0 -k 5

MMseqs Version: 747c64c
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix aa:VTML40.out,nucl:VTML40.out
Sensitivity 2
k-mer length 5
Target search mode 0
k-score seq:2147483647,prof:2147483647
Alphabet size aa:21,nucl:5
Max sequence length 65535
Max results per query 300
Split database 0
Split mode 2
Split memory limit 0
Coverage threshold 0.95
Coverage mode 1
Compositional bias 0
Compositional bias 1
Diagonal scoring true
Exact k-mer matching 0
Mask residues 0
Mask residues probability 0.9
Mask lower case residues 0
Minimum diagonal score 15
Selected taxa
Include identical seq. id. false
Spaced k-mers 0
Preload mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Spaced k-mer pattern
Local temporary path
Threads 56
Compressed 0
Verbosity 3
Rescore mode 2
Allow wrapped scoring false
Remove hits by seq. id. and coverage false
E-value threshold inf
Add backtrace true
Seq. id. threshold 0.9
Min alignment length 0
Seq. id. mode 0
Sort results 1
Min codons in orf 10
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Start sensitivity 4
Search steps 1
MPI runner
Force restart with latest tmp false
Remove temporary files false

search ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/results/search_result ./mmseqs2_tmp/tmp/2718880745757471168 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML40.out,nucl:VTML40.out' -s 2 -k 5 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.95 --cov-mode 1 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 56 --compressed 0 -v 3 --rescore-mode 2 --wrapped-scoring 0 --filter-hits 0 -e inf -a 1 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --sort-results 1 --min-length 10 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 0 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --start-sens 4 --sens-steps 1 --force-reuse 0 --remove-tmp-files 0 --alignment-mode 4

prefilter ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/tmp/2718880745757471168/12632981311954453740/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML40.out,nucl:VTML40.out' -k 5 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.95 --cov-mode 1 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 56 --compressed 0 -v 3 -s 2.0

Query database size: 82609 type: Aminoacid
Estimated memory consumption: 520M
Target database size: 2231 type: Aminoacid
Index table k-mer threshold: 135 at k-mer size 5
Index table: counting k-mers
[=================================================================] 2.23K 0s 2ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 2.23K 0s 2ms
Index statistics
Entries: 20288
DB size: 24 MB
Avg k-mer size: 0.006340
Top 10 k-mers
PGSPF 12
PSLDL 10
TPNSL 10
LEEEQ 10
QPASF 9
VSLNG 9
LNGAK 9
NQPAS 9
SFAVS 9
ELQQL 8
Time for index table init: 0h 0m 0s 33ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 82609
Target db start 1 to 2231
[=================================================================] 82.61K 0s 772ms

2.264198 k-mers per position
13 DB matches per sequence
0 overflows
0 sequences passed prefiltering per query sequence
0 median result list length
55964 sequences with 0 size result lists
Time for merging to pref_0: 0h 0m 1s 592ms
Time for processing: 0h 0m 5s 684ms
rescorediagonal ./mmseqs2_tmp/targetDB/targetDB ./mmseqs2_tmp/queryDB/queryDB ./mmseqs2_tmp/tmp/2718880745757471168/12632981311954453740/pref_0 ./mmseqs2_tmp/results/search_result --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 2 --wrapped-scoring 0 --filter-hits 0 -e inf -c 0.95 -a 1 --cov-mode 1 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 1 --db-load-mode 0 --threads 56 --compressed 0 -v 3

[=================================================================] 82.61K 0s 542ms
Time for merging to search_result: 0h 0m 2s 242ms
Time for processing: 0h 0m 4s 598ms

Context

I attach the query_peptides.fasta and proteome.fasta databases if needed. In my task, I also need the matching process to consider leucine and isoleucine indistinguishable, so I replace all I’ with L’ in both the query and the target databases.
proteome.fasta.txt.zip
query_peptides.fasta.txt

In practice, since query_peptides.fasta is a small database, I run MMseqs search with proteome.fasta as query DB and query_peptides.fasta as target DB. As discussed in #826, when doing in the reversed order (query_peptides.fasta as query, proteome.fasta as target), it leads to a "prefilter died" error.
Can this also impact the number of matches found?

Your Environment

Include as many relevant details about the environment you experienced the bug in.

MMseqs2 Version: 747c64c
Operating system and version: Rocky Linux release 8.10 (Green Obsidian)
Server specifications:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
RAM: 192 GB
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7452 32-Core Processor
Stepping: 0
CPU MHz: 2350.000
CPU max MHz: 2350.0000
CPU min MHz: 1500.0000
BogoMIPS: 4700.12
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to map all short peptides (7–30 aa) to reference proteome #929

Unable to map all short peptides (7–30 aa) to reference proteome #929

PominovaMS commented Jan 15, 2025

Unable to map all short peptides (7–30 aa) to reference proteome #929

Unable to map all short peptides (7–30 aa) to reference proteome #929

Comments

PominovaMS commented Jan 15, 2025

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Context

Your Environment