Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Mmseqs MSA with species header against Uniprot similar to Jackhmmer output #934

Open
abhinavb22 opened this issue Jan 17, 2025 · 0 comments

Comments

@abhinavb22
Copy link

abhinavb22 commented Jan 17, 2025

I want to use Mmseqs to generate MSAs against uniprot and use that with the default alphafold2-multimer pipeline. However, when I run the following script, the header in the sto file is just the uniprot ID without species info. The default alphafold2 with jackhmmer gives sto with header that is parsed by the af2 pipeline to pair sequences for multimer. How can I accomplish the same behavior with mmseqs??
Thank you.

Jackhmmer sample output in sto format:
#=GF ID chain_B-i1

#=GS tr|A0A2J8M6R0|A0A2J8M6R0_PANTR/1-79 DE [subseq from] COA6 isoform 2 OS=Pan troglodytes OX=9598 GN=CK820_G0023150 PE=4 SV=1
#=GS tr|A0A2J8UE17|A0A2J8UE17_PONAB/1-79 DE [subseq from] COA6 isoform 2 OS=Pongo abelii OX=9601 GN=CR201_G0028559 PE=4 SV=1
#=GS sp|Q5JTJ3|COA6_HUMAN/47-125 DE [subseq from] Cytochrome c oxidase assembly factor 6 homolog OS=Homo sapiens OX=9606 GN=COA6 PE=1 SV=1
#=GS tr|A0A2I3SJ14|A0A2I3SJ14_PANTR/47-125 DE [subseq from] Cytochrome c oxidase assembly factor 6 OS=Pan troglodytes OX=9598 GN=COA6 PE=4 SV=1
#=GS tr|A0A6D2WTQ0|A0A6D2WTQ0_PANTR/47-125 DE [subseq from] COA6 isoform 3 OS=Pan troglodytes OX=9598 GN=CK820_G0023150 PE=4 SV=1

Mmseqs sample output in sto format:
Q5JTJ3 MAAPSMKERQVCWGARDEYWKCLDENLEDASQCKKLRSSFESSCPQQWIKYFDKRRDYLKFKEKFEAGQFEPSETTAKS
A0A2J8M6R0 MAAPSMKERQVCWGARDEYWKCLDENLEDASQCKKLRSSFESSCPQQWIKYFDKRRDYLKFKEKFEAGQFEPSETTAKS
A0A2I3SJ14 MAAPSMKERQVCWGARDEYWKCLDENLEDASQCKKLRSSFESSCPQQWIKYFDKRRDYLKFKEKFEAGQFEPSETTAKS
A0A6D2WTQ0 MAAPSMKERQVCWGARDEYWKCLDENLEDASQCKKLRSSFESSCPQQWIKYFDKRRDYLKFKEKFEAGQFEPSETTAKS

Script I ran for Mmseqs:

`MMSEQS="mmseqs" # Path to MMseqs2 binary
QUERY=<path_to_query> # Query file
BASE="./output" # Output directory
UNIPROT_DB=<path_to_uniprot> # UniProt database
EXPAND_EVAL=inf
ALIGN_EVAL=10
QSC=-20.0
MAX_ACCEPT=1000000

SEARCH_PARAM="--num-iterations 3 --db-load-mode 2 -a --k-score 'seq:96,prof:80' -e 0.1 --max-seqs 10000"
FILTER_PARAM="--filter-min-enable 1000 --diff 3000 --qid 0.0,0.2,0.4,0.6,0.8,1.0 --qsc 0 --max-seq-id 0.95"

mkdir -p "${BASE}"

"${MMSEQS}" createdb "${QUERY}" "${BASE}/qdb" --dbtype 1

"${MMSEQS}" search "${BASE}/qdb" "${UNIPROT_DB}" "${BASE}/res" "${BASE}/tmp" $SEARCH_PARAM

"${MMSEQS}" align "${BASE}/qdb" "${UNIPROT_DB}" "${BASE}/res" "${BASE}/res_aln" --db-load-mode 2 -e ${ALIGN_EVAL} --max-accept ${MAX_ACCEPT} --alt-ali 10 -a

"${MMSEQS}" filterresult "${BASE}/qdb" "${UNIPROT_DB}" "${BASE}/res_aln" "${BASE}/res_filtered" --db-load-mode 2 --qid 0 --qsc $QSC --diff 0 --max-seq-id 1.0 --filter-min-enable 100

"${MMSEQS}" result2msa "${BASE}/qdb" "${UNIPROT_DB}" "${BASE}/res_filtered" "${BASE}/uniprot.sto" --msa-format-mode 4 --db-load-mode 2`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant