Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why Pairwise alignment (--allpairs_global) only support positive strand? #576

Open
billzt opened this issue Oct 15, 2024 · 4 comments
Open

Comments

@billzt
Copy link

billzt commented Oct 15, 2024

I have a set of sequences that are definitely homologous. However some of them are in negative strand and I don't know which of them are. Therefore I cannot use --allpairs_global directly since sequences are aligned on their plus strand only. Any other good suggestions?

@torognes
Copy link
Owner

Perhaps the --orient command may be useful to orient all sequences in the same direction before aligning them with the --allpairs_global command?

If all the sequences are fairly similar, I don't think you'll need many sequences in the database for the orient command to work fine.

@billzt
Copy link
Author

billzt commented Oct 15, 2024

Thank you. I will try --orient. But I still hope --allpairs_global could support --strand both in the future, the same as that in --cluster_fast.

@torognes
Copy link
Owner

We'll consider adding --strand both in the future.

@frederic-mahe
Copy link
Collaborator

hello @billzt

here is a possible workaround, using a tiny dataset for demonstration purpose:

>s1
AAAA
>s2
AAAT
>s3
TTTT
  • s1 has 75% similarity with s2,
  • s1 has 100% similarity with s3 (if s3 is reverse-complemented)
s1 AAAA
   |||
s2 AAAT

s1 AAAA
   ||||
s3 AAAA (reverse-complement)
  • use --fastx_revcomp to reverse-complement the dataset,
  • concatenate both normal and reverse-complement datasets,
  • use --allpairs_global to find matching pairs,
  • (optional: filter out the results)
FASTA_FILE=$(mktemp)
printf ">s1\nAAAA\n>s2\nAAAT\n>s3\nTTTT\n" > "${FASTA_FILE}"

(
    cat "${FASTA_FILE}"
    vsearch \
        --fastx_revcomp "${FASTA_FILE}" \
        --quiet \
        --label_suffix "_rv" \
        --fastaout -
) | \
    vsearch \
        --allpairs_global - \
        --id 0.75 \
        --iddef 1 \
        --quiet \
        --blast6out -

rm "${FASTA_FILE}"

We obtain the expected results, equivalent to a search on both strands:

s1	s3_rv	100.0	4	0	0	1	4	1	4	-1	0
s1	s2	75.0	4	1	0	1	4	1	4	-1	0
s3	s1_rv	100.0	4	0	0	1	4	1	4	-1	0

Warning: doubling the size of a dataset quadruples computation time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants