Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add uchime2_denovo to close #92 #100

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from

Conversation

colinbrislawn
Copy link
Contributor

@colinbrislawn colinbrislawn commented Sep 29, 2024

WIP to close #92
Add internal functions and tests to use 'uchime' 'uchime2' or 'uchime3'
Open questions:

  1. Externally, do we want to present all these as new methods/functions or new settings?
qiime vsearch uchime-denovo
qiime vsearch uchime2-denovo
qiime vsearch uchime3-denovo
# or
qiime vsearch uchime-denovo --p-method 'uchime2'
  1. Internally, do we want to combine some of these? They are very similar, especially uchime2 and 3

The only difference [in --uchime3_denovo] from --uchime2_denovo is that the default minimum abundance
skew (--abskew) is set to 16.0 rather than 2.0.

  1. Do we want to expose --abskew for some or all of these?

@hagenjp
Copy link
Contributor

hagenjp commented Oct 3, 2024

Hi @colinbrislawn,
1./2. We all agree that new parameters would be best vs. new methods. Thank you!
3. We do not have strong feelings about --abskew but making sure that the default is none (would be best so that it can match the algorithm by default)

@colinbrislawn
Copy link
Contributor Author

colinbrislawn commented Oct 3, 2024

new parameters would be best

Cool!

How does this look for the CLI? (CLI docs for the existing function)

qiime vsearch uchime-denovo \
  --p-method 'uchime2' \
  --p-mindiffs 99 # ignored when running uchime2 and uchime3
  --p-mindiv 0.8 # ignored when running uchime2 and uchime3
  --p-minh 0.99 # ignored when running uchime2 and uchime3
  ...

What should we do if someone passes settings that are not used by uchime2 and uchime3?
vsearch simply ignores them silently, which I don't love for our API

@colinbrislawn
Copy link
Contributor Author

Let's add support for --abskew in a different PR, to keep things tidy ✨ 🧹

@colinbrislawn colinbrislawn marked this pull request as ready for review October 3, 2024 21:18
@colinbrislawn
Copy link
Contributor Author

colinbrislawn commented Oct 3, 2024

Is --p-method a good name? I could go back to --p-algorithm or something new.

The methods are tested with 100% coverage, but the results are not, as I'm not sure how they should work with non-trivial examples.

If we want to build a test that shows the difference between these methods, here's commentary on vsearch's implementation:
torognes/vsearch#283
torognes/vsearch#503

@nbokulich, if you have the time and interest, I would appreciate your review!

q2_vsearch/_chimera.py Show resolved Hide resolved
q2_vsearch/plugin_setup.py Outdated Show resolved Hide resolved
q2_vsearch/plugin_setup.py Outdated Show resolved Hide resolved
q2_vsearch/plugin_setup.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on adding tests for these new algorithm versions (beyond testing the command string)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a trivial test that shows both working.

I don't have an example in which these methods differ.... Would you like me to try and find one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that there is a test to which we pass "uchime3" as the method; however we technically can't be sure that this method is being implemented by the underlying software without differentiating behavior.

I understand if it's too difficult to contrive input data that shows different expected behavior for the different algorithm methods, but if it is reasonably easy to do so it would be best.

@@ -404,12 +406,17 @@
'abundances).'),
},
parameter_descriptions={
'method': ('Denovo chimera detection based on uchime (Edgar 2011), '
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colinvwood While trying to keep this short and sweet, I've added a little more detail.

How does this look?

I think we should cite Rob's papers too. Let me work on adding those two papers to the .bib file...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good 👍🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want me to add these citations to the .bib and link them up, or are we good to go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're referring to each of the citations for each of the algorithm iterations, I think that yes we should do that. If you're referring to "Rob's papers" I'm unsure which ones those are exactly.

@colinvwood
Copy link
Contributor

Hey @colinbrislawn, not sure if you're waiting on any more input from me but everything looks good to me except for the outstanding tests discussion.

@colinbrislawn
Copy link
Contributor Author

colinbrislawn commented Dec 19, 2024

I'm working to build some good tests, starting from PR2 from the test data.

Note: if I knew these algorithms better, I may be able to make mock tests from first principles. But I don't!

I may need to put this on hold while I work on other projects.

# shorten names
vsearch --fastx_uniques PR2-18S-rRNA-V4.derep.fsa --sizein --sizeout --relabel derep_ --fastaout PR2_short.fsa

# run all 3 methods
vsearch --uchime_denovo  PR2_short.fsa --chimeras chi_v1.fasta --uchimeout nonchi_v1.uc &
vsearch --uchime2_denovo PR2_short.fsa --chimeras chi_v2.fasta --uchimeout nonchi_v2.uc &
vsearch --uchime3_denovo PR2_short.fsa --chimeras chi_v3.fasta --uchimeout nonchi_v3.uc &

# inspect for differences
git diff --no-index --word-diff -U0 nonchi_v1.uc nonchi_v2.uc
git diff --no-index --word-diff -U0 nonchi_v1.uc nonchi_v3.uc
git diff --no-index --word-diff -U0 nonchi_v2.uc nonchi_v3.uc

grep 'derep_33;size=86\s' nonchi_v1.uc | head -n 1
grep 'derep_33;size=86\s' nonchi_v2.uc | head -n 1
grep 'derep_33;size=86\s' nonchi_v3.uc | head -n 1

Possible reads from PR2 to use. Note the short names!

query parent1 parent2 uchime1 uchim2 uchime3
derep_33;size=86 derep_2;size=485 derep_5;size=315 N Y N
derep_34;size=85 derep_2;size=485 derep_5;size=315 N Y N

Testing uchime2 vs uchime3 should be easy.

The only difference from --uchime2_denovo is that the default minimum abundance skew (--abskew) is set to 16.0 rather than 2.0.

This higher default --abskew threshold should lead to fewer called chimeras, which is what I see!

@colinbrislawn
Copy link
Contributor Author

Would this work as a minimal test?

>uchime
0.0239  derep_33;size=86        derep_2;size=485        derep_5;size=315        derep_5;size=315        100.0   98.1    99.4    97.5    99.4    1       00       3       0       0       0.6     N
>uchime2
0.0239  derep_33;size=86        derep_2;size=485        derep_5;size=315        derep_5;size=315        100.0   98.1    99.4    97.5    99.4    1       00       3       0       0       0.6     Y
>ucmime3
0.0000  derep_33;size=86        *       *       *       *       *       *       *       *       0       0       0       0       0       0       *       N

1 and 2 report the same score, but different final decision
and 3 reports nothing

Let me check if this works with only three reads!

@colinbrislawn
Copy link
Contributor Author

vsearch --fastx_uniques PR2-18S-rRNA-V4.derep.fsa --sizein --sizeout --relabel derep_ --fastaout PR2_short.fsa
vsearch --uchime_denovo  PR2_short.fsa --uchimeout nonchi_v1.uc &
vsearch --uchime2_denovo PR2_short.fsa --uchimeout nonchi_v2.uc &
vsearch --uchime3_denovo PR2_short.fsa --uchimeout nonchi_v3.uc &

git diff --no-index --word-diff -U0 nonchi_v1.uc nonchi_v2.uc
git diff --no-index --word-diff -U0 nonchi_v1.uc nonchi_v3.uc
git diff --no-index --word-diff -U0 nonchi_v2.uc nonchi_v3.uc

grep 'derep_16\;' nonchi_v1.uc | head -n 1
grep 'derep_16\;' nonchi_v2.uc | head -n 1
grep 'derep_16\;' nonchi_v3.uc | head -n 1

# Pull derep_2, derep_5, and derep_16 into a new file. Then:
vsearch --uchime_denovo  PR2_pull.fsa --quiet --uchimeout - | grep 'derep_16'
vsearch --uchime2_denovo  PR2_pull.fsa --quiet --uchimeout - | grep 'derep_16'
vsearch --uchime3_denovo  PR2_pull.fsa --quiet --uchimeout - | grep 'derep_16'

@colinbrislawn
Copy link
Contributor Author

colinbrislawn commented Jan 17, 2025

Here's a toy fasta file, which I've named PR2_pull.fsa

>derep_2;size=485
AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC
ACCGCGTGCACTGGTCCGGCCGGGCCTTTCCCTCTGTGGAACCCCATACCCTTCACTGGGCGTGGCGGGGAAACAGGACA
TTTACTTTGAAAAAATTAGAGTGCTCCAGGCAGGCCTATGCTCGAATACATTAGCATGGAATAATAAAATAGGACGCGCG
GTTCTATTTTGTTGGTTTATAGGACCGCCGTAATGATTAATAGGGACAGTCGGGGGCATCAGTATTCAACTGTCAGAGGT
GAAATTCTTGGATCAGTTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTTTTCA
>derep_5;size=315
AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC
ACCGCGTGTACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT
TTCA
>derep_16;size=146
AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTC
ACCGCGTGCACTGGTCCGGCCGGTGAAATTCTTGGATTTATTGAAGACTAACTACTGCGAAAGCATTTGCCAAGGATGTT
TTCA

Running this file with uchime*_denovo variants produces different results.

vsearch --uchime_denovo  PR2_pull.fsa --quiet --uchimeout -
vsearch --uchime2_denovo  PR2_pull.fsa --quiet --uchimeout -
vsearch --uchime3_denovo  PR2_pull.fsa --quiet --uchimeout -
program key difference
uchime first column score 0.0239, final vote N
uchime2_denovo first column score 0.0239, final vote Y
uchime3_denovo first column score *, final vote N
vsearch --uchime_denovo  PR2_pull.fsa --quiet --uchimeout -
0.0000  derep_2;size=485        *       *       *       *       *       *       *       *       0       0       0       0       0    0*       N
0.0000  derep_5;size=315        *       *       *       *       *       *       *       *       0       0       0       0       0    0*       N
0.0239  derep_16;size=146       derep_2;size=485        derep_5;size=315        derep_5;size=315        100.0   98.1    99.4    97.5 99.4     1       0       0       3       0       0       0.6     N

vsearch --uchime2_denovo  PR2_pull.fsa --quiet --uchimeout -
0.0000  derep_2;size=485        *       *       *       *       *       *       *       *       0       0       0       0       0    0*       N
0.0000  derep_5;size=315        *       *       *       *       *       *       *       *       0       0       0       0       0    0*       N
0.0239  derep_16;size=146       derep_2;size=485        derep_5;size=315        derep_5;size=315        100.0   98.1    99.4    97.5 99.4     1       0       0       3       0       0       0.6     Y

vsearch --uchime3_denovo  PR2_pull.fsa --quiet --uchimeout -
0.0000  derep_2;size=485        *       *       *       *       *       *       *       *       0       0       0       0       0    0*       N
0.0000  derep_5;size=315        *       *       *       *       *       *       *       *       0       0       0       0       0    0*       N
0.0000  derep_16;size=146       *       *       *       *       *       *       *       *       0       0       0       0       0    0*       N

I've taken a look at the testing framework, but I can't make head or tails of it.
It looks like the input fasta files are built from features tables (??)

@colinvwood, if you are willing to build that tests for this, I would very much appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

ENH: add support for the different versions of uchime_denovo within vsearch
3 participants