Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added testing for disease predicates #151

Merged
merged 12 commits into from
May 10, 2024
Merged

Added testing for disease predicates #151

merged 12 commits into from
May 10, 2024

Conversation

lnrekerle
Copy link
Collaborator

@lnrekerle lnrekerle commented May 1, 2024

closes #137

I added a disease test, and proof that it works in the RPGRIP1 notebook.
A couple questions before I push this tho:

  • Did we want to run all diseases (like we do currently with phenotypes) or have the user give a specific disease (what we're currently doing)?
  • Should I rewrite compare_disease_vs_genotype so that the user gives a function they would normally see rather than having to give a class that's normally hidden? Or is the way it is right now good for now? This could be a rather large rewrite for that file (_gp_impl.py)

Thank you @pnrobinson @ielis

@lnrekerle lnrekerle self-assigned this May 1, 2024
@lnrekerle lnrekerle marked this pull request as draft May 1, 2024 16:40
@lnrekerle lnrekerle requested review from ielis and pnrobinson May 1, 2024 16:40
@ielis
Copy link
Member

ielis commented May 3, 2024

Hi @lnrekerle

thanks for the code! I added a few improvements/fixes to the code.

Regarding the questions.

Did we want to run all diseases (like we do currently with phenotypes) or have the user give a specific disease (what we're currently doing)?

I think we should let the user to provide a set of "specific diseases" - these are the diseases the user is interested to see. Let me call these query diseases. The query diseases act as a filter. If we have a cohort with 5 diseases and the user provides 2 query diseases, at most 2 query diseases will make it into results. If zero or (perhaps) just one of the query diseases is in fact represented in the cohort, the code should probably explode since this is likely a bug. Otherwise, we would either return an empty result or a result with correlation for one disease only, which IMO is not what the user wants to see.

Besides supporting the query diseases, IMO many times the users will want to see all diseases of the cohort. We should make this easy. One way to do this is to write a function like this:

class CohortAnalysis:

    # bla bla bla

   def compare_genotype_by_diseases(
        self,
        genotype_predicate: GenotypePolyPredicate,
        disease_ids: typing.Optional[typing.Sequence[hpotk.TermId]] = None,
   ) -> GenotypePhenotypeAnalysisResult:
        pass

The function takes a genotype poly predicate, which can be (as you know) anything, and an optional sequence of disease IDs formatted as hpotk.TermId. If disease_ids==None, then the user wants to see all diseases. If len(disease_ids) in {0, 1} we should probably explode because of what I wrote above. Otherwise, we use the disease_ids as a filter.


Should I rewrite compare_disease_vs_genotype so that the user gives a function they would normally see rather than having to give a class that's normally hidden? Or is the way it is right now good for now? This could be a rather large rewrite for that file (_gp_impl.py)

This question is a little hard for me to understand, I'll try my best though.
I think we are approaching a point where we will have to write a function like this:

def compare_genotype_by_phenotype(
    self,
    genotype_predicate: GenotypePolyPredicate,
    phenotype_predicate: PhenotypePolyPredicate,
) -> GenotypePhenotypeAnalysisResult:
    pass

This will become a central point of the entire analysis. All convenience functions that we currently have on CohortAnalysis, such as compare_by_protein_feature_type or compare_by_variant_key will actually boil down to compare_genotype_by_phenotype. So, most of the time, the user will not have to deal with a "class that is normally hidden" (e.g. ExonPredicate or PropagatingPhenotypePredicate) but the user will call one of the convenience function.
However, when the convenience function is not there, the user will have to create genotype/phenotype predicate and run the analysis.

This may seem like a huge change, but we actually +- have this in GpCohortAnalysis._apply_poly_predicate. We may need to discuss the requirements and tweak this a bit at some but it should not be too much work.

This sounds like a good subject for the next meeting. Please think about the requirements. Right now we _apply_poly_predicate takes an iterable of phenotype predicates and single genotype poly predicates. It would be good to think about the cardinalities, especially the corner cases of the phenotype part. E.g. what happens with patients diagnosed with multiple diseases.

@lnrekerle lnrekerle marked this pull request as ready for review May 9, 2024 17:46
@lnrekerle lnrekerle merged commit 1df6274 into develop May 10, 2024
5 checks passed
@lnrekerle lnrekerle deleted the disease_testing branch May 10, 2024 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

By disease/by gene
2 participants