Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at making snakemake workflow for innovation model #11

Open
wants to merge 36 commits into
base: master
Choose a base branch
from

Conversation

marlinfiggins
Copy link

@marlinfiggins marlinfiggins commented Feb 13, 2024

This PR implements a Snakemake workflow for provisioning sequence counts with similar methods used in forecasts-ncov. This is my first time working with Snakemake, so any suggestions, comments, or questions are appreciated.

Major changes include:

  • Specifying several analysis periods with different pivot variants which growth advantages are estimated relative to. (See config.yaml for specification here)
  • Provisioning data sets for each analysis period (See workflow/snakemake_rules/prepare_data.smk)
  • Running the innovation model with the specified period for all qualifying locations (See scripts/run-innovation-model)

This to be accomplished still:

  • Checking Python implementation of mlr-fitness/data/pango-relationships.nb for correctness
  • Testing of workflow in general.
  • Implementing phenotype prediction using DMS data / snakemake.

Note: I've borrowed the files scripts/prepare-data.py and scripts/collapse-lineages.py directly from forecasts-ncov. Let me know if there's a better way of doing this.

@marlinfiggins
Copy link
Author

marlinfiggins commented Feb 14, 2024

I've tried to convert compare_natural notebook to a Python script scripts/compute-phenotype which can be used to generate phenotype files in specified in prepare_data.smk (corresponding rule is compute_phenotype).

There's a couple of remaining tasks for this section:

  • Ensuring pairs for predictor differences are generated with the parent-variant relationships in pango_variant_relationships
  • Figure out best way of specifying which phenotypes to generate. I imagine a section in config.yaml would be ideal?
  • Processing phenotypes to a single (or collection of) predictors files.
  • Testing in general

@trvrb
Copy link
Member

trvrb commented Oct 6, 2024

Thanks so much for diving in here @marlinfiggins. I'm sorry that I didn't notice this PR before you pointed it out to me last month. I just tried working from rewrite and I managed to fix some initial Snakemake bugs that were throwing errors. However, I now encounter

nextstrain build --cpus 1 . data/gisaid/pango_lineages/global/xbb15/prepared_cases.tsv
Building DAG of jobs...
MissingInputException in rule prepare_clade_data in file /nextstrain/build/workflow/snakemake_rules/prepare_data.smk, line 43:
Missing input files for rule prepare_clade_data:
    output: data/gisaid/pango_lineages/global/xbb15/prepared_cases.tsv, data/gisaid/pango_lineages/global/xbb15/prepared_seq_counts.tsv
    wildcards: data_provenance=gisaid, variant_classification=pango_lineages, geo_resolution=global, analysis_period=xbb15
    affected files:
        data/cases/global.tsv.gz
        data/gisaid/pango_lineages/global.tsv.gz

I think that you're assuming that local files like data/gisaid/pango_lineages/global.tsv.gz exist? Can you please add to the README.md what steps need to be taken to provision data before running Snakemake?

I just did almost exactly this over here: https://github.com/blab/fitness-dynamics?tab=readme-ov-file#provision-metadata-locally.

I can continue review once I know how to provision local data.


Also, separate question: can we just drop data/{data_provenance}/{variant_classification}/{geo_resolution}/{analysis_period}/prepared_cases.tsv? I don't think cases feed into any of the MLR analyses.

marlinfiggins and others added 5 commits October 24, 2024 15:23
This copies over logic from https://github.com/blab/fitness-dynamics where the prepare_clade_data rule that calls scripts/prepare-data.py is based on a defined analysis window of min_date and max_date rather than defining included_days. This is significantly cleaner for performing historical analyses.

Additionally, drop references to cases (and the requirement of inputting cases to prepare-data.py). We're not uses cases in the MLR analysis and they just add unused overhead.
The line variant_relationships = pd.DataFrame(parent_map).reset_index() in prepare-pango-relationships.py was throwing an error for me. I've fixed it in this commit.
@trvrb
Copy link
Member

trvrb commented Nov 4, 2024

Hey @marlinfiggins. I was able to provision data locally and fix some errors to get me to

data/gisaid/pango_lineages/global/xbb15/collapsed_seq_counts.tsv
data/gisaid/pango_lineages/global/xbb15/pango_variant_relationships.tsv

However, I can't figure out how to even begin with phenotypes. From prepare_data.smk I see I should be able to ask for predictors/{analysis_period}/{pheno}_clade.csv. I tried this with nextstrain build . -j 1 predictors/xbb15/EVEscape_clade.csv and got the error:

InputFunctionException in rule compute_phenotypes in file /nextstrain/build/workflow/snakemake_rules/prepare_data.smk, line 163:
Error:
  AttributeError: 'Wildcards' object has no attribute 'phenotype'
Wildcards:
  analysis_period=xbb15
  pheno=EVEscape
Traceback:
  File "/nextstrain/build/workflow/snakemake_rules/prepare_data.smk", line 165, in <lambda>

I'm pretty sure this was an issue with line 165 of:

input_data = lambda wildcards: phenos_compare_natural.get(wildcards.phenotype),

I just updated this to

input_data = lambda wildcards: phenos_compare_natural.get(wildcards.pheno),

@trvrb
Copy link
Member

trvrb commented Nov 4, 2024

(continued...)

However now calling nextstrain build . -j 1 predictors/xbb15/EVEscape_clade.csv is giving the error

WorkflowError in rule compute_phenotypes in file /nextstrain/build/workflow/snakemake_rules/prepare_data.smk, line 163:
Function did not return str or list of str.

Can you please take a look at this? Please add instructions in README.md for how to proceed with provisioning local phenotypes.

@trvrb trvrb self-requested a review November 7, 2024 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants