First pass at making snakemake workflow for innovation model #11

marlinfiggins · 2024-02-13T22:50:56Z

This PR implements a Snakemake workflow for provisioning sequence counts with similar methods used in forecasts-ncov. This is my first time working with Snakemake, so any suggestions, comments, or questions are appreciated.

Major changes include:

Specifying several analysis periods with different pivot variants which growth advantages are estimated relative to. (See config.yaml for specification here)
Provisioning data sets for each analysis period (See workflow/snakemake_rules/prepare_data.smk)
Running the innovation model with the specified period for all qualifying locations (See scripts/run-innovation-model)

This to be accomplished still:

Checking Python implementation of mlr-fitness/data/pango-relationships.nb for correctness
Testing of workflow in general.
Implementing phenotype prediction using DMS data / snakemake.

Note: I've borrowed the files scripts/prepare-data.py and scripts/collapse-lineages.py directly from forecasts-ncov. Let me know if there's a better way of doing this.

…ting `prepare_data.smk`.

marlinfiggins · 2024-02-14T22:15:34Z

I've tried to convert compare_natural notebook to a Python script scripts/compute-phenotype which can be used to generate phenotype files in specified in prepare_data.smk (corresponding rule is compute_phenotype).

There's a couple of remaining tasks for this section:

Ensuring pairs for predictor differences are generated with the parent-variant relationships in pango_variant_relationships
Figure out best way of specifying which phenotypes to generate. I imagine a section in config.yaml would be ideal?
Processing phenotypes to a single (or collection of) predictors files.
Testing in general

…ences.

I couldn't run Snakemake due to a couple bugs in the Snakefile 1. run_models.smk missing comma 2. Snakefile missing pandas import 3. Some confusion in analysis_periods of iterating over list vs dictionary. Should be solved by passing analysis_periods.keys() to expand However, I couldn't test workflow due to missing data/ files so I'm not 100% sure this fixes things.

trvrb · 2024-10-06T02:59:16Z

Thanks so much for diving in here @marlinfiggins. I'm sorry that I didn't notice this PR before you pointed it out to me last month. I just tried working from rewrite and I managed to fix some initial Snakemake bugs that were throwing errors. However, I now encounter

nextstrain build --cpus 1 . data/gisaid/pango_lineages/global/xbb15/prepared_cases.tsv
Building DAG of jobs...
MissingInputException in rule prepare_clade_data in file /nextstrain/build/workflow/snakemake_rules/prepare_data.smk, line 43:
Missing input files for rule prepare_clade_data:
    output: data/gisaid/pango_lineages/global/xbb15/prepared_cases.tsv, data/gisaid/pango_lineages/global/xbb15/prepared_seq_counts.tsv
    wildcards: data_provenance=gisaid, variant_classification=pango_lineages, geo_resolution=global, analysis_period=xbb15
    affected files:
        data/cases/global.tsv.gz
        data/gisaid/pango_lineages/global.tsv.gz

I think that you're assuming that local files like data/gisaid/pango_lineages/global.tsv.gz exist? Can you please add to the README.md what steps need to be taken to provision data before running Snakemake?

I just did almost exactly this over here: https://github.com/blab/fitness-dynamics?tab=readme-ov-file#provision-metadata-locally.

I can continue review once I know how to provision local data.

Also, separate question: can we just drop data/{data_provenance}/{variant_classification}/{geo_resolution}/{analysis_period}/prepared_cases.tsv? I don't think cases feed into any of the MLR analyses.

…write

This copies over logic from https://github.com/blab/fitness-dynamics where the prepare_clade_data rule that calls scripts/prepare-data.py is based on a defined analysis window of min_date and max_date rather than defining included_days. This is significantly cleaner for performing historical analyses. Additionally, drop references to cases (and the requirement of inputting cases to prepare-data.py). We're not uses cases in the MLR analysis and they just add unused overhead.

The line variant_relationships = pd.DataFrame(parent_map).reset_index() in prepare-pango-relationships.py was throwing an error for me. I've fixed it in this commit.

trvrb · 2024-11-04T18:06:03Z

Hey @marlinfiggins. I was able to provision data locally and fix some errors to get me to

data/gisaid/pango_lineages/global/xbb15/collapsed_seq_counts.tsv
data/gisaid/pango_lineages/global/xbb15/pango_variant_relationships.tsv

However, I can't figure out how to even begin with phenotypes. From prepare_data.smk I see I should be able to ask for predictors/{analysis_period}/{pheno}_clade.csv. I tried this with nextstrain build . -j 1 predictors/xbb15/EVEscape_clade.csv and got the error:

InputFunctionException in rule compute_phenotypes in file /nextstrain/build/workflow/snakemake_rules/prepare_data.smk, line 163:
Error:
  AttributeError: 'Wildcards' object has no attribute 'phenotype'
Wildcards:
  analysis_period=xbb15
  pheno=EVEscape
Traceback:
  File "/nextstrain/build/workflow/snakemake_rules/prepare_data.smk", line 165, in <lambda>

I'm pretty sure this was an issue with line 165 of:

input_data = lambda wildcards: phenos_compare_natural.get(wildcards.phenotype),

I just updated this to

input_data = lambda wildcards: phenos_compare_natural.get(wildcards.pheno),

trvrb · 2024-11-04T18:08:51Z

(continued...)

However now calling nextstrain build . -j 1 predictors/xbb15/EVEscape_clade.csv is giving the error

WorkflowError in rule compute_phenotypes in file /nextstrain/build/workflow/snakemake_rules/prepare_data.smk, line 163:
Function did not return str or list of str.

Can you please take a look at this? Please add instructions in README.md for how to proceed with provisioning local phenotypes.

… intermediates.

marlinfiggins added 2 commits December 21, 2023 11:28

Title and introduction

a9a2134

First pass snakemake workflow

a93002d

marlinfiggins assigned marlinfiggins and trvrb and unassigned marlinfiggins Feb 13, 2024

marlinfiggins added 2 commits February 14, 2024 12:48

Update description of run-innovation-model.py.

80049d1

Refactoring compare_natural.ipynb just to generate phenotypes. Upda…

b262d4e

…ting `prepare_data.smk`.

marlinfiggins and others added 16 commits February 14, 2024 15:09

Fixing config

766dcda

Fixing syntax errors

3be1176

Passing dry-run

7d3c186

Typo in prepare-pango-relationships.py

613a8eb

Adding in pango_variant_relationships for generating phenotype differ…

c2b02df

…ences.

Adding windowed observation script

6345511

Allowing users to specify start and end of windows

39020a5

Letting model run script save posteriors.

398d100

Updating run_models.smk.

e6157e5

Updating snakemake workflow

097c29e

Variant-Parents are the same by location

afd8d78

Adding mlr_innovation-regression

ae8079e

Innovation Regression example

d731268

Adding predictors argument

5bb0675

Innovation Regression example

c3b9caa

marlinfiggins and others added 5 commits October 24, 2024 15:23

Manuscript + Methods updates

0ffb1e8

Merge branch 'rewrite' of https://github.com/blab/ncov-escape into re…

8465ecf

…write

Updated methods section and additional section on PGLS.

882d54b

Include instructions for provisioning data locally

71d5173

Fix bug in Pango relationships script

80c0d38

The line variant_relationships = pd.DataFrame(parent_map).reset_index() in prepare-pango-relationships.py was throwing an error for me. I've fixed it in this commit.

Fix wildcard reference

64371af

marlinfiggins added 9 commits November 6, 2024 15:20

Adding script to provision files from forecasts-ncov

1561ac0

Working snakemake pipeline

8d6e5e7

Adding ability to select predictors by name. Saving processed data as…

80494f1

… intermediates.

Clearly specifying min-date, max-date, and pivot for single analyses

3392519

Starting windowed analysis

107f4a6

Adding informed model

b69bb66

Fixing syntax error

d9b63d0

Adding provisioning rule

3b32fd8

Adding provisioning phenotype script

3091e7f

trvrb self-requested a review November 7, 2024 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First pass at making snakemake workflow for innovation model #11

First pass at making snakemake workflow for innovation model #11

marlinfiggins commented Feb 13, 2024 •

edited

Loading

marlinfiggins commented Feb 14, 2024 •

edited

Loading

trvrb commented Oct 6, 2024

trvrb commented Nov 4, 2024

trvrb commented Nov 4, 2024

First pass at making snakemake workflow for innovation model #11

Are you sure you want to change the base?

First pass at making snakemake workflow for innovation model #11

Conversation

marlinfiggins commented Feb 13, 2024 • edited Loading

marlinfiggins commented Feb 14, 2024 • edited Loading

trvrb commented Oct 6, 2024

trvrb commented Nov 4, 2024

trvrb commented Nov 4, 2024

marlinfiggins commented Feb 13, 2024 •

edited

Loading

marlinfiggins commented Feb 14, 2024 •

edited

Loading