Merge pull request #814 from nextstrain/move-filter

Move filter after subsampling
nextstrain · Jan 5, 2022 · cf79e41 · cf79e41
2 parents 9d45734 + d04c639
commit cf79e41
Show file tree

Hide file tree

Showing 19 changed files with 287 additions and 424 deletions.
diff --git a/.github/workflows/preprocess-gisaid.yml b/.github/workflows/preprocess-gisaid.yml
diff --git a/.github/workflows/preprocess-open.yml b/.github/workflows/preprocess-open.yml
diff --git a/docs/dev_docs.md b/docs/dev_docs.md
@@ -73,27 +73,9 @@ We do not release new minor versions for new features, but you should document n
 The "core" nextstrain builds consist of a global analysis and six regional analyses, performed independently for GISAID data and open data (currently open data is GenBank data).
 Stepping back, the process can be broken into three steps:
 1. Ingest and curation of raw data. This is performed by the [ncov-ingest](https://github.com/nextstrain/ncov-ingest/) repo and resulting files are uploaded to S3 buckets.
-2. Preprocessing of data (alignment, masking and QC filtering). This is performed by the profiles `nextstrain_profiles/nextstrain-open-preprocess` and `nextstrain_profiles/nextstrain-gisaid-preprocess`. The resulting files are uploaded to S3 buckets by the `upload` rule.
-3. Phylogenetic builds, which start from the files produced by the previous step. This is performed by the profiles `nextstrain_profiles/nextstrain-open` and `nextstrain_profiles/nextstrain-gisaid`. The resulting files are uploaded to S3 buckets by the `upload` rule.
+2. Phylogenetic builds, which start from the files produced by the previous step. This is performed by the profiles `nextstrain_profiles/nextstrain-open` and `nextstrain_profiles/nextstrain-gisaid`. The resulting files are uploaded to S3 buckets by the `upload` rule. 
 
 
-### Manually running preprocessing
-
-To run these pipelines without uploading the results:
-```sh
-snakemake -pf results/filtered_open.fasta.xz --profile nextstrain_profiles/nextstrain-open-preprocess
-snakemake -pf results/filtered_gisaid.fasta.xz --profile nextstrain_profiles/nextstrain-gisaid-preprocess
-```
-
-If you wish to upload the resulting information, you should run the `upload` rule.
-Optionally, you may wish to define a specific `S3_DST_BUCKET` to avoid overwriting the files already present on the S3 buckets:
-```sh
-snakemake -pf upload --profile nextstrain_profiles/nextstrain-open-preprocess \
-    --config S3_DST_BUCKET=nextstrain-staging/files/ncov/open/trial/TRIAL_NAME
-snakemake -pf upload --profile nextstrain_profiles/nextstrain-gisaid-preprocess \
-    --config S3_DST_BUCKET=nextstrain-ncov-private/trial/TRIAL_NAME
-```
-
 ### Manually running phylogenetic builds
 
 To run these pipelines locally, without uploading the results:
@@ -111,13 +93,13 @@ You may wish to overwrite these parameters for your local runs to avoid overwrit
 For instance, here are the commands used by the trial builds action (see below):
 ```sh
 snakemake -pf upload deploy \
-    --profile nextstrain_profiles/nextstrain-open-preprocess \
+    --profile nextstrain_profiles/nextstrain-open \
     --config \
         S3_DST_BUCKET=nextstrain-staging/files/ncov/open/trial/TRIAL_NAME \
         deploy_url=s3://nextstrain-staging/ \
         auspice_json_prefix=ncov_open_trial_TRIAL_NAME
 snakemake -pf upload deploy \
-    --profile nextstrain_profiles/nextstrain-gisaid-preprocess \
+    --profile nextstrain_profiles/nextstrain-gisaid \
     --config \
         S3_DST_BUCKET=nextstrain-ncov-private/trial/TRIAL_NAME \
         deploy_url=s3://nextstrain-staging/ \

diff --git a/docs/src/analysis/orientation-files.md b/docs/src/analysis/orientation-files.md
@@ -26,7 +26,7 @@ We'll walk through all of the files one by one, but here are the most important
 ## Output files and directories
 
   * `auspice/<build_name>.json`: output file for visualization in Auspice where `<build_name>` is the name of your build in the workflow configuration file.
-  * `results/aligned.fasta`, `results/filtered.fasta`, etc.: raw results files (dependencies) that are shared across all builds.
+  * `results/aligned.fasta`, etc.: raw results files (dependencies) that are shared across all builds.
   * `results/<build_name>/`: raw results files (dependencies) that are specific to a single build.
   * `logs/`: Log files with error messages and other information about the run.
   * `benchmarks/`: Run-times (and memory usage on Linux systems) for each rule in the workflow.

diff --git a/docs/src/reference/change_log.md b/docs/src/reference/change_log.md
@@ -3,6 +3,10 @@
 As of April 2021, we use major version numbers (e.g. v2) to reflect backward incompatible changes to the workflow that likely require you to update your Nextstrain installation.
 We also use this change log to document new features that maintain backward compatibility, indicating these features by the date they were added.
 
+## v10 (January 2022)
+
+ - Move filter and diagnostic steps after subsampling. For workflows with subsampling that does not depend on priority calculations, these changes allow the workflow to start subsampling from the metadata, skipping sequence alignment of the full input sequences and only looping through these input sequences once per build when subsampled sequences are extracted. To skip the alignment step, define your input sequences with the `aligned` directive. If you use priority-based subsampling, define your input sequences with the `sequences` directive. This reorganization of the workflow causes a breaking change in that the workflow no longer supports input-specific filtering with the `exclude_where`, `min_date`, and `exclude_ambiguous_dates_by` parameters. The workflow continues to support input-specific filtering by `min_length` and skipping of diagnostic filters with `skip_diagnostics`. [PR #814](https://github.com/nextstrain/ncov/pull/814).
+
 ## New features since last version update
 
 - 20 December 2021: Surface the crowding penalty parameter via the config file: [PR #828](https://github.com/nextstrain/ncov/pull/827), [Issue #708](https://github.com/nextstrain/ncov/issues/708). The crowding penalty, used when calculating `priority scores` during subsampling, decreases the number of identical samples that are included in the tree during random subsampling to provide a broader picture of the viral diversity in your dataset. However, you may wish to set `crowding_penalty = 0.0` (default value = `0.1`) if you are interested in seeing as many samples as possible that are closely related to your `focal` set. You can change this parameter via `config['priorities']['crowding_penalty']`. There is no change to default behavior.

diff --git a/docs/src/reference/configuration.md b/docs/src/reference/configuration.md
@@ -273,7 +273,7 @@ Builds support any named attributes that can be referenced by subsampling scheme
 * required
 	* `name`
 	* `metadata`
-	* `sequences` or `aligned` or `filtered`
+	* `sequences` or `aligned`
 * examples:
 ```yaml
 inputs:
@@ -283,9 +283,6 @@ inputs:
   - name: prealigned-data
     metadata: data/other_metadata.tsv.xz
     aligned: data/other_aligned.fasta.xz
-  - name: prealigned-and-filtered-data
-    metadata: data/other_metadata.tsv.xz
-    filtered: data/other_filtered.fasta.xz
 ```
 
 Valid attributes for list entries in `inputs` are provided below.
@@ -310,7 +307,7 @@ Valid attributes for list entries in `inputs` are provided below.
 
 ### sequences
 * type: string
-* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **_un_aligned and _un_filtered** genome sequences. Sequences can be uncompressed or compressed.
+* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **_un_aligned** genome sequences. Sequences can be uncompressed or compressed.
 * examples:
 	* `data/example_sequences.fasta`
 	* `data/example_sequences.fasta.xz`
@@ -319,22 +316,13 @@ Valid attributes for list entries in `inputs` are provided below.
 
 ### aligned
 * type: string
-* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **aligned and _un_filtered** genome sequences. Sequences can be uncompressed or compressed.
+* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **aligned** genome sequences. Sequences can be uncompressed or compressed.
 * examples:
 	* `data/aligned.fasta`
 	* `data/aligned.fasta.xz`
 	* `s3://your-bucket/aligned.fasta.gz`
 	* `https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz`
 
-### filtered
-* type: string
-* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **aligned and filtered** genome sequences. Sequences can be uncompressed or compressed.
-* examples:
-	* `data/filtered.fasta`
-	* `data/filtered.fasta.xz`
-	* `s3://your-bucket/filtered.fasta.gz`
-	* `https://data.nextstrain.org/files/ncov/open/filtered.fasta.xz`
-
 ## localrules
 * type: string
 * description: Path to a Snakemake file to include in the workflow. This parameter is redundant with `custom_rules` and may be deprecated soon.

diff --git a/docs/src/reference/remote_inputs.md b/docs/src/reference/remote_inputs.md
@@ -41,7 +41,6 @@ A side-effect of this is the creation and upload of processed versions of the en
 
 * `aligned.fasta.xz` alignment via [nextalign](https://github.com/nextstrain/nextclade/tree/master/packages/nextalign_cli). The default reference genome is [MN908947](https://www.ncbi.nlm.nih.gov/nuccore/MN908947) (Wuhan-Hu-1).
 * `mutation-summary.tsv.xz` A summary of the data in `aligned.fasta.xz`.
-* `filtered.fasta.xz` The alignment excluding data with incomplete / invalid dates, unexpected genome lengths, missing metadata etc. We also maintain a [list of sequences to exclude](https://github.com/nextstrain/ncov/blob/master/defaults/exclude.txt) which are removed at this step. These sequences represent duplicates, outliers in terms of divergence or sequences with faulty metadata.
 
 ## Subsampled datasets
 
@@ -71,7 +70,6 @@ This means that the full GenBank metadata and sequences are typically updated a
 | Full GenBank data    | metadata  | https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz            |
 |                      | sequences | https://data.nextstrain.org/files/ncov/open/sequences.fasta.xz         |
 |                      | aligned   | https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz           |
-|                      | filtered  | https://data.nextstrain.org/files/ncov/open/filtered.fasta.xz          |
 | Global sample        | metadata  | https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz     |
 |                      | sequences | https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz  |
 |                      | aligned   | https://data.nextstrain.org/files/ncov/open/global/aligned.fasta.xz    |
@@ -138,8 +136,6 @@ inputs:
 The following starting points are available:
 
 * replace `sequences` with `aligned` (skips alignment)
-* replace `sequences` with `filtered` (skips alignment and basic filtering steps)
-
 
 ## Compressed vs uncompressed starting points
 

diff --git a/nextstrain_profiles/nextstrain-gisaid-preprocess/builds.yaml b/nextstrain_profiles/nextstrain-gisaid-preprocess/builds.yaml
diff --git a/nextstrain_profiles/nextstrain-gisaid-preprocess/config.yaml b/nextstrain_profiles/nextstrain-gisaid-preprocess/config.yaml
diff --git a/nextstrain_profiles/nextstrain-gisaid/builds.yaml b/nextstrain_profiles/nextstrain-gisaid/builds.yaml
@@ -17,13 +17,12 @@ upload:
 genes: ["ORF1a", "ORF1b", "S", "ORF3a", "E", "M", "ORF6", "ORF7a", "ORF7b", "ORF8", "N", "ORF9b"]
 use_nextalign: true
 
-# Note: we have a separate profile for aligning GISAID sequences. This is triggered
-# as soon as new sequences are available. This workflow is thus intended to be
-# started from the filtered alignment.                        james, sept 2021
+# Note: unaligned sequences are provided as "aligned" sequences to avoid an initial full-DB alignment
+# as we re-align everything after subsampling.
 inputs:
   - name: gisaid
     metadata: "s3://nextstrain-ncov-private/metadata.tsv.gz"
-    filtered: "s3://nextstrain-ncov-private/filtered.fasta.xz"
+    aligned: "s3://nextstrain-ncov-private/sequences.fasta.xz"
 
 # Define locations for which builds should be created.
 # For each build we specify a subsampling scheme via an explicit key.

diff --git a/nextstrain_profiles/nextstrain-open-preprocess/builds.yaml b/nextstrain_profiles/nextstrain-open-preprocess/builds.yaml
diff --git a/nextstrain_profiles/nextstrain-open-preprocess/config.yaml b/nextstrain_profiles/nextstrain-open-preprocess/config.yaml
diff --git a/nextstrain_profiles/nextstrain-open/builds.yaml b/nextstrain_profiles/nextstrain-open/builds.yaml
@@ -14,13 +14,12 @@ S3_DST_ORIGINS: ["open"]
 upload:
   - build-files
 
-# Note: we have a separate profile for aligning open sequences. This is triggered
-# as soon as new sequences are available. This workflow is thus intended to be
-# started from the filtered alignment.                        james, sept 2021
+# Note: unaligned sequences are provided as "aligned" sequences to avoid an initial full-DB alignment
+# as we re-align everything after subsampling.
 inputs:
   - name: open
     metadata: "s3://nextstrain-data/files/ncov/open/metadata.tsv.gz"
-    filtered: "s3://nextstrain-data/files/ncov/open/filtered.fasta.xz"
+    aligned: "s3://nextstrain-data/files/ncov/open/sequences.fasta.xz"
 
 # Define locations for which builds should be created.
 # For each build we specify a subsampling scheme via an explicit key.