Skip to content

Commit

Permalink
Merge pull request #814 from nextstrain/move-filter
Browse files Browse the repository at this point in the history
Move filter after subsampling
  • Loading branch information
huddlej authored Jan 5, 2022
2 parents 9d45734 + d04c639 commit cf79e41
Show file tree
Hide file tree
Showing 19 changed files with 287 additions and 424 deletions.
70 changes: 0 additions & 70 deletions .github/workflows/preprocess-gisaid.yml

This file was deleted.

70 changes: 0 additions & 70 deletions .github/workflows/preprocess-open.yml

This file was deleted.

24 changes: 3 additions & 21 deletions docs/dev_docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,27 +73,9 @@ We do not release new minor versions for new features, but you should document n
The "core" nextstrain builds consist of a global analysis and six regional analyses, performed independently for GISAID data and open data (currently open data is GenBank data).
Stepping back, the process can be broken into three steps:
1. Ingest and curation of raw data. This is performed by the [ncov-ingest](https://github.com/nextstrain/ncov-ingest/) repo and resulting files are uploaded to S3 buckets.
2. Preprocessing of data (alignment, masking and QC filtering). This is performed by the profiles `nextstrain_profiles/nextstrain-open-preprocess` and `nextstrain_profiles/nextstrain-gisaid-preprocess`. The resulting files are uploaded to S3 buckets by the `upload` rule.
3. Phylogenetic builds, which start from the files produced by the previous step. This is performed by the profiles `nextstrain_profiles/nextstrain-open` and `nextstrain_profiles/nextstrain-gisaid`. The resulting files are uploaded to S3 buckets by the `upload` rule.
2. Phylogenetic builds, which start from the files produced by the previous step. This is performed by the profiles `nextstrain_profiles/nextstrain-open` and `nextstrain_profiles/nextstrain-gisaid`. The resulting files are uploaded to S3 buckets by the `upload` rule.


### Manually running preprocessing

To run these pipelines without uploading the results:
```sh
snakemake -pf results/filtered_open.fasta.xz --profile nextstrain_profiles/nextstrain-open-preprocess
snakemake -pf results/filtered_gisaid.fasta.xz --profile nextstrain_profiles/nextstrain-gisaid-preprocess
```

If you wish to upload the resulting information, you should run the `upload` rule.
Optionally, you may wish to define a specific `S3_DST_BUCKET` to avoid overwriting the files already present on the S3 buckets:
```sh
snakemake -pf upload --profile nextstrain_profiles/nextstrain-open-preprocess \
--config S3_DST_BUCKET=nextstrain-staging/files/ncov/open/trial/TRIAL_NAME
snakemake -pf upload --profile nextstrain_profiles/nextstrain-gisaid-preprocess \
--config S3_DST_BUCKET=nextstrain-ncov-private/trial/TRIAL_NAME
```

### Manually running phylogenetic builds

To run these pipelines locally, without uploading the results:
Expand All @@ -111,13 +93,13 @@ You may wish to overwrite these parameters for your local runs to avoid overwrit
For instance, here are the commands used by the trial builds action (see below):
```sh
snakemake -pf upload deploy \
--profile nextstrain_profiles/nextstrain-open-preprocess \
--profile nextstrain_profiles/nextstrain-open \
--config \
S3_DST_BUCKET=nextstrain-staging/files/ncov/open/trial/TRIAL_NAME \
deploy_url=s3://nextstrain-staging/ \
auspice_json_prefix=ncov_open_trial_TRIAL_NAME
snakemake -pf upload deploy \
--profile nextstrain_profiles/nextstrain-gisaid-preprocess \
--profile nextstrain_profiles/nextstrain-gisaid \
--config \
S3_DST_BUCKET=nextstrain-ncov-private/trial/TRIAL_NAME \
deploy_url=s3://nextstrain-staging/ \
Expand Down
2 changes: 1 addition & 1 deletion docs/src/analysis/orientation-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ We'll walk through all of the files one by one, but here are the most important
## Output files and directories

* `auspice/<build_name>.json`: output file for visualization in Auspice where `<build_name>` is the name of your build in the workflow configuration file.
* `results/aligned.fasta`, `results/filtered.fasta`, etc.: raw results files (dependencies) that are shared across all builds.
* `results/aligned.fasta`, etc.: raw results files (dependencies) that are shared across all builds.
* `results/<build_name>/`: raw results files (dependencies) that are specific to a single build.
* `logs/`: Log files with error messages and other information about the run.
* `benchmarks/`: Run-times (and memory usage on Linux systems) for each rule in the workflow.
Expand Down
4 changes: 4 additions & 0 deletions docs/src/reference/change_log.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
As of April 2021, we use major version numbers (e.g. v2) to reflect backward incompatible changes to the workflow that likely require you to update your Nextstrain installation.
We also use this change log to document new features that maintain backward compatibility, indicating these features by the date they were added.

## v10 (January 2022)

- Move filter and diagnostic steps after subsampling. For workflows with subsampling that does not depend on priority calculations, these changes allow the workflow to start subsampling from the metadata, skipping sequence alignment of the full input sequences and only looping through these input sequences once per build when subsampled sequences are extracted. To skip the alignment step, define your input sequences with the `aligned` directive. If you use priority-based subsampling, define your input sequences with the `sequences` directive. This reorganization of the workflow causes a breaking change in that the workflow no longer supports input-specific filtering with the `exclude_where`, `min_date`, and `exclude_ambiguous_dates_by` parameters. The workflow continues to support input-specific filtering by `min_length` and skipping of diagnostic filters with `skip_diagnostics`. [PR #814](https://github.com/nextstrain/ncov/pull/814).

## New features since last version update

- 20 December 2021: Surface the crowding penalty parameter via the config file: [PR #828](https://github.com/nextstrain/ncov/pull/827), [Issue #708](https://github.com/nextstrain/ncov/issues/708). The crowding penalty, used when calculating `priority scores` during subsampling, decreases the number of identical samples that are included in the tree during random subsampling to provide a broader picture of the viral diversity in your dataset. However, you may wish to set `crowding_penalty = 0.0` (default value = `0.1`) if you are interested in seeing as many samples as possible that are closely related to your `focal` set. You can change this parameter via `config['priorities']['crowding_penalty']`. There is no change to default behavior.
Expand Down
18 changes: 3 additions & 15 deletions docs/src/reference/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ Builds support any named attributes that can be referenced by subsampling scheme
* required
* `name`
* `metadata`
* `sequences` or `aligned` or `filtered`
* `sequences` or `aligned`
* examples:
```yaml
inputs:
Expand All @@ -283,9 +283,6 @@ inputs:
- name: prealigned-data
metadata: data/other_metadata.tsv.xz
aligned: data/other_aligned.fasta.xz
- name: prealigned-and-filtered-data
metadata: data/other_metadata.tsv.xz
filtered: data/other_filtered.fasta.xz
```

Valid attributes for list entries in `inputs` are provided below.
Expand All @@ -310,7 +307,7 @@ Valid attributes for list entries in `inputs` are provided below.

### sequences
* type: string
* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **_un_aligned and _un_filtered** genome sequences. Sequences can be uncompressed or compressed.
* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **_un_aligned** genome sequences. Sequences can be uncompressed or compressed.
* examples:
* `data/example_sequences.fasta`
* `data/example_sequences.fasta.xz`
Expand All @@ -319,22 +316,13 @@ Valid attributes for list entries in `inputs` are provided below.

### aligned
* type: string
* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **aligned and _un_filtered** genome sequences. Sequences can be uncompressed or compressed.
* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **aligned** genome sequences. Sequences can be uncompressed or compressed.
* examples:
* `data/aligned.fasta`
* `data/aligned.fasta.xz`
* `s3://your-bucket/aligned.fasta.gz`
* `https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz`

### filtered
* type: string
* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **aligned and filtered** genome sequences. Sequences can be uncompressed or compressed.
* examples:
* `data/filtered.fasta`
* `data/filtered.fasta.xz`
* `s3://your-bucket/filtered.fasta.gz`
* `https://data.nextstrain.org/files/ncov/open/filtered.fasta.xz`

## localrules
* type: string
* description: Path to a Snakemake file to include in the workflow. This parameter is redundant with `custom_rules` and may be deprecated soon.
Expand Down
4 changes: 0 additions & 4 deletions docs/src/reference/remote_inputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ A side-effect of this is the creation and upload of processed versions of the en

* `aligned.fasta.xz` alignment via [nextalign](https://github.com/nextstrain/nextclade/tree/master/packages/nextalign_cli). The default reference genome is [MN908947](https://www.ncbi.nlm.nih.gov/nuccore/MN908947) (Wuhan-Hu-1).
* `mutation-summary.tsv.xz` A summary of the data in `aligned.fasta.xz`.
* `filtered.fasta.xz` The alignment excluding data with incomplete / invalid dates, unexpected genome lengths, missing metadata etc. We also maintain a [list of sequences to exclude](https://github.com/nextstrain/ncov/blob/master/defaults/exclude.txt) which are removed at this step. These sequences represent duplicates, outliers in terms of divergence or sequences with faulty metadata.

## Subsampled datasets

Expand Down Expand Up @@ -71,7 +70,6 @@ This means that the full GenBank metadata and sequences are typically updated a
| Full GenBank data | metadata | https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz |
| | sequences | https://data.nextstrain.org/files/ncov/open/sequences.fasta.xz |
| | aligned | https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz |
| | filtered | https://data.nextstrain.org/files/ncov/open/filtered.fasta.xz |
| Global sample | metadata | https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz |
| | sequences | https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz |
| | aligned | https://data.nextstrain.org/files/ncov/open/global/aligned.fasta.xz |
Expand Down Expand Up @@ -138,8 +136,6 @@ inputs:
The following starting points are available:
* replace `sequences` with `aligned` (skips alignment)
* replace `sequences` with `filtered` (skips alignment and basic filtering steps)


## Compressed vs uncompressed starting points

Expand Down
22 changes: 0 additions & 22 deletions nextstrain_profiles/nextstrain-gisaid-preprocess/builds.yaml

This file was deleted.

10 changes: 0 additions & 10 deletions nextstrain_profiles/nextstrain-gisaid-preprocess/config.yaml

This file was deleted.

7 changes: 3 additions & 4 deletions nextstrain_profiles/nextstrain-gisaid/builds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,12 @@ upload:
genes: ["ORF1a", "ORF1b", "S", "ORF3a", "E", "M", "ORF6", "ORF7a", "ORF7b", "ORF8", "N", "ORF9b"]
use_nextalign: true

# Note: we have a separate profile for aligning GISAID sequences. This is triggered
# as soon as new sequences are available. This workflow is thus intended to be
# started from the filtered alignment. james, sept 2021
# Note: unaligned sequences are provided as "aligned" sequences to avoid an initial full-DB alignment
# as we re-align everything after subsampling.
inputs:
- name: gisaid
metadata: "s3://nextstrain-ncov-private/metadata.tsv.gz"
filtered: "s3://nextstrain-ncov-private/filtered.fasta.xz"
aligned: "s3://nextstrain-ncov-private/sequences.fasta.xz"

# Define locations for which builds should be created.
# For each build we specify a subsampling scheme via an explicit key.
Expand Down
22 changes: 0 additions & 22 deletions nextstrain_profiles/nextstrain-open-preprocess/builds.yaml

This file was deleted.

10 changes: 0 additions & 10 deletions nextstrain_profiles/nextstrain-open-preprocess/config.yaml

This file was deleted.

7 changes: 3 additions & 4 deletions nextstrain_profiles/nextstrain-open/builds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,12 @@ S3_DST_ORIGINS: ["open"]
upload:
- build-files

# Note: we have a separate profile for aligning open sequences. This is triggered
# as soon as new sequences are available. This workflow is thus intended to be
# started from the filtered alignment. james, sept 2021
# Note: unaligned sequences are provided as "aligned" sequences to avoid an initial full-DB alignment
# as we re-align everything after subsampling.
inputs:
- name: open
metadata: "s3://nextstrain-data/files/ncov/open/metadata.tsv.gz"
filtered: "s3://nextstrain-data/files/ncov/open/filtered.fasta.xz"
aligned: "s3://nextstrain-data/files/ncov/open/sequences.fasta.xz"

# Define locations for which builds should be created.
# For each build we specify a subsampling scheme via an explicit key.
Expand Down
Loading

0 comments on commit cf79e41

Please sign in to comment.