A Snakemake pipeline to prepare, impute and process binary PLINK files on the Michigan/TOPMed Imputation Servers.
- Anaconda or Miniconda python
- Singularity
- Python3 (The future is now!)
- Snakemake
All other reqirements will be installed locally by the pipeline at runtime.
config/config.yaml
contains settings for QC and preparation, submission and download, and post-imputation processing.
Please review and edit the configuration file before starting.
By default, the pipeline looks for PLINK filesets in the base directory of your installation. You can choose a different destination by editing directory
. Cohort names are infered from the stem of the plink filesets.
If you do not want all filesets in the directory
to be processed and imputed, then you can include COHORT:
or SAMPLES:
and a list of cohorts in the config file like this:
COHORT:
- group1
- study2
or
COHORT: [group1, study2]
The output files for the pipeline will be stored in the directory specified by out_dir
. This directory must exist.
Output file choices can be specified under outputs
, and you can delete or comment out any list item you do not want. The options follow:
stat_report
A report of imputation qualityvcf_bycohort
Bgzipped VCF files for each cohortvcf_merged
A bgzipped VCF with all cohortsbgen_bycohort
Binary oxford filesets for each cohortbgen_merged
A binary oxford fileset with all cohortsplink_bycohort
Binary plink filesets for each cohortplink_merged
A binary plink fileset with all cohorts
This pipeline can impute using the NIH imputatation server and the Michigan Imputation Server. You must have an account and API key with any server you plan to use.
Imputation settings are stored under imputation:
as a hash/dictionary of cohort names, with a hash of settings underneath. You can provide default settings with the hash key default
, and/or override those settings with your individual cohort names.
The hash of settings follows the API specifications here and here, in addition to your API key, stored under token
. files
and password
are automatically set. Additionally, if you are planning to use HRC or TOPMed, and you just provide the refpanel or server along with your API key, best-practice defaults will be used.
Options for server
are NIH
or Michigan
. Case does not matter.
For each cohort specified, the pipeline will override the defaults, with the specified settings. If those are unchanged from the default, they do not need to be edited.
Here is an example:
imputation:
default:
token: token_here
refpanel: topmed-r2
study2:
token: other_token_here
server: Michigan
refpanel: gasp-v2
population: ASN
This will run study2 on the Michigan server using the GAsP panel with the ASN population, and all other cohorts with TOPMed using all populations.
Select the chromosomes you want to upload by editing chroms
. Separate ranges with ":" and listed chromosomes with ",". You can put both in the same string. Use M for mitochondra.
Options are 1:22,X,Y,M
The pipeline will reference-align the input files before imputation using the fasta file specified under ref:
. This fasta MUST be in the same genome build as the input. Input genome builds must match the builds supported by the chosen server, and you must specify the build under imputation
if it does not match the default for the server.
This pipeline supports all builds for the downloaded imputed files. Be aware that TOPMed imputation returns GRCh38 genotypes.
The pipeline can both do QC in preparation for upload and filter the files provided by the servers on a per-cohort basis. Pre-imputation QC is under preqc:
and post-imputation QC is under postqc:
. The options are documented in the configuration file.
The imputation server imputes in 10kBase chunks, removing chunks that have below a 50% callrate in any subject. To avoid this, we can filter such that no chromosome has above 20% missingness (chr_callrate: True
), or that no imputation chunk with at least 50 variants has a missingness greater than 50% by removing subjects who violate the criterion. You can skip both checks by setting both options to false, but you can only perform a maximum of one of these checks. The chunk check is recomended.
You can also include OR exclude samples by a named list file using one of the following options:
include_samp:
filename of samples to include (plink --keep)exclude_samp:
filename of samples to exclude (plink --remove)
include_samp
accepts a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column, and removes all unlisted samples from the current analysis. exclude_samp
does the same for all listed samples.
You must run the pipeline with Anaconda environments and Singularity enabled. You can either use a Snakemake profile to do so or run the pipeline with the following command:
snakemake --use-conda --use-singularity
Make sure your cluster supports the amount of memory required for report generation (528 GiB) and that the memory is being properly requested. If that is not the case, you can edit the resource requirements on the rule stats
in the postImpute pipeline and modify the modularization lines in workflow/Snakefile
in this pipeline. We have included a lsf.yaml
file that ensures those resources are available within this pipeline.