Perform phasing and local ancestry inference for an analysis of target admixed samples.
Also, call IBD segments.
- Clone the repository.
git clone https://github.com/sdtemple/flare-pipeline
- Install
java
so as to be able to runjava -jar
on terminal. - Download additional software (requires
wget
).
bash get-software.sh software
- Make the Python enviroment.
mamba env create -f conda-env.yml
- I recommend using
mamba
of some sort (https://mamba.readthedocs.io/en/latest/index.html) - In which case,
mamba env create -f conda-env.yml
- See
misc/installing-mamba.md
for further support
- GDS files for each chromosome
- Reference samples
- Target admixed samples
- Map between reference sample IDs to their reference panel
- Point to this file in your YAML settings
To subset or manipulate VCFs and GDSs:
scripts/vcf-to-gds.R
scripts/get-sample-ids-gds.R
scripts/subset-gds.R
: implements subsetting, and minor allele count and missingness filters- Use MAC 1 and missingness 0.00 if you only want to subset samples
mamba activate flare24
- Modify the
your.analysis.arguments.yaml
file- See the
change:
settings - You need to choose a reference sample!
- See the
- Do a dry-run of the pipeline. Check which rules will run.
snakemake -c1 -n --configfile *.yaml
- Run the pipeline for real.
nohup snakemake -c1 --latency-wait 300 --keep-going --cluster "[options]" --configfile *.yaml --jobs XXX &
- Other useful
snakemake
commands--rerun-incomplete
--rerun-triggers mtime
--force-all
???
- Commands for
qsub
--cluster "qsub -q your-queue.q -m e -M [email protected] -pe local XXX -l h_vmem=XXXG -V"
- "-V" is important to pass in your conda environment!
- "-pe local XXX" is how many threads you will use
- You don't have to send emails to yourself if you don't want to
- Commands for
slurm
--cluster "sbatch [options]"
- "-e ~/your-logs/{rule}.e" and "-o ~/your-logs/{rule}.o" will control where stderr, stdout go
- "--cpus-per-task=XX" says how many cpus per job
- "--nodes=XX" says how many nodes per job
- "--partition=SOMENAME" says which partitions to use
- "--mem=XX" says how much memory in MB
- "--mail-type=ALL" and "--mail-user=[email protected]" sends mail to you when a job finishes
- "--job-name={rule}"
- You can sign out of cluster.
no hup ... &
will keep this as an ongoing process until complete
- Your LAI results in a
lai/
folder - Your IBD results in a
ibdsegs/
folder- By default, do not detect IBD segments
For reproducibility, the arguments.yaml
in the main folder says what you ran. Don't change it ever!
For robustness, you can create different *.yaml
settings and see how results change.
Make sure to change the your-analysis-folder
setting.
- Running out of memory: increase cluster-resouces:xmxmem in yaml and terminal pass into
--cluster
- Names are different in CHROM column between reference and target samples
- JAR file is corrupted: download a fresh version
- Changed the
remove-phase.jar
toremove-phase.py
- Changed the
There are two phasing strategies:
- Use the reference to phase the target sample (could introduce imputed values)
- This may result in more markers in the target sample data
- You can also use imputed markers if you change the flag in Beagle
- Rephase target and reference targets altogether
Your chromosome files should be named numerically, e.g., chr1 all the way to chr22. If you want to study some sex chromosome or otherwise, use symbolic links to say call it chr23.
In your YAML configuration file, the chromosome files between chr-low
to chr-high
will be analyzed. Use a subset of the smallest chromosomes to test the pipeline, e.g., chr-low: "21"
and chr-high: "22"
. Then, to run the entire analysis, use chr-low: "1"
and chr-high: "22"
.
The keep-samples
in the rule gds_to_vcf_adx
controls memory and time if you have a huge admixed target sample, but you only want to study a subset of them.
- If you want to study all of them, make a one column text file with sample IDs for all samples
The exclude-samples
is an option in the Browning Lab software (beagle, flare, etc). This is a one column text file with sample IDs for the samples you do not want to study. Most of the time this should be the empty file excludesamples.txt
in this repo. If you use it with a non-empty file, it will subset the VCFs in the phasing step.
Sometimes the recombination map and the VCF/GDS data have different names for the chromosome column, e.g., chr22 versus 22. Choose between one of four text files in the rename-chrs/
from this repo.
- The first column is the old name, the name of chromosome in the VCF file.
- The second column is the new name, the name of the chromosome in the map file.
- There are separate YAML options for the reference and admixed target samples.
You can call detect IBD segments by removing the comments in record_yaml
rule.
- This may be useful if you want to do:
- IBD mapping in conjunction w/ local ancestry analyses
- Relatedness inference using
IBDkin
(look for github repo)
- See the Beagle paper about manipulating window size
- Impacts memory and runtime
- The greatest concern is memory in large samples
- https://www.cell.com/ajhg/fulltext/S0002-9297(21)00304-9
- Subset the target samples and implement reference phasing
- Make subsets of the target samples
- Make separate configuration YAML files w/ different
keep-samples
text files - Comment out this:
[macro+'/lai/chr'+str(i)+'.rephased.flare.anc.vcf.gz' for i in range(low,high+1)],
- Keep this:
[macro+'/lai/chr'+str(i)+'.referencephased.flare.anc.vcf.gz' for i in range(low,high+1)],
- This repo currently uses snakemake 7.25.2
- May extend to version 8 as I develop familiarity in other repos
- Initial data can be
*.gds
OR*.vcf
It takes time to make these pipelines. I am a early career researcher and appreciate credit.
(smiley face) Please:
- At time of publication, check this repo to see if we have a paper introducing the pipeline.
- (If we publish this pipeline somewhere, I will point out the paper.)
- Acknowledge me in publication.
You should be citing (see get-software.sh
):
beagle
flare