This directory contains the pipelines used to convert human data into
tsinfer
input format and run tsinfer. It is based around a Makefile,
which breaks up the various steps and can avoid repeating parts of the
analysis.
We assume that all commands are run within this directory.
The pipelines require:
- bcftools
- tabix
- Python 3
The Python package requirements are listed in the requirements.txt
file
in the repository root.
To build a tree sequence for 1000 Genomes chromosome 20, run:
$ make 1kg_chr20.trees
This will download the data, and create the intermediate files and
output the final tree sequence .trees
file. Converting from
VCF to the .samples
file will take some time.
The number of threads used in steps of the pipeline that support
threading can be controlled using the NUM_THREADS
make
variable, i.e.
$ make NUM_THREADS=20 1kg_chr20.trees
will tell tsinfer
to use 20 threads where appropriate.
The pipeline for 1000 genomes is generic, and so any chromosome can be built
by running, e.g., make 1kg_chr1.trees
.
The pipeline for the SGDP chromosome 20 data can executed using
$ make sgdp_chr20.trees
The pipeline for UKBB can be run in a similar way, but because it is not a public dataset the input data cannot be downloaded automatically. See the Makefile for the details of the required files and paths.
The code for generating visual representations of the tree sequence characterisation of the UK Biobank is in the notebook UKB_plot.ipynb
. This is the code used to plot Figure 5 of the paper. The execution of this code also requires the following files: UK_NUTS_Level_2_January_2018_Names_and_Countries.csv
and UK_NUTS_Level_3_January_2018_Names_and_Countries.csv
.