Skip to content

Latest commit

 

History

History
184 lines (142 loc) · 8.07 KB

Phasing.md

File metadata and controls

184 lines (142 loc) · 8.07 KB

PHASING GENOMIC DATA


Pipeline by Juliana Acosta-Uribe. UCSB 2021


I. PREP YOUR FILES

Remove singletons and keep only biallelic sites. Since we are taking benefits of the family structure, many individuals have Mother and Fathre ID, therefore plink categorizes these individuals as "nonfounders". I recommend to use --nonfounders to include these individulas in the allelic frequency counts.

plink --bfile prefix --mac 2 --snps-only just-acgt --nonfounders --make-bed --out prefix.mac2.snps-only

You should only phase one chromosome at a time. For our example we will be using chromosome 14, but you could do a "for loop" function to pahse all chromosomes in a sequence.

plink --bfile prefix.mac2.snps-only --chr 14 --snps-only just-acgt --make-bed --out prefix.mac2.snps-only.chr14

Make sure your file prefix.mac2.snps-only.chr14.bim has the same variant identifiers as your reference panels e.g. 1000GP_Phase3.hap.legend.sample and 1000GP_Phase3_chr14.hap.gz files

You can find mutiple reference panels here https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#reference or get directly the 1000GP phase 3 here https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html. The cM genetic recombination maps are also available in the same site.

II. PHASE DATA

1. Check your file with SHAPEIT2

shapeit \
-check \
--input-ref 1000GP_Phase3_chr14.hap.gz 1000GP_Phase3_chr14.legend.gz 1000GP_Phase3.sample \
--thread 20 \
--input-bed prefix.mac2.snps-only.chr14 \
--input-map genetic_map_chr14_combined_b37.txt \
--output-log prefix.mac2.snps-only.chr14.shapeit

Output files:

prefix.mac2.snps-only.chr14.shapeit.ind.me : contains the Mendel errors at the individual level (N lines). The columns are: id: The individual id father_id: The father id for this individual father_mendel: The number of Mendel error when considering only the duo father/child mother_id: The mother id for this individual mother_mendel: The number of Mendel error when considering only the duo mother/child father_mother_mendel: The number of Mendel error when considering the full trio father/mother/child famID: An unique id internally assigned by SHAPEIT to each distinct family (useful to determine how many unrelated families there are).

id father_id father_mendel mother_id mother_mendel father_mother_mende  famID
HG01879 0       0       0       0       0       0
HG01880 0       0       0       0       0       1
HG01882 0       0       0       0       0       2
HG01883 0       0       0       0       0       3
HG01885 0       0       0       0       0       4
HG01886 0       0       0       0       0       5
HG01889 0       0       0       0       0       6
HG01890 0       0       0       0       0       7
HG01894 0       0       0       0       0       8

For this file we have 0 mendel errors.

prefix.mac2.snps-only.chr14.shapeit.snp.me : contains the Mendel errors at the SNP level (L lines). The columns are variant site ID and position, number of mendel errors, total number of duos-trios. If you divide the 3rd column by the last, you get the Mendel error rate.

id     			 				position        mendel  total
14:19064660:T:G 				19064660        0       105
rs201332986:19126973:G:T        19126973        0       105
14:19127510:G:T					19127510        0       105
14:19130203:A:C 				19130203        0       105
rs138361741:19130208:A:T        19130208        0       105
14:19130228:A:G 				19130228        0       105
rs138032186:19130371:A:G        19130371        0       105
14:19130399:T:G 				19130399        0       105
rs190014110:19130435:G:A        19130435        0       105

For this file we have 0 mendel errors.

prefix.mac2.snps-only.chr14.shapeit.ind.mm : contains the individual missing rates (N lines) The first column gives the individual ID and the second the number of sites with missing data the individual contains.

id      missing
HG01879 55
HG01880 36
HG01882 61
HG01883 72
HG01885 86
HG01886 67
HG01889 71
HG01890 57
HG01894 72

prefix.mac2.snps-only.chr14.shapeit.snp.mm : contains the SNP missing rates and allele frequencies (L lines) The columns are: id: The SNP id (ex: rs id) position: The SNP physical position in bp A: The first allele (ref allele in VCF) B: The second alleles (alt allele in VCF) missing: The missing genotype count. That is the number of samples with missing data at this site. count_A_main: The allele A count. The number of haplotypes carrying the A allele. count_B_main: The allele B count. The number of haplotypes carrying the B allele. count_A_founder: The allele A count in founder haplotypes only. count_B_founder: The allele B count in founder haplotypes only.

id      position        A       B       missing count_A_main    count_B_main    count_A_founder count_B_founder count_A_ref     count_B_ref  count_A_total    count_B_total
14:19064660:T:G 				19064660        G       T       0       9       4813    9       4603    5       5003    14      9816
rs201332986:19126973:G:T        19126973        T       G       0       28      4794    24      4588    13      4995    41      9789
14:19127510:G:T 				19127510        T       G       0       6       4816    6       4606    3       5005    9       9821
14:19130203:A:C 				19130203        C       A       3       47      4769    46      4560    22      4986    69      9755
rs138361741:19130208:A:T        19130208        T       A       7       274     4534    258     4340    202     4806    476     9340
14:19130228:A:G 				19130228        G       A       0       15      4807    14      4598    14      4994    29      9801
rs138032186:19130371:A:G        19130371        G       A       2       151     4667    142     4466    109     4899    260     9566
14:19130399:T:G 				19130399        G       T       0       29      4793    28      4584    22      4986    51      9779
rs190014110:19130435:G:A        19130435        A       G       0       239     4583    223     4389    240     4768    479     9351

To fix the SNPs with high missingness I used the plink --geno command. Use the --nonfounder flag to do this

plink --bfile prefix.mac2.snps-only.chr14 --geno 0.01  --nonfounders --make-bed --out prefix.mac2.snps-only.chr14.geno

Missalingment or missingness issues will be reported in file.strand and file.strand.exclude

You can exclude the file.strand.exclude snps with --exclude-snp file.exclude Where each line of the file file.exclude corresponds to a physical position that has to be excluded.

Then check again (use the --exclude-snp flag if necessary)

shapeit \
-check \
--input-ref 1000GP_Phase3_chr14.hap.gz 1000GP_Phase3_chr14.legend.gz 1000GP_Phase3.sample \
--thread 20 \
--input-bed prefix.mac2.snps-only.chr14.geno \
--input-map genetic_map_chr14_combined_b37.txt \
--output-log prefix.mac2.snps-only.chr14.geno.shapeit

Reading genetic map in [genetic_map_chr14_combined_b37.txt] * 104393 genetic positions found * #set=101046 / #interpolated=609980 * Physical map [19.06 Mb -> 107.29 Mb] / Genetic map [0.00 cM -> 117.51 cM]

Checking missingness and MAF...

Checking Mendel errors... * Low level of Mendel error in all trios and duos * 0 SNPs with high Mendel error rate (> 5%)

If everything is good, proceed to phase your data.

2. Phase your data with SHAPEIT2

shapeit \
--input-ref 1000GP_Phase3_chr14.hap.gz 1000GP_Phase3_chr14.legend.gz 1000GP_Phase3.sample \
--input-bed prefix.mac2.snps-only.chr14.geno \
--input-map genetic_map_chr14_combined_b37.txt \
--duohmm \ 	#Useful for phasing complex pedigrees
-W 5 \ 		#A larger phasing window improves phasing when your population has significant IBD
--thread 20 \ 	# you can use as many threads as you want
--output-max prefix.mac2.snps-only.chr14.geno.haps.gz prefix.mac2.snps-only.chr14.geno.sample \
--output-log prefix.mac2.snps-only.chr14.geno.shapeit_phased

Output files : prefix.mac2.snps-only.chr14.geno.haps.gz and prefix.mac2.snps-only.chr14.geno.sample

3. Convert the haps.gz and sample into a VCF

unzip the haps.gz file and run:

shapeit -convert \
--input-haps prefix.mac2.snps-only.chr14.geno \
--output-vcf prefixmac2.snps-only.chr14.geno.vcf