From fe5a547501e083f5114519cf5f281a302758e880 Mon Sep 17 00:00:00 2001 From: mconomos Date: Wed, 5 Jun 2024 17:55:33 -0700 Subject: [PATCH] add Rmd and html files --- 01_gds_intro.Rmd | 171 +++++ 01_gds_intro.html | 737 ++++++++++++++++++ 02.A_pop_structure_relatedness.Rmd | 340 +++++++++ 02.A_pop_structure_relatedness.html | 1098 +++++++++++++++++++++++++++ 02_GWAS.Rmd | 365 +++++++++ 02_GWAS.html | 1068 ++++++++++++++++++++++++++ 03.A_GENESIS_model_explorer.Rmd | 142 ++++ 03.A_GENESIS_model_explorer.html | 684 +++++++++++++++++ 03_advanced_GWAS.Rmd | 287 +++++++ 03_advanced_GWAS.html | 854 +++++++++++++++++++++ 04_conditional_analysis.Rmd | 364 +++++++++ 04_conditional_analysis.html | 1054 +++++++++++++++++++++++++ 05_annotation_explorer.Rmd | 96 +++ 05_annotation_explorer.html | 586 ++++++++++++++ 06_aggregate_tests.Rmd | 326 ++++++++ 06_aggregate_tests.html | 1014 +++++++++++++++++++++++++ 07_STAAR.Rmd | 166 ++++ 07_STAAR.html | 722 ++++++++++++++++++ 18 files changed, 10074 insertions(+) create mode 100644 01_gds_intro.Rmd create mode 100644 01_gds_intro.html create mode 100644 02.A_pop_structure_relatedness.Rmd create mode 100644 02.A_pop_structure_relatedness.html create mode 100644 02_GWAS.Rmd create mode 100644 02_GWAS.html create mode 100644 03.A_GENESIS_model_explorer.Rmd create mode 100644 03.A_GENESIS_model_explorer.html create mode 100644 03_advanced_GWAS.Rmd create mode 100644 03_advanced_GWAS.html create mode 100644 04_conditional_analysis.Rmd create mode 100644 04_conditional_analysis.html create mode 100644 05_annotation_explorer.Rmd create mode 100644 05_annotation_explorer.html create mode 100644 06_aggregate_tests.Rmd create mode 100644 06_aggregate_tests.html create mode 100644 07_STAAR.Rmd create mode 100644 07_STAAR.html diff --git a/01_gds_intro.Rmd b/01_gds_intro.Rmd new file mode 100644 index 0000000..bdedc04 --- /dev/null +++ b/01_gds_intro.Rmd @@ -0,0 +1,171 @@ +# 1. Introduction to GDS Format + +This tutorial introduces Genomic Data Structure (GDS), which is a storage format that can efficiently store genomic data and provide fast access to subsets of the data. For more information on GDS for sequence data, see the [SeqArray package vignette](https://github.com/zhengxwen/SeqArray/blob/master/vignettes/SeqArrayTutorial.Rmd). + +## Convert a VCF to GDS + +To use the R packages developed at the University of Washington Genetic Analysis Center for analyzing sequence data, we first need to convert a VCF file to GDS. (If the file is BCF, use [https://samtools.github.io/bcftools/bcftools.html](bcftools) to convert to VCF). + +For these tutorials, we use a subset of data from the 1000 Genomes Project phase 3 callset from 2015 (PMID: 26432245) that has been lifted-over from the GRCh37 to GRCh38 using CrossMap, as described in Byrska-Bishop et al., 2022 (PMID: 36055201). The data is available [here](https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/phase3_liftover_nygc_dir/). + +```{r vcf2gds, message = FALSE} +library(SeqArray) + +repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main" +if (!dir.exists("data")) dir.create("data") + +# file path to the VCF file to *read* data from +vcffile <- "data/1KG_phase3_GRCh38_subset_chr1.vcf.gz" +if (!file.exists(vcffile)) download.file(file.path(repo_path, vcffile), vcffile) + +# file path to *write* the output GDS file to +gdsfile <- "data/1KG_phase3_GRCh38_subset_chr1.gds" + +# convert the VCF to GDS +seqVCF2GDS(vcffile, gdsfile, fmt.import="GT") +``` + +## Exploring a GDS File + +### Open a GDS + +We can interact with the GDS file using the [SeqArray R package](https://bioconductor.org/packages/release/bioc/html/SeqArray.html). The first thing we need to do is open a connection to a GDS file on disk using the `seqOpen` function. Note that opening a GDS file does _not_ load all of the data into memory. + +```{r seqarray} +# open a connection to the GDS file +gds <- seqOpen(gdsfile) +gds +``` + +### Reading Data + +The `seqGetData` function is the basic function for reading in data from a GDS file +```{r seqGetData} +# the unique sample identifier comes from the VCF header +sample.id <- seqGetData(gds, "sample.id") +length(sample.id) +head(sample.id) + +# a unique integer ID is assigned to each variant +variant.id <- seqGetData(gds, "variant.id") +length(variant.id) +head(variant.id) + +chr <- seqGetData(gds, "chromosome") +head(chr) + +pos <- seqGetData(gds, "position") +head(pos) + +id <- seqGetData(gds, "annotation/id") +head(id) +``` + +There are additional useful functions for summary level data, such as calculating allele frequencies. + +```{r minor_freq} +# minor allele frequency of each variant +maf <- seqAlleleFreq(gds, minor = TRUE) +head(maf) +summary(maf) +hist(maf, breaks=50) +``` + +#### Data Filters + +We can define a filter on the `gds` object. After using the `seqSetFilter` command, all subsequent reads from the `gds` object are restricted to the selected subset of data, until a new filter is defined or `seqResetFilter` is called to clear the filter. + +```{r filter} +seqSetFilter(gds, variant.id=variant.id[1:10], sample.id=sample.id[1:5]) +``` + +```{r samp.id} +# only returns data for the filtered variants +seqGetData(gds, "sample.id") +``` + +```{r var.id} +# only returns data for the filtered variants +seqGetData(gds, "variant.id") +seqGetData(gds, "position") +``` + +### Genotype Data + +Genotype data is stored in a 3-dimensional array, where the first dimension is always length 2 for diploid genotypes. The second and third dimensions are samples and variants, respectively. The values of the array denote alleles: `0` is the reference allele and `1` is the alternate allele. For multiallelic variants, other alternate alleles are represented as integers `> 1`. + +```{r genotypes} +geno <- seqGetData(gds, "genotype") +dim(geno) +# print the first two variants +geno[,,1:2] +``` + +The [SeqVarTools R package](http://bioconductor.org/packages/SeqVarTools) has some additional functions for interacting with SeqArray-format GDS files. There are functions providing more intuitive ways to read in genotypes. What does each of the following functions return? + +```{r seqvartools_geno} +library(SeqVarTools) + +# return genotypes in matrix format +getGenotype(gds) +getGenotypeAlleles(gds) +refDosage(gds) +altDosage(gds) +``` + +### Variant Information + +There are functions to extract variant-level information. + +```{r seqvartools_varinfo} +# look at reference and alternate alleles +refChar(gds) +altChar(gds) + +# data.frame of variant information +variantInfo(gds) +``` + +We can also return variant information as a `GRanges` object from the [GenomicRanges package](https://bioconductor.org/packages/release/bioc/manuals/GenomicRanges/man/GenomicRanges.pdf). This format for representing sequence data is common across many Bioconductor packages. Chromosome is stored in the `seqnames` column. The `ranges` column has variant position, which can be a single base pair or a range. We will use `GRanges` objects when we analyze sets of variants (e.g. in genes). + +```{r granges} +# reset the filter to all variants and samples +seqResetFilter(gds) + +gr <- granges(gds) +gr +``` + +### Close a GDS + +Always use the `seqClose` command to close your connection to a GDS file when you are done working with it. Trying to open an already opened GDS will result in an error. + +```{r intro_close} +seqClose(gds) +``` + + + +## Exercise 1.1 (Application) + +The Apps on the BioData Catalyst powered by Seven Bridges platform allow you to easily scale up cloud computing to running analyses on all chromosomes genome-wide and with larger samples. Use the `VCF to GDS Converter` app to convert the example 1000 Genomes files into GDS files. The steps to perform this analysis are as follows: + +- Copy the app to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `VCF to GDS Converter` + - Click: Copy > Select your project > Copy +- Run the analysis in your project: + - Click: Apps > `VCF to GDS Converter` > Run + - Specify the Inputs: + - Variants Files: `1KG_phase3_GRCh38_subset_chr.vcf.gz` (select all 22 chromosomes) + - Specify the App Settings: + - check GDS: No + - Click: Run + +The analysis will take several minutes to run. You can find your analysis in the Tasks menu of your Project. Use the "View stats & logs" button to check on the status of your tasks. Click on the bar that says "vcf2gds", click "View Logs", and click one of the "vcf2gds" folders (there's one per chromosome). In here you can see detailed logs of the job; take a look at the `job.out.log` and `job.err.log` -- these can be useful for debugging issues. + +The output of this analysis will be a set of 22 GDS files, one per chromosome. + +You can find the expected output of this analysis by looking at the existing task `01 Convert VCF to GDS` in the Tasks menu of your Project. + + diff --git a/01_gds_intro.html b/01_gds_intro.html new file mode 100644 index 0000000..f7987ba --- /dev/null +++ b/01_gds_intro.html @@ -0,0 +1,737 @@ + + + + + + + + + + + + + +01_gds_intro.knit + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

1. Introduction to GDS Format

+

This tutorial introduces Genomic Data Structure (GDS), which is a +storage format that can efficiently store genomic data and provide fast +access to subsets of the data. For more information on GDS for sequence +data, see the SeqArray +package vignette.

+
+

Convert a VCF to GDS

+

To use the R packages developed at the University of Washington +Genetic Analysis Center for analyzing sequence data, we first need to +convert a VCF file to GDS. (If the file is BCF, use https://samtools.github.io/bcftools/bcftools.html to +convert to VCF).

+

For these tutorials, we use a subset of data from the 1000 Genomes +Project phase 3 callset from 2015 (PMID: 26432245) that has been +lifted-over from the GRCh37 to GRCh38 using CrossMap, as described in +Byrska-Bishop et al., 2022 (PMID: 36055201). The data is available here.

+
library(SeqArray)
+
+repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main"
+if (!dir.exists("data")) dir.create("data")
+
+# file path to the VCF file to *read* data from 
+vcffile <- "data/1KG_phase3_GRCh38_subset_chr1.vcf.gz"
+if (!file.exists(vcffile)) download.file(file.path(repo_path, vcffile), vcffile)
+
+# file path to *write* the output GDS file to 
+gdsfile <- "data/1KG_phase3_GRCh38_subset_chr1.gds"
+
+# convert the VCF to GDS
+seqVCF2GDS(vcffile, gdsfile, fmt.import="GT")
+
## Mon Jun  3 17:03:35 2024
+## Variant Call Format (VCF) Import:
+##     file:
+##         1KG_phase3_GRCh38_subset_chr1.vcf.gz (5.4M)
+##     file format: VCFv4.1
+##     genome reference: <unknown>
+##     # of sets of chromosomes (ploidy): 2
+##     # of samples: 1,040
+##     genotype field: GT
+##     genotype storage: bit2
+##     compression method: LZMA_RA
+##     # of samples: 1040
+## Output:
+##     data/1KG_phase3_GRCh38_subset_chr1.gds
+## Parsing '1KG_phase3_GRCh38_subset_chr1.vcf.gz':
+## + genotype/data   { Bit2 2x1040x37409 LZMA_ra(9.30%), 1.8M }
+## Digests:
+##     sample.id  [md5: d115dfc4ae456b619da107e3311ce2a4]
+##     variant.id  [md5: 3dcd734d3c97b971f3a8f80990232e0a]
+##     position  [md5: 4bd0c4e4c7a1fe92251e5b74a16e50c5]
+##     chromosome  [md5: 20c39308ae61b0f84333c6f3c3ad8f8e]
+##     allele  [md5: 20efac1f1e6791274766c86b2deb4a21]
+##     genotype  [md5: 32f9a7a66c0c324612ef56d36ab3cb3d]
+##     phase  [md5: 914e7388fab4b1a257ab8f0528b42e17]
+##     annotation/id  [md5: bb137e041f5d016c364689b4034ff7d3]
+##     annotation/qual  [md5: adf4f7fc5e787427afbac00069fd863e]
+##     annotation/filter  [md5: c1946ec5806fccd769c58e2b84f97363]
+## Done.
+## Mon Jun  3 17:03:41 2024
+## Optimize the access efficiency ...
+## Clean up the fragments of GDS file:
+##     open the file 'data/1KG_phase3_GRCh38_subset_chr1.gds' (2.1M)
+##     # of fragments: 140
+##     save to 'data/1KG_phase3_GRCh38_subset_chr1.gds.tmp'
+##     rename 'data/1KG_phase3_GRCh38_subset_chr1.gds.tmp' (2.1M, reduced: 948B)
+##     # of fragments: 61
+## Mon Jun  3 17:03:41 2024
+
+
+

Exploring a GDS File

+
+

Open a GDS

+

We can interact with the GDS file using the SeqArray +R package. The first thing we need to do is open a connection to a +GDS file on disk using the seqOpen function. Note that +opening a GDS file does not load all of the data into +memory.

+
# open a connection to the GDS file
+gds <- seqOpen(gdsfile)
+gds
+
## Object of class "SeqVarGDSClass"
+## File: /Users/mconomos/Documents/Teaching/SISG_2024/data/1KG_phase3_GRCh38_subset_chr1.gds (2.1M)
+## +    [  ] *
+## |--+ description   [  ] *
+## |--+ sample.id   { Str8 1040 LZMA_ra(9.88%), 829B } *
+## |--+ variant.id   { Int32 37409 LZMA_ra(7.40%), 10.8K } *
+## |--+ position   { Int32 37409 LZMA_ra(37.4%), 54.6K } *
+## |--+ chromosome   { Str8 37409 LZMA_ra(0.22%), 169B } *
+## |--+ allele   { Str8 37409 LZMA_ra(17.0%), 25.9K } *
+## |--+ genotype   [  ] *
+## |  |--+ data   { Bit2 2x1040x37409 LZMA_ra(9.74%), 1.8M } *
+## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
+## |  \--+ extra   { Int16 0 LZMA_ra, 18B }
+## |--+ phase   [  ]
+## |  |--+ data   { Bit1 1040x37409 LZMA_ra(0.02%), 861B } *
+## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
+## |  \--+ extra   { Bit1 0 LZMA_ra, 18B }
+## |--+ annotation   [  ]
+## |  |--+ id   { Str8 37409 LZMA_ra(35.9%), 150.9K } *
+## |  |--+ qual   { Float32 37409 LZMA_ra(0.12%), 181B } *
+## |  |--+ filter   { Int32,factor 37409 LZMA_ra(0.12%), 181B } *
+## |  |--+ info   [  ]
+## |  \--+ format   [  ]
+## \--+ sample.annotation   [  ]
+
+
+

Reading Data

+

The seqGetData function is the basic function for +reading in data from a GDS file

+
# the unique sample identifier comes from the VCF header
+sample.id <- seqGetData(gds, "sample.id")
+length(sample.id)
+
## [1] 1040
+
head(sample.id)
+
## [1] "HG00096" "HG00097" "HG00099" "HG00100" "HG00101" "HG00102"
+
# a unique integer ID is assigned to each variant
+variant.id <- seqGetData(gds, "variant.id")
+length(variant.id)
+
## [1] 37409
+
head(variant.id)
+
## [1] 1 2 3 4 5 6
+
chr <- seqGetData(gds, "chromosome")
+head(chr)
+
## [1] "1" "1" "1" "1" "1" "1"
+
pos <- seqGetData(gds, "position")
+head(pos)
+
## [1] 631490 736950 800909 814264 868735 903007
+
id <- seqGetData(gds, "annotation/id")
+head(id)
+
## [1] "rs528621909" "rs530588710" "rs79010578"  "rs75909375"  "rs9725068"  
+## [6] "rs4970384"
+

There are additional useful functions for summary level data, such as +calculating allele frequencies.

+
# minor allele frequency of each variant
+maf <- seqAlleleFreq(gds, minor = TRUE)
+head(maf)
+
## [1] 0.05576923 0.12067308 0.13076923 0.20096154 0.07836538 0.20817308
+
summary(maf)
+
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
+## 0.000000 0.000000 0.001442 0.081444 0.117789 0.500000
+
hist(maf, breaks=50)
+

+
+

Data Filters

+

We can define a filter on the gds object. After using +the seqSetFilter command, all subsequent reads from the +gds object are restricted to the selected subset of data, +until a new filter is defined or seqResetFilter is called +to clear the filter.

+
seqSetFilter(gds, variant.id=variant.id[1:10], sample.id=sample.id[1:5])
+
## # of selected samples: 5
+## # of selected variants: 10
+
# only returns data for the filtered variants
+seqGetData(gds, "sample.id")
+
## [1] "HG00096" "HG00097" "HG00099" "HG00100" "HG00101"
+
# only returns data for the filtered variants
+seqGetData(gds, "variant.id")
+
##  [1]  1  2  3  4  5  6  7  8  9 10
+
seqGetData(gds, "position")
+
##  [1] 631490 736950 800909 814264 868735 903007 911085 915148 917063 917859
+
+
+
+

Genotype Data

+

Genotype data is stored in a 3-dimensional array, where the first +dimension is always length 2 for diploid genotypes. The second and third +dimensions are samples and variants, respectively. The values of the +array denote alleles: 0 is the reference allele and +1 is the alternate allele. For multiallelic variants, other +alternate alleles are represented as integers > 1.

+
geno <- seqGetData(gds, "genotype")
+dim(geno)
+
## [1]  2  5 10
+
# print the first two variants
+geno[,,1:2]
+
## , , 1
+## 
+##       sample
+## allele [,1] [,2] [,3] [,4] [,5]
+##   [1,]    0    0    0    0    0
+##   [2,]    0    0    0    0    0
+## 
+## , , 2
+## 
+##       sample
+## allele [,1] [,2] [,3] [,4] [,5]
+##   [1,]    1    0    0    0    1
+##   [2,]    0    1    0    0    0
+

The SeqVarTools R +package has some additional functions for interacting with +SeqArray-format GDS files. There are functions providing more intuitive +ways to read in genotypes. What does each of the following functions +return?

+
library(SeqVarTools)
+
+# return genotypes in matrix format
+getGenotype(gds)
+
##          variant
+## sample    1     2     3     4     5     6     7     8     9     10   
+##   HG00096 "0|0" "1|0" "0|1" "0|0" "1|1" "0|0" "0|0" "0|0" "0|0" "0|0"
+##   HG00097 "0|0" "0|1" "0|0" "1|1" "1|1" "0|1" "0|0" "0|0" "0|0" "0|1"
+##   HG00099 "0|0" "0|0" "0|1" "0|0" "1|1" "0|0" "0|0" "0|0" "0|0" "0|0"
+##   HG00100 "0|0" "0|0" "0|0" "1|1" "1|1" "1|1" "0|0" "0|0" "0|0" "1|0"
+##   HG00101 "0|0" "1|0" "0|0" "1|0" "1|1" "1|0" "0|0" "0|0" "0|0" "0|0"
+
getGenotypeAlleles(gds)
+
##          variant
+## sample    1     2     3     4     5     6     7     8     9     10   
+##   HG00096 "T|T" "C|G" "T|A" "T|T" "A|A" "T|T" "C|C" "A|A" "A|A" "A|A"
+##   HG00097 "T|T" "G|C" "T|T" "C|C" "A|A" "T|C" "C|C" "A|A" "A|A" "A|G"
+##   HG00099 "T|T" "G|G" "T|A" "T|T" "A|A" "T|T" "C|C" "A|A" "A|A" "A|A"
+##   HG00100 "T|T" "G|G" "T|T" "C|C" "A|A" "C|C" "C|C" "A|A" "A|A" "G|A"
+##   HG00101 "T|T" "C|G" "T|T" "C|T" "A|A" "C|T" "C|C" "A|A" "A|A" "A|A"
+
refDosage(gds)
+
##          variant
+## sample    1 2 3 4 5 6 7 8 9 10
+##   HG00096 2 1 1 2 0 2 2 2 2  2
+##   HG00097 2 1 2 0 0 1 2 2 2  1
+##   HG00099 2 2 1 2 0 2 2 2 2  2
+##   HG00100 2 2 2 0 0 0 2 2 2  1
+##   HG00101 2 1 2 1 0 1 2 2 2  2
+
altDosage(gds)
+
##          variant
+## sample    1 2 3 4 5 6 7 8 9 10
+##   HG00096 0 1 1 0 2 0 0 0 0  0
+##   HG00097 0 1 0 2 2 1 0 0 0  1
+##   HG00099 0 0 1 0 2 0 0 0 0  0
+##   HG00100 0 0 0 2 2 2 0 0 0  1
+##   HG00101 0 1 0 1 2 1 0 0 0  0
+
+
+

Variant Information

+

There are functions to extract variant-level information.

+
# look at reference and alternate alleles
+refChar(gds)
+
##  [1] "T" "G" "T" "T" "G" "T" "C" "A" "A" "A"
+
altChar(gds)
+
##  [1] "C" "C" "A" "C" "A" "C" "T" "G" "C" "G"
+
# data.frame of variant information
+variantInfo(gds)
+
##    variant.id chr    pos ref alt
+## 1           1   1 631490   T   C
+## 2           2   1 736950   G   C
+## 3           3   1 800909   T   A
+## 4           4   1 814264   T   C
+## 5           5   1 868735   G   A
+## 6           6   1 903007   T   C
+## 7           7   1 911085   C   T
+## 8           8   1 915148   A   G
+## 9           9   1 917063   A   C
+## 10         10   1 917859   A   G
+

We can also return variant information as a GRanges +object from the GenomicRanges +package. This format for representing sequence data is common across +many Bioconductor packages. Chromosome is stored in the +seqnames column. The ranges column has variant +position, which can be a single base pair or a range. We will use +GRanges objects when we analyze sets of variants (e.g. in +genes).

+
# reset the filter to all variants and samples
+seqResetFilter(gds)
+
## # of selected samples: 1,040
+## # of selected variants: 37,409
+
gr <- granges(gds)
+gr
+
## GRanges object with 37409 ranges and 0 metadata columns:
+##         seqnames    ranges strand
+##            <Rle> <IRanges>  <Rle>
+##       1        1    631490      *
+##       2        1    736950      *
+##       3        1    800909      *
+##       4        1    814264      *
+##       5        1    868735      *
+##     ...      ...       ...    ...
+##   37405        1 248671596      *
+##   37406        1 248682170      *
+##   37407        1 248768291      *
+##   37408        1 248774730      *
+##   37409        1 248870861      *
+##   -------
+##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
+
+
+

Close a GDS

+

Always use the seqClose command to close your connection +to a GDS file when you are done working with it. Trying to open an +already opened GDS will result in an error.

+
seqClose(gds)
+
+
+
+

Exercise 1.1 (Application)

+

The Apps on the BioData Catalyst powered by Seven Bridges platform +allow you to easily scale up cloud computing to running analyses on all +chromosomes genome-wide and with larger samples. Use the +VCF to GDS Converter app to convert the example 1000 +Genomes files into GDS files. The steps to perform this analysis are as +follows:

+
    +
  • Copy the app to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for VCF to GDS Converter
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
  • Run the analysis in your project: +
      +
    • Click: Apps > VCF to GDS Converter > Run
    • +
    • Specify the Inputs: +
        +
      • Variants Files: +1KG_phase3_GRCh38_subset_chr<CHR>.vcf.gz (select all +22 chromosomes)
      • +
    • +
    • Specify the App Settings: +
        +
      • check GDS: No
      • +
    • +
    • Click: Run
    • +
  • +
+

The analysis will take several minutes to run. You can find your +analysis in the Tasks menu of your Project. Use the “View stats & +logs” button to check on the status of your tasks. Click on the bar that +says “vcf2gds”, click “View Logs”, and click one of the “vcf2gds” +folders (there’s one per chromosome). In here you can see detailed logs +of the job; take a look at the job.out.log and +job.err.log – these can be useful for debugging issues.

+

The output of this analysis will be a set of 22 GDS files, one per +chromosome.

+

You can find the expected output of this analysis by looking at the +existing task 01 Convert VCF to GDS in the Tasks menu of +your Project.

+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/02.A_pop_structure_relatedness.Rmd b/02.A_pop_structure_relatedness.Rmd new file mode 100644 index 0000000..942591e --- /dev/null +++ b/02.A_pop_structure_relatedness.Rmd @@ -0,0 +1,340 @@ +# 2.A. Population Structure and Relatedness Inference + +Population structure due to genetic ancestry and genetic correlation due to sample relatedness are important factors to consider when performing association tests as their presence can lead to mis-calibration of test statistics and false positive associations. This tutorial demonstrates how to perform population structure and relatedness inference using the [GENESIS](https://bioconductor.org/packages/release/bioc/html/GENESIS.html) and [SNPRelate](https://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html) R/Bioconductor packages. We use the results of this inference to perform association tests in the GWAS tutorials. + +For this tutorial we have provide a GDS file combined across all chromosomes and filtered to common variants with minor allele frequency (MAF) $> 5\%$ in the sample. + +## LD-pruning + +We generally advise that population structure and relatedness inference be performed using a set of (nearly) independent genetic variants. To find this set of variants, we perform linkage-disequilibrium (LD) pruning on the study sample set. We typically use an LD threshold of `r^2 < 0.1` to select variants. We use the [SNPRelate](https://github.com/zhengxwen/SNPRelate) package to perform LD-pruning with GDS files. + +```{r ld-pruning, message = FALSE} +library(SeqArray) +repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main" + +# use a GDS file with all chromosomes +if (!dir.exists("data")) dir.create("data") +gdsfile <- "data/1KG_phase3_GRCh38_subset_maf05_ALL_chr.gds" +if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile) +gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open +gds <- seqOpen(gdsfile) + +# run LD pruning +library(SNPRelate) +set.seed(100) # LD pruning has a random element; so make this reproducible +snpset <- snpgdsLDpruning(gds, + maf = 0.01, + ld.threshold=sqrt(0.1)) + +# how many variants on each chr? +sapply(snpset, length) + +# get the full list of LD-pruned variants +pruned <- unlist(snpset, use.names=FALSE) +length(pruned) +``` + +## Computing a GRM + +We can use the [SNPRelate](https://github.com/zhengxwen/SNPRelate) package to compute a Genetic Relationship matrix (GRM) using GDS files. A GRM captures genetic correlation among samples due to both distant ancestry (i.e. population structure) and recent kinship (i.e. familial relatedness) in a single matrix. + +SNPRelate offers several algorithms for computing a GRM, including the commonly-used GCTA [Yang et al 2011](https://www.ncbi.nlm.nih.gov/pubmed/21167468) (`method = "GCTA"`) and allelic matching based estimators described by [Weir and Goudet 2017](https://www.ncbi.nlm.nih.gov/pubmed/28550018) (`method = "IndivBeta"`). + +```{r grm} +# compute the GRM using the GCTA method +library(SNPRelate) +grm <- snpgdsGRM(gds, method="GCTA", snp.id = pruned) +names(grm) +dim(grm$grm) + +# look at the top corner of the matrix +grm$grm[1:5,1:5] +``` + + +## De-convoluting ancestry and relatedness + +To disentangle distant ancestry (i.e. population structure) from recent kinship (i.e. familial relatedness), we implement the analysis described in [Conomos et al., 2016](https://www.cell.com/ajhg/fulltext/S0002-9297(15)00496-6). This approach uses the [KING-robust](http://www.ncbi.nlm.nih.gov/pubmed/20926424), [PC-AiR](http://www.ncbi.nlm.nih.gov/pubmed/25810074), and [PC-Relate](http://www.ncbi.nlm.nih.gov/pubmed/26748516) methods. + +### KING-robust + +Step 1 is to get initial kinship estimates using [KING-robust](http://www.ncbi.nlm.nih.gov/pubmed/20926424), which is robust to discrete population structure but not ancestry admixture. KING-robust will be able to identify close relatives (e.g. 1st and 2nd degree) reliably, but may identify spurious pairs or miss more distant pairs of relatives in the presence of admixture. KING is available as its own software, but the KING-robust algorithm is also available in SNPRelate via the function `snpgdsIBDKING`. + +```{r king} +# run KING-robust +king <- snpgdsIBDKING(gds, snp.id=pruned) +names(king) + +# extract the kinship estimates +kingMat <- king$kinship +colnames(kingMat) <- rownames(kingMat) <- king$sample.id +dim(kingMat) +# look at the top corner of the matrix +kingMat[1:5,1:5] +``` + +We extract pairwise kinship estimates and IBS0 values (the proportion of variants for which the pair of indivdiuals share 0 alleles identical by state) to plot. We use a hexbin plot to visualize the relatedness for all pairs of samples. + +```{r king_plot} +kinship <- snpgdsIBDSelection(king) +head(kinship) + +library(ggplot2) +ggplot(kinship, aes(IBS0, kinship)) + + geom_hline(yintercept=2^(-seq(3,9,2)/2), linetype="dashed", color="grey") + + geom_hex(bins = 100) + + ylab("kinship estimate") + + theme_bw() +``` + +We see a few parent-offspring, full sibling, 2nd degree, and 3rd degree relative pairs. The abundance of negative estimates represent pairs of individuals who have ancestry from different populations -- the magnitude of the negative relationship is informative of how different their ancestries are; more on this below. + +### PC-AiR + +The next step is [PC-AiR](http://www.ncbi.nlm.nih.gov/pubmed/25810074), which provides robust population structure inference in samples with kinship and pedigree structure. PC-AiR is available in the GENESIS package via the function `pcair`. + +First, PC-AiR partitions the full sample set into a set of mutually unrelated samples that is maximally informative about all ancestries in the sample (i.e. the unrelated set) and their relatives (i.e. the related set). We use a 3rd degree kinship threshold (`kin.thresh = 2^(-9/2)`), which corresponds to first cousins -- this defines anyone less related than first cousins as "unrelated". We use the negative KING-robust estimates as "ancestry divergence" measures (`divMat`) to identify pairs of samples with different ancestry -- we preferentially select individuals with many negative estimates for the unrelated set to ensure ancestry representation. For now, we also use the KING-robust estimates as our kinship measures (`kinMat`); more on this below. + +Once the unrelated and related sets are identified, PC-AiR performs a standard Principal Component Analysis (PCA) on the unrelated set of individuals and then projects the relatives onto the PCs. Under the hood, PC-AiR uses the SNPRelate package for efficient PC computation and projection. + +```{r pcair1} +# run PC-AiR +library(GENESIS) +pca <- pcair(gds, + kinobj = kingMat, + kin.thresh = 2^(-9/2), + divobj = kingMat, + div.thresh = -2^(-9/2), + snp.include = pruned) + +names(pca) + +# the unrelated set of samples +length(pca$unrels) +head(pca$unrels) + +# the related set of samples +length(pca$rels) +head(pca$rels) + +# extract the top 12 PCs and make a data.frame +pcs <- data.frame(pca$vectors[,1:12]) +colnames(pcs) <- paste0('PC', 1:12) +pcs$sample.id <- pca$sample.id +dim(pcs) +head(pcs) + +# save output +save(pcs, file="data/pcs.RData") +``` + +We'd like to determine which PCs are ancestry informative. To do this we look at the PCs in conjunction with reported population information for the 1000 Genomes samples. This information is stored in the provided `pheno_data.tsv` file. We make a parallel coordinates plot and pairwise scatter plots, color-coding samples by 1000 Genomes population labels. + +```{r pcair1_parcoord, message = FALSE} +phenfile <- "data/pheno_data.tsv" +if (!file.exists(phenfile)) download.file(file.path(repo_path, phenfile), phenfile) +phen <- read.table(phenfile, header = TRUE, sep = "\t", as.is = TRUE) +head(phen) + +pc.df <- merge(pcs, phen[,c("sample.id", "pop")], by = "sample.id") + +library(GGally) +library(RColorBrewer) +pop.cols <- setNames(brewer.pal(12, "Paired"), + c("ACB", "ASW", "YRI", "CEU", "GBR", "TSI", "GIH", "CHB", "JPT", "MXL", "PUR")) +ggparcoord(pc.df, columns=2:13, groupColumn="pop", scale="uniminmax") + + scale_color_manual(values=pop.cols) + + xlab("PC") + ylab("") +``` + +```{r} +ggplot(pc.df, aes(x = PC1, y = PC2)) + geom_point(aes(color = pop)) + scale_color_manual(values = pop.cols) +ggplot(pc.df, aes(x = PC3, y = PC4)) + geom_point(aes(color = pop)) + scale_color_manual(values = pop.cols) +ggplot(pc.df, aes(x = PC5, y = PC6)) + geom_point(aes(color = pop)) + scale_color_manual(values = pop.cols) +ggplot(pc.df, aes(x = PC7, y = PC8)) + geom_point(aes(color = pop)) + scale_color_manual(values = pop.cols) +ggplot(pc.df, aes(x = PC9, y = PC10)) + geom_point(aes(color = pop)) + scale_color_manual(values = pop.cols) +``` + +It appears as though PCs 1-7 separate the populations in our study and sufficiently capture the population structure in our sample. + +### PC-Relate + +The next step is [PC-Relate](http://www.ncbi.nlm.nih.gov/pubmed/26748516), which provides accurate kinship inference, even in the presence of population structure and ancestry admixture, by conditioning on ancestry informative PCs. As we saw above, PCs 1-7 separate populations in our study, so we condition on PCs 1-7 in our PC-Relate analysis. PC-Relate can be performed using the `pcrelate` function in GENESIS, which expects a `SeqVarIterator` object for the genotype data. The `training.set` argument allows for specification of which samples to use to "learn" the ancestry adjustment -- we recommend the unrelated set from the PC-AiR analysis. + +(NOTE: this will take a few minutes to run). + +```{r pcrelate1} +seqResetFilter(gds, verbose=FALSE) +library(SeqVarTools) +seqData <- SeqVarData(gds) + +# filter the GDS object to our LD-pruned variants +seqSetFilter(seqData, variant.id=pruned) +iterator <- SeqVarBlockIterator(seqData, verbose=FALSE) + +pcrel <- pcrelate(iterator, + pcs=pca$vectors[,c(1:7)], + training.set=pca$unrels) +names(pcrel) + +# relatedness between pairs of individuals +dim(pcrel$kinBtwn) +head(pcrel$kinBtwn) + +# self-kinship estimates +dim(pcrel$kinSelf) +head(pcrel$kinSelf) + +# save output +save(pcrel, file="data/pcrelate.RData") +``` + +We plot the pairwise kinship estimates againts the IBD0 (`k0`) estimates (the proportion of variants for which the pair of individuals share 0 alleles identical by descent). We use a hexbin plot to visualize the relatedness for all pairs of samples. + +```{r pcrelate1_plot} +ggplot(pcrel$kinBtwn, aes(k0, kin)) + + geom_hline(yintercept=2^(-seq(5,9,2)/2), linetype="dashed", color="grey") + + geom_hex(bins = 100) + + geom_abline(intercept = 0.25, slope = -0.25) + + ylab("kinship estimate") + + theme_bw() +``` + +We get very similar inference for 1st and 2nd degree relatives, but we also see that the PC-Relate relatedness estimates for unrelated pairs (i.e. kin ~ 0 and k0 ~ 1) are much closer to expectation than those from KING-robust. + + +### Advanced Notes + +In small samples (such as this one), we recommend performing a second iteration of PC-AiR and PC-Relate. Using the first iteration of PC-Relate ancestry-adjusted kinship estimates, we can often better partition our sample into unrelated and related sets, which leads to better ancestry PCs from PC-AiR and better relatedness estimates from PC-Relate. The steps to perform the second iteration are: + +1. Perform a second PC-AiR analysis using the PC-Relate kinship matrix as the kinship estimates. Still use the (original) KING-robust matrix for the ancestry divergence estimates. +2. Perform a second PC-Relate analysis using the new PC-AiR PCs to adjust for population structure. + + +In large samples (such as TOPMed), the KING-robust and PC-Relate analyses can be very time consuming, so we suggest the following alternative procedure for deconvoluting ancestry and relatedness: + +1. Use the [KING-IBDseg method](https://www.kingrelatedness.com/manual.shtml#IBDSEG) to estimate kinship for all pairs of individuals. This method uses a fast algorithm to approximate IBD segments based on identity by state and works well in the presence of ancestry admixture -- in our experience, it gives very similar results to PC-Relate. The software to run KING-IBDseg uses PLINK files, but we've written a `KING IBDseg` application on the BioData Catalyst powered by Seven Bridges platform that will use GDS files as input (it does a conversion from GDS to PLINK). +2. Perform a PC-AiR analysis to infer population structure using the KING-IBDseg estimates as the kinship estimates and using no ancestry divergence measures (in large samples, the PC-AiR algorithm works well even without the ancestry divergence). + + +## Exercise 2.A.1 (Application) + +Use the `LD Pruning` app on the BioData Catalyst powered by Seven Bridges platform to perform LD pruning on the example 1000 Genomes GDS files you created previously. The steps to perform this analysis are as follows: + +- Copy the app to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `LD Pruning` + - Click: Copy > Select your project > Copy +- Run the analysis in your project: + - Click: Apps > `LD Pruning` > Run + - Specify the Inputs: + - GDS Files: `1KG_phase3_GRCh38_subset_chr.gds` (select all 22 chromosomes) + - Specify the App Settings: + - Autosomes only: TRUE + - LD |r| threshold: 0.32 ($\approx r^2 = 0.1$) + - MAF threshold: 0.05 + - Output prefix: "1KG_phase3_GRCh38_subset" (or any other string to name the output file) + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project. Use the "View stats & logs" button to check on the status of your tasks. Click on the bar that says "ld_pruning", click "View Logs", and click one of the "ld_pruning" folders (there's one per chromosome). In here you can see detailed logs of the job; take a look at the `job.out.log` and `job.err.log` -- these can be useful for debugging issues. + +The output of this analysis will be a single file (`_pruned.gds`) with the genotype data from the LD pruned variants across all 22 chromosomes. + +You can find the expected output of this analysis by looking at the existing task `02 LD Pruning` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to move to the next exercise. + + +## Exercise 2.A.2 (Application) + +Use the `KING robust` app on the BioData Catalyst powered by Seven Bridges platform to perform a KING-robust analysis of the example 1000 Genomes data using the LD pruned variants. The steps to perform this analysis are as follows: + +- Copy the app to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `KING robust` + - Click: Copy > Select your project > Copy +- Run the analysis in your project: + - Click: Apps > `KING robust` > Run + - Specify the Inputs: + - GDS file: `1KG_phase3_GRCh38_subset_pruned.gds` (has pruned variants from all 22 chromosomes) + - Specify the App Settings: + - kinship_plots > Kinship plotting threshold: 0 + - Output prefix: "1KG_phase3_GRCh38_subset" (or any other string to name the output file) + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output of this analysis will be a `_king_robust.gds` file that has the kinship estimates and a `_king_robust_all.pdf` with the plot of estimated kinship vs. IBS0. Look at the kinship plot -- how many 1st degree relative pairs are identified? How many 2nd degree relative pairs are identified? + +You can find the expected output of this analysis by looking at the existing task `03 King robust` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to move to the next exercise. + + +### Solution 2.A.2 (Application) + +From the kinship plot, we can see that there are 6 1st degree relative pairs (5 parent-offspring; 1 full sibling) and 3 2nd degree relative pairs identified. + + +## Exercise 2.A.3 (Application) + +Use the `PC-AiR` app on the BioData Catalyst powered by Seven Bridges platform to perform a PC-AiR analysis of the example 1000 Genomes data using the LD pruned variants. Use the KING-robust kinship estimates as both the kinship and divergence measures. The steps to perform this analysis are as follows: + +- Copy the app to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `PC-AiR` + - Click: Copy > Select your project > Copy +- Run the analysis in your project: + - Click: Apps > `PC-AiR` > Run + - Specify the Inputs: + - Pruned GDS File: `1KG_phase3_GRCh38_subset_pruned.gds` (has pruned variants from all 22 chromosomes) + - Kinship File: `1KG_phase3_GRCh38_subset_king_robust.gds` + - Divergence File: `1KG_phase3_GRCh38_subset_king_robust.gds` + - Phenotype file: `pheno_annotated.RData` + - Specify the App Settings: + - pca_byrel > Number of PCs: 12 + - pca_plots > Group: pop (sample variable for coloring output PCA) + - pca_plots > Number of PCs: 12 + - PC-variant correlation > Run PC-variant correlation: FALSE (extra diagnostic step that takes a while to run) + - Output prefix: "1KG_phase3_GRCh38_subset" (or any other string to name the output file) + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output of this analysis is a "_pca.RData" object with the PC values and several PC plots. Look at the parallel coordinates plot ("`_parcoord.pdf`") generated by the task. How many PCs appear to reflect population structure in the sample? This will determine how many PCs you should use to adjust PC-Relate in the next exercise. + +You can find the expected output of this analysis by looking at the existing task `04 PC-AiR` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to move to the next exercise. + +### Solution 2.A.3 (Application) + +From the parallel coordinates plot, we can see that PCs 1-7 appear to reflect population structure. We will use the first 7 PCs to adjust for population structure in the PC-Relate analysis in the next exercise. + + +## Exercise 2.A.4 (Application) + +Use the `PC-Relate` app on the BioData Catalyst powered by Seven Bridges platform to perform a PC-Relate analysis of the example 1000 Genomes data using the LD pruned variants. Use the PC-AiR PCs to adjust for population structure. The steps to perform this analysis are as follows: + +- Copy the app to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `PC-Relate` + - Click: Copy > Select your project > Copy +- Run the analysis in your project: + - Click: Apps > `PC-Relate` > Run + - Specify the Inputs: + - GDS File: `1KG_phase3_subset_pruned.gds` (has pruned variants from all 22 chromosomes) + - PCA file: `1KG_phase3_subset_pca.RData` + - Specify the App Settings: + - pcrelate_beta > Number of PCs: 7 + - pcrelate > Number of PCs: 7 + - pcrelate > Return IBD probabilities?: TRUE + - pcrelate_correct > Sparse threshold: -1 (to keep full dense matrix) + - kinship_plots > Kinship plotting threshold: 0 + - kinship_plots > Return IBD probabilities?: TRUE + - Output prefix: "1KG_phase3_GRCh38_subset" (or any other string to name the output file) + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output of this analysis is a `_pcrelate.RData` file that contains the PC-Relate relatedness estimates, a `_pcrelate_Matrix.RData` file that contains a sparse matrix of kinship estimates (more on this in the Advanced GWAS tutorial), and a `_pcrelate_all.pdf` with the plot of estimated kinship vs. k0. Look at the kinship plot -- how many 1st degree relative pairs are identified? How many 2nd degree relative pairs are identified? + +You can find the expected output of this analysis by looking at the existing task `05 PC-Relate` in the Tasks menu of your Project. + +### Solution 2.A.4 (Application) + +From the kinship plot, you can see that there are 6 1st degree relative pairs (5 parent-offspring; 1 full sibling) and 2 2nd degree relative pairs identified. diff --git a/02.A_pop_structure_relatedness.html b/02.A_pop_structure_relatedness.html new file mode 100644 index 0000000..1c43ab3 --- /dev/null +++ b/02.A_pop_structure_relatedness.html @@ -0,0 +1,1098 @@ + + + + + + + + + + + + + +02.A_pop_structure_relatedness.knit + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

2.A. Population Structure and Relatedness Inference

+

Population structure due to genetic ancestry and genetic correlation +due to sample relatedness are important factors to consider when +performing association tests as their presence can lead to +mis-calibration of test statistics and false positive associations. This +tutorial demonstrates how to perform population structure and +relatedness inference using the GENESIS +and SNPRelate +R/Bioconductor packages. We use the results of this inference to perform +association tests in the GWAS tutorials.

+

For this tutorial we have provide a GDS file combined across all +chromosomes and filtered to common variants with minor allele frequency +(MAF) \(> 5\%\) in the sample.

+
+

LD-pruning

+

We generally advise that population structure and relatedness +inference be performed using a set of (nearly) independent genetic +variants. To find this set of variants, we perform +linkage-disequilibrium (LD) pruning on the study sample set. We +typically use an LD threshold of r^2 < 0.1 to select +variants. We use the SNPRelate package to +perform LD-pruning with GDS files.

+
library(SeqArray)
+repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main"
+
+# use a GDS file with all chromosomes
+if (!dir.exists("data")) dir.create("data")
+gdsfile <- "data/1KG_phase3_GRCh38_subset_maf05_ALL_chr.gds"
+if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile)
+gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
+gds <- seqOpen(gdsfile)
+
+# run LD pruning
+library(SNPRelate)
+set.seed(100) # LD pruning has a random element; so make this reproducible
+snpset <- snpgdsLDpruning(gds, 
+                          maf = 0.01,
+                          ld.threshold=sqrt(0.1))
+
## SNV pruning based on LD:
+## Calculating allele counts/frequencies ...
+## 
+[..................................................]  0%, ETC: ---    
+[==================================================] 100%, completed, 3s
+## # of selected variants: 305,768
+##     # of samples: 1,040
+##     # of SNVs: 305,768
+##     using 1 thread
+##     sliding window: 500,000 basepairs, Inf SNPs
+##     |LD| threshold: 0.316228
+##     method: composite
+## Chromosome 1: 35.23%, 4,412/12,525
+## Chromosome 2: 32.00%, 4,308/13,461
+## Chromosome 3: 25.45%, 4,208/16,534
+## Chromosome 4: 20.79%, 4,047/19,463
+## Chromosome 5: 29.29%, 3,910/13,349
+## Chromosome 6: 24.22%, 3,852/15,902
+## Chromosome 7: 32.08%, 3,794/11,826
+## Chromosome 8: 25.57%, 3,641/14,242
+## Chromosome 9: 26.08%, 3,551/13,615
+## Chromosome 10: 28.30%, 3,731/13,185
+## Chromosome 11: 26.79%, 3,459/12,913
+## Chromosome 12: 25.16%, 3,689/14,663
+## Chromosome 13: 21.49%, 3,160/14,703
+## Chromosome 14: 24.79%, 3,036/12,246
+## Chromosome 15: 19.98%, 3,175/15,887
+## Chromosome 16: 23.86%, 3,238/13,569
+## Chromosome 17: 26.51%, 3,077/11,608
+## Chromosome 18: 18.99%, 3,153/16,603
+## Chromosome 19: 24.71%, 2,908/11,769
+## Chromosome 20: 22.91%, 2,818/12,302
+## Chromosome 21: 17.48%, 2,245/12,844
+## Chromosome 22: 18.56%, 2,331/12,559
+## 75,743 markers are selected in total.
+
# how many variants on each chr?
+sapply(snpset, length)
+
##  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13 
+##  4412  4308  4208  4047  3910  3852  3794  3641  3551  3731  3459  3689  3160 
+## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 
+##  3036  3175  3238  3077  3153  2908  2818  2245  2331
+
# get the full list of LD-pruned variants 
+pruned <- unlist(snpset, use.names=FALSE)
+length(pruned)
+
## [1] 75743
+
+
+

Computing a GRM

+

We can use the SNPRelate package to +compute a Genetic Relationship matrix (GRM) using GDS files. A GRM +captures genetic correlation among samples due to both distant ancestry +(i.e. population structure) and recent kinship (i.e. familial +relatedness) in a single matrix.

+

SNPRelate offers several algorithms for computing a GRM, including +the commonly-used GCTA Yang et al 2011 +(method = "GCTA") and allelic matching based estimators +described by Weir +and Goudet 2017 (method = "IndivBeta").

+
# compute the GRM using the GCTA method
+library(SNPRelate)
+grm <- snpgdsGRM(gds, method="GCTA", snp.id = pruned)
+
## Genetic Relationship Matrix (GRM, GCTA):
+## Calculating allele counts/frequencies ...
+## 
+[..................................................]  0%, ETC: ---    
+[==================================================] 100%, completed, 2s
+## # of selected variants: 75,743
+##     # of samples: 1,040
+##     # of SNVs: 75,743
+##     using 1 thread
+## CPU capabilities:
+## Mon Jun  3 19:07:49 2024    (internal increment: 472)
+## 
+[..................................................]  0%, ETC: ---        
+[==================================================] 100%, completed, 41s
+## Mon Jun  3 19:08:30 2024    Done.
+
names(grm)
+
## [1] "sample.id" "snp.id"    "method"    "grm"
+
dim(grm$grm)
+
## [1] 1040 1040
+
# look at the top corner of the matrix
+grm$grm[1:5,1:5]
+
##            [,1]       [,2]       [,3]       [,4]       [,5]
+## [1,] 0.94936208 0.07507070 0.08290501 0.07041950 0.07981302
+## [2,] 0.07507070 0.95193999 0.06952752 0.08517712 0.07612110
+## [3,] 0.08290501 0.06952752 0.94074070 0.09478942 0.08585627
+## [4,] 0.07041950 0.08517712 0.09478942 0.94634467 0.09946362
+## [5,] 0.07981302 0.07612110 0.08585627 0.09946362 0.94186583
+
+
+

De-convoluting ancestry and relatedness

+

To disentangle distant ancestry (i.e. population structure) from +recent kinship (i.e. familial relatedness), we implement the analysis +described in Conomos +et al., 2016. This approach uses the KING-robust, PC-AiR, and PC-Relate +methods.

+
+

KING-robust

+

Step 1 is to get initial kinship estimates using KING-robust, +which is robust to discrete population structure but not ancestry +admixture. KING-robust will be able to identify close relatives +(e.g. 1st and 2nd degree) reliably, but may identify spurious pairs or +miss more distant pairs of relatives in the presence of admixture. KING +is available as its own software, but the KING-robust algorithm is also +available in SNPRelate via the function snpgdsIBDKING.

+
# run KING-robust
+king <- snpgdsIBDKING(gds, snp.id=pruned)
+
## IBD analysis (KING method of moment) on genotypes:
+## Calculating allele counts/frequencies ...
+## 
+[..................................................]  0%, ETC: ---    
+[==================================================] 100%, completed, 2s
+## # of selected variants: 75,743
+##     # of samples: 1,040
+##     # of SNVs: 75,743
+##     using 1 thread
+## No family is specified, and all individuals are treated as singletons.
+## Relationship inference in the presence of population stratification.
+## CPU capabilities:
+## Mon Jun  3 19:08:33 2024    (internal increment: 15104)
+## 
+[..................................................]  0%, ETC: ---        
+[==================================================] 100%, completed, 4s
+## Mon Jun  3 19:08:37 2024    Done.
+
names(king)
+
## [1] "sample.id" "snp.id"    "afreq"     "IBS0"      "kinship"
+
# extract the kinship estimates
+kingMat <- king$kinship
+colnames(kingMat) <- rownames(kingMat) <- king$sample.id
+dim(kingMat)
+
## [1] 1040 1040
+
# look at the top corner of the matrix
+kingMat[1:5,1:5]
+
##            HG00096    HG00097    HG00099    HG00100    HG00101
+## HG00096 0.50000000 0.01407729 0.01878175 0.01381189 0.02020718
+## HG00097 0.01407729 0.50000000 0.01040445 0.02046941 0.01361374
+## HG00099 0.01878175 0.01040445 0.50000000 0.02828253 0.02300703
+## HG00100 0.01381189 0.02046941 0.02828253 0.50000000 0.03461278
+## HG00101 0.02020718 0.01361374 0.02300703 0.03461278 0.50000000
+

We extract pairwise kinship estimates and IBS0 values (the proportion +of variants for which the pair of indivdiuals share 0 alleles identical +by state) to plot. We use a hexbin plot to visualize the relatedness for +all pairs of samples.

+
kinship <- snpgdsIBDSelection(king)
+head(kinship)
+
##       ID1     ID2       IBS0    kinship
+## 1 HG00096 HG00097 0.04726509 0.01407729
+## 2 HG00096 HG00099 0.04659176 0.01878175
+## 3 HG00096 HG00100 0.04693503 0.01381189
+## 4 HG00096 HG00101 0.04587883 0.02020718
+## 5 HG00096 HG00102 0.04582602 0.01819071
+## 6 HG00096 HG00103 0.04657856 0.01417400
+
library(ggplot2)
+ggplot(kinship, aes(IBS0, kinship)) +
+    geom_hline(yintercept=2^(-seq(3,9,2)/2), linetype="dashed", color="grey") +
+    geom_hex(bins = 100) +
+    ylab("kinship estimate") +
+    theme_bw()
+

+

We see a few parent-offspring, full sibling, 2nd degree, and 3rd +degree relative pairs. The abundance of negative estimates represent +pairs of individuals who have ancestry from different populations – the +magnitude of the negative relationship is informative of how different +their ancestries are; more on this below.

+
+
+

PC-AiR

+

The next step is PC-AiR, which +provides robust population structure inference in samples with kinship +and pedigree structure. PC-AiR is available in the GENESIS package via +the function pcair.

+

First, PC-AiR partitions the full sample set into a set of mutually +unrelated samples that is maximally informative about all ancestries in +the sample (i.e. the unrelated set) and their relatives (i.e. the +related set). We use a 3rd degree kinship threshold +(kin.thresh = 2^(-9/2)), which corresponds to first cousins +– this defines anyone less related than first cousins as “unrelated”. We +use the negative KING-robust estimates as “ancestry divergence” measures +(divMat) to identify pairs of samples with different +ancestry – we preferentially select individuals with many negative +estimates for the unrelated set to ensure ancestry representation. For +now, we also use the KING-robust estimates as our kinship measures +(kinMat); more on this below.

+

Once the unrelated and related sets are identified, PC-AiR performs a +standard Principal Component Analysis (PCA) on the unrelated set of +individuals and then projects the relatives onto the PCs. Under the +hood, PC-AiR uses the SNPRelate package for efficient PC computation and +projection.

+
# run PC-AiR
+library(GENESIS)
+pca <- pcair(gds,
+            kinobj = kingMat,
+            kin.thresh = 2^(-9/2),
+            divobj = kingMat,
+            div.thresh = -2^(-9/2),
+            snp.include = pruned)
+
## Using kinobj and divobj to partition samples into unrelated and related sets
+
## Working with 1040 samples
+
## Identifying relatives for each sample using kinship threshold 0.0441941738241592
+
## Identifying pairs of divergent samples using divergence threshold -0.0441941738241592
+
## Partitioning samples into unrelated and related sets...
+
## Unrelated Set: 1014 Samples 
+## Related Set: 26 Samples
+
## Performing PCA on the Unrelated Set...
+
## Principal Component Analysis (PCA) on genotypes:
+## Calculating allele counts/frequencies ...
+## 
+[..................................................]  0%, ETC: ---    
+[==================================================] 100%, completed, 2s
+## # of selected variants: 75,743
+##     # of samples: 1,014
+##     # of SNVs: 75,743
+##     using 1 thread
+##     # of principal components: 32
+## CPU capabilities:
+## Mon Jun  3 19:08:42 2024    (internal increment: 484)
+## 
+[..................................................]  0%, ETC: ---        
+[==================================================] 100%, completed, 40s
+## Mon Jun  3 19:09:22 2024    Begin (eigenvalues and eigenvectors)
+## Mon Jun  3 19:09:23 2024    Done.
+
## Predicting PC Values for the Related Set...
+
## SNP Loading:
+##     # of samples: 1,014
+##     # of SNPs: 75,743
+##     using 1 thread
+##     using the top 32 eigenvectors
+## Mon Jun  3 19:09:23 2024    (internal increment: 3876)
+## 
+[..................................................]  0%, ETC: ---        
+[==================================================] 100%, completed, 3s
+## Mon Jun  3 19:09:26 2024    Done.
+## Sample Loading:
+##     # of samples: 26
+##     # of SNPs: 75,743
+##     using 1 thread
+##     using the top 32 eigenvectors
+## Mon Jun  3 19:09:26 2024    (internal increment: 65536)
+## 
+[..................................................]  0%, ETC: ---        
+[==================================================] 100%, completed, 2s
+## Mon Jun  3 19:09:28 2024    Done.
+
names(pca)
+
##  [1] "vectors"    "values"     "rels"       "unrels"     "kin.thresh"
+##  [6] "div.thresh" "sample.id"  "nsamp"      "nsnps"      "varprop"   
+## [11] "call"       "method"
+
# the unrelated set of samples
+length(pca$unrels)
+
## [1] 1014
+
head(pca$unrels)
+
## [1] "HG00096" "HG00097" "HG00099" "HG00100" "HG00101" "HG00102"
+
# the related set of samples
+length(pca$rels)
+
## [1] 26
+
head(pca$rels)
+
## [1] "NA20278" "NA20900" "NA20274" "NA20299" "NA19661" "NA19657"
+
# extract the top 12 PCs and make a data.frame
+pcs <- data.frame(pca$vectors[,1:12])
+colnames(pcs) <- paste0('PC', 1:12)
+pcs$sample.id <- pca$sample.id
+dim(pcs)
+
## [1] 1040   13
+
head(pcs)
+
##                PC1         PC2        PC3         PC4           PC5         PC6
+## HG00096 0.01800549 -0.03361957 0.01541131 -0.02085591 -0.0037719379 -0.04669896
+## HG00097 0.01742309 -0.03361182 0.01304470 -0.01992104 -0.0095947596 -0.04545408
+## HG00099 0.01799001 -0.03418633 0.01152677 -0.01694420  0.0011408340 -0.05178681
+## HG00100 0.01793750 -0.03378008 0.01491137 -0.01954784  0.0045447134 -0.04375806
+## HG00101 0.01759459 -0.03318057 0.01568040 -0.01892110 -0.0009236657 -0.05762760
+## HG00102 0.01796459 -0.03344242 0.01286241 -0.01767438  0.0022747910 -0.06528352
+##                PC7          PC8         PC9          PC10         PC11
+## HG00096 0.01985608  0.008985902 0.026996901  0.0041579044 -0.003803835
+## HG00097 0.01034344  0.025550498 0.011886491 -0.0223288469 -0.020732328
+## HG00099 0.02212758  0.012950846 0.028599645 -0.0046117845 -0.004731816
+## HG00100 0.02055287 -0.012583055 0.003434313  0.0106009372  0.007842334
+## HG00101 0.03443987  0.018017382 0.022045858  0.0003858169 -0.007009917
+## HG00102 0.03392455  0.015166974 0.006720184  0.0019921179  0.001414767
+##                 PC12 sample.id
+## HG00096  0.011771859   HG00096
+## HG00097  0.002096803   HG00097
+## HG00099  0.004180146   HG00099
+## HG00100  0.006986524   HG00100
+## HG00101 -0.005164265   HG00101
+## HG00102  0.003916856   HG00102
+
# save output
+save(pcs, file="data/pcs.RData")
+

We’d like to determine which PCs are ancestry informative. To do this +we look at the PCs in conjunction with reported population information +for the 1000 Genomes samples. This information is stored in the provided +pheno_data.tsv file. We make a parallel coordinates plot +and pairwise scatter plots, color-coding samples by 1000 Genomes +population labels.

+
phenfile <- "data/pheno_data.tsv"
+if (!file.exists(phenfile)) download.file(file.path(repo_path, phenfile), phenfile)
+phen <- read.table(phenfile, header = TRUE, sep = "\t", as.is = TRUE)
+head(phen)
+
##   sample.id pop super_pop    sex age trait_1 trait_2 status
+## 1   HG00096 GBR       EUR   male  56    23.5     0.1      0
+## 2   HG00097 GBR       EUR female  56    23.9     0.1      0
+## 3   HG00099 GBR       EUR female  48    23.8     0.3      0
+## 4   HG00100 GBR       EUR female  54    24.9     0.2      0
+## 5   HG00101 GBR       EUR   male  65    27.6     0.7      1
+## 6   HG00102 GBR       EUR female  53    24.5     0.4      0
+
pc.df <- merge(pcs, phen[,c("sample.id", "pop")], by = "sample.id")
+
+library(GGally)
+library(RColorBrewer)
+pop.cols <- setNames(brewer.pal(12, "Paired"),
+                 c("ACB", "ASW", "YRI", "CEU", "GBR", "TSI", "GIH", "CHB", "JPT", "MXL", "PUR"))
+ggparcoord(pc.df, columns=2:13, groupColumn="pop", scale="uniminmax") +
+    scale_color_manual(values=pop.cols) +
+    xlab("PC") + ylab("")
+

+
ggplot(pc.df, aes(x = PC1, y = PC2)) + geom_point(aes(color = pop)) + scale_color_manual(values = pop.cols)
+

+
ggplot(pc.df, aes(x = PC3, y = PC4)) + geom_point(aes(color = pop)) + scale_color_manual(values = pop.cols)
+

+
ggplot(pc.df, aes(x = PC5, y = PC6)) + geom_point(aes(color = pop)) + scale_color_manual(values = pop.cols)
+

+
ggplot(pc.df, aes(x = PC7, y = PC8)) + geom_point(aes(color = pop)) + scale_color_manual(values = pop.cols)
+

+
ggplot(pc.df, aes(x = PC9, y = PC10)) + geom_point(aes(color = pop)) + scale_color_manual(values = pop.cols)
+

+

It appears as though PCs 1-7 separate the populations in our study +and sufficiently capture the population structure in our sample.

+
+
+

PC-Relate

+

The next step is PC-Relate, which +provides accurate kinship inference, even in the presence of population +structure and ancestry admixture, by conditioning on ancestry +informative PCs. As we saw above, PCs 1-7 separate populations in our +study, so we condition on PCs 1-7 in our PC-Relate analysis. PC-Relate +can be performed using the pcrelate function in GENESIS, +which expects a SeqVarIterator object for the genotype +data. The training.set argument allows for specification of +which samples to use to “learn” the ancestry adjustment – we recommend +the unrelated set from the PC-AiR analysis.

+

(NOTE: this will take a few minutes to run).

+
seqResetFilter(gds, verbose=FALSE)
+library(SeqVarTools)
+seqData <- SeqVarData(gds)
+
+# filter the GDS object to our LD-pruned variants
+seqSetFilter(seqData, variant.id=pruned)
+
## # of selected variants: 75,743
+
iterator <- SeqVarBlockIterator(seqData, verbose=FALSE)
+
+pcrel <- pcrelate(iterator,
+                  pcs=pca$vectors[,c(1:7)],
+                  training.set=pca$unrels)
+
## Using 6 CPU cores
+
## 1040 samples to be included in the analysis...
+
## Betas for 7 PC(s) will be calculated using 1014 samples in training.set...
+
## Running PC-Relate analysis for 1040 samples using 75743 SNPs in 8 blocks...
+
## Performing Small Sample Correction...
+
names(pcrel)
+
## [1] "kinBtwn" "kinSelf"
+
# relatedness between pairs of individuals
+dim(pcrel$kinBtwn)
+
## [1] 540280      6
+
head(pcrel$kinBtwn)
+
##       ID1     ID2          kin        k0            k2  nsnp
+## 1 HG00096 HG00097  0.001588431 0.9932987 -0.0003476095 69500
+## 2 HG00096 HG00099  0.003362302 0.9932786  0.0067278300 69323
+## 3 HG00096 HG00100 -0.003428617 1.0124657 -0.0012487941 69424
+## 4 HG00096 HG00101  0.003034543 0.9845987 -0.0032630923 69416
+## 5 HG00096 HG00102  0.004579173 0.9805721 -0.0011112225 69195
+## 6 HG00096 HG00103  0.003119401 0.9766951 -0.0108272507 69195
+
# self-kinship estimates
+dim(pcrel$kinSelf)
+
## [1] 1040    3
+
head(pcrel$kinSelf)
+
##        ID           f  nsnp
+## 1 HG00096 -0.01454001 69639
+## 2 HG00097 -0.01167319 69917
+## 3 HG00099 -0.02335083 69572
+## 4 HG00100 -0.02149244 69697
+## 5 HG00101 -0.02210617 69798
+## 6 HG00102 -0.01810681 69541
+
# save output
+save(pcrel, file="data/pcrelate.RData")
+

We plot the pairwise kinship estimates againts the IBD0 +(k0) estimates (the proportion of variants for which the +pair of individuals share 0 alleles identical by descent). We use a +hexbin plot to visualize the relatedness for all pairs of samples.

+
ggplot(pcrel$kinBtwn, aes(k0, kin)) +
+    geom_hline(yintercept=2^(-seq(5,9,2)/2), linetype="dashed", color="grey") +
+    geom_hex(bins = 100) +
+    geom_abline(intercept = 0.25, slope = -0.25) +
+    ylab("kinship estimate") +
+    theme_bw()
+

+

We get very similar inference for 1st and 2nd degree relatives, but +we also see that the PC-Relate relatedness estimates for unrelated pairs +(i.e. kin ~ 0 and k0 ~ 1) are much closer to expectation than those from +KING-robust.

+
+
+

Advanced Notes

+

In small samples (such as this one), we recommend performing a second +iteration of PC-AiR and PC-Relate. Using the first iteration of +PC-Relate ancestry-adjusted kinship estimates, we can often better +partition our sample into unrelated and related sets, which leads to +better ancestry PCs from PC-AiR and better relatedness estimates from +PC-Relate. The steps to perform the second iteration are:

+
    +
  1. Perform a second PC-AiR analysis using the PC-Relate kinship matrix +as the kinship estimates. Still use the (original) KING-robust matrix +for the ancestry divergence estimates.
  2. +
  3. Perform a second PC-Relate analysis using the new PC-AiR PCs to +adjust for population structure.
  4. +
+

In large samples (such as TOPMed), the KING-robust and PC-Relate +analyses can be very time consuming, so we suggest the following +alternative procedure for deconvoluting ancestry and relatedness:

+
    +
  1. Use the KING-IBDseg +method to estimate kinship for all pairs of individuals. This method +uses a fast algorithm to approximate IBD segments based on identity by +state and works well in the presence of ancestry admixture – in our +experience, it gives very similar results to PC-Relate. The software to +run KING-IBDseg uses PLINK files, but we’ve written a +KING IBDseg application on the BioData Catalyst powered by +Seven Bridges platform that will use GDS files as input (it does a +conversion from GDS to PLINK).
  2. +
  3. Perform a PC-AiR analysis to infer population structure using the +KING-IBDseg estimates as the kinship estimates and using no ancestry +divergence measures (in large samples, the PC-AiR algorithm works well +even without the ancestry divergence).
  4. +
+
+
+
+

Exercise 2.A.1 (Application)

+

Use the LD Pruning app on the BioData Catalyst powered +by Seven Bridges platform to perform LD pruning on the example 1000 +Genomes GDS files you created previously. The steps to perform this +analysis are as follows:

+
    +
  • Copy the app to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for LD Pruning
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
  • Run the analysis in your project: +
      +
    • Click: Apps > LD Pruning > Run
    • +
    • Specify the Inputs: +
        +
      • GDS Files: 1KG_phase3_GRCh38_subset_chr<CHR>.gds +(select all 22 chromosomes)
      • +
    • +
    • Specify the App Settings: +
        +
      • Autosomes only: TRUE
      • +
      • LD |r| threshold: 0.32 (\(\approx r^2 = +0.1\))
      • +
      • MAF threshold: 0.05
      • +
      • Output prefix: “1KG_phase3_GRCh38_subset” (or any other string to +name the output file)
      • +
    • +
    • Click: Run
    • +
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project. Use the “View stats & +logs” button to check on the status of your tasks. Click on the bar that +says “ld_pruning”, click “View Logs”, and click one of the “ld_pruning” +folders (there’s one per chromosome). In here you can see detailed logs +of the job; take a look at the job.out.log and +job.err.log – these can be useful for debugging issues.

+

The output of this analysis will be a single file +(<output prefix>_pruned.gds) with the genotype data +from the LD pruned variants across all 22 chromosomes.

+

You can find the expected output of this analysis by looking at the +existing task 02 LD Pruning in the Tasks menu of your +Project. The output files are available in the Project, so you do not +need to wait for your analysis to finish to move to the next +exercise.

+
+
+

Exercise 2.A.2 (Application)

+

Use the KING robust app on the BioData Catalyst powered +by Seven Bridges platform to perform a KING-robust analysis of the +example 1000 Genomes data using the LD pruned variants. The steps to +perform this analysis are as follows:

+
    +
  • Copy the app to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for KING robust
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
  • Run the analysis in your project: +
      +
    • Click: Apps > KING robust > Run
    • +
    • Specify the Inputs: +
        +
      • GDS file: 1KG_phase3_GRCh38_subset_pruned.gds (has +pruned variants from all 22 chromosomes)
      • +
    • +
    • Specify the App Settings: +
        +
      • kinship_plots > Kinship plotting threshold: 0
      • +
      • Output prefix: “1KG_phase3_GRCh38_subset” (or any other string to +name the output file)
      • +
    • +
    • Click: Run
    • +
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

The output of this analysis will be a +<output_prefix>_king_robust.gds file that has the +kinship estimates and a +<output_prefix>_king_robust_all.pdf with the plot of +estimated kinship vs. IBS0. Look at the kinship plot – how many 1st +degree relative pairs are identified? How many 2nd degree relative pairs +are identified?

+

You can find the expected output of this analysis by looking at the +existing task 03 King robust in the Tasks menu of your +Project. The output files are available in the Project, so you do not +need to wait for your analysis to finish to move to the next +exercise.

+
+

Solution 2.A.2 (Application)

+

From the kinship plot, we can see that there are 6 1st degree +relative pairs (5 parent-offspring; 1 full sibling) and 3 2nd degree +relative pairs identified.

+
+
+
+

Exercise 2.A.3 (Application)

+

Use the PC-AiR app on the BioData Catalyst powered by +Seven Bridges platform to perform a PC-AiR analysis of the example 1000 +Genomes data using the LD pruned variants. Use the KING-robust kinship +estimates as both the kinship and divergence measures. The steps to +perform this analysis are as follows:

+
    +
  • Copy the app to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for PC-AiR
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
  • Run the analysis in your project: +
      +
    • Click: Apps > PC-AiR > Run
    • +
    • Specify the Inputs: +
        +
      • Pruned GDS File: 1KG_phase3_GRCh38_subset_pruned.gds +(has pruned variants from all 22 chromosomes)
      • +
      • Kinship File: +1KG_phase3_GRCh38_subset_king_robust.gds
      • +
      • Divergence File: +1KG_phase3_GRCh38_subset_king_robust.gds
      • +
      • Phenotype file: pheno_annotated.RData
      • +
    • +
    • Specify the App Settings: +
        +
      • pca_byrel > Number of PCs: 12
      • +
      • pca_plots > Group: pop (sample variable for coloring output +PCA)
      • +
      • pca_plots > Number of PCs: 12
      • +
      • PC-variant correlation > Run PC-variant correlation: FALSE (extra +diagnostic step that takes a while to run)
      • +
      • Output prefix: “1KG_phase3_GRCh38_subset” (or any other string to +name the output file)
      • +
    • +
    • Click: Run
    • +
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

The output of this analysis is a “_pca.RData” object +with the PC values and several PC plots. Look at the parallel +coordinates plot (“<output_prefix>_parcoord.pdf”) +generated by the task. How many PCs appear to reflect population +structure in the sample? This will determine how many PCs you should use +to adjust PC-Relate in the next exercise.

+

You can find the expected output of this analysis by looking at the +existing task 04 PC-AiR in the Tasks menu of your Project. +The output files are available in the Project, so you do not need to +wait for your analysis to finish to move to the next exercise.

+
+

Solution 2.A.3 (Application)

+

From the parallel coordinates plot, we can see that PCs 1-7 appear to +reflect population structure. We will use the first 7 PCs to adjust for +population structure in the PC-Relate analysis in the next exercise.

+
+
+
+

Exercise 2.A.4 (Application)

+

Use the PC-Relate app on the BioData Catalyst powered by +Seven Bridges platform to perform a PC-Relate analysis of the example +1000 Genomes data using the LD pruned variants. Use the PC-AiR PCs to +adjust for population structure. The steps to perform this analysis are +as follows:

+
    +
  • Copy the app to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for PC-Relate
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
  • Run the analysis in your project: +
      +
    • Click: Apps > PC-Relate > Run
    • +
    • Specify the Inputs: +
        +
      • GDS File: 1KG_phase3_subset_pruned.gds (has pruned +variants from all 22 chromosomes)
      • +
      • PCA file: 1KG_phase3_subset_pca.RData
      • +
    • +
    • Specify the App Settings: +
        +
      • pcrelate_beta > Number of PCs: 7
      • +
      • pcrelate > Number of PCs: 7
      • +
      • pcrelate > Return IBD probabilities?: TRUE
      • +
      • pcrelate_correct > Sparse threshold: -1 (to keep full dense +matrix)
      • +
      • kinship_plots > Kinship plotting threshold: 0
      • +
      • kinship_plots > Return IBD probabilities?: TRUE
      • +
      • Output prefix: “1KG_phase3_GRCh38_subset” (or any other string to +name the output file)
      • +
    • +
    • Click: Run
    • +
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

The output of this analysis is a +<output_prefix>_pcrelate.RData file that contains the +PC-Relate relatedness estimates, a +<output_prefix>_pcrelate_Matrix.RData file that +contains a sparse matrix of kinship estimates (more on this in the +Advanced GWAS tutorial), and a +<output_prefix>_pcrelate_all.pdf with the plot of +estimated kinship vs. k0. Look at the kinship plot – how many 1st degree +relative pairs are identified? How many 2nd degree relative pairs are +identified?

+

You can find the expected output of this analysis by looking at the +existing task 05 PC-Relate in the Tasks menu of your +Project.

+
+

Solution 2.A.4 (Application)

+

From the kinship plot, you can see that there are 6 1st degree +relative pairs (5 parent-offspring; 1 full sibling) and 2 2nd degree +relative pairs identified.

+
+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/02_GWAS.Rmd b/02_GWAS.Rmd new file mode 100644 index 0000000..6e6d0a7 --- /dev/null +++ b/02_GWAS.Rmd @@ -0,0 +1,365 @@ +# 2. Genome Wide Association Studies (GWAS) + +Single variant association tests are used to identify genetic variants associated with a phenotype of interest. Performing single-variant tests genome-wide is commonly referred to as a Genome Wide Association Study (GWAS). This tutorial demonstrates how to perform single variant association tests using mixed models with the [GENESIS](https://bioconductor.org/packages/release/bioc/html/GENESIS.html) R/Bioconductor package. + +## Prepare the Data + +Before we can begin our association testing procedure, we must prepare our data in the required format. GENESIS requires that phenotype data be provided as an `AnnotatedDataFrame`, which is a special data structure provided by the [Biobase](https://www.bioconductor.org/packages/release/bioc/html/Biobase.html) R/Bioconductor package that contains both data and metadata. You should include a description of each variable in the metadata. + + +### Phenotype Data + +First, we load our phenotype data (i.e. both the outcome and covariate data), which is provided in a tab separated .tsv file. We then create metadata to describe the columns of the phenotype data. Finally, we create an `AnnotatedDataFrame` by pairing the phenotype data with the metadata. + +```{r, message = FALSE} +library(Biobase) +``` + +```{r, pheno_data} +repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main" +if (!dir.exists("data")) dir.create("data") + +# load phenotype data +phenfile <- "data/pheno_data.tsv" +if (!file.exists(phenfile)) download.file(file.path(repo_path, phenfile), phenfile) +phen <- read.table(phenfile, header = TRUE, sep = "\t", as.is = TRUE) +head(phen) + +# create metadata +metadata <- data.frame(labelDescription = c("sample identifier", + "population", + "super population", + "sex", + "age at measurement", + "trait 1 values", + "trait 2 values", + "case-control status"), + row.names = colnames(phen)) +metadata + +# create the AnnotatedDataFrame +annot <- AnnotatedDataFrame(phen, metadata) +annot +``` + +We use the `pData` and `varMetaData` functions to access the data and metadata in our `AnnotatedDataFrame`, respectively. + +```{r} +# access the data with the pData() function. +head(pData(annot)) + +# access the metadata with the varMetadata() function. +varMetadata(annot) +``` + +Save the `AnnotatedDataFrame` with PCs for future use. + +```{r} +save(annot, file = "data/pheno_annotated.RData") +``` + +#### Sample Identifiers + +Note that the GENESIS code to fit the mixed model and perform the association tests requires that the `AnnotatedDataFrame` have a column named `sample.id`, which represents a sample (i.e. sequencing instance) identifier. The values in the `sample.id` column must match the `sample.id` values in the GDS file(s) containing the sequencing data. + +When designing a study, we generally advise using separate IDs for samples (sequencing instances) and subjects (individuals with phenotypes) and maintaining a sample to subject mapping file. This practice can be beneficial for quality control purposes; for example, when sample swaps are detected, the mapping between sequencing (indexed by `sample.id`) and phenotype (indexed by `subject.id`) data can easily be updated, rather than needing to modify and re-write phenotype data or sequencing metrics files. + +However, in this example, the 1000 Genomes sample identifiers (`sample.id`) are used as subject identifiers in our phenotype data -- this goes against our recommendation, but is OK for these exercises. + +### Genetic Ancestry Principal Components (PCs) + +We use genetic ancestry PCs to adjust for potential confounding due to population structure in our sample. The additional tutorial `02.A_population_structure_relatedness.Rmd` shows how to compute the ancestry PCs that are used below. In that tutorial, we find that PCs 1-7 appear to reflect population structure in our sample, so we will use those to adjust for ancestry in our null model. We need to add these PCs to our `AnnotatedDataFrame` with the phenotype data. + +```{r, load_pcs, message = FALSE} +# load the ancestry PCs +pcfile <- "data/pcs.RData" +if (!file.exists(pcfile)) download.file(file.path(repo_path, pcfile), pcfile) +pcs <- get(load(pcfile)) +pcs <- pcs[,c("sample.id", paste0("PC", 1:7))] +head(pcs) + +# merge PCs with the sample annotation +dat <- merge(pData(annot), pcs, by = "sample.id") +head(dat) + +# update the variable metadata +metadata <- data.frame(labelDescription = c(varMetadata(annot)$labelDescription, paste0("ancestry PC", 1:7)), + row.names = colnames(dat)) + +# create an updated AnnotatedDataFrame +annot <- AnnotatedDataFrame(dat, metadata) +annot +``` + +Save the `AnnotatedDataFrame` with PCs for future use. + +```{r} +save(annot, file = "data/pheno_annotated_pcs.RData") +``` + + +### Kinship Matrix (KM) + +In order to perform association testing using a mixed model, we also need a kinship matrix (KM) or genetic relationship matrix (GRM) that captures the genetic correlation among samples. The additional tutorial `02.A_population_structure_relatedness.Rmd` also shows how to compute pairwise kinship estimates using the PC-Relate method. We can create an (n x n) empirical kinship matrix (KM) from the output of `pcrelate` using the `pcrelateToMatrix` function. We set `scaleKin = 2` to multiply the kinship values by 2, which gives values on the same scale as the standard GRM (this is relevant for the interpretation of the variance component estimates). This matrix is represented in R as a symmetric matrix object from the Matrix package. + +```{r, load_kinship} +library(GENESIS) + +# load the pcrelate results +kinfile <- "data/pcrelate.RData" +if (!file.exists(kinfile)) download.file(file.path(repo_path, kinfile), kinfile) +pcrel <- get(load(kinfile)) + +# create the empirical KM +kinship <- pcrelateToMatrix(pcrel, scaleKin=2, verbose=FALSE) +dim(kinship) +kinship[1:5,1:5] +``` + +Save the kinship matrix for future use. + +```{r} +# save the empirical KM +save(kinship, file="data/pcrelate_Matrix.RData") +``` + + +## Null Model + +Now that our data is prepared, we can move on to the association testing procedure. The first step is to fit the "null model" -- i.e., a model fit under the null hypothesis of no individual variant association. Operationally, this is fitting a mixed model with the desired outcome phenotype, fixed effect covariates, and a random effect with covariance proportional to a kinship matrix (KM). + +### Fit the Null Model + +We use the `fitNullModel` function from GENESIS. We need to specify the `AnnotatedDataFrame` with the phenotype data, the outcome variable (trait_1), and the fixed effect covariates (sex, age, and PCs 1-7). We also include the kinship matrix in the model with the `cov.mat` (covariance matrix) argument, which is used to specify the random effect(s) in the model with covariance structure(s) proportional to the supplied matrix(s). + +```{r null_model_fit} +# fit the null model +nullmod <- fitNullModel(annot, + outcome="trait_1", + covars=c("sex", "age", paste0("PC", c(1:7))), + cov.mat=kinship, + verbose=FALSE) + +# save the output +save(nullmod, file="data/null_model_trait1.RData") +``` + +The `fitNullModel` function returns a lot of information about the model that was fit. We examine some of that information below; to see all of the components, try `names(nullmod)`. + +```{r assoc_null_model_results} +# description of the model we fit +nullmod$model + +# fixed effect regression estimates +nullmod$fixef + +# variance component estimates +nullmod$varComp + +# model fit: fitted values, residuals +head(nullmod$fit) + +# plot the residuals vs the fitted values +library(ggplot2) +ggplot(nullmod$fit, aes(x = fitted.values, y = resid.marginal)) + + geom_point(alpha = 0.5) + + geom_hline(yintercept = 0) + + geom_smooth(method = 'lm') +``` + +The residuals vs. fitted values diagnostic plot looks good. + +## Single-Variant Association Tests + +After fitting the null model, we use single-variant score tests to test each variant across the genome separately for association with the outcome, accounting for genetic ancestry and genetic relatedness among the samples. We use the `assocTestSingle` function from GENESIS. + +### Prepare the GDS Iterator + +First, we have to create a `SeqVarData` object linking the GDS file containing the sequencing data and the `AnnotatedDataFrame` containing the phenotype data. We then create a `SeqVarBlockIterator` object, which breaks the set of all variants in the `SeqVarData` object into blocks, allowing us to analyze genome-wide in manageable pieces. Note that in this tutorial we are analyzing only a small subset of variants from chromosome 1. + +```{r, message = FALSE} +library(SeqVarTools) + +# open a connection to the GDS file +gdsfile <- "data/1KG_phase3_GRCh38_subset_chr1.gds" +if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile) +gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open +gds <- seqOpen(gdsfile) + +# make the seqVarData object +seqData <- SeqVarData(gds, sampleData=annot) + +# make the iterator object +iterator <- SeqVarBlockIterator(seqData, verbose=FALSE) +iterator +``` + +The `SeqVarBlockIterator` object looks a lot like the GDS objects we've seen before, but with an additional `sample.annotation` field that contains the phenotype data from the linked `AnnotatedDataFrame`. + +### Run the Association Tests + +The `assocTestSingle` function takes the already fitted null model as input, performs score tests by iterating over all blocks of variants in the `SeqVarBlockIterator` object, and then concatenates and returns the results. + +```{r assoc_single, message = FALSE} +# run the single-variant association test +assoc <- assocTestSingle(iterator, + null.model = nullmod, + test = "Score") +dim(assoc) +head(assoc) +``` + +Each row of the results data.frame represents one tested variant and includes: variant information (`variant.id`, `chr`, and `pos`), the number of samples tested (`n.obs`), the minor allele count (`MAC`), the effect allele frequency (`freq`), the score value (`Score`) and its standard error (`Score.SE`), the score test statistic (`Score.Stat`) and $p$-value (`Score.pval`), an approximation of the effect allele effect size (`Est`) and its standard error (`Est.SE`), and an approximation of the proportion of variation explained by the variant (`PVE`). When using a `SeqVarData` object, the effect allele is the alternate allele. + +```{r} +# save for later +save(assoc, file = 'data/assoc_chr1_trait_1.RData') +``` + +#### Examine the results + +A lot of the variants we tested are very rare -- i.e., the alternate allele is not observed for many samples. Single-variant tests do not perform well for very rare variants (we discuss testing rare variants in more detail later). We can use the minor allele count (MAC) observed in the sample to filter out rare variants that we may expect to have unreliable test results (e.g. MAC < 20). The MAC filter you will want to use in practice will depend on your sample size. + +```{r, mac} +summary(assoc$MAC) +sum(assoc$MAC < 20) + +# filter out the rarest variants +assoc <- assoc[assoc$MAC >= 20, ] +dim(assoc) +``` + +We make a QQ plot to examine the distribution of $p$-values. + +```{r, assoc_single_qq} +qqPlot <- function(pval) { + pval <- pval[!is.na(pval)] + n <- length(pval) + x <- 1:n + dat <- data.frame(obs=sort(pval), + exp=x/n, + upper=qbeta(0.025, x, rev(x)), + lower=qbeta(0.975, x, rev(x))) + + ggplot(dat, aes(-log10(exp), -log10(obs))) + + geom_line(aes(-log10(exp), -log10(upper)), color="gray") + + geom_line(aes(-log10(exp), -log10(lower)), color="gray") + + geom_point() + + geom_abline(intercept=0, slope=1, color="red") + + xlab(expression(paste(-log[10], "(expected P)"))) + + ylab(expression(paste(-log[10], "(observed P)"))) + + theme_bw() +} + +qqPlot(assoc$Score.pval) +``` + +We make a Manhattan plot of the $-log_{10}(p)$-values using the `manhattanPlot` fuction from the `GWASTools` package to visualize the association signals. + +```{r, assoc_single_manhattan} +GWASTools::manhattanPlot(assoc$Score.pval, + chromosome = assoc$chr, + thinThreshold = 1e-4, + ylim = c(0, 12)) +``` + +We should expect the majority of variants to fall near the red `y=x` line in the QQ plot. Deviation above the line, commonly referred to as "inflation" is typically indicative of some model issue (e.g. unaccounted for population structure or relatedness). In this example, the appearance of inflation is caused by enrichment of association signal due to the fact that we have two genome-wide significant signals (i.e. $p < 5 \times 10^{-8}$) and our dataset only has a small number of variants on chromosome 1. + +From looking at the results for the variants that reached genome-wide significance, we see that 6 variants at two different loci have $p < 5 \times 10^{-8}$ +```{r} +assoc[assoc$Score.pval < 5e-8, ] +``` + + +## Exercise 2.1 (Application) + +Use the `GENESIS Null Model` app on the BioData Catalyst powered by Seven Bridges platform to fit the a null model for trait_1, adjusting for sex, age, ancestry, and kinship in the model, using the example 1000 Genomes data. You can use the PCs and kinship matrix computed using the PC-AiR and PC-Relate apps in the additional Population Structure and Relatedness tutorial's exercises as inputs to this analysis. The steps to perform this analysis are as follows: + +- Copy the app to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `GENESIS Null Model` + - Click: Copy > Select your project > Copy +- Run the analysis in your project: + - Click: Apps > `GENESIS Null Model` > Run + - Specify the Inputs: + - Phenotype file: `pheno_annotated.RData` + - PCA file: `1KG_phase3_GRCh38_subset_pca.RData` + - Relatedness matrix file: `1KG_phase3_subset_GRCh38_pcrelate_Matrix.RData` + - Specify the App Settings: + - Covariates: age, sex (each as a different term) + - Family: gaussian + - Number of PCs to include as covariates: 7 + - Outcome: trait_1 + - Two stage model: FALSE + - Output prefix: "1KG_trait_1" (or any other string to name the output file) + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output of this analysis will be a `_null_model.RData` file that contains the null model fit, a `_phenotypes.RData` file with the phenotype data used in the analysis, and a `_report.Rmd` and `_report.html` with model diagnostics. Review the .html report -- which covariates have significant ($p < 0.05$) associations with trait_1 in the null model? + +You can find the expected output of this analysis by looking at the existing task `06 Null Model trait_1` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to move to the next exercise. + +### Solution 2.1 (Application) + +From looking at the .html report, we see that PC1, PC2, PC3, and PC6 have significant associations with trait_1 in our null model. + + +## Exercise 2.2 (Application) + +Use the `GENESIS Single Variant Association Testing` app on the BioData Catalyst powered by Seven Bridges platform to perform a GWAS for trait_1 using the null model fit in the previous exercise. Use the genotype data in the genome-wide GDS files you created previously. The steps to perform this analysis are as follows: + +- Copy the app to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `GENESIS Single Variant Association Testing` + - Click: Copy > Select your project > Copy +- Run the analysis in your project: + - Click: Apps > `GENESIS Single Variant Association Testing` > Run + - Specify the Inputs: + - GDS Files: `1KG_phase3_GRCh38_subset_chr.gds` (select all 22 chromosomes) + - Null model file: `1KG_trait_1_null_model.RData` + - Phenotype file: `1KG_trait_1_phenotypes.RData` (use the phenotype file created by the Null Model app) + - Specify the App Settings: + - MAC threshold: 5 + - Test type: score + - memory GB: 32 (increase to make sure enough available) + - Output prefix: "1KG_trait_1_assoc" (or any other string to name the output file) + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output of this analysis will be 22 `_chr.RData` files with the association test results for each chromosome as well as a `_manh.png` file with the Manhattan plot and a `_qq.png` file with the QQ plot. Review the QQ and Manhattan plots -- is there evidence of genomic inflation? + +You can find the expected output of this analysis by looking at the existing task `07 Single Variant Association Test trait_1` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to move to the next exercise. + +### Solution 2.2 (Application) + +From looking at the QQ plot, we see that the genomic control lambda = 1.074 and there is some deviation from the $y=x$ line -- both indicative of moderate inflation in our analysis. This is likely an artifact of looking at rare variants with a small sample size. + + +## Exercise 2.3 (Application) + +Use the `GENESIS Association results plotting` app on the BioData Catalyst powered by Seven Bridges platform to make additional QQ plots of the single variant association results binned by MAF: $0-0.5\%$, $0.5-1\%$, $1-5\%$, $\geq 5\%$. The steps to perform this analysis are as follows: + +- Copy the app to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `GENESIS Association results plotting` + - Click: Copy > Select your project > Copy +- Run the analysis in your project: + - Click: Apps > `GENESIS Association results plotting` > Run + - Specify the Inputs: + - Results from association testing: `1KG_trait_1_assoc_chr.RData` (select all 22 chromosomes) + - Specify the App Settings: + - Association Type: single + - QQ MAF bins: "0.005 0.01 0.05" + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output of this analysis will be a `_qq_bymaf.png` file. Look at the QQ plots by MAF bin -- how do they compare to the overall QQ plot of all variants? + +You can find the expected output of this analysis by looking at the existing task `08 Single Variant Association Plots trait_1` in the Tasks menu of your Project, so you do not need to wait for your analysis to finish to look at the output. + +### Solution 2.3 (Application) + +From the binned QQ plots, we see that the common variants (i.e. MAF $\geq 5\%$) have a genomic control lambda = 1.007 and follow along the $y=x$ line. As suspected, the inflation is only present in the rarer variants, likely due to the small sample size. diff --git a/02_GWAS.html b/02_GWAS.html new file mode 100644 index 0000000..081b65e --- /dev/null +++ b/02_GWAS.html @@ -0,0 +1,1068 @@ + + + + + + + + + + + + + +02_GWAS.knit + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

2. Genome Wide Association Studies (GWAS)

+

Single variant association tests are used to identify genetic +variants associated with a phenotype of interest. Performing +single-variant tests genome-wide is commonly referred to as a Genome +Wide Association Study (GWAS). This tutorial demonstrates how to perform +single variant association tests using mixed models with the GENESIS +R/Bioconductor package.

+
+

Prepare the Data

+

Before we can begin our association testing procedure, we must +prepare our data in the required format. GENESIS requires that phenotype +data be provided as an AnnotatedDataFrame, which is a +special data structure provided by the Biobase +R/Bioconductor package that contains both data and metadata. You should +include a description of each variable in the metadata.

+
+

Phenotype Data

+

First, we load our phenotype data (i.e. both the outcome and +covariate data), which is provided in a tab separated .tsv file. We then +create metadata to describe the columns of the phenotype data. Finally, +we create an AnnotatedDataFrame by pairing the phenotype +data with the metadata.

+
library(Biobase)
+
repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main"
+if (!dir.exists("data")) dir.create("data")
+
+# load phenotype data
+phenfile <- "data/pheno_data.tsv"
+if (!file.exists(phenfile)) download.file(file.path(repo_path, phenfile), phenfile)
+phen <- read.table(phenfile, header = TRUE, sep = "\t", as.is = TRUE)
+head(phen)
+
##   sample.id pop super_pop    sex age trait_1 trait_2 status
+## 1   HG00096 GBR       EUR   male  56    23.5     0.1      0
+## 2   HG00097 GBR       EUR female  56    23.9     0.1      0
+## 3   HG00099 GBR       EUR female  48    23.8     0.3      0
+## 4   HG00100 GBR       EUR female  54    24.9     0.2      0
+## 5   HG00101 GBR       EUR   male  65    27.6     0.7      1
+## 6   HG00102 GBR       EUR female  53    24.5     0.4      0
+
# create metadata
+metadata <- data.frame(labelDescription = c("sample identifier",
+                                            "population",
+                                            "super population",
+                                            "sex",
+                                            "age at measurement",
+                                            "trait 1 values",
+                                            "trait 2 values",
+                                            "case-control status"),
+                       row.names = colnames(phen))
+metadata
+
##              labelDescription
+## sample.id   sample identifier
+## pop                population
+## super_pop    super population
+## sex                       sex
+## age        age at measurement
+## trait_1        trait 1 values
+## trait_2        trait 2 values
+## status    case-control status
+
# create the AnnotatedDataFrame
+annot <- AnnotatedDataFrame(phen, metadata)
+annot
+
## An object of class 'AnnotatedDataFrame'
+##   rowNames: 1 2 ... 1040 (1040 total)
+##   varLabels: sample.id pop ... status (8 total)
+##   varMetadata: labelDescription
+

We use the pData and varMetaData functions +to access the data and metadata in our AnnotatedDataFrame, +respectively.

+
# access the data with the pData() function.
+head(pData(annot))
+
##   sample.id pop super_pop    sex age trait_1 trait_2 status
+## 1   HG00096 GBR       EUR   male  56    23.5     0.1      0
+## 2   HG00097 GBR       EUR female  56    23.9     0.1      0
+## 3   HG00099 GBR       EUR female  48    23.8     0.3      0
+## 4   HG00100 GBR       EUR female  54    24.9     0.2      0
+## 5   HG00101 GBR       EUR   male  65    27.6     0.7      1
+## 6   HG00102 GBR       EUR female  53    24.5     0.4      0
+
# access the metadata with the varMetadata() function.
+varMetadata(annot)
+
##              labelDescription
+## sample.id   sample identifier
+## pop                population
+## super_pop    super population
+## sex                       sex
+## age        age at measurement
+## trait_1        trait 1 values
+## trait_2        trait 2 values
+## status    case-control status
+

Save the AnnotatedDataFrame with PCs for future use.

+
save(annot, file = "data/pheno_annotated.RData")
+
+

Sample Identifiers

+

Note that the GENESIS code to fit the mixed model and perform the +association tests requires that the AnnotatedDataFrame have +a column named sample.id, which represents a sample +(i.e. sequencing instance) identifier. The values in the +sample.id column must match the sample.id +values in the GDS file(s) containing the sequencing data.

+

When designing a study, we generally advise using separate IDs for +samples (sequencing instances) and subjects (individuals with +phenotypes) and maintaining a sample to subject mapping file. This +practice can be beneficial for quality control purposes; for example, +when sample swaps are detected, the mapping between sequencing (indexed +by sample.id) and phenotype (indexed by +subject.id) data can easily be updated, rather than needing +to modify and re-write phenotype data or sequencing metrics files.

+

However, in this example, the 1000 Genomes sample identifiers +(sample.id) are used as subject identifiers in our +phenotype data – this goes against our recommendation, but is OK for +these exercises.

+
+
+
+

Genetic Ancestry Principal Components (PCs)

+

We use genetic ancestry PCs to adjust for potential confounding due +to population structure in our sample. The additional tutorial +02.A_population_structure_relatedness.Rmd shows how to +compute the ancestry PCs that are used below. In that tutorial, we find +that PCs 1-7 appear to reflect population structure in our sample, so we +will use those to adjust for ancestry in our null model. We need to add +these PCs to our AnnotatedDataFrame with the phenotype +data.

+
# load the ancestry PCs
+pcfile <- "data/pcs.RData"
+if (!file.exists(pcfile)) download.file(file.path(repo_path, pcfile), pcfile)
+pcs <- get(load(pcfile))
+pcs <- pcs[,c("sample.id", paste0("PC", 1:7))]
+head(pcs)
+
##         sample.id        PC1         PC2        PC3         PC4           PC5
+## HG00096   HG00096 0.01800549 -0.03361957 0.01541131 -0.02085591 -0.0037719379
+## HG00097   HG00097 0.01742309 -0.03361182 0.01304470 -0.01992104 -0.0095947596
+## HG00099   HG00099 0.01799001 -0.03418633 0.01152677 -0.01694420  0.0011408340
+## HG00100   HG00100 0.01793750 -0.03378008 0.01491137 -0.01954784  0.0045447134
+## HG00101   HG00101 0.01759459 -0.03318057 0.01568040 -0.01892110 -0.0009236657
+## HG00102   HG00102 0.01796459 -0.03344242 0.01286241 -0.01767438  0.0022747910
+##                 PC6        PC7
+## HG00096 -0.04669896 0.01985608
+## HG00097 -0.04545408 0.01034344
+## HG00099 -0.05178681 0.02212758
+## HG00100 -0.04375806 0.02055287
+## HG00101 -0.05762760 0.03443987
+## HG00102 -0.06528352 0.03392455
+
# merge PCs with the sample annotation
+dat <- merge(pData(annot), pcs, by = "sample.id")
+head(dat)
+
##   sample.id pop super_pop    sex age trait_1 trait_2 status        PC1
+## 1   HG00096 GBR       EUR   male  56    23.5     0.1      0 0.01800549
+## 2   HG00097 GBR       EUR female  56    23.9     0.1      0 0.01742309
+## 3   HG00099 GBR       EUR female  48    23.8     0.3      0 0.01799001
+## 4   HG00100 GBR       EUR female  54    24.9     0.2      0 0.01793750
+## 5   HG00101 GBR       EUR   male  65    27.6     0.7      1 0.01759459
+## 6   HG00102 GBR       EUR female  53    24.5     0.4      0 0.01796459
+##           PC2        PC3         PC4           PC5         PC6        PC7
+## 1 -0.03361957 0.01541131 -0.02085591 -0.0037719379 -0.04669896 0.01985608
+## 2 -0.03361182 0.01304470 -0.01992104 -0.0095947596 -0.04545408 0.01034344
+## 3 -0.03418633 0.01152677 -0.01694420  0.0011408340 -0.05178681 0.02212758
+## 4 -0.03378008 0.01491137 -0.01954784  0.0045447134 -0.04375806 0.02055287
+## 5 -0.03318057 0.01568040 -0.01892110 -0.0009236657 -0.05762760 0.03443987
+## 6 -0.03344242 0.01286241 -0.01767438  0.0022747910 -0.06528352 0.03392455
+
# update the variable metadata
+metadata <- data.frame(labelDescription = c(varMetadata(annot)$labelDescription, paste0("ancestry PC", 1:7)),
+                       row.names = colnames(dat))
+
+# create an updated AnnotatedDataFrame
+annot <- AnnotatedDataFrame(dat, metadata)
+annot
+
## An object of class 'AnnotatedDataFrame'
+##   rowNames: 1 2 ... 1040 (1040 total)
+##   varLabels: sample.id pop ... PC7 (15 total)
+##   varMetadata: labelDescription
+

Save the AnnotatedDataFrame with PCs for future use.

+
save(annot, file = "data/pheno_annotated_pcs.RData")
+
+
+

Kinship Matrix (KM)

+

In order to perform association testing using a mixed model, we also +need a kinship matrix (KM) or genetic relationship matrix (GRM) that +captures the genetic correlation among samples. The additional tutorial +02.A_population_structure_relatedness.Rmd also shows how to +compute pairwise kinship estimates using the PC-Relate method. We can +create an (n x n) empirical kinship matrix (KM) from the output of +pcrelate using the pcrelateToMatrix function. +We set scaleKin = 2 to multiply the kinship values by 2, +which gives values on the same scale as the standard GRM (this is +relevant for the interpretation of the variance component estimates). +This matrix is represented in R as a symmetric matrix object from the +Matrix package.

+
library(GENESIS)
+
+# load the pcrelate results
+kinfile <- "data/pcrelate.RData"
+if (!file.exists(kinfile)) download.file(file.path(repo_path, kinfile), kinfile)
+pcrel <- get(load(kinfile))
+
+# create the empirical KM
+kinship <- pcrelateToMatrix(pcrel, scaleKin=2, verbose=FALSE)
+dim(kinship)
+
## [1] 1040 1040
+
kinship[1:5,1:5]
+
## 5 x 5 Matrix of class "dsyMatrix"
+##              HG00096       HG00097      HG00099      HG00100       HG00101
+## HG00096  0.985459993  0.0031768617  0.006724604 -0.006857234  0.0060690853
+## HG00097  0.003176862  0.9883268089 -0.007300965  0.012888182 -0.0009002438
+## HG00099  0.006724604 -0.0073009649  0.976649166  0.022746299  0.0118873586
+## HG00100 -0.006857234  0.0128881819  0.022746299  0.978507564  0.0265278283
+## HG00101  0.006069085 -0.0009002438  0.011887359  0.026527828  0.9778938334
+

Save the kinship matrix for future use.

+
# save the empirical KM
+save(kinship, file="data/pcrelate_Matrix.RData")
+
+
+
+

Null Model

+

Now that our data is prepared, we can move on to the association +testing procedure. The first step is to fit the “null model” – i.e., a +model fit under the null hypothesis of no individual variant +association. Operationally, this is fitting a mixed model with the +desired outcome phenotype, fixed effect covariates, and a random effect +with covariance proportional to a kinship matrix (KM).

+
+

Fit the Null Model

+

We use the fitNullModel function from GENESIS. We need +to specify the AnnotatedDataFrame with the phenotype data, +the outcome variable (trait_1), and the fixed effect covariates (sex, +age, and PCs 1-7). We also include the kinship matrix in the model with +the cov.mat (covariance matrix) argument, which is used to +specify the random effect(s) in the model with covariance structure(s) +proportional to the supplied matrix(s).

+
# fit the null model 
+nullmod <- fitNullModel(annot, 
+                        outcome="trait_1", 
+                        covars=c("sex", "age", paste0("PC", c(1:7))), 
+                        cov.mat=kinship, 
+                        verbose=FALSE)
+
+# save the output
+save(nullmod, file="data/null_model_trait1.RData")
+

The fitNullModel function returns a lot of information +about the model that was fit. We examine some of that information below; +to see all of the components, try names(nullmod).

+
# description of the model we fit
+nullmod$model
+
## $hetResid
+## [1] FALSE
+## 
+## $family
+## 
+## Family: gaussian 
+## Link function: identity 
+## 
+## 
+## $outcome
+## [1] "trait_1"
+## 
+## $covars
+## [1] "sex" "age" "PC1" "PC2" "PC3" "PC4" "PC5" "PC6" "PC7"
+## 
+## $formula
+## [1] "trait_1 ~ sex + age + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + (1|A)"
+
# fixed effect regression estimates
+nullmod$fixef
+
##                      Est          SE         Stat         pval
+## (Intercept) 25.185899964 0.315517207 6371.8985072 0.000000e+00
+## sexmale      0.097399076 0.082950262    1.3787139 2.403203e-01
+## age          0.005314007 0.005724947    0.8615915 3.532937e-01
+## PC1         -9.160539707 1.332592217   47.2549794 6.232745e-12
+## PC2          7.381926355 1.341147398   30.2960768 3.708753e-08
+## PC3         -6.609404517 1.334624625   24.5248522 7.335753e-07
+## PC4         -2.416695874 1.330245386    3.3005056 6.925856e-02
+## PC5         -1.173732431 1.348177786    0.7579558 3.839690e-01
+## PC6          3.623595390 1.341374618    7.2975861 6.904731e-03
+## PC7          0.982631984 1.344912866    0.5338183 4.650060e-01
+
# variance component estimates
+nullmod$varComp
+
##         V_A V_resid.var 
+##   0.3214518   1.4719584
+
# model fit: fitted values, residuals
+head(nullmod$fit)
+
##         outcome workingY fitted.values resid.marginal resid.conditional
+## HG00096    23.5     23.5      24.97103    -1.47102942       -1.27022740
+## HG00097    23.9     23.9      24.89440    -0.99440313       -0.85372610
+## HG00099    23.8     23.8      24.82133    -1.02132686       -0.85776070
+## HG00100    24.9     24.9      24.86416     0.03583687       -0.02486229
+## HG00101    27.6     27.6      24.99079     2.60920776        2.13706428
+## HG00102    24.5     24.5      24.80791    -0.30791251       -0.23232014
+##         linear.predictor    resid.PY resid.cholesky sample.id
+## HG00096         24.77023 -0.86295061    -1.15371693   HG00096
+## HG00097         24.75373 -0.57999336    -0.77627756   HG00097
+## HG00099         24.65776 -0.58273434    -0.77883646   HG00099
+## HG00100         24.92486 -0.01689062    -0.02609604   HG00100
+## HG00101         25.46294  1.45185101     1.93686871   HG00101
+## HG00102         24.73232 -0.15783064    -0.21111386   HG00102
+
# plot the residuals vs the fitted values
+library(ggplot2)
+ggplot(nullmod$fit, aes(x = fitted.values, y = resid.marginal)) +
+    geom_point(alpha = 0.5) +
+    geom_hline(yintercept = 0) +
+    geom_smooth(method = 'lm')
+
## `geom_smooth()` using formula = 'y ~ x'
+

+

The residuals vs. fitted values diagnostic plot looks good.

+
+
+
+

Single-Variant Association Tests

+

After fitting the null model, we use single-variant score tests to +test each variant across the genome separately for association with the +outcome, accounting for genetic ancestry and genetic relatedness among +the samples. We use the assocTestSingle function from +GENESIS.

+
+

Prepare the GDS Iterator

+

First, we have to create a SeqVarData object linking the +GDS file containing the sequencing data and the +AnnotatedDataFrame containing the phenotype data. We then +create a SeqVarBlockIterator object, which breaks the set +of all variants in the SeqVarData object into blocks, +allowing us to analyze genome-wide in manageable pieces. Note that in +this tutorial we are analyzing only a small subset of variants from +chromosome 1.

+
library(SeqVarTools)
+
+# open a connection to the GDS file
+gdsfile <- "data/1KG_phase3_GRCh38_subset_chr1.gds"
+if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile)
+gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
+gds <- seqOpen(gdsfile)
+
+# make the seqVarData object
+seqData <- SeqVarData(gds, sampleData=annot)
+
+# make the iterator object
+iterator <- SeqVarBlockIterator(seqData, verbose=FALSE)
+iterator
+
## SeqVarBlockIterator object; on iteration 1 of 4 
+##  | GDS:
+## File: /Users/mconomos/Documents/Teaching/SISG_2024/data/1KG_phase3_GRCh38_subset_chr1.gds (2.1M)
+## +    [  ] *
+## |--+ description   [  ] *
+## |--+ sample.id   { Str8 1040 LZMA_ra(9.88%), 829B } *
+## |--+ variant.id   { Int32 37409 LZMA_ra(7.40%), 10.8K } *
+## |--+ position   { Int32 37409 LZMA_ra(37.4%), 54.6K } *
+## |--+ chromosome   { Str8 37409 LZMA_ra(0.22%), 169B } *
+## |--+ allele   { Str8 37409 LZMA_ra(17.0%), 25.9K } *
+## |--+ genotype   [  ] *
+## |  |--+ data   { Bit2 2x1040x37409 LZMA_ra(9.74%), 1.8M } *
+## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
+## |  \--+ extra   { Int16 0 LZMA_ra, 18B }
+## |--+ phase   [  ]
+## |  |--+ data   { Bit1 1040x37409 LZMA_ra(0.02%), 861B } *
+## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
+## |  \--+ extra   { Bit1 0 LZMA_ra, 18B }
+## |--+ annotation   [  ]
+## |  |--+ id   { Str8 37409 LZMA_ra(35.9%), 150.9K } *
+## |  |--+ qual   { Float32 37409 LZMA_ra(0.12%), 181B } *
+## |  |--+ filter   { Int32,factor 37409 LZMA_ra(0.12%), 181B } *
+## |  |--+ info   [  ]
+## |  \--+ format   [  ]
+## \--+ sample.annotation   [  ]
+##  | sampleData:
+## An object of class 'AnnotatedDataFrame'
+##   rowNames: 1 2 ... 1040 (1040 total)
+##   varLabels: sample.id pop ... PC7 (15 total)
+##   varMetadata: labelDescription
+##  | variantData:
+## An object of class 'AnnotatedDataFrame': none
+

The SeqVarBlockIterator object looks a lot like the GDS +objects we’ve seen before, but with an additional +sample.annotation field that contains the phenotype data +from the linked AnnotatedDataFrame.

+
+
+

Run the Association Tests

+

The assocTestSingle function takes the already fitted +null model as input, performs score tests by iterating over all blocks +of variants in the SeqVarBlockIterator object, and then +concatenates and returns the results.

+
# run the single-variant association test
+assoc <- assocTestSingle(iterator, 
+                         null.model = nullmod,
+                         test = "Score")
+
## # of selected samples: 1,040
+
dim(assoc)
+
## [1] 27621    14
+
head(assoc)
+
##   variant.id chr    pos allele.index n.obs       freq MAC       Score  Score.SE
+## 1          1   1 631490            1  1040 0.05576923 116   8.1875979  7.770130
+## 2          2   1 736950            1  1040 0.12067308 251 -16.3995051 10.327695
+## 3          3   1 800909            1  1040 0.13076923 272   0.6883871 11.167501
+## 4          4   1 814264            1  1040 0.20096154 418  -2.2268280 15.118906
+## 5          5   1 868735            1  1040 0.92163462 163   3.0830853  9.009819
+## 6          6   1 903007            1  1040 0.20817308 433 -15.1272838 13.839850
+##   Score.Stat Score.pval          Est     Est.SE          PVE
+## 1  1.0537273  0.2920078  0.135612569 0.12869798 1.078001e-03
+## 2 -1.5879153  0.1123055 -0.153753123 0.09682703 2.448034e-03
+## 3  0.0616420  0.9508479  0.005519767 0.08954555 3.689064e-06
+## 4 -0.1472876  0.8829050 -0.009741951 0.06614235 2.106179e-05
+## 5  0.3421917  0.7322066  0.037979865 0.11099002 1.136846e-04
+## 6 -1.0930237  0.2743834 -0.078976554 0.07225512 1.159903e-03
+

Each row of the results data.frame represents one tested variant and +includes: variant information (variant.id, +chr, and pos), the number of samples tested +(n.obs), the minor allele count (MAC), the +effect allele frequency (freq), the score value +(Score) and its standard error (Score.SE), the +score test statistic (Score.Stat) and \(p\)-value (Score.pval), an +approximation of the effect allele effect size (Est) and +its standard error (Est.SE), and an approximation of the +proportion of variation explained by the variant (PVE). +When using a SeqVarData object, the effect allele is the +alternate allele.

+
# save for later
+save(assoc, file = 'data/assoc_chr1_trait_1.RData')
+
+

Examine the results

+

A lot of the variants we tested are very rare – i.e., the alternate +allele is not observed for many samples. Single-variant tests do not +perform well for very rare variants (we discuss testing rare variants in +more detail later). We can use the minor allele count (MAC) observed in +the sample to filter out rare variants that we may expect to have +unreliable test results (e.g. MAC < 20). The MAC filter you will want +to use in practice will depend on your sample size.

+
summary(assoc$MAC)
+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
+##     1.0     2.0    36.0   229.4   396.0  1040.0
+
sum(assoc$MAC < 20)
+
## [1] 12785
+
# filter out the rarest variants
+assoc <- assoc[assoc$MAC >= 20, ]
+dim(assoc)
+
## [1] 14836    14
+

We make a QQ plot to examine the distribution of \(p\)-values.

+
qqPlot <- function(pval) {
+    pval <- pval[!is.na(pval)]
+    n <- length(pval)
+    x <- 1:n
+    dat <- data.frame(obs=sort(pval),
+                      exp=x/n,
+                      upper=qbeta(0.025, x, rev(x)),
+                      lower=qbeta(0.975, x, rev(x)))
+
+    ggplot(dat, aes(-log10(exp), -log10(obs))) +
+        geom_line(aes(-log10(exp), -log10(upper)), color="gray") +
+        geom_line(aes(-log10(exp), -log10(lower)), color="gray") +
+        geom_point() +
+        geom_abline(intercept=0, slope=1, color="red") +
+        xlab(expression(paste(-log[10], "(expected P)"))) +
+        ylab(expression(paste(-log[10], "(observed P)"))) +
+        theme_bw()
+}
+
+qqPlot(assoc$Score.pval)
+

+

We make a Manhattan plot of the \(-log_{10}(p)\)-values using the +manhattanPlot fuction from the GWASTools +package to visualize the association signals.

+
GWASTools::manhattanPlot(assoc$Score.pval, 
+                         chromosome = assoc$chr, 
+                         thinThreshold = 1e-4,
+                         ylim = c(0, 12))
+

+

We should expect the majority of variants to fall near the red +y=x line in the QQ plot. Deviation above the line, commonly +referred to as “inflation” is typically indicative of some model issue +(e.g. unaccounted for population structure or relatedness). In this +example, the appearance of inflation is caused by enrichment of +association signal due to the fact that we have two genome-wide +significant signals (i.e. \(p < 5 \times +10^{-8}\)) and our dataset only has a small number of variants on +chromosome 1.

+

From looking at the results for the variants that reached genome-wide +significance, we see that 6 variants at two different loci have \(p < 5 \times 10^{-8}\)

+
assoc[assoc$Score.pval < 5e-8, ]
+
##       variant.id chr       pos allele.index n.obs      freq MAC     Score
+## 1938        2370   1  25044607            1  1040 0.1692308 352  71.07880
+## 2003        2458   1  25046749            1  1040 0.1500000 312  69.35044
+## 2011        2473   1  25047225            1  1040 0.1927885 401  71.95763
+## 23169      31443   1 212951423            1  1040 0.1091346 227  63.18412
+## 23185      31476   1 212952357            1  1040 0.5254808 987 108.91489
+## 23271      31614   1 212956321            1  1040 0.4043269 841 110.53196
+##       Score.SE Score.Stat   Score.pval       Est     Est.SE        PVE
+## 1938  12.46025   5.704443 1.167240e-08 0.4578113 0.08025520 0.03159288
+## 2003  11.99336   5.782402 7.364128e-09 0.4821336 0.08337946 0.03246230
+## 2011  13.06207   5.508899 3.610844e-08 0.4217478 0.07655754 0.02946405
+## 23169 10.23968   6.170516 6.806767e-10 0.6026081 0.09765927 0.03696627
+## 23185 16.86561   6.457808 1.062305e-10 0.3828979 0.05929224 0.04048862
+## 23271 16.11184   6.860292 6.871974e-12 0.4257919 0.06206614 0.04569282
+
+
+
+
+

Exercise 2.1 (Application)

+

Use the GENESIS Null Model app on the BioData Catalyst +powered by Seven Bridges platform to fit the a null model for trait_1, +adjusting for sex, age, ancestry, and kinship in the model, using the +example 1000 Genomes data. You can use the PCs and kinship matrix +computed using the PC-AiR and PC-Relate apps in the additional +Population Structure and Relatedness tutorial’s exercises as inputs to +this analysis. The steps to perform this analysis are as follows:

+
    +
  • Copy the app to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for GENESIS Null Model
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
  • Run the analysis in your project: +
      +
    • Click: Apps > GENESIS Null Model > Run
    • +
    • Specify the Inputs: +
        +
      • Phenotype file: pheno_annotated.RData
      • +
      • PCA file: 1KG_phase3_GRCh38_subset_pca.RData
      • +
      • Relatedness matrix file: +1KG_phase3_subset_GRCh38_pcrelate_Matrix.RData
      • +
    • +
    • Specify the App Settings: +
        +
      • Covariates: age, sex (each as a different term)
      • +
      • Family: gaussian
      • +
      • Number of PCs to include as covariates: 7
      • +
      • Outcome: trait_1
      • +
      • Two stage model: FALSE
      • +
      • Output prefix: “1KG_trait_1” (or any other string to name the output +file)
      • +
    • +
    • Click: Run
    • +
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

The output of this analysis will be a +<output_prefix>_null_model.RData file that contains +the null model fit, a +<output_prefix>_phenotypes.RData file with the +phenotype data used in the analysis, and a +<output_prefix>_report.Rmd and +<output_prefix>_report.html with model diagnostics. +Review the .html report – which covariates have significant (\(p < 0.05\)) associations with trait_1 in +the null model?

+

You can find the expected output of this analysis by looking at the +existing task 06 Null Model trait_1 in the Tasks menu of +your Project. The output files are available in the Project, so you do +not need to wait for your analysis to finish to move to the next +exercise.

+
+

Solution 2.1 (Application)

+

From looking at the .html report, we see that PC1, PC2, PC3, and PC6 +have significant associations with trait_1 in our null model.

+
+
+
+

Exercise 2.2 (Application)

+

Use the GENESIS Single Variant Association Testing app +on the BioData Catalyst powered by Seven Bridges platform to perform a +GWAS for trait_1 using the null model fit in the previous exercise. Use +the genotype data in the genome-wide GDS files you created previously. +The steps to perform this analysis are as follows:

+
    +
  • Copy the app to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for +GENESIS Single Variant Association Testing
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
  • Run the analysis in your project: +
      +
    • Click: Apps > +GENESIS Single Variant Association Testing > Run
    • +
    • Specify the Inputs: +
        +
      • GDS Files: 1KG_phase3_GRCh38_subset_chr<CHR>.gds +(select all 22 chromosomes)
      • +
      • Null model file: 1KG_trait_1_null_model.RData
      • +
      • Phenotype file: 1KG_trait_1_phenotypes.RData (use the +phenotype file created by the Null Model app)
      • +
    • +
    • Specify the App Settings: +
        +
      • MAC threshold: 5
      • +
      • Test type: score
      • +
      • memory GB: 32 (increase to make sure enough available)
      • +
      • Output prefix: “1KG_trait_1_assoc” (or any other string to name the +output file)
      • +
    • +
    • Click: Run
    • +
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

The output of this analysis will be 22 +<output_prefix>_chr<CHR>.RData files with the +association test results for each chromosome as well as a +<output_prefix>_manh.png file with the Manhattan plot +and a <output_prefix>_qq.png file with the QQ plot. +Review the QQ and Manhattan plots – is there evidence of genomic +inflation?

+

You can find the expected output of this analysis by looking at the +existing task 07 Single Variant Association Test trait_1 in +the Tasks menu of your Project. The output files are available in the +Project, so you do not need to wait for your analysis to finish to move +to the next exercise.

+
+

Solution 2.2 (Application)

+

From looking at the QQ plot, we see that the genomic control lambda = +1.074 and there is some deviation from the \(y=x\) line – both indicative of moderate +inflation in our analysis. This is likely an artifact of looking at rare +variants with a small sample size.

+
+
+
+

Exercise 2.3 (Application)

+

Use the GENESIS Association results plotting app on the +BioData Catalyst powered by Seven Bridges platform to make additional QQ +plots of the single variant association results binned by MAF: \(0-0.5\%\), \(0.5-1\%\), \(1-5\%\), \(\geq +5\%\). The steps to perform this analysis are as follows:

+
    +
  • Copy the app to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for GENESIS Association results plotting
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
  • Run the analysis in your project: +
      +
    • Click: Apps > GENESIS Association results plotting +> Run
    • +
    • Specify the Inputs: +
        +
      • Results from association testing: +1KG_trait_1_assoc_chr<CHR>.RData (select all 22 +chromosomes)
      • +
    • +
    • Specify the App Settings: +
        +
      • Association Type: single
      • +
      • QQ MAF bins: “0.005 0.01 0.05”
      • +
    • +
    • Click: Run
    • +
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

The output of this analysis will be a +<output_prefix>_qq_bymaf.png file. Look at the QQ +plots by MAF bin – how do they compare to the overall QQ plot of all +variants?

+

You can find the expected output of this analysis by looking at the +existing task 08 Single Variant Association Plots trait_1 +in the Tasks menu of your Project, so you do not need to wait for your +analysis to finish to look at the output.

+
+

Solution 2.3 (Application)

+

From the binned QQ plots, we see that the common variants (i.e. MAF +\(\geq 5\%\)) have a genomic control +lambda = 1.007 and follow along the \(y=x\) line. As suspected, the inflation is +only present in the rarer variants, likely due to the small sample +size.

+
+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/03.A_GENESIS_model_explorer.Rmd b/03.A_GENESIS_model_explorer.Rmd new file mode 100644 index 0000000..cf6bd47 --- /dev/null +++ b/03.A_GENESIS_model_explorer.Rmd @@ -0,0 +1,142 @@ +# 3.A. Exploring Association Results + +In this tutorial, we will learn how to use the [GENESIS Model Explorer App](https://genesis-model-explorer-app.bdc.sb-webapp.com/?project=smgogarten/uw-gac-commit), which is an "Interactive Browser" built with [R Shiny](https://shiny.rstudio.com/) on the NHLBI BioData Catalyst powered by Seven Bridges cloud platform. + +## GENESIS Model Explorer App + +The [GENESIS Model Explorer App](https://genesis-model-explorer-app.bdc.sb-webapp.com/?project=smgogarten/uw-gac-commit) is an interactive tool that enables users to make figures to visualize and explore the results of a GENESIS null model, paired with phenotype and genotype data on the same samples. It is meant to provide an intuitive interface for researchers to easily select, visualize, and explore phenotypes, genotypes, and a fitted GENESIS model interactively with no prior R programming knowledge. The app takes three inputs: + +- **Null Model File:** The null model file should be any fitted GENESIS null model saved in .RData format. The null model could have been created interactively using the `fitNullModel` function in an R session (e.g. in Data Studio or on your local machine), or it could be the output from the `GENESIS Null Model` application. +- **Phenotype File:** The phenotype file should be a data.frame or `AnnotatedDataFrame` saved in .RData format. The data.frame must contain all of the samples included in your null model file in a column named `sample.id`, with additional columns containing phenotype variables of interest. If you used the `GENESIS Null Model` application to fit your null model, we recommend using the `_phenotypes.RData` output file, which contains all of the phenotype data from all of the samples used in the analysis. Alternatively, you can use the same phenotype file used as input to fit your null model, or an entirely new file where you have added additional columns with phenotype variables of interest. +- **Genotype File (Optional):** Providing an optional genotype file allows the user to make figures looking at the relationships of variants of interest with null model variables and phenotypes of interest. The genotype file should be a data.frame saved in .rds format. The data.frame must contain all of the samples included in your null model file in a column named `sample.id`, with additional columns containing variant allele counts or dosages. Conveniently, this file can be generated from an existing GDS file with the `GDS Genotype Extractor` application (see below). + +We will now use the [GENESIS Model Explorer](https://genesis-model-explorer-app.bdc.sb-webapp.com/?project=smgogarten/uw-gac-commit) to make some figures exploring the data: + +- Launch the interactive browser + - From the top menu, click "Public Resources" > "Interactive Web Apps" + - Click: "Open" on the GENESIS Model Explorer App + - Click: "Yes" to proceed + - Click: "Get Started" +- Load Data + - Null Model File + - Project: select your SISG project (should be chosen by default if you launched the app as described) + - Current File: select `1KG_trait_1_null_model_reportonly.RData` (much smaller file without extra matrices required for computing association test statistics) + - Phenotype File: + - Project: select your SISG project + - Current File: select `1KG_trait_1_phenotypes.RData` (this is the phenotype file that was created by the null model application) + - Click: Load Data + +Once you load the data, you will be taken to a "Plot setup" screen, where you can select what is plotted. We will make a few different plots. Once you've selected your variables, click "Generate Plot" to render the figure. To make a new plot, change the parameters and click "Generate Plot" again. + +- Outcome Histogram + - x-axis: Model: outcome +- Outcome Density Plot + - x-axis: Model: outcome + - plot type: density plot +- Scatterplot of Residuals vs Fitted Values + - x-axis: Model: fitted.values + - y-axis: Model: resid.marginal + - plottype: scatterplot + - Additional Options + - Add y = 0 line + - Add smooth line +- Boxplot of trait_1 by sex + - x-axis: Phenotype: sex + - y-axis: Phenotype: trait_1 + - plot type: boxplot +- Scatterplot of trait_1 vs age, grouped by sex (sex indicated by color) + - x-axis: Phenotype: age + - y-axis: Phenotype: trait_1 + - plot type: scatterplot + - group by: sex +- Scatterplot of trait_1 vs age, faceted by sex (each sex in its own panel) + - x-axis: Phenotype: age + - y-axis: Phenotype: trait_1 + - plot type: scatterplot + - facet by: Phenotype: sex + + +## Extracting Sample Genotypes from a GDS + +Perhaps we want to look at the relationship between the genotype values of our association study "hits" and our phenotypes or model residuals. The GENESIS Model Explorer can do this as well if we provide the optional Genotype file with sample genotype values for the variants of interest. Conveniently, this file can be generated from an existing GDS file with the `GDS Genotype Extractor` application. + +First, let's identify a few variants to use for this demonstration. After running an Application (e.g. GENESIS Single Variant Association Testing) on the BioData Catalyst Powered by Seven Bridges platform, the output files are saved in the directory `/sbgenomics/project-files/`. You can load these files into RStudio to explore them interactively -- we load the chromosome 19 single variant association test results from the task `07 Single Variant Association Test trait_1`, and identify the most significant variant. + +```{r, eval = FALSE} +# load the association results +assoc <- get(load('/sbgenomics/project-files/1KG_trait_1_assoc_chr19.RData')) + +# variant with minimum p-value +assoc[which.min(assoc$Score.pval), ] +``` + +We need to create a "variant include file" with the `variant.id` of this variants as input for the `GDS Genotype Extractor` application. The variant include should be saved as an .rds file using the `saveRDS` function. + +```{r, eval = FALSE} +varid <- '1070473' +saveRDS(varid, file = '/sbgenomics/output-files/1KG_trait_1_chr19_variant_include.rds') +``` + +We also need to create a "sample include file" with the `sample.id` of all the samples included in our analysis as input for the `GDS Genotype Extractor` application. We can get these `sample.id` values from our fitted null model -- we can use the `_null_model_reportonly.RData` file, which is much smaller than the `_null_model.RData` file by excluding some large matrices only needed for computing association test results. The sample include file should also be saved as an .rds file using the `saveRDS` function. + +```{r, eval = FALSE} +nullmod <- get(load('/sbgenomics/project-files/1KG_trait_1_null_model_reportonly.RData')) + +# the sample.id are stored in the "fit" data.frame +head(nullmod$fit) + +sampid <- nullmod$fit$sample.id +length(sampid) +``` + +```{r, eval = FALSE} +saveRDS(sampid, file = '/sbgenomics/output-files/1KG_trait_1_sample_include.rds') +``` + +**Note about Directories:** The working directory for the Data Studio is `sbgenomics/workspace/` (you can see this by going to the Terminal in RStudio and typing `pwd`). This directory is accessible in the Data Studio, but the Applications in your Project can **not** see files here. Applications only see the `sbgenomics/project-files/` directory (which is read-only from the Data Studio). In order to make our variant include file visible to the `GDS Genotype Extractor` application, we save our file to the `/sbgenomics/output-files/` directory. When we stop our RStudio session (and only then), new files in the `/sbgenomics/output-files/` directory will be copied over to the `/sbgenomics/project-files/` directory, making them available to Applications. More details can be found in the platform [documentation](https://sb-biodatacatalyst.readme.io/docs/about-files-in-a-data-cruncher-analysis). + + +## Exercise 3.A.1 (Application) + +Use the `GDS Genotype Extractor` app on the BioData Catalyst powered by Seven Bridges platform to create an .rds file with genotype values for all samples in our `1KG_phase3_GRCh38_subset` GDS files at the lead variant on chromosome 19 we identified above. The steps to perform this analysis are as follows: + +- Copy the app to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `GDS Genotype Extractor` + - Click: Copy > Select your project > Copy +- Run the analysis in your project: + - Click: Apps > `GDS Genotype Extractor` > Run + - Specify the Inputs: + - GDS file: `1KG_phase3_GRCh38_subset_chr19.gds` + - Sample include file: `1KG_trait_1_sample_include.rds` + - Variant include file: `1KG_trait_1_chr19_variant_include.rds` + - Output prefix: "1KG_trait_1_chr19" (or any other string to name the output file) + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output of this analysis will be a `_genotypes.rds` file that contains a column of `sample.id` and then one column per variant with the genotype values for each sample, and a `_variant_info.rds` file with one row per variant and columns providing details such as variant identifiers, chromosome, position, and ref and alt alleles. + +You can find the expected output of this analysis by looking at the existing task `GDS Genotype Extractor Chr19` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to move to the next exercise. + + +## Exercise 3.A.2 (GENESIS Model Explorer) + +Use the GENESIS Model Explorer to make a boxplot of the Cholesky residuals (`resid.cholesky`) from the 1KG trait_1 null model by genotype value of the variant at chr19:45084084 A>G. The Cholesky residuals are a transformation of the marginal residuals computed using the estimated model covariance structure to remove the correlation among samples. The correlation of these residuals with the genotype values if essentially what the score statistic is measuring when we perform our association tests. What do you observe in the boxplot? + +### Solution 3.A.2 (GENESIS Model Explorer) + +- In the GENESIS Model Explorer window, click back on the "Load Data" tab and add the following file: + - Genotype File: + - Project: select your SISG project + - Current File: select `1KG_trait_1_chr19_genotypes.rds` (this is the genotype file we created in Exercise 5.2) + - Click: Load Data +- Set the plotting parameters as follows: + - x-axis: Genotype: chr19:45084084_A_G + - y-axis: Model: resid.cholesky + - plot type: boxplot + - Additional Options + - Add y = 0 line +- Click "Generate plot" + +From the boxplot, we can see that there is an "upward" trend in the median residual value across genotypes. The values 0/1/2 of the genotype value correspond to the number of copies of the alternate allele (in this case, the G allele), so we observe that having more copies of the G allele is associated with higher values of trait_1, after adjusting for the covariates in our model. This is consistent with the `Score` value for this variant from our association test, which also has a positive value. diff --git a/03.A_GENESIS_model_explorer.html b/03.A_GENESIS_model_explorer.html new file mode 100644 index 0000000..54a3419 --- /dev/null +++ b/03.A_GENESIS_model_explorer.html @@ -0,0 +1,684 @@ + + + + + + + + + + + + + +03.A_GENESIS_model_explorer.knit + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

3.A. Exploring Association Results

+

In this tutorial, we will learn how to use the GENESIS +Model Explorer App, which is an “Interactive Browser” built with R Shiny on the NHLBI BioData +Catalyst powered by Seven Bridges cloud platform.

+
+

GENESIS Model Explorer App

+

The GENESIS +Model Explorer App is an interactive tool that enables users to make +figures to visualize and explore the results of a GENESIS null model, +paired with phenotype and genotype data on the same samples. It is meant +to provide an intuitive interface for researchers to easily select, +visualize, and explore phenotypes, genotypes, and a fitted GENESIS model +interactively with no prior R programming knowledge. The app takes three +inputs:

+
    +
  • Null Model File: The null model file should be any +fitted GENESIS null model saved in .RData format. The null model could +have been created interactively using the fitNullModel +function in an R session (e.g. in Data Studio or on your local machine), +or it could be the output from the GENESIS Null Model +application.
  • +
  • Phenotype File: The phenotype file should be a +data.frame or AnnotatedDataFrame saved in .RData format. +The data.frame must contain all of the samples included in your null +model file in a column named sample.id, with additional +columns containing phenotype variables of interest. If you used the +GENESIS Null Model application to fit your null model, we +recommend using the <output_prefix>_phenotypes.RData +output file, which contains all of the phenotype data from all of the +samples used in the analysis. Alternatively, you can use the same +phenotype file used as input to fit your null model, or an entirely new +file where you have added additional columns with phenotype variables of +interest.
  • +
  • Genotype File (Optional): Providing an optional +genotype file allows the user to make figures looking at the +relationships of variants of interest with null model variables and +phenotypes of interest. The genotype file should be a data.frame saved +in .rds format. The data.frame must contain all of the samples included +in your null model file in a column named sample.id, with +additional columns containing variant allele counts or dosages. +Conveniently, this file can be generated from an existing GDS file with +the GDS Genotype Extractor application (see below).
  • +
+

We will now use the GENESIS +Model Explorer to make some figures exploring the data:

+
    +
  • Launch the interactive browser +
      +
    • From the top menu, click “Public Resources” > “Interactive Web +Apps”
    • +
    • Click: “Open” on the GENESIS Model Explorer App
    • +
    • Click: “Yes” to proceed
    • +
    • Click: “Get Started”
    • +
  • +
  • Load Data +
      +
    • Null Model File +
        +
      • Project: select your SISG project (should be chosen by default if +you launched the app as described)
      • +
      • Current File: select +1KG_trait_1_null_model_reportonly.RData (much smaller file +without extra matrices required for computing association test +statistics)
      • +
    • +
    • Phenotype File: +
        +
      • Project: select your SISG project
      • +
      • Current File: select 1KG_trait_1_phenotypes.RData (this +is the phenotype file that was created by the null model +application)
      • +
    • +
    • Click: Load Data
    • +
  • +
+

Once you load the data, you will be taken to a “Plot setup” screen, +where you can select what is plotted. We will make a few different +plots. Once you’ve selected your variables, click “Generate Plot” to +render the figure. To make a new plot, change the parameters and click +“Generate Plot” again.

+
    +
  • Outcome Histogram +
      +
    • x-axis: Model: outcome
    • +
  • +
  • Outcome Density Plot +
      +
    • x-axis: Model: outcome
    • +
    • plot type: density plot
    • +
  • +
  • Scatterplot of Residuals vs Fitted Values +
      +
    • x-axis: Model: fitted.values
    • +
    • y-axis: Model: resid.marginal
    • +
    • plottype: scatterplot
    • +
    • Additional Options +
        +
      • Add y = 0 line
      • +
      • Add smooth line
      • +
    • +
  • +
  • Boxplot of trait_1 by sex +
      +
    • x-axis: Phenotype: sex
    • +
    • y-axis: Phenotype: trait_1
    • +
    • plot type: boxplot
    • +
  • +
  • Scatterplot of trait_1 vs age, grouped by sex (sex indicated by +color) +
      +
    • x-axis: Phenotype: age
    • +
    • y-axis: Phenotype: trait_1
    • +
    • plot type: scatterplot
    • +
    • group by: sex
    • +
  • +
  • Scatterplot of trait_1 vs age, faceted by sex (each sex in its own +panel) +
      +
    • x-axis: Phenotype: age
    • +
    • y-axis: Phenotype: trait_1
    • +
    • plot type: scatterplot
    • +
    • facet by: Phenotype: sex
    • +
  • +
+
+
+

Extracting Sample Genotypes from a GDS

+

Perhaps we want to look at the relationship between the genotype +values of our association study “hits” and our phenotypes or model +residuals. The GENESIS Model Explorer can do this as well if we provide +the optional Genotype file with sample genotype values for the variants +of interest. Conveniently, this file can be generated from an existing +GDS file with the GDS Genotype Extractor application.

+

First, let’s identify a few variants to use for this demonstration. +After running an Application (e.g. GENESIS Single Variant Association +Testing) on the BioData Catalyst Powered by Seven Bridges platform, the +output files are saved in the directory +/sbgenomics/project-files/. You can load these files into +RStudio to explore them interactively – we load the chromosome 19 single +variant association test results from the task +07 Single Variant Association Test trait_1, and identify +the most significant variant.

+
# load the association results
+assoc <- get(load('/sbgenomics/project-files/1KG_trait_1_assoc_chr19.RData'))
+
+# variant with minimum p-value
+assoc[which.min(assoc$Score.pval), ]
+

We need to create a “variant include file” with the +variant.id of this variants as input for the +GDS Genotype Extractor application. The variant include +should be saved as an .rds file using the saveRDS +function.

+
varid <- '1070473'
+saveRDS(varid, file = '/sbgenomics/output-files/1KG_trait_1_chr19_variant_include.rds')
+

We also need to create a “sample include file” with the +sample.id of all the samples included in our analysis as +input for the GDS Genotype Extractor application. We can +get these sample.id values from our fitted null model – we +can use the +<output_prefix>_null_model_reportonly.RData file, +which is much smaller than the +<output_prefix>_null_model.RData file by excluding +some large matrices only needed for computing association test results. +The sample include file should also be saved as an .rds file using the +saveRDS function.

+
nullmod <- get(load('/sbgenomics/project-files/1KG_trait_1_null_model_reportonly.RData'))
+
+# the sample.id are stored in the "fit" data.frame
+head(nullmod$fit)
+
+sampid <- nullmod$fit$sample.id
+length(sampid)
+
saveRDS(sampid, file = '/sbgenomics/output-files/1KG_trait_1_sample_include.rds')
+

Note about Directories: The working directory for +the Data Studio is sbgenomics/workspace/ (you can see this +by going to the Terminal in RStudio and typing pwd). This +directory is accessible in the Data Studio, but the Applications in your +Project can not see files here. Applications only see +the sbgenomics/project-files/ directory (which is read-only +from the Data Studio). In order to make our variant include file visible +to the GDS Genotype Extractor application, we save our file +to the /sbgenomics/output-files/ directory. When we stop +our RStudio session (and only then), new files in the +/sbgenomics/output-files/ directory will be copied over to +the /sbgenomics/project-files/ directory, making them +available to Applications. More details can be found in the platform documentation.

+
+
+

Exercise 3.A.1 (Application)

+

Use the GDS Genotype Extractor app on the BioData +Catalyst powered by Seven Bridges platform to create an .rds file with +genotype values for all samples in our +1KG_phase3_GRCh38_subset GDS files at the lead variant on +chromosome 19 we identified above. The steps to perform this analysis +are as follows:

+
    +
  • Copy the app to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for GDS Genotype Extractor
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
  • Run the analysis in your project: +
      +
    • Click: Apps > GDS Genotype Extractor > Run
    • +
    • Specify the Inputs: +
        +
      • GDS file: 1KG_phase3_GRCh38_subset_chr19.gds
      • +
      • Sample include file: +1KG_trait_1_sample_include.rds
      • +
      • Variant include file: +1KG_trait_1_chr19_variant_include.rds
      • +
      • Output prefix: “1KG_trait_1_chr19” (or any other string to name the +output file)
      • +
    • +
    • Click: Run
    • +
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

The output of this analysis will be a +<output_prefix>_genotypes.rds file that contains a +column of sample.id and then one column per variant with +the genotype values for each sample, and a +<output_prefix>_variant_info.rds file with one row +per variant and columns providing details such as variant identifiers, +chromosome, position, and ref and alt alleles.

+

You can find the expected output of this analysis by looking at the +existing task GDS Genotype Extractor Chr19 in the Tasks +menu of your Project. The output files are available in the Project, so +you do not need to wait for your analysis to finish to move to the next +exercise.

+
+
+

Exercise 3.A.2 (GENESIS Model Explorer)

+

Use the GENESIS Model Explorer to make a boxplot of the Cholesky +residuals (resid.cholesky) from the 1KG trait_1 null model +by genotype value of the variant at chr19:45084084 A>G. The Cholesky +residuals are a transformation of the marginal residuals computed using +the estimated model covariance structure to remove the correlation among +samples. The correlation of these residuals with the genotype values if +essentially what the score statistic is measuring when we perform our +association tests. What do you observe in the boxplot?

+
+

Solution 3.A.2 (GENESIS Model Explorer)

+
    +
  • In the GENESIS Model Explorer window, click back on the “Load Data” +tab and add the following file: +
      +
    • Genotype File: +
        +
      • Project: select your SISG project
      • +
      • Current File: select 1KG_trait_1_chr19_genotypes.rds +(this is the genotype file we created in Exercise 5.2)
      • +
    • +
    • Click: Load Data
    • +
  • +
  • Set the plotting parameters as follows: +
      +
    • x-axis: Genotype: chr19:45084084_A_G
    • +
    • y-axis: Model: resid.cholesky
    • +
    • plot type: boxplot
    • +
    • Additional Options +
        +
      • Add y = 0 line
      • +
    • +
  • +
  • Click “Generate plot”
  • +
+

From the boxplot, we can see that there is an “upward” trend in the +median residual value across genotypes. The values 0/1/2 of the genotype +value correspond to the number of copies of the alternate allele (in +this case, the G allele), so we observe that having more copies of the G +allele is associated with higher values of trait_1, after adjusting for +the covariates in our model. This is consistent with the +Score value for this variant from our association test, +which also has a positive value.

+
+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/03_advanced_GWAS.Rmd b/03_advanced_GWAS.Rmd new file mode 100644 index 0000000..7e06b78 --- /dev/null +++ b/03_advanced_GWAS.Rmd @@ -0,0 +1,287 @@ +# 3. Advanced GWAS Models + +This tutorial extends what was previously introduced in the `02_GWAS.Rmd` tutorial to more advanced models using the [GENESIS](https://bioconductor.org/packages/release/bioc/html/GENESIS.html) R/Bioconductor package. + +## Sparse Kinship Matrix + +Recall that fitting the null model uses a kinship matrix (KM) that captures the genetic correlation among samples. In the `02_GWAS.Rmd` tutorial, we used a dense KM that has non-zero values for all entries in the matrix. This works well with small sample sizes (like we have here), but can require a lot of memory and be very computationally demanding with large samples. + +In large population based samples, we can make an empirical KM sparse by zeroing out small values that are near 0. When creating a PC-Relate KM with `pcrelateToMatrix` this can be done by setting the `thresh` parameter equal to the smallest non-zero value to keep in the matrix. Alternatively, we can use the `makeSparseMatrix` function from the GENESIS package to sparsify an existing matrix into a sparse block-diagonal matrix. + +```{r, sparse} +repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main" +if (!dir.exists("data")) dir.create("data") + +library(GENESIS) + +# load the PC-Relate KM +kmfile <- "data/pcrelate_Matrix.RData" +if (!file.exists(kmfile)) download.file(file.path(repo_path, kmfile), kmfile) +km <- get(load(kmfile)) +dim(km) +km[1:5,1:5] + +# make the KM sparse at 4th degree relatedness +skm <- makeSparseMatrix(km, thresh = 2*2^(-11/2)) +dim(skm) +skm[1:5,1:5] +``` +This sparse KM can be used in the null model in place of the original dense KM that we used in the `02_GWAS.Rmd` tutorial -- exercise left to the reader. + + +## Two Stage Model + +As discussed in the lecture, we recommend a fully adjusted two-stage inverse Normalization procedure for fitting the null model for quantitative traits, particularly when the outcome has a non-Normal distribution. See [Sofer et al. (2019)](https://onlinelibrary.wiley.com/doi/10.1002/gepi.22188) for more information on the fully adjusted two-stage model. + +### Phenotype Data + +First we load the `AnnotatedDataFrame` with the phenotype data that we prepared previously. + +```{r, message = FALSE} +library(Biobase) +``` + +```{r} +# pheno data +annotfile <- "data/pheno_annotated_pcs.RData" +if (!file.exists(annotfile)) download.file(file.path(repo_path, annotfile), annotfile) +annot <- get(load(annotfile)) +head(pData(annot)) +``` + +For this section of the tutorial, we will be analyzing `trait_2`. Make a histogram of the trait_2 values -- what do you notice about the distribution? + +```{r} +library(ggplot2) +ggplot(pData(annot), aes(x = trait_2)) + geom_histogram() +``` + +### Standard Analysis + +First, let's run the GWAS using the standard LMM. Recall that we first fit the null model and then perform single variant score tests. + +#### Null Model + +We use the `fitNullModel` function from GENESIS. We need to specify the `AnnotatedDataFrame` with the phenotype data (annot), the outcome variable (trait_2), and the fixed effect covariates (sex, age, and PCs 1-7). We also include the kinship matrix to specify the covariance structure of the polygenic random effect in the model. + +```{r} +nullmod <- fitNullModel(annot, + outcome = "trait_2", + covars=c("sex", "age", paste0("PC", c(1:7))), + cov.mat=km, + verbose=TRUE) +``` + +#### Association Tests + +After fitting the null model, we use single-variant score tests to test each variant across the genome separately for association with the outcome, accounting for genetic ancestry and genetic relatedness among the samples. Recall that we make a `SeqVarData` iterator object that links the genotype data in the GDS file with the phenotype data in our `AnnotatedDataFrame` and use the `assocTestSingle` function from GENESIS to perform the association tests. + +```{r, message = FALSE} +library(SeqVarTools) + +# open a connection to the GDS file +gdsfile <- "data/1KG_phase3_GRCh38_subset_chr1.gds" +if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile) +gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open +gds <- seqOpen(gdsfile) + +# make the seqVarData object +seqData <- SeqVarData(gds, sampleData=annot) + +# make the iterator object +iterator <- SeqVarBlockIterator(seqData, verbose=FALSE) + +# run the single-variant association test +assoc <- assocTestSingle(iterator, + null.model = nullmod, + test = "Score") +dim(assoc) +``` + +We make a QQ plot of all variants with MAC >= 5. What do you notice about this QQ plot? + +```{r} +qqPlot <- function(pval) { + pval <- pval[!is.na(pval)] + n <- length(pval) + x <- 1:n + dat <- data.frame(obs=sort(pval), + exp=x/n, + upper=qbeta(0.025, x, rev(x)), + lower=qbeta(0.975, x, rev(x))) + + ggplot(dat, aes(-log10(exp), -log10(obs))) + + geom_line(aes(-log10(exp), -log10(upper)), color="gray") + + geom_line(aes(-log10(exp), -log10(lower)), color="gray") + + geom_point() + + geom_abline(intercept=0, slope=1, color="red") + + xlab(expression(paste(-log[10], "(expected P)"))) + + ylab(expression(paste(-log[10], "(observed P)"))) + + theme_bw() +} + +# make a QQ plot +qqPlot(assoc$Score.pval[assoc$MAC >= 5]) +``` + +### Fully-Adjusted Two-Stage Model Analysis + +Now, let's run the GWAS using the fully-adjusted two-stage LMM. + +#### Null Model + +To run the fully-adjusted two-stage null model, we simply set the `two.stage` option to `TRUE`. + +```{r} +# fit the two stage null model +nullmod.twostage <- fitNullModel(annot, + outcome = "trait_2", + covars=c("sex", "age", paste0("PC", c(1:7))), + cov.mat=km, + two.stage = TRUE, + verbose=TRUE) +``` + +Notice that the messages from the function show that the model was fit twice -- first with the original outcome and second with the inverse-Normal transformed residuals. + +```{r} +# description of the model we fit +nullmod.twostage$model +``` + +From the model description, we can see this is a two stage model because the formula element has `rankInvNorm(resid(trait_2))` as the outcome variable. + +#### Association Tests + +After fitting the two-stage null model, the association testing procedure is exactly the same. Since we've already created our `SeqVarData` iterator, we do not need to create it again; however, we do need to "reset" it to the first block of data. + +```{r, message = FALSE} +# reset the filter to the first block +seqResetFilter(iterator, verbose = FALSE) + +# run the single-variant association test +assoc.twostage <- assocTestSingle(iterator, + null.model = nullmod.twostage, + test = "Score") +dim(assoc.twostage) +``` + +We make a QQ plot of all variants with MAC >= 5. What do you notice about this QQ plot compared to the one from the standard null model? + +```{r} +# make a QQ plot +qqPlot(assoc.twostage$Score.pval[assoc.twostage$MAC >= 5]) +``` + +From the QQ plot, we see that using the fully-adjusted two-stage model substantially decreased the amount of inflation in the test statistics. We know that this is because of the non-Normality / skewness of trait_2, but we can compare the marginal residuals from both null models to see how their distributions look after covariate adjustment + +```{r null_model_two_stage} +# merge the data for plotting +pdat <- merge(nullmod$fit, nullmod.twostage$fit, + by = 'sample.id', suffixes = c('.orig', '.twostage')) +head(pdat, 2) + +# distribution of residuals - original null model +ggplot(pdat, aes(x = resid.marginal.orig)) + geom_histogram() + +# distribution of residuals - two stage null model +ggplot(pdat, aes(x = resid.marginal.twostage)) + geom_histogram() + +# compare residuals +ggplot(pdat, aes(x = resid.marginal.orig, y = resid.marginal.twostage)) + + geom_point() + + geom_abline(intercept = 0, slope = 1) + +``` + +As expected, the residuals from the original model are very skewed, while the residuals from the two-stage model are much closer to Normally distributed. The skewness of the residuals in the original model leads to inflation in variant test statistics (as seen in the QQ plots), particularly for rare or low frequency variants, where effect allele carriers with extreme residuals in the tails of the distribution can have high leverage on the score statistics. + + + +## Binary Traits + +GENESIS also supports testing binary (e.g. case/control) outcomes. + +### Phenotype Data + +The outcome `status` in the annotated phenotype data is a binary case/control variable. Look at the prevalence of this outcome + +```{r} +table(pData(annot)$status) +``` + +### Logistic Mixed Model Analysis + +GENESIS implements the GMMAT method of approximate logistic mixed models for association testing with binary outcomes. See [Chen et al. (2016)](https://www.sciencedirect.com/science/article/pii/S000292971600063X) for more information on the GMMAT method. + +#### Null Model + +We can fit a null model using a logistic mixed model by specifying the argument `family=binomial` in the `fitNullModel` function. As before, we need to specify the `AnnotatedDataFrame` with the phenotype data (annot), the outcome variable (status), and the fixed effect covariates (sex, age, and PCs 1-7). We still include the kinship matrix to specify the covariance structure of the polygenic random effect in the model. + +```{r} +# fit the null model with logistic mixed model +nullmod.status <- fitNullModel(annot, + outcome="status", + covars=c("sex", "age", paste0("PC", 1:7)), + cov.mat = km, + family=binomial, + verbose=TRUE) +``` + +#### Association Tests + +After fitting the logistic null model, the association testing procedure is still exactly the same. Since we've already created our `SeqVarData` iterator, we do not need to create it again; however, we do need to "reset" it to the first block of data. + +```{r, message = FALSE} +# reset the filter to the first block +seqResetFilter(iterator, verbose = FALSE) + +# run the single-variant association test +assoc.status <- assocTestSingle(iterator, + null.model = nullmod.status, + test = "Score") +dim(assoc.status) +``` + +As usual, we make a QQ plot of all variants with MAC >= 5. What do you notice about the tail of this QQ plot? + +```{r} +# make a QQ plot +qqPlot(assoc.status$Score.pval[assoc.status$MAC >= 5]) +``` + +#### SPA test + +In samples with highly imbalanced case:control ratios, the Score test can be inflated for rare and low frequency variants. Saddlepoint approximation (SPA) can be used to improve $p$-value calculations, and is available in GENESIS by setting the argument `test=Score.SPA` in `assocTestSingle`. See [Dey et al. (2017)](https://www.cell.com/ajhg/fulltext/S0002-9297(17)30201-X) and [Zhou et al. (2018)](https://www.nature.com/articles/s41588-018-0184-y) for details on using SPA in GWAS. Re-run the analysis using the SPA $p$-value calculation. + +```{r, message = FALSE} +# reset the filter to the first block +seqResetFilter(iterator, verbose = FALSE) + +# run the single-variant association test +assoc.spa <- assocTestSingle(iterator, + null.model = nullmod.status, + test = "Score.SPA") +dim(assoc.spa) +head(assoc.spa) +``` +Notice that the results of this test include two new columns: `SPA.pval`, which are the $p$-values after the SPA adjustment, and `SPA.converged` which indicates if the SPA adjustment was successful (`TRUE/FALSE`); if the value is `NA` then the SPA adjustment was not applied and the original score test $p$-value is returned -- for computational efficiency, SPA is only applied when the original score test $p$-value $< 0.05$. Note that the `Score`, `Score.SE`, and `Score.Stat` are exactly the same for all variants as when using the standard score test, as the SPA adjustment only alters the $p$-value. + +```{r} +table(assoc.spa$SPA.converged, exclude = NULL) +``` + +Make a new QQ plot of all variants with MAC >= 5. How does using the SPA $p$-value adjustment affect the results? + +```{r} +# make a QQ plot +qqPlot(assoc.spa$SPA.pval[assoc.spa$MAC >= 5]) +``` + +Using the SPA $p$-value adjustment has resolved the test statistic inflation! + + +```{r} +# close the GDS file! +seqClose(seqData) +``` diff --git a/03_advanced_GWAS.html b/03_advanced_GWAS.html new file mode 100644 index 0000000..df2e8b0 --- /dev/null +++ b/03_advanced_GWAS.html @@ -0,0 +1,854 @@ + + + + + + + + + + + + + +03_advanced_GWAS.knit + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

3. Advanced GWAS Models

+

This tutorial extends what was previously introduced in the +02_GWAS.Rmd tutorial to more advanced models using the GENESIS +R/Bioconductor package.

+
+

Sparse Kinship Matrix

+

Recall that fitting the null model uses a kinship matrix (KM) that +captures the genetic correlation among samples. In the +02_GWAS.Rmd tutorial, we used a dense KM that has non-zero +values for all entries in the matrix. This works well with small sample +sizes (like we have here), but can require a lot of memory and be very +computationally demanding with large samples.

+

In large population based samples, we can make an empirical KM sparse +by zeroing out small values that are near 0. When creating a PC-Relate +KM with pcrelateToMatrix this can be done by setting the +thresh parameter equal to the smallest non-zero value to +keep in the matrix. Alternatively, we can use the +makeSparseMatrix function from the GENESIS package to +sparsify an existing matrix into a sparse block-diagonal matrix.

+
repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main"
+if (!dir.exists("data")) dir.create("data")
+
+library(GENESIS)
+
+# load the PC-Relate KM
+kmfile <- "data/pcrelate_Matrix.RData"
+if (!file.exists(kmfile)) download.file(file.path(repo_path, kmfile), kmfile)
+km <- get(load(kmfile))
+dim(km)
+
## [1] 1040 1040
+
km[1:5,1:5]
+
## 5 x 5 Matrix of class "dsyMatrix"
+##              HG00096       HG00097      HG00099      HG00100       HG00101
+## HG00096  0.985459993  0.0031768617  0.006724604 -0.006857234  0.0060690853
+## HG00097  0.003176862  0.9883268089 -0.007300965  0.012888182 -0.0009002438
+## HG00099  0.006724604 -0.0073009649  0.976649166  0.022746299  0.0118873586
+## HG00100 -0.006857234  0.0128881819  0.022746299  0.978507564  0.0265278283
+## HG00101  0.006069085 -0.0009002438  0.011887359  0.026527828  0.9778938334
+
# make the KM sparse at 4th degree relatedness
+skm <- makeSparseMatrix(km, thresh = 2*2^(-11/2))
+
## Using 1040 samples provided
+
## Identifying clusters of relatives...
+
##     57 relatives in 26 clusters; largest cluster = 4
+
## Creating block matrices for clusters...
+
## 983 samples with no relatives included
+
## Putting all samples together into one block diagonal matrix
+
dim(skm)
+
## [1] 1040 1040
+
skm[1:5,1:5]
+
## 5 x 5 sparse Matrix of class "dsCMatrix"
+##            HG00112    HG00123   HG00116   HG00120   HG00238
+## HG00112 0.98518404 0.09728113 .         .         .        
+## HG00123 0.09728113 0.98731569 .         .         .        
+## HG00116 .          .          0.9874224 0.1704253 .        
+## HG00120 .          .          0.1704253 0.9778078 .        
+## HG00238 .          .          .         .         0.9927059
+

This sparse KM can be used in the null model in place of the original +dense KM that we used in the 02_GWAS.Rmd tutorial – +exercise left to the reader.

+
+
+

Two Stage Model

+

As discussed in the lecture, we recommend a fully adjusted two-stage +inverse Normalization procedure for fitting the null model for +quantitative traits, particularly when the outcome has a non-Normal +distribution. See Sofer et +al. (2019) for more information on the fully adjusted two-stage +model.

+
+

Phenotype Data

+

First we load the AnnotatedDataFrame with the phenotype +data that we prepared previously.

+
library(Biobase)
+
# pheno data
+annotfile <- "data/pheno_annotated_pcs.RData"
+if (!file.exists(annotfile)) download.file(file.path(repo_path, annotfile), annotfile)
+annot <- get(load(annotfile))
+head(pData(annot))
+
##   sample.id pop super_pop    sex age trait_1 trait_2 status        PC1
+## 1   HG00096 GBR       EUR   male  56    23.5     0.1      0 0.01800549
+## 2   HG00097 GBR       EUR female  56    23.9     0.1      0 0.01742309
+## 3   HG00099 GBR       EUR female  48    23.8     0.3      0 0.01799001
+## 4   HG00100 GBR       EUR female  54    24.9     0.2      0 0.01793750
+## 5   HG00101 GBR       EUR   male  65    27.6     0.7      1 0.01759459
+## 6   HG00102 GBR       EUR female  53    24.5     0.4      0 0.01796459
+##           PC2        PC3         PC4           PC5         PC6        PC7
+## 1 -0.03361957 0.01541131 -0.02085591 -0.0037719379 -0.04669896 0.01985608
+## 2 -0.03361182 0.01304470 -0.01992104 -0.0095947596 -0.04545408 0.01034344
+## 3 -0.03418633 0.01152677 -0.01694420  0.0011408340 -0.05178681 0.02212758
+## 4 -0.03378008 0.01491137 -0.01954784  0.0045447134 -0.04375806 0.02055287
+## 5 -0.03318057 0.01568040 -0.01892110 -0.0009236657 -0.05762760 0.03443987
+## 6 -0.03344242 0.01286241 -0.01767438  0.0022747910 -0.06528352 0.03392455
+

For this section of the tutorial, we will be analyzing +trait_2. Make a histogram of the trait_2 values – what do +you notice about the distribution?

+
library(ggplot2)
+ggplot(pData(annot), aes(x = trait_2)) + geom_histogram()
+
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+

+
+
+

Standard Analysis

+

First, let’s run the GWAS using the standard LMM. Recall that we +first fit the null model and then perform single variant score +tests.

+
+

Null Model

+

We use the fitNullModel function from GENESIS. We need +to specify the AnnotatedDataFrame with the phenotype data +(annot), the outcome variable (trait_2), and the fixed effect covariates +(sex, age, and PCs 1-7). We also include the kinship matrix to specify +the covariance structure of the polygenic random effect in the +model.

+
nullmod <- fitNullModel(annot, 
+                        outcome = "trait_2", 
+                        covars=c("sex", "age", paste0("PC", c(1:7))), 
+                        cov.mat=km, 
+                        verbose=TRUE)
+
## Computing Variance Component Estimates...
+
## Sigma^2_A     log-lik     RSS
+
## [1]     0.7674395     0.7674395 -1642.1166605     0.9161778
+## [1]     0.6850887     0.7089644 -1642.1162960     1.0084433
+## [1]     0.6923555     0.7134190 -1642.1162873     1.0000701
+
+
+

Association Tests

+

After fitting the null model, we use single-variant score tests to +test each variant across the genome separately for association with the +outcome, accounting for genetic ancestry and genetic relatedness among +the samples. Recall that we make a SeqVarData iterator +object that links the genotype data in the GDS file with the phenotype +data in our AnnotatedDataFrame and use the +assocTestSingle function from GENESIS to perform the +association tests.

+
library(SeqVarTools)
+
+# open a connection to the GDS file
+gdsfile <- "data/1KG_phase3_GRCh38_subset_chr1.gds"
+if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile)
+gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
+gds <- seqOpen(gdsfile)
+
+# make the seqVarData object
+seqData <- SeqVarData(gds, sampleData=annot)
+
+# make the iterator object
+iterator <- SeqVarBlockIterator(seqData, verbose=FALSE)
+
+# run the single-variant association test
+assoc <- assocTestSingle(iterator, 
+                         null.model = nullmod,
+                         test = "Score")
+
## # of selected samples: 1,040
+
dim(assoc)
+
## [1] 27621    14
+

We make a QQ plot of all variants with MAC >= 5. What do you +notice about this QQ plot?

+
qqPlot <- function(pval) {
+    pval <- pval[!is.na(pval)]
+    n <- length(pval)
+    x <- 1:n
+    dat <- data.frame(obs=sort(pval),
+                      exp=x/n,
+                      upper=qbeta(0.025, x, rev(x)),
+                      lower=qbeta(0.975, x, rev(x)))
+
+    ggplot(dat, aes(-log10(exp), -log10(obs))) +
+        geom_line(aes(-log10(exp), -log10(upper)), color="gray") +
+        geom_line(aes(-log10(exp), -log10(lower)), color="gray") +
+        geom_point() +
+        geom_abline(intercept=0, slope=1, color="red") +
+        xlab(expression(paste(-log[10], "(expected P)"))) +
+        ylab(expression(paste(-log[10], "(observed P)"))) +
+        theme_bw()
+}
+
+# make a QQ plot
+qqPlot(assoc$Score.pval[assoc$MAC >= 5])
+

+
+
+
+

Fully-Adjusted Two-Stage Model Analysis

+

Now, let’s run the GWAS using the fully-adjusted two-stage LMM.

+
+

Null Model

+

To run the fully-adjusted two-stage null model, we simply set the +two.stage option to TRUE.

+
# fit the two stage null model
+nullmod.twostage <- fitNullModel(annot, 
+                                 outcome = "trait_2", 
+                                 covars=c("sex", "age", paste0("PC", c(1:7))), 
+                                 cov.mat=km, 
+                                 two.stage = TRUE,
+                                 verbose=TRUE)
+
## Computing Variance Component Estimates...
+
## Sigma^2_A     log-lik     RSS
+
## [1]     0.7674395     0.7674395 -1642.1166605     0.9161778
+## [1]     0.6850887     0.7089644 -1642.1162960     1.0084433
+## [1]     0.6923555     0.7134190 -1642.1162873     1.0000701
+
## Computing Variance Component Estimates...
+## Sigma^2_A     log-lik     RSS
+
## [1]     0.6923555     0.7134190 -1623.4392525     0.9644513
+## [1]     0.4614019     0.8850879 -1623.2976713     1.0020121
+## [1]     0.4804507     0.8692495 -1623.2964101     1.0000090
+## [1]     0.4800435     0.8696570 -1623.2964095     1.0000000
+

Notice that the messages from the function show that the model was +fit twice – first with the original outcome and second with the +inverse-Normal transformed residuals.

+
# description of the model we fit
+nullmod.twostage$model
+
## $hetResid
+## [1] FALSE
+## 
+## $family
+## 
+## Family: gaussian 
+## Link function: identity 
+## 
+## 
+## $outcome
+## [1] "trait_2"
+## 
+## $covars
+## [1] "sex" "age" "PC1" "PC2" "PC3" "PC4" "PC5" "PC6" "PC7"
+## 
+## $formula
+## [1] "rankInvNorm(resid(trait_2)) ~ sex + age + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + (1|A)"
+

From the model description, we can see this is a two stage model +because the formula element has rankInvNorm(resid(trait_2)) +as the outcome variable.

+
+
+

Association Tests

+

After fitting the two-stage null model, the association testing +procedure is exactly the same. Since we’ve already created our +SeqVarData iterator, we do not need to create it again; +however, we do need to “reset” it to the first block of data.

+
# reset the filter to the first block
+seqResetFilter(iterator, verbose = FALSE)
+
+# run the single-variant association test
+assoc.twostage <- assocTestSingle(iterator, 
+                                  null.model = nullmod.twostage,
+                                  test = "Score")
+
## # of selected samples: 1,040
+
dim(assoc.twostage)
+
## [1] 27621    14
+

We make a QQ plot of all variants with MAC >= 5. What do you +notice about this QQ plot compared to the one from the standard null +model?

+
# make a QQ plot
+qqPlot(assoc.twostage$Score.pval[assoc.twostage$MAC >= 5])
+

+

From the QQ plot, we see that using the fully-adjusted two-stage +model substantially decreased the amount of inflation in the test +statistics. We know that this is because of the non-Normality / skewness +of trait_2, but we can compare the marginal residuals from both null +models to see how their distributions look after covariate +adjustment

+
# merge the data for plotting
+pdat <- merge(nullmod$fit, nullmod.twostage$fit,
+              by = 'sample.id', suffixes = c('.orig', '.twostage'))
+head(pdat, 2)
+
##   sample.id outcome.orig workingY.orig fitted.values.orig resid.marginal.orig
+## 1   HG00096          0.1           0.1          0.9522246          -0.8522246
+## 2   HG00097          0.1           0.1          0.9098002          -0.8098002
+##   resid.conditional.orig linear.predictor.orig resid.PY.orig
+## 1             -0.4713968             0.5713968    -0.6607573
+## 2             -0.4203074             0.5203074    -0.5891452
+##   resid.cholesky.orig outcome.twostage workingY.twostage fitted.values.twostage
+## 1          -0.7784371        -1.144867         -1.144867             0.10910933
+## 2          -0.6957929        -1.035960         -1.035960             0.08932232
+##   resid.marginal.twostage resid.conditional.twostage linear.predictor.twostage
+## 1               -1.253977                 -0.8467285                -0.2981388
+## 2               -1.125283                 -0.7524428                -0.2835175
+##   resid.PY.twostage resid.cholesky.twostage
+## 1        -0.9736350               -1.126569
+## 2        -0.8652179               -1.002910
+
# distribution of residuals - original null model
+ggplot(pdat, aes(x = resid.marginal.orig)) + geom_histogram()
+
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+

+
# distribution of residuals - two stage null model
+ggplot(pdat, aes(x = resid.marginal.twostage)) + geom_histogram()
+
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+

+
# compare residuals
+ggplot(pdat, aes(x = resid.marginal.orig, y = resid.marginal.twostage)) +
+    geom_point() +
+    geom_abline(intercept = 0, slope = 1)
+

+

As expected, the residuals from the original model are very skewed, +while the residuals from the two-stage model are much closer to Normally +distributed. The skewness of the residuals in the original model leads +to inflation in variant test statistics (as seen in the QQ plots), +particularly for rare or low frequency variants, where effect allele +carriers with extreme residuals in the tails of the distribution can +have high leverage on the score statistics.

+
+
+
+
+

Binary Traits

+

GENESIS also supports testing binary (e.g. case/control) +outcomes.

+
+

Phenotype Data

+

The outcome status in the annotated phenotype data is a +binary case/control variable. Look at the prevalence of this outcome

+
table(pData(annot)$status)
+
## 
+##   0   1 
+## 996  44
+
+
+

Logistic Mixed Model Analysis

+

GENESIS implements the GMMAT method of approximate logistic mixed +models for association testing with binary outcomes. See Chen +et al. (2016) for more information on the GMMAT method.

+
+

Null Model

+

We can fit a null model using a logistic mixed model by specifying +the argument family=binomial in the +fitNullModel function. As before, we need to specify the +AnnotatedDataFrame with the phenotype data (annot), the +outcome variable (status), and the fixed effect covariates (sex, age, +and PCs 1-7). We still include the kinship matrix to specify the +covariance structure of the polygenic random effect in the model.

+
# fit the null model with logistic mixed model
+nullmod.status <- fitNullModel(annot,
+                               outcome="status",
+                               covars=c("sex", "age", paste0("PC", 1:7)),
+                               cov.mat = km,
+                               family=binomial,
+                               verbose=TRUE)
+
## Computing Variance Component Estimates...
+
## Sigma^2_A     log-lik     RSS
+
## [1]     0.0100000 -3429.7752484     0.8869329
+## [1]     0.1904589 -3429.2871878     0.8800663
+## [1]     0.1933888 -3429.2794483     0.8799562
+
## Updating WorkingY Vector...
+
## Computing Variance Component Estimates...
+
## Sigma^2_A     log-lik     RSS
+
## [1]     0.1933888 -3350.5100738     0.7552805
+## [1]     0.1661252 -3350.6802051     0.7563057
+## [1]     0.1690691 -3350.6617913     0.7561948
+## [1]     0.1687592 -3350.6637295     0.7562065
+
## Updating WorkingY Vector...
+
## Computing Variance Component Estimates...
+
## Sigma^2_A     log-lik     RSS
+
## [1]     0.1687592 -3360.7997975     0.7710210
+## [1]     0.1778610 -3360.7472975     0.7706784
+## [1]     0.1770193 -3360.7521482     0.7707101
+
## Updating WorkingY Vector...
+
## Computing Variance Component Estimates...
+
## Sigma^2_A     log-lik     RSS
+
## [1]     0.1770193 -3357.5234919     0.7658698
+## [1]     0.1775156 -3357.5205527     0.7658512
+
## Updating WorkingY Vector...
+
## Computing Variance Component Estimates...
+
## Sigma^2_A     log-lik     RSS
+
## [1]     0.1775156 -3357.3274708     0.7655623
+
+
+

Association Tests

+

After fitting the logistic null model, the association testing +procedure is still exactly the same. Since we’ve already created our +SeqVarData iterator, we do not need to create it again; +however, we do need to “reset” it to the first block of data.

+
# reset the filter to the first block
+seqResetFilter(iterator, verbose = FALSE)
+
+# run the single-variant association test
+assoc.status <- assocTestSingle(iterator, 
+                                null.model = nullmod.status,
+                                test = "Score")
+
## # of selected samples: 1,040
+
dim(assoc.status)
+
## [1] 27621    14
+

As usual, we make a QQ plot of all variants with MAC >= 5. What do +you notice about the tail of this QQ plot?

+
# make a QQ plot
+qqPlot(assoc.status$Score.pval[assoc.status$MAC >= 5])
+

+
+
+

SPA test

+

In samples with highly imbalanced case:control ratios, the Score test +can be inflated for rare and low frequency variants. Saddlepoint +approximation (SPA) can be used to improve \(p\)-value calculations, and is available in +GENESIS by setting the argument test=Score.SPA in +assocTestSingle. See Dey et +al. (2017) and Zhou et +al. (2018) for details on using SPA in GWAS. Re-run the analysis +using the SPA \(p\)-value +calculation.

+
# reset the filter to the first block
+seqResetFilter(iterator, verbose = FALSE)
+
+# run the single-variant association test
+assoc.spa <- assocTestSingle(iterator, 
+                             null.model = nullmod.status,
+                             test = "Score.SPA")
+
## # of selected samples: 1,040
+
dim(assoc.spa)
+
## [1] 27621    15
+
head(assoc.spa)
+
##   variant.id chr    pos allele.index n.obs       freq MAC       Score Score.SE
+## 1          1   1 631490            1  1040 0.05576923 116 -0.31231254 1.701762
+## 2          2   1 736950            1  1040 0.12067308 251  1.31337450 2.774703
+## 3          3   1 800909            1  1040 0.13076923 272 -2.78881539 2.774633
+## 4          4   1 814264            1  1040 0.20096154 418  5.93502599 4.447559
+## 5          5   1 868735            1  1040 0.92163462 163  0.08006921 2.090673
+## 6          6   1 903007            1  1040 0.20817308 433 -2.80060466 3.583768
+##    Score.Stat  SPA.pval         Est    Est.SE          PVE SPA.converged
+## 1 -0.18352305 0.8543876 -0.10784296 0.5876262 4.271333e-05            NA
+## 2  0.47333876 0.6359715  0.17059078 0.3603989 2.841361e-04            NA
+## 3 -1.00511162 0.3148431 -0.36225036 0.3604081 1.281182e-03            NA
+## 4  1.33444576 0.1820578  0.30004005 0.2248424 2.258313e-03            NA
+## 5  0.03829829 0.9694499  0.01831864 0.4783148 1.860120e-06            NA
+## 6 -0.78146924 0.4345266 -0.21805797 0.2790359 7.744725e-04            NA
+

Notice that the results of this test include two new columns: +SPA.pval, which are the \(p\)-values after the SPA adjustment, and +SPA.converged which indicates if the SPA adjustment was +successful (TRUE/FALSE); if the value is NA +then the SPA adjustment was not applied and the original score test +\(p\)-value is returned – for +computational efficiency, SPA is only applied when the original score +test \(p\)-value \(< 0.05\). Note that the +Score, Score.SE, and Score.Stat +are exactly the same for all variants as when using the standard score +test, as the SPA adjustment only alters the \(p\)-value.

+
table(assoc.spa$SPA.converged, exclude = NULL)
+
## 
+##  TRUE  <NA> 
+##  1238 26383
+

Make a new QQ plot of all variants with MAC >= 5. How does using +the SPA \(p\)-value adjustment affect +the results?

+
# make a QQ plot
+qqPlot(assoc.spa$SPA.pval[assoc.spa$MAC >= 5])
+

+

Using the SPA \(p\)-value adjustment +has resolved the test statistic inflation!

+
# close the GDS file!
+seqClose(seqData)
+
+
+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/04_conditional_analysis.Rmd b/04_conditional_analysis.Rmd new file mode 100644 index 0000000..737fc0f --- /dev/null +++ b/04_conditional_analysis.Rmd @@ -0,0 +1,364 @@ +# 4. Conditional Analysis + +In this tutorial, we will learn how to investigate our association signals and perform conditional analyses to search for secondary signals. We will utilize the same association testing applications as well as the [LocusZoom Shiny App](https://locuszoom-shiny-app.bdc.sb-webapp.com/), which is an "Interactive Browser" built with [R Shiny](https://shiny.rstudio.com/) on the NHLBI BioData Catalyst powered by Seven Bridges cloud platform. + +## Original Association Test Results + +In the `02_GWAS.Rmd` tutorial, we found two loci with significant associations on chromosome 1. Let's take a quick look at the association test results to remind ourselves what we found + +```{r} +# load the association test results +repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main" +if (!dir.exists("data")) dir.create("data") + +assocfile <- "data/assoc_chr1_trait_1.RData" +if (!file.exists(assocfile)) download.file(file.path(repo_path, assocfile), assocfile) +assoc <- get(load(assocfile)) +``` + +We make a Manhattan plot of the $-log_{10}(p)$-values using the `manhattanPlot` function from the `GWASTools` package to visualize the association signals. + +```{r, assoc_single_manhattan} +GWASTools::manhattanPlot(assoc$Score.pval, + chromosome = assoc$chr, + thinThreshold = 1e-4, + ylim = c(0, 12)) +``` + +Filter to just the genome-wide significant variants + +```{r} +# genome-wide significant +assoc[assoc$Score.pval < 5e-8, ] + +# extract the variant.id of these hits for later +hits <- assoc$variant.id[assoc$Score.pval < 5e-8] +``` + +We see that 6 variants at two different loci have $p < 5 \times 10^{-8}$. The most significant variant has $p = 5.8 \times 10^{-12}$ and is at position 212956321. Let's explore this variant further. + +## Locus Zoom Plots + +The [Locus Zoom Shiny App](https://locuszoom-shiny-app.bdc.sb-webapp.com/) is an interactive tool built on the [LocusZoom.js library](https://statgen.github.io/locuszoom/) that enables users to make LocusZoom plots of association results produced with the `GENESIS Single Variant Association Testing` app. We will now use the LocusZoom Shiny App to make a LocusZoom plot of our association hit on chromosome 1. + +- Launch the interactive browser + - From the top menu, click "Public Resources" > "Interactive Web Apps" + - Click: "Open" on the LocusZoom Shiny App + - Click: "Yes" to proceed + +The application requires data to be stored as a JSON file. There is a `GENESIS Data JSONizer` tool that converts single-variant association test results .RData file as output by the `GENESIS Single Variant Association Testing` app into the required JSON file. This tool also calculates the linkage disequilibrium (LD) measures required to make the LocusZoom plot for the selected variants. + +- Click the "GENESIS Data JSONizer" tab at the top of the screen +- Select Input Files + - GDS file: `1KG_phase3_GRCh38_subset_chr1.gds` + - .RData file: `1KG_trait_1_chr1.RData` +- JSONizer parameters + - Check: "Specify variant and a flanking region around it" + - Select the position of the variant of interest: 212956321 + - Specify flanking region: 50000 (i.e. 50kb in each direction). + - Select test type: score +- Click: JSONize + +You have the option to download the JSON file to your local environment or upload it to the BioData Catalyst platform and save it for later, if you desire. + +- Expand: JSON File - Download and Export Form +- Set a file name (e.g. "1KG_trait_1_chr1_212956321") +- Choose extension: `.json` +- Click: Export JSON file to platform +- Select your Project and Click: Confirm +- Click: Upload + +There are several optional data layers you can add to your LocusZoom plot. The most likely layer that you will want to adjust is the Linkage Disequilibrium (LD) layer. The tool gives you the option to either compute LD measures using your sample genotype data stored in the GDS file (the default), or use the University of Michigan (UM) database. + +- Expand: Option Data Layers +- Expand: Linkage Disequilibrium +- Select Data Source: Compute LD Data +- Select reference variant: 1:212956321_?/? (our variant of interest) +- Click: Calculate LD + +You can expand the Linkage Disequilibrium Data Overview tab to see a preview of the calculated LD data, and you can download the data as a JSON file to your local environment or upload it to the BioData Catalyst platform and save it for later, if you desire. + +- Expand: JSON File - Download and Export Form +- Set a file name (e.g. "1KG_trait_1_chr1_212956321_LD") +- Choose extension: `.json` +- Click: Export JSON file to platform +- Select your Project and Click: Confirm +- Click: Upload + +You need to select the Genome Build that matches your data: + +- Change the Genome Build to GRCh38 for this dataset + +You can review the Initial Plot State Info to make sure everything looks as expected, and then make the plot! + +- Click: Generate plot + +The generated plot is interactive. You can hover over variants to see their chromosome, position, alleles, and association p-value. You can drag the figure left or right to see different sections of the plotted region. You can save the current figure as a .png or .svg file either locally or on the BioData Catalyst platform. + +If you've saved your .json association results file and your .json LD statistics file to your Project, you can come back later and recreate your LocusZoom plot by selecting the "Use Your Own Data Sources" tab at the top of the LocusZoom Shiny App page. This time, rather than JSONizing the data, you can select the .json files as input, and set the plotting parameters the same as we did above. + + +## Conditional Analysis + +One of the most common post-GWAS analyses we routinely perform is to run conditional analyses to explore if there are any secondary hits at loci (regions) with significant variant associations. Conditional analyses include genetic variants in the null model (i.e. the conditional variants) to adjust for their effects on the trait, just like the other fixed effect covariates in the model. The idea is to see if other association signals remain after accounting for (i.e. conditioning on) the effect(s) of the conditional variant(s). + +### Selecting Conditional Variants + +Conditional variants are usually selected either (1) as the top hits from your initial GWAS analysis, or (2) as variants known to be associated with the trait from prior publications. When performing conditional analyses using top hits from an initial GWAS analysis, the conditioning procedure is often performed step-wise iteratively to get a set of (roughly) independent association signals. + +- add the top hit (i.e. variant with the smallest $p$-value) to the set of conditional variants +- perform the conditional association test, conditioning on the set of conditional variants +- check if any significant variants remain + - if yes, repeat + - if no, stop + +When performing genome-wide conditional analyses, it is typically OK to add the top hit from each locus (i.e. region of the genome) to the set of conditional variants at each iteration. Note that the definition of locus here is not concise -- it is typically based on some measure of genetic distance, whether that be physical distance (e.g. withing 500kb, 1Mb, etc.) or genetic distance based on linkage disequilibrium (LD). If you want to err on the side of caution, you can add only the top hit from each chromosome to your set of conditional variants at each iteration. \n + +```{r} +assoc[assoc$Score.pval < 5e-8, ] +``` + +In our original association analysis, we found that there were 6 genome-wide significant variants at two distinct loci. In the particular example here, it is pretty clear that we can consider our hits as two distinct loci, as they are at opposite ends of the chromosome and the physical distance between them is ~188Mb. Therefore, we identify our conditional variants as those at `1:212956321` and `1:25046749`. \n + +### Conditoinal Null Model + +When preparing our data to run the conditional null model, we need to actually extract the genotype values from the GDS file. It is easiest to use the `variant.id` values from the GDS file, but remember that these are unique to your GDS file. + +```{r} +library(SeqArray) +library(SeqVarTools) + +# open the GDS file +gdsfile <- "data/1KG_phase3_GRCh38_subset_chr1.gds" +if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile) +gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open +gds <- seqOpen(gdsfile) + +# variants to condition on +cond <- c(2458, 31614) + +# set a filter to the conditional variants +seqSetFilter(gds, variant.id = cond) + +# read in the genotype data +geno <- altDosage(gds) +head(geno) +``` + +First we load the `AnnotatedDataFrame` with the phenotype data that we prepared previously and merge in the genotypes for the conditional variants. + +```{r, message = FALSE} +library(Biobase) + +# pheno data +annotfile <- "data/pheno_annotated_pcs.RData" +if (!file.exists(annotfile)) download.file(file.path(repo_path, annotfile), annotfile) +annot <- get(load(annotfile)) + +# merge +dat <- merge(pData(annot), + data.frame('sample.id' = rownames(geno), g = geno), + by = 'sample.id') +head(dat) + +# updated AnnotatedDataFrame +annot <- AnnotatedDataFrame(dat) +annot +``` + +We also need to load our kinship matrix + +```{r} +# load the full PC-Relate KM +kmfile <- "data/pcrelate_Matrix.RData" +if (!file.exists(kmfile)) download.file(file.path(repo_path, kmfile), kmfile) +km <- get(load(kmfile)) +dim(km) +km[1:5,1:5] +``` +Now we can fit our null model, including the conditional variants as covariates. The rest of the model should remain the same as in the original association analysis. + +```{r} +library(GENESIS) + +# fit the conditional model +nullmod.cond <- fitNullModel(annot, + outcome = "trait_1", + covars=c("g.2458", "g.31614", "sex", "age", paste0("PC", c(1:7))), + cov.mat=km, + verbose=TRUE) +``` + +Look at the conditional null model output: + +```{r} +# description of the model we fit +nullmod.cond$model + +# fixed effects +nullmod.cond$fixef +``` +Note that, as expected, the two conditional variants have very significant $p$-values in the null model. The $p$-values aren't *exactly* the same as what we calculated in the original association score tests, but they are quite close -- good validation that the score test procedure is working well! + +### Conditional Association Test + +Now that we have our conditional null model, we can perform the conditional association tests. The procedure is exactly the same as what we've seen before, just using this new null model. + +```{r} +# reset the filter to all variants +seqResetFilter(gds) + +# make the seqVarData object +seqData <- SeqVarData(gds, sampleData=annot) + +# make the iterator object +iterator <- SeqVarBlockIterator(seqData, verbose=FALSE) + +# run the single-variant association test +assoc.cond <- assocTestSingle(iterator, + null.model = nullmod.cond, + test = "Score") +dim(assoc.cond) + +# remove the conditional variants +assoc.cond <- assoc.cond[!(assoc.cond$variant.id %in% cond),] +dim(assoc.cond) +``` + +Note that we removed the variants we conditioned on from our association test results. The `assocTestSingle` function does not know you conditioned on those variants; it will return statistics, but they will be non-sense -- the test statistic will blow up to $\pm Inf$ and you will get $p$-values very near to 0 or 1. \n + +#### Examine the results + +Let's look at the conditional association results. We make a Manhattan plot of the $-log_{10}(p)$-values using the `manhattanPlot` function from the `GWASTools` package to visualize the association signals. + +```{r} +GWASTools::manhattanPlot(assoc.cond$Score.pval, + chromosome = assoc.cond$chr, + thinThreshold = 1e-4, + ylim = c(0, 12)) +``` + +We see that after conditioning the signal from the locus at the beginning of the chromosome is completely removed, but there is still some signal from the locus at the end of the chromosome. Filter to just the genome-wide significant variants to see the statistics + +```{r} +# genome-wide significant +assoc.cond[assoc.cond$Score.pval < 5e-8, ] +``` + +There is now just one genome-wide significant variant, `1:212951423`. Prior to conditioning, this variant had $p = 6.2 \times 10^{-10}$, and after conditioning it has $p = 3.1 \times 10^{-9}$. The signal is reduced *very slightly*, but we would conclude that the association signal at this variant is independent of the association signals at the other variants we conditioned on. Given variant `1:212951423` proximity to variant `1:212956321` that we conditioned on -- only about 5kb apart -- this may seem surprising. This is an example of a secondary signal at this locus. \n + +We can print the conditional association statistics for all of the original hits to see that the signal at the rest of those variants has in fact gone away (recall that the two variants we conditioned on are removed from the output) + +```{r} +assoc.cond[assoc.cond$variant.id %in% hits, ] +``` + +If we wanted to continue iterating, we would run a second conditional analysis, conditioning on both variants `1:212956321` and `1:212951423` to look for a tertiary signal at this locus. However, with only one genome-wide significant variant remaining, that is unnecessary in this situation. + +#### LD calculation + +To understand this secondary signal, we can use the `snpgdsLDpair` function from SNPRelate package to compute the LD between the top hits at this locus from our primary and secondary signals. + +```{r} +library(SNPRelate) + +# filter the GDS to the two variants +seqSetFilter(gds, variant.id = c(31614, 31443)) + +# read in the genotype values +geno <- altDosage(gds) + +# compute the LD r^2 value (note that the function returns the correlation, not squared) +snpgdsLDpair(snp1 = geno[,1], snp2 = geno[,2])^2 +``` + +The LD $r^2$ value between these two variants is quite small, nearly 0, which explains why the secondary signal at variant `1:212951423` remains after conditioning on variant `1:212956321`. + + +## Exercise 4.1 (Application) + +The `GENESIS Null Model` app on the BioData Catalyst powered by Seven Bridges platform makes it quite simple to perform a conditional analysis. In addition to the inputs provided for the standard analysis, we need to provide the GDS files that contain the genotype data for the variants we want to condition on and an .RData file that specifies the chromosome and variant.id values of the variants we want to condition on. \n + +Use the `GENESIS Null Model` to fit a null model for trait_1, conditioning on the top hit from each locus on chromosome 1 in our original GWAS analysis. The rest of the model parameters should be the same as the original GWAS -- adjust for sex, age, ancestry, and kinship in the model. The steps to perform this analysis are as follows: + +- Copy the app to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `GENESIS Null Model` + - Click: Copy > Select your project > Copy +- Run the analysis in your project: + - Click: Apps > `GENESIS Null Model` > Run + - Specify the Inputs: + - Phenotype file: `pheno_annotated.RData` + - PCA file: `1KG_phase3_GRCh38_subset_pca.RData` + - Relatedness matrix file: `1KG_phase3_subset_GRCh38_pcrelate_Matrix.RData` + - GDS files: `1KG_phase3_GRCh38_subset_chr1.gds` + - Conditional variant file: `conditional_vars_trait_1_chr1.RData` + - Specify the App Settings: + - Covariates: age, sex (each as a different term) + - Family: gaussian + - Number of PCs to include as covariates: 7 + - Outcome: trait_1 + - Two stage model: FALSE + - Output prefix: "1KG_trait_1_cond" (or any other string to name the output file) + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output of this analysis will be a `_null_model.RData` file that contains the null model fit, a `_phenotypes.RData` file with the phenotype data used in the analysis, and a `_report.Rmd` and `_report.html` with model diagnostics. Review the .html report -- which covariates have significant ($p < 0.05$) associations with trait_1 in the null model? What do you notice about the boxplots of the trait_1 values by the conditional variants? + +You can find the expected output of this analysis by looking at the existing task `09 Conditional Null Model trait_1` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to move to the next exercise. + +### Solution 4.1 (Application) + +From looking at the .html report, we see that our conditional variants (var_2458 and var_31614), PC1, PC2, PC3, and PC6 have significant associations with trait_1 in our conditional null model. From the boxplots, we can see the positive trend between the trait_1 values and the number of copies of the effect allele at each conditional variant. + + +## Exercise 4.2 (Application) + +Use the `GENESIS Single Variant Association Testing` app on the BioData Catalyst powered by Seven Bridges platform to perform a conditional association tests for trait_1 using the null model fit in the previous exercise. To speed things up, we will restrict this analysis to chromosome 1. Use the genotype data in the genome-wide GDS files you created previously. The steps to perform this analysis are as follows: + +- Copy the app to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `GENESIS Single Variant Association Testing` + - Click: Copy > Select your project > Copy +- Run the analysis in your project: + - Click: Apps > `GENESIS Single Variant Association Testing` > Run + - Specify the Inputs: + - GDS Files: `1KG_phase3_GRCh38_subset_chr1.gds` + - Null model file: `1KG_trait_1_cond_null_model.RData` + - Phenotype file: `1KG_trait_1_cond_phenotypes.RData` (use the phenotype file created by the Null Model app) + - Specify the App Settings: + - MAC threshold: 5 + - Test type: score + - memory GB: 32 (increase to make sure enough available) + - Output prefix: "1KG_trait_1_assoc_cond" (or any other string to name the output file) + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output of this analysis will be an `_chr1.RData` file with the association test results for chromosome 1 as well as a `_manh.png` file with the Manhattan plot and a `_qq.png` file with the QQ plot. Review the truncated Manhattan plot -- what do you find? + +You can find the expected output of this analysis by looking at the existing task `10 Conditional Single Variant Association Test trait_1` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to move to the next exercise. + +### Solution 4.2 (Application) + +From looking at the truncated Manhattan plot, we see that the signal from the locus at the beginning of the chromosome has been removed, but there is still a genome-wide significant variant from the locus at the end of the chromosome. We also see that there is a truncated variant at the top of the figure for each locus -- these are the variants we conditioned on. The code in the application does not know what variants we conditioned on, so it does not know to remove them from the plots. + + +## Exercise 4.3 (LocusZoom Shiny App) + +Return to the LocusZoom Shiny App and make locus zoom plots indexed by our secondary hit at position 212951423, using both the original and conditional association analysis results. For the original analysis results, you can use the data you JSONized before. For the conditional analysis results, you will need to JSONize the association statistics from that analysis. What do you observe in these locus zoom plots? + + + +## Exercise 4.4 (LocusZoom Shiny App) + +Notably, the LocusZoom plot we generated with the example data is fairly sparse, which is not representative of what a LocusZoom plot would actually look like in practice. There are example data sets available in the tool via the [University of Michigan database](https://portaldev.sph.umich.edu/docs/api/v1/#introduction). Select the "Explore UM Database" tab at the top of the LocusZoom Shiny App page and generate a LocusZoom plot using the GIANT Consortium BMI meta-analysis (PMID: 20935630) data for variant chr16:53803574, using a flanking region of 100kb. What is the p-value of the variant chr16:53803574_T/A? What gene is this variant located in? Change the LD reference population to EUR (European ancestry) -- what do you observe? Change the LD reference population to AFR (African ancestry) -- what do you observe? + +### Solution 4.4 (Locus Zoom Shiny App) + +- The p-value of variant chr16:53803574_T/A is reported as $2.05 x 10^{-62}$ +- This variant is located in the FTO gene, which is well established to be associated with BMI. +- Using the EUR LD reference panel, many of the variants in this region with similar p-values have very high LD with variant chr16:53803574_T/A (indicated by the red color). +- Using the AFR LD reference panel, many of the variants in this region with similar p-values no longer have high LD with variant chr16:53803574_T/A (indicated by the blue color). + diff --git a/04_conditional_analysis.html b/04_conditional_analysis.html new file mode 100644 index 0000000..0c87f12 --- /dev/null +++ b/04_conditional_analysis.html @@ -0,0 +1,1054 @@ + + + + + + + + + + + + + +04_conditional_analysis.knit + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

4. Conditional Analysis

+

In this tutorial, we will learn how to investigate our association +signals and perform conditional analyses to search for secondary +signals. We will utilize the same association testing applications as +well as the LocusZoom Shiny +App, which is an “Interactive Browser” built with R Shiny on the NHLBI BioData +Catalyst powered by Seven Bridges cloud platform.

+
+

Original Association Test Results

+

In the 02_GWAS.Rmd tutorial, we found two loci with +significant associations on chromosome 1. Let’s take a quick look at the +association test results to remind ourselves what we found

+
# load the association test results
+repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main"
+if (!dir.exists("data")) dir.create("data")
+
+assocfile <- "data/assoc_chr1_trait_1.RData"
+if (!file.exists(assocfile)) download.file(file.path(repo_path, assocfile), assocfile)
+assoc <- get(load(assocfile))
+

We make a Manhattan plot of the \(-log_{10}(p)\)-values using the +manhattanPlot function from the GWASTools +package to visualize the association signals.

+
GWASTools::manhattanPlot(assoc$Score.pval, 
+                         chromosome = assoc$chr, 
+                         thinThreshold = 1e-4,
+                         ylim = c(0, 12))
+

+

Filter to just the genome-wide significant variants

+
# genome-wide significant
+assoc[assoc$Score.pval < 5e-8, ]
+
##       variant.id chr       pos allele.index n.obs      freq MAC     Score
+## 1938        2370   1  25044607            1  1040 0.1692308 352  71.07880
+## 2003        2458   1  25046749            1  1040 0.1500000 312  69.35044
+## 2011        2473   1  25047225            1  1040 0.1927885 401  71.95763
+## 23169      31443   1 212951423            1  1040 0.1091346 227  63.18412
+## 23185      31476   1 212952357            1  1040 0.5254808 987 108.91489
+## 23271      31614   1 212956321            1  1040 0.4043269 841 110.53196
+##       Score.SE Score.Stat   Score.pval       Est     Est.SE        PVE
+## 1938  12.46025   5.704443 1.167240e-08 0.4578113 0.08025520 0.03159288
+## 2003  11.99336   5.782402 7.364128e-09 0.4821336 0.08337946 0.03246230
+## 2011  13.06207   5.508899 3.610844e-08 0.4217478 0.07655754 0.02946405
+## 23169 10.23968   6.170516 6.806767e-10 0.6026081 0.09765927 0.03696627
+## 23185 16.86561   6.457808 1.062305e-10 0.3828979 0.05929224 0.04048862
+## 23271 16.11184   6.860292 6.871974e-12 0.4257919 0.06206614 0.04569282
+
# extract the variant.id of these hits for later
+hits <- assoc$variant.id[assoc$Score.pval < 5e-8]
+

We see that 6 variants at two different loci have \(p < 5 \times 10^{-8}\). The most +significant variant has \(p = 5.8 \times +10^{-12}\) and is at position 212956321. Let’s explore this +variant further.

+
+
+

Locus Zoom Plots

+

The Locus +Zoom Shiny App is an interactive tool built on the LocusZoom.js library +that enables users to make LocusZoom plots of association results +produced with the +GENESIS Single Variant Association Testing app. We will now +use the LocusZoom Shiny App to make a LocusZoom plot of our association +hit on chromosome 1.

+
    +
  • Launch the interactive browser +
      +
    • From the top menu, click “Public Resources” > “Interactive Web +Apps”
    • +
    • Click: “Open” on the LocusZoom Shiny App
    • +
    • Click: “Yes” to proceed
    • +
  • +
+

The application requires data to be stored as a JSON file. There is a +GENESIS Data JSONizer tool that converts single-variant +association test results .RData file as output by the +GENESIS Single Variant Association Testing app into the +required JSON file. This tool also calculates the linkage disequilibrium +(LD) measures required to make the LocusZoom plot for the selected +variants.

+
    +
  • Click the “GENESIS Data JSONizer” tab at the top of the screen
  • +
  • Select Input Files +
      +
    • GDS file: 1KG_phase3_GRCh38_subset_chr1.gds
    • +
    • .RData file: 1KG_trait_1_chr1.RData
    • +
  • +
  • JSONizer parameters +
      +
    • Check: “Specify variant and a flanking region around it”
    • +
    • Select the position of the variant of interest: 212956321
    • +
    • Specify flanking region: 50000 (i.e. 50kb in each direction).
    • +
    • Select test type: score
    • +
  • +
  • Click: JSONize
  • +
+

You have the option to download the JSON file to your local +environment or upload it to the BioData Catalyst platform and save it +for later, if you desire.

+
    +
  • Expand: JSON File - Download and Export Form
  • +
  • Set a file name (e.g. “1KG_trait_1_chr1_212956321”)
  • +
  • Choose extension: .json
  • +
  • Click: Export JSON file to platform
  • +
  • Select your Project and Click: Confirm
  • +
  • Click: Upload
  • +
+

There are several optional data layers you can add to your LocusZoom +plot. The most likely layer that you will want to adjust is the Linkage +Disequilibrium (LD) layer. The tool gives you the option to either +compute LD measures using your sample genotype data stored in the GDS +file (the default), or use the University of Michigan (UM) database.

+
    +
  • Expand: Option Data Layers
  • +
  • Expand: Linkage Disequilibrium
  • +
  • Select Data Source: Compute LD Data
  • +
  • Select reference variant: 1:212956321_?/? (our variant of +interest)
  • +
  • Click: Calculate LD
  • +
+

You can expand the Linkage Disequilibrium Data Overview tab to see a +preview of the calculated LD data, and you can download the data as a +JSON file to your local environment or upload it to the BioData Catalyst +platform and save it for later, if you desire.

+
    +
  • Expand: JSON File - Download and Export Form
  • +
  • Set a file name (e.g. “1KG_trait_1_chr1_212956321_LD”)
  • +
  • Choose extension: .json
  • +
  • Click: Export JSON file to platform
  • +
  • Select your Project and Click: Confirm
  • +
  • Click: Upload
  • +
+

You need to select the Genome Build that matches your data:

+
    +
  • Change the Genome Build to GRCh38 for this dataset
  • +
+

You can review the Initial Plot State Info to make sure everything +looks as expected, and then make the plot!

+
    +
  • Click: Generate plot
  • +
+

The generated plot is interactive. You can hover over variants to see +their chromosome, position, alleles, and association p-value. You can +drag the figure left or right to see different sections of the plotted +region. You can save the current figure as a .png or .svg file either +locally or on the BioData Catalyst platform.

+

If you’ve saved your .json association results file and your .json LD +statistics file to your Project, you can come back later and recreate +your LocusZoom plot by selecting the “Use Your Own Data Sources” tab at +the top of the LocusZoom Shiny App page. This time, rather than +JSONizing the data, you can select the .json files as input, and set the +plotting parameters the same as we did above.

+
+
+

Conditional Analysis

+

One of the most common post-GWAS analyses we routinely perform is to +run conditional analyses to explore if there are any secondary hits at +loci (regions) with significant variant associations. Conditional +analyses include genetic variants in the null model (i.e. the +conditional variants) to adjust for their effects on the trait, just +like the other fixed effect covariates in the model. The idea is to see +if other association signals remain after accounting for +(i.e. conditioning on) the effect(s) of the conditional variant(s).

+
+

Selecting Conditional Variants

+

Conditional variants are usually selected either (1) as the top hits +from your initial GWAS analysis, or (2) as variants known to be +associated with the trait from prior publications. When performing +conditional analyses using top hits from an initial GWAS analysis, the +conditioning procedure is often performed step-wise iteratively to get a +set of (roughly) independent association signals.

+
    +
  • add the top hit (i.e. variant with the smallest \(p\)-value) to the set of conditional +variants
  • +
  • perform the conditional association test, conditioning on the set of +conditional variants
  • +
  • check if any significant variants remain +
      +
    • if yes, repeat
    • +
    • if no, stop
    • +
  • +
+

When performing genome-wide conditional analyses, it is typically OK +to add the top hit from each locus (i.e. region of the genome) to the +set of conditional variants at each iteration. Note that the definition +of locus here is not concise – it is typically based on some measure of +genetic distance, whether that be physical distance (e.g. withing 500kb, +1Mb, etc.) or genetic distance based on linkage disequilibrium (LD). If +you want to err on the side of caution, you can add only the top hit +from each chromosome to your set of conditional variants at each +iteration.

+
assoc[assoc$Score.pval < 5e-8, ]
+
##       variant.id chr       pos allele.index n.obs      freq MAC     Score
+## 1938        2370   1  25044607            1  1040 0.1692308 352  71.07880
+## 2003        2458   1  25046749            1  1040 0.1500000 312  69.35044
+## 2011        2473   1  25047225            1  1040 0.1927885 401  71.95763
+## 23169      31443   1 212951423            1  1040 0.1091346 227  63.18412
+## 23185      31476   1 212952357            1  1040 0.5254808 987 108.91489
+## 23271      31614   1 212956321            1  1040 0.4043269 841 110.53196
+##       Score.SE Score.Stat   Score.pval       Est     Est.SE        PVE
+## 1938  12.46025   5.704443 1.167240e-08 0.4578113 0.08025520 0.03159288
+## 2003  11.99336   5.782402 7.364128e-09 0.4821336 0.08337946 0.03246230
+## 2011  13.06207   5.508899 3.610844e-08 0.4217478 0.07655754 0.02946405
+## 23169 10.23968   6.170516 6.806767e-10 0.6026081 0.09765927 0.03696627
+## 23185 16.86561   6.457808 1.062305e-10 0.3828979 0.05929224 0.04048862
+## 23271 16.11184   6.860292 6.871974e-12 0.4257919 0.06206614 0.04569282
+

In our original association analysis, we found that there were 6 +genome-wide significant variants at two distinct loci. In the particular +example here, it is pretty clear that we can consider our hits as two +distinct loci, as they are at opposite ends of the chromosome and the +physical distance between them is ~188Mb. Therefore, we identify our +conditional variants as those at 1:212956321 and +1:25046749.

+
+
+

Conditoinal Null Model

+

When preparing our data to run the conditional null model, we need to +actually extract the genotype values from the GDS file. It is easiest to +use the variant.id values from the GDS file, but remember +that these are unique to your GDS file.

+
library(SeqArray)
+
## Loading required package: gdsfmt
+
library(SeqVarTools)
+
+# open the GDS file
+gdsfile <- "data/1KG_phase3_GRCh38_subset_chr1.gds"
+if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile)
+gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
+gds <- seqOpen(gdsfile)
+
+# variants to condition on
+cond <- c(2458, 31614)
+
+# set a filter to the conditional variants
+seqSetFilter(gds, variant.id = cond)
+
## # of selected variants: 2
+
# read in the genotype data
+geno <- altDosage(gds)
+head(geno)
+
##          variant
+## sample    2458 31614
+##   HG00096    0     0
+##   HG00097    0     1
+##   HG00099    0     0
+##   HG00100    0     1
+##   HG00101    1     2
+##   HG00102    1     0
+

First we load the AnnotatedDataFrame with the phenotype +data that we prepared previously and merge in the genotypes for the +conditional variants.

+
library(Biobase)
+
+# pheno data
+annotfile <- "data/pheno_annotated_pcs.RData"
+if (!file.exists(annotfile)) download.file(file.path(repo_path, annotfile), annotfile)
+annot <- get(load(annotfile))
+
+# merge
+dat <- merge(pData(annot), 
+             data.frame('sample.id' = rownames(geno), g = geno),
+             by = 'sample.id')
+head(dat)
+
##   sample.id pop super_pop    sex age trait_1 trait_2 status        PC1
+## 1   HG00096 GBR       EUR   male  56    23.5     0.1      0 0.01800549
+## 2   HG00097 GBR       EUR female  56    23.9     0.1      0 0.01742309
+## 3   HG00099 GBR       EUR female  48    23.8     0.3      0 0.01799001
+## 4   HG00100 GBR       EUR female  54    24.9     0.2      0 0.01793750
+## 5   HG00101 GBR       EUR   male  65    27.6     0.7      1 0.01759459
+## 6   HG00102 GBR       EUR female  53    24.5     0.4      0 0.01796459
+##           PC2        PC3         PC4           PC5         PC6        PC7
+## 1 -0.03361957 0.01541131 -0.02085591 -0.0037719379 -0.04669896 0.01985608
+## 2 -0.03361182 0.01304470 -0.01992104 -0.0095947596 -0.04545408 0.01034344
+## 3 -0.03418633 0.01152677 -0.01694420  0.0011408340 -0.05178681 0.02212758
+## 4 -0.03378008 0.01491137 -0.01954784  0.0045447134 -0.04375806 0.02055287
+## 5 -0.03318057 0.01568040 -0.01892110 -0.0009236657 -0.05762760 0.03443987
+## 6 -0.03344242 0.01286241 -0.01767438  0.0022747910 -0.06528352 0.03392455
+##   g.2458 g.31614
+## 1      0       0
+## 2      0       1
+## 3      0       0
+## 4      0       1
+## 5      1       2
+## 6      1       0
+
# updated AnnotatedDataFrame
+annot <- AnnotatedDataFrame(dat)
+annot
+
## An object of class 'AnnotatedDataFrame'
+##   rowNames: 1 2 ... 1040 (1040 total)
+##   varLabels: sample.id pop ... g.31614 (17 total)
+##   varMetadata: labelDescription
+

We also need to load our kinship matrix

+
# load the full PC-Relate KM
+kmfile <- "data/pcrelate_Matrix.RData"
+if (!file.exists(kmfile)) download.file(file.path(repo_path, kmfile), kmfile)
+km <- get(load(kmfile))
+dim(km)
+
## [1] 1040 1040
+
km[1:5,1:5]
+
## 5 x 5 Matrix of class "dsyMatrix"
+##              HG00096       HG00097      HG00099      HG00100       HG00101
+## HG00096  0.985459993  0.0031768617  0.006724604 -0.006857234  0.0060690853
+## HG00097  0.003176862  0.9883268089 -0.007300965  0.012888182 -0.0009002438
+## HG00099  0.006724604 -0.0073009649  0.976649166  0.022746299  0.0118873586
+## HG00100 -0.006857234  0.0128881819  0.022746299  0.978507564  0.0265278283
+## HG00101  0.006069085 -0.0009002438  0.011887359  0.026527828  0.9778938334
+

Now we can fit our null model, including the conditional variants as +covariates. The rest of the model should remain the same as in the +original association analysis.

+
library(GENESIS)
+
+# fit the conditional model
+nullmod.cond <- fitNullModel(annot, 
+                             outcome = "trait_1", 
+                             covars=c("g.2458", "g.31614", "sex", "age", paste0("PC", c(1:7))), 
+                             cov.mat=km, 
+                             verbose=TRUE)
+
## Computing Variance Component Estimates...
+
## Sigma^2_A     log-lik     RSS
+
## [1]     0.9867937     0.9867937 -1734.9472441     0.8491269
+## [1]     0.1413764     1.4541284 -1734.4486589     1.0376070
+## [1]     0.3457546     1.3111795 -1734.3346307     1.0017077
+## [1]     0.3575102     1.3025230 -1734.3343188     1.0000041
+## [1]     0.3573107     1.3027243 -1734.3343187     1.0000000
+

Look at the conditional null model output:

+
# description of the model we fit
+nullmod.cond$model
+
## $hetResid
+## [1] FALSE
+## 
+## $family
+## 
+## Family: gaussian 
+## Link function: identity 
+## 
+## 
+## $outcome
+## [1] "trait_1"
+## 
+## $covars
+##  [1] "g.2458"  "g.31614" "sex"     "age"     "PC1"     "PC2"     "PC3"    
+##  [8] "PC4"     "PC5"     "PC6"     "PC7"    
+## 
+## $formula
+## [1] "trait_1 ~ g.2458 + g.31614 + sex + age + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + (1|A)"
+
# fixed effects
+nullmod.cond$fixef
+
##                      Est         SE         Stat         pval
+## (Intercept)  24.46095512 0.31339306 6092.1145879 0.000000e+00
+## g.2458        0.47316236 0.08020750   34.8008955 3.652054e-09
+## g.31614       0.42061411 0.05971255   49.6176852 1.868220e-12
+## sexmale       0.08723529 0.07982865    1.1941720 2.744896e-01
+## age           0.00989107 0.00552822    3.2012225 7.358325e-02
+## PC1         -11.17836708 1.32080000   71.6279657 2.598476e-17
+## PC2           6.66887441 1.29724439   26.4278281 2.735706e-07
+## PC3          -6.22638671 1.28668083   23.4169602 1.304236e-06
+## PC4          -1.39282849 1.29019425    1.1654270 2.803429e-01
+## PC5          -1.54938754 1.29940760    1.4217699 2.331123e-01
+## PC6           4.09586998 1.29357154   10.0256241 1.543774e-03
+## PC7           0.63123074 1.29588712    0.2372695 6.261852e-01
+

Note that, as expected, the two conditional variants have very +significant \(p\)-values in the null +model. The \(p\)-values aren’t +exactly the same as what we calculated in the original +association score tests, but they are quite close – good validation that +the score test procedure is working well!

+
+
+

Conditional Association Test

+

Now that we have our conditional null model, we can perform the +conditional association tests. The procedure is exactly the same as what +we’ve seen before, just using this new null model.

+
# reset the filter to all variants
+seqResetFilter(gds)
+
## # of selected samples: 1,040
+## # of selected variants: 37,409
+
# make the seqVarData object
+seqData <- SeqVarData(gds, sampleData=annot)
+
+# make the iterator object
+iterator <- SeqVarBlockIterator(seqData, verbose=FALSE)
+
+# run the single-variant association test
+assoc.cond <- assocTestSingle(iterator, 
+                              null.model = nullmod.cond,
+                              test = "Score")
+
## # of selected samples: 1,040
+
## Using 6 CPU cores
+
dim(assoc.cond)
+
## [1] 27621    14
+
# remove the conditional variants
+assoc.cond <- assoc.cond[!(assoc.cond$variant.id %in% cond),]
+dim(assoc.cond)
+
## [1] 27619    14
+

Note that we removed the variants we conditioned on from our +association test results. The assocTestSingle function does +not know you conditioned on those variants; it will return statistics, +but they will be non-sense – the test statistic will blow up to \(\pm Inf\) and you will get \(p\)-values very near to 0 or 1.

+
+

Examine the results

+

Let’s look at the conditional association results. We make a +Manhattan plot of the \(-log_{10}(p)\)-values using the +manhattanPlot function from the GWASTools +package to visualize the association signals.

+
GWASTools::manhattanPlot(assoc.cond$Score.pval, 
+                         chromosome = assoc.cond$chr, 
+                         thinThreshold = 1e-4,
+                         ylim = c(0, 12))
+

+

We see that after conditioning the signal from the locus at the +beginning of the chromosome is completely removed, but there is still +some signal from the locus at the end of the chromosome. Filter to just +the genome-wide significant variants to see the statistics

+
# genome-wide significant
+assoc.cond[assoc.cond$Score.pval < 5e-8, ]
+
##       variant.id chr       pos allele.index n.obs      freq MAC    Score
+## 23169      31443   1 212951423            1  1040 0.1091346 227 62.85738
+##       Score.SE Score.Stat   Score.pval       Est    Est.SE        PVE
+## 23169 10.60252   5.928532 3.056555e-09 0.5591625 0.0943172 0.03419016
+

There is now just one genome-wide significant variant, +1:212951423. Prior to conditioning, this variant had \(p = 6.2 \times 10^{-10}\), and after +conditioning it has \(p = 3.1 \times +10^{-9}\). The signal is reduced very slightly, but we +would conclude that the association signal at this variant is +independent of the association signals at the other variants we +conditioned on. Given variant 1:212951423 proximity to +variant 1:212956321 that we conditioned on – only about 5kb +apart – this may seem surprising. This is an example of a secondary +signal at this locus.

+

We can print the conditional association statistics for all of the +original hits to see that the signal at the rest of those variants has +in fact gone away (recall that the two variants we conditioned on are +removed from the output)

+
assoc.cond[assoc.cond$variant.id %in% hits, ]
+
##       variant.id chr       pos allele.index n.obs      freq MAC     Score
+## 1938        2370   1  25044607            1  1040 0.1692308 352  4.229163
+## 2011        2473   1  25047225            1  1040 0.1927885 401  6.116060
+## 23169      31443   1 212951423            1  1040 0.1091346 227 62.857376
+## 23185      31476   1 212952357            1  1040 0.5254808 987 35.897551
+##        Score.SE Score.Stat   Score.pval       Est     Est.SE          PVE
+## 1938   4.889797  0.8648955 3.870962e-01 0.1768776 0.20450746 0.0007276695
+## 2011   6.902432  0.8860731 3.755781e-01 0.1283711 0.14487647 0.0007637408
+## 23169 10.602520  5.9285316 3.056555e-09 0.5591625 0.09431720 0.0341901629
+## 23185 13.669947  2.6260197 8.638983e-03 0.1921017 0.07315317 0.0067081510
+

If we wanted to continue iterating, we would run a second conditional +analysis, conditioning on both variants 1:212956321 and +1:212951423 to look for a tertiary signal at this locus. +However, with only one genome-wide significant variant remaining, that +is unnecessary in this situation.

+
+
+

LD calculation

+

To understand this secondary signal, we can use the +snpgdsLDpair function from SNPRelate package to compute the +LD between the top hits at this locus from our primary and secondary +signals.

+
library(SNPRelate)
+
## SNPRelate
+
# filter the GDS to the two variants
+seqSetFilter(gds, variant.id = c(31614, 31443))
+
## # of selected variants: 2
+
# read in the genotype values
+geno <- altDosage(gds)
+
+# compute the LD r^2 value (note that the function returns the correlation, not squared)
+snpgdsLDpair(snp1 = geno[,1], snp2 = geno[,2])^2
+
##          ld 
+## 0.000630134
+

The LD \(r^2\) value between these +two variants is quite small, nearly 0, which explains why the secondary +signal at variant 1:212951423 remains after conditioning on +variant 1:212956321.

+
+
+
+
+

Exercise 4.1 (Application)

+

The GENESIS Null Model app on the BioData Catalyst +powered by Seven Bridges platform makes it quite simple to perform a +conditional analysis. In addition to the inputs provided for the +standard analysis, we need to provide the GDS files that contain the +genotype data for the variants we want to condition on and an .RData +file that specifies the chromosome and variant.id values of the variants +we want to condition on.

+

Use the GENESIS Null Model to fit a null model for +trait_1, conditioning on the top hit from each locus on chromosome 1 in +our original GWAS analysis. The rest of the model parameters should be +the same as the original GWAS – adjust for sex, age, ancestry, and +kinship in the model. The steps to perform this analysis are as +follows:

+
    +
  • Copy the app to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for GENESIS Null Model
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
  • Run the analysis in your project: +
      +
    • Click: Apps > GENESIS Null Model > Run
    • +
    • Specify the Inputs: +
        +
      • Phenotype file: pheno_annotated.RData
      • +
      • PCA file: 1KG_phase3_GRCh38_subset_pca.RData
      • +
      • Relatedness matrix file: +1KG_phase3_subset_GRCh38_pcrelate_Matrix.RData
      • +
      • GDS files: 1KG_phase3_GRCh38_subset_chr1.gds
      • +
      • Conditional variant file: +conditional_vars_trait_1_chr1.RData
      • +
    • +
    • Specify the App Settings: +
        +
      • Covariates: age, sex (each as a different term)
      • +
      • Family: gaussian
      • +
      • Number of PCs to include as covariates: 7
      • +
      • Outcome: trait_1
      • +
      • Two stage model: FALSE
      • +
      • Output prefix: “1KG_trait_1_cond” (or any other string to name the +output file)
      • +
    • +
    • Click: Run
    • +
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

The output of this analysis will be a +<output_prefix>_null_model.RData file that contains +the null model fit, a +<output_prefix>_phenotypes.RData file with the +phenotype data used in the analysis, and a +<output_prefix>_report.Rmd and +<output_prefix>_report.html with model diagnostics. +Review the .html report – which covariates have significant (\(p < 0.05\)) associations with trait_1 in +the null model? What do you notice about the boxplots of the trait_1 +values by the conditional variants?

+

You can find the expected output of this analysis by looking at the +existing task 09 Conditional Null Model trait_1 in the +Tasks menu of your Project. The output files are available in the +Project, so you do not need to wait for your analysis to finish to move +to the next exercise.

+
+

Solution 4.1 (Application)

+

From looking at the .html report, we see that our conditional +variants (var_2458 and var_31614), PC1, PC2, PC3, and PC6 have +significant associations with trait_1 in our conditional null model. +From the boxplots, we can see the positive trend between the trait_1 +values and the number of copies of the effect allele at each conditional +variant.

+
+
+
+

Exercise 4.2 (Application)

+

Use the GENESIS Single Variant Association Testing app +on the BioData Catalyst powered by Seven Bridges platform to perform a +conditional association tests for trait_1 using the null model fit in +the previous exercise. To speed things up, we will restrict this +analysis to chromosome 1. Use the genotype data in the genome-wide GDS +files you created previously. The steps to perform this analysis are as +follows:

+
    +
  • Copy the app to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for +GENESIS Single Variant Association Testing
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
  • Run the analysis in your project: +
      +
    • Click: Apps > +GENESIS Single Variant Association Testing > Run
    • +
    • Specify the Inputs: +
        +
      • GDS Files: 1KG_phase3_GRCh38_subset_chr1.gds
      • +
      • Null model file: 1KG_trait_1_cond_null_model.RData
      • +
      • Phenotype file: 1KG_trait_1_cond_phenotypes.RData (use +the phenotype file created by the Null Model app)
      • +
    • +
    • Specify the App Settings: +
        +
      • MAC threshold: 5
      • +
      • Test type: score
      • +
      • memory GB: 32 (increase to make sure enough available)
      • +
      • Output prefix: “1KG_trait_1_assoc_cond” (or any other string to name +the output file)
      • +
    • +
    • Click: Run
    • +
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

The output of this analysis will be an +<output_prefix>_chr1.RData file with the association +test results for chromosome 1 as well as a +<output_prefix>_manh.png file with the Manhattan plot +and a <output_prefix>_qq.png file with the QQ plot. +Review the truncated Manhattan plot – what do you find?

+

You can find the expected output of this analysis by looking at the +existing task +10 Conditional Single Variant Association Test trait_1 in +the Tasks menu of your Project. The output files are available in the +Project, so you do not need to wait for your analysis to finish to move +to the next exercise.

+
+

Solution 4.2 (Application)

+

From looking at the truncated Manhattan plot, we see that the signal +from the locus at the beginning of the chromosome has been removed, but +there is still a genome-wide significant variant from the locus at the +end of the chromosome. We also see that there is a truncated variant at +the top of the figure for each locus – these are the variants we +conditioned on. The code in the application does not know what variants +we conditioned on, so it does not know to remove them from the +plots.

+
+
+
+

Exercise 4.3 (LocusZoom Shiny App)

+

Return to the LocusZoom Shiny App and make locus zoom plots indexed +by our secondary hit at position 212951423, using both the original and +conditional association analysis results. For the original analysis +results, you can use the data you JSONized before. For the conditional +analysis results, you will need to JSONize the association statistics +from that analysis. What do you observe in these locus zoom plots?

+
+
+

Exercise 4.4 (LocusZoom Shiny App)

+

Notably, the LocusZoom plot we generated with the example data is +fairly sparse, which is not representative of what a LocusZoom plot +would actually look like in practice. There are example data sets +available in the tool via the University +of Michigan database. Select the “Explore UM Database” tab at the +top of the LocusZoom Shiny App page and generate a LocusZoom plot using +the GIANT Consortium BMI meta-analysis (PMID: 20935630) data for variant +chr16:53803574, using a flanking region of 100kb. What is the p-value of +the variant chr16:53803574_T/A? What gene is this variant located in? +Change the LD reference population to EUR (European ancestry) – what do +you observe? Change the LD reference population to AFR (African +ancestry) – what do you observe?

+
+

Solution 4.4 (Locus Zoom Shiny App)

+
    +
  • The p-value of variant chr16:53803574_T/A is reported as \(2.05 x 10^{-62}\)
  • +
  • This variant is located in the FTO gene, which is well established +to be associated with BMI.
  • +
  • Using the EUR LD reference panel, many of the variants in this +region with similar p-values have very high LD with variant +chr16:53803574_T/A (indicated by the red color).
  • +
  • Using the AFR LD reference panel, many of the variants in this +region with similar p-values no longer have high LD with variant +chr16:53803574_T/A (indicated by the blue color).
  • +
+
+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/05_annotation_explorer.Rmd b/05_annotation_explorer.Rmd new file mode 100644 index 0000000..fdeeb84 --- /dev/null +++ b/05_annotation_explorer.Rmd @@ -0,0 +1,96 @@ +# 5. Annotation Explorer + +In this tutorial, we will learn how to use [Annotation Explorer](https://platform.sb.biodatacatalyst.nhlbi.nih.gov/u/biodatacatalyst/annotation-explorer/), an open tool available on the NHLBI BioData Catalyst powered by Seven Bridges cloud platform that eliminates the challenges of working with very large variant-level annotated datasets. Annotation Explorer has an interactive graphical user interface built on high performance databases and does not require any programming experience. We will learn how to explore and interactively query variant annotations and integrate them in GWAS analyses. + +For these exercises, we will be using the open-access "GenomeWide" dataset, which contains annotations for all combinations of SNVs, including each position in the genome as well as INDELs submitted to dbSNP. + +- Launch the Annotation Explorer + - From the top menu, click "Data" > "Annotation Explorer" + - Click: "Query Dataset" on the "GenomeWide" Dataset + - Choose the SISG Billing Group + - Choose the XSMALL Instance type + - Click: "Select" + +## Post-association testing + +Annotation Explorer can be used **post-association testing** -- for example, to explore annotations of variants in a novel GWAS signal. Suppose we performed a GWAS and our top hit was the C>T variant on chromosome 19 at position 44908822. In the Annotation Explorer: + +- Click: "Add filter" + - Select "CHROM" + - Select value "19" + - Click: "Add" +- Click: "Add filter" + - Select "POS" + - Select "Equals" and type in value "44908822" + - Click: "Add" +- Click: "Run query" in the top right + +This may take a minute to run. Once the query finishes, any variants that match the filtering criteria will be displayed. We can click the "+" icon on the right of the results to "add additional annotations" to the results: + +- Click: "+" + - Use the search to find and select "KGP_AF" (allele frequency from 1000 Genomes Project) + - Use the search to find and select "GWAS_catalog_rs", "GWAS_catalog_trait", and "GWAS_catalog_OR" + - Use the search to find and select "CADD_phred" (variants with larger scores are more likely damaging) + - Click: "Add" to update the query + +We can see that the C>T variant at 19:44908822 + +- is rs7412 +- has a T allele frequency of 7.5\% in 1000 Genomes +- has been associated with systolic blood pressure in the GWAS Catalog +- has a CADD_phred = 26 (variants with CADD_phred > 20 are in the top 1\% of scores) + + +### Exercise 5.1 (Annotation Explorer) + +Use the Annotation Explorer to find the rsID, the allele frequency in the 1000 Genomes Project, and the CADD_phred score for the T>A variant on chromosome 16 at position 53786615. Also, what trait is this variant associated with in the GWAS Catalog? + +#### Solution 5.1 (Annotation Explorer) + +Running Annotation Explorer, we see that the T>A variant at 16:53786615 + +- is rs9939609 +- has an A allele frequency of 34.0\% in 1000 Genomes +- has been associated with BMI in the GWAS Catalog +- has a CADD_phred = 10.05 + + +## Pre-association testing + +Annotation Explorer can also be used **pre-association testing** -- for example, to generate annotation informed variant filtering and grouping files for aggregate testing (more on this in the Aggregate Association Tests tutorial). In the Annotation Explorer: + +- Click: "Add filter" + - Select "CHROM" + - Select value "22" + - Click: "Add" +- Click: "Add filter" + - Select "CADD_phred" + - Select "Greater than" and type in value "20" + - Click: "Add" +- Click: "Add filter" + - Select "MetaSVM_score" + - Select "Greater than" and type in value "0.5" + - Click: "Add" +- Click: "Run query" in the top right + +This may take a couple of minutes to run. Once the query finishes, any variants that match the filtering criteria will be displayed. We can see that there are 138,831 matching results. We may want to group (aggregate) variants for multi-variant association tests (this is particularly useful for rare variants). A common approach is to aggregate variants by gene: + +- Click: "Aggregate manually" + - Select "Ensembl_geneid" + +Once the aggregation finishes, we see a histogram of the number of variants that meet our filtering criteria in each aggregation unit (i.e. unique Ensembl_geneid units). We see that 19825 (98.26% of) aggregation units have 0 variants after our filtering (it's so many because we restricted to chromosome 22). There are 151 aggregation units with $> 100$ and $\leq 1000$ variants. + + +### Exercise 5.2 (Annotation Explorer) + +Use the Annotation Explorer to identify all variants on chromosome 22 with CADD_phred score > 30 (i.e. the top 0.1% of most likely damaging variants) and MetaSVM_score > 0.5. How many variants are selected? Aggregate the results by Ensembl_geneid. How many aggregation units have $> 100$ and $\leq 1000$ variants? + +#### Solution 5.2 (Annotation Explorer) + +There are 18,411 variants that meet the filtering criteria. There are 41 aggregation units defined by Ensembl gene ID with $> 100$ and $\leq 1000$ variants. + + +## Multi-variant association testing + +In the `08_aggregate_tests.Rmd` tutorial, we will learn about aggregate multi-variant association tests. Variant annotation can be very useful for pre-selecting which variants to test in aggregate. In general, a more stringent filtering approach (e.g. CADD_phred > 30 vs. CADD_phred > 20) will reduce the number of aggregation units which have at least one variant. Often, there is not a "correct" pre-determined cut-off to implement for an annotation field to optimize association tests. Annotation Explorer enables the user to play with varying filtering criteria, which can help visualize its effects on the aggregation unit characteristics and may assist in choosing a filtering criteria in an informed way. + diff --git a/05_annotation_explorer.html b/05_annotation_explorer.html new file mode 100644 index 0000000..e7c95c3 --- /dev/null +++ b/05_annotation_explorer.html @@ -0,0 +1,586 @@ + + + + + + + + + + + + + +05_annotation_explorer.knit + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

5. Annotation Explorer

+

In this tutorial, we will learn how to use Annotation +Explorer, an open tool available on the NHLBI BioData Catalyst +powered by Seven Bridges cloud platform that eliminates the challenges +of working with very large variant-level annotated datasets. Annotation +Explorer has an interactive graphical user interface built on high +performance databases and does not require any programming experience. +We will learn how to explore and interactively query variant annotations +and integrate them in GWAS analyses.

+

For these exercises, we will be using the open-access “GenomeWide” +dataset, which contains annotations for all combinations of SNVs, +including each position in the genome as well as INDELs submitted to +dbSNP.

+
    +
  • Launch the Annotation Explorer +
      +
    • From the top menu, click “Data” > “Annotation Explorer”
    • +
    • Click: “Query Dataset” on the “GenomeWide” Dataset
    • +
    • Choose the SISG Billing Group
    • +
    • Choose the XSMALL Instance type
    • +
    • Click: “Select”
    • +
  • +
+
+

Post-association testing

+

Annotation Explorer can be used post-association +testing – for example, to explore annotations of variants in a +novel GWAS signal. Suppose we performed a GWAS and our top hit was the +C>T variant on chromosome 19 at position 44908822. In the Annotation +Explorer:

+
    +
  • Click: “Add filter” +
      +
    • Select “CHROM” +
        +
      • Select value “19” +
          +
        • Click: “Add”
        • +
      • +
    • +
  • +
  • Click: “Add filter” +
      +
    • Select “POS” +
        +
      • Select “Equals” and type in value “44908822” +
          +
        • Click: “Add”
        • +
      • +
    • +
  • +
  • Click: “Run query” in the top right
  • +
+

This may take a minute to run. Once the query finishes, any variants +that match the filtering criteria will be displayed. We can click the +“+” icon on the right of the results to “add additional annotations” to +the results:

+
    +
  • Click: “+” +
      +
    • Use the search to find and select “KGP_AF” (allele frequency from +1000 Genomes Project)
    • +
    • Use the search to find and select “GWAS_catalog_rs”, +“GWAS_catalog_trait”, and “GWAS_catalog_OR”
    • +
    • Use the search to find and select “CADD_phred” (variants with larger +scores are more likely damaging)
    • +
    • Click: “Add” to update the query
    • +
  • +
+

We can see that the C>T variant at 19:44908822

+
    +
  • is rs7412
  • +
  • has a T allele frequency of 7.5% in 1000 Genomes
  • +
  • has been associated with systolic blood pressure in the GWAS +Catalog
  • +
  • has a CADD_phred = 26 (variants with CADD_phred > 20 are in the +top 1% of scores)
  • +
+
+

Exercise 5.1 (Annotation Explorer)

+

Use the Annotation Explorer to find the rsID, the allele frequency in +the 1000 Genomes Project, and the CADD_phred score for the T>A +variant on chromosome 16 at position 53786615. Also, what trait is this +variant associated with in the GWAS Catalog?

+
+

Solution 5.1 (Annotation Explorer)

+

Running Annotation Explorer, we see that the T>A variant at +16:53786615

+
    +
  • is rs9939609
  • +
  • has an A allele frequency of 34.0% in 1000 Genomes
  • +
  • has been associated with BMI in the GWAS Catalog
  • +
  • has a CADD_phred = 10.05
  • +
+
+
+
+
+

Pre-association testing

+

Annotation Explorer can also be used pre-association +testing – for example, to generate annotation informed variant +filtering and grouping files for aggregate testing (more on this in the +Aggregate Association Tests tutorial). In the Annotation Explorer:

+
    +
  • Click: “Add filter” +
      +
    • Select “CHROM” +
        +
      • Select value “22” +
          +
        • Click: “Add”
        • +
      • +
    • +
  • +
  • Click: “Add filter” +
      +
    • Select “CADD_phred” +
        +
      • Select “Greater than” and type in value “20” +
          +
        • Click: “Add”
        • +
      • +
    • +
  • +
  • Click: “Add filter” +
      +
    • Select “MetaSVM_score” +
        +
      • Select “Greater than” and type in value “0.5” +
          +
        • Click: “Add”
        • +
      • +
    • +
  • +
  • Click: “Run query” in the top right
  • +
+

This may take a couple of minutes to run. Once the query finishes, +any variants that match the filtering criteria will be displayed. We can +see that there are 138,831 matching results. We may want to group +(aggregate) variants for multi-variant association tests (this is +particularly useful for rare variants). A common approach is to +aggregate variants by gene:

+
    +
  • Click: “Aggregate manually” +
      +
    • Select “Ensembl_geneid”
    • +
  • +
+

Once the aggregation finishes, we see a histogram of the number of +variants that meet our filtering criteria in each aggregation unit +(i.e. unique Ensembl_geneid units). We see that 19825 (98.26% of) +aggregation units have 0 variants after our filtering (it’s so many +because we restricted to chromosome 22). There are 151 aggregation units +with \(> 100\) and \(\leq 1000\) variants.

+
+

Exercise 5.2 (Annotation Explorer)

+

Use the Annotation Explorer to identify all variants on chromosome 22 +with CADD_phred score > 30 (i.e. the top 0.1% of most likely damaging +variants) and MetaSVM_score > 0.5. How many variants are selected? +Aggregate the results by Ensembl_geneid. How many aggregation units have +\(> 100\) and \(\leq 1000\) variants?

+
+

Solution 5.2 (Annotation Explorer)

+

There are 18,411 variants that meet the filtering criteria. There are +41 aggregation units defined by Ensembl gene ID with \(> 100\) and \(\leq 1000\) variants.

+
+
+
+
+

Multi-variant association testing

+

In the 08_aggregate_tests.Rmd tutorial, we will learn +about aggregate multi-variant association tests. Variant annotation can +be very useful for pre-selecting which variants to test in aggregate. In +general, a more stringent filtering approach (e.g. CADD_phred > 30 +vs. CADD_phred > 20) will reduce the number of aggregation units +which have at least one variant. Often, there is not a “correct” +pre-determined cut-off to implement for an annotation field to optimize +association tests. Annotation Explorer enables the user to play with +varying filtering criteria, which can help visualize its effects on the +aggregation unit characteristics and may assist in choosing a filtering +criteria in an informed way.

+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/06_aggregate_tests.Rmd b/06_aggregate_tests.Rmd new file mode 100644 index 0000000..e915932 --- /dev/null +++ b/06_aggregate_tests.Rmd @@ -0,0 +1,326 @@ +# 6. Aggregate Association Tests + +Multi-variant association tests, which are commonly used for testing rare variants in aggregate, can be used to identify when variants in a genomic region (e.g. a gene), potentially with certain properties defined by variant annotation, are associated with a phenotype of interest. Under certain assumptions, these aggregate tests can improve statistical power to detect association when single variant tests are under-powered and/or poorly calibrated. This tutorial demonstrates how to perform aggregate multi-variant association tests using the [GENESIS](https://bioconductor.org/packages/release/bioc/html/GENESIS.html) R/Bioconductor package. + +## Aggregation Units for Association Testing + +In this tutorial, we will be using a subset of genes from chromosome 8 as our aggregation units. We use Gencode v38 gene boundaries in genome build GRCh38/hg38 and label genes by their Ensembl gene IDs. It is important to use aggregation units based on the genome build consistent with your sample genotype data. The gene boundaries are provided in a `GRanges` object, which is constructed with the GenomicRanges R/Bioconductor package. + +```{r, message = FALSE} +repo_path <- "https://github.com/UW-GAC/SISG_2022/raw/main" +if (!dir.exists("data")) dir.create("data") + +library(GenomicRanges) + +genefile <- "data/gencode.v38.hg38_ENSG_GRanges_subset_chr8.RData" +if (!file.exists(genefile)) download.file(file.path(repo_path, genefile), genefile) +genes <- get(load(genefile)) +genes + +# number of genes +length(genes) +``` + +In the `GRanges` object, the `seqnames` field provides the chromosome value and the `ranges` field provides the gene boundaries. The metadata also includes `strand` direction and `gene` names. Each entry in the object is labeled by the Ensembl gene ID (e.g. ENSG00000253764). \n + +## Aggregate Association Tests + +As we saw in the lecture, there are many different types of multi-variant association tests. We can perform burden, SKAT, SKAT-O, fastSKAT, or SMMAT tests using the same `assocTestAggregate` function from GENESIS. When performing multi-variant association tests with GENESIS, the process is *very* similar to performing single variant association tests. + +### Prepare the Data + +First, we load the `AnnotatedDataFrame` with the phenotype data, open a connection to the GDS file with the genotype data, and create our `SeqVarData` object linking the two. This is exactly the same as the previous tutorials. + +```{r, message = FALSE} +# open the GDS file +library(SeqVarTools) + +gdsfile <- "data/1KG_phase3_GRCh38_subset_chr8.gds" +if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile) +gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open +gds <- seqOpen(gdsfile) + +# sample annotation file +annotfile <- "data/pheno_annotated_pcs.RData" +if (!file.exists(annotfile)) download.file(file.path(repo_path, annotfile), aggfile) +annot <- get(load(annotfile)) + +# make the seqVarData object +seqData <- SeqVarData(gds, sampleData=annot) +``` + +When performing aggregate tests using gene boundaries in a `GRanges` object, we define a `SeqVarRangeIterator` object where each list element is a gene aggregation unit. This is the only difference in the data preparation process from what we saw in previous tutorials. + +```{r} +# construct the iterator using the SeqVarRangeIterator function +iterator <- SeqVarRangeIterator(seqData, variantRanges=genes, verbose=FALSE) +iterator +``` + +### Null Model + +As with the single variant association tests, multi-variant association tests require that we first fit a null model. In most cases, you will want to use *exactly* the same null model for both single and multi-variant tests. We load the same null model for trait_1 that we fit and saved in the `02_GWAS.Rmd` tutorial. + +```{r} +# load the null model +nullmodfile <- "data/null_model_trait1.RData" +if (!file.exists(nullmodfile)) download.file(file.path(repo_path, nullmodfile), nullmodfile) +nullmod <- get(load(nullmodfile)) + +# summary +nullmod$model +``` + + +### Burden Test + +First, we perform a burden test. We restrict the test to variants with alternate allele frequency < 0.01. We use a uniform weighting scheme -- i.e. every variant gets the same weight (a Beta(1,1) distribution is a uniform distribution). The `assocTestAggregate` function iterates over all aggregation units (i.e. genes) in the `SeqVarRangeIterator` object. + +```{r assoc_burden} +# run the burden test +library(GENESIS) +assoc.burden <- assocTestAggregate(iterator, + null.model = nullmod, + test = "Burden", + AF.max = 0.01, + weight.beta = c(1,1)) +names(assoc.burden) +``` + +The function returns the primary results for each aggregation unit in one table (`results`). It also returns a list of tables that contain the variant details for each aggregation unit tested (`variantInfo`). + +```{r} +# results for each aggregation unit +class(assoc.burden$results) +dim(assoc.burden$results) +head(assoc.burden$results) +``` + +Each row of the `results` data.frame represents one tested aggregation unit and includes: the number of variants/sites included (`n.site`), the total number of alternate alleles observed across all samples in those variants (`n.alt`), the total number of samples with at least one alternate allele observed at some variant (`n.sample.alt`), the burden score value (`Score`) and its standard error (`Score.SE`), the burden score test statistic (`Score.Stat`) and $p$-value (`Score.pval`), an approximation of the burden effect size (`Est`) and its standard error (`Est.SE`), and an approximation of the proportion of variation explained by the burden (`PVE`). + +```{r} +# variant info per aggregation unit +class(assoc.burden$variantInfo) +head(assoc.burden$variantInfo[[1]]) +``` + +The `variantInfo` for each aggregation unit includes: variant information (`variant.id`, `chr`, and `pos`), the number of samples included (`n.obs`), the minor allele count (`MAC`), the effect allele frequency (`freq`), and the weight assigned to that variant (`weight`). \n + +When performing aggregate tests, we usually want to filter our aggregation units where the cumulative number of minor alleles (i.e. cumulative MAC) across all samples and variants is below some threshold. Similarly to how single variant tests are not well calibrated when a variant is very rare, these aggregate tests are not well calibrated when the cumulative MAC is very small. The `n.alt` values the `assocTestAggregate` output gives the total number of alternate alleles observed across all samples and variants in the aggregation unit. Filter the output to only genes with at least 5 alternate alleles observed across all samples and variants. + +```{r} +burden <- assoc.burden$results[assoc.burden$results$n.alt >= 5, ] +dim(burden) +``` + +When performing aggregate tests, we typically use a Bonferroni correction for the number of aggregation units tested to account for multiple testing. In other words, we use a less stringent $p$-value threshold than the genome-wide significant threshold used for single variant GWAS. Check for significant burden associations. + +```{r} +burden[burden$Score.pval < 0.05/nrow(burden), ] +``` +We have one significant burden association at ENSG00000251354 with $p = 1.6 \times 10^{-5}$. We can also make a QQ plot of the burden p-values from the main results table + +```{r} +library(ggplot2) +qqPlot <- function(pval) { + pval <- pval[!is.na(pval)] + n <- length(pval) + x <- 1:n + dat <- data.frame(obs=sort(pval), + exp=x/n, + upper=qbeta(0.025, x, rev(x)), + lower=qbeta(0.975, x, rev(x))) + + ggplot(dat, aes(-log10(exp), -log10(obs))) + + geom_line(aes(-log10(exp), -log10(upper)), color="gray") + + geom_line(aes(-log10(exp), -log10(lower)), color="gray") + + geom_point() + + geom_abline(intercept=0, slope=1, color="red") + + xlab(expression(paste(-log[10], "(expected P)"))) + + ylab(expression(paste(-log[10], "(observed P)"))) + + theme_bw() +} + +qqPlot(burden$Score.pval) +``` + +Note: QQ plots for multi-variant tests are often not as clean as for single variant GWAS, particularly in the lower part of the plot (i.e. insignificant $p$-values near to $-log_{10}(p) = 0$). However, the QQ plot can still be useful to assess egregious issues. + +### SKAT Test + +We can also perform a SKAT test. This time, we will use the Wu weights (i.e. drawn from a Beta(1,25) distribution), which give larger weights to rarer variants (note the different weight values in the `variantInfo` output). + +```{r assoc_skat, message = FALSE} +# reset the iterator to the first window +resetIterator(iterator, verbose = FALSE) + +# run the SKAT test +assoc.skat <- assocTestAggregate(iterator, + null.model = nullmod, + test = "SKAT", + AF.max = 0.01, + weight.beta = c(1,25)) +``` + +```{r} +# results for each aggregation unit +head(assoc.skat$results) +``` + +Again, each row of the `results` data.frame represents one tested aggregation unit. Some of the columns are the same as for the burden test; new columns include: the SKAT statistic (`Q`), the SKAT $p$-value (`pval`), the $p$-value method used (`pval.method`), and an indicator if there was any error detected in computing the $p$-value (`err`). If any aggregation units indicate an error (value = 1), they should be dropped from the results. Note that there is no effect size provided, as there is no such concept for SKAT. + +```{r} +table(assoc.skat$results$pval.method, assoc.skat$results$err, exclude = NULL) +``` + +```{r} +# variant info per aggregation unit +head(assoc.skat$variantInfo[[3]]) +``` + +The `variantInfo` for each aggregation unit includes the same information as for the burden test, but note the different variant weight values due to using the Wu weights instead of Uniform weights. + +```{r} +# filter based on cumulative MAC +skat <- assoc.skat$results[assoc.skat$results$n.alt >= 5, ] + +# significant genes +skat[skat$pval < 0.05/nrow(skat), ] + +# make a QQ plot of the SKAT test p-values +qqPlot(skat$pval) +``` + +We have one significant SKAT association at ENSG00000253184 with $p = 1.1 \times 10^{-4}$. \n + +### SMMAT Test + +We can also perform a SMMAT test, which efficiently combines the $p$-values from the burden test and an asymptotically independent adjusted "SKAT-type" test (it's essentially a SKAT test conditional on the burden) using Fisher's method. This method is conceptually similar to the SKAT-O test but much faster computationally. + +```{r assoc_smmat, message = FALSE} +# reset the iterator to the first window +resetIterator(iterator, verbose = FALSE) + +# run the SKAT test +assoc.smmat <- assocTestAggregate(iterator, + null.model = nullmod, + test = "SMMAT", + AF.max = 0.01, + weight.beta = c(1,25)) +``` + +```{r} +# results for each aggregation unit +head(assoc.smmat$results) +``` + +Again, each row of the `results` data.frame represents one tested aggregation unit. Some of the columns are the same as for the burden and SKAT tests; new columns include the SMMAT combined $p$-value (`pval_SMMAT`). Note that the burden score value (`Score_burden`) and its standard error (`Score.SE_burden`), and the burden score test statistic (`Stat_burden`) and $p$-value (`pval_burden`) are included -- these are the same values you would get from running the burden test. There are also columns for the SKAT-type test statistic (`Q_theta`), $p$-value (`pval_theta`), $p$-value method (`pval_theta.method`), and error indicator (`err`) -- these are *not* the same values you would get from running SKAT because the "theta" component of the test has been adjusted for the burden test. Again, there is no effect size provided, as there is no such concept for the overall SMMAT test. + +```{r} +# variant info per aggregation unit +head(assoc.smmat$variantInfo[[3]]) +``` + +Again, the `variantInfo` for each aggregation unit includes the same information as the other tests. \n + +The function returns the $p$-values from the burden test (`pval_burden`), the adjusted SKAT-type test (`pval_theta`), and the combined $p$-value (`pval_SMMAT`). The combined $p$-value is the one to use for assessing significance. The burden and theta $p$-values may be of secondary interest for further exploring results. + +```{r} +# filter based on cumulative MAC +smmat <- assoc.smmat$results[assoc.smmat$results$n.alt >= 5, ] + +# significant genes +smmat[smmat$pval_SMMAT < 0.05/nrow(smmat), ] + +# make a QQ plot of the SKAT test p-values +qqPlot(smmat$pval_SMMAT) +``` + +The SMMAT test found two significant genes, ENSG00000253184 and ENSG00000251354, which were the genes that the SKAT and burden tests found respectively. For ENSG00000253184, the SMMAT $p = 4.6 \times 10^{-4}$, while the SKAT $p = 1.1 \times 10^{-4}$ (see above) was slightly more significant. For ENSG00000251354, the SMMAT $p = 4.3 \times 10^{-6}$ was more significant than the burden $p = 1.0 \times 10^{-5}$ (the burden $p$-value is a bit different from earlier because we used the Wu weights instead of Uniform weights) -- as seen here, the combined SMMAT $p$-value may be more significant than either burden or SKAT separately. + + +## Exercise 6.1 (Application) + +Use the `GENESIS Aggregate Association Testing` app on the BioData Catalyst powered by Seven Bridges platform to perform gene-based burden tests for trait_1 using the null model previously fit in the `02_GWAS.Rmd` tutorial. Only include variants with alternate allele frequency < 1% and use the Wu weights to upweight rarer variants. Use the genotype data in the genome-wide GDS files you created previously. \n + +The `GENESIS Aggregate Association Testing` app currently requires Variant group files that are RData data.frames (i.e. our GRanges objects with gene defintions will not work). Fortunately, it is easy to transform our GRanges object to the required data.frame. The files you need to run the application are already in the project files on the SBG platform. + +```{r} +# look at the GRanges object +genes + +# conver to the required data.frame +genes.df <- data.frame("group_id" = names(genes), + chr = seqnames(genes), + start = start(genes), + end = end(genes)) +head(genes.df) +``` + +The steps to perform this analysis are as follows: + +- Copy the app to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `GENESIS Aggregate Association Testing` + - Click: Copy > Select your project > Copy +- Run the analysis in your project: + - Click: Apps > `GENESIS Aggregate Association Testing` > Run + - Specify the Inputs: + - GDS files: `1KG_phase3_GRCh38_subset_chr.gds` (select all 22 chromosomes) + - Null model file: `1KG_trait_1_null_model.RData` + - Phenotype file: `1KG_trait_1_phenotypes.RData` (use the phenotype file created by the Null Model app) + - Variant group files: `gencode.v38.hg38_ENSG_VarGroups_subset_chr.RData` (select all 22 chromosomes) + - Specify the App Settings: + - define_segments > Genome build: hg38 + - aggregate_list > Aggregate type: position + - assoc_aggregate > Alt Freq Max: 0.01 + - assoc_aggregate > Memory GB: 32 (increase to make sure enough available) + - assoc_aggregate > Test: burden + - assoc_aggregate > Weight Beta: "1 25" + - Output prefix: "1KG_trait_1_burden" (or any other string to name the output file) + - GENESIS Association results plotting > Plot MAC threshold: 5 + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output of this analysis will be 22 `_chr.RData` files with the association test results for each chromosome as well as a `_manh.png` file with the Manhattan plot and a `_qq.png` file with the QQ plot. Review the Manhattan plot -- are there any significant gene associations? + +You can find the expected output of this analysis by looking at the existing task `11 Burden Association Test trait_1` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output. + + +## Exercise 6.2 (Data Studio) + +After running an Application, you may want to load the results into RStudio to explore them interactively. All of the output files are saved in the directory `/sbgenomics/project-files/`. Load the chr 8 burden results into RStudio and find the significant genes. +```{r} +# your solution here +# +# +# +# +# +# +# +# +# +``` + +### Solution 6.2 (Data Studio) + +After running an Application, you may want to load the results into RStudio to explore them interactively. All of the output files are saved in the directory `/sbgenomics/project-files/`. Load the chr 8 burden results into RStudio and find the significant genes. + +```{r, eval = FALSE} +# load +assoc <- get(load('/sbgenomics/project-files/1KG_trait_1_burden_chr8.RData')) +names(assoc) + +head(assoc$results) + +# filter to cumulative MAC >= 5 +burden <- assoc$results[assoc$results$n.alt >= 5, ] + +# significant genes +burden[burden$Score.pval < 0.05/nrow(burden), ] +``` + +Gene ENSG00000186510.7 has the smallest burden p-value ($p = 4.8x10^{-4}$). + diff --git a/06_aggregate_tests.html b/06_aggregate_tests.html new file mode 100644 index 0000000..9c9de54 --- /dev/null +++ b/06_aggregate_tests.html @@ -0,0 +1,1014 @@ + + + + + + + + + + + + + +06_aggregate_tests.knit + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

6. Aggregate Association Tests

+

Multi-variant association tests, which are commonly used for testing +rare variants in aggregate, can be used to identify when variants in a +genomic region (e.g. a gene), potentially with certain properties +defined by variant annotation, are associated with a phenotype of +interest. Under certain assumptions, these aggregate tests can improve +statistical power to detect association when single variant tests are +under-powered and/or poorly calibrated. This tutorial demonstrates how +to perform aggregate multi-variant association tests using the GENESIS +R/Bioconductor package.

+
+

Aggregation Units for Association Testing

+

In this tutorial, we will be using a subset of genes from chromosome +8 as our aggregation units. We use Gencode v38 gene boundaries in genome +build GRCh38/hg38 and label genes by their Ensembl gene IDs. It is +important to use aggregation units based on the genome build consistent +with your sample genotype data. The gene boundaries are provided in a +GRanges object, which is constructed with the GenomicRanges +R/Bioconductor package.

+
repo_path <- "https://github.com/UW-GAC/SISG_2022/raw/main"
+if (!dir.exists("data")) dir.create("data")
+
+library(GenomicRanges)
+
+genefile <- "data/gencode.v38.hg38_ENSG_GRanges_subset_chr8.RData"
+if (!file.exists(genefile)) download.file(file.path(repo_path, genefile), genefile)
+genes <- get(load(genefile))
+genes
+
## GRanges object with 50 ranges and 1 metadata column:
+##                   seqnames              ranges strand |          gene
+##                      <Rle>           <IRanges>  <Rle> |   <character>
+##   ENSG00000253764        8     1971153-1974637      - | RP11-439C15.4
+##   ENSG00000215373        8     7287392-7290402      + |      FAM90A5P
+##   ENSG00000176782        8     7836436-7841242      + |      DEFB104A
+##   ENSG00000248538        8     9151695-9425524      + | RP11-115J16.1
+##   ENSG00000206950        8   14332555-14332663      + |         Y_RNA
+##               ...      ...                 ...    ... .           ...
+##   ENSG00000223697        8 132838117-132844298      - |    AF230666.2
+##   ENSG00000226807        8 141433829-141507230      - |         MROH5
+##   ENSG00000198576        8 142611049-142614479      - |           ARC
+##   ENSG00000253196        8 142763116-142766427      + | RP11-706C16.7
+##   ENSG00000255343        8 143833270-143834063      - | RP11-299M14.2
+##   -------
+##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
+
# number of genes
+length(genes)
+
## [1] 50
+

In the GRanges object, the seqnames field +provides the chromosome value and the ranges field provides +the gene boundaries. The metadata also includes strand +direction and gene names. Each entry in the object is +labeled by the Ensembl gene ID (e.g. ENSG00000253764).

+
+
+

Aggregate Association Tests

+

As we saw in the lecture, there are many different types of +multi-variant association tests. We can perform burden, SKAT, SKAT-O, +fastSKAT, or SMMAT tests using the same assocTestAggregate +function from GENESIS. When performing multi-variant association tests +with GENESIS, the process is very similar to performing single +variant association tests.

+
+

Prepare the Data

+

First, we load the AnnotatedDataFrame with the phenotype +data, open a connection to the GDS file with the genotype data, and +create our SeqVarData object linking the two. This is +exactly the same as the previous tutorials.

+
# open the GDS file
+library(SeqVarTools)
+
+gdsfile <- "data/1KG_phase3_GRCh38_subset_chr8.gds"
+if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile)
+gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
+gds <- seqOpen(gdsfile)
+
+# sample annotation file
+annotfile <- "data/pheno_annotated_pcs.RData"
+if (!file.exists(annotfile)) download.file(file.path(repo_path, annotfile), aggfile)
+annot <- get(load(annotfile))
+
+# make the seqVarData object
+seqData <- SeqVarData(gds, sampleData=annot)
+

When performing aggregate tests using gene boundaries in a +GRanges object, we define a +SeqVarRangeIterator object where each list element is a +gene aggregation unit. This is the only difference in the data +preparation process from what we saw in previous tutorials.

+
# construct the iterator using the SeqVarRangeIterator function
+iterator <- SeqVarRangeIterator(seqData, variantRanges=genes, verbose=FALSE)
+iterator
+
## SeqVarRangeIterator object; on iteration 1 of 50 
+##  | GDS:
+## File: /Users/mconomos/Documents/Teaching/SISG_2024/data/1KG_phase3_GRCh38_subset_chr8.gds (2.3M)
+## +    [  ] *
+## |--+ description   [  ] *
+## |--+ sample.id   { Str8 1040 LZMA_ra(9.88%), 829B } *
+## |--+ variant.id   { Int32 62056 LZMA_ra(15.1%), 36.6K } *
+## |--+ position   { Int32 62056 LZMA_ra(32.3%), 78.4K } *
+## |--+ chromosome   { Str8 62056 LZMA_ra(0.14%), 177B } *
+## |--+ allele   { Str8 62056 LZMA_ra(16.3%), 40.9K } *
+## |--+ genotype   [  ] *
+## |  |--+ data   { Bit2 2x1040x62056 LZMA_ra(6.23%), 1.9M } *
+## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
+## |  \--+ extra   { Int16 0 LZMA_ra, 18B }
+## |--+ phase   [  ]
+## |  |--+ data   { Bit1 1040x62056 LZMA_ra(0.02%), 1.3K } *
+## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
+## |  \--+ extra   { Bit1 0 LZMA_ra, 18B }
+## |--+ annotation   [  ]
+## |  |--+ id   { Str8 62056 LZMA_ra(35.4%), 250.4K } *
+## |  |--+ qual   { Float32 62056 LZMA_ra(0.08%), 197B } *
+## |  |--+ filter   { Int32,factor 62056 LZMA_ra(0.07%), 193B } *
+## |  |--+ info   [  ]
+## |  \--+ format   [  ]
+## \--+ sample.annotation   [  ]
+##  | sampleData:
+## An object of class 'AnnotatedDataFrame'
+##   rowNames: 1 2 ... 1040 (1040 total)
+##   varLabels: sample.id pop ... PC7 (15 total)
+##   varMetadata: labelDescription
+##  | variantData:
+## An object of class 'AnnotatedDataFrame': none
+##  | variantRanges:
+## GRanges object with 50 ranges and 1 metadata column:
+##                   seqnames              ranges strand |          gene
+##                      <Rle>           <IRanges>  <Rle> |   <character>
+##   ENSG00000253764        8     1971153-1974637      - | RP11-439C15.4
+##   ENSG00000215373        8     7287392-7290402      + |      FAM90A5P
+##   ENSG00000176782        8     7836436-7841242      + |      DEFB104A
+##   ENSG00000248538        8     9151695-9425524      + | RP11-115J16.1
+##   ENSG00000206950        8   14332555-14332663      + |         Y_RNA
+##               ...      ...                 ...    ... .           ...
+##   ENSG00000223697        8 132838117-132844298      - |    AF230666.2
+##   ENSG00000226807        8 141433829-141507230      - |         MROH5
+##   ENSG00000198576        8 142611049-142614479      - |           ARC
+##   ENSG00000253196        8 142763116-142766427      + | RP11-706C16.7
+##   ENSG00000255343        8 143833270-143834063      - | RP11-299M14.2
+##   -------
+##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
+
+
+

Null Model

+

As with the single variant association tests, multi-variant +association tests require that we first fit a null model. In most cases, +you will want to use exactly the same null model for both +single and multi-variant tests. We load the same null model for trait_1 +that we fit and saved in the 02_GWAS.Rmd tutorial.

+
# load the null model
+nullmodfile <- "data/null_model_trait1.RData"
+if (!file.exists(nullmodfile)) download.file(file.path(repo_path, nullmodfile), nullmodfile)
+nullmod <- get(load(nullmodfile))
+
+# summary
+nullmod$model
+
## $hetResid
+## [1] FALSE
+## 
+## $family
+## 
+## Family: gaussian 
+## Link function: identity 
+## 
+## 
+## $outcome
+## [1] "trait_1"
+## 
+## $covars
+## [1] "sex" "age" "PC1" "PC2" "PC3" "PC4" "PC5" "PC6" "PC7"
+## 
+## $formula
+## [1] "trait_1 ~ sex + age + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + (1|A)"
+
+
+

Burden Test

+

First, we perform a burden test. We restrict the test to variants +with alternate allele frequency < 0.01. We use a uniform weighting +scheme – i.e. every variant gets the same weight (a Beta(1,1) +distribution is a uniform distribution). The +assocTestAggregate function iterates over all aggregation +units (i.e. genes) in the SeqVarRangeIterator object.

+
# run the burden test
+library(GENESIS)
+assoc.burden <- assocTestAggregate(iterator, 
+                                   null.model = nullmod,
+                                   test = "Burden", 
+                                   AF.max = 0.01, 
+                                   weight.beta = c(1,1))
+
## # of selected samples: 1,040
+
## Using 6 CPU cores
+
names(assoc.burden)
+
## [1] "results"     "variantInfo"
+

The function returns the primary results for each aggregation unit in +one table (results). It also returns a list of tables that +contain the variant details for each aggregation unit tested +(variantInfo).

+
# results for each aggregation unit
+class(assoc.burden$results)
+
## [1] "data.frame"
+
dim(assoc.burden$results)
+
## [1] 50 10
+
head(assoc.burden$results)
+
##                 n.site n.alt n.sample.alt       Score   Score.SE Score.Stat
+## ENSG00000253764     53   173          132   -6.818734  11.067040 -0.6161298
+## ENSG00000215373      0     0            0          NA         NA         NA
+## ENSG00000176782     41   110           97    2.199517   8.321782  0.2643085
+## ENSG00000248538   5755 20042         1038 -213.148649 350.819169 -0.6075741
+## ENSG00000206950      0     0            0          NA         NA         NA
+## ENSG00000253184   1209  4240          873  -18.428450 145.780031 -0.1264127
+##                 Score.pval          Est      Est.SE          PVE
+## ENSG00000253764  0.5378088 -0.055672503 0.090358396 3.685591e-04
+## ENSG00000215373         NA           NA          NA           NA
+## ENSG00000176782  0.7915422  0.031761043 0.120166572 6.782423e-05
+## ENSG00000248538  0.5434700 -0.001731873 0.002850471 3.583944e-04
+## ENSG00000206950         NA           NA          NA           NA
+## ENSG00000253184  0.8994052 -0.000867147 0.006859650 1.551473e-05
+

Each row of the results data.frame represents one tested +aggregation unit and includes: the number of variants/sites included +(n.site), the total number of alternate alleles observed +across all samples in those variants (n.alt), the total +number of samples with at least one alternate allele observed at some +variant (n.sample.alt), the burden score value +(Score) and its standard error (Score.SE), the +burden score test statistic (Score.Stat) and \(p\)-value (Score.pval), an +approximation of the burden effect size (Est) and its +standard error (Est.SE), and an approximation of the +proportion of variation explained by the burden (PVE).

+
# variant info per aggregation unit
+class(assoc.burden$variantInfo)
+
## [1] "list"
+
head(assoc.burden$variantInfo[[1]])
+
##    variant.id chr     pos allele.index n.obs         freq MAC weight
+## 4       87697   8 1971284            1  1040 0.0004807692   1      1
+## 5       87698   8 1971336            1  1040 0.0004807692   1      1
+## 6       87699   8 1971340            1  1040 0.0004807692   1      1
+## 7       87700   8 1971341            1  1040 0.0004807692   1      1
+## 8       87701   8 1971346            1  1040 0.0038461538   8      1
+## 10      87703   8 1971407            1  1040 0.0004807692   1      1
+

The variantInfo for each aggregation unit includes: +variant information (variant.id, chr, and +pos), the number of samples included (n.obs), +the minor allele count (MAC), the effect allele frequency +(freq), and the weight assigned to that variant +(weight).

+

When performing aggregate tests, we usually want to filter our +aggregation units where the cumulative number of minor alleles +(i.e. cumulative MAC) across all samples and variants is below some +threshold. Similarly to how single variant tests are not well calibrated +when a variant is very rare, these aggregate tests are not well +calibrated when the cumulative MAC is very small. The n.alt +values the assocTestAggregate output gives the total number +of alternate alleles observed across all samples and variants in the +aggregation unit. Filter the output to only genes with at least 5 +alternate alleles observed across all samples and variants.

+
burden <- assoc.burden$results[assoc.burden$results$n.alt >= 5, ]
+dim(burden)
+
## [1] 45 10
+

When performing aggregate tests, we typically use a Bonferroni +correction for the number of aggregation units tested to account for +multiple testing. In other words, we use a less stringent \(p\)-value threshold than the genome-wide +significant threshold used for single variant GWAS. Check for +significant burden associations.

+
burden[burden$Score.pval < 0.05/nrow(burden), ]
+
##                 n.site n.alt n.sample.alt    Score Score.SE Score.Stat
+## ENSG00000251354     25    53           50 23.78183 5.504684    4.32029
+##                   Score.pval      Est    Est.SE        PVE
+## ENSG00000251354 1.558241e-05 0.784839 0.1816635 0.01812127
+

We have one significant burden association at ENSG00000251354 with +\(p = 1.6 \times 10^{-5}\). We can also +make a QQ plot of the burden p-values from the main results table

+
library(ggplot2)
+qqPlot <- function(pval) {
+    pval <- pval[!is.na(pval)]
+    n <- length(pval)
+    x <- 1:n
+    dat <- data.frame(obs=sort(pval),
+                      exp=x/n,
+                      upper=qbeta(0.025, x, rev(x)),
+                      lower=qbeta(0.975, x, rev(x)))
+    
+    ggplot(dat, aes(-log10(exp), -log10(obs))) +
+        geom_line(aes(-log10(exp), -log10(upper)), color="gray") +
+        geom_line(aes(-log10(exp), -log10(lower)), color="gray") +
+        geom_point() +
+        geom_abline(intercept=0, slope=1, color="red") +
+        xlab(expression(paste(-log[10], "(expected P)"))) +
+        ylab(expression(paste(-log[10], "(observed P)"))) +
+        theme_bw()
+}    
+
+qqPlot(burden$Score.pval)
+

+

Note: QQ plots for multi-variant tests are often not as clean as for +single variant GWAS, particularly in the lower part of the plot +(i.e. insignificant \(p\)-values near +to \(-log_{10}(p) = 0\)). However, the +QQ plot can still be useful to assess egregious issues.

+
+
+

SKAT Test

+

We can also perform a SKAT test. This time, we will use the Wu +weights (i.e. drawn from a Beta(1,25) distribution), which give larger +weights to rarer variants (note the different weight values in the +variantInfo output).

+
# reset the iterator to the first window
+resetIterator(iterator, verbose = FALSE)
+
+# run the SKAT test
+assoc.skat <- assocTestAggregate(iterator, 
+                                 null.model = nullmod, 
+                                 test = "SKAT", 
+                                 AF.max = 0.01, 
+                                 weight.beta = c(1,25))
+
## # of selected samples: 1,040
+
# results for each aggregation unit
+head(assoc.skat$results)
+
##                 n.site n.alt n.sample.alt          Q         pval err
+## ENSG00000253764     53   173          132   59644.82 0.2378400464   0
+## ENSG00000215373      0     0            0         NA           NA  NA
+## ENSG00000176782     41   110           97   37082.94 0.2580691317   0
+## ENSG00000248538   5755 20042         1038 6014266.17 0.2735962312   0
+## ENSG00000206950      0     0            0         NA           NA  NA
+## ENSG00000253184   1209  4240          873 2351077.86 0.0001106246   0
+##                 pval.method
+## ENSG00000253764 integration
+## ENSG00000215373        <NA>
+## ENSG00000176782 integration
+## ENSG00000248538 integration
+## ENSG00000206950        <NA>
+## ENSG00000253184 integration
+

Again, each row of the results data.frame represents one +tested aggregation unit. Some of the columns are the same as for the +burden test; new columns include: the SKAT statistic (Q), +the SKAT \(p\)-value +(pval), the \(p\)-value +method used (pval.method), and an indicator if there was +any error detected in computing the \(p\)-value (err). If any +aggregation units indicate an error (value = 1), they should be dropped +from the results. Note that there is no effect size provided, as there +is no such concept for SKAT.

+
table(assoc.skat$results$pval.method, assoc.skat$results$err, exclude = NULL)
+
##              
+##                0 <NA>
+##   integration 44    0
+##   saddlepoint  4    0
+##   <NA>         0    2
+
# variant info per aggregation unit 
+head(assoc.skat$variantInfo[[3]])
+
##    variant.id chr     pos allele.index n.obs         freq MAC   weight
+## 4      408422   8 7836645            1  1040 0.0038461538   8 22.79156
+## 6      408424   8 7836646            1  1040 0.0004807692   1 24.71313
+## 7      408425   8 7836662            1  1040 0.0024038462   5 23.59687
+## 13     408431   8 7836835            1  1040 0.0004807692   1 24.71313
+## 15     408433   8 7836873            1  1040 0.0009615385   2 24.42941
+## 17     408435   8 7836914            1  1040 0.0004807692   1 24.71313
+

The variantInfo for each aggregation unit includes the +same information as for the burden test, but note the different variant +weight values due to using the Wu weights instead of Uniform +weights.

+
# filter based on cumulative MAC
+skat <- assoc.skat$results[assoc.skat$results$n.alt >= 5, ]
+
+# significant genes
+skat[skat$pval < 0.05/nrow(skat), ]
+
##                 n.site n.alt n.sample.alt       Q         pval err pval.method
+## ENSG00000253184   1209  4240          873 2351078 0.0001106246   0 integration
+
# make a QQ plot of the SKAT test p-values
+qqPlot(skat$pval)
+

+

We have one significant SKAT association at ENSG00000253184 with +\(p = 1.1 \times 10^{-4}\).

+
+
+

SMMAT Test

+

We can also perform a SMMAT test, which efficiently combines the +\(p\)-values from the burden test and +an asymptotically independent adjusted “SKAT-type” test (it’s +essentially a SKAT test conditional on the burden) using Fisher’s +method. This method is conceptually similar to the SKAT-O test but much +faster computationally.

+
# reset the iterator to the first window
+resetIterator(iterator, verbose = FALSE)
+
+# run the SKAT test
+assoc.smmat <- assocTestAggregate(iterator, 
+                                  null.model = nullmod, 
+                                  test = "SMMAT", 
+                                  AF.max = 0.01, 
+                                  weight.beta = c(1,25))
+
## # of selected samples: 1,040
+
# results for each aggregation unit
+head(assoc.smmat$results)
+
##                 n.site n.alt n.sample.alt Score_burden Score.SE_burden
+## ENSG00000253764     53   173          132   -136.60540        254.8157
+## ENSG00000215373      0     0            0           NA              NA
+## ENSG00000176782     41   110           97     44.25059        193.0924
+## ENSG00000248538   5755 20042         1038  -5318.01023       7919.9428
+## ENSG00000206950      0     0            0           NA              NA
+## ENSG00000253184   1209  4240          873   -395.78019       3314.4992
+##                 Stat_burden pval_burden    Q_theta   pval_theta   pval_SMMAT
+## ENSG00000253764  -0.5360948   0.5918930   58453.89 2.044668e-01 0.3765952026
+## ENSG00000215373          NA          NA         NA           NA           NA
+## ENSG00000176782   0.2291679   0.8187384   37127.27 2.015127e-01 0.4622737914
+## ENSG00000248538  -0.6714708   0.5019207 6027887.29 2.220929e-01 0.3560418110
+## ENSG00000206950          NA          NA         NA           NA           NA
+## ENSG00000253184  -0.1194087   0.9049515 2349643.83 4.583332e-05 0.0004599947
+##                 err pval_theta.method
+## ENSG00000253764   0       integration
+## ENSG00000215373  NA              <NA>
+## ENSG00000176782   0       integration
+## ENSG00000248538   0       integration
+## ENSG00000206950  NA              <NA>
+## ENSG00000253184   0       integration
+

Again, each row of the results data.frame represents one +tested aggregation unit. Some of the columns are the same as for the +burden and SKAT tests; new columns include the SMMAT combined \(p\)-value (pval_SMMAT). Note +that the burden score value (Score_burden) and its standard +error (Score.SE_burden), and the burden score test +statistic (Stat_burden) and \(p\)-value (pval_burden) are +included – these are the same values you would get from running the +burden test. There are also columns for the SKAT-type test statistic +(Q_theta), \(p\)-value +(pval_theta), \(p\)-value +method (pval_theta.method), and error indicator +(err) – these are not the same values you would +get from running SKAT because the “theta” component of the test has been +adjusted for the burden test. Again, there is no effect size provided, +as there is no such concept for the overall SMMAT test.

+
# variant info per aggregation unit 
+head(assoc.smmat$variantInfo[[3]])
+
##    variant.id chr     pos allele.index n.obs         freq MAC   weight
+## 4      408422   8 7836645            1  1040 0.0038461538   8 22.79156
+## 6      408424   8 7836646            1  1040 0.0004807692   1 24.71313
+## 7      408425   8 7836662            1  1040 0.0024038462   5 23.59687
+## 13     408431   8 7836835            1  1040 0.0004807692   1 24.71313
+## 15     408433   8 7836873            1  1040 0.0009615385   2 24.42941
+## 17     408435   8 7836914            1  1040 0.0004807692   1 24.71313
+

Again, the variantInfo for each aggregation unit +includes the same information as the other tests.

+

The function returns the \(p\)-values from the burden test +(pval_burden), the adjusted SKAT-type test +(pval_theta), and the combined \(p\)-value (pval_SMMAT). The +combined \(p\)-value is the one to use +for assessing significance. The burden and theta \(p\)-values may be of secondary interest for +further exploring results.

+
# filter based on cumulative MAC
+smmat <- assoc.smmat$results[assoc.smmat$results$n.alt >= 5, ]
+
+# significant genes
+smmat[smmat$pval_SMMAT < 0.05/nrow(smmat), ]
+
##                 n.site n.alt n.sample.alt Score_burden Score.SE_burden
+## ENSG00000253184   1209  4240          873    -395.7802       3314.4992
+## ENSG00000251354     25    53           50     578.9664        131.3365
+##                 Stat_burden  pval_burden    Q_theta   pval_theta   pval_SMMAT
+## ENSG00000253184  -0.1194087 9.049515e-01 2349643.83 4.583332e-05 4.599947e-04
+## ENSG00000251354   4.4082673 1.042009e-05   28639.65 2.561739e-02 4.307339e-06
+##                 err pval_theta.method
+## ENSG00000253184   0       integration
+## ENSG00000251354   0       integration
+
# make a QQ plot of the SKAT test p-values
+qqPlot(smmat$pval_SMMAT)
+

+

The SMMAT test found two significant genes, ENSG00000253184 and +ENSG00000251354, which were the genes that the SKAT and burden tests +found respectively. For ENSG00000253184, the SMMAT \(p = 4.6 \times 10^{-4}\), while the SKAT +\(p = 1.1 \times 10^{-4}\) (see above) +was slightly more significant. For ENSG00000251354, the SMMAT \(p = 4.3 \times 10^{-6}\) was more +significant than the burden \(p = 1.0 \times +10^{-5}\) (the burden \(p\)-value is a bit different from earlier +because we used the Wu weights instead of Uniform weights) – as seen +here, the combined SMMAT \(p\)-value +may be more significant than either burden or SKAT separately.

+
+
+
+

Exercise 6.1 (Application)

+

Use the GENESIS Aggregate Association Testing app on the +BioData Catalyst powered by Seven Bridges platform to perform gene-based +burden tests for trait_1 using the null model previously fit in the +02_GWAS.Rmd tutorial. Only include variants with alternate +allele frequency < 1% and use the Wu weights to upweight rarer +variants. Use the genotype data in the genome-wide GDS files you created +previously.

+

The GENESIS Aggregate Association Testing app currently +requires Variant group files that are RData data.frames (i.e. our +GRanges objects with gene defintions will not work). Fortunately, it is +easy to transform our GRanges object to the required data.frame. The +files you need to run the application are already in the project files +on the SBG platform.

+
# look at the GRanges object
+genes
+
## GRanges object with 50 ranges and 1 metadata column:
+##                   seqnames              ranges strand |          gene
+##                      <Rle>           <IRanges>  <Rle> |   <character>
+##   ENSG00000253764        8     1971153-1974637      - | RP11-439C15.4
+##   ENSG00000215373        8     7287392-7290402      + |      FAM90A5P
+##   ENSG00000176782        8     7836436-7841242      + |      DEFB104A
+##   ENSG00000248538        8     9151695-9425524      + | RP11-115J16.1
+##   ENSG00000206950        8   14332555-14332663      + |         Y_RNA
+##               ...      ...                 ...    ... .           ...
+##   ENSG00000223697        8 132838117-132844298      - |    AF230666.2
+##   ENSG00000226807        8 141433829-141507230      - |         MROH5
+##   ENSG00000198576        8 142611049-142614479      - |           ARC
+##   ENSG00000253196        8 142763116-142766427      + | RP11-706C16.7
+##   ENSG00000255343        8 143833270-143834063      - | RP11-299M14.2
+##   -------
+##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
+
# conver to the required data.frame
+genes.df <- data.frame("group_id" = names(genes),
+                       chr = seqnames(genes),
+                       start = start(genes),
+                       end = end(genes))
+head(genes.df)
+
##          group_id chr    start      end
+## 1 ENSG00000253764   8  1971153  1974637
+## 2 ENSG00000215373   8  7287392  7290402
+## 3 ENSG00000176782   8  7836436  7841242
+## 4 ENSG00000248538   8  9151695  9425524
+## 5 ENSG00000206950   8 14332555 14332663
+## 6 ENSG00000253184   8 16604255 16664879
+

The steps to perform this analysis are as follows:

+
    +
  • Copy the app to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for GENESIS Aggregate Association Testing
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
  • Run the analysis in your project: +
      +
    • Click: Apps > GENESIS Aggregate Association Testing +> Run
    • +
    • Specify the Inputs: +
        +
      • GDS files: 1KG_phase3_GRCh38_subset_chr<CHR>.gds +(select all 22 chromosomes)
      • +
      • Null model file: 1KG_trait_1_null_model.RData
      • +
      • Phenotype file: 1KG_trait_1_phenotypes.RData (use the +phenotype file created by the Null Model app)
      • +
      • Variant group files: +gencode.v38.hg38_ENSG_VarGroups_subset_chr<CHR>.RData +(select all 22 chromosomes)
      • +
    • +
    • Specify the App Settings: +
        +
      • define_segments > Genome build: hg38
      • +
      • aggregate_list > Aggregate type: position
      • +
      • assoc_aggregate > Alt Freq Max: 0.01
      • +
      • assoc_aggregate > Memory GB: 32 (increase to make sure enough +available)
      • +
      • assoc_aggregate > Test: burden
      • +
      • assoc_aggregate > Weight Beta: “1 25”
      • +
      • Output prefix: “1KG_trait_1_burden” (or any other string to name the +output file)
      • +
      • GENESIS Association results plotting > Plot MAC threshold: 5
      • +
    • +
    • Click: Run
    • +
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

The output of this analysis will be 22 +<output_prefix>_chr<CHR>.RData files with the +association test results for each chromosome as well as a +<output_prefix>_manh.png file with the Manhattan plot +and a <output_prefix>_qq.png file with the QQ plot. +Review the Manhattan plot – are there any significant gene +associations?

+

You can find the expected output of this analysis by looking at the +existing task 11 Burden Association Test trait_1 in the +Tasks menu of your Project. The output files are available in the +Project, so you do not need to wait for your analysis to finish to look +at the output.

+
+
+

Exercise 6.2 (Data Studio)

+

After running an Application, you may want to load the results into +RStudio to explore them interactively. All of the output files are saved +in the directory /sbgenomics/project-files/. Load the chr 8 +burden results into RStudio and find the significant genes.

+
# your solution here 
+#
+#
+#
+#
+#
+#
+#
+#
+#
+
+

Solution 6.2 (Data Studio)

+

After running an Application, you may want to load the results into +RStudio to explore them interactively. All of the output files are saved +in the directory /sbgenomics/project-files/. Load the chr 8 +burden results into RStudio and find the significant genes.

+
# load
+assoc <- get(load('/sbgenomics/project-files/1KG_trait_1_burden_chr8.RData'))
+names(assoc)
+
+head(assoc$results)
+
+# filter to cumulative MAC >= 5
+burden <- assoc$results[assoc$results$n.alt >= 5, ]
+
+# significant genes
+burden[burden$Score.pval < 0.05/nrow(burden), ]
+

Gene ENSG00000186510.7 has the smallest burden p-value (\(p = 4.8x10^{-4}\)).

+
+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/07_STAAR.Rmd b/07_STAAR.Rmd new file mode 100644 index 0000000..c9bc400 --- /dev/null +++ b/07_STAAR.Rmd @@ -0,0 +1,166 @@ +# 7. STAAR Pipeline + +The STAAR pipeline provides practical options for rare variant analysis for whole genome and exome sequencing data, including when related participants are included in an analysis. STAAR makes it easy to incorporate functional annotation from the FAVOR database into multiple-variant aggregate association tests. You can also provide your own annotations and customize the pipeline for your tissue of interest; here we use the standard FAVOR database annotation strategy, as described in accompanying lecture and the published paper. + +## STAAR Pipeline Applications + +These apps run phenotype-genotype association analyses for biobank-scale whole-genome/whole-exome sequencing data. + +The first app `STAARpipeline` will: + +1. Fit the null model. This is fitting your model with your outcome, adjustments, and kinship/genetic relatedness matrix, but does not use the genotypes +2. Take the null model object from the first step and run your association analysis, while dynamically incorporating multiple functional annotations to empower rare variant (set) association analysis using the STAAR method. + +The same null model can be used for single variant or aggregate tests. + +The second app `STAARpipelineSummary VarSet` takes the single variant or aggregate test results generated from the `STAARpipeline` app, and will: + +1. Summarize these results across all chromosomes and create a unified list of results +2. Perform conditional analysis for (unconditionally) significant single variants or variant sets by adjusting for a given list of known variants + +The third app `STAARpipelineSummary IndVar` will: + +1. Extract information (summary statistics) of individual variants from a user-specified variant set (gene category or genetic region) in the analytical follow-up of `STAARpipeline`. + +This pipeline is described in detail at PMID 36303018. It has been applied in a number of application papers as well, including PMID 36220816, 37253714. + +## FAVOR + +The FAVOR reference files to annotate a given GDS file to an aGDS file are located under “FAVOR Essential Database” here: +https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1VGTJI + +These files have already been provided to you in the BioData Catalyst project, but you may need them in future. Public tutorial/manual files are also available https://docs.google.com/document/d/1l-fCmuey7HnrUxx2U67_bgwRrtyWyuSu98I3YXoTZoE/edit, https://github.com/xihaoli/STAARpipeline-Tutorial. + + +## Sliding Window Tests + +We will use the `STAARpipeline` apps on the BioData Catalyst powered by Seven Bridges platform to perform sliding window tests for variants with an allele frequency $<1\%$. The steps to perform this analysis are as follows: + +- Copy the relevant apps to your project if it is not already there: + - Click: Public Resources > Workflows and Tools > Browse + - Search for `STAAR`. You need all 4 apps listed below: + - FAVORannotator + - STAARpipeline + - STAARpipelineSummary VarSet + - STAARpipelineSummary IndVar + - Click: Copy > Select your project > Copy + +## Exercise 7.1 (Application) + +First, run the `FAVORannotator` app on an example GDS file -- perhaps chromosome 19. This app runs one chromosome at a time. + +- Run the analysis in your project: + - Click: Apps > `FAVORannotator` > Run + - Specify the Inputs: + - GDS file: `1KG_phase3_subset_1KG_phase3_subset_chr19_pruned.gds` + - FAVOR database for specific chromosome: `FAVOR_chr19.tar.gz` (provided at link above by STAAR package creators) + - FAVORdatabase_chrsplit CSV file: `FAVORdatabase_chrsplit.csv` (provided at link above by STAAR package creators) + - Specify the App Settings: + - Chromosome: 19 + - Output file prefix: "1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor" (or any other string to name the output file) + +Note: other app setting defaults will not need to be changed for our example, but could be altered depending on cohort size, etc. + +Then, as this task will otherwise take half an hour to run, do not run the task. The output of this task would be an annotated GDS file named `.gds`. You can find the expected output of this task by looking at the existing task `11. STAARexercise_FAVORannotator_chr19_example` in the Tasks menu of your Project. We will utilize the pre-provided output file available in the Project for the next steps. + +## Exercise 7.2 (Application) + +Next, run `STAARpipeline`. We will focus on a sliding window test for this exercise, but options also exist for gene-centric coding and noncoding tests and tests per ncRNA. This single application (`STAARpipeline`) includes options for running an initial null model, as well as multiple types of aggregate tests ("Gene_Centric_Coding", "Gene_Centric_Noncoding", "ncRNA", "Sliding_Window"). Note, there is an option to run single variant tests within STAAR and summarize them, but this is very similar to GENESIS pipeline already covered, so is not a part of this exercise. + +- First, generate an appropriate null model: + - Click: Apps > `STAARpipeline` > Run + - Specify the Inputs: + - GDS files: `1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor.gds` (in a real analysis, you would select all 22 chromosomes) + - Annotation name catalog: `Annotation_name_catalog.csv` + - Phenotype file: `mock_phenotype_SISG.csv` + - Specify the App Settings: + - Phenotype: phenotype + - Covariates: age, sex (again, in a real analysis, you would want to include a kinship matrix and ancestry principal components) + - Test type: Null + - Output file prefix: "chr19_region_null" (or any other string to name the output file) + - Click: Run + +Note: you do not need to provide a variant grouping file, variant annotations are already included in the annotated GDS. + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output file for this null model task is `.Rdata`, a null model data object. + +You can find the expected output of this analysis by looking at the existing task `12. STAARexercise_nullmodel_chr19` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output. + + +## Exercise 7.3 (Application) + +- Next, run a sliding window aggregate test (5 kb window). + - Click: Apps > `STAARpipeline` > Run + - Specify the Inputs: + - GDS files: `1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor.gds` + - Annotation name catalog: `Annotation_name_catalog.csv` + - Null model: `chr19_region_null.Rdata` + - Specify App Settings: + - Sliding window size (bp) to be used in sliding window test: 5000 + - Output file prefix: "region_sliding_5kb_chr19" (or any other string to name the output file) + - Test type: Sliding_Window + - Click: Run + +Note: we will utilize the default allele frequency setting (max MAF $1\%$), and are running the pipeline on single nucleotide variants or SNVs only (indels would likely be included too in a real analysis). Mean values are used for missing genotypes by default. + +The analysis will take ~10 minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output from this task is `.Rdata`, which contains the STAAR association results. + +You can find the expected output of this analysis by looking at the existing task `13. STAARexercise_STAARpipeline_run_sliding_5kb` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output. + + +## Exercise 7.4 (Application) + +The output file for these association tests using STAAR is an `.RData` file. To convert to more human readable formats, and split different types of tests (LOF vs missense for coding tests, for example) into different `.Rdata` objects, use the `STAARpipelineSummary VarSet` app. + + - Click: Apps > `STAARpipelineSummary VarSet` > Run + - Specify the Inputs: + - Annotation name catalog: `Annotation_name_catalog.csv` + - Input array results: `region_sliding_5kb_chr19.Rdata` + - Specify App Settings: + - Output file prefix: "result_sliding_5kb_chr19" (or any other string to name the output file) + - Prefix of input results: "region_sliding_5kb_chr" + - Test type: Sliding_Window + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +You can find the expected output of this analysis by looking at the existing task `14. STAARexercise_STAARpipelineSummary_VarSet_sliding5kb` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output. + +The output from this task is `_results_sliging_window_genome.Rdata` and `_results_sliging_window_genome_sig.csv`. Do note that the second file "sig", i.e. significant test files in this analysis, are empty - which is good/anticipated with a randomly generated phenotype! + +We can also use the `STAARpipelineSummary VarSet` app to adjust for a list of known variants we would like to condition our analysis on (i.e. include as a covariate) in order to determine if our identified rare variant signals are independent of such known single variants. Files of significant "cond" results will then appear. If you use this option, note you would need to input `null_obj_file`, `agds_file_name`, and `agds_files` across all chromosomes at once (the app is not intended to be run across a single chromosome and will fail if a file for each autosome is not found). + +## Exercise 7.5 (Application) + +You can also use the `STAARpipelineSummary IndVar` App to examine the individual variants which contribute to an aggregate test. We here examine one of the regions tested in the 5 kb sliding window test. Do note that individual variant $p$-values with a very low minor allele count (say less than 5) are likely quite unstable and may not be useful. This can still be a useful type of annotation for your results however. + + - Click: Apps > `STAARpipelineSummary IndVar` > Run + - Specify the Inputs: + - Annotation name catalog: `Annotation_name_catalog.csv` + - AGDS file: `1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor.gds` + - Null model: `chr19_region_null.Rdata` + - Specify App Settings: + - Chromosome: 19 + - End location: 45668803 + - Output file prefix: "slidingwindow_indvar_19_45663804" (or any other string to name the output file) + - Start location: 45663804 + - Test type: Sliding_Window + - Click: Run + +The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. + +The output from this analysis will be `csv`. + +You can find the expected output of this analysis by looking at the existing task `15. STAARexercise_STAARpipelineSummary_IndVar` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output. + + +## Gene-centric Tests + +While it takes a bit too long to run for an in class exercise, you can also check out results for a gene-centric coding variant mask (similar gene-centric noncoding variant masks could also be run) using a less sparse genotype file than what we have usually used in this class. The example tasks are `STAARexercise_example_STAARpipeline_gene_centric_coding` and `STAARexercise_example_STAARpipelineSummary_VarSet_genebased`. The output files in the Project have the prefix "output_full_gene_centric_coding_chr19". We also used the IndVar application to explore variants contributing to the coding gene-based test for DNMT1. See task `STAARexercise_example_STAARpipelineSummary_IndVar_gene_centric`. These example tasks use a provided annotated GDS file, produced using FAVORannotator using a similar pipeline to above, `test_1000G_chr19.gds`. + + + diff --git a/07_STAAR.html b/07_STAAR.html new file mode 100644 index 0000000..ca11043 --- /dev/null +++ b/07_STAAR.html @@ -0,0 +1,722 @@ + + + + + + + + + + + + + +07_STAAR.knit + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

7. STAAR Pipeline

+

The STAAR pipeline provides practical options for rare variant +analysis for whole genome and exome sequencing data, including when +related participants are included in an analysis. STAAR makes it easy to +incorporate functional annotation from the FAVOR database into +multiple-variant aggregate association tests. You can also provide your +own annotations and customize the pipeline for your tissue of interest; +here we use the standard FAVOR database annotation strategy, as +described in accompanying lecture and the published paper.

+
+

STAAR Pipeline Applications

+

These apps run phenotype-genotype association analyses for +biobank-scale whole-genome/whole-exome sequencing data.

+

The first app STAARpipeline will:

+
    +
  1. Fit the null model. This is fitting your model with your outcome, +adjustments, and kinship/genetic relatedness matrix, but does not use +the genotypes
  2. +
  3. Take the null model object from the first step and run your +association analysis, while dynamically incorporating multiple +functional annotations to empower rare variant (set) association +analysis using the STAAR method.
  4. +
+

The same null model can be used for single variant or aggregate +tests.

+

The second app STAARpipelineSummary VarSet takes the +single variant or aggregate test results generated from the +STAARpipeline app, and will:

+
    +
  1. Summarize these results across all chromosomes and create a unified +list of results
  2. +
  3. Perform conditional analysis for (unconditionally) significant +single variants or variant sets by adjusting for a given list of known +variants
  4. +
+

The third app STAARpipelineSummary IndVar will:

+
    +
  1. Extract information (summary statistics) of individual variants from +a user-specified variant set (gene category or genetic region) in the +analytical follow-up of STAARpipeline.
  2. +
+

This pipeline is described in detail at PMID 36303018. It has been +applied in a number of application papers as well, including PMID +36220816, 37253714.

+
+
+

FAVOR

+

The FAVOR reference files to annotate a given GDS file to an aGDS +file are located under “FAVOR Essential Database” here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1VGTJI

+

These files have already been provided to you in the BioData Catalyst +project, but you may need them in future. Public tutorial/manual files +are also available https://docs.google.com/document/d/1l-fCmuey7HnrUxx2U67_bgwRrtyWyuSu98I3YXoTZoE/edit, +https://github.com/xihaoli/STAARpipeline-Tutorial.

+
+
+

Sliding Window Tests

+

We will use the STAARpipeline apps on the BioData +Catalyst powered by Seven Bridges platform to perform sliding window +tests for variants with an allele frequency \(<1\%\). The steps to perform this +analysis are as follows:

+
    +
  • Copy the relevant apps to your project if it is not already there: +
      +
    • Click: Public Resources > Workflows and Tools > Browse
    • +
    • Search for STAAR. You need all 4 apps listed below: +
        +
      • FAVORannotator
      • +
      • STAARpipeline
      • +
      • STAARpipelineSummary VarSet
      • +
      • STAARpipelineSummary IndVar
      • +
    • +
    • Click: Copy > Select your project > Copy
    • +
  • +
+
+
+

Exercise 7.1 (Application)

+

First, run the FAVORannotator app on an example GDS file +– perhaps chromosome 19. This app runs one chromosome at a time.

+
    +
  • Run the analysis in your project: +
      +
    • Click: Apps > FAVORannotator > Run
    • +
    • Specify the Inputs: +
        +
      • GDS file: +1KG_phase3_subset_1KG_phase3_subset_chr19_pruned.gds
      • +
      • FAVOR database for specific chromosome: +FAVOR_chr19.tar.gz (provided at link above by STAAR package +creators)
      • +
      • FAVORdatabase_chrsplit CSV file: +FAVORdatabase_chrsplit.csv (provided at link above by STAAR +package creators)
      • +
    • +
    • Specify the App Settings: +
        +
      • Chromosome: 19
      • +
      • Output file prefix: +“1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor” (or any other +string to name the output file)
      • +
    • +
  • +
+

Note: other app setting defaults will not need to be changed for our +example, but could be altered depending on cohort size, etc.

+

Then, as this task will otherwise take half an hour to run, do not +run the task. The output of this task would be an annotated GDS file +named <output_prefix>.gds. You can find the expected +output of this task by looking at the existing task +11. STAARexercise_FAVORannotator_chr19_example in the Tasks +menu of your Project. We will utilize the pre-provided output file +available in the Project for the next steps.

+
+
+

Exercise 7.2 (Application)

+

Next, run STAARpipeline. We will focus on a sliding +window test for this exercise, but options also exist for gene-centric +coding and noncoding tests and tests per ncRNA. This single application +(STAARpipeline) includes options for running an initial +null model, as well as multiple types of aggregate tests +(“Gene_Centric_Coding”, “Gene_Centric_Noncoding”, “ncRNA”, +“Sliding_Window”). Note, there is an option to run single variant tests +within STAAR and summarize them, but this is very similar to GENESIS +pipeline already covered, so is not a part of this exercise.

+
    +
  • First, generate an appropriate null model: +
      +
    • Click: Apps > STAARpipeline > Run
    • +
    • Specify the Inputs: +
        +
      • GDS files: +1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor.gds +(in a real analysis, you would select all 22 chromosomes)
      • +
      • Annotation name catalog: +Annotation_name_catalog.csv
      • +
      • Phenotype file: mock_phenotype_SISG.csv
      • +
    • +
    • Specify the App Settings: +
        +
      • Phenotype: phenotype
      • +
      • Covariates: age, sex (again, in a real analysis, you would want to +include a kinship matrix and ancestry principal components)
      • +
      • Test type: Null
      • +
      • Output file prefix: “chr19_region_null” (or any other string to name +the output file)
      • +
    • +
    • Click: Run
    • +
  • +
+

Note: you do not need to provide a variant grouping file, variant +annotations are already included in the annotated GDS.

+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

The output file for this null model task is +<output_prefix>.Rdata, a null model data object.

+

You can find the expected output of this analysis by looking at the +existing task 12. STAARexercise_nullmodel_chr19 in the +Tasks menu of your Project. The output files are available in the +Project, so you do not need to wait for your analysis to finish to look +at the output.

+
+
+

Exercise 7.3 (Application)

+
    +
  • Next, run a sliding window aggregate test (5 kb window). +
      +
    • Click: Apps > STAARpipeline > Run
    • +
    • Specify the Inputs: +
        +
      • GDS files: +1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor.gds
      • +
      • Annotation name catalog: +Annotation_name_catalog.csv
      • +
      • Null model: chr19_region_null.Rdata
      • +
    • +
    • Specify App Settings: +
        +
      • Sliding window size (bp) to be used in sliding window test: +5000
      • +
      • Output file prefix: “region_sliding_5kb_chr19” (or any other string +to name the output file)
      • +
      • Test type: Sliding_Window
      • +
    • +
    • Click: Run
    • +
  • +
+

Note: we will utilize the default allele frequency setting (max MAF +\(1\%\)), and are running the pipeline +on single nucleotide variants or SNVs only (indels would likely be +included too in a real analysis). Mean values are used for missing +genotypes by default.

+

The analysis will take ~10 minutes to run. You can find your analysis +in the Tasks menu of your Project to check on its progress and see the +results once it has completed.

+

The output from this task is +<output_prefix>.Rdata, which contains the STAAR +association results.

+

You can find the expected output of this analysis by looking at the +existing task +13. STAARexercise_STAARpipeline_run_sliding_5kb in the +Tasks menu of your Project. The output files are available in the +Project, so you do not need to wait for your analysis to finish to look +at the output.

+
+
+

Exercise 7.4 (Application)

+

The output file for these association tests using STAAR is an +.RData file. To convert to more human readable formats, and +split different types of tests (LOF vs missense for coding tests, for +example) into different .Rdata objects, use the +STAARpipelineSummary VarSet app.

+
    +
  • Click: Apps > STAARpipelineSummary VarSet > +Run
  • +
  • Specify the Inputs: +
      +
    • Annotation name catalog: +Annotation_name_catalog.csv
    • +
    • Input array results: +region_sliding_5kb_chr19.Rdata
    • +
  • +
  • Specify App Settings: +
      +
    • Output file prefix: “result_sliding_5kb_chr19” (or any other string +to name the output file)
    • +
    • Prefix of input results: “region_sliding_5kb_chr”
    • +
    • Test type: Sliding_Window
    • +
  • +
  • Click: Run
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

You can find the expected output of this analysis by looking at the +existing task +14. STAARexercise_STAARpipelineSummary_VarSet_sliding5kb in +the Tasks menu of your Project. The output files are available in the +Project, so you do not need to wait for your analysis to finish to look +at the output.

+

The output from this task is +<output_prefix>_results_sliging_window_genome.Rdata +and +<output_prefix>_results_sliging_window_genome_sig.csv. +Do note that the second file “sig”, i.e. significant test files in this +analysis, are empty - which is good/anticipated with a randomly +generated phenotype!

+

We can also use the STAARpipelineSummary VarSet app to +adjust for a list of known variants we would like to condition our +analysis on (i.e. include as a covariate) in order to determine if our +identified rare variant signals are independent of such known single +variants. Files of significant “cond” results will then appear. If you +use this option, note you would need to input +null_obj_file, agds_file_name, and +agds_files across all chromosomes at once (the app is not +intended to be run across a single chromosome and will fail if a file +for each autosome is not found).

+
+
+

Exercise 7.5 (Application)

+

You can also use the STAARpipelineSummary IndVar App to +examine the individual variants which contribute to an aggregate test. +We here examine one of the regions tested in the 5 kb sliding window +test. Do note that individual variant \(p\)-values with a very low minor allele +count (say less than 5) are likely quite unstable and may not be useful. +This can still be a useful type of annotation for your results +however.

+
    +
  • Click: Apps > STAARpipelineSummary IndVar > +Run
  • +
  • Specify the Inputs: +
      +
    • Annotation name catalog: +Annotation_name_catalog.csv
    • +
    • AGDS file: +1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor.gds
    • +
    • Null model: chr19_region_null.Rdata
    • +
  • +
  • Specify App Settings: +
      +
    • Chromosome: 19
    • +
    • End location: 45668803
    • +
    • Output file prefix: “slidingwindow_indvar_19_45663804” (or any other +string to name the output file)
    • +
    • Start location: 45663804
    • +
    • Test type: Sliding_Window
    • +
  • +
  • Click: Run
  • +
+

The analysis will take a few minutes to run. You can find your +analysis in the Tasks menu of your Project to check on its progress and +see the results once it has completed.

+

The output from this analysis will be +<output_prefix>csv.

+

You can find the expected output of this analysis by looking at the +existing task 15. STAARexercise_STAARpipelineSummary_IndVar +in the Tasks menu of your Project. The output files are available in the +Project, so you do not need to wait for your analysis to finish to look +at the output.

+
+
+

Gene-centric Tests

+

While it takes a bit too long to run for an in class exercise, you +can also check out results for a gene-centric coding variant mask +(similar gene-centric noncoding variant masks could also be run) using a +less sparse genotype file than what we have usually used in this class. +The example tasks are +STAARexercise_example_STAARpipeline_gene_centric_coding and +STAARexercise_example_STAARpipelineSummary_VarSet_genebased. +The output files in the Project have the prefix +“output_full_gene_centric_coding_chr19”. We also used the IndVar +application to explore variants contributing to the coding gene-based +test for DNMT1. See task +STAARexercise_example_STAARpipelineSummary_IndVar_gene_centric. +These example tasks use a provided annotated GDS file, produced using +FAVORannotator using a similar pipeline to above, +test_1000G_chr19.gds.

+
+
+ + + + +
+ + + + + + + + + + + + + + +