Pipeline for Genome-wide Association Studies in Cassava
Authors: Vianey Barrera-Enriquez and Camilo E. Sanchez
The pipeline has six main steps:
- Quality control (in progress)
- Imputation (in progress)
- LD Decay
- Population structure and covariable selection (in progress)
- GWAS analysis using GAPIT3
- Annotation of results (in progress)
- Boxplot of significant markers: genotypes vs. phenotype
- Customizable Manhattan plots (in progress)
In progress
In progress
In progress
In progress
# Initial requeriments
inputFile <- "/path/to/inputFile.vcf.gz"
prefix <- "gwas_results"
# Get lisf of samples from VCF
bcftools query -l ${inputFile} > ${prefix}.list.samples.txt
# Get stats
bcftools stats ${inputFile} > ${prefix}.stats.txt
# Calculate allele frequency
vcftools --gzvcf ${inputFile} --freq2 --out ${prefix}.frequency
# Stats mean depth per individual
vcftools --gzvcf ${inputFile} --depth --out ${prefix}.depth
# Mean depth per site
vcftools --gzvcf ${inputFile} --site-mean-depth --out ${prefix}.depth_site
# Site quality
vcftools --gzvcf ${inputFile} --site-quality --out ${prefix}.quality
# Missing data per individual
vcftools --gzvcf ${inputFile} --missing-indv --out ${prefix}.missing
# Missing data per site
vcftools --gzvcf ${inputFile} --missing-site --out ${prefix}.missing_site
# Heterozygosity and inbreeding coefficients per individual
vcftools --gzvcf ${inputFile} --het --out ${prefix}.heterozygosity
# Hardy-Weinberg p-value
vcftools --gzvcf ${inputFile} --hardy --out ${prefix}.hwe
In progress
In progress
In progress
In progress
This script contains the customized functions to perform four different multivariate methods: (1) Non-Metric Multidimensional Scaling (NDMS), (2) Multidimensional Scaling (MDS), (3) Principal Component Analysis (PCA), and (4) Discriminant Analysis of Principal Components (DAPC).
# NDMS, MDS, and PCA
NDMS(dir, phenofile, dist = "bray", groups = F)
MDS(dir, phenofile, dist = "gower", groups = F)
PCA(dir, type, groups = F, phenofile = NULL, labels = NULL, genofile = NULL, vcf = NULL, gds = NULL, PC.retain = F)
# DAPC
library(adegenet)
grp <- find.clusters(my_genind)
dapc <- dapc(my_genind, grp$grp)
- For NDMS:
dir
: Directory where data is located.phenofile
: A database of genotypes/individuals (rows) and their traits (columns). First column must be genotypes/individuals names.dist
: Dissimilarity index/measure to use (default = "bray").groups
: Boolean value indicating whether the data includes different treatments/groups (default = F). IfTRUE
is selected then the second column of thephenofile
must be the groups/treatments.
- For MDS:
dir
: Directory where data is located.phenofile
: A database of genotypes/individuals (rows) and their traits (columns). First column must be genotypes/individuals names.dist
: Dissimilarity index/measure to use (default = "gower").groups
: Boolean value indicating whether the data includes different treatments/groups (default = F). IfTRUE
is selected then the second column of thephenofile
must be the groups/treatments.
- For PCA:
dir
: Directory where data is located.type
: A character string indicating wheter data is genotypic ("geno") of phenotypic ("pheno").groups
: Boolean value indicating whether the data includes different treatments/groups (default = F). IfTRUE
is selected then the second column of thephenofile
must be the groups/treatments.phenofile
: A database of genotypes/individuals (rows) and their traits (columns). First column must be genotypes/individuals names.genofile
: An object of class SNPGDSFileClass (GDS file read with thesnpgdsOpen
function fromSNPRelate
package).labels
: When provide agenofile
andgroups
argument isTRUE
, please provide a dataframe with genotypes/individuals in the first column and groups/treatment data in the second column.gds
: A Genomic Data Structures (GDS) file (a reformatted VCF file with thesnpgdsVCF2GDS
function fromvcfR
package).vcf
: A Variant Call Format (VCF) file containing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations.PC.retain
: Boolean value indicating whether to analyze how many PCs retain (default = F).
- For DAPC:
my_genind
: An object of class genind (vcf file read with thevcfR2genind
function fromvcfR
package).
This series of functions retrieves the required pacakges and data to perform four multivariate methods, potentially used to study population structure and select covariable for next analyses. Especifically, this script includes the functions for Non-Metric Multidimensional Scaling (NDMS
), Multidimensional Scaling (MDS
), Principal Component Analysis (PCA
), and Discriminant Analysis of Principal Components (DAPC
). The first two functions are built only for phenotypic data while the PCA can handle both data type (phenotypic or genotypic), and the DAPC only uses genotypic data. For all functions interactive plots (made using plotly
package) in html format. The DAPC part is not built as a function give the nature of the functions used in which prompts are necessary to continue. Please run this methods line by line.
# NDMS example(s)
# Set arguments
# Phenotypic with groups
dir <- "D:/OneDrive - CGIAR/Cassava_Bioinformatics_Team/02_CTS_Drought_Family/01_Phenotype_Preliminar_Analysis/"
phenofile <- read.csv(paste0(dir, "Prueba.csv"), header = T)
groups <- T
# Phenotypic without groups
dir <- "D:/OneDrive - CGIAR/Cassava_Bioinformatics_Team/02_CTS_Drought_Family/01_Phenotype_Preliminar_Analysis/"
phenofile <- read.csv(paste0(dir, "Prueba.csv"), header = T)
phenofile <- phenofile[-2]
# Run function
NDMS(dir, phenofile, dist, groups)
# MDS example(s)
# Set arguments
# Phenotypic with groups
dir <- "D:/OneDrive - CGIAR/Cassava_Bioinformatics_Team/02_CTS_Drought_Family/01_Phenotype_Preliminar_Analysis/"
phenofile <- read.csv(paste0(dir, "Prueba.csv"), header = T)
groups <- T
# Phenotypic without groups
dir <- "D:/OneDrive - CGIAR/Cassava_Bioinformatics_Team/02_CTS_Drought_Family/01_Phenotype_Preliminar_Analysis/"
phenofile <- read.csv(paste0(dir, "Prueba.csv"), header = T)
phenofile <- phenofile[-2]
# Run function
MDS(dir, phenofile, dist, groups)
# PCA example(s)
# Set arguments
# Phenotypic with groups
dir <- "D:/OneDrive - CGIAR/Cassava_Bioinformatics_Team/02_CTS_Drought_Family/01_Phenotype_Preliminar_Analysis/"
phenofile <- read.csv(paste0(dir, "Prueba.csv"), header = T)
groups <- T
# Phenotypic without groups
dir <- "D:/OneDrive - CGIAR/Cassava_Bioinformatics_Team/02_CTS_Drought_Family/01_Phenotype_Preliminar_Analysis/"
phenofile <- read.csv(paste0(dir, "Prueba.csv"), header = T)
phenofile <- phenofile[-2]
# Genotypic with groups
dir <- "D:/OneDrive - CGIAR/Cassava_Bioinformatics_Team/03_GWAS_PPD_Populations/02_PCA/"
labels <- read.csv(paste0(dir, "GWAS_PPD.labels.csv"))
vcf <- "GWAS_PPD.snps.filter_info.missing_0.10.imputation.vcf.gz"
type <- "Geno"
groups <- T
dir <- "D:/OneDrive - CGIAR/Cassava_Bioinformatics_Team/01_ACWP_F2_Phenotype/01_Population_Structure/"
labels <- read.csv(paste0(dir, "AM1588_labels.csv"))
vcf <- "AM1588_MAP.miss0.05.recode.vcf"
type <- "Geno"
groups <- T
# Run function
PCA(dir, phenofile, genofile, gds, vcf, type = "Geno", groups = T, PC.retain = F)
# DAPC example(s)
# Set arguments
dir <- "D:/OneDrive - CGIAR/Cassava_Bioinformatics_Team/01_ACWP_F2_Phenotype/01_Population_Structure/"
vcf <- "AM1588_MAP.miss0.05.recode.vcf"
dir <- "D:/OneDrive - CGIAR/Cassava_Bioinformatics_Team/03_GWAS_PPD_Populations/02_PCA/"
vcf <- "GWAS_PPD.snps.filter_info.missing_0.10.imputation.vcf.gz"
# Load libraries
library(adegenet)
library(grDevices)
library(vcfR)
library(tidyverse)
library(plotly)
library(htmlwidgets)
# Load VCF files and convert them into a genind object
setwd(dir)
vcf <- read.vcfR(vcf, verbose = T)
my_genind <- vcfR2genind(vcf)
my_genind
# Identify clusters
# Shows a graph with the accumulated variance explained by the eigenvalues of the PCA
grp <- find.clusters(my_genind)
# Performs the DAPC
dapc <- dapc(my_genind, grp$grp)
Interactive plots (made using plotly
package) in html format.
vegan
tidyverse
plotly
htmlwidgets
SNPRelate
gdsfmt
psych
adegenet
grDevices
vcfR
This R script runs a Genome-Wide Association Study (GWAS) analysis using the GAPIT3 R package. It saves the results of each trait in an individual folder.
GAPIT3(phenofile, genofile, wdir, trait_list = NULL)
phenofile
: A database of genotypes/individuals (rows) and the trait's measurements or BLUPs (columns).genofile
: Genotype data in hapmap format.wdir
: A working directory where the function will create the folders for each trait.trait_list
: A vector with the trait's name. The default isNULL
, which will use the full phenotype dataset.
This function loads the required packages and data, then performs a GWAS analysis using the GAPIT
function from the GAPIT3
package. It loops through each trait in trait_list
, creating a new folder for each trait in the working directory and saving the results of the GWAS analysis for that trait in that folder.
GAPIT3(phenofile = "my_phenotypes.csv", genofile = "my_genotypes.hmp", wdir = "my_results_folder", trait_list = c("Trait1", "Trait2", "Trait3"))
This will perform a GWAS analysis using the phenotype data in my_phenotypes.csv
and the genotype data in my_genotypes.hmp
. It will create a new folder for each trait in the trait_list
vector in the my_results_folder
directory.
The function will create a folder for each trait in the trait_list
vector in the working directory. In each folder, it will save the results of the GWAS analysis for that trait. For more information about GAPIT
function outputs, please check the official GAPIT3
package documentation.
devtools
GAPIT3
This code annotates the gen containing the significat SNPs from the GAPIT3
results. Additional, it retrieves the closest genes downstream and upstream.
GWAS_Annotation(Mdir, pat, mod, wdyw, annot, GFF)
pat
: Location path of the files.wdyw
: Gene feature to annotate.Mdir
: Name of the directory that contains theGAPIT3
results.mod
: Names of the models to filter (options: )gff3
: gff3 file from the genome version used for alignment.annotationFile
: Annotation details of the genes. txt file from the genome version used for alignment.version
: (Options: 6.1 or 8.1. Default = 6.1).
In progress
In progress
A single CSV file containing relevant gene information plus SNPs' P-values, traits, models, and effects.
tidyverse
This R function generates a boxplot for a given SNP. The function takes as inputs a CSV file with the phenotype values and a list of SNPs. The function can also receive an optional CSV file to add extra information about the samples to the plot (categories, family, ect.) The output of the function is a PDF file with the plot.
GWAS_Boxplot(outputname, dir, phenofile, genofile, snp_list_file, labelfile = NULL)
outputname
: (required) character string with the base name for the output file.dir
: (required) character string with the directory where the output file will be saved.phenofile
: (required) character string with the name of the phenotype file in tabular format. The first column should contain the sample names, and the rest of the columns should contain the phenotypes.genofile
: (required) character string with the name of the genotype file in hapmap format.snp_list_file
: (required) character string with the name of the CSV file with three columns:- Column 01 - Name: SNPS. List of SNPS to plot. The name should be the same as in the
genofile
data. - Column 02 - Name: trait. Name of the trait as in the
phenofile
data. - Column 03 - Name: xlabel. Name of the trait to be included as a label.
- Column 01 - Name: SNPS. List of SNPS to plot. The name should be the same as in the
labelfile
: (optional) character string with the name of the CSV file with two columns:- Column 01 - Name: Taxa. Sample names.
- Column 02 - Name: label. Label or category to add to the plot.
The function loads all the necessary packages and data files. It then checks for the presence of an optional file with sample labels and prepares the data for the boxplot. The function generates a boxplot for each SNP specified in the input CSV file. It uses ggplot2
package to generate the plot, with the genotypes on the x-axis and the phenotype on the y-axis. The function adds a label to the x-axis to indicate the trait being plotted. If labelfile
is provided, it adds extra information about the samples, coloring the data by the levels in the provided file.
# Generate boxplot without extra labels
GWAS_Boxplot("outputname", ".path/to/save/plots/", "phenotype.csv", "genotype.hmp", "snp_list.csv")
# Generate boxplot with extra labels
GWAS_Boxplot("outputname", ".path/to/save/plots/", "phenotype.csv", "genotype.hmp", "snp_list.csv", "labelfile.csv")
A single PDF file containing the boxplot of the SNPs.
tidyverse
tibble
dplyr
janitor
ggplot2
Biostrings
hrbrthemes
forcats
ggsignif
RColorBrewer
This R function generates a Manhattan plot(s) for a given set of GWAS results. The function scans whithin directories and takes as inputs the CSV file that contain the results of previous GWAS analyses. The output of the function is a plot in html format made using plotly
package.
Manhattan(Mdir, pat, mod, wtd)
Mdir
: Name of the directory that contains the GAPIT results. For example: home/user/folder.pat
: Enter the path of file names to look for whithin directories. For example: QTL_LOD_Intervals. The path must finish with a point (.).mod
: Enter the model(s) of interest. Options: BLINK, GLM, MLM, FarmCPU.wtd
: How many traits do you want to plot. Options: One, Several, All.colors
: (Optional) Colors of the chromosomes in Manhattan plots. If you want to change the colors, provide 2 or more. It can be color names o hex codes (Default: grey and skyblue).
In progress
# Set arguments
Mdir <- "D:/OneDrive - CGIAR/Cassava_Bioinformatics_Team/01_ACWP_F1_Metabolomics/07_GWAS"
pat <- "GAPIT.Association.GWAS_Results."
mod <- c("BLINK", "FarmCPU", "MLM")
wtd <- "One"
# Run function
Manhattan(Mdir, pat, mod, wtd)
Manhattan plots for the traits(s) selected.
tidyverse
ggtext
plotly
-
Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li, Twelve years of SAMtools and BCFtools, GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008
-
Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, Gerton Lunter, Gabor T. Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin, 1000 Genomes Project Analysis Group, The variant call format and VCFtools, Bioinformatics, Volume 27, Issue 15, August 2011, Pages 2156–2158, https://doi.org/10.1093/bioinformatics/btr330
-
B L Browning, X Tian, Y Zhou, and S R Browning (2021) Fast two-stage phasing of large-scale sequence data. Am J Hum Genet 108(10):1880-1890. doi:10.1016/j.ajhg.2021.08.005
-
B L Browning, Y Zhou, and S R Browning (2018). A one-penny imputed genome from next generation reference panels. Am J Hum Genet 103(3):338-348. doi:10.1016/j.ajhg.2018.07.015
-
Chi Zhang, Shan-Shan Dong, Jun-Yang Xu, Wei-Ming He, Tie-Lin Yang, PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files, Bioinformatics, Volume 35, Issue 10, May 2019, Pages 1786–1788, https://doi.org/10.1093/bioinformatics/bty875
-
Thibaut Jombart, adegenet: a R package for the multivariate analysis of genetic markers, Bioinformatics, Volume 24, Issue 11, June 2008, Pages 1403–1405, https://doi.org/10.1093/bioinformatics/btn129
-
Xiuwen Zheng, David Levine, Jess Shen, Stephanie M. Gogarten, Cathy Laurie, Bruce S. Weir, A high-performance computing toolset for relatedness and principal component analysis of SNP data, Bioinformatics, Volume 28, Issue 24, December 2012, Pages 3326–3328, https://doi.org/10.1093/bioinformatics/bts606
-
Wang J., Zhang Z., GAPIT Version 3: Boosting Power and Accuracy for Genomic Association and Prediction, Genomics, Proteomics & Bioinformatics (2021), doi: https://doi.org/10.1016/j.gpb.2021.08.005.
-
Huang M, Liu X, Zhou Y, Summers RM, Zhang Z. BLINK: A package for the next level of genome-wide association studies with both individuals and markers in the millions. Gigascience. https://doi.org/10.1093/gigascience/giy154.
-
Liu X., Huang M., Fan B., Buckler E. S., Zhang Z., 2016 Iterative Usage of Fixed and Random Effect Models for Powerful and Efficient Genome-Wide Association Studies. PLoS Genet. 12: e1005767. https://doi.org/10.1371/journal.pgen.1005767.
-
David M. Goodstein, Shengqiang Shu, Russell Howson, Rochak Neupane, Richard D. Hayes, Joni Fazo, Therese Mitros, William Dirks, Uffe Hellsten, Nicholas Putnam, and Daniel S. Rokhsar, Phytozome: a comparative platform for green plant genomics, Nucleic Acids Res. 2012 40 (D1): D1178-D1186
-
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
For questions or feedback about this pipeline, please contact Vianey Barrera-Enriquez at [email protected].