pyGEMMA: A Fast, User-Friendly Python/R Implementation of Linear Mixed Models for Genome-Wide Association Studies

Software Requirements

The current implementation of pyGEMMA was tested on the following configuration:

Operating System and Compilers

Ubuntu 18.04.6 LST (Bionic Beaver)
gcc/g++ 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

Python Environment

Installation

The installation of pyGEMMA is straightforward and can be performed using Python's pip package manager. Here, we detail the installation process using a virtualenv Python enviroment. This has been tested with the configuration listed in the Software Requirements section. While installation may be possible with other configurations, we it has only been tested with the configuration we list.

Create your Python environment and activate it. Using virtualenv, this can be done by running

pip3 install virtualenv
python3 -m virtualenv pygemma_env
source pygemma_env/bin/activate

Note: If pip3 install virtualenv fails because it can't find pip3, you can try running python3 -m pip install virtualenv instead. This looks for the pip module directly if pip3 isn't in your PATH.

Ensure the Numpy and Cython packages are both installed prior to installing pyGEMMA (they will not be installed automatically). This can be done by running

pip install numpy Cython

Clone this repository.
Install pyGEMMA's dependencies. From the pygemma directory, this can be done by running

pip install -r requirements.txt

Ensure that you have a valid C/C++ compiler loaded. pyGEMMA has been tested using gcc/g++.
Install pyGEMMA. From the pygemma directory, this can be done by running

python setup.py install

Usage

The pyGEMMA package contains both high-level and low-level functions for fitting the linear mixed model outlined in the original GEMMA paper by Zhou et al. (Nat Gen 2012).

General Usage

The pyGEMMA package is designed to fit the same model as GEMMA. That is, it fits

$$ \mathbf{y} = \mathbf{W} \mathbf{\alpha} + \mathbf{x} \mathbf{\beta} + \mathbf{Z} \mathbf{u} + \mathbf{\varepsilon} $$

$$ \mathbf{u} \sim \mathcal{\text{MVN}}_{m}(\mathbf{0}, \lambda \tau^{-1} \mathbf{K}) $$

$$ \mathbf{\varepsilon} \sim \mathcal{\text{MVN}}_{n} \left(\mathbf{0},\tau^{-1} \mathbf{I}_n \right) $$

where $\mathbf{y}$ is $n \times 1$ is the vector phenotype, $\mathbf{W}$ is $n \times c$ is the matrix of fixed effect covariates (including the intercept), $\mathbf{\alpha}$ is the $c \times 1$ vector of coefficients for the covariates, $\mathbf{x}$ is the $n \times 1$ vector of genotypes, $beta$ is the effect size of the genotype, $\mathbf{Z}$ is the $n \times m$ loading matrix, $\mathbf{u}$ is the $m \times 1$ vector of random effects, $\mathbf{\varepsilon}$ is the $n \times 1$ vector of errors, $\tau^{-1}$ is the variance of the resitual errors, $\lambda$ is the ratio between the two variances components, and $\mathbf{K}$ is the relatedness matrix. (Description adapted from Zhou et al. (Nat Gen 2012))

This model can be fit using the function pygemma.lmm.pygemma.

from pygemma import lmm
lmm.pygemma(Y, X, W, K, snps=snps, verbose=1)

Note that snps is a list of SNP names that will be used to label the pandas DataFrame returned by the function. verbose controls whether to output run progress.

pyGEMMA in R

We have also developed an R interface for pyGEMMA, enabling its use within the R programming environment. A comprehensive tutorial for this integration can be found here

Examples

UK Biobank Benchmark

We provide our benchmarks for the UK Biobank data in the experiments/benchmarks directory. This benchmarking consisted of running random subsets of 50,000 individuals of European ancestry from the UK Biobank data. Subsets were taken from 50 to 10,000 samples and 20 to 100,000 SNPs.

We use this data to compare the runtime of pyGEMMA to competing methods, namely GEMMA, GCAT, and fastGWA. Regenie is not represented here due to the long runtime of its Stage I phase.

Speedup is calculated as $\frac{t_{\text{Method}}}{t_{\text{pyGEMMA}}}$.

Runtime	Speedup

Methods: pyGEMMA(■), pyGEMMA - Grid Search(■), GEMMA(■), GCTA(■), fastGWA(■), Linear Regression(■)

Based on this, we see that pyGEMMA is at $\approx$ 15-20 times faster than GCTA and over 10 times faster than GEMMA. While fastGWA is faster than pyGEMMA for small datasets, pyGEMMA is faster for larger datasets.

GEMMA Mouse Data

This dataset consisted of 12,226 SNPs, 1940 mice, and 4 phenotypes. The data was taken from GEMMA's test dataset. The data can be found here: GEMMA Mouse Data

We use this dataset to benchmark scaling with the number of covariates in the model. We compared pyGEMMA against GEMMA and GCTA.

Runtime	Speedup

Methods: pyGEMMA(■), GEMMA(■), GCTA(■)

Based on this, we see that pyGEMMA is significantly faster than both GEMMA and GCTA. It also exhibits better scaling behavior as the number of covariates increases.

Wellcome Trust Case Control Consortium (WTCCC) Data

This dataset consists of $\approx$ 390,000 SNPs and 6 binary phenotypes, with around 2,000 case and 3,000 control samples. The data was taken from the WTCCC study.

We use this dataset to demonstrate that pyGEMMA and GEMMA produce identical results.

Beta	-log10 of p-values

These plots show that the effect sizes and p-values produced by pyGEMMA and GEMMA are identical.

Contact

@rlangefe - Robert Langefeld (Department of Biostatistics - University of Michigan)

If you have any questions or comments, please feel free to contact me.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
data		data
experiments		experiments
graphics/tests		graphics/tests
lmm		lmm
plotting		plotting
pygemma_model		pygemma_model
tests		tests
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pyGEMMA.html		pyGEMMA.html
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyGEMMA: A Fast, User-Friendly Python/R Implementation of Linear Mixed Models for Genome-Wide Association Studies

Table of Contents

Software Requirements

Operating System and Compilers

Python Environment

Installation

Usage

General Usage

pyGEMMA in R

Examples

UK Biobank Benchmark

GEMMA Mouse Data

Wellcome Trust Case Control Consortium (WTCCC) Data

Contact

About

Releases

Packages

Contributors 2

Languages

License

rlangefe/pygemma

Folders and files

Latest commit

History

Repository files navigation

pyGEMMA: A Fast, User-Friendly Python/R Implementation of Linear Mixed Models for Genome-Wide Association Studies

Table of Contents

Software Requirements

Operating System and Compilers

Python Environment

Installation

Usage

General Usage

pyGEMMA in R

Examples

UK Biobank Benchmark

GEMMA Mouse Data

Wellcome Trust Case Control Consortium (WTCCC) Data

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages