pyGEMMA: A Fast, User-Friendly Python/R Implementation of Linear Mixed Models for Genome-Wide Association Studies
The current implementation of pyGEMMA
was tested on the following configuration:
- Ubuntu 18.04.6 LST (Bionic Beaver)
- gcc/g++ 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
- Python 3.8.10
- Numpy 1.24.4
- Cython 3.0.2
- Pandas 2.0.3
- Scipy 1.10.1
- Scikit-learn 0.24.2
- Matplotlib 3.7.1
- Plotly 5.19.0
- Seaborn 0.12.2
- rich 13.5.2
- qnorm 0.8.1
The installation of pyGEMMA
is straightforward and can be performed using Python's pip
package manager. Here, we detail the installation process using a virtualenv
Python enviroment. This has been tested with the configuration listed in the Software Requirements section. While installation may be possible with other configurations, we it has only been tested with the configuration we list.
- Create your Python environment and activate it. Using
virtualenv
, this can be done by running
pip3 install virtualenv
python3 -m virtualenv pygemma_env
source pygemma_env/bin/activate
Note: If pip3 install virtualenv
fails because it can't find pip3
, you can try running python3 -m pip install virtualenv
instead. This looks for the pip
module directly if pip3
isn't in your PATH
.
- Ensure the
Numpy
andCython
packages are both installed prior to installingpyGEMMA
(they will not be installed automatically). This can be done by running
pip install numpy Cython
- Clone this repository.
- Install
pyGEMMA
's dependencies. From thepygemma
directory, this can be done by running
pip install -r requirements.txt
-
Ensure that you have a valid
C/C++
compiler loaded.pyGEMMA
has been tested usinggcc/g++
. -
Install
pyGEMMA
. From thepygemma
directory, this can be done by running
python setup.py install
The pyGEMMA
package contains both high-level and low-level functions for fitting the linear mixed model outlined in the original GEMMA paper by Zhou et al. (Nat Gen 2012).
The pyGEMMA
package is designed to fit the same model as GEMMA. That is, it fits
where
This model can be fit using the function pygemma.lmm.pygemma
.
from pygemma import lmm
lmm.pygemma(Y, X, W, K, snps=snps, verbose=1)
Note that snps
is a list of SNP names that will be used to label the pandas DataFrame
returned by the function. verbose
controls whether to output run progress.
We have also developed an R interface for pyGEMMA, enabling its use within the R programming environment. A comprehensive tutorial for this integration can be found here
We provide our benchmarks for the UK Biobank data in the experiments/benchmarks directory. This benchmarking consisted of running random subsets of 50,000 individuals of European ancestry from the UK Biobank data. Subsets were taken from 50 to 10,000 samples and 20 to 100,000 SNPs.
We use this data to compare the runtime of pyGEMMA
to competing methods, namely GEMMA
, GCAT
, and fastGWA
. Regenie
is not represented here due to the long runtime of its Stage I phase.
Speedup is calculated as
Runtime | Speedup |
Methods:
pyGEMMA
(■
), pyGEMMA - Grid Search
(■
), GEMMA
(■
), GCTA
(■
), fastGWA
(■
), Linear Regression
(■
)
Based on this, we see that pyGEMMA
is at GCTA
and over 10 times faster than GEMMA
. While fastGWA
is faster than pyGEMMA
for small datasets, pyGEMMA
is faster for larger datasets.
This dataset consisted of 12,226 SNPs, 1940 mice, and 4 phenotypes. The data was taken from GEMMA's test dataset. The data can be found here: GEMMA Mouse Data
We use this dataset to benchmark scaling with the number of covariates in the model. We compared pyGEMMA
against GEMMA
and GCTA
.
Runtime | Speedup |
Methods:
pyGEMMA
(■
), GEMMA
(■
), GCTA
(■
)
Based on this, we see that pyGEMMA
is significantly faster than both GEMMA
and GCTA
. It also exhibits better scaling behavior as the number of covariates increases.
This dataset consists of
We use this dataset to demonstrate that pyGEMMA
and GEMMA
produce identical results.
Beta | -log10 of p-values |
These plots show that the effect sizes and p-values produced by pyGEMMA
and GEMMA
are identical.
@rlangefe - Robert Langefeld (Department of Biostatistics - University of Michigan)
If you have any questions or comments, please feel free to contact me.