eGST package: Leveraging eQTLs to identify individual-level tissue of interest for a complex trait

Arunabha Majumdar, Tanushree Haldar, Bogdan Pasaniuc

2019-06-30

Introduction

Genetic predisposition for complex traits is often manifested through multiple tissues of interest at different time points in the development. As an example, the genetic predisposition for obesity could be manifested through inherited variants that control metabolism through regulation of genes expressed in the brain and/or through the control of fat storage in the adipose tissue by dysregulation of genes expressed in adipose tissue. We present a method eGST that integrates tissue-specific eQTLs with GWAS data for a complex trait to probabilistically assign a tissue of interest to the phenotype of each individual in the study. eGST estimates the posterior probability that an individual’s phenotype can be assigned to a tissue based on individual-level genotype data of tissue-specific eQTLs and marginal phenotype data in a GWAS cohort. Under a Bayesian framework of mixture model, eGST employs a maximum a posteriori (MAP) expectation-maximization (EM) algorithm to estimate the tissue-specific posterior probability across individuals.

The package consists of the following function:

  1. eGST: It estimates the posterior probability that the genetic susceptibility of the phenotype of an individual in the study is mediated through eQTLs specific to a tissue of interest. The phenotype across individuals can be classified into tissues under consideration based on the estimated tissue-specific posterior probability across individuals.

Installation

You can install eGST from CRAN.

#install.packages("eGST")
#library("eGST")

How to run eGST for two tissues.

Get the path to the data.

library("eGST")
# Load the phenotype data vector
phenofile <- system.file("extdata", "ExamplePhenoData.rda", package = "eGST")
load(phenofile)
head(ExamplePhenoData)
## [1]  0.6676993  7.2554983  1.1536855  4.4182458 -2.8580514 -2.1392170

Here ExamplePhenoData is the phenotype data vector for 1000 individuals.

library("eGST")
# Load the list containing genotype matrices of tissue-specific eQTLs. 
genofile <- system.file("extdata", "ExampleEQTLgenoData.rda", package = "eGST")
load(genofile)
ExampleEQTLgenoData[[1]][1:5, 1:5]
##             V1         V2        V3          V4         V5
## [1,]  0.310475  0.5849763 0.2366187  0.04890556 -1.0796018
## [2,]  0.310475  0.5849763 0.2366187 -1.38949328 -1.0796018
## [3,]  0.310475 -0.9265903 1.6706716 -1.38949328 -1.0796018
## [4,]  1.721725  0.5849763 0.2366187  1.48730440  0.3541323
## [5,] -1.100775  0.5849763 1.6706716 -1.38949328  0.3541323

Here ExampleEQTLgenoData is a list containing two elements corresponding to two tissues each containing a 1000 by 100 ordered genotype matrix. Each matrix provides the genotype data of 1000 individuals at 100 tissue-specific eQTLs for each tissue. To create sets of tissue-specific eQTLs in your context, please see our manuscript: Majumdar A, Giambartolomei C, Cai N, Freund MK, Haldar T, J Flint, Pasaniuc B (2019) Leveraging eQTLs to identify tissue-specific genetic subtype of complex trait, bioRxiv. Here we have displayed genotypes for first 5 individuals at first 5 eQTLs in the set of first tissue-specific eQTLs. We normalize each SNP’s genotype data across all individuals in the sample before running eGST.

Next we specify the name of the tissues.

# Specify the name of the tissues.
tissues <- paste0("tissue", 1:2)
tissues
## [1] "tissue1" "tissue2"

In this simulated example dataset, we have considered two tissues and corresponding sets of 100 tissue-specific eQTLs each. First half of 1000 individuals’ phenotypes were simulated to have genetic effect from the first tissue specific eQTLs, but no effect from the second tissue-specific eQTLs. Hence the phenotype of first 500 individuals were assigned to the first tissue. Similarly, second half of the 1000 individuals were simulated to have genetic effect from the second-tissue specific eQTLs.

Next for this toy example dataset, we run eGST for 10 iterations. However, we recommend at least 50 iterations in your application. There are more options of arguments to pass into the function (see the Arguments section of eGST in the eGST manual).

#Run eGST to estimate the tissue-specific posterior probability across individuals.
result <- eGST(ExamplePhenoData, ExampleEQTLgenoData, tissues, nIter = 10)
## -------MAPEM in eGST starting--------
## 2019-06-30 23:55:20
## Iteration 1:
## logL improvement:10
## Iteration 2:
## logL improvement:0.0853372767560936
## Iteration 3:
## logL improvement:0.0825853667259242
## Iteration 4:
## logL improvement:0.0441753339905757
## Iteration 5:
## logL improvement:0.0322214448923188
## Iteration 6:
## logL improvement:0.0242762373635395
## Iteration 7:
## logL improvement:0.0186074438217965
## Iteration 8:
## logL improvement:0.014311619964968
## Iteration 9:
## logL improvement:0.0110271792233498
## Iteration 10:
## logL improvement:0.00854399985377663
## 2019-06-30 23:55:20
## -------MAPEM finished--------

So at each iteration, eGST prints the average improvement in log-likelihood of the data. Next we display an overall summary of the results obtained by eGST.

# Overall summary of the results produced by eGST.
str(result)
## List of 7
##  $ gamma  : num [1:1000, 1:2] 0.3118 0.0193 0.311 0.5934 0.9991 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "tissue1" "tissue2"
##  $ alfa   : num [1:2] 0.744 1.048
##  $ beta   :List of 2
##   ..$ : Named num [1:100] -0.0077 -0.1415 -0.1141 -0.104 -0.4558 ...
##   .. ..- attr(*, "names")= chr [1:100] "V1" "V2" "V3" "V4" ...
##   ..$ : Named num [1:100] -0.198 0.2617 -0.0706 0.315 0.3254 ...
##   .. ..- attr(*, "names")= chr [1:100] "V1" "V2" "V3" "V4" ...
##  $ sigma_g: num [1:2] 0.26 0.271
##  $ sigma_e: num [1:2] 2 2.01
##  $ m      : Named int [1:2] 100 100
##   ..- attr(*, "names")= chr [1:2] "tissue1" "tissue2"
##  $ logL   : num -2.3

The main output of interest is contained in result$gamma matrix which provides the estimate of tissue-specific subtype posterior probability across individuals. So the first (second) column of result$gamma provides the posterior probability that an individual’s phenotype is the first (second) tissue-specific genetic subtype. Individuals can be classified as tissue-specific genetic subtype of the trait based on a posterior probability threshold, e.g. 65%, 70%, etc. For example, individuals for whom the first tissue-specific posterior probability is > 65% can be assigned as first tissue-specific genetic subtype. The list ‘result’ contains other outputs from eGST. For more details, please see the ‘Value’ section in eGST manual.

For any questions, please send an email to statgen.arunabha@gmail.com or pasaniuc@ucla.edu. See our manuscript for more details: A Majumdar, C Giambartolomei, N Cai, MK Freund, T Haldar, T Schwarz, J Flint, B Pasaniuc (2019) Leveraging eQTLs to identify tissue-specific genetic subtype of complex trait. bioRxiv.