Corrected Credible Set

Anna Hutchinson

This vignette will show users how the corrcoverage R package can be used to obtain a new credible set of variants that contains the true causal variant with some specified desired coverage value whilst containing as few variants as possible.


1. Simulate GWAS summary statistics


As in the corrected coverage vignette, let’s begin by simulating some GWAS data using the simGWAS package. This package is not avaliable on CRAN so the example data generated by the simGWAS package is contained in this package.

set.seed(18)
library(corrcoverage)
# library(simGWAS)

data <- system.file("extdata", "", package="corrcoverage")

#  Simulate reference haplotypes
nsnps <- 200
nhaps <- 1000
lag <- 30  # genotypes are correlated between neighbouring variants
maf <- runif(nsnps + lag, 0.05, 0.5)  # common SNPs
laghaps <- do.call("cbind", lapply(maf, function(f) rbinom(nhaps, 1, f)))
haps <- laghaps[, 1:nsnps]
for (j in 1:lag) haps <- haps + laghaps[, (1:nsnps) + j]
haps <- round(haps/matrix(apply(haps, 2, max), nhaps, nsnps, byrow = TRUE))
snps <- colnames(haps) <- paste0("s", 1:nsnps)
freq <- as.data.frame(haps + 1)
freq$Probability <- 1/nrow(freq)
sum(freq$Probability)
## [1] 1
MAF <- colMeans(freq[, snps] - 1)  # minor allele frequencies
CV <- sample(snps[which(colMeans(haps) > 0.1)], 1)
iCV <- sub("s", "", CV)  # index of cv
LD <- cor2(haps) # correlation between SNPs
OR <- 1.1 # odds ratios
N0 <- 10000 # number of controls
N1 <- 10000 # number of cases
  
#z0 <- simulated_z_score(N0 = N0, # number of controls
#                        N1 = N1, # number of cases
#                        snps = snps, # column names in freq
#                        W = CV, # causal variants, subset of snps
#                        gamma.W = log(OR), # log odds ratios
#                        freq = freq) # reference haplotypes

z0 <- readRDS(paste0(data,"/z0_2.RDS"))

To calculate \(V\), the prior variance for the estimated effect size, we use Var.data.cc.

varbeta <- Var.data.cc(f = MAF, N = N1+N0, s = N1/(N0+N1)) # variance of 
                                                       # estimated effect size

We can then use the ppfunc function to calculate the posterior probabilities of causality for each variant.

postprobs <- ppfunc(z = z0, V = varbeta) 

We use the est_mu function to obtain an estimate of the true effect at the causal variant.

muhat <- est_mu(z0, MAF, N0, N1)
muhat
## [1] 4.970273

2. Derive corrected coverage estimate


The corrected_cov function is used to find the corrected coverage of a credible set with specified threshold, say 0.9.

Note that this function is similar to using corrcov as explained in the “Corrected Coverage” vignette; which would require \(Z\)-scores, minor allele frequencies and sample sizes. Here, we have already calculated some of the intermediaries calculated in the corrcov function (muhat, varbeta etc.) so we can use corrected_cov instead.

thr = 0.9
corrcov <- corrected_cov(pp0 = postprobs, mu = muhat, V = varbeta, 
                         Sigma = LD, thr = thr, nrep = 1000)
cs <- credset(pp = postprobs, thr = thr)
data.frame(claimed.cov = cs$claimed.cov, corr.cov =  corrcov, nvar = cs$nvar)
##   claimed.cov corr.cov nvar
## 1   0.9307395 0.967282    9

Using the Bayesian approach for statistical fine-mapping we obtain a 90% credible set consisting of 9 variants. The claimed coverage of this credible set is ~0.93, yet the corrected coverage estimate is ~0.97, suggesting that we can afford to be ‘more confident’ that we have captured the causal variant in our credible set.


3. Evaluate accuracy of estimate


Again, for the purpose of this vignette we can investigate how accurate this estimate is by simulating many credible sets from the same system, and finding the proportion of these that contain the true causal variant. Note that nrep is relatively low in this example to keep stored data in the R package within CRAN guidelines. Please increase nrep to increase accuracy of estimated empirical coverage.

# z0.tmp <- simulated_z_score(N0 = N0, # number of controls
#                            N1 = N1, # number of cases
#                            snps = snps, # column names in freq
#                            W = CV, # causal variants, subset of snps
#                            gamma.W = log(OR), # log odds ratios
#                            freq = freq, # reference haplotypes
#                            nrep = 200) # increase nrep to increase accuracy of estimate

z0.tmp <- readRDS(paste0(data,"/z0.tmp_2.RDS"))

pps <- ppfunc.mat(zstar = z0.tmp, V = varbeta)  # find pps 
cs.cov <- apply(pps, 1, function(x) credset(x, CV = iCV, thr = thr)$cov)
true.cov.est <- mean(cs.cov)
data.frame(claimed.cov = cs$claimed.cov, corr.cov =  corrcov, 
           true.cov = true.cov.est, nvar = cs$nvar)
##   claimed.cov corr.cov true.cov nvar
## 1   0.9307395 0.967282     0.97    9

We find that our corrected coverage value is very close to the empirical coverage of the credible set.


4. Obtain a corrected credible set


Our results suggest that we may be able to remove some variants from the credible set, whilst still achieving the desired coverage of 90%.

The corrected_cs function uses GWAS summary statistics and some user-defined parameters to find the smallest credible set such that the coverage estimate is within some accuracy of the desired coverage.

The function requires the following parameters to be specified:

In addition, there are a number of optional parameters:

The function uses the bisection root finding method to converge to the smallest threshold such that the corrected coverage is larger than the desired coverage. The lower and upper parameters define the boundaries of the threshold values for the root finding method, with default values of 0 and 1 respectively. The acc parameter has a default value set to 0.005, meaning that the algorithm will keep attempting to find a corrected credible set (until the number of iterations reaches max.iter) that has coverage within 0.005 of the desired coverage value (e.g. between 0.895 and 0.905 for a 90% credible set).

The function reports the threshold values tested and their corresponding corrected coverage value at each iteration. The maximum number of iterations for the bisecting root finding algorithm is an optional parameter, with default value 20. The functions stops when either the number of iterations reaches the maximum, or the corrected coverage is within some accuracy of the desired coverage.

res <- corrected_cs(z = z0, f = MAF, N0, N1, 
                    Sigma = LD, lower = 0.5, upper = 1, desired.cov = 0.9)
## [1] "thr:  0.75 , cov:  0.918448964618699"
## [1] "thr:  0.625 , cov:  0.870505410370508"
## [1] "thr:  0.6875 , cov:  0.893551309455069"
## [1] "thr:  0.71875 , cov:  0.906172863191769"
## [1] "thr:  0.703125 , cov:  0.899644208037482"
## [1] "thr:  0.7109375 , cov:  0.901677186379133"
res
## $credset
## [1] "s54" "s58" "s57" "s67"
## 
## $req.thr
## [1] 0.7109375
## 
## $corr.cov
## [1] 0.9016772
## 
## $size
## [1] 0.7343958

The first threshold value tested is the midpoint of the lower (0.5) and upper (1) parameter values, which here is 0.75. The coverage of this credible set is too high (\(>0.9\)) and so a smaller threshold value is tested, the midpoint of the lower parameter value (0.5) and the previously tried threshold value (0.75). This credible set has a coverage that is too low, and so a higher threshold value is tested, and so on.

In this example we see that a much smaller threshold value is required to obtain a credible set with 90% corrected coverage of the causal variant, containing only 4 variants. In the standard Bayesian approach, the threshold value of 90% leads to over-coverage.


Finally, we can compare this coverage estimate of the corrected credible set to an empirical estimate of the true coverage.

new.cs.sims <- apply(pps, 1, function(x) credset(x, CV = iCV, thr = res$req.thr)$cov)
true.cov.est2 <- mean(new.cs.sims)

Original 90% credible set:

df1 <- data.frame(claimed.cov = round(cs$claimed.cov, 3), corr.cov =  round(corrcov, 3), true.cov = round(true.cov.est, 3), nvar = cs$nvar)
print(df1, row.names = FALSE)
##  claimed.cov corr.cov true.cov nvar
##        0.931    0.967     0.97    9

New 90% credible set:

df2 <- data.frame(claimed.cov = round(res$size, 3), corr.cov = round(res$corr.cov, 3), true.cov = round(true.cov.est2, 3), nvar = length(res$credset))
print(df2, row.names = FALSE)
##  claimed.cov corr.cov true.cov nvar
##        0.734    0.902      0.9    4

This vignette has shown how the corrcoverage R package can be used to improve the resolution of a credible set from Bayesian genetic fine-mapping, without the use of any additional data.