Vignette for clipp

The R package clipp provides a fast and general implementation of the Elston-Stewart algorithm.

The main function is pedigree_loglikelihood, which calculates the pedigree log-likelihood for almost any choice of genetic model. Helper functions are provided that specify commonly used genetic models, though users are free to define and use more advanced ones. Combining clipp with an optimisation function like mle allows the user to perform maximum-likelihood estimation of model parameters, as illustrated below.

clipp also provides a function, genotype_probabilities, that calculates genotype probabilities for a target person within a family, given the family’s observed data.

Pedigree likelihoods

The function pedigree_loglikelihood calculates (the logarithm of) any pedigree likelihood that can be written in Ott’s form, as given on page 117 of (Lange, 2002). In this formulation, the likelihood \(L\) for a given family is \[ L = \sum_{g_1} \dots \sum_{g_n} \prod_{i=1}^n P(x_i \mid g_i) P(g_i \mid g_{m(i)}, g_{f(i)}),\] where: there are \(n\) members of the family, who are labeled by identifiers \(i = 1, \dots, n\); \(g_i\) and \(x_i\) are the genotype and observed data for person \(i\), respectively; \(P(x_i \mid g_i)\) is the conditional probability of person \(i\)’s observed data give his or her genotype; \(m(i)\) and \(f(i)\) are the identifiers of person \(i\)’s mother and father, respectively, or missing if person \(i\) is a founder (i.e. if person \(i\) does not have parents in the pedigree); and \(P(g_i \mid g_{m(i)}, g_{f(i)})\) is either the population prevalence of genotype \(g_i\) (if person \(i\) is a founder) or the conditional probability of offspring genotype \(g_i\) given parental genotypes \(g_{m(i)}\) and \(g_{f(i)}\) (if person \(i\) is a non-founder). Each sum is over all possible genotypes at the genetic locus under consideration (or over all possible multi-locus genotypes if we are modeling two or more genetic loci).

The terms in this likelihood correspond to the main arguments of the function pedigree_loglikelihood. The argument dat specifies the family structure, i.e. \(m(i)\) and \(f(i)\) for all family members \(i\). The argument geno_freq specifies the population genotype prevalences, i.e. \(P(g_i \mid g_{m(i)}, g_{f(i)})\) when person \(i\) is a founder. The argument trans specifies the transmission probabilities, i.e. \(P(g_i \mid g_{m(i)}, g_{f(i)})\) when person \(i\) is a non-founder. Lastly, the penetrance matrix penet specifies the relationship between the observed data and the genotypes, i.e. \(P(x_i \mid g_i)\).

Specifying the genetic model

The transmission probabilities trans and population genotype prevalences geno_freq determine the joint probability for any combination of genotypes within a family, so we say that trans and geno_freq define the genetic model of the analysis. These terms are usually based on known genetic laws (such as Mendel’s laws) and user-specified genotype or allele frequencies. clipp provides helper functions to calculate trans and geno_freq for common genetic models. For example, the following code specifies a genetic model consisting of a single biallelic locus with a minor allele frequency of 10%.

library(clipp)
MAF <- 0.1
geno_freq <- geno_freq_monogenic(p_alleles = c(1 - MAF, MAF))
trans <- trans_monogenic(n_alleles = 2)

We can view a more user-friendly version of the genotype frequencies by using the option annotate = TRUE.

geno_freq_monogenic(p_alleles = c(1 - MAF, MAF), annotate = TRUE)
#>  1/1  1/2  2/2 
#> 0.81 0.18 0.01

This shows that the alleles at the genetic locus have been given the default names 1 and 2, and that the possible genotypes are 1/1, 1/2 and 2/2. In this example, the population prevalence of the heterozygous genotype 1/2 is 18%.

The option annotate = TRUE also reveals a more user-friendly version of the transmission probabilities.

trans_monogenic(n_alleles = 2, annotate = TRUE)
#>    gm  gf  1/1 1/2  2/2
#> 1 1/1 1/1 1.00 0.0 0.00
#> 2 1/1 1/2 0.50 0.5 0.00
#> 3 1/1 2/2 0.00 1.0 0.00
#> 4 1/2 1/1 0.50 0.5 0.00
#> 5 1/2 1/2 0.25 0.5 0.25
#> 6 1/2 2/2 0.00 0.5 0.50
#> 7 2/2 1/1 0.00 1.0 0.00
#> 8 2/2 1/2 0.00 0.5 0.50
#> 9 2/2 2/2 0.00 0.0 1.00

Here, the rows correspond to the nine possible joint parental genotypes and the last three columns correspond to the three possible offspring genotypes. Each number is the conditional probability of the offspring genotype, given the parental genotypes. For example, row 3 says that when the mother has genotype 1/1 and the father has genotype 2/2 then the offspring will have genotype 1/2 (with probability 1). The probabilities in this table are given by Mendel’s laws of genetics. Note that the probabilities in each row sum to 1, since the offspring must always have one of the three possible genotypes, regardless of the parental genotypes.

Calculating the pedigree likelihood

Now that we’ve defined a genetic model, we can use the function pedigree_loglikelihood to calculate the pedigree log-likelihoods of some sample families.

data("dat_small", "penet_small", "dat_large", "penet_large")
head(dat_small)
#>   family  indiv mother father sex aff age geno
#> 1    ora ora001 ora009 ora010   1   1  42  2/2
#> 2    ora ora002   <NA>   <NA>   2   0   6  1/1
#> 3    ora ora003 ora002 ora001   1   0  22     
#> 4    ora ora004 ora002 ora001   2   0  17  1/2
#> 5    ora ora005 ora002 ora001   1   0  22     
#> 6    ora ora006 ora002 ora001   1   0  71
head(penet_small)
#>           [,1]      [,2]      [,3]
#> [1,] 0.9988662 0.9966026 0.9966026
#> [2,] 0.9999994 0.9999982 0.9999982
#> [3,] 0.9999676 0.9999028 0.9999028
#> [4,] 0.9999866 0.9999598 0.9999598
#> [5,] 0.9999676 0.9999028 0.9999028
#> [6,] 0.9561271 0.8740714 0.8740714

Each row of dat_small corresponds to a person in one of 10 families. Here, dat_small specifies the family structure for these 10 families, meaning that dat_small specifies the mother and father of each person in each family. Some people, such as person ora002, have missing parental identifiers because they are founders (i.e. their parents are not included in the pedigree). Each person has either both or no parents included in their pedigree, i.e. if the mother identifier is missing then so is the father identifier, and vice versa. Also, each person who is mentioned as a parent has a corresponding row, e.g. person ora009 is listed as the mother of person ora001, so there must be a row later in the pedigree with indiv equal to ora009. The columns sex, aff, age and geno are ignored by the function pedigree_loglikelihood, though data like this will usually contribute to the corresponding pedigree likelihood via the penetrance matrix. We will soon give examples of calculating penetrance matrices from this sort of observed data, but for now we simply use the sample penetrance matrix penet_small.

If there are any identical twins or triplets in the family then we can specify them using an optional argument monozyg. For example, to indicate that ora024 and ora027 are identical twins, and so are aey063 and aey064, then we can use the following as the monozyg argument:

monozyg_small <- list(c("ora024", "ora027"), c("aey063", "aey064"))

We can now calculate the log-likelihoods of the 10 families.

pedigree_loglikelihood(dat_small, geno_freq, trans, penet_small, 
                       monozyg = monozyg_small, sum_loglik = FALSE, ncores = 2)
#>       ora       ltk       aey       fwj       tqo       ibz       shd       vtc 
#> -224.0860 -170.4035 -204.6862 -198.4982 -193.2291 -179.5359 -246.8078 -160.9817 
#>       jil       mak 
#> -163.6852 -158.1138

By default, pedigree_loglikelihood will return the sum of the pedigree log-likelihoods of all of the families. However, the option sum_loglik = FALSE above instructs pedigree_loglikelihood to return the separate log-likelihoods, e.g. the above output shows that family ora has a log-likelihood of -224.0860. The option ncores = 2 instructs pedigree_loglikelihood to perform this calculation in parallel on two cores, with the log-likelihoods of five families calculated on one core and the remaining five on another core.

The function pedigree_loglikelihood can also handle very large families. For example, the following code takes much less than one minute on a standard desktop computer to calculate the log-likelihood of a family with approximately 10,000 family members.

system.time(ll <- pedigree_loglikelihood(dat_large, geno_freq, trans, penet_large))
#>    user  system elapsed 
#>   10.64    0.15   10.83
ll
#> [1] -18020.99

Maximum likelihood estimation

The argument penet of pedigree_loglikelihood specifies the penetrance matrix, which usually depends on the observed data and some unknown parameters that we would like to estimate. In this section, we give a simple example of such a penetrance matrix and then use this to estimate its unknown parameters.

Let’s take dat_small as our sample data, and suppose we’re interested in estimating the log-odds of disease for the three possible genotypes from previous sections. Here, the log-odds of disease is log(p/(1-p)) when p is the probability of disease.

head(dat_small)
#>   family  indiv mother father sex aff age geno
#> 1    ora ora001 ora009 ora010   1   1  42  2/2
#> 2    ora ora002   <NA>   <NA>   2   0   6  1/1
#> 3    ora ora003 ora002 ora001   1   0  22     
#> 4    ora ora004 ora002 ora001   2   0  17  1/2
#> 5    ora ora005 ora002 ora001   1   0  22     
#> 6    ora ora006 ora002 ora001   1   0  71

As a toy example, we ignore age and most other data, and take the observed data to be just the disease affected status aff. For simplicity, we also assume that allele 1 is dominant to allele 2, meaning that the log-odds of disease are the same for people with genotypes 1/1 and 1/2. The parameters that we would like to estimate are therefore the log-odds of disease for genotypes 1/1 and 1/2 (combined) and for genotype 2/2. We denote these parameters by logodds1 and logodds2, respectively.

A penetrance matrix penet is designed to give a probabilistic connection between a peron’s genotype and his or her observed data. Each row of penet corresponds to a person and each column corresponds to one of the possible genotypes. (Row i of penet and row i of dat should correspond to the same person, and the order of the genotypes should be the same for penet, geno_freq and trans.) The (i, j)^th entry of penet gives the conditional probability of person i’s observed data, given that he or she has the j^th genotype. In our example, these conditional probabilities are determined by the parameters logodds1 and logodds2. For instance, if person i has genotype 1/1 then his or her probability of disease is prob1 = 1/(1 + exp(-logodds1)), so if person i is affected then the conditional probability of his or her observed data is prob1, and if person i is unaffected then this conditional probability is 1 - prob1. Therefore, the (i, 1)^th entry of penet, corresponding to person i and the first genotype (i.e. 1/1), is prob1 if person i is affected and 1 - prob1 if person i is unaffected. The following function uses this and similar reasoning to calculate row i of penet.

penet.fn <- function(i, logodds1, logodds2) {
  prob1 <- 1/(1 + exp(-logodds1))
  prob2 <- 1/(1 + exp(-logodds2))
  penet.i <- c(prob1, prob1, prob2)
  if (dat_small$aff[i] == 0)  penet.i <- 1 - penet.i
  return(penet.i)
}

We can now estimate our parameters of interest, logodds1 and logodds2, using maximum likelihood estimation, as implemented in the mle function of the stats4 package.

library(stats4)
minusll <- function(logodds1 = 0, logodds2 = 0) {
  penet <- t(sapply(1:nrow(dat_small), penet.fn, logodds1, logodds2))
  loglik <- pedigree_loglikelihood(dat_small, geno_freq, trans, penet,
                                   monozyg = monozyg_small, ncores = 2)
  return(-loglik)
}
minusll()
#> [1] 705.6238
fit <- mle(minusll)
summary(fit)
#> Maximum likelihood estimation
#> 
#> Call:
#> mle(minuslogl = minusll)
#> 
#> Coefficients:
#>           Estimate Std. Error
#> logodds1 -1.359962  0.1054336
#> logodds2 -1.313467  6.8597364
#> 
#> -2 log L: 1030.901

The function minusll above first calculates the penetrance matrix penet corresponding to any given values of the parameters logodds1 and logodds2, and then uses this matrix and pedigree_loglikelihood to calculate the corresponding log-likelihood. The above code then uses mle to perform maximum likelihood estimation, giving estimates of -1.36 and -1.31 for logodds1 and logodds2, respectively.

Incorporating known genotypes

We will now extend the analysis of the previous section to include known genotypes.

The dataset dat_small has a column geno corresponding to known genotypes. As is often the case with family data, many people in dat_small do not have measured genotypes, and so have a blank ("") in the geno column.

head(dat_small)
#>   family  indiv mother father sex aff age geno
#> 1    ora ora001 ora009 ora010   1   1  42  2/2
#> 2    ora ora002   <NA>   <NA>   2   0   6  1/1
#> 3    ora ora003 ora002 ora001   1   0  22     
#> 4    ora ora004 ora002 ora001   2   0  17  1/2
#> 5    ora ora005 ora002 ora001   1   0  22     
#> 6    ora ora006 ora002 ora001   1   0  71

The standard method for incorporating known genotypes into the Elston-Stewart algorithm is to first calculate the penetrance matrix penet while ignoring all genotype data, and then change penet for people who have measured genotypes according to the following simple rule. If person i is known to have the j^th genotype then penet[i, j] is unchanged but penet[i, k] is set to 0 for all k != j. See later for brief justification for this rule.

For example, to incorporate known genotypes into the analysis of the previous section, we modify penet.fn by adding the three lines of code below that are marked with \(\color{darkred}{\text{###}}\).

penet.fn <- function(i, logodds1, logodds2) {
  prob1 <- 1/(1 + exp(-logodds1))
  prob2 <- 1/(1 + exp(-logodds2))
  penet.i <- c(prob1, prob1, prob2)
  if (dat_small$aff[i] == 0)  penet.i <- 1 - penet.i
  if (dat_small$geno[i] == "1/1")  penet.i[-1] <- 0     ###
  if (dat_small$geno[i] == "1/2")  penet.i[-2] <- 0     ###
  if (dat_small$geno[i] == "2/2")  penet.i[-3] <- 0     ###
  return(penet.i)
}

We can now use the same code as in the previous section to estimate the log-odds of disease, though our estimates are now based on the known genotypes as well as disease affected statuses.

minusll()
#> [1] 788.5003
fit <- mle(minusll)
summary(fit)
#> Maximum likelihood estimation
#> 
#> Call:
#> mle(minuslogl = minusll)
#> 
#> Coefficients:
#>           Estimate Std. Error
#> logodds1 -1.357596 0.08405912
#> logodds2 -1.395321 0.61632248
#> 
#> -2 log L: 1196.65

The above method for incorporating known genotypes can be justified as follows. Measured genotypes can differ from actual genotypes due to genotyping errors, and even when genotyping errors are negligible, we can make a conceptual distinction between the actual genotype \(g_i\) of person i, and any measured genotype of person i, which is part of his or her observed data \(x_i\). The (i, k)^th component penet[i, k] of the penetrance matrix is the conditional probability of person i’s observed data \(x_i\), given that his or her actual genotype \(g_i\) is the k^th genotype (e.g. 2/2 is the third genotype, in our running example). When genotyping errors are negligible, the chance of observing a genotype different from the actual genotype is 0. So if person i is observed to have the j^th genotype then penet[i, k] is 0 for all k != j. Using similar reasoning, non-negligible genotyping errors and partial genotyping information can also be incorporated in an analysis.

Calculating carrier probabilities

Given a genetic model and a penetrance matrix, clipp can also use the Elston-Stewart algorithm to calculate genotype probabilities for a person of interest, using the function genotype_probabilities. For example, we can calculate the genotype probabilities for individual ora008 in the family ora, as follows.

head(dat_small)
#>   family  indiv mother father sex aff age geno
#> 1    ora ora001 ora009 ora010   1   1  42  2/2
#> 2    ora ora002   <NA>   <NA>   2   0   6  1/1
#> 3    ora ora003 ora002 ora001   1   0  22     
#> 4    ora ora004 ora002 ora001   2   0  17  1/2
#> 5    ora ora005 ora002 ora001   1   0  22     
#> 6    ora ora006 ora002 ora001   1   0  71
genotype_probabilities(target = "ora008", dat_small, geno_freq, trans, 
                       penet_small, monozyg_small)
#> [1] 0.19832320 0.74111557 0.06056123

This shows that person ora008 has a 19.8% chance of having the first genotype (1/1), given the inputs (family structure, genetic model and penetrance matrix). In this calculation, the penetrance matrix penet_small depends on all of the observed data. If we instead wanted genotype probabilities that are based only on the family structure and observed genotypes then we could use the following code.

penet.fn <- function(i) {
  penet.i <- rep(1, 3)
  if (dat_small$geno[i] == "1/1")  penet.i[-1] <- 0
  if (dat_small$geno[i] == "1/2")  penet.i[-2] <- 0
  if (dat_small$geno[i] == "2/2")  penet.i[-3] <- 0
  return(penet.i)
}
penet <- t(sapply(1:nrow(dat_small), penet.fn))
genotype_probabilities(target = "ora008", dat_small, geno_freq, trans, 
                       penet, monozyg_small)
#> [1] 0.45 0.50 0.05