Distribution fitting

The analysis of multivariate count data arises in numerous fields including genomics, image analysis, text mining, and sport analytics. The multinomial logit model is limiting due to its restrictive mean-variance structure. Moreover, it assumes that counts of different categories are negatively correlated. Models that allow over-dispersion and possess more flexible positive and/or negative correlation structures offer more realism. We implement four models in the R package MGLM: multinomial logit (MN), Dirichlet multinomial (DM), generalized Dirichlet multinomial (GDM), and negative mutinomial (NegMN). Distribution fitting, regression, hypothesis testing, and variable selection are treated in a unified framework. The multivariate count data we analyze here has $d$ categories.

Distribution fitting

The function MGLMfit fits various multivariate discrete distributions and outputs a list with the maximum likelihood estimate (MLE) and relevant statistics.

When fitting distributions, i.e. no covariates involved, MN is a sub-model of DM, and DM is a sub-model of GDM. <!–%We can perform the likelihood ratio test (LRT) to make comparison between the three models.–> MGLMfit outputs the p-value of the likelihood ratio test (LRT) for comparing the fitted model with the most commonly used multinomial model. The NegMN model does not have a nesting relationship with any of the other three models. Therefore, no LRT is performed when fitting a NegMN distribution.

Multinomial (MN)

We first generate data from a multinomial distribution. Note the multinomial parameter (must be positive) supplied to the rmn function is automatically scaled to be a probability vector.

library(MGLM)
set.seed(123)
n <- 200
d <- 4
alpha <- rep(1, d)
m <- 50
Y <- rmn(n, m, alpha)

Multinomial distribution fitting, although trivial, is implemented.

mnFit <- MGLMfit(Y, dist="MN")
show(mnFit)

##         estimate       SE
## alpha_1   0.2568 0.030891
## alpha_2   0.2467 0.030483
## alpha_3   0.2451 0.030416
## alpha_4   0.2514 0.030676
## 
## Distribution: Multinomial
## Log-likelihood: -1457.788
## BIC: 2931.471
## AIC: 2921.576
## LRT test p value: 
## Iterations:

As a comparison, we fit the DM distribution to the same data set. The results indicate that using a more complex model on the multinomial data shows no advantage.

compareFit <- MGLMfit(Y, dist="DM")
show(compareFit)

##         estimate        SE
## alpha_1  5270901 786946676
## alpha_2  5063596 755995891
## alpha_3  5030755 751092797
## alpha_4  5160065 770398732
## 
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1457.788
## BIC: 2936.769
## AIC: 2923.576
## LRT test p value: 1.000
## Iterations: 35

Both the DM parameter estimates and their standard errors are large, indicating possible overfitting by the DM model. This is confirmed by the fact that the p-value of the LRT for comparing MN to DM is close to 1.

Dirichlet-multinomial (DM)

DM is a Dirichlet mixture of multinomials and allows over-dispersion. Similar to the MN model, it assumes that the counts of any two different categories are negatively correlated. We generate the data from the DM model and fit the DM distribution.

set.seed(123)
n <- 200
d <- 4
alpha <- rep(1, d)
m <- 50
Y <- rdirmn(n, m, alpha)

dmFit <- MGLMfit(Y, dist="DM")
show(dmFit)

##         estimate       SE
## alpha_1 0.976670 0.076589
## alpha_2 0.995142 0.079255
## alpha_3 1.006120 0.080033
## alpha_4 0.904500 0.072547
## 
## Distribution: Dirichlet Multinomial
## Log-likelihood: -2011.225
## BIC: 4043.644
## AIC: 4030.451
## LRT test p value: <0.0001
## Iterations: 4

The estimate is very close to the true value with small standard errors. The LRT shows that the DM model is significantly better than the MN model.

Generalized Dirichlet-multinomial (GDM)

GDM model uses $d-2$ more parameters than the DM model and allows both positive and negative correlations among different categories. DM is a sub-model of GDM. Here we fit a GDM model to the above DM sample.

gdmFit <- MGLMfit(Y, dist="GDM")
show(gdmFit)

##         estimate       SE
## alpha_1 1.158474 0.123403
## alpha_2 0.993293 0.437239
## alpha_3 0.839967 0.106374
## beta_1  3.706863 0.226414
## beta_2  1.979389 0.094645
## beta_3  0.759641 0.084403
## 
## Distribution: Generalized Dirichlet Multinomial
## Log-likelihood: -2007.559
## BIC: 4046.907
## AIC: 4027.117
## LRT test p value: <0.0001
## Iterations: 27

GDM yields a slightly larger log-likelihood value but a larger BIC, suggesting DM as a preferred model. p-value indiciates GDM is still significantly better thant the MN model. <!–%This is confirmed by a formal LRT % <>= % lrtGdmvsdm <- pchisq(2*(gdmFit$loglikelihood - dmFit$loglikelihood), d-2, lower.tail=FALSE) % show(lrtGdmvsdm) % @–> Now we simulate data from GDM and fit the GDM distribution.

set.seed(124)
n <- 200
d <- 4
alpha <- rep(1, d-1)
beta <- rep(1, d-1)
m <- 50
Y <- rgdirmn(n, m, alpha, beta)
gdmFit <- MGLMfit(Y, dist="GDM")
gdmFit

##         estimate       SE
## alpha_1 1.019835 0.103480
## alpha_2 0.826125 0.109316
## alpha_3 0.774387 0.092568
## beta_1  1.062112 0.095566
## beta_2  0.846216 0.108725
## beta_3  0.924559 0.135008
## 
## Distribution: Generalized Dirichlet Multinomial
## Log-likelihood: -1820.616
## BIC: 3673.021
## AIC: 3653.231
## LRT test p value: <0.0001
## Iterations: 24

Negative multinomial (NegMN)

NegMN model is a multivariate analog of the negative binomial model. It assumes positive correlation among the counts. The following code generates data from the NegMN model and fit the NegMN distribution,

set.seed(1220)
n <- 100
d <- 4
p <- 5
prob <- rep(0.2, d)
beta <- 10
Y <- rnegmn(n, beta, prob)
negmnFit <- MGLMfit(Y, dist="NegMN")
show(negmnFit)

##      estimate       SE
## p_1  0.188151 0.009584
## p_2  0.194311 0.009837
## p_3  0.191511 0.009722
## p_4  0.196177 0.009914
## phi 12.313935 2.266097
## 
## Distribution: Negative Multinomial
## Log-likelihood: -1104.579
## BIC: 2232.184
## AIC: 2219.158
## LRT test p value: 
## Iterations: 4

Because MN is not a sub-model of NegMN, no LRT is performed here.

Regression

In regression, the $n \times p$ covariate matrix $X$ is similar to that used in the glm function. The response should be a $n \times d$ count matrix. Unlike estimating a parameter vector $\beta$ in GLM, we need to estimate a parameter matrix $B$ when the responses are multivariate. The dimension of the parameter matrix depends on the model:

MN: $p\times (d-1)$
DM: $p\times d$
GDM: $p\times 2(d-1)$
NegMN: $p\times (d+1)$

The GDM model provides the most flexibility, but also requires most parameters. In the function MGLMreg for regression, the option dist="MN", "DM", "GDM" or "NegMN" specifies the model.

The rows $B_{j,\cdot}$ of the parameter matrix correspond to covariates. By default, the function output the Wald test statistics and p-values for testing $H_0: B_{j,\cdot}={\bf 0}$ vs $H_a: B_{j, \cdot}\neq {\bf 0}$. If specifying the option LRT=TRUE, the function also outputs LRT statistics and p-values.

Next, we demonstrate that model mis-specification results in failure of hypothesis testing. We simulate response data from the GDM model. Covariates $X_1$ and $X_2$ have no effect while $X_3$, $X_4$, $X_5$ have different effect sizes.

set.seed(1234)
n <- 200
p <- 5
d <- 4
X <- matrix(runif(p * n), n, p)
alpha <- matrix(c(0.6, 0.8, 1), p, d - 1, byrow=TRUE)
alpha[c(1, 2),] <- 0
Alpha <- exp(X %*% alpha)
beta <- matrix(c(1.2, 1, 0.6), p, d - 1, byrow=TRUE)
beta[c(1, 2),] <- 0
Beta <- exp(X %*% beta)
m <- runif(n, min=0, max=25) + 25
Y <- rgdirmn(n, m, Alpha, Beta)

We fit various regression models and test significance of covariates.

Multinomial regression

mnReg <- MGLMreg(Y~0+X, dist="MN")
show(mnReg)

## Call: MGLMreg(formula = Y ~ 0 + X, dist = "MN")
## 
## Coefficients:
##          Col_1     Col_2     Col_3
## [1,]  0.277063 -0.182760 -0.123204
## [2,]  0.543064  0.422730  0.246523
## [3,]  0.333252  0.517605  0.221851
## [4,]  0.356843  0.486722  0.565427
## [5,] -0.302455  0.192508  0.323713
## 
## Hypothesis test: 
##    wald value    Pr(>wald)
## X1   24.63244 1.842859e-05
## X2   21.99680 6.533133e-05
## X3   23.10908 3.832310e-05
## X4   25.07475 1.489470e-05
## X5   49.37327 1.086326e-10
## 
## Distribution: Multinomial
## Log-likelihood: -2194.448
## BIC: 4468.371
## AIC: 4418.896
## Iterations: 5

The Wald test shows that all predictors, including the null predictors $X_1$ and $X_2$, are significant.

Dirichlet-multinomial regression

dmReg <- MGLMreg(Y~0+X, dist="DM")
show(dmReg)

## Call: MGLMreg(formula = Y ~ 0 + X, dist = "DM")
## 
## Coefficients:
##         Col_1     Col_2     Col_3     Col_4
## [1,] 0.154137 -0.118264 -0.188339 -0.013172
## [2,] 0.183264  0.142034 -0.183394 -0.333886
## [3,] 1.143146  1.254828  1.092635  0.811259
## [4,] 0.392703  0.545421  0.590015  0.131132
## [5,] 0.226350  0.660108  0.939599  0.487034
## 
## Hypothesis test: 
##    wald value Pr(>wald)
## X1   3.349794  0.501082
## X2   7.845339  0.097411
## X3  25.497386  0.000040
## X4   8.735121  0.068072
## X5  23.136042  0.000119
## 
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1683.961
## BIC: 3473.889
## AIC: 3407.922
## Iterations: 7

Wald test declares that $X1$, $X2$ and $X4$ have not effects, but $X3$ and $X5$ are significant.

Generalized Dirichlet-multinomial Regression

gdmReg <- MGLMreg(Y~0+X, dist="GDM")
show(gdmReg)

## Call: MGLMreg(formula = Y ~ 0 + X, dist = "GDM")
## 
## Coefficients:
##      alpha_Col_2 alpha_Col_1 alpha_Col_3 beta_Col_2 beta_Col_1 beta_Col_3
## [1,]   -0.283917    0.190503    0.315706  -0.400203   0.684651   0.467592
## [2,]   -0.209171    0.395541    0.014426  -0.508254   0.552645  -0.171410
## [3,]    1.090140    1.243785    1.171786   1.304955   1.537526   0.916033
## [4,]    0.296819    0.405333    0.809087   0.487838   0.573695   0.305889
## [5,]    0.601841    0.030123    1.281219   1.430931   0.260197   0.853498
## 
## Hypothesis test: 
##    wald value    Pr(>wald)
## X1   9.424379 1.510802e-01
## X2   4.865683 5.611523e-01
## X3  26.828718 1.559064e-04
## X4  12.607034 4.971848e-02
## X5  38.637877 8.427803e-07
## 
## Distribution: Generalized Dirichlet Multinomial
## Log-likelihood: -1676.399
## BIC: 3511.748
## AIC: 3412.798
## Iterations: 21

When using the correct model GDM, the Wald test is able to differentiate the null effects from the significant ones. GDM regression yields the highest log-likelihood and smallest BIC.

Negative multinomial regression

negReg <- MGLMreg(Y~0+X, dist="NegMN", regBeta=FALSE)
show(negReg)

## Call: MGLMreg(formula = Y ~ 0 + X, dist = "NegMN", regBeta = FALSE)
## 
## Coefficients:
## $alpha
##        Col_1     Col_2     Col_3     Col_4
## X1  0.243606 -0.216368 -0.156525 -0.031371
## X2  0.061896 -0.055885 -0.232635 -0.479780
## X3 -0.160914  0.022688 -0.271238 -0.495127
## X4 -0.176181 -0.043828  0.034781 -0.530126
## X5 -0.607974 -0.115829  0.012937 -0.315857
## 
## $phi
##      phi 
## 13.77531 
## 
## 
## Hypothesis test: 
##    wald value    Pr(>wald)
## X1   24.99903 5.033252e-05
## X2   24.90077 5.267448e-05
## X3   29.32830 6.704198e-06
## X4   28.18693 1.143075e-05
## X5   60.50179 2.275444e-12
## 
## Distribution: Negative Multinomial
## Log-likelihood: -2908.896
## BIC: 5929.056
## AIC: 5859.792
## Iterations: 15

Again, the Wald test declares all predictors to be significant.