Using RMixtComp with mixed and missing data

Quentin Grimonprez

2023-06-17

Unsupervised classification is illustrated on the titanic dataset. It is a data.frame with 1309 observations and 8 variables containing information on the passengers of the Titanic. Each observation represents a passenger described by a set of real variables: age in years (age), ticket price in pounds (fare), a set of counting variables: number of siblings/spouses aboard (sibsp), number of parents/children aboard (parch) and a set of categorical variables: sex, ticket class (pclass), port of embarkation and a binary variable indicating if the passenger survived (survived). Furthermore, the dataset contains missing values for three variables: age, fare and embarked.

library(RMixtComp)
data(titanic)
print(titanic[c(1, 16, 38, 169, 285, 1226),])
##      pclass survived    sex  age sibsp parch     fare embarked
## 1       1st        1 female 29.0     0     0 211.3375        S
## 16      1st        0   male   NA     0     0  25.9250        S
## 38      1st        1   male   NA     0     0  26.5500        S
## 169     1st        1 female 38.0     0     0  80.0000     <NA>
## 285     1st        1 female 62.0     0     0  80.0000     <NA>
## 1226    3rd        0   male 60.5     0     0       NA        S

Step 1: Data Preparation

First, the dataset must be converted in the MixtComp format. Categorical variables must be numbered from 1 to the number of categories (e.g. 3 for embarked). This can be done using the refactorCategorical function that takes in arguments the vector containing the data, the old labels and the new labels. Totaly missing values must be indicated with a ?.

titanicMC <- titanic
titanicMC$sex <- refactorCategorical(titanic$sex, c("male", "female"), c(1, 2))
titanicMC$pclass <- refactorCategorical(titanic$pclass, c("1st", "2nd", "3rd"), c(1, 2, 3))
titanicMC$embarked <- refactorCategorical(titanic$embarked, c("C", "Q", "S"), c(1, 2, 3))
titanicMC$survived <- refactorCategorical(titanic$survived, c(0, 1), c(1, 2))
titanicMC[is.na(titanicMC)] = "?"
head(titanicMC)
##   pclass survived sex    age sibsp parch     fare embarked
## 1      1        2   2     29     0     0 211.3375        3
## 2      1        2   1 0.9167     1     2   151.55        3
## 3      1        1   2      2     1     2   151.55        3
## 4      1        1   1     30     1     2   151.55        3
## 5      1        1   2     25     1     2   151.55        3
## 6      1        2   1     48     0     0    26.55        3

The dataset is splitted in 2 datasets for illustrating learning and prediction.

indTrain <- sample(nrow(titanicMC), floor(0.8 * nrow(titanicMC)))
titanicMCTrain <- titanicMC[indTrain, ]
titanicMCTest <- titanicMC[-indTrain, ]

Then, as all variables are stored as character in a data.frame, a model object indicating which model to use for each variable is created. In this example, a gaussian model is used for age and fare variables, a multinomial for sex, pclass, embarked and survived, a Poisson for sibsp and parch.

model <- list(fare = "Gaussian", age = "Gaussian", pclass = "Multinomial", survived = "Multinomial",
              sex = "Multinomial", embarked = "Multinomial", sibsp = "Poisson", parch = "Poisson")

Step 2: Learning

We choose to run the clustering analysis for 1 to 20 clusters with 3 runs for every number of clusters. These runs can be parallelized using the nCore parameter.

resTitanic <- mixtCompLearn(titanicMCTrain, model, nClass = 1:20, nRun = 3, nCore = 1)

Step 3: Interpretation and Visualization

summary and plot functions are used to have an overview of the results for the best number of classes according to the chosen criterion (BIC or ICL). If this number is not the one desired by the user, it can been changed via the parameter nClass.

The summary displays the number of clusters chosen and some outputs as the discriminative power indicating the variables that contribute most to class separation and parameters associated with the 3 most discriminant variables.

summary(resTitanic)
## ############### MixtCompLearn Run ###############
## nClass: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
## Criterion used: BIC 
##            1         2         3         4         5         6         7
## BIC -14420.7 -12832.51 -12315.06 -11847.27 -11678.29 -11601.63 -11523.19
## ICL -14420.7 -12860.87 -12341.07 -11904.55 -11741.10 -11661.26 -11596.42
##             8         9        10        11        12        13        14
## BIC -11555.02 -11547.94 -11514.92 -11405.14 -11489.13 -11446.26 -11431.12
## ICL -11613.51 -11605.37 -11588.10 -11501.77 -11559.88 -11567.44 -11554.02
##            15        16        17        18        19        20
## BIC -11515.67 -11481.53 -11539.71 -11598.12 -11629.26 -11624.77
## ICL -11603.53 -11571.33 -11664.16 -11695.05 -11753.95 -11718.26
## Best model: 11 clusters 
## ########### MixtComp Run ###########
## Number of individuals: 1047 
## Number of variables: 8 
## Number of clusters: 11 
## Mode: learn 
## Time: 0.915 s
## SEM burn-in iterations done: 50/50 
## SEM run iterations done: 50/50 
## Observed log-likelihood: -10911.43 
## BIC: -11405.14 
## ICL: -11501.77 
## Discriminative power:
##     fare   pclass survived    parch    sibsp      age embarked      sex 
##    0.599    0.387    0.201    0.181    0.180    0.147    0.127    0.127 
## Proportions of the mixture:
## 0.177 0.117 0.12 0.034 0.099 0.169 0.059 0.057 0.065 0.069 0.032 
## Parameters of the most discriminant variables:
## - fare: Gaussian 
##          mean      sd
## k: 1    7.908   0.811
## k: 2   20.081   9.435
## k: 3   22.681  10.145
## k: 4   44.187  16.900
## k: 5   12.461   1.365
## k: 6    7.845   0.113
## k: 7   28.083   2.167
## k: 8  189.115 107.997
## k: 9   66.192  17.439
## k: 10  36.376  25.719
## k: 11 102.602  64.425
## - pclass: Multinomial 
##       modality 1 modality 2 modality 3
## k: 1       0.000      0.000      1.000
## k: 2       0.000      0.190      0.810
## k: 3       0.000      0.540      0.460
## k: 4       0.000      0.000      1.000
## k: 5       0.000      1.000      0.000
## k: 6       0.000      0.000      1.000
## k: 7       1.000      0.000      0.000
## k: 8       1.000      0.000      0.000
## k: 9       1.000      0.000      0.000
## k: 10      0.439      0.368      0.193
## k: 11      1.000      0.000      0.000
## - survived: Multinomial 
##       modality 1 modality 2
## k: 1       0.866      0.134
## k: 2       1.000      0.000
## k: 3       0.000      1.000
## k: 4       1.000      0.000
## k: 5       0.723      0.277
## k: 6       0.691      0.309
## k: 7       0.542      0.458
## k: 8       0.082      0.918
## k: 9       0.000      1.000
## k: 10      0.869      0.131
## k: 11      1.000      0.000
## ####################################

The plot function displayed the values of criteria, the discriminative power of variables and the parameters of the three most discriminative variable. More variables can be displayed using the nVarMaxToPlot parameter.

plot(resTitanic)
## $criteria

## 
## $discrimPowerVar

## 
## $proportion

## 
## $fare

## 
## $pclass

## 
## $survived

The most discriminant variable for clustering are fare and pclass. The similarity between variables is shown with the following code:

heatmapVar(resTitanic)

round(computeSimilarityVar(resTitanic), 2)
##          fare  age pclass survived  sex embarked sibsp parch
## fare     1.00 0.37   0.43     0.35 0.36     0.39  0.37  0.37
## age      0.37 1.00   0.59     0.69 0.73     0.72  0.70  0.72
## pclass   0.43 0.59   1.00     0.58 0.59     0.60  0.55  0.55
## survived 0.35 0.69   0.58     1.00 0.78     0.70  0.67  0.68
## sex      0.36 0.73   0.59     0.78 1.00     0.74  0.71  0.73
## embarked 0.39 0.72   0.60     0.70 0.74     1.00  0.69  0.70
## sibsp    0.37 0.70   0.55     0.67 0.71     0.69  1.00  0.71
## parch    0.37 0.72   0.55     0.68 0.73     0.70  0.71  1.00

The greatest similarity is between survived and sex, this relation is well-known in the dataset with a great proportion of women surviving compared to men. On the contrary, there is few similarity between fare and other variables.

Getters are available to easily access some results: getBIC, getICL, getCompletedData, getParam, getProportion, getTik, getPartition, … All these functions use the model maximizing the asked criterion. If results for an other number of classes is desired, the extractMixtCompObject can be used. For example:

getProportion(resTitanic)
##       k: 1       k: 2       k: 3       k: 4       k: 5       k: 6       k: 7 
## 0.17669532 0.11747851 0.12034384 0.03438395 0.09933142 0.16905444 0.05921681 
##       k: 8       k: 9      k: 10      k: 11 
## 0.05730659 0.06494747 0.06876791 0.03247373
resK2 <- extractMixtCompObject(resTitanic, 2)
getProportion(resK2)
##      k: 1      k: 2 
## 0.4918816 0.5081184

Step 4: Prediction

Once a model is learnt, one can use it to predict the clusters of new individuals.

resPred <- mixtCompPredict(titanicMCTest, resLearn = resTitanic, nClass = 5, nRun = 3, nCore = 1)

The probabilities of belonging to the different classes and the associated partition are given by:

tik <- getTik(resPred)
head(tik)
##              [,1] [,2]          [,3] [,4] [,5]
## [1,]         -Inf -Inf  0.000000e+00 -Inf -Inf
## [2,] -14.60520000 -Inf -4.539859e-07 -Inf -Inf
## [3,]  -0.04038069 -Inf -3.229526e+00 -Inf -Inf
## [4,]         -Inf -Inf  0.000000e+00 -Inf -Inf
## [5,]         -Inf -Inf  0.000000e+00 -Inf -Inf
## [6,]         -Inf -Inf  0.000000e+00 -Inf -Inf
partition <- getPartition(resPred)
head(partition)
## [1] 3 3 1 3 3 3