This document provides a very quick introduction to the
R
code needed to estimate the quality of a typology in a
subsequent regression or when the relationship between the typology and
the covariate is of key interest. Readers interested in the methods and
the exact interpretation of the results are referred to:
You are kindly asked to cite the above reference if you use the methods presented in this document.
Let’s start by setting the seed for reproducible results.
Time for this code chunk to run: 0.06 seconds
For this example, we use the mvad
dataset. Let’s start
with the creation of the state sequence object.
## Loading the TraMineR library
library(TraMineR)
## Loading the data
data(mvad)
## State properties
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.lab <- c("employment", "further education", "higher education", "joblessness", "school", "training")
mvad.shortlab <- c("EM","FE","HE","JL","SC","TR")
## Creating the state sequence object
mvad.seq <- seqdef(mvad, 17:86, alphabet = mvad.alphabet, states = mvad.shortlab, labels = mvad.lab, xtstep = 6)
Time for this code chunk to run: 0.26 seconds
We will now create a typology using cluster analysis. Readers
interested in more detail are referred to the
WeightedCluster
library manual (also available as a
vignette), which goes into the details of the creation and computation
of cluster quality measures.
We start by computing dissimilarities with the seqdist
function using the Hamming distance. We then use Ward clustering to
create a typology of the trajectories. For this step, we recommend the
use of the fastcluster
library (Müllner 2024), which considerably speed up the
computations.
## Using fastcluster for hierarchical clustering
library(fastcluster)
## Distance computation
diss <- seqdist(mvad.seq, method="LCS")
## Hierarchical clustering
hc <- hclust(as.dist(diss), method="ward.D")
Time for this code chunk to run: 4.78 seconds
We can now compute several cluster quality indices using
as.clustrange
function from two to ten groups.
# Loading the WeightedCluster library
library(WeightedCluster)
# Computing cluster quality measures.
clustqual <- as.clustrange(hc, diss=diss, ncluster=10)
clustqual
## PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
## cluster2 0.59 0.75 0.74 0.43 0.43 216.72 0.23 431.41 0.38 0.13
## cluster3 0.43 0.51 0.50 0.25 0.25 175.61 0.33 335.58 0.49 0.25
## cluster4 0.52 0.67 0.67 0.29 0.30 164.63 0.41 352.31 0.60 0.17
## cluster5 0.52 0.69 0.68 0.31 0.31 153.46 0.46 326.41 0.65 0.16
## cluster6 0.57 0.79 0.78 0.34 0.35 151.17 0.52 372.95 0.73 0.11
## cluster7 0.58 0.83 0.83 0.37 0.37 144.38 0.55 383.10 0.77 0.09
## cluster8 0.51 0.78 0.78 0.32 0.33 140.00 0.58 358.85 0.78 0.12
## cluster9 0.52 0.85 0.85 0.34 0.35 137.85 0.61 386.02 0.81 0.09
## cluster10 0.51 0.87 0.87 0.35 0.36 131.67 0.63 379.81 0.83 0.08
Time for this code chunk to run: 0.44 seconds
clustassoc
functionIn this example, we will focus on the association between father
unemployment status (funemp
variable) and our
school-to-work trajectories. The clustassoc
function
provides several indicators of the quality of typology to study this
association.
The function takes a clustrange
object as the first
argument. The diss
argument specifies the distance matrix
used for clustering, covar
the covariate of association of
interest, and weights
an optional case weights vector.
## Unaccounted Remaining BIC numcluster
## No Clustering 1.0000000 0.006135499 642.7743 1
## cluster2 0.6804065 0.004174633 639.6969 2
## cluster3 0.4544939 0.002788547 642.3764 3
## cluster4 0.3432326 0.002105903 646.4923 4
## cluster5 0.3445621 0.002114060 652.8668 5
## cluster6 0.2225184 0.001365262 657.7167 6
## cluster7 0.2269752 0.001392606 662.9861 7
## cluster8 0.2192153 0.001344995 667.0654 8
## cluster9 0.1944971 0.001193337 671.3423 9
## cluster10 0.1970281 0.001208866 675.5324 10
Time for this code chunk to run: 19.77 seconds The resulting object
presents three indicators. The Unaccounted
column shows the
share of the direct association between the trajectories and the
covariates that is by the typology. This computation are based on the
discrepancy analysis framework (Studer et al.
2011). A low value means that the typology carries most of the
information that is relevant to study the association between our
covariate and the trajectories.
The Remaining
column presents the share of the overall
variability of the trajectories that is by the typology. A low value
indicates that there is no variation left not explained by the typology.
Warning, this is usually a very low value. The value presented in the
“No clustering” row (the first) is equivalent to the pseudo-\(R^2\) of a discrepancy analysis between the
trajectories and the covariates.
The BIC
column presents the Bayesian Information
Criterion for the association between the typology and the covariate
(again the lower the better). While the first column provides the most
reliable information, the BIC
might be useful when
parcimony is of key interest.
The general idea is to select a cluster solution with low values on
Unaccounted
and BIC
(only if relevant).
The results can be plotted to make it easier to find the minimum.
Time for this code chunk to run: 0.09 seconds
According to the plot, at least 6 groups are required. However, around one fifth of the association is left un-reproduced by the clustering. It might be interesting to compare the 5 and 6 clusters solutions to better understand the association.
Time for this code chunk to run: 0.68 seconds
Time for this code chunk to run: 0.5 seconds
We can notice that the 6 cluster solutions contains a new joblessness cluster, which is found to be important to study the association between father unemployment and son school-to-work trajectories.
The presented might lead to different recommendations than the usual cluster quality indices, because it focuses on a relationship with a covariate. The method often suggests a higher number of groups.