Short R Tutorial: Sequence Analysis Typologies for Large Databases

Matthias Studer

Introduction

This document provides an introduction to the R code to create typologies of trajectories in large databases using the algorithm provided by the WeightedCluster R library (Studer 2013). It also presents methods to evaluate the quality of the resulting clustering in large databases. Readers interested by the methods and looking for more information on the interpretation of the results are referred to:

You are kindly asked to cite the above reference if you use the methods presented in this document.

Warning!! To avoid lengthy computations (and overloading the CRAN server), we restricted the number of iterations and the sample size. We strongly recommend using much higher values.

We start by loading the required library and setting the seed.

set.seed(1)
library(WeightedCluster)

Time for this code chunk to run: 1.38 seconds

Data Preparation

We rely on the biofam dataset to illustrate the use of the WeightedCluster package. This public dataset is distributed with the TraMineR R package and code family trajectories of a subsample of the Swiss Household Panel (SHP Group 2024).

First, we need to prepare the data by creating a state sequence object using the seqdef command (Gabadinho et al. 2011). This object stores all the information about our trajectories, including the data and associated characteristics, such as state labels. During this step, we further define proper state labels as the original data are coded using numerical values. We then plot the sequences using, for instance, a chronogram.

data(biofam) #load illustrative data
## Defining the new state labels 
statelab <- c("Parent", "Left", "Married", "Left/Married",  "Child", 
            "Left/Child", "Left/Married/Child", "Divorced")
## Creating the state sequence object,
biofam.seq <- seqdef(biofam[,10:25], alphabet=0:7, states=statelab)
seqdplot(biofam.seq, legend.prop=0.2)

Time for this code chunk to run: 0.19 seconds

CLARA Clustering

CLARA for SA is available in the seqclararange function. It is used as follows:

bfclara <- seqclararange(biofam.seq, R = 50, sample.size = 100, kvals = 2:10, 
                         seqdist.args = list(method = "HAM"), parallel=TRUE, 
                         stability=TRUE)

Time for this code chunk to run: 56.59 seconds

The function requires us to specify the sequence to cluster (our biofam.seq sequence object), the number of iterations (argument R, here set to 50 to avoid long computation time, larger values are recommended) and the subsample size (argument sample.size, here again set to the low value of 100 for illustrative purpose). The number of groups in our typology is set using the kvals arguments, here between 2 and ten groups solutions. We directly specify a range of values that will be considered later on. Finally, we need to specify how to compute the distance between sequences through the seqdist.args argument as a list object. All the arguments specified here will be directly passed to the seqdist function. Therefore, any distance measures available in seqdist can be used here. Finally, we set stability=TRUE to estimate the stability of the clustering among the subsamples.

Setting parallel=TRUE, a default parallel back-end is set up using the future framework (Bengtsson 2021). However, any parallel back-end previously defined with the plan function will be used when parallel=FALSE. The parallel protocol can then be adapted to specific environments, for instance some High Performance Computing (HPC) server relies on specific protocols (MPI,…). As implied by the name, setting progressbar=TRUE shows information (and estimated computation time) on the progress of the computations.

Cluster Quality Indices

The values of the medoid-based CQI are shown when the result is printed, or can be plotted using the plot command. When plotting the CQI, standardizing the values makes it easier to identify the best solution (Studer 2013).

bfclara
##           Avg dist  PBM   DB   XB  AMS ARI>0.8 JC>0.8 Best iter
## cluster2      5.18 1.13 0.89 0.47 0.50      20     27        46
## cluster3      4.29 0.73 1.09 0.61 0.48      21     17        18
## cluster4      3.69 0.56 0.93 0.53 0.53      24     17        12
## cluster5      3.21 0.47 0.79 0.46 0.58      16      7        36
## cluster6      2.94 0.54 0.84 0.42 0.55      11      3        50
## cluster7      2.71 0.55 0.92 0.68 0.55       4      1         9
## cluster8      2.59 0.45 0.88 0.52 0.55       4      1        21
## cluster9      2.49 0.34 0.98 0.62 0.53       1      1         3
## cluster10     2.35 0.31 1.09 0.78 0.52       1      1         3
plot(bfclara, norm="range")

Time for this code chunk to run: 0.2 seconds

Except the PBM index, all indices favor a five-cluster solution. The resulting clustering is stored in the clustering element of the results. It can for instance be used to represent the sequences in each cluster as follows.

The stability of the clustering can also be plotted using either the average value, or the number of recoveries of a similar solution.

plot(bfclara, stat="stabmean")

Time for this code chunk to run: 0.07 seconds

Here again, the five-cluster solution shows the highest CQI values. However, the absolute number of recoveries is low for more than seven groups. A higher number of iterations is therefore recommended. This is not surprising as we used a low number of iterations.

plot(bfclara, stat="stability")

Time for this code chunk to run: 0.06 seconds

The bootclustrange function can be used to bootstrap the cluster quality measures. It has the following arguments. The first object is the clustering to be evaluated, as a seqclararange object, a data.frame or a vector. Second, the sequences that were used to create the typology. We should further specify the distance measure to use (as before), the number of bootstraps (R argument), and the subsample size (here 100). We would generally use higher values for this last two arguments. Finally, the parallel and progressbar arguments could be used as before. The resulting object is then printed and the standardized values of the CQI are plotted. The results lead to the same conclusion as for the medoid-based CQI.

    bCQI <- bootclustrange(bfclara, biofam.seq, seqdist.args = list(method = "HAM"), R = 50, sample.size = 100,  parallel=TRUE)
    bCQI
##            PBC   HG HGSD  ASW ASWw    CH   R2  CHsq R2sq   HC
## cluster2  0.45 0.57 0.55 0.31 0.32 20.72 0.17 37.21 0.27 0.21
## cluster3  0.55 0.69 0.67 0.34 0.36 23.17 0.32 45.78 0.48 0.16
## cluster4  0.58 0.76 0.74 0.36 0.39 21.89 0.41 45.61 0.59 0.13
## cluster5  0.59 0.81 0.79 0.39 0.42 21.47 0.47 46.60 0.66 0.11
## cluster6  0.62 0.87 0.86 0.40 0.43 20.91 0.53 49.45 0.72 0.08
## cluster7  0.59 0.87 0.86 0.38 0.42 20.28 0.57 49.00 0.76 0.08
## cluster8  0.60 0.90 0.88 0.38 0.43 19.01 0.59 47.86 0.78 0.07
## cluster9  0.56 0.87 0.86 0.34 0.40 17.67 0.61 42.74 0.79 0.09
## cluster10 0.54 0.89 0.88 0.33 0.40 17.46 0.63 43.96 0.81 0.08
  plot(bCQI, norm="zscore")

Time for this code chunk to run: 22.93 seconds

Plotting the Typology

Once a clustering in a given number of groups has been selected, we can plot the sequences by cluster to give a better interpretation.

seqdplot(biofam.seq, group=bfclara$clustering$cluster5)

Time for this code chunk to run: 0.23 seconds

Fuzzy Clustering

By setting method="fuzzy" in the seqclararange function, the fuzzy version of the algorithm is used. It should be noted that the computations are generally longer than for crisp clustering. The CQI values are printed and plotted using the same commands.

bfclaraf <- seqclararange(biofam.seq, R = 50, sample.size = 100, kvals = 2:10, method="fuzzy",
                            seqdist.args = list(method = "HAM"), parallel=TRUE)
bfclaraf
##           Avg dist  PBM   DB   XB  AMS ARI>0.8 JC>0.8 Best iter
## cluster2      4.36 0.71 1.20 0.44 0.62      NA     NA         1
## cluster3      3.22 0.50 1.28 0.46 0.72      NA     NA        17
## cluster4      2.62 0.33 1.31 0.37 0.73      NA     NA        32
## cluster5      2.27 0.19 1.60 0.45 0.74      NA     NA        18
## cluster6      2.02 0.17 1.49 0.40 0.75      NA     NA         7
## cluster7      1.81 0.14 1.70 0.45 0.75      NA     NA        15
## cluster8      1.67 0.13 1.66 0.42 0.74      NA     NA         4
## cluster9      1.56 0.11 1.98 0.52 0.75      NA     NA         4
## cluster10     1.46 0.09 1.96 0.49 0.76      NA     NA         4
plot(bfclaraf, norm="zscore")

Time for this code chunk to run: 58.76 seconds

The XB and AMS index favor a four or six cluster solution. Several plots are available to describe fuzzy clustering of trajectories, see Studer (2018) and the fuzzyseqplot function. Here, we use sequence index plot ordered by membership strength, with typical trajectories of each cluster represented at the top, leaving aside trajectories with low membership. In this kind of graphic, hybrid trajectories are represented at the bottom.

fuzzyseqplot(biofam.seq, group=bfclaraf$clustering$cluster4, type="I", sortv="membership", membership.threashold=0.4)

Time for this code chunk to run: 2.99 seconds

Bengtsson, Henrik. 2021. “A Unifying Framework for Parallel and Distributed Processing in r Using Futures.” The R Journal 13 (2): 208. https://doi.org/10.32614/rj-2021-048.
Gabadinho, Alexis, Gilbert Ritschard, Nicolas S. Müller, and Matthias Studer. 2011. “Analyzing and Visualizing State Sequences in R with TraMineR.” Journal of Statistical Software 40 (4): 1–37. https://doi.org/https://doi.org/10.18637/jss.v040.i04.
SHP Group. 2024. “Living in Switzerland Waves 1-24 (Including a Long File) + Covid 19 Data.” FORS data service. https://doi.org/10.48573/58NW-6A50.
Studer, Matthias. 2013. “WeightedCluster Library Manual: A Practical Guide to Creating Typologies of Trajectories in the Social Sciences with R.” {LIVES} Working Papers 24. Switzerland: NCCR LIVES. https://doi.org/10.12682/lives.2296-1658.2013.24.
———. 2018. “Divisive Property-Based and Fuzzy Clustering for Sequence Analysis.” In Sequence Analysis and Related Approaches: Innovative Methods and Applications, edited by Gilbert Ritschard and Matthias Studer, 10:223–39. Life Course Research and Social Policies. Springer. https://doi.org/10.1007/978-3-319-95420-2_13.