This package “ocp” provides code implementing Bayesian online changepoint detection. This tutorial will show how to use the main functions in this package.

1 Running Online Bayesian Changepoint Detection

1.1 Generate Univariate Gaussian Data

The figure below shows the simulated univariate gaussian data used for this tutorial, and the locations of the true changepoints.

 # the true changepoint locations including the first and last point
truecps<- c(1, 51, 71, 121)
#simulate the data
set.seed(1)
uvg<- c(rnorm(n=diff(truecps)[1], mean=0, sd=2), 
        rnorm(n=diff(truecps)[2], mean=20, sd=4),
        rnorm(n=diff(truecps)[3], mean=10, sd=3))

# view the data
plot(uvg, main = "Simulated Univariate Gaussian Data with Changepoints", 
     ylab = "data values", xlab="time point", type= "l", col = "black", cex=0.5)
# show the changepoints on the graph
for(cp in truecps){
  abline(v=cp, col = "green", lwd= 2)
}

1.2 Run Basic Online Changepoint Detection

This section will run the basic onlineCPD function to output an ocpd object from the simulated data.

# running the basic function with all the default settings
ocpd1<- onlineCPD(uvg) 

# view results
ocpd1
#> [1] "1 -variate data."
#> [1] "Attributes returned:"
#>  [1] "R"                 "prevR"             "prevRprod"        
#>  [4] "prevRsum"          "prevDataPt"        "time"             
#>  [7] "ocpd_settings"     "threshcps"         "max"              
#> [10] "update_paramsT"    "update_params0"    "init_params"      
#> [13] "logprobmaxes"      "logprobcps"        "currmu"           
#> [16] "changepoint_lists"
#> [1] "Changepoints:"
#> [[1]]
#> [1]  1 51 71

Printing the results shows a list of the attributes returned, and the main finding which is the output list of changepoints.

Following is some explanation of each attribute returned.

1.2.1 Main Outputs

The main output from the OCPD objects are the changepoints list and the R matrix if the it is chosen to be returned.

  • Changepoints List: list of changepoints computed from different approaches
    • colmaxes: the list of changepoints taken from the maximum probability in each column of the R matrix
    • threshchps: the list of changepoints that have run length probability above a threshold value, e.g. 0.5
    • maxCPs: the list of changepoints with the highest overall probability

The list of possible changepoints can be accessed from the attribute “changepoint_lists”. The code and output from this example is shown below.

cpdf<- data.frame(method=names(ocpd1$changepoint_lists))
cpdf$changepoints<- unlist(ocpd1$changepoint_lists, recursive = FALSE)
kable(cpdf)
method changepoints
colmaxes c(1, 51, 71)
threshcps c(1, 51, 71)
maxCPs c(1, 51, 71)

1.2.2 Additional Outputs

Additional outputs are also returned - these are needed for running the algorithm online since this algorithm can work being called point by point to build on a pre-existing ocpd object, but not needed for interpreting the final results. These outputs are:

  • ocpd_settings: stores the settings used in running the ocpd function
  • prevR: the column of the R matrix for the previous step
  • prevRprod, prevRsum: versions of the R matrix containing cumulative product or sum along diagonals
  • prevDataPt: the previous data point processed - needed for replacing missing data options
  • time: the total number of points processed within this ocpd object
  • update_paramsT: the current parameters used in the UPM update functions
  • update_params0: the initial parameters used in the UPM update functions
  • init_params: the params used to initialize the UMP update_params0
  • logprobcps, logprobmaxes: the list of changepoints with the highest probability at each step, and the probability associated respectively
  • currmu: the current mu compute after each timepoint

1.3 Online Changepoint Detection Function Options

This section will explain the various options possible to set when running the online changepoint detction function. The main choices to be made in running the function are:

  • speed:
    • Fastest method (\(O(n)\)): do not return any unnecessary information and truncate the R vector, set \(getR=FALSE\), \(optionalOutputs=FALSE\), \(truncRlim = 10^{(-4)}\)
    • Running without truncation (\(O(n^2)\)): to prevent truncation, set \(truncRlim =0\)
    • Returning the R matrix (\(O(n^3)\)): set \(getR=TRUE\) to end up storing the computed R matrix in the ocpd object
  • truncation:
    • truncation can be done by probability threshold (truncRlim)
    • hard constraints on length can also be set: maxRlength (max length allowed before truncating), minRlength (min length required before starting to truncate)
  • probability model:
    • model choice specified by name or letter, e.g. \(probModel=list("g")\), or \(probModel=list("gaussian")\)
    • the initialization settings for the probability model: \(init\_params=list(list(m=0, k=0.01, a=0.01, b=0.0001))\)
  • mutlivariate: set \(multivariate=TRUE\) if the data is multivariate

The following sections show how to run the algorithm with various configurations.

1.3.1 Settings affecting speed: getR, truncRlim, optionalOutputs

The following code shows some examples how to run the function with different configurations.

# slowest mode: saving R matrix and optional outputs, no truncation
ocpd2<- onlineCPD(uvg, getR=TRUE, optionalOutputs = TRUE,
                  hazard_func=function(x, lambda){const_hazard(x, lambda=100)})

# plot the results, the R matrix will be visible
plot(ocpd2, cplistID = 3)

# faster mode: not saving R matrix, with optional outputs, with truncation based on probability
ocpd3<- onlineCPD(uvg, getR=FALSE, optionalOutputs = TRUE, truncRlim = 10^(-4),
                  hazard_func=function(x, lambda){const_hazard(x, lambda=100)})

# plot the results, the R matrix will not be visible
plot(ocpd3, cplistID = 3)

# faster mode: not saving R matrix, without optional outputs, with truncation based on length
ocpd4<- onlineCPD(uvg, getR=FALSE, optionalOutputs = FALSE, maxRlength = 200,
                  hazard_func=function(x, lambda){const_hazard(x, lambda=100)})

# plot the results, the R matrix will not be visible,
# also the data will not be visible, because its stored in optional outputs
plot(ocpd4, cplistID = 3)

1.3.2 Settings affecting truncation: truncRlim, maxRlength, minRlength

# truncating with min and max length, and probability threshold
ocpd5<- onlineCPD(uvg, getR=FALSE, optionalOutputs = FALSE, 
                  maxRlength = 100, minRlength = 2, truncRlim = 10^(-2),
                  hazard_func=function(x, lambda){const_hazard(x, lambda=100)})

# plot the results, the R matrix will not be visible,
# also the data will not be visible, because its stored in optional outputs
plot(ocpd5, cplistID = 3)

1.3.3 Settings affecting probability model: probModel, init_params

1.3.3.1 Gaussian Univariate

# example with gaussian
ocpd6<- onlineCPD(uvg, getR=FALSE, optionalOutputs = FALSE, 
                  probModel=list("gaussian"), init_params=list(list(m=0, k=0.01, a=0.01, b=0.0001)),
                  hazard_func=function(x, lambda){const_hazard(x, lambda=100)})
#note: probModel=list("g") also works

# plot the results, the R matrix will not be visible,
# also the data will not be visible, because its stored in optional outputs
plot(ocpd6, cplistID = 3)

1.3.3.2 Gaussian Multivariate

mvg<- matrix(ncol = 2, nrow = length(uvg))
mvg[,1]<-uvg
mvg[,2]<-uvg*3
# example with gaussian multivariate, set: multivariate = TRUE
ocpd6<- onlineCPD(mvg, getR=FALSE, optionalOutputs = FALSE, multivariate = TRUE,
                  probModel=list("g"), # note: can input one specification of probability model if is the same for every dimension
                  init_params=list(list(m=0, k=0.01, a=0.01, b=0.0001)),
                  hazard_func=function(x, lambda){const_hazard(x, lambda=100)})
#note: probModel=list("g") also works

# plot the results, the R matrix will not be visible,
# also the data will be visible, because its being input manually as data=mvg
plot(ocpd6, data=mvg, cplistID = 3)

1.3.3.3 Poisson Univariate

# example with poisson

# sim poisson data:
truecps<- c(1, 51, 71, 121)
#simulate the data
set.seed(1)
uvp<- c(rpois(n=diff(truecps)[1], lambda = 1), 
        rpois(n=diff(truecps)[2], lambda =20),
        rpois(n=diff(truecps)[3], lambda =10))

ocpd7<- onlineCPD(uvp, getR=FALSE, optionalOutputs = FALSE, 
                  probModel=list("p"), init_params=list(list(a=1, b=1)),
                  hazard_func=function(x, lambda){const_hazard(x, lambda=100)})

# plot the results, the R matrix will not be visible,
# also the data willbe visible, because its being input manually as data=uvp
plot(ocpd7, data=uvp, cplistID = 3)

1.3.3.4 Gaussian and Poisson Multivariate

mvgp<- matrix(ncol = 3, nrow = length(uvg))
mvgp[,1]<-uvg
mvgp[,2]<-uvg*3
mvgp[,3]<-uvp
# example with gaussian multivariate, set: multivariate = TRUE
ocpd6<- onlineCPD(mvgp, getR=FALSE, optionalOutputs = FALSE, multivariate = TRUE,
                  probModel=list("g", "g", "p"), 
                  init_params=list(list(m=0, k=0.01, a=0.01, b=0.0001), list(m=0, k=0.01, a=0.01, b=0.0001), list(a=1, b=1)),
                  hazard_func=function(x, lambda){const_hazard(x, lambda=100)})

# plot the results, the R matrix will not be visible,
# also the data will be visible, because its being input manually as data=mvgp
plot(ocpd6, data=mvgp, cplistID = 3)

1.4 Missing Data Options

There are a few missing data options that can be specified when running the onlineCPD algorithm, by setting the option “missPts”. The possible options for this are:

  • “none”: will bypass checking for skipped points
  • “mean”: will replace a missing point with the mean since the most likely changepoint
  • “prev”: simply replaces with the previous point
  • “skip”: will skip the point, ie by not updating the parameters in the distributions functions
  • numeric value: a custom replacement choice can be input as a numeric value