Using familiar prospectively

Alex Zwanenburg

2022-12-16

library(familiar)
library(data.table)

set.seed(19)

The summon_familiar function is used to generate models for a given dataset and provide a broad analysis afterwards. This creates a number of files. Many of the files resulting from this analysis can also be used outside of summon_familiar. For example, models and ensembles can be used prospectively to assess new datasets. Likewise, data and collection objects can be used to customise plotting and export to tables.

In this example we will use the birthweight dataset collected in 1986 at Baystate Medical Center, Springfield, Massachusetts. This dataset contains birth weight data from 189 newborn children, some of which have low birth weight (less then 2.5 kg). Potential risk factors were also collected. We will try to predict the low birth weight indicator using these risk factors.

We will first randomly split the data into development and validation datasets:

# Load the birth weight data set
data <- data.table::as.data.table(MASS::birthwt)

# Add sample and batch identifiers.
data[,":="("sample_id"=.I)]

# Generate training and validation samples
train_samples <- sample(data$sample_id, size=120, replace=FALSE)
valid_samples <- setdiff(data$sample_id, train_samples)

# Assign batch identifiers.
data[sample_id %in% train_samples, "batch_id":="development"]
data[sample_id %in% valid_samples, "batch_id":="validation"]

Next we will prepare the dataset further. We drop the bwt column as this directly contains the indicator we are trying to predict. Other columns are encoded as categorical variables.

# Drop the bwt column.
data[, "bwt":=NULL]

# Encode the low outcome column: we ensure that "low" is now the so-called
# positive class.
data$low <- factor(data$low, levels=c(0, 1), labels=c("normal", "low"))

# Encode race. smoke, ht, and ui columns as categorical variables.
data$race <- factor(data$race, levels=c(1, 2, 3), labels=c("white", "black", "other"))
data$smoke <- factor(data$smoke, levels=c(0, 1), labels=c("no", "yes"))
data$ht <- factor(data$ht, levels=c(0, 1), labels=c("no", "yes"))
data$ui <- factor(data$ui, levels=c(0, 1), labels=c("no", "yes"))

# Rename columns to make them clearer.
data.table::setnames(
  data,
  old=c("low", "age", "lwt", "race", "smoke", "ptl", "ht", "ui", "ftv"),
  new=c("birth_weight", "age_mother", "weight_mother_before_pregnancy",
        "ethnicity", "smoking_during_pregnancy", "previous_premature_labours",
        "hypertension_history", "uterine_irritability", "physician_visits_first_trimester"))

Then we call summon_familiar create models for the data and assess these. We will create an ensemble of five penalised logistic regression models to predict the low birth weight indicator, based on bootstraps of the development dataset. You may notice that we here write to the temporary R directory using the tempdir() function. In practice you will want to use a different directory, as the temporary R directory will be deleted once your R session closes. For speed, we will also only compute point estimates during evaluation (see the evaluation and explanation vignette for other options).

familiar::summon_familiar(
  data=data,
  project_dir=tempdir(),
  sample_id_column="sample_id",
  batch_id_column="batch_id",
  development_batch_id="development",
  outcome_type="binomial",
  outcome_column="birth_weight",
  experimental_design="bs(fs+mb,5) + ev",
  cluster_method="none",
  fs_method="none",
  learner="lasso",
  parallel=FALSE,
  estimation_type="point")

Using models prospectively

Models generated by familiar are stored in subdirectories of the trained_models folder:

# Create path to the directory containing the models.
model_directory_path <- file.path(tempdir(), "trained_models", "lasso", "none")

# List files present in the directory.
list.files(model_directory_path)
#>  [1] "20221216133548_hyperparameters_lasso_none_2_1.RDS" "20221216133548_hyperparameters_lasso_none_2_2.RDS" "20221216133548_hyperparameters_lasso_none_2_3.RDS" "20221216133548_hyperparameters_lasso_none_2_4.RDS"
#>  [5] "20221216133548_hyperparameters_lasso_none_2_5.RDS" "20221216133548_lasso_none_1_1_ensemble.RDS"        "20221216133548_lasso_none_2_1_model.RDS"           "20221216133548_lasso_none_2_2_model.RDS"          
#>  [9] "20221216133548_lasso_none_2_3_model.RDS"           "20221216133548_lasso_none_2_4_model.RDS"           "20221216133548_lasso_none_2_5_model.RDS"

There are 5 models in the directory, which are stored in RDS format in files ending with *_model.RDS. We can inspect the first model in the directory.

# Create path to the model.
model_path <- file.path(model_directory_path, list.files(model_directory_path, pattern="model")[1])

# Load the model.
model <- readRDS(model_path)
model
#> A lasso model (class: familiarGLMnetLasso; v1.4.1) trained using glmnet (v4.1.6) package.
#> 
#> --------------- Model details ---------------
#> 
#> 
#> ---------------------------------------------
#> 
#> The following outcome was modelled:
#> birth_weight (binomial), with classes: normal (reference) and low.
#> 
#> The model was trained using the following hyperparameters:
#>   sign_size: 8
#>   family: binomial
#>   lambda_min: lambda.min
#>   n_folds: 7
#>   normalise: FALSE
#>   sample_weighting: inverse_number_of_samples
#>   sample_weighting_beta: -2
#> 
#> Variable importance was determined using the none variable importance method.
#> 
#> The following features were used in the model:
#> age_mother (numeric):
#>   transformation (yeo_johnson_robust) with λ = -0.5.
#>   normalisation (standardisation_robust) with shift = 1.58921828040236 and scale = 0.0489241429455001.
#> weight_mother_before_pregnancy (numeric):
#>   transformation (yeo_johnson_robust) with λ = -1.2.
#>   normalisation (standardisation_robust) with shift = 0.830775375081118 and scale = 0.000576197086994417.
#> ethnicity (categorical), with levels: white (reference), black and other.
#> smoking_during_pregnancy (categorical), with levels: no (reference) and yes.
#> physician_visits_first_trimester (numeric):
#>   transformation (yeo_johnson_robust) with λ = -0.8.
#>   normalisation (standardisation_robust) with shift = 0.310658008586882 and scale = 0.392221755361211.
#> previous_premature_labours (numeric).
#> hypertension_history (categorical), with levels: no (reference) and yes.
#> uterine_irritability (categorical), with levels: no (reference) and yes.
#> 
#> A novelty detector was trained using the model features.

This model can then be used to predict values for a given dataset, among other things. The predict method used by familiar is in many ways similar to other predict methods. However, familiar requires that the newdata argument is set. It does not store development data with its models to limit model size and prevent leaking sensitive information. Predictions can be made as follows:

predict(object=model, newdata=data)
#>      predicted_class_probability_normal predicted_class_probability_low predicted_class
#>   1:                          0.5313997                       0.4686003          normal
#>   2:                          0.7986109                       0.2013891          normal
#>   3:                          0.3311981                       0.6688019             low
#>   4:                          0.3672995                       0.6327005             low
#>   5:                          0.2523367                       0.7476633             low
#>  ---                                                                                   
#> 185:                          0.4933279                       0.5066721             low
#> 186:                          0.4287156                       0.5712844             low
#> 187:                          0.3519040                       0.6480960             low
#> 188:                          0.4746111                       0.5253889             low
#> 189:                          0.3753937                       0.6246063             low

In addition to default predictions, familiar allows for several different types of prediction by setting the type argument. These are:

For example, we can predict novelty of the samples as follows:

predict(object=model, newdata=data, type="novelty")
#>        novelty
#>   1: 0.5368978
#>   2: 0.5241009
#>   3: 0.4273222
#>   4: 0.5396562
#>   5: 0.5263258
#>  ---          
#> 185: 0.5095876
#> 186: 0.5713185
#> 187: 0.4789844
#> 188: 0.5232316
#> 189: 0.5382750

More powerful however, is the ability to perform any of the evaluation and explanation steps for a new dataset, including new settings. By default, all evaluation and explanation steps are conducted using the settings defined when running summon_familiar. In this case that means that point estimates will be computed.

Let us for example compute and plot model performance AUC-ROC and accuracy for the model.

plots <- familiar::plot_model_performance(
  object=model,
  draw=TRUE,
  facet_by="metric",
  data=data[batch_id=="validation"],
  metric=c("auc", "accuracy"))
#> Warning in (new("standardGeneric", .Data = function (object, draw = FALSE, : Creating a violinplot requires bias-corrected estimates or bootstrap confidence interval estimates instead of point estimates.

You may notice that no plot is produced. This is because the type of plot violin_plot and the estimation_type inherited from the model are incompatible. However, nothing prevents us from changing the estimation type to bootstrap confidence intervals bci.

# Draw model performance plots with bootstrap confidence intervals.
# familiar_data_names argument specifies the name that appears below the plot.
# The default is rather long.
plots <- familiar::plot_model_performance(
  object=model,
  draw=TRUE,
  facet_by="metric",
  data=data[batch_id=="validation"],
  estimation_type="bci",
  metric=c("auc", "accuracy"),
  familiar_data_names="lasso")

plot of chunk prospective-model-performance-plot

Note that we can achieve the same result without explicitly importing the model. Providing the path to the model as the object argument suffices:

plots <- familiar::plot_model_performance(
  object=model_path,
  draw=TRUE,
  facet_by="metric",
  data=data[batch_id=="validation"],
  estimation_type="bci",
  metric=c("auc", "accuracy"),
  familiar_data_names="lasso")

plot of chunk prospective-model-performance-plot-implicit

Using ensembles of models prospectively

The five models created in the example form an ensemble. Instead of investigating the models separately, we can also evaluate the model ensemble. Model ensembles generated by familiar are stored in the same subdirectory of the trained_models folder as their constituent models:

# List files present in the directory.
list.files(model_directory_path)
#>  [1] "20221216133548_hyperparameters_lasso_none_2_1.RDS" "20221216133548_hyperparameters_lasso_none_2_2.RDS" "20221216133548_hyperparameters_lasso_none_2_3.RDS" "20221216133548_hyperparameters_lasso_none_2_4.RDS"
#>  [5] "20221216133548_hyperparameters_lasso_none_2_5.RDS" "20221216133548_lasso_none_1_1_ensemble.RDS"        "20221216133548_lasso_none_2_1_model.RDS"           "20221216133548_lasso_none_2_2_model.RDS"          
#>  [9] "20221216133548_lasso_none_2_3_model.RDS"           "20221216133548_lasso_none_2_4_model.RDS"           "20221216133548_lasso_none_2_5_model.RDS"

In this case there is only one ensemble in the directory, which is stored in RDS format in a file ending with *_ensemble.RDS. Some alternative experiment designs, e.g. ones involving cross-validation, can lead to multiple ensembles being formed. We can inspect the ensemble in the directory.

# Create path to the model.
ensemble_path <- file.path(
  model_directory_path,
  list.files(model_directory_path, pattern="ensemble")[1])

# Load the model.
ensemble <- readRDS(ensemble_path)
ensemble
#> An ensemble of 5 lasso models (v1.4.1).
#> 
#> The following outcome was modelled:
#> birth_weight (binomial), with classes: normal (reference) and low.
#> 
#> Variable importance was determined using the none variable importance method.
#> 
#> The following features were used in the ensemble:
#> age_mother (numeric).
#> weight_mother_before_pregnancy (numeric).
#> ethnicity (categorical), with levels: white (reference), black and other.
#> smoking_during_pregnancy (categorical), with levels: no (reference) and yes.
#> physician_visits_first_trimester (numeric).
#> previous_premature_labours (numeric).
#> hypertension_history (categorical), with levels: no (reference) and yes.
#> uterine_irritability (categorical), with levels: no (reference) and yes.

One can use ensembles of models for prediction:

predict(object=ensemble, newdata=data)
#>      predicted_class_probability_normal predicted_class_probability_low predicted_class
#>   1:                          0.5313997                       0.4686003          normal
#>   2:                          0.7194464                       0.2805536          normal
#>   3:                          0.3809461                       0.6190539             low
#>   4:                          0.4149866                       0.5850134             low
#>   5:                          0.3574783                       0.6425217             low
#>  ---                                                                                   
#> 185:                          0.4661910                       0.5338090             low
#> 186:                          0.4287156                       0.5712844             low
#> 187:                          0.3519040                       0.6480960             low
#> 188:                          0.4746111                       0.5253889             low
#> 189:                          0.4795694                       0.5204306             low

Ensembles behave similarly to models during evaluation and explanation steps:

plots <- familiar::plot_model_performance(
  object=ensemble,
  draw=TRUE,
  facet_by="metric",
  data=data[batch_id=="validation"],
  estimation_type="bci",
  metric=c("auc", "accuracy"),
  familiar_data_names="lasso")

plot of chunk prospective-model-performance-plot-ensemble

There is one important limitation to using ensembles. Normally, when loading an ensemble, the models are not attached to the ensemble. Instead, the model_list attribute of the ensemble object contains a list of paths to the location of the model files at creation. Thus, if you move these files, the ensemble can no longer find and attach the models. There are two ways to avoid this issue.

The first way is to use the update_model_dir_path() method to point the ensemble to the new directory. The second, more generic way, is to create an ensemble on the fly from the underlying models. To do so, we supply a list of models, or paths to these models as the object argument for plot methods and tabular export methods.

# Generate paths to the model objects.
model_paths <- sapply(list.files(model_directory_path, pattern="model"), function(x) (file.path(model_directory_path, x)))

# Generate plot using an ad-hoc ensemble.
plots <- familiar::plot_model_performance(
  object=model_paths,
  draw=TRUE,
  facet_by="metric",
  data=data[batch_id=="validation"],
  estimation_type="bci",
  metric=c("auc", "accuracy"),
  familiar_data_names="lasso")

plot of chunk prospective-model-performance-plot-new-directory

Customising exports and plots

Familiar also produces data and collection objects. Data objects hold the processed evaluation data derived from a particular data set and are found in the familiar_data folder.

list.files(file.path(tempdir(), "familiar_data"))
#> [1] "20221216133548_lasso_none_1_1_ensemble_1_1_validation_data.RDS" "20221216133548_lasso_none_1_1_pool_1_1_development_data.RDS"    "20221216133548_lasso_none_1_1_pool_1_1_validation_data.RDS"

In our example, we generated three data objects: one for internal development, one for internal validation, and one for external validation data. These separate data objects are collected in a collection object, which is found in the familiar_collections folder.

list.files(file.path(tempdir(), "familiar_collections"))
#> [1] "pooled_data.RDS"

There is typically only one collection in this location, but more may exist if summon_familiar is called with evaluate_top_level_only=FALSE.

Note that data and collection objects are static. We cannot use them as flexibly as models and ensembles. For example, we cannot assess different performance metrics using the data stored in familiar data and collection objects or use these objects to predict outcomes for new datasets. Below are exceptions to this rule:

The primary use of data and collection objects is for customising plotting. For example, the AUC-ROC curves for each species are plotted using the default palette in familiar, and a custom theme based on cowplot::theme_cowplot. We can re-create the plot using the standard R palette, and a different theme to alter its appearance.

collection <- file.path(tempdir(), "familiar_collections", "pooled_data.RDS")

plots <- plot_auc_roc_curve(object=collection,
                            ggtheme=ggplot2::theme_dark(),
                            discrete_palette="R4",
                            draw=TRUE)

plot of chunk prospective-auc_roc-plot

Exporting collections

Generating plot data can take a non-trivial amount of time. Hence it may be preferable to have the collection object available so that plots can be altered and created more quickly. To do so, we can call an export or plot method with export_collection=TRUE.

# Generate paths to the model objects.
model_paths <- sapply(
  list.files(model_directory_path, pattern="model"),
  function(x) (file.path(model_directory_path, x)))

# Generate plot data (plots + collection) using an ad-hoc ensemble.
plot_data <- familiar::plot_auc_roc_curve(
  object=model_paths,
  draw=FALSE,
  data=data[batch_id=="validation"],
  export_collection=TRUE)

The plot_data variable is a list that contains (here) two items: collection and plot_list. The collection object can be used to alter plot elements, such as the theme and palette.

plots <- familiar::plot_auc_roc_curve(
  object=plot_data$collection,
  ggtheme=ggplot2::theme_dark(),
  discrete_palette="R4",
  draw=TRUE)

plot of chunk prospective-auc_roc-plot-from-exported-collection