Introduction to Lift Chart, ROC Curve and Word Cloud

Peter Hurford, Thakur Raj Anand, Chester Ismay

2024-03-13

In V2.7 release of DataRobot API, the following model insights have been added:

Insights provided by Lift Chart and ROC Curves are helpful in checking the performance of machine learning models. Word clouds are helpful for understanding useful words and phrases generated after applying different NLP techniques to unstructured data. We will explore each one of these in detail.

Connecting to DataRobot

To access the DataRobot modeling engine, it is necessary to establish an authenticated connection, which can be done in one of two ways. In both cases, the necessary information is an endpoint, the URL address of the specific DataRobot server being used and a token, a previously validated access token.

token is unique for each DataRobot modeling engine account and can be accessed using the DataRobot webapp in the account profile section.

endpoint depends on DataRobot modeling engine installation (cloud-based vs. on-premise) you are using. Contact your DataRobot admin for information on which endpoint to use if you do not know. The endpoint for DataRobot cloud accounts is https://app.datarobot.com/api/v2.

The first access method uses a YAML configuration file with these two elements - labeled token and endpoint - located at $HOME/.config/datarobot/drconfig.yaml. If this file exists when the datarobot package is loaded, a connection to the DataRobot modeling engine is automatically established during library(datarobot). It is also possible to establish a connection using this YAML file via the ConnectToDataRobot() function, by specifying the configPath parameter.

The second method of establishing a connection to the DataRobot modeling engine is to call the function ConnectToDataRobot with the endpoint and token parameters.

library(datarobot)
ConnectToDataRobot(endpoint = "http://<YOUR DR SERVER>/api/v2", token = "<YOUR API TOKEN>")

Data

We will be using the Lending Club dataset, a sample dataset related to credit scoring open-sourced by LendingClub. We can create a project with this dataset like this:

lendingClubURL <- "https://s3.amazonaws.com/datarobot_public_datasets/10K_Lending_Club_Loans.csv"
project <- StartProject(dataSource = lendingClubURL,
                        projectName = "AdvancedModelInsightsVignette",
                        mode = "auto",
                        target = "is_bad",
                        workerCount = "max",
                        wait = TRUE)

Once the modeling process has completed, the ListModels function returns an S3 object of class listOfModels that characterizes all of the models in a specified DataRobot project. It is important to use WaitforAutopilot before calling ListModels, as the function will return only a partial list (and a warning) if the autopilot is not yet complete.

results <- as.data.frame(ListModels(project))
saveRDS(results, "resultsModelInsights.rds")
library(knitr)
kable(head(results), longtable = TRUE, booktabs = TRUE, row.names = TRUE)
modelType expandedModel modelId blueprintId featurelistName featurelistId samplePct validationMetric
1 Gradient Boosted Trees Classifier with Early Stopping Gradient Boosted Trees Classifier with Early Stopping::Ordinal encoding of categorical variables::Converter for Text Mining::Auto-Tuned Word N-Gram Text Modeler using token occurrences::Missing Values Imputed 5efa1dcfe157256402b66684 76406c9c52dc3f6a3a0ba8442fa17601 Informative Features 5efa1bd3f0f49455b0ccd765 64 0.36472
2 eXtreme Gradient Boosted Trees Classifier with Early Stopping eXtreme Gradient Boosted Trees Classifier with Early Stopping::Ordinal encoding of categorical variables::Converter for Text Mining::Auto-Tuned Word N-Gram Text Modeler using token occurrences::Missing Values Imputed 5efa1dd0e157256402b66694 5964b39390e51b69a82d9a8dab7b2675 Informative Features 5efa1bd3f0f49455b0ccd765 64 0.36562
3 ENET Blender ENET Blender::Elastic-Net Classifier (L2 / Binomial Deviance) 5efa23c020433938c72a0153 81092c05cb849904f6b737b767799660 Multiple featurelists Multiple featurelist ids 64 0.36564
4 AVG Blender AVG Blender::Average Blender 5efa23be20433938c72a014f c294bee1a436f6f034fd680aa752b9d5 Multiple featurelists Multiple featurelist ids 64 0.36566
5 ENET Blender ENET Blender::Elastic-Net Classifier (L2 / Binomial Deviance) 5efa23c020433938c72a0155 83d1a0ca93741bd8ef06bfc47c75ac33 Multiple featurelists Multiple featurelist ids 64 0.36567
6 Advanced AVG Blender Advanced AVG Blender::Average Blender 5efa23c020433938c72a0151 c40db7cd1b9d3ee12d17c0369639cb3a Multiple featurelists Multiple featurelist ids 64 0.36639

Lift Chart

Lift chart data can be retrieved for a specific data partition (validation, cross-validation, or holdout) or for all the data partitions using GetLiftChart and ListLiftCharts. To retrieve the data for holdout partition, it needs to be unlocked first.

Let’s retrieve the validation partition data for top model using GetLiftChart. The GetLiftChart function returns data for validation partition by default. We can retrieve data for specific data partition by passing value to source parameter in GetLiftChart.

project <- GetProject("5eed0d790ef80408ae212f09")
allModels <- ListModels(project)
saveRDS(allModels, "modelsModelInsights.rds")
modelFrame <- as.data.frame(allModels)
metric <- modelFrame$validationMetric
if (project$metric %in% c('AUC', 'Gini Norm')) {
  bestIndex <- which.max(metric)
} else {
  bestIndex <- which.min(metric)
}
bestModel <- allModels[[bestIndex]]
bestModel$modelType

[1] “Gradient Boosted Greedy Trees Classifier with Early Stopping”

This selects a Gradient Boosted Greedy Trees Classifier with Early Stopping model.

The lift chart data we retrieve from the server includes the mean of the model prediction and the mean of the actual target values, sorted by the prediction values in ascending order and split into up to 60 bins.

lc <- GetLiftChart(bestModel)
saveRDS(lc, "liftChartModelInsights.rds")
head(lc)
  actual  predicted binWeight

1 0.00000000 0.01877918 27 2 0.03703704 0.02476968 27 3 0.00000000 0.02867826 26 4 0.00000000 0.03207965 27 5 0.07407407 0.03540244 27 6 0.03846154 0.03865136 26

ValidationLiftChart <- GetLiftChart(bestModel, source = "validation")
dr_dark_blue <- "#08233F"
dr_blue <- "#1F77B4"
dr_orange <- "#FF7F0E"

# Function to plot lift chart
library(data.table)
LiftChartPlot <- function(ValidationLiftChart, bins = 10) {
  if (60 %% bins == 0) {
    ValidationLiftChart$bins <- rep(seq(bins), each = 60 / bins)
    ValidationLiftChart <- data.table(ValidationLiftChart)
    ValidationLiftChart[, actual := mean(actual), by = bins]
    ValidationLiftChart[, predicted := mean(predicted), by = bins]
    unique(ValidationLiftChart[, -"binWeight"])
  } else {
    "Please provide bins less than 60 and divisor of 60"
  }
}
LiftChartData <- LiftChartPlot(ValidationLiftChart)
saveRDS(LiftChartData, "LiftChartDataVal.rds")
par(bg = dr_dark_blue)
plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b",
     main = "Lift Chart", xlab = "Bins", ylab = "Value")
lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b")

All the available lift chart data can be retrieved using ListLiftCharts. Here is an example retrieving data for all the available partitions, followed by plotting the cross validation partition:

AllLiftChart <- ListLiftCharts(bestModel)
LiftChartData <- LiftChartPlot(AllLiftChart[["crossValidation"]])
saveRDS(LiftChartData, "LiftChartDataCV.rds")
par(bg = dr_dark_blue)
plot(LiftChartData$Actual, col = dr_orange, pch = 20, type = "b",
     main = "Lift Chart", xlab = "Bins", ylab = "Value")
lines(LiftChartData$Predicted, col = dr_blue, pch = 20, type = "b")

We can also plot the lift chart using ggplot2:

library(ggplot2)
lc$actual <- lc$actual / lc$binWeight
lc$predicted <- lc$predicted / lc$binWeight
lc <- lc[order(lc$predicted), ]
lc$binWeight <- NULL
lc <- data.frame(value = c(lc$actual, lc$predicted),
                 variable = c(rep("Actual", length(lc$actual)),
                              rep("Predicted", length(lc$predicted))),
                 id = rep(seq_along(lc$actual), 2))
ggplot(lc) + geom_line(aes(x = id, y = value, color = variable))

ROC Curve Data

The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

ROC curve data can be generated for a specific data partition (validation, cross validation, or holdout) or for all the data partition using GetRocCurve and ListRocCurves.

To retrieve ROC curve information use GetRocCurve:

roc <- GetRocCurve(bestModel)
saveRDS(roc, "ROCCurveModelInsights.rds")

You can then plot the results:

dr_dark_blue <- "#08233F"
dr_roc_green <- "#03c75f"
ValidationRocCurve <- GetRocCurve(bestModel)
ValidationRocPoints <- ValidationRocCurve[["rocPoints"]]
saveRDS(ValidationRocPoints, "ValidationRocPoints.rds")
par(bg = dr_dark_blue, xaxs = "i", yaxs = "i")
plot(ValidationRocPoints$falsePositiveRate, ValidationRocPoints$truePositiveRate,
     main = "ROC Curve",
     xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)",
     col = dr_roc_green,
     ylim = c(0,1), xlim = c(0,1),
     pch = 20, type = "b")

All the available ROC curve data can be retrieved using ListRocCurves. Here again is an example to retrieve data for all the available partitions, followed by plotting the cross validation partition:

AllRocCurve <- ListRocCurves(bestModel)
CrossValidationRocPoints <- AllRocCurve[['crossValidation']][['rocPoints']]
saveRDS(CrossValidationRocPoints, 'CrossValidationRocPoints.rds')
par(bg = dr_dark_blue, xaxs = "i", yaxs = "i")
plot(CrossValidationRocPoints$falsePositiveRate, CrossValidationRocPoints$truePositiveRate,
     main = "ROC Curve",
     xlab = "False Positive Rate (Fallout)", ylab = "True Positive Rate (Sensitivity)",
     col = dr_roc_green,
     ylim = c(0, 1), xlim = c(0, 1),
     pch = 20, type = "b")

You can also plot the ROC curve using ggplot2:

ggplot(
  ValidationRocPoints, 
  aes(x = falsePositiveRate, y = truePositiveRate)
) + geom_line()

Threshold operations

You can get the recommended threshold value with maximal F1 score. That is the same threshold that is preselected in DataRobot when you open the “ROC curve” tab.

threshold <- ValidationRocPoints$threshold[which.max(ValidationRocPoints$f1Score)]

You can also estimate metrics for different threshold values. This will produce the same results as updating the threshold on the DataRobot “ROC curve” tab.

ValidationRocPoints[ValidationRocPoints$threshold == tail(Filter(function(x) x > threshold,
                                                                 ValidationRocPoints$threshold),
                                                          1), ]

Word Cloud

The word cloud is a type of insight available for some text-processing models for datasets containing text columns. You can get information about how the appearance of each ngram (word or sequence of words) in the text field affects the predicted target value.

This example will show you how to obtain word cloud data and visualize it, similar to how DataRobot visualizes the word cloud in the “Model Insights” tab interface.

The visualization example here uses the modelwordcloud package.

Now let’s find our word cloud:

# Find word-based models by looking for "word" modelType
wordModels <- allModels[grep("Word", lapply(allModels, `[[`, "modelType"))]
wordModel <- wordModels[[1]]
# Get word cloud
wordCloud <- GetWordCloud(project, wordModel$modelId)
saveRDS(wordCloud, "wordCloudModelInsights.rds")

Now we plot it!

# Remove stop words
wordCloud <- wordCloud[!wordCloud$isStopword, ]

# Specify colors similar to what DataRobot produces for 
# a wordcloud in Insights
colors <- readRDS("colors.rds")

# Make word cloud
suppressWarnings(
  wordcloud(words = wordCloud$ngram,
            freq = wordCloud$frequency,
            coefficients = wordCloud$coefficient,
            colors = colors,
            scale = c(3, 0.3))
)