Multi Class vtreat

John Mount

2024-06-12

vtreat can now effectively prepare data for multi-class classification or multinomial modeling.

The two functions needed (mkCrossFrameMExperiment() and the S3 method prepare.multinomial_plan()) are now part of vtreat.

Let’s work a specific example: trying to model multi-class y as a function of x1 and x2.

library("vtreat")

## Loading required package: wrapr

# create example data
set.seed(326346)
sym_bonuses <- rnorm(3)
names(sym_bonuses) <- c("a", "b", "c")
sym_bonuses3 <- rnorm(3)
names(sym_bonuses3) <- as.character(seq_len(length(sym_bonuses3)))
n_row <- 1000
d <- data.frame(
  x1 = rnorm(n_row),
  x2 = sample(names(sym_bonuses), n_row, replace = TRUE),
  x3 = sample(names(sym_bonuses3), n_row, replace = TRUE),
  y = "NoInfo",
  stringsAsFactors = FALSE)
d$y[sym_bonuses[d$x2] > 
      pmax(d$x1, sym_bonuses3[d$x3], runif(n_row))] <- "Large1"
d$y[sym_bonuses3[d$x3] > 
      pmax(sym_bonuses[d$x2], d$x1, runif(n_row))] <- "Large2"

knitr::kable(head(d))

x1	x2	x3	y
0.8178292	a	2	NoInfo
0.5867139	c	1	NoInfo
-0.6711920	a	3	Large2
0.1033166	a	2	Large1
-0.3182176	c	3	Large2
-0.5914308	c	2	NoInfo

We define the problem controls and use mkCrossFrameMExperiment() to build both a cross-frame and a treatment plan.

# define problem
vars <- c("x1", "x2", "x3")
y_name <- "y"

# build the multi-class cross frame and treatments
cfe_m <- mkCrossFrameMExperiment(d, vars, y_name)

The cross-frame is the entity safest for training on (unless you have made separate data split for the treatment design step). It uses cross-validation to reduce nested model bias. Some notes on this issue are available here, and here.

# look at the data we would train models on
str(cfe_m$cross_frame)

## 'data.frame':    1000 obs. of  16 variables:
##  $ x1            : num  0.818 0.587 -0.671 0.103 -0.318 ...
##  $ x2_catP       : num  0.333 0.334 0.333 0.333 0.334 0.334 0.333 0.333 0.333 0.333 ...
##  $ x3_catP       : num  0.35 0.321 0.329 0.35 0.329 0.35 0.321 0.321 0.321 0.35 ...
##  $ x2_lev_x_a    : num  1 0 1 1 0 0 0 0 1 1 ...
##  $ x2_lev_x_b    : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ x2_lev_x_c    : num  0 1 0 0 1 1 0 0 0 0 ...
##  $ x3_lev_x_1    : num  0 1 0 0 0 0 1 1 1 0 ...
##  $ x3_lev_x_2    : num  1 0 0 1 0 1 0 0 0 1 ...
##  $ x3_lev_x_3    : num  0 0 1 0 1 0 0 0 0 0 ...
##  $ Large1_x2_catB: num  1.23 -10.72 1.15 1.16 -10.53 ...
##  $ Large1_x3_catB: num  0.7025 0.0903 -10.4833 0.6238 -10.529 ...
##  $ Large2_x2_catB: num  0.17979 0.19661 -0.00379 -0.09818 0.00627 ...
##  $ Large2_x3_catB: num  -13.12 -13.05 4.49 -4.03 4.71 ...
##  $ NoInfo_x2_catB: num  -0.48752 -0.00254 -0.27947 -0.26155 0.15195 ...
##  $ NoInfo_x3_catB: num  2.05 2.43 -4.34 1.79 -4.55 ...
##  $ y             : chr  "NoInfo" "NoInfo" "Large2" "Large1" ...

prepare() can apply the designed treatments to new data. Here we are simulating new data by re-using our design data.

# pretend original data is new data to be treated
# NA out top row to show processing
for(vi in vars) {
  d[[vi]][[1]] <- NA
}
str(prepare(cfe_m$treat_m, d))

## 'data.frame':    1000 obs. of  16 variables:
##  $ x1            : num  0.0205 0.5867 -0.6712 0.1033 -0.3182 ...
##  $ x2_catP       : num  0.0005 0.334 0.333 0.333 0.334 0.334 0.333 0.333 0.333 0.333 ...
##  $ x3_catP       : num  0.0005 0.321 0.329 0.35 0.329 0.35 0.321 0.321 0.321 0.35 ...
##  $ x2_lev_x_a    : num  0 0 1 1 0 0 0 0 1 1 ...
##  $ x2_lev_x_b    : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ x2_lev_x_c    : num  0 1 0 0 1 1 0 0 0 0 ...
##  $ x3_lev_x_1    : num  0 1 0 0 0 0 1 1 1 0 ...
##  $ x3_lev_x_2    : num  0 0 0 1 0 1 0 0 0 1 ...
##  $ x3_lev_x_3    : num  0 0 1 0 1 0 0 0 0 0 ...
##  $ Large1_x2_catB: num  0 -10.58 1.18 1.18 -10.58 ...
##  $ Large1_x3_catB: num  0 0.284 -10.584 0.529 -10.584 ...
##  $ Large2_x2_catB: num  0 0.1 0.0242 0.0242 0.1 ...
##  $ Large2_x3_catB: num  0 -13.08 4.72 -4.43 4.72 ...
##  $ NoInfo_x2_catB: num  0 0.0685 -0.3392 -0.3392 0.0685 ...
##  $ NoInfo_x3_catB: num  0 2.39 -4.55 2.05 -4.55 ...
##  $ y             : chr  "NoInfo" "NoInfo" "Large2" "Large1" ...

Obvious issues include: computing variable importance, and blow up and co-dependency of produced columns. These we leave for the next modeling step to deal with (this is our philosophy with most issues that involve joint distributions of variables).

We also have per-outcome variable importance.

knitr::kable(
  cfe_m$score_frame[, 
                    c("varName", "rsq", "sig", "outcome_level"), 
                    drop = FALSE])

varName	rsq	sig	outcome_level
x1	0.0427675	0.0002015	Large1
x2_catP	0.0979334	0.0000000	Large1
x2_lev_x_a	0.2681130	0.0000000	Large1
x2_lev_x_b	0.0975700	0.0000000	Large1
x2_lev_x_c	0.0979334	0.0000000	Large1
x3_catP	0.0125618	0.0439536	Large1
x3_lev_x_1	0.0053772	0.1874933	Large1
x3_lev_x_2	0.0266092	0.0033678	Large1
x3_lev_x_3	0.0961219	0.0000000	Large1
x1	0.0003984	0.4784542	Large2
x2_catP	0.0008969	0.2875322	Large2
x2_lev_x_a	0.0000512	0.7994128	Large2
x2_lev_x_b	0.0013961	0.1845435	Large2
x2_lev_x_c	0.0008969	0.2875322	Large2
x3_catP	0.0574052	0.0000000	Large2
x3_lev_x_1	0.2546121	0.0000000	Large2
x3_lev_x_2	0.2659830	0.0000000	Large2
x3_lev_x_3	0.9308590	0.0000000	Large2
x1	0.0035420	0.0312177	NoInfo
x2_catP	0.0004091	0.4641054	NoInfo
x2_lev_x_a	0.0108027	0.0001684	NoInfo
x2_lev_x_b	0.0072297	0.0020855	NoInfo
x2_lev_x_c	0.0004091	0.4641054	NoInfo
x3_catP	0.0416046	0.0000000	NoInfo
x3_lev_x_1	0.1848006	0.0000000	NoInfo
x3_lev_x_2	0.1796720	0.0000000	NoInfo
x3_lev_x_3	0.7228777	0.0000000	NoInfo
Large1_x2_catB	0.2679354	0.0000000	Large1
Large1_x3_catB	0.0835409	0.0000002	Large1
Large2_x2_catB	0.0002176	0.6004146	Large2
Large2_x3_catB	0.9064823	0.0000000	Large2
NoInfo_x2_catB	0.0080585	0.0011565	NoInfo
NoInfo_x3_catB	0.7143906	0.0000000	NoInfo

One can relate these per-target and per-treatment performances back to original columns by aggregating.

tapply(cfe_m$score_frame$rsq, 
       cfe_m$score_frame$origName, 
       max)

##         x1         x2         x3 
## 0.04276746 0.26811298 0.93085900

tapply(cfe_m$score_frame$sig, 
       cfe_m$score_frame$origName, 
       min)

##            x1            x2            x3 
##  2.015164e-04  1.315559e-20 2.777723e-257