Data for Titanic survival

Let’s see an example for DALEX package for classification models for the survival problem for Titanic dataset. Here we are using a dataset titanic_imputed avaliable in the DALEX package. Note that this data was copied from the stablelearner package and changed for practicality.

library("DALEX")
head(titanic_imputed)
#>   gender age class    embarked  fare sibsp parch survived
#> 1   male  42   3rd Southampton  7.11     0     0        0
#> 2   male  13   3rd Southampton 20.05     0     2        0
#> 3   male  16   3rd Southampton 20.05     1     1        0
#> 4 female  39   3rd Southampton 20.05     1     1        1
#> 5 female  16   3rd Southampton  7.13     0     0        1
#> 6   male  25   3rd Southampton  7.13     0     0        1

Model for Titanic survival

Ok, now it’s time to create a model. Let’s use the Random Forest model.

# prepare model
library("ranger")
model_titanic_rf <- ranger(survived ~ gender + age + class + embarked +
                           fare + sibsp + parch,
                           data = titanic_imputed, probability = TRUE)
model_titanic_rf
#> Ranger result
#> 
#> Call:
#>  ranger(survived ~ gender + age + class + embarked + fare + sibsp +      parch, data = titanic_imputed, probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      2207 
#> Number of independent variables:  7 
#> Mtry:                             2 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.1422968

Explainer for Titanic survival

The third step (it’s optional but useful) is to create a DALEX explainer for random forest model.

library("DALEX")
explain_titanic_rf <- explain(model_titanic_rf,
                              data = titanic_imputed[,-8],
                              y = titanic_imputed[,8],
                              label = "Random Forest")
#> Preparation of a new explainer is initiated
#>   -> model label       :  Random Forest 
#>   -> data              :  2207  rows  7  cols 
#>   -> target variable   :  2207  values 
#>   -> predict function  :  yhat.ranger  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package ranger , ver. 0.12.1 , task classification (  default  ) 
#>   -> predicted values  :  numerical, min =  0.01164526 , mean =  0.3215481 , max =  0.9899436  
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  numerical, min =  -0.7923093 , mean =  0.0006086512 , max =  0.8905081  
#>   A new explainer has been created! 

Model Level Feature Importance

Use the feature_importance() explainer to present importance of particular features. Note that type = "difference" normalizes dropouts, and now they all start in 0.

library("ingredients")

fi_rf <- feature_importance(explain_titanic_rf)
head(fi_rf)
#>       variable mean_dropout_loss         label
#> 1 _full_model_         0.3408062 Random Forest
#> 2        parch         0.3520488 Random Forest
#> 3        sibsp         0.3520933 Random Forest
#> 4     embarked         0.3527842 Random Forest
#> 5          age         0.3760269 Random Forest
#> 6         fare         0.3848921 Random Forest
plot(fi_rf)

Feature effects

As we see the most important feature is gender. Next three importnat features are class, age and fare. Let’s see the link between model response and these features.

Such univariate relation can be calculated with partial_dependence().

age

Kids 5 years old and younger have much higher survival probability.

Partial Dependence Profiles

pp_age  <- partial_dependence(explain_titanic_rf, variables =  c("age", "fare"))
head(pp_age)
#> Top profiles    : 
#>   _vname_       _label_       _x_    _yhat_ _ids_
#> 1    fare Random Forest 0.0000000 0.3630884     0
#> 2     age Random Forest 0.1666667 0.5347603     0
#> 3     age Random Forest 2.0000000 0.5536098     0
#> 4     age Random Forest 4.0000000 0.5595259     0
#> 5    fare Random Forest 6.1793080 0.3100674     0
#> 6     age Random Forest 7.0000000 0.5159751     0
plot(pp_age)

Conditional Dependence Profiles

cp_age  <- conditional_dependence(explain_titanic_rf, variables =  c("age", "fare"))
plot(cp_age)

Accumulated Local Effect Profiles

ap_age  <- accumulated_dependence(explain_titanic_rf, variables =  c("age", "fare"))
plot(ap_age)