This article covers core features of the `aorsf`

package.

The oblique random forest (RF) is an extension of the traditional (axis-based) RF. Instead of using a single variable to split data and grow new branches, trees in the oblique RF use a weighted combination of multiple variables.

The purpose of `aorsf`

(‘a’ is short for accelerated) is
to provide a unifying framework to fit oblique RFs that can scale
adequately to large data sets. The fastest algorithms available in the
package are used by default because they often have equivalent
prediction accuracy to more computational approaches.

The center piece of `aorsf`

is the `orsf()`

function. In the initial versions of `aorsf`

, the
`orsf()`

function only fit `o`

blique
`r`

andom `s`

urvival `f`

orests, but now
it allows for classification, regression, and survival forests. (I may
introduce an `orf()`

function in the future if the name
`orsf()`

is misleading to users.)

For classification, we fit an oblique RF to predict penguin species
using `penguin`

data from the magnificent
`palmerpenguins`

R package

```
# An oblique classification RF
penguin_fit <- orsf(data = penguins_orsf, formula = species ~ .)
penguin_fit
#> ---------- Oblique random classification forest
#>
#> Linear combinations: Accelerated Logistic regression
#> N observations: 333
#> N classes: 3
#> N trees: 500
#> N predictors total: 7
#> N predictors per node: 3
#> Average leaves per tree: 5.574
#> Min observations in leaf: 5
#> OOB stat value: 1.00
#> OOB stat type: AUC-ROC
#> Variable importance: anova
#>
#> -----------------------------------------
```

For regression, we use the same data but predict bill length of penguins:

```
# An oblique regression RF
bill_fit <- orsf(data = penguins_orsf, formula = bill_length_mm ~ .)
bill_fit
#> ---------- Oblique random regression forest
#>
#> Linear combinations: Accelerated Linear regression
#> N observations: 333
#> N trees: 500
#> N predictors total: 7
#> N predictors per node: 3
#> Average leaves per tree: 49.936
#> Min observations in leaf: 5
#> OOB stat value: 0.82
#> OOB stat type: RSQ
#> Variable importance: anova
#>
#> -----------------------------------------
```

My personal favorite is the oblique survival RF with accelerated Cox regression because it has a great combination of prediction accuracy and computational efficiency (see JCGS paper). Here, we predict mortality risk following diagnosis of primary biliary cirrhosis:

```
# An oblique survival RF
pbc_fit <- orsf(data = pbc_orsf,
n_tree = 5,
formula = Surv(time, status) ~ . - id)
pbc_fit
#> ---------- Oblique random survival forest
#>
#> Linear combinations: Accelerated Cox regression
#> N observations: 276
#> N events: 111
#> N trees: 5
#> N predictors total: 17
#> N predictors per node: 5
#> Average leaves per tree: 20.8
#> Min observations in leaf: 5
#> Min events in leaf: 1
#> OOB stat value: 0.80
#> OOB stat type: Harrell's C-index
#> Variable importance: anova
#>
#> -----------------------------------------
```

you may notice that the first input of `aorsf`

is
`data`

. This is a design choice that makes it easier to use
`orsf`

with pipes (i.e., `%>%`

or
`|>`

). For instance,

`aorsf`

includes several functions dedicated to
interpretation of ORSFs, both through estimation of partial dependence
and variable importance.

There are multiple methods to compute variable importance, and each can be applied to any type of oblique forest.

To compute

*negation*importance, ORSF multiplies each coefficient of that variable by -1 and then re-computes the out-of-sample (sometimes referred to as out-of-bag) accuracy of the ORSF model.`orsf_vi_negate(pbc_fit) #> bili copper protime stage albumin age #> 0.074843858 0.073523557 0.036650250 0.032580793 0.030044474 0.025459972 #> sex chol ast alk.phos trt ascites #> 0.017996486 0.017985702 0.013128951 0.012594186 0.010617402 0.009458633 #> spiders edema platelet trig hepato #> 0.006791345 0.006689677 0.005991995 -0.001229404 -0.006160397`

You can also compute variable importance using

*permutation*, a more classical approach that noises up a predictor and then assigned the resulting degradation in prediction accuracy to be the importance of that predictor.A faster alternative to permutation and negation importance is ANOVA importance, which computes the proportion of times each variable obtains a low p-value (p < 0.01) while the forest is grown.

Partial dependence (PD) shows the expected prediction from a model as a function of a single predictor or multiple predictors. The expectation is marginalized over the values of all other predictors, giving something like a multivariable adjusted estimate of the model’s prediction.

For more on PD, see the vignette

Unlike partial dependence, which shows the expected prediction as a function of one or multiple predictors, individual conditional expectations (ICE) show the prediction for an individual observation as a function of a predictor.

For more on ICE, see the vignette

The original ORSF (i.e., `obliqueRSF`

) used
`glmnet`

to find linear combinations of inputs.
`aorsf`

allows users to implement this approach using the
`orsf_control_survival(method = 'net')`

function:

```
orsf_net <- orsf(data = pbc_orsf,
formula = Surv(time, status) ~ . - id,
control = orsf_control_survival(method = 'net'))
```

`net`

forests fit a lot faster than the original ORSF
function in `obliqueRSF`

. However, `net`

forests
are still much slower than `cph`

ones.

The unique feature of `aorsf`

is its fast algorithms to
fit ORSF ensembles. `RLT`

and `obliqueRSF`

both
fit oblique random survival forests, but `aorsf`

does so
faster. `ranger`

and `randomForestSRC`

fit
survival forests, but neither package supports oblique splitting.
`obliqueRF`

fits oblique random forests for classification
and regression, but not survival. `PPforest`

fits oblique
random forests for classification but not survival.

Note: The default prediction behavior for `aorsf`

models
is to produce predicted risk at a specific prediction horizon, which is
not the default for `ranger`

or `randomForestSRC`

.
I think this will change in the future, as computing time independent
predictions with `aorsf`

could be helpful.

`aorsf`

began as a dedicated package for oblique random
survival forests, and so most papers published so far have focused on
survival analysis and risk prediction. However, the routines for
regression and classification oblique RFs in `aorsf`

have
high overlap with the survival ones.