Random Forest Classification Using Parsnip

parsnip tidymodels machine-learning random-forest random-code-snippets

How to make sure you get a classification fit and not a probability fit from a random forest model using the tidymodels framework.

Robert M Flight true
08-30-2021

I’ve been working on a “machine learning” project, and in the process I’ve been learning to use the tidymodels framework (“Tidymodels” 2021), which helps save you from leaking information from testing to training data, as well as creating workflows in a consistent way across methods.

However, I got tripped up recently by one issue. When I’ve previously used Random Forests (“Random Forest Wiki Page” 2021), I’ve found that for classification problems, the out-of-bag (OOB) error reported is a good proxy for the area-under-the-curve (AUC), or estimate of how good any other machine learning technique will do (see (Flight 2015) for an example using actual random data). Therefore, I like to put my data through a Random Forest algorithm and check the OOB error, and then maybe reach for a tuned boosted tree to squeeze every last bit of performance out.

tidymodels default is to use a probability tree, even for classification problems. This isn’t normally a problem for most people, because you will have a train and test set, and estimate performance on the test set using AUC. However, it is a problem if you just want to see the OOB error from the random forest, because it is reported differently for probability vs classification.

Lets run an example using the tidymodels cell data set.

library(tidymodels)
library(modeldata)
library(skimr)
data(cells, package = "modeldata")
library(ranger)
tidymodels_prefer()

cells$case = NULL
set.seed(1234)
ranger(class ~ ., data = cells, min.node.size = 10, classification = TRUE)
Ranger result

Call:
 ranger(class ~ ., data = cells, min.node.size = 10, classification = TRUE) 

Type:                             Classification 
Number of trees:                  500 
Sample size:                      2019 
Number of independent variables:  56 
Mtry:                             7 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error:             17.29 % 

Here we can see that we get an OOB error of 17%, which isn’t too shabby. Now, let’s setup a workflow to do the same thing via tidymodels parsnip.

rf_spec = rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf_recipe = recipe(class ~ ., data = cells) %>%
  step_dummy(class, -class)

set.seed(1234)
workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_spec) %>%
  fit(data = cells)
══ Workflow [trained] ════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ──────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ─────────────────────────────────────────────────────────────
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  500 
Sample size:                      2019 
Number of independent variables:  56 
Mtry:                             7 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.1198456 

Here we see the OOB error is 12% (0.119), which is not significantly different than the 17% above, but still different. Also, the “Type” shows “Probability estimation” instead of “Classification estimation.”

If we run ranger again with a “probability” instead of “classification,” do we match up with the result above?

set.seed(1234)
ranger(class ~ ., data = cells, min.node.size = 10, probability = TRUE)
Ranger result

Call:
 ranger(class ~ ., data = cells, min.node.size = 10, probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  500 
Sample size:                      2019 
Number of independent variables:  56 
Mtry:                             7 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.119976 

That is much closer to the tidymodels result! Great! Except, it’s a misestimation of the true OOB error for classification. How do we get what we want while using the tidymodels framework?

I couldn’t find the answer, and the above looked like a bug, so I filed one on the parsnip github (“Ranger "Classification" Mode Still Looks Like "Probability"” 2021). Julia Silge helpfully provided the solution to my problem.

rf_spec_class = rand_forest() %>%
  set_engine("ranger", probability = FALSE) %>%
  set_mode("classification")

set.seed(1234)
workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_spec_class) %>%
  fit(data = cells)
══ Workflow [trained] ════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ──────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ─────────────────────────────────────────────────────────────
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, probability = ~FALSE,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 

Type:                             Classification 
Number of trees:                  500 
Sample size:                      2019 
Number of independent variables:  56 
Mtry:                             7 
Target node size:                 1 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error:             16.54 % 

Aha! Now we are much closer to the original value of 17%, and the “Type” is “Classification.”

I know in this case, the differences in OOB error are honestly not that much different, but my recent project, they differed by 20%, where I had a 45% using classification, and 25% using probability. Therefore, I was being fooled by the tidymodels framework investigation, and then wondering why my final AUC on a tuned model was only hitting just > 55%.

So remember, this isn’t how I would run the model for final classification and estimation of AUC on a test set, but if you want the OOB errors for a quick “feel” of your data, it’s very useful.

Flight, Robert M. 2015. “Deciphering Life: One Bit at a Time: Random Forest Vs PLS on Random Data.” https://rmflight.github.io/posts/2015-12-12-random-forest-vs-pls-on-random-data/.
“Random Forest Wiki Page.” 2021. https://en.wikipedia.org/wiki/Random_forest.
“Ranger "Classification" Mode Still Looks Like "Probability".” 2021. https://github.com/tidymodels/parsnip/issues/546.
“Tidymodels.” 2021. https://www.tidymodels.org/.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/rmflight/researchBlog_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Flight (2021, Aug. 30). Deciphering Life: One Bit at a Time: Random Forest Classification Using Parsnip. Retrieved from https://rmflight.github.io/posts/2021-08-30-random-forest-classification-using-parsnip/

BibTeX citation

@misc{flight2021random,
  author = {Flight, Robert M},
  title = {Deciphering Life: One Bit at a Time: Random Forest Classification Using Parsnip},
  url = {https://rmflight.github.io/posts/2021-08-30-random-forest-classification-using-parsnip/},
  year = {2021}
}