Tidymodels Course

knitr::opts_chunk$set(eval = F)
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.5      ✔ recipes      1.0.10
✔ dials        1.2.1      ✔ rsample      1.2.1 
✔ dplyr        1.1.4      ✔ tibble       3.2.1 
✔ ggplot2      3.5.1      ✔ tidyr        1.3.1 
✔ infer        1.0.7      ✔ tune         1.2.1 
✔ modeldata    1.3.0      ✔ workflows    1.1.4 
✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
✔ purrr        1.0.2      ✔ yardstick    1.3.1 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

Splitting

rsample package is designed to create training and test datasets. Creating a test dataset is important for estimating how a trained model will likely perform on new data. It also guards against overfitting, where a model memorizes patterns that exist only in the training data and performs poorly on new data.

  • Create an rsample object, home_split, that contains the instructions for randomly splitting the home_sales data into a training and test dataset.

  • Allocate 70% of the data into training and stratify the results by selling_price.

# Create a data split object
home_split <- initial_split(home_sales, 
                  prop = 0.7, 
                  strata = selling_price)
# Create the training data
home_training <- home_split %>%
 training()

# Create the test data
home_test <- home_split %>% 
  testing()

# Check number of rows in each dataset
nrow(home_training)
nrow(home_test)

Excellent work! The minimum and maximum selling prices in both datasets are the same. The mean and standard deviation are also similar. Stratifying by the outcome variable ensures the model fitting process is performed on a representative sample of the original data.

home_training %>% 
  summarize(min_sell_price = min(selling_price),
            max_sell_price = max(selling_price),
            mean_sell_price = mean(selling_price),
            sd_sell_price = sd(selling_price))
# A tibble: 1 × 4
  min_sell_price max_sell_price mean_sell_price sd_sell_price
           <dbl>          <dbl>           <dbl>         <dbl>
1         350000         650000         479156.        81201.
# Distribution of selling_price in test data
home_test %>% 
  summarize(min_sell_price = min(selling_price),
            max_sell_price = max(selling_price),
            mean_sell_price = mean(selling_price),
            sd_sell_price = sd(selling_price))
# A tibble: 1 × 4
  min_sell_price max_sell_price mean_sell_price sd_sell_price
           <dbl>          <dbl>           <dbl>         <dbl>
1         350000         650000         478920.        80552.
# Distribution of selling_price in test data
home_test %>% 
  summarize(min_sell_price = min(selling_price),
            max_sell_price = max(selling_price),
            mean_sell_price = mean(selling_price),
            sd_sell_price = sd(selling_price))
# A tibble: 1 × 4
  min_sell_price max_sell_price mean_sell_price sd_sell_price
           <dbl>          <dbl>           <dbl>         <dbl>
1         350000         650000         478920.        80552.

Linear Regression

# Initialize a linear regression object, linear_model
linear_model <- linear_reg() %>% 
  # Set the model engine
  set_engine("lm") %>% 
  # Set the model mode
  set_mode("regression")

# Fit the model using the training data
lm_fit <- linear_model() %>% 
  fit(selling_price ~ home_age + sqft_living,
      data = home_training)

# Print lm_fit to view model information
lm_fit

Output:

parsnip model object

Fit time:  1ms 

Call:
stats::lm(formula = selling_price ~ home_age + sqft_living, data = data)

Coefficients:
(Intercept)     home_age  sqft_living  
   291102.4      -1577.4        104.1  

Prediction on test data

# Predict selling_price
home_predictions <- predict(lm_fit,
                        new_data = home_test)

# View predicted selling prices
home_predictions

Output:

# A tibble: 450 × 1
     .pred
     <dbl>
 1 539701.
 2 633473.
 3 429715.
 4 409971.
 5 489075.
 6 532008.
 7 402084.
 8 484440.
 9 449362.
10 535602.
# … with 440 more rows
# Combine test data with predictions
home_test_results <- home_test %>% 
  select(selling_price, home_age, sqft_living) %>% 
  bind_cols(home_predictions)

# View results
home_test_results

Output

# A tibble: 450 × 4
   selling_price home_age sqft_living   .pred
           <dbl>    <dbl>       <dbl>   <dbl>
 1        487000       10        2540 539701.
 2        635000        4        3350 633473.
 3        495000       21        1650 429715.
 4        355000       19        1430 409971.
 5        464950       19        2190 489075.
 6        535000        3        2360 532008.
 7        356000       24        1430 402084.
 8        525000       16        2100 484440.
 9        552321       29        1960 449362.
10        485000        6        2440 535602.
# … with 440 more rows

Evaluating Performance

Using yardstick package. Requires tibble with model results.

Use RMSE

rmse(truth = , estimate = .pred)

R2=r(actual, pred)2

rsq(truth = , estimate = .pred)

rsq plot of x = actual, y = pred

  • line y=x is R2=1 line

  • helpful for showing nonlinear patterns, regions where model is predicting poorly, etc

last_fit() takes a model specification, model formula, and data split object to:

  1. create training and test datasets
  2. fits model to training data
  3. calculates GOF metrics and predictions on test data
  4. returns object will all the results

e.g.

lm_last_fit <- lm_model %>% last_fit(hwy ~ cty, split = mpg_split)

lm_last_fit %>% collect_metrics()

collect_metrics() returns tibble of results (default are RMSE and rsq)

lm_last_fit %>% collect_predictions() returns tibble with test data predictions (.pred)

Example

# Print home_test_results
home_test_results

# Calculate the RMSE metric
home_test_results %>% 
  rmse(truth = selling_price, estimate = .pred)

# Calculate the R squared metric
home_test_results %>% 
  rsq(truth = selling_price, estimate = .pred)

Output

home_test_results %>% 
  rmse(truth = selling_price, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      47680.

home_test_results %>% 
  rsq(truth = selling_price, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.651
# Define a linear regression model
linear_model <- linear_reg() %>% 
  set_engine('lm') %>% 
  set_mode('regression')

# Train linear_model with last_fit()
linear_fit <- linear_model %>% 
  last_fit(selling_price ~ ., split = home_split)

# Collect predictions and view results
predictions_df <- linear_fit %>% collect_predictions()
predictions_df

Output

# Collect predictions and view results
predictions_df <- linear_fit %>% collect_predictions()
predictions_df
# A tibble: 375 × 5
   id                 .pred  .row selling_price .config             
   <chr>              <dbl> <int>         <dbl> <chr>               
 1 train/test split 528823.     1        487000 Preprocessor1_Model1
 2 train/test split 437591.     6        495000 Preprocessor1_Model1
 3 train/test split 401497.     7        355000 Preprocessor1_Model1
 4 train/test split 477671.     8        464950 Preprocessor1_Model1
 5 train/test split 480365.    10        535000 Preprocessor1_Model1
 6 train/test split 443872.    11        356000 Preprocessor1_Model1
 7 train/test split 502003.    13        525000 Preprocessor1_Model1
 8 train/test split 390503.    18        381000 Preprocessor1_Model1
 9 train/test split 479451.    21        450000 Preprocessor1_Model1
10 train/test split 628777.    22        624000 Preprocessor1_Model1
# … with 365 more rows
# Make an R squared plot using predictions_df
ggplot(predictions_df, aes(x = seling_price, y = .pred)) + 
  geom_point(alpha = 0.5) + 
  geom_abline(color = 'blue', linetype = 2) +
  coord_obs_pred() +
  labs(x = 'Actual Home Selling Price', y = 'Predicted Selling Price')

Classification Models

Goal: Create distinct, non-overlapping regions along set of predictor variable values, i.e. Predict the same categorical outcome in each region (on graph)

Ex: logistic regression, Popular classication algorithm which creates a linear separation between outcome categories

First step is again to resample and split data into training and testing data

initial_split(df, prop = , strata = )

stratify by outcome variable (0/1, yes/no, etc), purchased

then pass to training() and testing() functions

logistic_reg() function, set_engine("glm") and set mode("classification")

then fit(y ~ x1 + x2) for training

for prediction, use the testing data in the new_data function and set type to class (classification):

predict(new_data = the_testing_data, type = "class")

when type is 'class', returns a factor variable called .pred_class

setting type to 'prop' estimates probabilities for each outcome category for each row (observation) in test data

  • named pred_{outcome_category}

using yardstick to evaluate performance, need a tibble of results

results <- test_df %>% 
  select(outcome) %>%
  bind_cols(predictions_df, prob_preds)

Example

telecom_df
# A tibble: 975 × 9
   canceled_service cellular_service avg_data_gb avg_call_mins avg_intl_mins
   <fct>            <fct>                  <dbl>         <dbl>         <dbl>
 1 yes              single_line             7.78           497           127
 2 yes              single_line             9.04           336            88
 3 no               single_line            10.3            262            55
 4 yes              multiple_lines          5.08           250           107
 5 no               multiple_lines          8.05           328           122
 6 no               single_line             9.3            326           114
 7 yes              multiple_lines          8.01           525            97
 8 no               multiple_lines          9.4            312           147
 9 yes              single_line             5.29           417            96
10 no               multiple_lines          9.96           340           136
# … with 965 more rows, and 4 more variables: internet_service <fct>,
#   contract <fct>, months_with_company <dbl>, monthly_charges <dbl>

Code

# Create data split object
telecom_split <- initial_split(telecom_df, prop = 0.75,
                     strata = canceled_service)

# Create the training data
telecom_training <- telecom_split %>% 
  training()

# Create the test data
telecom_test <- telecom_split %>% 
  testing()

# Check the number of rows
nrow(telecom_training)
nrow(telecom_test)

Output

nrow(telecom_training)
[1] 731
nrow(telecom_test)
[1] 244 

Create Model

# Specify a logistic regression model
logistic_model <- logistic_reg() %>% 
  # Set the engine
  set_engine("glm") %>% 
  # Set the mode
  set_mode("classification")

# Print the model specification
logistic_model

Output

# Print the model specification
logistic_model
Logistic Regression Model Specification (classification)

Computational engine: glm 

Fit model

# Fit to training data
logistic_fit <- logistic_model %>% 
  fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,
      data = telecom_training)

# Print model fit object
logistic_fit

Output

# Print model fit object
logistic_fit
parsnip model object

Fit time:  5ms 

Call:  stats::glm(formula = canceled_service ~ avg_call_mins + avg_intl_mins + 
    monthly_charges, family = stats::binomial, data = data)

Coefficients:
    (Intercept)    avg_call_mins    avg_intl_mins  monthly_charges  
       1.837242        -0.010424         0.022114         0.002802  

Degrees of Freedom: 730 Total (i.e. Null);  727 Residual
Null Deviance:      932.4 
Residual Deviance: 806.7    AIC: 814.7

Predictions

# Predict outcome categories
class_preds <- predict(logistic_fit, new_data = telecom_test,
                       type = "class")

# Obtain estimated probabilities for each outcome value
prob_preds <- predict(logistic_fit, new_data = telecom_test, 
                      type = 'prob')

# Combine test set results
telecom_results <- telecom_test %>% 
  select(canceled_service) %>% 
  bind_cols(class_preds, prob_preds)

# View results tibble
telecom_results

Output

telecom_results
# A tibble: 243 × 4
   canceled_service .pred_class .pred_yes .pred_no
   <fct>            <fct>           <dbl>    <dbl>
 1 no               no              0.376    0.624
 2 yes              yes             0.755    0.245
 3 yes              yes             0.550    0.450
 4 yes              yes             0.567    0.433
 5 yes              no              0.362    0.638
 6 no               no              0.206    0.794
 7 no               no              0.172    0.828
 8 no               no              0.117    0.883
 9 yes              yes             0.515    0.485
10 yes              no              0.194    0.806
# … with 233 more rows

Assessing Model Fit

Yardstick, binary Y, must be a factor with first level the positive (“yes”) class. Can check with levels().

Confusion Matrix compares combinations of actual and predicted outcome values

Correct predictions:

  • True positive (TP)

  • True negative (TN)

Classification errors:

  • False positive (FP)

  • False negative (FN)

Creating a confusion matrix with yardstick’s conf_mat() needs tibble of model results with

  • true outcome values

  • predicted outcome categories .pred_class

  • estimated probabilities of each category .pred_yes and .pred_no

conf_mat(df_results, truth = , estimate = .pred_class

the accuracy() function takes the same inputs at conf_mat, and calculates proportion of all estimates that are correct classfications:

TP+TNTP+TN+FP+FN

Accuracy may not be best metric,

sensitivity measures proportion of all positive cases that were correctly classified

TPTP+FN

uses sens() function with same inputs as accuracy().

Specificity measures proportion of all negative cases that were correctly classified

TNTN+FP

uses spec() function

(1 - spec = “false positive rate”; proportion of false positives among true negatives)

Instead of calculating these one by one, can use the metric_set() function and specify what metrics we want (by name), e.g.:

custom_metrics <- metric_set(accuracy, sens, spec)

Also passing conf_mat() results into sumarry() will calculate all these at once!

Example

telecom_results
# A tibble: 243 × 4
   canceled_service .pred_class .pred_yes .pred_no
   <fct>            <fct>           <dbl>    <dbl>
 1 no               no              0.376    0.624
 2 yes              yes             0.755    0.245
 3 yes              yes             0.550    0.450
 4 yes              yes             0.567    0.433
 5 yes              no              0.362    0.638
 6 no               no              0.206    0.794
 7 no               no              0.172    0.828
 8 no               no              0.117    0.883
 9 yes              yes             0.515    0.485
10 yes              no              0.194    0.806
# … with 233 more rows
# Calculate the confusion matrix
conf_mat(telecom_results, truth = canceled_service,
    estimate = .pred_class)

Output

conf_mat(telecom_results, truth = canceled_service,
    estimate = .pred_class)
          Truth
Prediction yes  no
       yes  34  17
       no   47 145
# Calculate the accuracy
accuracy(telecom_results, truth = canceled_service,
    estimate = .pred_class)

Output

accuracy(telecom_results, truth = canceled_service,
    estimate = .pred_class)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.737
# Calculate the sensitivity
sens(telecom_results, truth = canceled_service,
    estimate = .pred_class)

Output

sens(telecom_results, truth = canceled_service,
    estimate = .pred_class)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 sens    binary         0.420
# Calculate the specificity
spec(telecom_results, truth = canceled_service,
    estimate = .pred_class)

Output

spec(telecom_results, truth = canceled_service,
    estimate = .pred_class)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 spec    binary         0.895

The specificity of your logistic regression model is 0.895, which is more than double the sensitivity of 0.42. This indicates that your model is much better at detecting customers who will not cancel their telecommunications service versus the ones who will.

# Create a custom metric function
telecom_metrics <- metric_set(accuracy, sens, spec)

# Calculate metrics using model results tibble
telecom_metrics(telecom_results, truth = canceled_service,
                estimate = .pred_class)

# Create a confusion matrix
conf_mat(telecom_results,
         truth = canceled_service,
         estimate = .pred_class) %>% 
  # Pass to the summary() function
  summary()

output

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.737
 2 kap                  binary         0.347
 3 sens                 binary         0.420
 4 spec                 binary         0.895
 5 ppv                  binary         0.667
 6 npv                  binary         0.755
 7 mcc                  binary         0.364
 8 j_index              binary         0.315
 9 bal_accuracy         binary         0.657
10 detection_prevalence binary         0.210
11 precision            binary         0.667
12 recall               binary         0.420
13 f_meas               binary         0.515

Visualizing Model Performance

Can plot conf_mat() with autoplot() setting type = 'heatmap'. Setting type = 'mosaic' will create a mosaic of spec and sens

Probability thresholds: default in binary classification is 0.5. If estimated probability of positive (1, yes) outcome for a case 0.5 then it is classified as positive (1, yes).

How does a model perform across a range of thresholds? Can estimate a unique probability threshold in the .pred_yes column of test results; calculate specificity and sensitivity of each case.

Receiver Operating Curve (ROC) visualizes performance across thresholds. Plots (1- specificity) as x-axis and sensitivity as y-axis.

For each unique threshold, a point that represents sensitivity vs 1-specificity is plotted. i.e. proportion correct among actual positives vs. proportion incorrect among actual negatives.

A step function

Optimal point is (0,1). Diagonal line (y=x, i.e. sensitivity = 1 - specificity) indicates poor performance - predicts outcomes based on random coin flip

Area under ROC curve (AUC) summarizes performance in a single number. Useful interpretation as a letter grade:

  • A: [0.9-1]

  • B: [0.8,0.9)

  • C: [0.7,0.8)

  • D: [0.6-0.7)

  • F: [0.5, 0.6)

roc_curve(truth = , estimate = .pred_yes) (second argument is estimated probability of yes category from resutls)

returns a tibble with specificity and sensitivity for all unique thresholds in .pred_yes.

Then pass roc_curve(…) %>% autoplot(). Then can calculate roc_auc(results, truth = , .pred_yes).

Example

# Create a confusion matrix
conf_mat(telecom_results,
         truth = canceled_service,
         estimate = .pred_class) %>% 
  # Create a heat map
  autoplot(type = 'heatmap')

image

# Create a confusion matrix
conf_mat(telecom_results,
         truth = canceled_service,
         estimate = .pred_class) %>% 
  # Create a mosaic plot
  autoplot(type = 'mosaic')

image

Great job! The mosaic plot clearly shows that your logistic regression model performs much better in terms of specificity than sensitivity. You can see that in the yes column, a large proportion of outcomes were incorrectly predicted as no.

Create a tibble, threshold_df, which contains the sensitivity and specificity of your classification model across the unique probability thresholds in telecom_results.

# Calculate metrics across thresholds
threshold_df <- telecom_results %>% 
  roc_curve(truth = canceled_service, .pred_yes)

# View results
threshold_df

Output

# A tibble: 245 × 3
   .threshold specificity sensitivity
        <dbl>       <dbl>       <dbl>
 1  -Inf          0                 1
 2     0.0220     0                 1
 3     0.0439     0.00617           1
 4     0.0538     0.0123            1
 5     0.0548     0.0185            1
 6     0.0608     0.0247            1
 7     0.0706     0.0309            1
 8     0.0756     0.0370            1
 9     0.0774     0.0432            1
10     0.0783     0.0494            1
# … with 235 more rows
# Plot ROC curve
threshold_df %>% 
  autoplot()

image

# Calculate ROC AUC
roc_auc(telecom_results,
    truth = canceled_service, 
    .pred_yes)

output

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.760

Nice work! The area under the ROC curve is 0.76. This indicates that your model gets a C in terms of overall performance. This is mainly due to the low sensitivity of the model.

Streamlining Workflow

last_fit() function speeds up modeling process; fits model to training data and produces predictions on test data.

We need a data split object and a specified model, then add model formula using split= our split object. Then pipe into collect_metrics() to see how it performed on test data. Default metrics are accuracy and roc auc.

Can also pipe into collect_predictions().

When making custom metrics, must include roc_auc. Realize that most functions like spec and sens, etc. need truth = and estimate = .pred_class ; roc_auc needs truth and estimate=.pred_yes. So when using custom_metrics() , the estimate = .pred_class, .pred_yes to work for all.

Example

# Train model with last_fit()
telecom_last_fit <- logistic_model %>% 
  last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,
           split = telecom_split)

# View test set metrics
telecom_last_fit %>% 
  collect_metrics()

output

telecom_last_fit %>% 
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.737 Preprocessor1_Model1
2 roc_auc  binary         0.760 Preprocessor1_Model1
# Collect predictions
last_fit_results <- telecom_last_fit %>% 
  collect_predictions()

# View results
last_fit_results

output

# A tibble: 243 × 7
   id         .pred_yes .pred_no  .row .pred_class canceled_service .config     
   <chr>          <dbl>    <dbl> <int> <fct>       <fct>            <chr>       
 1 train/tes…     0.376    0.624     3 no          no               Preprocesso…
 2 train/tes…     0.755    0.245     7 yes         yes              Preprocesso…
 3 train/tes…     0.550    0.450     9 yes         yes              Preprocesso…
 4 train/tes…     0.567    0.433    12 yes         yes              Preprocesso…
 5 train/tes…     0.362    0.638    17 no          yes              Preprocesso…
 6 train/tes…     0.206    0.794    25 no          no               Preprocesso…
 7 train/tes…     0.172    0.828    26 no          no               Preprocesso…
 8 train/tes…     0.117    0.883    28 no          no               Preprocesso…
 9 train/tes…     0.515    0.485    31 yes         yes              Preprocesso…
10 train/tes…     0.194    0.806    38 no          yes              Preprocesso…
# … with 233 more rows
# Custom metrics function
last_fit_metrics <- metric_set(accuracy, sens,
                               spec, roc_auc)

# Calculate metrics
last_fit_metrics(last_fit_results,
                 truth = canceled_service,
                 estimate = .pred_class,
                 .pred_yes)

results

# A tibble: 4 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.737
2 sens     binary         0.420
3 spec     binary         0.895
4 roc_auc  binary         0.760

Complete model workflow (with new x variable)

# Train a logistic regression model
logistic_fit <- logistic_model %>% 
  last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges + months_with_company, 
           split = telecom_split)
# Collect metrics
logistic_fit %>% 
  collect_metrics()

## RESULTS
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.790 Preprocessor1_Model1
2 roc_auc  binary         0.837 Preprocessor1_Model1

# Collect model predictions
logistic_fit %>% 
  collect_predictions() %>% 
  # Plot ROC curve
  roc_curve(truth = canceled_service, .pred_yes) %>% 
  autoplot()

image

Excellent work! The ROC curve shows that the logistic regression model performs better than a model that guesses at random (the dashed line in the plot). Adding the months_with_company predictor variable increased your area under the ROC curve from 0.76 in your previous model to 0.837!

Feature Engineering

Using recipes package.

Process of transforming data into a format suitable for machine learning algorithms. Examples:

  1. assigning variable roles to columns in data
  2. defining preprocessing tasks & data transformations
  3. training our transformed data
  4. applying them to new data sources

First step, assigning variables roles, using recipe(). For each column, either:

  • outcome

  • predictor

And determine the type of data

  • numeric

  • categorical

Next, preprocessing in steps, using step_*(), for purposes of e.g.:

  • missing data imputation

  • centering/transforming numeric variables

  • creating new variables (e.g. ratios of existing variables)

Then train on data using prep(), estimate things like mean, sd of numeric columns (for centering and scaling); storing formulas for new columns

Finally, apply these preprocessing transformations to the data with bake():

  • training and testing data for modeling

  • predict using new data

    • machine learning algorithms require data to be in same format as the training data

Example

Log-transforming total_time variable

  • Log compresses range of data values nad reduces variability
# specify purchased is response and all else is predictor
leads_log_rec <- recipe(purchased ~ .,
                        # determines data types
                        data = leads_training) %>% 
  # now transform variable
  step_log(total_time, base = 10) # log base 10

printing leads_log_rec will display number of outcome and predictor vars, as well as the preprocessing operations

When piped into summary(), returns a tibble with variable information

  • type: nominal (categorical) or numeric

  • role: outcome or predictor

  • source

Now train recipe on training data:

leads_log_rec_prep <- leads_log_rec %>%
  prep(training = leads_training)

Printing it shows which operations were successfully trained

To apply recipe to new or existing data, use bake()

leads_log_rec_prep %>% 
  bake(new_data = NULL) # which data to apply trained recipe

Since we used leads_training to train our recipe, the transformations were retained by default in the prep() function. Setting new_data = NULL will return preprocessed training data.

To transform the test data, pass it to the new_data argument

leads_log_rec_prep %>% 
  bake(new_data = leads_test)

Example

telecom_rec <- recipe(canceled_service ~ .,
                      data = telecom_df) %>% 
  step_log(avg_call_mins, base = 10)

# print the recipe
telecom_rec

output

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          8

Operations:

Log transformation on avg_call_mins

# summarize recipe
telecom_rec %>% summary()
# A tibble: 9 × 4
  variable            type    role      source  
  <chr>               <chr>   <chr>     <chr>   
1 cellular_service    nominal predictor original
2 avg_data_gb         numeric predictor original
3 avg_call_mins       numeric predictor original
4 avg_intl_mins       numeric predictor original
5 internet_service    nominal predictor original
6 contract            nominal predictor original
7 months_with_company numeric predictor original
8 monthly_charges     numeric predictor original
9 canceled_service    nominal outcome   original
# Specify feature engineering recipe
telecom_log_rec <- recipe(canceled_service ~ ., 
                          data = telecom_training) %>%
  # Add log transformation step for numeric predictors
  step_log(avg_call_mins, avg_intl_mins, base = 10)

# Print recipe object
telecom_log_rec

output

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          8

Operations:

Log transformation on avg_call_mins, avg_intl_mins
# View variable roles and data types
telecom_log_rec %>%
  summary()

output

# A tibble: 9 × 4
  variable            type    role      source  
  <chr>               <chr>   <chr>     <chr>   
1 cellular_service    nominal predictor original
2 avg_data_gb         numeric predictor original
3 avg_call_mins       numeric predictor original
4 avg_intl_mins       numeric predictor original
5 internet_service    nominal predictor original
6 contract            nominal predictor original
7 months_with_company numeric predictor original
8 monthly_charges     numeric predictor original
9 canceled_service    nominal outcome   original

Now it’s time to train your recipe and apply it to new data!

# Train the telecom_log_rec object
telecom_log_rec_prep <- telecom_log_rec %>% 
  prep(training = telecom_training)

# View results
telecom_log_rec_prep

output

# View results
telecom_log_rec_prep
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          8

Training data contained 731 data points and no missing data.

Operations:

Log transformation on avg_call_mins, avg_intl_mins [trained]
# Apply to training data
telecom_log_rec_prep %>% 
  bake(new_data = NULL)

output

# A tibble: 731 × 9
   cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
   <fct>                  <dbl>         <dbl>         <dbl> <fct>           
 1 single_line            10.3           2.42          1.74 fiber_optic     
 2 multiple_lines          8.05          2.52          2.09 digital         
 3 multiple_lines          9.4           2.49          2.17 fiber_optic     
 4 multiple_lines          9.96          2.53          2.13 fiber_optic     
 5 multiple_lines         10.2           2.60          2.06 fiber_optic     
 6 single_line             6.69          2.55          1.96 digital         
 7 single_line             9.37          2.58          1.94 fiber_optic     
 8 multiple_lines          4.11          2.57          1.81 digital         
 9 multiple_lines          5.17          2.53          2.08 digital         
10 multiple_lines          7.86          2.58          2.21 digital         
# … with 721 more rows, and 4 more variables: contract <fct>,
#   months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>
# Apply to test data
telecom_log_rec_prep %>% 
  bake(new_data = telecom_test)

output

# A tibble: 244 × 9
   cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
   <fct>                  <dbl>         <dbl>         <dbl> <fct>           
 1 single_line             7.78          2.70          2.10 fiber_optic     
 2 single_line             9.04          2.53          1.94 fiber_optic     
 3 multiple_lines          5.08          2.40          2.03 digital         
 4 single_line             9.3           2.51          2.06 fiber_optic     
 5 single_line             7.07          2.40          1.97 fiber_optic     
 6 multiple_lines         10.6           2.45          2.17 fiber_optic     
 7 single_line             8.67          1.97          2.12 fiber_optic     
 8 multiple_lines         11.0           2.59          1.89 fiber_optic     
 9 single_line             5.03          2.54          2.14 digital         
10 single_line             7.96          2.52          1.71 fiber_optic     
# … with 234 more rows, and 4 more variables: contract <fct>,
#   months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>

Great work! You successfully trained your recipe to be able to transform new data sources and applied it to the training and test datasets. Notice that the avg_call_mins and avg_intl_mins variables have been log transformed in the test dataset!

Numeric Predictors

Correlated predictor variables provide redundant information

Multicollinearity causes instability in machine learning optimization algorithms and can lead to model fitting errors

To preprocess correlated predictor variables, specify a recipe using the step_corr() function (note two R’s!): add all numeric predictors by name with commas and add a threshold (e.g. 0.9), in abs value

  • this removes variables that are correlated with each other beyond ±0.9 .

Alternatively instead of selecting columns by name, we can use special functions

  • all_outcomes() selects the outcome variable

  • all_numeric() selects all numeric vars (including outcome if it is numeric)

can just do step_corr(all_numeric, threshold = 0.9)

Another task: centering/scaling (“normalization”) - subtract mean and divide by sd (Z-score)

use step_normalize()

can have multiple steps, but order matters!

When we apply the recipe to the testing data, the same recipe used on training data will be used (e.g. step corr, step normalize, etc)

Example

telecom_training %>% 
  # Select numeric columns
  select_if(is.numeric) %>% 
  # Calculate correlation matrix
  cor()

output

                    avg_data_gb avg_call_mins avg_intl_mins months_with_company
avg_data_gb           1.0000000    0.17885326    0.17028613          0.42756454
avg_call_mins         0.1788533    1.00000000    0.07508974          0.03392502
avg_intl_mins         0.1702861    0.07508974    1.00000000          0.24932929
months_with_company   0.4275645    0.03392502    0.24932929          1.00000000
monthly_charges       0.9573573    0.17455819    0.17021135          0.46235948
                    monthly_charges
avg_data_gb               0.9573573
avg_call_mins             0.1745582
avg_intl_mins             0.1702114
months_with_company       0.4623595
monthly_charges           1.0000000
# Specify a recipe object
telecom_cor_rec <- recipe(canceled_service ~ .,
                          data = telecom_training) %>%
  # Remove correlated variables
  step_corr(all_numeric(), threshold = 0.8)

# Train the recipe
telecom_cor_rec_prep <- telecom_cor_rec %>% 
  prep(training = telecom_training)

# Apply to training data
telecom_cor_rec_prep %>% 
  bake(new_data = NULL)

output

# A tibble: 731 × 8
   cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
   <fct>                  <dbl>         <dbl>         <dbl> <fct>           
 1 single_line            10.3            262            55 fiber_optic     
 2 multiple_lines          8.05           328           122 digital         
 3 multiple_lines          9.4            312           147 fiber_optic     
 4 multiple_lines          9.96           340           136 fiber_optic     
 5 multiple_lines         10.2            402           116 fiber_optic     
 6 single_line             6.69           352            91 digital         
 7 single_line             9.37           382            87 fiber_optic     
 8 multiple_lines          4.11           371            64 digital         
 9 multiple_lines          5.17           341           119 digital         
10 multiple_lines          7.86           378           164 digital         
# … with 721 more rows, and 3 more variables: contract <fct>,
#   months_with_company <dbl>, canceled_service <fct>
# Apply to test data
telecom_cor_rec_prep %>% 
  bake(new_data = telecom_test)

output

# A tibble: 244 × 8
   cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
   <fct>                  <dbl>         <dbl>         <dbl> <fct>           
 1 single_line             7.78           497           127 fiber_optic     
 2 single_line             9.04           336            88 fiber_optic     
 3 multiple_lines          5.08           250           107 digital         
 4 single_line             9.3            326           114 fiber_optic     
 5 single_line             7.07           249            94 fiber_optic     
 6 multiple_lines         10.6            281           147 fiber_optic     
 7 single_line             8.67            93           131 fiber_optic     
 8 multiple_lines         11.0            390            78 fiber_optic     
 9 single_line             5.03           343           138 digital         
10 single_line             7.96           330            51 fiber_optic     
# … with 234 more rows, and 3 more variables: contract <fct>,
#   months_with_company <dbl>, canceled_service <fct>

Excellent! You have trained your recipe to remove all correlated predictors that exceed the 0.8 correlation threshold. Notice that your recipe found the high correlation between monthly_charges and avg_data_gb in the training data and when applied to the telecom_test data, it removed the monthly_charges column.

Multiple feature engineering steps

# Specify a recipe object
telecom_norm_rec <- recipe(canceled_service ~ ., data = telecom_training) %>% 
  # Remove correlated variables
  step_corr(all_numeric(), threshold = 0.8) %>% 
  # Normalize numeric predictors
  step_normalize(all_numeric())

# Train the recipe
telecom_norm_rec_prep <- telecom_norm_rec %>% 
  prep(training = telecom_training)

# Apply to test data
telecom_norm_rec_prep %>% 
  bake(new_data = telecom_test)

output

# A tibble: 244 × 8
   cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
   <fct>                  <dbl>         <dbl>         <dbl> <fct>           
 1 single_line           -0.258        1.99          0.604  fiber_optic     
 2 single_line            0.398       -0.159        -0.629  fiber_optic     
 3 multiple_lines        -1.66        -1.31         -0.0283 digital         
 4 single_line            0.533       -0.292         0.193  fiber_optic     
 5 single_line           -0.628       -1.32         -0.439  fiber_optic     
 6 multiple_lines         1.24        -0.892         1.24   fiber_optic     
 7 single_line            0.205       -3.40          0.730  fiber_optic     
 8 multiple_lines         1.44         0.562        -0.945  fiber_optic     
 9 single_line           -1.69        -0.0651        0.951  digital         
10 single_line           -0.165       -0.239        -1.80   fiber_optic     
# … with 234 more rows, and 3 more variables: contract <fct>,
#   months_with_company <dbl>, canceled_service <fct>

Great job! When you applied your trained recipe to the telecom_test data, it removed the monthly_charges column, due to its large correlation with avg_data_gb, and normalized the numeric predictor variables!

Nominal Predictors

Nominal data are variables that have labels for categories with no set order (e.g. department within a company; native language; car you drive)

Nominal must often be transformed into numeric data for ML algorithms

“One-hot encoding” maps categorical values to a sequence of [0/1] indicator variables

  • each unique value gets its own indicator variable (e.g. category dummies)

  • having the final value is redundant! (if not Spring or Winter or Fall, must be Spring!)

Dummy variable encoding” excludes one value from the original set (e.g. omit Spring); get n1 indicator variables

  • is the default in tidymodels

Start by specifying a recipe() with model formula, then pipe into step_dummy() where we select the variables; then pass to prep (training = training_data) and then to bake(new_data = test_data).

More robust way is to select columns by type, e.g. all_nominal() and all_outcomes(); can exclude outcome variable with -all_outcomes().

telecom_recipe_1 <- 
  recipe(canceled_service ~ avg_data_gb + contract, data = telecom_training)  %>% 
  # Normalize numeric predictors
  step_normalize(all_numeric())  %>% 
  # Create dummy variables for nominal predictors
  step_dummy(all_nominal, -all_outcomes())

# Train and apply telecom_recipe_1 on the test data
telecom_recipe_1 %>% 
  prep(training = telecom_training) %>% 
  bake(new_data = telecom_test)

output

# A tibble: 244 × 4
   avg_data_gb canceled_service contract_one_year contract_two_year
         <dbl> <fct>                        <dbl>             <dbl>
 1      -0.258 yes                              0                 0
 2       0.398 yes                              0                 0
 3      -1.66  yes                              1                 0
 4       0.533 no                               0                 0
 5      -0.628 yes                              0                 0
 6       1.24  no                               0                 1
 7       0.205 no                               0                 1
 8       1.44  no                               0                 1
 9      -1.69  no                               0                 1
10      -0.165 yes                              0                 0
# … with 234 more rows

Another recipe:

telecom_recipe_2 <- 
  recipe(canceled_service ~ avg_data_gb + contract, data = telecom_training)  %>% 
  # Create dummy variables for nominal predictors
  step_dummy(all_nominal(), -all_outcomes())  %>% 
  # Normalize numeric predictors
  step_normalize(all_numeric(), -all_outcomes())

# Train and apply telecom_recipe_2 on the test data
telecom_recipe_2 %>% 
  prep(training = telecom_training) %>% 
  bake(new_data = telecom_test)

output

# A tibble: 244 × 4
   avg_data_gb canceled_service contract_one_year contract_two_year
         <dbl> <fct>                        <dbl>             <dbl>
 1      -0.258 yes                         -0.510            -0.471
 2       0.398 yes                         -0.510            -0.471
 3      -1.66  yes                          1.96             -0.471
 4       0.533 no                          -0.510            -0.471
 5      -0.628 yes                         -0.510            -0.471
 6       1.24  no                          -0.510             2.12 
 7       0.205 no                          -0.510             2.12 
 8       1.44  no                          -0.510             2.12 
 9      -1.69  no                          -0.510             2.12 
10      -0.165 yes                         -0.510            -0.471
# … with 234 more rows

Great job! Notice that telecom_recipe_1 produced [0, 1] values in the dummy variable columns while telecom_recipe_2 produced dummy variables which were then normalized! The predictor contract_two_year created by telecom_recipe_2 is -0.471 instead of 0 and 2.12 instead of 1 due to normalization. For model interpretation, it’s best to normalize variables before creating dummy variables. Also notice that since you only specified two predictor variables in your model formula, the rest of the columns are ignored by your recipe objects when transforming new data sources.

# Create a recipe that predicts canceled_service using the training data
telecom_recipe <- recipe(canceled_service ~ ., data = telecom_training) %>% 
  # Remove correlated predictors
  step_corr(all_numeric(), threshold = 0.8) %>% 
  # Normalize numeric predictors
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  # Create dummy variables
  step_dummy(all_nominal(), -all_outcomes())

# Train your recipe and apply it to the test data
telecom_recipe %>% 
  prep(training = telecom_training) %>% 
  bake(new_data = telecom_test)

output

# A tibble: 244 × 9
   avg_data_gb avg_call_mins avg_intl_mins months_with_company canceled_service
         <dbl>         <dbl>         <dbl>               <dbl> <fct>           
 1      -0.258        1.99          0.604               -1.07  yes             
 2       0.398       -0.159        -0.629               -0.947 yes             
 3      -1.66        -1.31         -0.0283               0.771 yes             
 4       0.533       -0.292         0.193               -0.347 no              
 5      -0.628       -1.32         -0.439               -0.587 yes             
 6       1.24        -0.892         1.24                 0.811 no              
 7       0.205       -3.40          0.730                0.851 no              
 8       1.44         0.562        -0.945                1.45  no              
 9      -1.69        -0.0651        0.951                0.212 no              
10      -0.165       -0.239        -1.80                -1.31  yes             
# … with 234 more rows, and 4 more variables:
#   cellular_service_single_line <dbl>, internet_service_digital <dbl>,
#   contract_one_year <dbl>, contract_two_year <dbl>

A Full Workflow

Preprocessing

telecom_recipe <- recipe(canceled_service ~ ., data = telecom_training) %>% 
  # Removed correlated predictors
  step_corr(all_numeric(), threshold = 0.8) %>% 
  # Log transform numeric predictors
  step_log(all_numeric(), base = 10) %>%
  # Normalize numeric predictors
  step_normalize(all_numeric()) %>%
  # Create dummy variables
  step_dummy(all_nominal(), -all_outcomes())

telecom_recipe <- recipe(canceled_service ~ ., data = telecom_training) %>% 
  # Removed correlated predictors
  step_corr(all_numeric(), threshold = 0.8) %>% 
  # Log transform numeric predictors
  step_log(all_numeric(), base = 10) %>%
  # Normalize numeric predictors
  step_normalize(all_numeric()) %>% 
  # Create dummy variables
  step_dummy(all_nominal(), -all_outcomes())

# Train recipe
telecom_recipe_prep <- telecom_recipe %>% 
    prep(training = telecom_training)

# Transform training data
telecom_training_prep <- telecom_recipe_prep %>% 
  bake(new_data = NULL)

# Transform test data
telecom_test_prep <- telecom_recipe_prep %>% 
  bake(new_data = telecom_test)

telecom_test_prep

output

# A tibble: 244 × 9
   avg_data_gb avg_call_mins avg_intl_mins months_with_company canceled_service
         <dbl>         <dbl>         <dbl>               <dbl> <fct>           
 1     -0.125         1.63           0.657            -0.748   yes             
 2      0.466        -0.0388        -0.490            -0.483   yes             
 3     -1.80         -1.30           0.121             0.757   yes             
 4      0.577        -0.167          0.319             0.198   no              
 5     -0.501        -1.31          -0.284            -0.00569 yes             
 6      1.11         -0.800          1.11              0.770   no              
 7      0.301        -5.51           0.754             0.784   no              
 8      1.26          0.596         -0.867             0.963   no              
 9     -1.84          0.0490         0.916             0.529   no              
10     -0.0346       -0.115         -2.20             -2.19    yes             
# … with 234 more rows, and 4 more variables:
#   cellular_service_single_line <dbl>, internet_service_digital <dbl>,
#   contract_one_year <dbl>, contract_two_year <dbl>

Model Fitting

# Train logistic model
logistic_fit <- logistic_model %>% 
  fit(canceled_service ~ ., data = telecom_training_prep)

# Obtain class predictions
class_preds <- predict(logistic_fit, new_data = telecom_test_prep,
                   type = 'class')

# Obtain estimated probabilities
prob_preds <- predict(logistic_fit, new_data = telecom_test_prep,
                   type = 'prob')

# Combine test set results
telecom_results <- telecom_test_prep %>% 
  select(canceled_service) %>% 
  bind_cols(class_preds, prob_preds)

telecom_results

output

# A tibble: 243 × 4
   canceled_service .pred_class .pred_yes .pred_no
   <fct>            <fct>           <dbl>    <dbl>
 1 no               no             0.222     0.778
 2 yes              yes            0.793     0.207
 3 yes              no             0.387     0.613
 4 yes              yes            0.500     0.500
 5 yes              no             0.407     0.593
 6 no               no             0.0473    0.953
 7 no               no             0.321     0.679
 8 no               no             0.0116    0.988
 9 yes              no             0.229     0.771
10 yes              no             0.312     0.688
# … with 233 more rows

Evaluate Performance

# Create a confusion matrix
telecom_results %>% 
  conf_mat(truth = canceled_service, estimate = .pred_class)

output

          Truth
Prediction yes  no
       yes  46  16
       no   35 146
# Calculate sensitivity
telecom_results %>% 
  sens(truth = canceled_service, estimate = .pred_class)

output

  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 sens    binary         0.568
# Calculate specificity
telecom_results %>% 
  spec(truth = canceled_service, estimate = .pred_class)

output

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 spec    binary         0.901
# Plot ROC curve
telecom_results %>% 
  roc_curve(truth = canceled_service, .pred_yes) %>% 
  autoplot()

Fantastic work! You have really come a long way in developing your modeling skills with tidymodels! From the results of your metric calculations, using feature engineering and incorporating all predictor variables increased your model’s sensitivity to 0.57, up from 0.42, and specificity to 0.901, up from 0.895!

Machine Learning Workflows (Decision Trees)

Combine models and recipes into a single workflow

Decision trees can segment predictor space into rectangular regions

Popular algorithm is recursive binary splitting - segments predictor space into non-overlapping regions

  • makes a series of horizontal or vertical cut points (“splits”) iteratively

  • decision tree will predict the majority class in each region to classify it

may be better than the linear decision boundaries of a logistic regression model

Can visualize with tree diagrams

  • interior nodes: the splits of a decision tree (dark boxes)

  • terminal nodes: model predictions at end of branches

Specified with decision_tree(), uses common engine rpart , mode can be classification or regression

dt_model <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

Previously we’ve set up a recipe to transform data for preprocessing (feature engineering). Now we have 2 R objects to manage:

  • parsnip model

  • recipe specification

The workflows package can combine these into a single workflow object. Initialize with

wkfl <- workflow() %>%
  add_model(dt_model) %>%
  add_recipe(our_recipe)

To train it, pipe it into last_fit() and provide a data split object

wkfl %>% 
  last_fit(split = split_df)

Then we can also collect_metrics() and collect_predictions()

Example

loans_df
# A tibble: 872 × 8
   loan_default loan_purpose       missed_payment_2_yr loan_amount interest_rate
   <fct>        <fct>              <fct>                     <int>         <dbl>
 1 no           debt_consolidation no                        25000          5.47
 2 yes          medical            no                        10000         10.2 
 3 no           small_business     no                        13000          6.22
 4 no           small_business     no                        36000          5.97
 5 yes          small_business     yes                       12000         11.8 
 6 yes          medical            no                        13000         13.2 
 7 no           debt_consolidation no                        10000         10.5 
 8 no           debt_consolidation no                        40000          7.97
 9 no           debt_consolidation yes                       16000          4.72
10 no           small_business     no                        12000          7.47
# … with 862 more rows, and 3 more variables: installment <dbl>,
#   annual_income <dbl>, debt_to_income <dbl>
# Create data split object
loans_split <- initial_split(loans_df, 
                 # prop = 0.7,  # removed?
                  strata = loan_default)

# Build training data
loans_training <- loans_split %>% 
  training()

# Build test data
loans_test <- loans_split %>% 
  testing()
# Check for correlated predictors
loans_training %>% 
  # Select numeric columns
  select_if(is.numeric) %>%
  # Calculate correlation matrix
  cor()

output

               loan_amount interest_rate   installment annual_income
loan_amount      1.0000000 -0.0381763995  0.9361654779    0.32657454
interest_rate   -0.0381764  1.0000000000 -0.0001948258   -0.08485041
installment      0.9361655 -0.0001948258  1.0000000000    0.27947698
annual_income    0.3265745 -0.0848504133  0.2794769753    1.00000000
debt_to_income   0.1327948  0.1635109451  0.1943611185   -0.21036016
               debt_to_income
loan_amount         0.1327948
interest_rate       0.1635109
installment         0.1943611
annual_income      -0.2103602
debt_to_income      1.0000000

Great work! You have created your training and test datasets and discovered that loan_amount and installment are highly correlated predictor variables. To remove one of these predictors, you will have to incorporate step_corr() into your feature engineering pipeline for this data.

dt_model <- decision_tree() %>% 
  # Specify the engine
  set_engine("rpart") %>% 
  # Specify the mode
  set_mode("classification")
# Build feature engineering pipeline
loans_recipe <- recipe(loan_default ~ .,
                        data = loans_training) %>% 
  # Correlation filter
  step_corr(all_numeric(), threshold = 0.85) %>% 
  # Normalize numeric predictors
  step_normalize(all_numeric()) %>% 
  # Create dummy variables
  step_dummy(all_nominal(), -all_outcomes())

loans_recipe

output

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          7

Operations:

Correlation filter on all_numeric()
Centering and scaling for all_numeric()
Dummy variables from all_nominal(), -all_outcomes()

Nice work! Now that you have your model and feature engineering steps specified, you can create a workflow object for model training.

# Create a workflow
loans_dt_wkfl <- workflow() %>% 
  # Include the model object
  add_model(dt_model) %>% 
  # Include the recipe object
  add_recipe(loans_recipe)

# View workflow specification
loans_dt_wkfl

output

loans_dt_wkfl
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

step_corr()
step_normalize()
step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Computational engine: rpart 
# Train the workflow
loans_dt_wkfl_fit <- loans_dt_wkfl %>% 
  last_fit(split = loans_split)
# Calculate performance metrics on test data
loans_dt_wkfl_fit %>% 
  collect_metrics()

output

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.744 Preprocessor1_Model1
2 roc_auc  binary         0.790 Preprocessor1_Model1

Good job! You have trained a workflow with last_fit() that created training and test datasets, trained and applied your recipe, fit your decision tree model to the training data and calculated performance metrics on the test data all with just a few lines of code! The model performed pretty well, with an area under the ROC curve of 0.79.

Improving Performance with Cross-Evaluation

We’ve been creating training and testing datasets. The first step of modeling process. Guards against overfitting.

  • training data used for model fitting

  • testing data used for model evaluation

One downside of this method is we only get one estimate of model performance

k-fold cross validation is a technique that provides K estimates of model performance, and is typically used to compare different model types (e.g. logistic regression and decision trees)

Training data is randomly partitioned into K sets of roughly equal size (“folds”), which are used to perform K iterations of model fitting and evaluation

Test data is left out of this process, so it can provide a final, independent estimate of model performance once a model is chosen

Ex: if we have 5 folds, we will have 5 iterations of model training and evaluation.

  • In 1st iteration, fold 1 reserved for model evaluation while others are for training

  • In 2nd iteration, fold 2 reserved for evaluation while others are for training

  • etc.

Use vfold_cv() to do this

set.seed(25) # randomize reproducibly
ex_folds <- vfold_cv(training_df, # training df
                     v = 10, # number of folds
                     strata = outcome_var) # Y variable

Results in a tibble with list column named splits and an id column that identifies each fold. Each row is a data split object that has the instructions for fitting that row’s fold into a training or evaluation set

The fit_resamples function trains a parsnip model or workflow object, provides cross validation folds resamples and optional custom metrics function

ex_rs_fit <- our_wkfl %>% 
  fit_resamples(resamples = ex_folds,
                metrics = custom_metrics)

Estimates metrics for each fold. The average of these estimates is in mean column. Piping ex_rs_fit into collect_metrics(summarize = FALSE) will provide detailed metrics. Helpful to use dplyr:

rs_metrics %>% 
  group_by(.metric) %>% 
  summarize(min = min(.estimate),
            max = max(.estimate),
            mean = mean(.estimate)) # etc...

fit_resamples() cannot make predictions on new data sources. Purpose of cross validation is not to fit a final model, but to compare performance of different models to discover which type works best for our data

Example

loans_training
# A tibble: 655 × 8
   loan_default loan_purpose       missed_payment_2_yr loan_amount interest_rate
   <fct>        <fct>              <fct>                     <int>         <dbl>
 1 yes          medical            no                        10000         10.2 
 2 no           small_business     no                        13000          6.22
 3 no           small_business     no                        36000          5.97
 4 yes          small_business     yes                       12000         11.8 
 5 yes          medical            no                        13000         13.2 
 6 no           debt_consolidation no                        10000         10.5 
 7 no           debt_consolidation no                        40000          7.97
 8 no           debt_consolidation yes                       16000          4.72
 9 no           home_improvement   no                         5000          9.97
10 yes          medical            no                        19000         11   
# … with 645 more rows, and 3 more variables: installment <dbl>,
#   annual_income <dbl>, debt_to_income <dbl>
# Create cross validation folds
set.seed(290)
loans_folds <- vfold_cv(loans_training, v = 5,
                   strata = loan_default)

loans_folds

output

loans_folds
#  5-fold cross-validation using stratification 
# A tibble: 5 × 2
  splits            id   
  <list>            <chr>
1 <split [523/132]> Fold1
2 <split [523/132]> Fold2
3 <split [524/131]> Fold3
4 <split [525/130]> Fold4
5 <split [525/130]> Fold5
# Create custom metrics function
loans_metrics <- metric_set(roc_auc, sens, spec)

# Fit resamples
loans_dt_rs <- loans_dt_wkfl %>% 
  fit_resamples(resamples = loans_folds,
                metrics = loans_metrics)

# View performance metrics
loans_dt_rs %>% 
  collect_metrics

output

# A tibble: 3 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 roc_auc binary     0.846     5 0.00586 Preprocessor1_Model1
2 sens    binary     0.672     5 0.0156  Preprocessor1_Model1
3 spec    binary     0.876     5 0.00890 Preprocessor1_Model1

Excellent work! You have used cross validation to evaluate the performance of your decision tree workflow. Across the 5 cross validation folds, the average area under the ROC curve was 0.846. The average sensitivity and specificity were 0.672 and 0.876, respectively.

Cross Validation with Logistic Regression

# Create logistic model
logistic_model <- logistic_reg() %>% 
  # Specify the engine
  set_engine("glm") %>% 
  # Specify the mode
  set_mode("classification")

# Create workflow
loans_logistic_wkfl <- workflow() %>% 
  # Add model
  add_model(logistic_model) %>% 
  # Add recipe
  add_recipe(loans_recipe)

# Fit resamples
loans_logistic_rs <- loans_logistic_wkfl %>% 
  fit_resamples(resamples = loans_folds,
      metrics = loans_metrics)

# View performance metrics
loans_logistic_rs %>% 
  collect_metrics()

output

# A tibble: 3 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 roc_auc binary     0.848     5  0.0184 Preprocessor1_Model1
2 sens    binary     0.648     5  0.0312 Preprocessor1_Model1
3 spec    binary     0.873     5  0.0198 Preprocessor1_Model1

Great job! For logistic regression, across the 5 cross validation folds, the average area under the ROC curve was 0.848. The average sensitivity and specificity were 0.648 and 0.873, respectively. ROC AUC and specificity are very close to the decision tree cross validation results. However, the decision tree model performed slightly better on sensitivity, with an average value of 0.672.

Comparing model performance profiles

# Detailed cross validation results
dt_rs_results <- loans_dt_rs %>% 
  collect_metrics(summarize = FALSE)

# Explore model performance for decision tree
dt_rs_results %>% 
  group_by(.metric) %>% 
  summarize(min = min(.estimate),
            median = median(.estimate),
            max = max(.estimate))

output

# A tibble: 3 × 4
  .metric   min median   max
  <chr>   <dbl>  <dbl> <dbl>
1 roc_auc 0.830  0.842 0.865
2 sens    0.64   0.66  0.725
3 spec    0.85   0.875 0.901
# Detailed cross validation results
logistic_rs_results <- loans_logistic_rs %>% 
  collect_metrics(summarize = FALSE)

# Explore model performance for logistic regression
logistic_rs_results %>% 
  group_by(.metric) %>% 
  summarize(min = min(.estimate),
            median = median(.estimate),
            max = max(.estimate))

output

# A tibble: 3 × 4
  .metric   min median   max
  <chr>   <dbl>  <dbl> <dbl>
1 roc_auc 0.810  0.836 0.898
2 sens    0.549  0.68  0.725
3 spec    0.838  0.862 0.95 

Great job! Both models have similar average values across all metrics. However, logistic regression tends to have a wider range of values on all metrics. This provides evidence that a decision tree model may produce more stable prediction accuracy on the loans dataset.

Hyperparameter Tuning

Hyperparameters are model parameters which are set prior to model training an control model complexity

In parsnip, decision trees have 3 hyperparameters:

  1. cost_complexity: number used to penalize trees with large number of terminal nodes
    • by default: 0.01
  2. tree_depth: controls how long the path from root to any terminal node can be
    • by default: 30
  3. min_n: minimum data points required in a node for further splitting
    • by default: 20

Changing these values might improve overall performance — hyperparameter tuning uses cross validation to find optimal set of these hyperparameter values

  • tune() function (from tune package) is used to label model hyperparameters

  • in a parsnip model, set the hyperparameter values to tune()

dt_tune_model <- decision_tree(cost_complexity = tune(),
                               tree_depth = tune(),
                               min_n = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")

First create a workflow by updating a previous workflow (feature engineering steps & decision tree model with default params), pipe into update_model() and provide new decision tree with tuning parameters

my_wkflw %>% update_model(dt_tune_model)

Hyperparameter tuning is accomplished using grid search, a method where a grid of hyperparameter value is generated

For each combination, cross validation is used to estimate model performance, and best performing parameters are chosen. The parameters() function from the dials package can be used to identify the hyperparameters in a parsnip model object.

A popular method for grid search is to generate random combinations of values since there can be an infinite combination, choosing at random gives us a greater chance of discovering optimal combinations.

To create a random grid, pipe results into grid_random(parameters(dt_tune_model), size = 5. First argument is results of parameters() function, size is the number of random combinations to generate

Save the results dt_grid, then pipe into tune_grid(resamples = our_folds, grid = dt_grid, metrics = our_metrics)

Returns a tibble of results with .metrics column which is a list column with the results for each fold

Can use collect_metrics() to summarize results by default the average estimated values are used across all folds per combination

Each row is the average of performance across the n folds

Example

# Set tuning hyperparameters
dt_tune_model <- decision_tree(cost_complexity = tune(),
                               tree_depth = tune(),
                               min_n = tune()) %>% 
  # Specify engine
  set_engine("rpart") %>% 
  # Specify mode
  set_mode("classification")

dt_tune_model

output

Decision Tree Model Specification (classification)

Main Arguments:
  cost_complexity = tune()
  tree_depth = tune()
  min_n = tune()

Computational engine: rpart 
# Create a tuning workflow
loans_tune_wkfl <- loans_dt_wkfl %>% 
  # Replace model
  update_model(dt_tune_model)

loans_tune_wkfl

output

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

step_corr()
step_normalize()
step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Main Arguments:
  cost_complexity = tune()
  tree_depth = tune()
  min_n = tune()

Computational engine: rpart 

Random grid search

# Hyperparameter tuning with grid search
set.seed(214)
dt_grid <- grid_random(parameters(dt_tune_model),
               size = 5)

dt_grid

output

# A tibble: 5 × 3
  cost_complexity tree_depth min_n
            <dbl>      <int> <int>
1    0.0000000758         14    39
2    0.0243                5    34
3    0.00000443           11     8
4    0.000000600           3     5
5    0.00380               5    36
# Hyperparameter tuning
dt_tuning <- loans_tune_wkfl %>% 
  tune_grid(resamples = loans_folds,
      grid = dt_grid,
      metrics = loans_metrics)

output

# A tibble: 15 × 9
   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
 1    0.0000000758         14    39 roc_auc binary     0.840     5  0.0130
 2    0.0000000758         14    39 sens    binary     0.672     5  0.0281
 3    0.0000000758         14    39 spec    binary     0.848     5  0.0185
 4    0.0243                5    34 roc_auc binary     0.743     5  0.0227
 5    0.0243                5    34 sens    binary     0.585     5  0.0375
 6    0.0243                5    34 spec    binary     0.908     5  0.0222
 7    0.00000443           11     8 roc_auc binary     0.791     5  0.0122
 8    0.00000443           11     8 sens    binary     0.664     5  0.0136
 9    0.00000443           11     8 spec    binary     0.848     5  0.0200
10    0.000000600           3     5 roc_auc binary     0.792     5  0.0230
11    0.000000600           3     5 sens    binary     0.589     5  0.0423
12    0.000000600           3     5 spec    binary     0.920     5  0.0138
13    0.00380               5    36 roc_auc binary     0.838     5  0.0121
14    0.00380               5    36 sens    binary     0.652     5  0.0321
15    0.00380               5    36 spec    binary     0.883     5  0.0115
# … with 1 more variable: .config <chr>

Good work! Since you have 5 random hyperparameter combinations and 3 performance metrics, there are 15 results in your summarized tuning results. Each row shows the average of the 5 cross validation estimates of each metric and hyperparameter combination.

Exploring tuning results

# Collect detailed tuning results
dt_tuning_results <- dt_tuning %>% 
  collect_metrics(summarize = FALSE)

dt_tuning_results

output

# A tibble: 75 × 8
   id    cost_complexity tree_depth min_n .metric .estimator .estimate .config
   <chr>           <dbl>      <int> <int> <chr>   <chr>          <dbl> <chr>  
 1 Fold1    0.0000000758         14    39 sens    binary         0.765 Model1 
 2 Fold1    0.0000000758         14    39 spec    binary         0.877 Model1 
 3 Fold1    0.0000000758         14    39 roc_auc binary         0.880 Model1 
 4 Fold2    0.0000000758         14    39 sens    binary         0.647 Model1 
 5 Fold2    0.0000000758         14    39 spec    binary         0.864 Model1 
 6 Fold2    0.0000000758         14    39 roc_auc binary         0.822 Model1 
 7 Fold3    0.0000000758         14    39 sens    binary         0.647 Model1 
 8 Fold3    0.0000000758         14    39 spec    binary         0.788 Model1 
 9 Fold3    0.0000000758         14    39 roc_auc binary         0.804 Model1 
10 Fold4    0.0000000758         14    39 sens    binary         0.6   Model1 
# … with 65 more rows
# Explore detailed ROC AUC results for each fold
dt_tuning_results %>% 
  filter(.metric == 'roc_auc') %>% 
  group_by(id) %>% 
  summarize(min_roc_auc = min(.estimate),
            median_roc_auc = median(.estimate),
            max_roc_auc = max(.estimate))

output

# A tibble: 5 × 4
  id    min_roc_auc median_roc_auc max_roc_auc
  <chr>       <dbl>          <dbl>       <dbl>
1 Fold1       0.831          0.868       0.880
2 Fold2       0.708          0.803       0.822
3 Fold3       0.714          0.792       0.805
4 Fold4       0.736          0.777       0.848
5 Fold5       0.725          0.770       0.851

Excellent work! You have now had the chance to explore the detailed results of your decision tree hyperparameter tuning. The next step will be selecting the best combination and finalizing your workflow object!

We want to look to see if estimated values are fairly consistent across folds. Wild fluctuations are an indicator of model overfitting!

To make exploring tuning results easier, use show_best() function to display top n performing combinations. Takes dt_tuning %>% show_best(metric = 'roc_auc', n = 5) based on average values of the metric. The model in the .config column indicates which model has the largest average roc auc.

The select_best() function selects the best performing hyperparameters: dt_tuning %>% select_best(metric = "roc_auc"). Returns tibble with hyperparameter values taht produced largest avg performance metric value (e.g. Model1).

The results can finalize a workflow with finalize_workflow(): pipe the workflow into finalize :

leads_tune_wkfl %>% finalize_workflow(best_dt_model)

Once finalized workflow, can be trained with last_fit() and original data split object. Behind the scenes:

  1. training and test datsets created
  2. recipe trained and applied
  3. tuned decision tree trained with entire training dataset
  4. predictions and metrics on test dataset

In general, having similar performance between cross-validation and the test dataset implies model will perform similarly on new data sources.

# Display 5 best performing models
dt_tuning %>% 
  show_best(metric = "roc_auc", n = 5)

output

# A tibble: 5 × 9
  cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
            <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
1    0.0000000758         14    39 roc_auc binary     0.840     5  0.0130
2    0.00380               5    36 roc_auc binary     0.838     5  0.0121
3    0.000000600           3     5 roc_auc binary     0.792     5  0.0230
4    0.00000443           11     8 roc_auc binary     0.791     5  0.0122
5    0.0243                5    34 roc_auc binary     0.743     5  0.0227
# … with 1 more variable: .config <chr>
# Select based on best performance
best_dt_model <- dt_tuning %>% 
  # Choose the best model based on roc_auc
  select_best(metric = 'roc_auc')

best_dt_model

output

# A tibble: 1 × 4
  cost_complexity tree_depth min_n .config
            <dbl>      <int> <int> <chr>  
1    0.0000000758         14    39 Model1 
# Finalize your workflow
final_loans_wkfl <- loans_tune_wkfl %>% 
  finalize_workflow(best_dt_model)

final_loans_wkfl

output

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

step_corr()
step_normalize()
step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Main Arguments:
  cost_complexity = 7.58290839567418e-08
  tree_depth = 14
  min_n = 39

Computational engine: rpart 

Good job! When you printed your finalized workflow object, the optimal hyperparameter combination is displayed in the main arguments section of the output. Your workflow is now ready for model fitting and prediction on new data sources!

# Train finalized decision tree workflow
loans_final_fit <- final_loans_wkfl %>% 
  last_fit(split = loans_split)

# View performance metrics
loans_final_fit %>% 
  collect_metrics()

output

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.802 Preprocessor1_Model1
2 roc_auc  binary         0.875 Preprocessor1_Model1
# Create an ROC curve
loans_final_fit %>% 
  # Collect predictions
  collect_predictions() %>%
  # Calculate ROC curve metrics
  roc_curve(truth = loan_default, .pred_yes) %>%
  # Plot the ROC curve
  autoplot()

image

Great job! You were able to train your finalized workflow with last_fit() and generate predictions on the test data. The tuned decision tree model produced an area under the ROC curve of 0.875. That’s a great model! The ROC curve shows that the sensitivity and specificity remain high across a wide range of probability threshold values.