── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ recipes::step() masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
Splitting
rsample package is designed to create training and test datasets. Creating a test dataset is important for estimating how a trained model will likely perform on new data. It also guards against overfitting, where a model memorizes patterns that exist only in the training data and performs poorly on new data.
Create an rsample object, home_split, that contains the instructions for randomly splitting the home_sales data into a training and test dataset.
Allocate 70% of the data into training and stratify the results by selling_price.
# Create a data split objecthome_split <-initial_split(home_sales, prop =0.7, strata = selling_price)
# Create the training datahome_training <- home_split %>%training()# Create the test datahome_test <- home_split %>%testing()# Check number of rows in each datasetnrow(home_training)nrow(home_test)
Excellent work! The minimum and maximum selling prices in both datasets are the same. The mean and standard deviation are also similar. Stratifying by the outcome variable ensures the model fitting process is performed on a representative sample of the original data.
home_training %>%summarize(min_sell_price =min(selling_price),max_sell_price =max(selling_price),mean_sell_price =mean(selling_price),sd_sell_price =sd(selling_price))# A tibble: 1 × 4 min_sell_price max_sell_price mean_sell_price sd_sell_price<dbl><dbl><dbl><dbl>1350000650000479156.81201.# Distribution of selling_price in test datahome_test %>%summarize(min_sell_price =min(selling_price),max_sell_price =max(selling_price),mean_sell_price =mean(selling_price),sd_sell_price =sd(selling_price))# A tibble: 1 × 4 min_sell_price max_sell_price mean_sell_price sd_sell_price<dbl><dbl><dbl><dbl>1350000650000478920.80552.# Distribution of selling_price in test datahome_test %>%summarize(min_sell_price =min(selling_price),max_sell_price =max(selling_price),mean_sell_price =mean(selling_price),sd_sell_price =sd(selling_price))# A tibble: 1 × 4 min_sell_price max_sell_price mean_sell_price sd_sell_price<dbl><dbl><dbl><dbl>1350000650000478920.80552.
Linear Regression
# Initialize a linear regression object, linear_modellinear_model <-linear_reg() %>%# Set the model engineset_engine("lm") %>%# Set the model modeset_mode("regression")# Fit the model using the training datalm_fit <-linear_model() %>%fit(selling_price ~ home_age + sqft_living,data = home_training)# Print lm_fit to view model informationlm_fit
Output:
parsnip model objectFit time:1ms Call:stats::lm(formula = selling_price ~ home_age + sqft_living, data = data)Coefficients:(Intercept) home_age sqft_living 291102.4-1577.4104.1
# A tibble: 450 × 1 .pred<dbl>1539701.2633473.3429715.4409971.5489075.6532008.7402084.8484440.9449362.10535602.# … with 440 more rows
# Combine test data with predictionshome_test_results <- home_test %>%select(selling_price, home_age, sqft_living) %>%bind_cols(home_predictions)# View resultshome_test_results
Output
# A tibble: 450 × 4 selling_price home_age sqft_living .pred<dbl><dbl><dbl><dbl>1487000102540539701.263500043350633473.3495000211650429715.4355000191430409971.5464950192190489075.653500032360532008.7356000241430402084.8525000162100484440.9552321291960449362.1048500062440535602.# … with 440 more rows
Evaluating Performance
Using yardstick package. Requires tibble with model results.
Use RMSE
rmse(truth = , estimate = .pred)
rsq(truth = , estimate = .pred)
rsq plot of x = actual, y = pred
line is line
helpful for showing nonlinear patterns, regions where model is predicting poorly, etc
last_fit() takes a model specification, model formula, and data split object to:
create training and test datasets
fits model to training data
calculates GOF metrics and predictions on test data
# Make an R squared plot using predictions_dfggplot(predictions_df, aes(x = seling_price, y = .pred)) +geom_point(alpha =0.5) +geom_abline(color ='blue', linetype =2) +coord_obs_pred() +labs(x ='Actual Home Selling Price', y ='Predicted Selling Price')
Classification Models
Goal: Create distinct, non-overlapping regions along set of predictor variable values, i.e. Predict the same categorical outcome in each region (on graph)
Ex: logistic regression, Popular classication algorithm which creates a linear separation between outcome categories
First step is again to resample and split data into training and testing data
initial_split(df, prop = , strata = )
stratify by outcome variable (0/1, yes/no, etc), purchased
then pass to training() and testing() functions
logistic_reg() function, set_engine("glm") and set mode("classification")
then fit(y ~ x1 + x2) for training
for prediction, use the testing data in the new_data function and set type to class (classification):
predict(new_data = the_testing_data, type = "class")
when type is 'class', returns a factor variable called .pred_class
setting type to 'prop' estimates probabilities for each outcome category for each row (observation) in test data
named pred_{outcome_category}
using yardstick to evaluate performance, need a tibble of results
telecom_df# A tibble: 975 × 9 canceled_service cellular_service avg_data_gb avg_call_mins avg_intl_mins<fct><fct><dbl><dbl><dbl>1 yes single_line 7.784971272 yes single_line 9.04336883 no single_line 10.3262554 yes multiple_lines 5.082501075 no multiple_lines 8.053281226 no single_line 9.33261147 yes multiple_lines 8.01525978 no multiple_lines 9.43121479 yes single_line 5.294179610 no multiple_lines 9.96340136# … with 965 more rows, and 4 more variables: internet_service <fct>,# contract <fct>, months_with_company <dbl>, monthly_charges <dbl>
Code
# Create data split objecttelecom_split <-initial_split(telecom_df, prop =0.75,strata = canceled_service)# Create the training datatelecom_training <- telecom_split %>%training()# Create the test datatelecom_test <- telecom_split %>%testing()# Check the number of rowsnrow(telecom_training)nrow(telecom_test)
# Specify a logistic regression modellogistic_model <-logistic_reg() %>%# Set the engineset_engine("glm") %>%# Set the modeset_mode("classification")# Print the model specificationlogistic_model
Output
# Print the model specificationlogistic_modelLogistic Regression Model Specification (classification)Computational engine: glm
Fit model
# Fit to training datalogistic_fit <- logistic_model %>%fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,data = telecom_training)# Print model fit objectlogistic_fit
Output
# Print model fit objectlogistic_fitparsnip model objectFit time:5ms Call: stats::glm(formula = canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges, family = stats::binomial, data = data)Coefficients: (Intercept) avg_call_mins avg_intl_mins monthly_charges 1.837242-0.0104240.0221140.002802Degrees of Freedom:730Total (i.e. Null); 727 ResidualNull Deviance:932.4Residual Deviance:806.7 AIC:814.7
Predictions
# Predict outcome categoriesclass_preds <-predict(logistic_fit, new_data = telecom_test,type ="class")# Obtain estimated probabilities for each outcome valueprob_preds <-predict(logistic_fit, new_data = telecom_test, type ='prob')# Combine test set resultstelecom_results <- telecom_test %>%select(canceled_service) %>%bind_cols(class_preds, prob_preds)# View results tibbletelecom_results
Output
telecom_results# A tibble: 243 × 4 canceled_service .pred_class .pred_yes .pred_no<fct><fct><dbl><dbl>1 no no 0.3760.6242 yes yes 0.7550.2453 yes yes 0.5500.4504 yes yes 0.5670.4335 yes no 0.3620.6386 no no 0.2060.7947 no no 0.1720.8288 no no 0.1170.8839 yes yes 0.5150.48510 yes no 0.1940.806# … with 233 more rows
Assessing Model Fit
Yardstick, binary Y, must be a factor with first level the positive (“yes”) class. Can check with levels().
Confusion Matrix compares combinations of actual and predicted outcome values
Correct predictions:
True positive (TP)
True negative (TN)
Classification errors:
False positive (FP)
False negative (FN)
Creating a confusion matrix with yardstick’s conf_mat() needs tibble of model results with
true outcome values
predicted outcome categories .pred_class
estimated probabilities of each category .pred_yes and .pred_no
conf_mat(df_results, truth = , estimate = .pred_class
the accuracy() function takes the same inputs at conf_mat, and calculates proportion of all estimates that are correct classfications:
Accuracy may not be best metric,
sensitivity measures proportion of all positive cases that were correctly classified
uses sens() function with same inputs as accuracy().
Specificity measures proportion of all negative cases that were correctly classified
uses spec() function
(1 - spec = “false positive rate”; proportion of false positives among true negatives)
Instead of calculating these one by one, can use the metric_set() function and specify what metrics we want (by name), e.g.:
Also passing conf_mat() results into sumarry() will calculate all these at once!
Example
telecom_results# A tibble: 243 × 4 canceled_service .pred_class .pred_yes .pred_no<fct><fct><dbl><dbl>1 no no 0.3760.6242 yes yes 0.7550.2453 yes yes 0.5500.4504 yes yes 0.5670.4335 yes no 0.3620.6386 no no 0.2060.7947 no no 0.1720.8288 no no 0.1170.8839 yes yes 0.5150.48510 yes no 0.1940.806# … with 233 more rows
# Calculate the confusion matrixconf_mat(telecom_results, truth = canceled_service,estimate = .pred_class)
Output
conf_mat(telecom_results, truth = canceled_service,estimate = .pred_class) TruthPrediction yes no yes 3417 no 47145
# Calculate the accuracyaccuracy(telecom_results, truth = canceled_service,estimate = .pred_class)
Output
accuracy(telecom_results, truth = canceled_service,estimate = .pred_class)# A tibble: 1 × 3 .metric .estimator .estimate<chr><chr><dbl>1 accuracy binary 0.737
# Calculate the sensitivitysens(telecom_results, truth = canceled_service,estimate = .pred_class)
Output
sens(telecom_results, truth = canceled_service,estimate = .pred_class)# A tibble: 1 × 3 .metric .estimator .estimate<chr><chr><dbl>1 sens binary 0.420
# Calculate the specificityspec(telecom_results, truth = canceled_service,estimate = .pred_class)
Output
spec(telecom_results, truth = canceled_service,estimate = .pred_class)# A tibble: 1 × 3 .metric .estimator .estimate<chr><chr><dbl>1 spec binary 0.895
The specificity of your logistic regression model is 0.895, which is more than double the sensitivity of 0.42. This indicates that your model is much better at detecting customers who will not cancel their telecommunications service versus the ones who will.
# Create a custom metric functiontelecom_metrics <-metric_set(accuracy, sens, spec)# Calculate metrics using model results tibbletelecom_metrics(telecom_results, truth = canceled_service,estimate = .pred_class)# Create a confusion matrixconf_mat(telecom_results,truth = canceled_service,estimate = .pred_class) %>%# Pass to the summary() functionsummary()
Can plot conf_mat() with autoplot() setting type = 'heatmap'. Setting type = 'mosaic' will create a mosaic of spec and sens
Probability thresholds: default in binary classification is 0.5. If estimated probability of positive (1, yes) outcome for a case then it is classified as positive (1, yes).
How does a model perform across a range of thresholds? Can estimate a unique probability threshold in the .pred_yes column of test results; calculate specificity and sensitivity of each case.
Receiver Operating Curve (ROC) visualizes performance across thresholds. Plots (1- specificity) as x-axis and sensitivity as y-axis.
For each unique threshold, a point that represents sensitivity vs 1-specificity is plotted. i.e. proportion correct among actual positives vs. proportion incorrect among actual negatives.
A step function
Optimal point is (0,1). Diagonal line (y=x, i.e. sensitivity = 1 - specificity) indicates poor performance - predicts outcomes based on random coin flip
Area under ROC curve (AUC) summarizes performance in a single number. Useful interpretation as a letter grade:
A: [0.9-1]
B: [0.8,0.9)
C: [0.7,0.8)
D: [0.6-0.7)
F: [0.5, 0.6)
roc_curve(truth = , estimate = .pred_yes) (second argument is estimated probability of yes category from resutls)
returns a tibble with specificity and sensitivity for all unique thresholds in .pred_yes.
Then pass roc_curve(…) %>% autoplot(). Then can calculate roc_auc(results, truth = , .pred_yes).
Example
# Create a confusion matrixconf_mat(telecom_results,truth = canceled_service,estimate = .pred_class) %>%# Create a heat mapautoplot(type ='heatmap')
image
# Create a confusion matrixconf_mat(telecom_results,truth = canceled_service,estimate = .pred_class) %>%# Create a mosaic plotautoplot(type ='mosaic')
image
Great job! The mosaic plot clearly shows that your logistic regression model performs much better in terms of specificity than sensitivity. You can see that in the yes column, a large proportion of outcomes were incorrectly predicted as no.
Create a tibble, threshold_df, which contains the sensitivity and specificity of your classification model across the unique probability thresholds in telecom_results.
# A tibble: 245 × 3 .threshold specificity sensitivity<dbl><dbl><dbl>1-Inf0120.02200130.04390.00617140.05380.0123150.05480.0185160.06080.0247170.07060.0309180.07560.0370190.07740.04321100.07830.04941# … with 235 more rows
Nice work! The area under the ROC curve is 0.76. This indicates that your model gets a C in terms of overall performance. This is mainly due to the low sensitivity of the model.
Streamlining Workflow
last_fit() function speeds up modeling process; fits model to training data and produces predictions on test data.
We need a data split object and a specified model, then add model formula using split= our split object. Then pipe into collect_metrics() to see how it performed on test data. Default metrics are accuracy and roc auc.
Can also pipe into collect_predictions().
When making custom metrics, must include roc_auc. Realize that most functions like spec and sens, etc. need truth = and estimate = .pred_class ; roc_auc needs truth and estimate=.pred_yes. So when using custom_metrics() , the estimate = .pred_class, .pred_yes to work for all.
Example
# Train model with last_fit()telecom_last_fit <- logistic_model %>%last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,split = telecom_split)# View test set metricstelecom_last_fit %>%collect_metrics()
# A tibble: 243 × 7 id .pred_yes .pred_no .row .pred_class canceled_service .config <chr><dbl><dbl><int><fct><fct><chr>1 train/tes… 0.3760.6243 no no Preprocesso…2 train/tes… 0.7550.2457 yes yes Preprocesso…3 train/tes… 0.5500.4509 yes yes Preprocesso…4 train/tes… 0.5670.43312 yes yes Preprocesso…5 train/tes… 0.3620.63817 no yes Preprocesso…6 train/tes… 0.2060.79425 no no Preprocesso…7 train/tes… 0.1720.82826 no no Preprocesso…8 train/tes… 0.1170.88328 no no Preprocesso…9 train/tes… 0.5150.48531 yes yes Preprocesso…10 train/tes… 0.1940.80638 no yes Preprocesso…# … with 233 more rows
Excellent work! The ROC curve shows that the logistic regression model performs better than a model that guesses at random (the dashed line in the plot). Adding the months_with_company predictor variable increased your area under the ROC curve from 0.76 in your previous model to 0.837!
Feature Engineering
Using recipes package.
Process of transforming data into a format suitable for machine learning algorithms. Examples:
assigning variable roles to columns in data
defining preprocessing tasks & data transformations
training our transformed data
applying them to new data sources
First step, assigning variables roles, using recipe(). For each column, either:
outcome
predictor
And determine the type of data
numeric
categorical
Next, preprocessing in steps, using step_*(), for purposes of e.g.:
missing data imputation
centering/transforming numeric variables
creating new variables (e.g. ratios of existing variables)
Then train on data using prep(), estimate things like mean, sd of numeric columns (for centering and scaling); storing formulas for new columns
Finally, apply these preprocessing transformations to the data with bake():
training and testing data for modeling
predict using new data
machine learning algorithms require data to be in same format as the training data
Example
Log-transforming total_time variable
Log compresses range of data values nad reduces variability
# specify purchased is response and all else is predictorleads_log_rec <-recipe(purchased ~ .,# determines data typesdata = leads_training) %>%# now transform variablestep_log(total_time, base =10) # log base 10
printing leads_log_rec will display number of outcome and predictor vars, as well as the preprocessing operations
When piped into summary(), returns a tibble with variable information
Printing it shows which operations were successfully trained
To apply recipe to new or existing data, use bake()
leads_log_rec_prep %>%bake(new_data =NULL) # which data to apply trained recipe
Since we used leads_training to train our recipe, the transformations were retained by default in the prep() function. Setting new_data = NULL will return preprocessed training data.
To transform the test data, pass it to the new_data argument
leads_log_rec_prep %>%bake(new_data = leads_test)
Example
telecom_rec <-recipe(canceled_service ~ .,data = telecom_df) %>%step_log(avg_call_mins, base =10)# print the recipetelecom_rec
output
RecipeInputs: role #variables outcome 1 predictor 8Operations:Log transformation on avg_call_mins# summarize recipetelecom_rec %>%summary()# A tibble: 9 × 4 variable type role source <chr><chr><chr><chr>1 cellular_service nominal predictor original2 avg_data_gb numeric predictor original3 avg_call_mins numeric predictor original4 avg_intl_mins numeric predictor original5 internet_service nominal predictor original6 contract nominal predictor original7 months_with_company numeric predictor original8 monthly_charges numeric predictor original9 canceled_service nominal outcome original
# Specify feature engineering recipetelecom_log_rec <-recipe(canceled_service ~ ., data = telecom_training) %>%# Add log transformation step for numeric predictorsstep_log(avg_call_mins, avg_intl_mins, base =10)# Print recipe objecttelecom_log_rec
output
RecipeInputs: role #variables outcome 1 predictor 8Operations:Log transformation on avg_call_mins, avg_intl_mins
# View variable roles and data typestelecom_log_rec %>%summary()
# View resultstelecom_log_rec_prepRecipeInputs: role #variables outcome 1 predictor 8Training data contained 731 data points and no missing data.Operations:Log transformation on avg_call_mins, avg_intl_mins [trained]
# Apply to training datatelecom_log_rec_prep %>%bake(new_data =NULL)
output
# A tibble: 731 × 9 cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service<fct><dbl><dbl><dbl><fct>1 single_line 10.32.421.74 fiber_optic 2 multiple_lines 8.052.522.09 digital 3 multiple_lines 9.42.492.17 fiber_optic 4 multiple_lines 9.962.532.13 fiber_optic 5 multiple_lines 10.22.602.06 fiber_optic 6 single_line 6.692.551.96 digital 7 single_line 9.372.581.94 fiber_optic 8 multiple_lines 4.112.571.81 digital 9 multiple_lines 5.172.532.08 digital 10 multiple_lines 7.862.582.21 digital # … with 721 more rows, and 4 more variables: contract <fct>,# months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>
# Apply to test datatelecom_log_rec_prep %>%bake(new_data = telecom_test)
output
# A tibble: 244 × 9 cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service<fct><dbl><dbl><dbl><fct>1 single_line 7.782.702.10 fiber_optic 2 single_line 9.042.531.94 fiber_optic 3 multiple_lines 5.082.402.03 digital 4 single_line 9.32.512.06 fiber_optic 5 single_line 7.072.401.97 fiber_optic 6 multiple_lines 10.62.452.17 fiber_optic 7 single_line 8.671.972.12 fiber_optic 8 multiple_lines 11.02.591.89 fiber_optic 9 single_line 5.032.542.14 digital 10 single_line 7.962.521.71 fiber_optic # … with 234 more rows, and 4 more variables: contract <fct>,# months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>
Great work! You successfully trained your recipe to be able to transform new data sources and applied it to the training and test datasets. Notice that the avg_call_mins and avg_intl_mins variables have been log transformed in the test dataset!
Numeric Predictors
Correlated predictor variables provide redundant information
Multicollinearity causes instability in machine learning optimization algorithms and can lead to model fitting errors
To preprocess correlated predictor variables, specify a recipe using the step_corr() function (note two R’s!): add all numeric predictors by name with commas and add a threshold (e.g. 0.9), in abs value
this removes variables that are correlated with each other beyond .
Alternatively instead of selecting columns by name, we can use special functions
all_outcomes() selects the outcome variable
all_numeric() selects all numeric vars (including outcome if it is numeric)
can just do step_corr(all_numeric, threshold = 0.9)
Another task: centering/scaling (“normalization”) - subtract mean and divide by sd (Z-score)
use step_normalize()
can have multiple steps, but order matters!
When we apply the recipe to the testing data, the same recipe used on training data will be used (e.g. step corr, step normalize, etc)
# Specify a recipe objecttelecom_cor_rec <-recipe(canceled_service ~ .,data = telecom_training) %>%# Remove correlated variablesstep_corr(all_numeric(), threshold =0.8)# Train the recipetelecom_cor_rec_prep <- telecom_cor_rec %>%prep(training = telecom_training)# Apply to training datatelecom_cor_rec_prep %>%bake(new_data =NULL)
output
# A tibble: 731 × 8 cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service<fct><dbl><dbl><dbl><fct>1 single_line 10.326255 fiber_optic 2 multiple_lines 8.05328122 digital 3 multiple_lines 9.4312147 fiber_optic 4 multiple_lines 9.96340136 fiber_optic 5 multiple_lines 10.2402116 fiber_optic 6 single_line 6.6935291 digital 7 single_line 9.3738287 fiber_optic 8 multiple_lines 4.1137164 digital 9 multiple_lines 5.17341119 digital 10 multiple_lines 7.86378164 digital # … with 721 more rows, and 3 more variables: contract <fct>,# months_with_company <dbl>, canceled_service <fct>
# Apply to test datatelecom_cor_rec_prep %>%bake(new_data = telecom_test)
output
# A tibble: 244 × 8 cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service<fct><dbl><dbl><dbl><fct>1 single_line 7.78497127 fiber_optic 2 single_line 9.0433688 fiber_optic 3 multiple_lines 5.08250107 digital 4 single_line 9.3326114 fiber_optic 5 single_line 7.0724994 fiber_optic 6 multiple_lines 10.6281147 fiber_optic 7 single_line 8.6793131 fiber_optic 8 multiple_lines 11.039078 fiber_optic 9 single_line 5.03343138 digital 10 single_line 7.9633051 fiber_optic # … with 234 more rows, and 3 more variables: contract <fct>,# months_with_company <dbl>, canceled_service <fct>
Excellent! You have trained your recipe to remove all correlated predictors that exceed the 0.8 correlation threshold. Notice that your recipe found the high correlation between monthly_charges and avg_data_gb in the training data and when applied to the telecom_test data, it removed the monthly_charges column.
Multiple feature engineering steps
# Specify a recipe objecttelecom_norm_rec <-recipe(canceled_service ~ ., data = telecom_training) %>%# Remove correlated variablesstep_corr(all_numeric(), threshold =0.8) %>%# Normalize numeric predictorsstep_normalize(all_numeric())# Train the recipetelecom_norm_rec_prep <- telecom_norm_rec %>%prep(training = telecom_training)# Apply to test datatelecom_norm_rec_prep %>%bake(new_data = telecom_test)
output
# A tibble: 244 × 8 cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service<fct><dbl><dbl><dbl><fct>1 single_line -0.2581.990.604 fiber_optic 2 single_line 0.398-0.159-0.629 fiber_optic 3 multiple_lines -1.66-1.31-0.0283 digital 4 single_line 0.533-0.2920.193 fiber_optic 5 single_line -0.628-1.32-0.439 fiber_optic 6 multiple_lines 1.24-0.8921.24 fiber_optic 7 single_line 0.205-3.400.730 fiber_optic 8 multiple_lines 1.440.562-0.945 fiber_optic 9 single_line -1.69-0.06510.951 digital 10 single_line -0.165-0.239-1.80 fiber_optic # … with 234 more rows, and 3 more variables: contract <fct>,# months_with_company <dbl>, canceled_service <fct>
Great job! When you applied your trained recipe to the telecom_test data, it removed the monthly_charges column, due to its large correlation with avg_data_gb, and normalized the numeric predictor variables!
Nominal Predictors
Nominal data are variables that have labels for categories with no set order (e.g. department within a company; native language; car you drive)
Nominal must often be transformed into numeric data for ML algorithms
“One-hot encoding” maps categorical values to a sequence of [0/1] indicator variables
each unique value gets its own indicator variable (e.g. category dummies)
having the final value is redundant! (if not Spring or Winter or Fall, must be Spring!)
“Dummy variable encoding” excludes one value from the original set (e.g. omit Spring); get indicator variables
is the default in tidymodels
Start by specifying a recipe() with model formula, then pipe into step_dummy() where we select the variables; then pass to prep (training = training_data) and then to bake(new_data = test_data).
More robust way is to select columns by type, e.g. all_nominal() and all_outcomes(); can exclude outcome variable with -all_outcomes().
telecom_recipe_1 <-recipe(canceled_service ~ avg_data_gb + contract, data = telecom_training) %>%# Normalize numeric predictorsstep_normalize(all_numeric()) %>%# Create dummy variables for nominal predictorsstep_dummy(all_nominal, -all_outcomes())# Train and apply telecom_recipe_1 on the test datatelecom_recipe_1 %>%prep(training = telecom_training) %>%bake(new_data = telecom_test)
output
# A tibble: 244 × 4 avg_data_gb canceled_service contract_one_year contract_two_year<dbl><fct><dbl><dbl>1-0.258 yes 0020.398 yes 003-1.66 yes 1040.533 no 005-0.628 yes 0061.24 no 0170.205 no 0181.44 no 019-1.69 no 0110-0.165 yes 00# … with 234 more rows
Another recipe:
telecom_recipe_2 <-recipe(canceled_service ~ avg_data_gb + contract, data = telecom_training) %>%# Create dummy variables for nominal predictorsstep_dummy(all_nominal(), -all_outcomes()) %>%# Normalize numeric predictorsstep_normalize(all_numeric(), -all_outcomes())# Train and apply telecom_recipe_2 on the test datatelecom_recipe_2 %>%prep(training = telecom_training) %>%bake(new_data = telecom_test)
output
# A tibble: 244 × 4 avg_data_gb canceled_service contract_one_year contract_two_year<dbl><fct><dbl><dbl>1-0.258 yes -0.510-0.47120.398 yes -0.510-0.4713-1.66 yes 1.96-0.47140.533 no -0.510-0.4715-0.628 yes -0.510-0.47161.24 no -0.5102.1270.205 no -0.5102.1281.44 no -0.5102.129-1.69 no -0.5102.1210-0.165 yes -0.510-0.471# … with 234 more rows
Great job! Notice that telecom_recipe_1 produced [0, 1] values in the dummy variable columns while telecom_recipe_2 produced dummy variables which were then normalized! The predictor contract_two_year created by telecom_recipe_2 is -0.471 instead of 0 and 2.12 instead of 1 due to normalization. For model interpretation, it’s best to normalize variables before creating dummy variables. Also notice that since you only specified two predictor variables in your model formula, the rest of the columns are ignored by your recipe objects when transforming new data sources.
# Create a recipe that predicts canceled_service using the training datatelecom_recipe <-recipe(canceled_service ~ ., data = telecom_training) %>%# Remove correlated predictorsstep_corr(all_numeric(), threshold =0.8) %>%# Normalize numeric predictorsstep_normalize(all_numeric(), -all_outcomes()) %>%# Create dummy variablesstep_dummy(all_nominal(), -all_outcomes())# Train your recipe and apply it to the test datatelecom_recipe %>%prep(training = telecom_training) %>%bake(new_data = telecom_test)
output
# A tibble: 244 × 9 avg_data_gb avg_call_mins avg_intl_mins months_with_company canceled_service<dbl><dbl><dbl><dbl><fct>1-0.2581.990.604-1.07 yes 20.398-0.159-0.629-0.947 yes 3-1.66-1.31-0.02830.771 yes 40.533-0.2920.193-0.347 no 5-0.628-1.32-0.439-0.587 yes 61.24-0.8921.240.811 no 70.205-3.400.7300.851 no 81.440.562-0.9451.45 no 9-1.69-0.06510.9510.212 no 10-0.165-0.239-1.80-1.31 yes # … with 234 more rows, and 4 more variables:# cellular_service_single_line <dbl>, internet_service_digital <dbl>,# contract_one_year <dbl>, contract_two_year <dbl>
# A tibble: 244 × 9 avg_data_gb avg_call_mins avg_intl_mins months_with_company canceled_service<dbl><dbl><dbl><dbl><fct>1-0.1251.630.657-0.748 yes 20.466-0.0388-0.490-0.483 yes 3-1.80-1.300.1210.757 yes 40.577-0.1670.3190.198 no 5-0.501-1.31-0.284-0.00569 yes 61.11-0.8001.110.770 no 70.301-5.510.7540.784 no 81.260.596-0.8670.963 no 9-1.840.04900.9160.529 no 10-0.0346-0.115-2.20-2.19 yes # … with 234 more rows, and 4 more variables:# cellular_service_single_line <dbl>, internet_service_digital <dbl>,# contract_one_year <dbl>, contract_two_year <dbl>
Model Fitting
# Train logistic modellogistic_fit <- logistic_model %>%fit(canceled_service ~ ., data = telecom_training_prep)# Obtain class predictionsclass_preds <-predict(logistic_fit, new_data = telecom_test_prep,type ='class')# Obtain estimated probabilitiesprob_preds <-predict(logistic_fit, new_data = telecom_test_prep,type ='prob')# Combine test set resultstelecom_results <- telecom_test_prep %>%select(canceled_service) %>%bind_cols(class_preds, prob_preds)telecom_results
output
# A tibble: 243 × 4 canceled_service .pred_class .pred_yes .pred_no<fct><fct><dbl><dbl>1 no no 0.2220.7782 yes yes 0.7930.2073 yes no 0.3870.6134 yes yes 0.5000.5005 yes no 0.4070.5936 no no 0.04730.9537 no no 0.3210.6798 no no 0.01160.9889 yes no 0.2290.77110 yes no 0.3120.688# … with 233 more rows
Fantastic work! You have really come a long way in developing your modeling skills with tidymodels! From the results of your metric calculations, using feature engineering and incorporating all predictor variables increased your model’s sensitivity to 0.57, up from 0.42, and specificity to 0.901, up from 0.895!
Machine Learning Workflows (Decision Trees)
Combine models and recipes into a single workflow
Decision trees can segment predictor space into rectangular regions
Popular algorithm is recursive binary splitting - segments predictor space into non-overlapping regions
makes a series of horizontal or vertical cut points (“splits”) iteratively
decision tree will predict the majority class in each region to classify it
may be better than the linear decision boundaries of a logistic regression model
Can visualize with tree diagrams
interior nodes: the splits of a decision tree (dark boxes)
terminal nodes: model predictions at end of branches
Specified with decision_tree(), uses common engine rpart , mode can be classification or regression
To train it, pipe it into last_fit() and provide a data split object
wkfl %>%last_fit(split = split_df)
Then we can also collect_metrics() and collect_predictions()
Example
loans_df# A tibble: 872 × 8 loan_default loan_purpose missed_payment_2_yr loan_amount interest_rate<fct><fct><fct><int><dbl>1 no debt_consolidation no 250005.472 yes medical no 1000010.23 no small_business no 130006.224 no small_business no 360005.975 yes small_business yes 1200011.86 yes medical no 1300013.27 no debt_consolidation no 1000010.58 no debt_consolidation no 400007.979 no debt_consolidation yes 160004.7210 no small_business no 120007.47# … with 862 more rows, and 3 more variables: installment <dbl>,# annual_income <dbl>, debt_to_income <dbl>
# Create data split objectloans_split <-initial_split(loans_df, # prop = 0.7, # removed?strata = loan_default)# Build training dataloans_training <- loans_split %>%training()# Build test dataloans_test <- loans_split %>%testing()
Great work! You have created your training and test datasets and discovered that loan_amount and installment are highly correlated predictor variables. To remove one of these predictors, you will have to incorporate step_corr() into your feature engineering pipeline for this data.
dt_model <-decision_tree() %>%# Specify the engineset_engine("rpart") %>%# Specify the modeset_mode("classification")
RecipeInputs: role #variables outcome 1 predictor 7Operations:Correlation filter on all_numeric()Centering and scaling forall_numeric()Dummy variables from all_nominal(), -all_outcomes()
Nice work! Now that you have your model and feature engineering steps specified, you can create a workflow object for model training.
# Create a workflowloans_dt_wkfl <-workflow() %>%# Include the model objectadd_model(dt_model) %>%# Include the recipe objectadd_recipe(loans_recipe)# View workflow specificationloans_dt_wkfl
output
loans_dt_wkfl══ Workflow ════════════════════════════════════════════════════════════════════Preprocessor: RecipeModel:decision_tree()── Preprocessor ────────────────────────────────────────────────────────────────3 Recipe Steps• step_corr()• step_normalize()• step_dummy()── Model ───────────────────────────────────────────────────────────────────────Decision Tree Model Specification (classification)Computational engine: rpart
# Train the workflowloans_dt_wkfl_fit <- loans_dt_wkfl %>%last_fit(split = loans_split)
# Calculate performance metrics on test dataloans_dt_wkfl_fit %>%collect_metrics()
Good job! You have trained a workflow with last_fit() that created training and test datasets, trained and applied your recipe, fit your decision tree model to the training data and calculated performance metrics on the test data all with just a few lines of code! The model performed pretty well, with an area under the ROC curve of 0.79.
Improving Performance with Cross-Evaluation
We’ve been creating training and testing datasets. The first step of modeling process. Guards against overfitting.
training data used for model fitting
testing data used for model evaluation
One downside of this method is we only get one estimate of model performance
k-fold cross validation is a technique that provides K estimates of model performance, and is typically used to compare different model types (e.g. logistic regression and decision trees)
Training data is randomly partitioned into K sets of roughly equal size (“folds”), which are used to perform K iterations of model fitting and evaluation
Test data is left out of this process, so it can provide a final, independent estimate of model performance once a model is chosen
Ex: if we have 5 folds, we will have 5 iterations of model training and evaluation.
In 1st iteration, fold 1 reserved for model evaluation while others are for training
In 2nd iteration, fold 2 reserved for evaluation while others are for training
etc.
Use vfold_cv() to do this
set.seed(25) # randomize reproduciblyex_folds <-vfold_cv(training_df, # training dfv =10, # number of foldsstrata = outcome_var) # Y variable
Results in a tibble with list column named splits and an id column that identifies each fold. Each row is a data split object that has the instructions for fitting that row’s fold into a training or evaluation set
The fit_resamples function trains a parsnip model or workflow object, provides cross validation folds resamples and optional custom metrics function
Estimates metrics for each fold. The average of these estimates is in mean column. Piping ex_rs_fit into collect_metrics(summarize = FALSE) will provide detailed metrics. Helpful to use dplyr:
fit_resamples()cannot make predictions on new data sources. Purpose of cross validation is not to fit a final model, but to compare performance of different models to discover which type works best for our data
Example
loans_training# A tibble: 655 × 8 loan_default loan_purpose missed_payment_2_yr loan_amount interest_rate<fct><fct><fct><int><dbl>1 yes medical no 1000010.22 no small_business no 130006.223 no small_business no 360005.974 yes small_business yes 1200011.85 yes medical no 1300013.26 no debt_consolidation no 1000010.57 no debt_consolidation no 400007.978 no debt_consolidation yes 160004.729 no home_improvement no 50009.9710 yes medical no 1900011# … with 645 more rows, and 3 more variables: installment <dbl>,# annual_income <dbl>, debt_to_income <dbl>
# Create cross validation foldsset.seed(290)loans_folds <-vfold_cv(loans_training, v =5,strata = loan_default)loans_folds
output
loans_folds# 5-fold cross-validation using stratification # A tibble: 5 × 2 splits id <list><chr>1<split [523/132]> Fold12<split [523/132]> Fold23<split [524/131]> Fold34<split [525/130]> Fold45<split [525/130]> Fold5
# A tibble: 3 × 6 .metric .estimator mean n std_err .config <chr><chr><dbl><int><dbl><chr>1 roc_auc binary 0.84650.00586 Preprocessor1_Model12 sens binary 0.67250.0156 Preprocessor1_Model13 spec binary 0.87650.00890 Preprocessor1_Model1
Excellent work! You have used cross validation to evaluate the performance of your decision tree workflow. Across the 5 cross validation folds, the average area under the ROC curve was 0.846. The average sensitivity and specificity were 0.672 and 0.876, respectively.
# A tibble: 3 × 6 .metric .estimator mean n std_err .config <chr><chr><dbl><int><dbl><chr>1 roc_auc binary 0.84850.0184 Preprocessor1_Model12 sens binary 0.64850.0312 Preprocessor1_Model13 spec binary 0.87350.0198 Preprocessor1_Model1
Great job! For logistic regression, across the 5 cross validation folds, the average area under the ROC curve was 0.848. The average sensitivity and specificity were 0.648 and 0.873, respectively. ROC AUC and specificity are very close to the decision tree cross validation results. However, the decision tree model performed slightly better on sensitivity, with an average value of 0.672.
Comparing model performance profiles
# Detailed cross validation resultsdt_rs_results <- loans_dt_rs %>%collect_metrics(summarize =FALSE)# Explore model performance for decision treedt_rs_results %>%group_by(.metric) %>%summarize(min =min(.estimate),median =median(.estimate),max =max(.estimate))
output
# A tibble: 3 × 4 .metric min median max<chr><dbl><dbl><dbl>1 roc_auc 0.8300.8420.8652 sens 0.640.660.7253 spec 0.850.8750.901
# Detailed cross validation resultslogistic_rs_results <- loans_logistic_rs %>%collect_metrics(summarize =FALSE)# Explore model performance for logistic regressionlogistic_rs_results %>%group_by(.metric) %>%summarize(min =min(.estimate),median =median(.estimate),max =max(.estimate))
output
# A tibble: 3 × 4 .metric min median max<chr><dbl><dbl><dbl>1 roc_auc 0.8100.8360.8982 sens 0.5490.680.7253 spec 0.8380.8620.95
Great job! Both models have similar average values across all metrics. However, logistic regression tends to have a wider range of values on all metrics. This provides evidence that a decision tree model may produce more stable prediction accuracy on the loans dataset.
Hyperparameter Tuning
Hyperparameters are model parameters which are set prior to model training an control model complexity
In parsnip, decision trees have 3 hyperparameters:
cost_complexity: number used to penalize trees with large number of terminal nodes
by default: 0.01
tree_depth: controls how long the path from root to any terminal node can be
by default: 30
min_n: minimum data points required in a node for further splitting
by default: 20
Changing these values might improve overall performance — hyperparameter tuning uses cross validation to find optimal set of these hyperparameter values
tune() function (from tune package) is used to label model hyperparameters
in a parsnip model, set the hyperparameter values to tune()
First create a workflow by updating a previous workflow (feature engineering steps & decision tree model with default params), pipe into update_model() and provide new decision tree with tuning parameters
my_wkflw %>%update_model(dt_tune_model)
Hyperparameter tuning is accomplished using grid search, a method where a grid of hyperparameter value is generated
For each combination, cross validation is used to estimate model performance, and best performing parameters are chosen. The parameters() function from the dials package can be used to identify the hyperparameters in a parsnip model object.
A popular method for grid search is to generate random combinations of values since there can be an infinite combination, choosing at random gives us a greater chance of discovering optimal combinations.
To create a random grid, pipe results into grid_random(parameters(dt_tune_model), size = 5. First argument is results of parameters() function, size is the number of random combinations to generate
Save the results dt_grid, then pipe into tune_grid(resamples = our_folds, grid = dt_grid, metrics = our_metrics)
Returns a tibble of results with .metrics column which is a list column with the results for each fold
Can use collect_metrics() to summarize results by default the average estimated values are used across all folds per combination
Each row is the average of performance across the n folds
# A tibble: 15 × 9 cost_complexity tree_depth min_n .metric .estimator mean n std_err<dbl><int><int><chr><chr><dbl><int><dbl>10.00000007581439 roc_auc binary 0.84050.013020.00000007581439 sens binary 0.67250.028130.00000007581439 spec binary 0.84850.018540.0243534 roc_auc binary 0.74350.022750.0243534 sens binary 0.58550.037560.0243534 spec binary 0.90850.022270.00000443118 roc_auc binary 0.79150.012280.00000443118 sens binary 0.66450.013690.00000443118 spec binary 0.84850.0200100.00000060035 roc_auc binary 0.79250.0230110.00000060035 sens binary 0.58950.0423120.00000060035 spec binary 0.92050.0138130.00380536 roc_auc binary 0.83850.0121140.00380536 sens binary 0.65250.0321150.00380536 spec binary 0.88350.0115# … with 1 more variable: .config <chr>
Good work! Since you have 5 random hyperparameter combinations and 3 performance metrics, there are 15 results in your summarized tuning results. Each row shows the average of the 5 cross validation estimates of each metric and hyperparameter combination.
# Explore detailed ROC AUC results for each folddt_tuning_results %>%filter(.metric =='roc_auc') %>%group_by(id) %>%summarize(min_roc_auc =min(.estimate),median_roc_auc =median(.estimate),max_roc_auc =max(.estimate))
output
# A tibble: 5 × 4 id min_roc_auc median_roc_auc max_roc_auc<chr><dbl><dbl><dbl>1 Fold1 0.8310.8680.8802 Fold2 0.7080.8030.8223 Fold3 0.7140.7920.8054 Fold4 0.7360.7770.8485 Fold5 0.7250.7700.851
Excellent work! You have now had the chance to explore the detailed results of your decision tree hyperparameter tuning. The next step will be selecting the best combination and finalizing your workflow object!
We want to look to see if estimated values are fairly consistent across folds. Wild fluctuations are an indicator of model overfitting!
To make exploring tuning results easier, use show_best() function to display top n performing combinations. Takes dt_tuning %>% show_best(metric = 'roc_auc', n = 5) based on average values of the metric. The model in the .config column indicates which model has the largest average roc auc.
The select_best() function selects the best performing hyperparameters: dt_tuning %>% select_best(metric = "roc_auc"). Returns tibble with hyperparameter values taht produced largest avg performance metric value (e.g. Model1).
The results can finalize a workflow with finalize_workflow(): pipe the workflow into finalize :
# Finalize your workflowfinal_loans_wkfl <- loans_tune_wkfl %>%finalize_workflow(best_dt_model)final_loans_wkfl
output
══ Workflow ════════════════════════════════════════════════════════════════════Preprocessor: RecipeModel:decision_tree()── Preprocessor ────────────────────────────────────────────────────────────────3 Recipe Steps• step_corr()• step_normalize()• step_dummy()── Model ───────────────────────────────────────────────────────────────────────Decision Tree Model Specification (classification)Main Arguments: cost_complexity =7.58290839567418e-08 tree_depth =14 min_n =39Computational engine: rpart
Good job! When you printed your finalized workflow object, the optimal hyperparameter combination is displayed in the main arguments section of the output. Your workflow is now ready for model fitting and prediction on new data sources!
Great job! You were able to train your finalized workflow with last_fit() and generate predictions on the test data. The tuned decision tree model produced an area under the ROC curve of 0.875. That’s a great model! The ROC curve shows that the sensitivity and specificity remain high across a wide range of probability threshold values.