AE 11: Cross validation

Published

October 23, 2023

Important

Go to the course GitHub organization and locate your ae-11 repo to get started.

Render, commit, and push your responses to GitHub by the end of class. The responses are due in your GitHub repo no later than Thursday, October 26 at 11:59pm.

Model statistics function

You will use this function to calculate $Adj. R^2$, AIC, and BIC in the cross validation.

calc_model_stats <- function(x) {
  glance(extract_fit_parsnip(x)) |>
    select(adj.r.squared, AIC, BIC)
}

Packages

library(tidyverse)
library(tidymodels)
library(knitr)

Load data and relevel factors

tips <- read_csv("data/tip-data.csv")

tips <- tips |>
  mutate(Age = factor(Age, levels = c("Yadult", "Middle", "SenCit")), 
         Meal = factor(Meal, levels = c("Lunch", "Dinner", "Late Night"))
  )

Split data into training and testing

Split your data into testing and training sets.

set.seed(10232023)
tips_split <- initial_split(tips)
tips_train <- training(tips_split)
tips_test <- testing(tips_split)

Specify model

Specify a linear regression model. Call it tips_spec.

tips_spec <- linear_reg() |>
  set_engine("lm")

tips_spec

Linear Regression Model Specification (regression)

Computational engine: lm

Model 1

Create recipe

Create a recipe to use Party, Age, and Meal to predict Tip. Call it tips_rec1.

tips_rec1 <- recipe(Tip ~ Party + Age + Meal,
                    data = tips_train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

tips_rec1

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 3

── Operations

• Dummy variables from: all_nominal_predictors()

• Zero variance filter on: all_predictors()

Preview recipe

prep(tips_rec1) |>
  bake(tips_train) |>
  glimpse()

Rows: 126
Columns: 6
$ Party           <dbl> 3, 2, 2, 4, 2, 7, 4, 3, 2, 4, 1, 2, 2, 1, 2, 1, 2, 3, …
$ Tip             <dbl> 4.00, 4.92, 5.09, 8.84, 3.09, 15.00, 8.00, 4.00, 5.00,…
$ Age_Middle      <dbl> 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, …
$ Age_SenCit      <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
$ Meal_Dinner     <dbl> 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, …
$ Meal_Late.Night <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, …

Create workflow

Create the workflow that brings together the model specification and recipe. Call it tips_wflow1.

tips_wflow1 <- workflow() |>
  add_model(tips_spec) |>
  add_recipe(tips_rec1)

tips_wflow1

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_zv()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

Cross validation

Create folds

Create 5 folds.

# make 10 folds
set.seed(10232023)
folds <- vfold_cv(tips_train, v = 5)

Conduct cross validation

Conduct cross validation on the 5 folds.

# Fit model and performance statistics for each iteration
tips_fit_rs1 <- tips_wflow1 |>
  fit_resamples(resamples = folds, 
                control = control_resamples(extract = calc_model_stats))

Take a look at `tips_fit_rs1`

tips_fit_rs1

# Resampling results
# 5-fold cross-validation 
# A tibble: 5 × 5
  splits           id    .metrics         .notes           .extracts       
  <list>           <chr> <list>           <list>           <list>          
1 <split [100/26]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
2 <split [101/25]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
3 <split [101/25]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
4 <split [101/25]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
5 <split [101/25]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>

Summarize assessment CV metrics

Summarize assessment metrics from your CV iterations These statistics are calculated using the assessment set.

collect_metrics(tips_fit_rs1, summarize = TRUE)

# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   2.09      5  0.265  Preprocessor1_Model1
2 rsq     standard   0.673     5  0.0519 Preprocessor1_Model1

Tip

Set summarize = FALSE to see the individual $R^2$ and RMSE for each iteration.

Summarize model fit CV metrics

Summarize model fit statistics from your CV iterations These statistics are calculated using the analysis set.

map_df(tips_fit_rs1$.extracts, ~ .x[[1]][[1]]) |>
  summarise(mean_adj_rsq = mean(adj.r.squared), 
            mean_aic = mean(AIC), 
            mean_bic = mean(BIC))

# A tibble: 1 × 3
  mean_adj_rsq mean_aic mean_bic
         <dbl>    <dbl>    <dbl>
1        0.670     434.     453.

Tip

Run the first line of code map_df(tips_fit_rs1$.extracts, ~ .x[[1]][[1]]) to see the individual $Adj. R^2$, AIC, and BIC for each iteration.

Another model - Model 2

Create the recipe for a new model that includes Party, Age, Meal, and Alcohol (an indicator for whether the party ordered alcohol with the meal). Conduct 10-fold cross validation and summarize the metrics.

Model 2: Recipe

# add code here

Model 2: Model building workflow

# add code here

Model 2: Conduct CV

Note

We will use the same folds as the ones used for Model 1. Why should we use the same folds to evaluate and compare both models?

# add code here

Model 2: Summarize assessment CV metrics

# add code here

Model 2: Summarize model fit CV metrics

# add code here

Compare and choose a model

Describe how the two models compare to each other based on cross validation metrics.
Which model do you choose for the final model? Why?

Fit the selected model

Fit the selected model using the entire training set.

# add code here

Tip

See notes for example code.

Evaluate the performance of the selected model on the testing data

Calculate predicted values

# add code here

Calculate $RMSE$

# add code here

Tip

See notes notes for example code.

How does the model performance on the testing data compare to its performance on the training data?
Is this what you expected? Why or why not?

Submission