Feature engineering

Model workflow

Prof. Maria Tackett

Oct 11, 2023

Announcements

Lab 05 due October 20
Prof. Tackett will not have office hours on Friday
- Email to schedule an appointment if you need to meet
- All other office hours on regular schedule

🍁 Have a good Fall Break! 🍁

Topics

Feature engineering with recipes
Workflows to bring together models and recipes
RMSE and \(R^2\) for model evaluation on training and tests sets

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(gghighlight)
library(knitr)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Introduction

The Office

Data & goal

Data: The data come from data.world, by way of TidyTuesday
Goal: Predict imdb_rating from other variables in the dataset

# A tibble: 188 × 6
   season episode title             imdb_rating total_votes air_date  
    <dbl>   <dbl> <chr>                   <dbl>       <dbl> <date>    
 1      1       1 Pilot                     7.6        3706 2005-03-24
 2      1       2 Diversity Day             8.3        3566 2005-03-29
 3      1       3 Health Care               7.9        2983 2005-04-05
 4      1       4 The Alliance              8.1        2886 2005-04-12
 5      1       5 Basketball                8.4        3179 2005-04-19
 6      1       6 Hot Girl                  7.8        2852 2005-04-26
 7      2       1 The Dundies               8.7        3213 2005-09-20
 8      2       2 Sexual Harassment         8.2        2736 2005-09-27
 9      2       3 Office Olympics           8.4        2742 2005-10-04
10      2       4 The Fire                  8.4        2713 2005-10-11
# ℹ 178 more rows

Modeling

Train / test

Step 1: Create an initial split:

set.seed(123)
office_split <- initial_split(office_ratings) # prop = 3/4 by default

Step 2: Save training data

office_train <- training(office_split)
dim(office_train)

[1] 141   6

Step 3: Save testing data

office_test  <- testing(office_split)
dim(office_test)

[1] 47  6

Training data

office_train

# A tibble: 141 × 6
   season episode title               imdb_rating total_votes air_date  
    <dbl>   <dbl> <chr>                     <dbl>       <dbl> <date>    
 1      8      18 Last Day in Florida         7.8        1429 2012-03-08
 2      9      14 Vandalism                   7.6        1402 2013-01-31
 3      2       8 Performance Review          8.2        2416 2005-11-15
 4      9       5 Here Comes Treble           7.1        1515 2012-10-25
 5      3      22 Beach Games                 9.1        2783 2007-05-10
 6      7       1 Nepotism                    8.4        1897 2010-09-23
 7      3      15 Phyllis' Wedding            8.3        2283 2007-02-08
 8      9      21 Livin' the Dream            8.9        2041 2013-05-02
 9      9      18 Promos                      8          1445 2013-04-04
10      8      12 Pool Party                  8          1612 2012-01-19
# ℹ 131 more rows

Recap: Feature engineering

We prefer simple (parsimonious) models when possible, but parsimony does not mean sacrificing accuracy (or predictive performance) in the interest of simplicity
Variables that go into the model and how they are represented are just as critical to success of the model
Feature engineering allows us to get creative with our predictors in an effort to make them more useful for our model (to increase its predictive performance)

Recap: Modeling workflow, revisited

Create a recipe for feature engineering steps to be applied to the training data
Fit the model to the training data after these steps have been applied
Using the model estimates from the training data, predict outcomes for the test data
Evaluate the performance of the model on the test data

Building recipes

Initiate a recipe

office_rec <- recipe(
  imdb_rating ~ .,    # formula
  data = office_train # data for cataloguing names and types of variables
  )

office_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 5

Step 1: Alter roles

title isn’t a predictor, but we might want to keep it around as an ID

office_rec <- office_rec |>
  update_role(title, new_role = "ID")

office_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 4
ID:        1

Step 2: Add features

New features for day of week and month

office_rec <- office_rec |>
  step_date(air_date, features = c("dow", "month"))

office_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 4
ID:        1

── Operations

• Date features from: air_date

Working with recipes

When building recipes you in a pipeline, you don’t get to see the effect of the recipe on your data, which can be unsettling
You can take a peek at what will happen when you ultimately apply the recipe to your data at the time of fitting the model
This requires two functions: prep() to train the recipe and bake() to apply it to your data

Note

This is optional, we’ll show the results for demonstrative purposes. It doesn’t need to be part of your modeling pipeline, but it can be assuring to see the effects of the recipe steps as you build the recipe.

Step 2: Prep and bake

# determine required parameters to be estimated
office_rec |>
  prep() |>
# apply recipe computations to data
  bake(office_train) |>
  glimpse()

Rows: 141
Columns: 8
$ season         <dbl> 8, 9, 2, 9, 3, 7, 3, 9, 9, 8, 5, 5, 9, 6, 7, 6, 5, 2, 2…
$ episode        <dbl> 18, 14, 8, 5, 22, 1, 15, 21, 18, 12, 25, 26, 12, 1, 20,…
$ title          <fct> "Last Day in Florida", "Vandalism", "Performance Review…
$ total_votes    <dbl> 1429, 1402, 2416, 1515, 2783, 1897, 2283, 2041, 1445, 1…
$ air_date       <date> 2012-03-08, 2013-01-31, 2005-11-15, 2012-10-25, 2007-0…
$ imdb_rating    <dbl> 7.8, 7.6, 8.2, 7.1, 9.1, 8.4, 8.3, 8.9, 8.0, 8.0, 8.7, …
$ air_date_dow   <fct> Thu, Thu, Tue, Thu, Thu, Thu, Thu, Thu, Thu, Thu, Thu, …
$ air_date_month <fct> Mar, Jan, Nov, Oct, May, Sep, Feb, May, Apr, Jan, May, …

Step 3: Add more features

Identify holidays in air_date, then remove air_date

office_rec <- office_rec |>
  step_holiday(
    air_date, 
    holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), 
    keep_original_cols = FALSE
  )

office_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 4
ID:        1

── Operations

• Date features from: air_date

• Holiday features from: air_date

Step 3: Prep and bake

office_rec |>
  prep() |>
  bake(office_train) |>
  glimpse()

Rows: 141
Columns: 11
$ season                     <dbl> 8, 9, 2, 9, 3, 7, 3, 9, 9, 8, 5, 5, 9, 6, 7…
$ episode                    <dbl> 18, 14, 8, 5, 22, 1, 15, 21, 18, 12, 25, 26…
$ title                      <fct> "Last Day in Florida", "Vandalism", "Perfor…
$ total_votes                <dbl> 1429, 1402, 2416, 1515, 2783, 1897, 2283, 2…
$ imdb_rating                <dbl> 7.8, 7.6, 8.2, 7.1, 9.1, 8.4, 8.3, 8.9, 8.0…
$ air_date_dow               <fct> Thu, Thu, Tue, Thu, Thu, Thu, Thu, Thu, Thu…
$ air_date_month             <fct> Mar, Jan, Nov, Oct, May, Sep, Feb, May, Apr…
$ air_date_USThanksgivingDay <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USChristmasDay    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USNewYearsDay     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USIndependenceDay <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Step 4: Convert numbers to factors

Convert season to factor

office_rec <- office_rec |>
  step_num2factor(season, levels = as.character(1:9))

office_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 4
ID:        1

── Operations

• Date features from: air_date

• Holiday features from: air_date

• Factor variables from: season

Step 4: Prep and bake

office_rec |>
  prep() |>
  bake(office_train) |>
  glimpse()

Rows: 141
Columns: 11
$ season                     <fct> 8, 9, 2, 9, 3, 7, 3, 9, 9, 8, 5, 5, 9, 6, 7…
$ episode                    <dbl> 18, 14, 8, 5, 22, 1, 15, 21, 18, 12, 25, 26…
$ title                      <fct> "Last Day in Florida", "Vandalism", "Perfor…
$ total_votes                <dbl> 1429, 1402, 2416, 1515, 2783, 1897, 2283, 2…
$ imdb_rating                <dbl> 7.8, 7.6, 8.2, 7.1, 9.1, 8.4, 8.3, 8.9, 8.0…
$ air_date_dow               <fct> Thu, Thu, Tue, Thu, Thu, Thu, Thu, Thu, Thu…
$ air_date_month             <fct> Mar, Jan, Nov, Oct, May, Sep, Feb, May, Apr…
$ air_date_USThanksgivingDay <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USChristmasDay    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USNewYearsDay     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USIndependenceDay <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Step 5: Make dummy variables

Convert all nominal (categorical) predictors to factors

office_rec <- office_rec |>
  step_dummy(all_nominal_predictors())

office_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 4
ID:        1

── Operations

• Date features from: air_date

• Holiday features from: air_date

• Factor variables from: season

• Dummy variables from: all_nominal_predictors()

Step 5: Prep and bake

# determine required parameters to be estimated
office_rec |>
  prep() |>
# apply recipe computations to data
  bake(office_train) |>
  glimpse()

Rows: 141
Columns: 33
$ episode                    <dbl> 18, 14, 8, 5, 22, 1, 15, 21, 18, 12, 25, 26…
$ title                      <fct> "Last Day in Florida", "Vandalism", "Perfor…
$ total_votes                <dbl> 1429, 1402, 2416, 1515, 2783, 1897, 2283, 2…
$ imdb_rating                <dbl> 7.8, 7.6, 8.2, 7.1, 9.1, 8.4, 8.3, 8.9, 8.0…
$ air_date_USThanksgivingDay <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USChristmasDay    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USNewYearsDay     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_USIndependenceDay <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ season_X2                  <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ season_X3                  <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ season_X4                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ season_X5                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
$ season_X6                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ season_X7                  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ season_X8                  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ season_X9                  <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0…
$ air_date_dow_Mon           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_dow_Tue           <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_dow_Wed           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_dow_Thu           <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ air_date_dow_Fri           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_dow_Sat           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Feb         <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Mar         <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Apr         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1…
$ air_date_month_May         <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0…
$ air_date_month_Jun         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Jul         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Aug         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Sep         <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ air_date_month_Oct         <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Nov         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ air_date_month_Dec         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Step 6: Remove zero variance predictors

Remove all predictors that contain only a single value

office_rec <- office_rec |>
  step_zv(all_predictors())

office_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 4
ID:        1

── Operations

• Date features from: air_date

• Holiday features from: air_date

• Factor variables from: season

• Dummy variables from: all_nominal_predictors()

• Zero variance filter on: all_predictors()

Step 6: Prep and bake

office_rec |>
  prep() |>
  bake(office_train) |>
  glimpse()

Rows: 141
Columns: 22
$ episode            <dbl> 18, 14, 8, 5, 22, 1, 15, 21, 18, 12, 25, 26, 12, 1,…
$ title              <fct> "Last Day in Florida", "Vandalism", "Performance Re…
$ total_votes        <dbl> 1429, 1402, 2416, 1515, 2783, 1897, 2283, 2041, 144…
$ imdb_rating        <dbl> 7.8, 7.6, 8.2, 7.1, 9.1, 8.4, 8.3, 8.9, 8.0, 8.0, 8…
$ season_X2          <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ season_X3          <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ season_X4          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ season_X5          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, …
$ season_X6          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, …
$ season_X7          <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ season_X8          <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ season_X9          <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, …
$ air_date_dow_Tue   <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ air_date_dow_Thu   <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ air_date_month_Feb <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ air_date_month_Mar <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ air_date_month_Apr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, …
$ air_date_month_May <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
$ air_date_month_Sep <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
$ air_date_month_Oct <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ air_date_month_Nov <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ air_date_month_Dec <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Putting it all together

office_rec <- recipe(imdb_rating ~ ., data = office_train) |>
  # make title's role ID
  update_role(title, new_role = "ID") |>
  # extract day of week and month of air_date
  step_date(air_date, features = c("dow", "month")) |>
  # identify holidays and add indicators
  step_holiday(
    air_date, 
    holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), 
    keep_original_cols = FALSE
  ) |>
  # turn season into factor
  step_num2factor(season, levels = as.character(1:9)) |>
  # make dummy variables
  step_dummy(all_nominal_predictors()) |>
  # remove zero variance predictors
  step_zv(all_predictors())

Putting it all together

office_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 4
ID:        1

── Operations

• Date features from: air_date

• Holiday features from: air_date

• Factor variables from: season

• Dummy variables from: all_nominal_predictors()

• Zero variance filter on: all_predictors()

Recipe workflow

Artwork by @allison_horst

Application exercise

📋 sta210-fa23.netlify.app/ae/ae-10-model-workflow

Building workflows

Specify model

office_spec <- linear_reg() |>
  set_engine("lm")

office_spec

Linear Regression Model Specification (regression)

Computational engine: lm

Build workflow

Workflows bring together models and recipes so that they can be easily applied to both the training and test data.

office_wflow <- workflow() |>
  add_model(office_spec) |>
  add_recipe(office_rec)

See next slide for workflow…

View workflow

office_wflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_date()
• step_holiday()
• step_num2factor()
• step_dummy()
• step_zv()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

Fit model to training data

office_fit <- office_wflow |>
  fit(data = office_train)

tidy(office_fit)

# A tibble: 21 × 5
   term         estimate std.error statistic  p.value
   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)  6.40     0.510        12.5   1.51e-23
 2 episode     -0.00393  0.0171       -0.230 8.18e- 1
 3 total_votes  0.000375 0.0000414     9.07  2.75e-15
 4 season_X2    0.811    0.327         2.48  1.44e- 2
 5 season_X3    1.04     0.343         3.04  2.91e- 3
 6 season_X4    1.09     0.295         3.70  3.32e- 4
 7 season_X5    1.08     0.348         3.11  2.34e- 3
 8 season_X6    1.00     0.367         2.74  7.18e- 3
 9 season_X7    1.02     0.352         2.89  4.52e- 3
10 season_X8    0.497    0.348         1.43  1.55e- 1
# ℹ 11 more rows

So many predictors!

Model fit summary

tidy(office_fit) |> print(n = 21)

# A tibble: 21 × 5
   term                estimate std.error statistic  p.value
   <chr>                  <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)         6.40     0.510        12.5   1.51e-23
 2 episode            -0.00393  0.0171       -0.230 8.18e- 1
 3 total_votes         0.000375 0.0000414     9.07  2.75e-15
 4 season_X2           0.811    0.327         2.48  1.44e- 2
 5 season_X3           1.04     0.343         3.04  2.91e- 3
 6 season_X4           1.09     0.295         3.70  3.32e- 4
 7 season_X5           1.08     0.348         3.11  2.34e- 3
 8 season_X6           1.00     0.367         2.74  7.18e- 3
 9 season_X7           1.02     0.352         2.89  4.52e- 3
10 season_X8           0.497    0.348         1.43  1.55e- 1
11 season_X9           0.621    0.345         1.80  7.41e- 2
12 air_date_dow_Tue    0.382    0.422         0.904 3.68e- 1
13 air_date_dow_Thu    0.284    0.389         0.731 4.66e- 1
14 air_date_month_Feb -0.0597   0.132        -0.452 6.52e- 1
15 air_date_month_Mar -0.0752   0.156        -0.481 6.31e- 1
16 air_date_month_Apr  0.0954   0.177         0.539 5.91e- 1
17 air_date_month_May  0.156    0.213         0.734 4.64e- 1
18 air_date_month_Sep -0.0776   0.223        -0.348 7.28e- 1
19 air_date_month_Oct -0.176    0.174        -1.01  3.13e- 1
20 air_date_month_Nov -0.156    0.149        -1.05  2.98e- 1
21 air_date_month_Dec  0.170    0.149         1.14  2.55e- 1

Application exercise

📋 sta210-fa23.netlify.app/ae/ae-10-model-workflow

Evaluate model

Make predictions for training data

office_train_pred <- predict(office_fit, office_train) |>
  bind_cols(office_train)

office_train_pred

# A tibble: 141 × 7
   .pred season episode title               imdb_rating total_votes air_date  
   <dbl>  <dbl>   <dbl> <chr>                     <dbl>       <dbl> <date>    
 1  7.57      8      18 Last Day in Florida         7.8        1429 2012-03-08
 2  7.77      9      14 Vandalism                   7.6        1402 2013-01-31
 3  8.31      2       8 Performance Review          8.2        2416 2005-11-15
 4  7.67      9       5 Here Comes Treble           7.1        1515 2012-10-25
 5  8.84      3      22 Beach Games                 9.1        2783 2007-05-10
 6  8.33      7       1 Nepotism                    8.4        1897 2010-09-23
 7  8.46      3      15 Phyllis' Wedding            8.3        2283 2007-02-08
 8  8.14      9      21 Livin' the Dream            8.9        2041 2013-05-02
 9  7.87      9      18 Promos                      8          1445 2013-04-04
10  7.74      8      12 Pool Party                  8          1612 2012-01-19
# ℹ 131 more rows

R-squared

Percentage of variability in the IMDB ratings explained by the model.

rsq(office_train_pred, truth = imdb_rating, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.670

Are models with high or low \(R^2\) more preferable?

RMSE

An alternative model performance statistic: root mean square error.

\[ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]

rmse(office_train_pred, truth = imdb_rating, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.302

Are models with high or low RMSE are more preferable?

Interpreting RMSE

Is this RMSE considered low or high?

rmse(office_train_pred, truth = imdb_rating, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.302

Depends…

office_train |>
  summarise(min = min(imdb_rating), max = max(imdb_rating))

# A tibble: 1 × 2
    min   max
  <dbl> <dbl>
1   6.7   9.7

But, really…

who cares about predictions on training data?

Make predictions for testing data

office_test_pred <- predict(office_fit, office_test) |>
  bind_cols(office_test)

office_test_pred

# A tibble: 47 × 7
   .pred season episode title               imdb_rating total_votes air_date  
   <dbl>  <dbl>   <dbl> <chr>                     <dbl>       <dbl> <date>    
 1  8.03      1       2 Diversity Day               8.3        3566 2005-03-29
 2  7.98      1       3 Health Care                 7.9        2983 2005-04-05
 3  8.41      2       4 The Fire                    8.4        2713 2005-10-11
 4  8.35      2       5 Halloween                   8.2        2561 2005-10-18
 5  8.35      2       9 E-Mail Surveillance         8.4        2527 2005-11-22
 6  8.68      2      12 The Injury                  9          3282 2006-01-12
 7  8.32      2      14 The Carpet                  7.9        2342 2006-01-26
 8  8.93      2      22 Casino Night                9.3        3644 2006-05-11
 9  8.80      3       1 Gay Witch Hunt              8.9        3087 2006-09-21
10  8.37      3       5 Initiation                  8.2        2254 2006-10-19
# ℹ 37 more rows

Evaluate performance for testing data

RMSE of model fit to testing data

rmse(office_test_pred, truth = imdb_rating, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.411

R-sq of model fit to testing data

rsq(office_test_pred, truth = imdb_rating, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.468

Training vs. testing

metric	train	test	comparison
RMSE	0.302	0.411	RMSE lower for training
R-squared	0.67	0.468	R-squared higher for training

Evaluating performance on training data

The training set does not have the capacity to be a good arbiter of performance.
It is not an independent piece of information; predicting the training set can only reflect what the model already knows.
Suppose you give a class a test, then give them the answers, then provide the same test. The student scores on the second test do not accurately reflect what they know about the subject; these scores would probably be higher than their results on the first test.

Application exercise

📋 sta210-fa23.netlify.app/ae/ae-10-model-workflow