AE 10: Model workflow

Peer-to-peer lender

Published

October 11, 2023

Important

Go to the course GitHub organization and locate your ae-10 repo to get started.

Render, commit, and push your responses to GitHub by the end of class. The responses are due in your GitHub repo no later than Saturday, October 14 at 11:59pm.

Packages + data

library(tidyverse)
library(tidymodels)
library(knitr)
library(openintro)

The data for this AE is from the loan50 data set in the openintro R package. We will focus on the following variables:

Predictors

annual_income: Annual income (in US dollars)
debt_to_income: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total income
verified_income: Whether borrower’s income source and amount have been verified (Not Verified, Source Verified, Verified)

Response

interest_rate: Interest rate for the loan (0- 100)

Analysis goal

The goals of this analysis are to build a recipe to fit a linear regression model on the training data that has the following features:

annual_income rescaled to thousands of dollars
- Do not include the original variable annual_income in the model
Mean-centered quantitative variables
Indicator (dummy) variables for the categorical predictor
Interaction term between rescaled annual_income and verified_income

and (2) use prep() and bake() to check the recipe

Relevel `verified_income`

Make Verified the baseline level for the model.

loan50 <- loan50 |>
  mutate(verified_income = factor(verified_income,
                                  levels = c("Verified", "Not Verified", 
                                             "Source Verified")))

Test/train split

Split the data into 90% training, 10% testing.

set.seed(123)

loan_split <- initial_split(loan50, prop = 0.9)
loan_train <- training(loan_split)
loan_test  <- testing(loan_split)

Build (and troubleshoot) recipe

Use step_mutate() to create a new variable annual_income_th that is annual_income rescaled to thousands of dollars
Use step_rm() to remove annual_income from the model
Use step_center() to mean-center quantitative variables
Use step_dummy() to create indicator variables for the categorical predictor
Use step_interact() to create interaction between annual_income_th and verified_income

loan_rec <-  recipe(interest_rate ~ annual_income + debt_to_income + verified_income, 
                    data = loan_train) |>
  step_mutate(annual_income_th = annual_income / 1000) |>
  step_rm(annual_income) |>
  step_center(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_interact(terms = ~ annual_income_th:verified_income)

loan_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 3

── Operations

• Variable mutation for: annual_income / 1000

• Variables removed: annual_income

• Centering for: all_numeric_predictors()

• Dummy variables from: all_nominal_predictors()

• Interactions with: annual_income_th:verified_income

Check recipe using `prep()` and `bake()`

Once you’ve corrected the code, remove #| eval: false before rendering the document.

loan_rec |>
  prep() |>
  bake(loan_train) |>
  glimpse()

In which step do we have an error?
Click here to access the recipes reference page. Find the reference page for the relevant step_ function.
See the examples at the bottom of the reference page. Which model most closely aligns the interaction we’re trying to create?
Use example to help you fix the code. Then, use prep() and bake() to see the updated results.

Workflows and model fitting

Specify model

loan_spec <- linear_reg() |>
  set_engine("lm")

loan_spec

Linear Regression Model Specification (regression)

Computational engine: lm

Build workflow

loan_wflow <- workflow() |>
  add_model(loan_spec) |>
  add_recipe(loan_rec)

loan_wflow

Fit model to training data

Remove #| eval: false before rendering the document.

loan_fit <- loan_wflow |>
  fit(data = loan_train)

tidy(loan_fit) |>
  kable(digits = 3)

Evaluate model on training data

Make predictions

Fill in the code and remove #| eval: false before rendering the document.

loan_train_pred <- predict(loan_fit, ______) |>
  bind_cols(_____)

Calculate $R^{2}$

Fill in the code and remove #| eval: false before rendering the document.

rsq(loan_train_pred, truth = _____, estimate = _____)

Calculate RMSE

Fill in the code and remove #| eval: false before rendering the document.

rmse(______, ________, ________)

Is this RMSE considered high or low? Hint: Consider the range of the response variable to answer this question.

loan_train |>
  summarise(min = min(interest_rate), max = max(interest_rate))

# A tibble: 1 × 2
    min   max
  <dbl> <dbl>
1  5.31  26.3

Evaluate model on testing data

Make predictions

# fill in code to make predictions from testing data

Calculate $R^{2}$

# fill in code to calculate R-sq for testing data

Calculate RMSE

# fill in code to calculate RMSE for testing data

Compare training and testing data results

Compare the $R^{2}$ for the training and testing data. Is this what you expected?
Compare the RMSE for the training and testing data. Is this what you expected?

To submit the AE

Important

Render the document to produce the PDF with all of your work from today’s class.
Push all your work to your ae-10 repo on GitHub. (You do not submit AEs on Gradescope).

Packages + data

Analysis goal

Relevel verified_income

Test/train split

Build (and troubleshoot) recipe

Check recipe using prep() and bake()

Workflows and model fitting

Specify model

Build workflow

Fit model to training data

Evaluate model on training data

Make predictions

Calculate R2

Calculate RMSE

Evaluate model on testing data

Make predictions

Calculate R2

Calculate RMSE

Compare training and testing data results

To submit the AE

Relevel `verified_income`

Check recipe using `prep()` and `bake()`

Calculate $R^{2}$

Calculate $R^{2}$