library(tidyverse)
library(tidymodels)
library(knitr)
library(openintro)
AE 10: Model workflow
Peer-to-peer lender
Go to the course GitHub organization and locate your ae-10
repo to get started.
Render, commit, and push your responses to GitHub by the end of class. The responses are due in your GitHub repo no later than Saturday, October 14 at 11:59pm.
Packages + data
The data for this AE is from the loan50
data set in the openintro R package. We will focus on the following variables:
Predictors
annual_income
: Annual income (in US dollars)debt_to_income
: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total incomeverified_income
: Whether borrower’s income source and amount have been verified (Not Verified
,Source Verified
,Verified
)
Response
interest_rate
: Interest rate for the loan (0- 100)
Analysis goal
The goals of this analysis are to build a recipe to fit a linear regression model on the training data that has the following features:
annual_income
rescaled to thousands of dollars- Do not include the original variable
annual_income
in the model
- Do not include the original variable
- Mean-centered quantitative variables
- Indicator (dummy) variables for the categorical predictor
- Interaction term between rescaled
annual_income
andverified_income
and (2) use prep()
and bake()
to check the recipe
Relevel verified_income
Make Verified
the baseline level for the model.
<- loan50 |>
loan50 mutate(verified_income = factor(verified_income,
levels = c("Verified", "Not Verified",
"Source Verified")))
Test/train split
Split the data into 90% training, 10% testing.
set.seed(123)
<- initial_split(loan50, prop = 0.9)
loan_split <- training(loan_split)
loan_train <- testing(loan_split) loan_test
Build (and troubleshoot) recipe
Use
step_mutate()
to create a new variableannual_income_th
that isannual_income
rescaled to thousands of dollarsUse
step_rm()
to removeannual_income
from the modelUse
step_center()
to mean-center quantitative variablesUse
step_dummy()
to create indicator variables for the categorical predictorUse
step_interact()
to create interaction betweenannual_income_th
andverified_income
<- recipe(interest_rate ~ annual_income + debt_to_income + verified_income,
loan_rec data = loan_train) |>
step_mutate(annual_income_th = annual_income / 1000) |>
step_rm(annual_income) |>
step_center(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_interact(terms = ~ annual_income_th:verified_income)
loan_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 3
── Operations
• Variable mutation for: annual_income / 1000
• Variables removed: annual_income
• Centering for: all_numeric_predictors()
• Dummy variables from: all_nominal_predictors()
• Interactions with: annual_income_th:verified_income
Check recipe using prep()
and bake()
Once you’ve corrected the code, remove #| eval: false
before rendering the document.
|>
loan_rec prep() |>
bake(loan_train) |>
glimpse()
In which step do we have an error?
Click here to access the recipes reference page. Find the reference page for the relevant
step_
function.See the examples at the bottom of the reference page. Which model most closely aligns the interaction we’re trying to create?
Use example to help you fix the code. Then, use
prep()
andbake()
to see the updated results.
Workflows and model fitting
Specify model
<- linear_reg() |>
loan_spec set_engine("lm")
loan_spec
Linear Regression Model Specification (regression)
Computational engine: lm
Build workflow
<- workflow() |>
loan_wflow add_model(loan_spec) |>
add_recipe(loan_rec)
loan_wflow
Fit model to training data
Remove #| eval: false
before rendering the document.
<- loan_wflow |>
loan_fit fit(data = loan_train)
tidy(loan_fit) |>
kable(digits = 3)
Evaluate model on training data
Make predictions
Fill in the code and remove #| eval: false
before rendering the document.
<- predict(loan_fit, ______) |>
loan_train_pred bind_cols(_____)
Calculate \(R^2\)
Fill in the code and remove #| eval: false
before rendering the document.
rsq(loan_train_pred, truth = _____, estimate = _____)
Calculate RMSE
Fill in the code and remove #| eval: false
before rendering the document.
rmse(______, ________, ________)
Is this RMSE considered high or low? Hint: Consider the range of the response variable to answer this question.
|> loan_train summarise(min = min(interest_rate), max = max(interest_rate))
# A tibble: 1 × 2 min max <dbl> <dbl> 1 5.31 26.3
Evaluate model on testing data
Make predictions
# fill in code to make predictions from testing data
Calculate \(R^2\)
# fill in code to calculate R-sq for testing data
Calculate RMSE
# fill in code to calculate RMSE for testing data
Compare training and testing data results
Compare the \(R^2\) for the training and testing data. Is this what you expected?
Compare the RMSE for the training and testing data. Is this what you expected?
To submit the AE
- Render the document to produce the PDF with all of your work from today’s class.
- Push all your work to your
ae-10
repo on GitHub. (You do not submit AEs on Gradescope).