AE 09: Feature engineering with recipes

Peer-to-peer lender

Published

October 9, 2023

Important

Go to the course GitHub organization and locate your ae-09 repo to get started.

Render, commit, and push your responses to GitHub by the end of class. The responses are due in your GitHub repo no later than Thursday, October 12 at 11:59pm.

Packages + data

library(tidyverse)
library(tidymodels)
library(knitr)
library(openintro)

The data for this AE is from the loan50 data set in the openintro R package. We will focus on the following variables:

Predictors

  • annual_income: Annual income (in US dollars)
  • debt_to_income: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total income
  • verified_income: Whether borrower’s income source and amount have been verified (Not Verified, Source Verified, Verified)

Response

  • interest_rate: Interest rate for the loan (0- 100)

Analysis goal

The goals of this analysis are to build a recipe to fit a linear regression model on the training data that has the following features:

  • annual_income rescaled to thousands of dollars
  • Mean-centered quantitative variables
  • Indicator (dummy) variables for the categorical predictor
  • Interaction term between rescaled annual_income and verified_income

and (2) use prep() and bake() to check the recipe

Test/train split

Fill in the code to split the data into 90% training, 10% testing.

Important

Remove #| eval: false from the code chunk.

set.seed(123)

loans_split <- initial_split(loan50, prop = _____)
loan_train <- training(_____)
loan_test  <- _____(loan_split)

Build a recipe

  • Use step_mutate() to create a new variable annual_income_th that is annual_income rescaled to thousands of dollars

  • Use step_center() to mean-center quantitative variables

  • Use step_dummy() to create indicator variables for the categorical predictor

  • Use step_interact() to create interaction between annual_income_th and verified_income

Important

Remove #| eval: false from the code chunk.

# use original variables when specifying recipe
loan_rec <-  recipe(interest_rate ~ annaul_income + debt_to_income + verified_income, 
                    data = loan_train) |>
  # add recipe steps
loan_rec

Check recipe using prep() and bake()

Remove #| eval: false from the code chunk

# determine required parameters to be estimated
loan_rec_trained <- prep(loan_rec)

# apply recipe computations to data
bake(loan_rec_trained, loan_train) |>
  glimpse()

To submit the AE

Important
  • Render the document to produce the PDF with all of your work from today’s class.
  • Push all your work to your ae-09 repo on GitHub. (You do not submit AEs on Gradescope).