Oct 09, 2023
Group labs resume this week
Prof. Tackett will not have office hours on Friday
Email to schedule an appointment if you need to meet
All other office hours on regular schedule
Dr. Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and Professor of Biostatistics and Computational Biology at the Dana-Farber Cancer Institute. He earned a Bachelor of Science degree in Mathematics from the University of Puerto Rico at Rio Piedras and a PhD from the University of California, Berkeley in Statistics. Dr. Irizarry’s work is highly cited, and he has been given many prestigious awards including the Presidents’ Award given by the Committee of Presidents of Statistical Societies.
Article: Kishore, N., Marqués, D., Mahmud, A., Kiang, M. V., Rodriguez, I., Fuller, A., ... & Buckee, C. O. (2018). Mortality in Puerto Rico after Hurricane Maria. New England journal of medicine, 379(2), 162-170.
GitHub repo: github.com/c2-d2/pr_mort_official
Understanding categorical predictors and interaction terms
Feature engineering
Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50
data frame in the openintro R package.
# A tibble: 50 × 4
annual_income debt_to_income verified_income interest_rate
<dbl> <dbl> <fct> <dbl>
1 59000 0.558 Not Verified 10.9
2 60000 1.31 Not Verified 9.92
3 75000 1.06 Verified 26.3
4 75000 0.574 Not Verified 9.92
5 254000 0.238 Not Verified 9.43
6 67000 1.08 Source Verified 9.92
7 28800 0.0997 Source Verified 17.1
8 80000 0.351 Not Verified 6.08
9 34000 0.698 Not Verified 7.97
10 80000 0.167 Source Verified 12.6
# ℹ 40 more rows
Predictors:
annual_income
: Annual incomedebt_to_income
: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total incomeverified_income
: Whether borrower’s income source and amount have been verified (Not Verified
, Source Verified
, Verified
)Response: interest_rate
: Interest rate for the loan
interest_rate
min | median | max | iqr |
---|---|---|---|
5.31 | 9.93 | 26.3 | 5.755 |
verified_income
The lines are not parallel indicating there is an interaction effect. The slope of annual income differs based on the income verification.
Defining the interaction variable in the model formula as verified_income * annual_income_th_cent
is an implicit data manipulation step as well
Rows: 50
Columns: 9
$ `(Intercept)` <dbl> 1, 1, 1, 1, 1, …
$ debt_inc_cent <dbl> -0.16511719, 0.…
$ annual_income_th_cent <dbl> -27.17, -26.17,…
$ `verified_incomeNot Verified` <dbl> 1, 1, 0, 1, 1, …
$ `verified_incomeSource Verified` <dbl> 0, 0, 0, 0, 0, …
$ verified_incomeVerified <dbl> 0, 0, 1, 0, 0, …
$ `annual_income_th_cent:verified_incomeNot Verified` <dbl> -27.17, -26.17,…
$ `annual_income_th_cent:verified_incomeSource Verified` <dbl> 0.00, 0.00, 0.0…
$ `annual_income_th_cent:verified_incomeVerified` <dbl> 0.00, 0.00, -11…
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.484 | 0.989 | 9.586 | 0.000 |
debt_inc_cent | 0.691 | 0.685 | 1.009 | 0.319 |
verified_incomeSource Verified | 2.157 | 1.418 | 1.522 | 0.135 |
verified_incomeVerified | 7.181 | 1.870 | 3.840 | 0.000 |
annual_income_th_cent | -0.007 | 0.020 | -0.341 | 0.735 |
verified_incomeSource Verified:annual_income_th_cent | -0.016 | 0.026 | -0.643 | 0.523 |
verified_incomeVerified:annual_income_th_cent | -0.032 | 0.033 | -0.979 | 0.333 |
annual_income
for source verified: If the income is source verified, we expect the interest rate to decrease by 0.023% (-0.007 + -0.016) for each additional thousand dollars in annual income, holding all else constant.\[ \begin{aligned} \hat{interest\_rate} &= 9.484 + 0.691 \times debt\_inc\_cent\\ &- 0.007 \times annual\_income\_th\_cent \\ &+ 2.157 \times SourceVerified + 7.181 \times Verified \\ &- 0.016 \times annual\_inc\_th\_cent \times SourceVerified\\ &- 0.032 \times annual\_inc\_th\_cent \times Verified \end{aligned} \]
The data come from data.world, by way of TidyTuesday
# A tibble: 188 × 6
season episode title imdb_rating total_votes air_date
<dbl> <dbl> <chr> <dbl> <dbl> <date>
1 1 1 Pilot 7.6 3706 2005-03-24
2 1 2 Diversity Day 8.3 3566 2005-03-29
3 1 3 Health Care 7.9 2983 2005-04-05
4 1 4 The Alliance 8.1 2886 2005-04-12
5 1 5 Basketball 8.4 3179 2005-04-19
6 1 6 Hot Girl 7.8 2852 2005-04-26
7 2 1 The Dundies 8.7 3213 2005-09-20
8 2 2 Sexual Harassment 8.2 2736 2005-09-27
9 2 3 Office Olympics 8.4 2742 2005-10-04
10 2 4 The Fire 8.4 2713 2005-10-11
# ℹ 178 more rows
There are several steps to create a useful model: parameter estimation, model selection, performance assessment, etc.
Doing all of this on the entire data we have available leaves us with no other data to assess our choices
We can allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount to the model parameter estimation only (which is what we’ve done so far)
Step 1: Create an initial split:
# A tibble: 141 × 6
season episode title imdb_rating total_votes air_date
<dbl> <dbl> <chr> <dbl> <dbl> <date>
1 8 18 Last Day in Florida 7.8 1429 2012-03-08
2 9 14 Vandalism 7.6 1402 2013-01-31
3 2 8 Performance Review 8.2 2416 2005-11-15
4 9 5 Here Comes Treble 7.1 1515 2012-10-25
5 3 22 Beach Games 9.1 2783 2007-05-10
6 7 1 Nepotism 8.4 1897 2010-09-23
7 3 15 Phyllis' Wedding 8.3 2283 2007-02-08
8 9 21 Livin' the Dream 8.9 2041 2013-05-02
9 9 18 Promos 8 1445 2013-04-04
10 8 12 Pool Party 8 1612 2012-01-19
# ℹ 131 more rows
We prefer simple (parsimonious) models when possible, but parsimony does not mean sacrificing accuracy (or predictive performance) in the interest of simplicity
Variables that go into the model and how they are represented are just as critical to success of the model
Feature engineering allows us to get creative with our predictors in an effort to make them more useful for our model (to increase its predictive performance and improve interpretability)
office_train |>
mutate(
season = as_factor(season),
month = lubridate::month(air_date),
wday = lubridate::wday(air_date)
)
# A tibble: 141 × 8
season episode title imdb_rating total_votes air_date month wday
<fct> <dbl> <chr> <dbl> <dbl> <date> <dbl> <dbl>
1 8 18 Last Day in Flo… 7.8 1429 2012-03-08 3 5
2 9 14 Vandalism 7.6 1402 2013-01-31 1 5
3 2 8 Performance Rev… 8.2 2416 2005-11-15 11 3
4 9 5 Here Comes Treb… 7.1 1515 2012-10-25 10 5
5 3 22 Beach Games 9.1 2783 2007-05-10 5 5
6 7 1 Nepotism 8.4 1897 2010-09-23 9 5
# ℹ 135 more rows
Can you identify any potential problems with this approach?
Create a recipe for feature engineering steps to be applied to the training data
Fit the model to the training data after these steps have been applied
Using the model estimates from the training data, predict outcomes for the test data
Evaluate the performance of the model on the test data
title
isn’t a predictor, but we might want to keep it around as an ID
New features for day of week and month
Identify holidays in air_date
, then remove air_date
office_rec <- office_rec |>
step_holiday(
air_date,
holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"),
keep_original_cols = FALSE
)
office_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 4
ID: 1
── Operations
• Date features from: air_date
• Holiday features from: air_date
Convert season
to factor
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 4
ID: 1
── Operations
• Date features from: air_date
• Holiday features from: air_date
• Factor variables from: season
Convert all nominal (categorical) predictors to factors
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 4
ID: 1
── Operations
• Date features from: air_date
• Holiday features from: air_date
• Factor variables from: season
• Dummy variables from: all_nominal_predictors()
Remove all predictors that contain only a single value
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 4
ID: 1
── Operations
• Date features from: air_date
• Holiday features from: air_date
• Factor variables from: season
• Dummy variables from: all_nominal_predictors()
• Zero variance filter on: all_predictors()
office_rec <- recipe(imdb_rating ~ ., data = office_train) |>
# make title's role ID
update_role(title, new_role = "ID") |>
# extract day of week and month of air_date
step_date(air_date, features = c("dow", "month")) |>
# identify holidays and add indicators
step_holiday(
air_date,
holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"),
keep_original_cols = FALSE
) |>
# turn season into factor
step_num2factor(season, levels = as.character(1:9)) |>
# make dummy variables
step_dummy(all_nominal_predictors()) |>
# remove zero variance predictors
step_zv(all_predictors())
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 4
ID: 1
── Operations
• Date features from: air_date
• Holiday features from: air_date
• Factor variables from: season
• Dummy variables from: all_nominal_predictors()
• Zero variance filter on: all_predictors()
We will complete the workflow to fit a model predicting IMDB ratings that includes the following predictors:
episode
total_votes
season
air_date
)air_date
)What feature will not end up in the final model? Why is it not included?
When building recipes you in a pipeline, you don’t get to see the effect of the recipe on your data, which can be unsettling
You can take a peek at what will happen when you ultimately apply the recipe to your data at the time of fitting the model
This requires two functions: prep()
to train the recipe and bake()
to apply it to your data
Note
This is optional, we’ll show the results for demonstrative purposes. It doesn’t need to be part of your modeling pipeline, but it can be assuring to see the effects of the recipe steps as you build the recipe.