Sep 27, 2023
Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50
data frame in the openintro R package.
# A tibble: 50 × 4
annual_income debt_to_income verified_income interest_rate
<dbl> <dbl> <fct> <dbl>
1 59000 0.558 Not Verified 10.9
2 60000 1.31 Not Verified 9.92
3 75000 1.06 Verified 26.3
4 75000 0.574 Not Verified 9.92
5 254000 0.238 Not Verified 9.43
6 67000 1.08 Source Verified 9.92
7 28800 0.0997 Source Verified 17.1
8 80000 0.351 Not Verified 6.08
9 34000 0.698 Not Verified 7.97
10 80000 0.167 Source Verified 12.6
# ℹ 40 more rows
Predictors:
annual_income
: Annual incomedebt_to_income
: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total incomeverified_income
: Whether borrower’s income source and amount have been verified (Not Verified
, Source Verified
, Verified
)Outcome: interest_rate
: Interest rate for the loan
interest_rate
min | median | max | iqr |
---|---|---|---|
5.31 | 9.93 | 26.3 | 5.755 |
So far we’ve used a single predictor variable to understand variation in a quantitative response variable
Now we want to use multiple predictor variables to understand variation in a quantitative response variable
Based on the analysis goals, we will use a multiple linear regression model of the following form
\[ \begin{aligned}\hat{\text{interest_rate}} ~ = \hat{\beta}_0 & + \hat{\beta}_1 \text{debt_to_income} \\ & + \hat{\beta}_2 \text{verified_income} \\ &+ \hat{\beta}_3 \text{annual_income_th} \end{aligned} \]
Similar to simple linear regression, this model assumes that at each combination of the predictor variables, the values interest_rate
follow a Normal distribution.
Recall: The simple linear regression model assumes
\[ Y|X\sim N(\beta_0 + \beta_1 X, \sigma_{\epsilon}^2) \]
Similarly: The multiple linear regression model assumes
\[ Y|X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p, \sigma_{\epsilon}^2) \]
At any combination of the predictors, the mean value of the response \(Y\), is
\[ \mu_{Y|X_1, \ldots, X_p} = \beta_0 + \beta_1 X_{1} + \beta_2 X_2 + \dots + \beta_p X_p \]
Using multiple linear regression, we can estimate the mean response for any combination of predictors
\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_{1} + \hat{\beta}_2 X_2 + \dots + \hat{\beta}_p X_{p} \]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 10.726 | 1.507 | 7.116 | 0.000 |
debt_to_income | 0.671 | 0.676 | 0.993 | 0.326 |
verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 |
verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 |
annual_income_th | -0.021 | 0.011 | -1.804 | 0.078 |
\[ \begin{align}\hat{\text{interest_rate}} = 10.726 &+0.671 \times \text{debt_to_income}\\ &+ 2.211 \times \text{source_verified}\\ &+ 6.880 \times \text{verified}\\ & -0.021 \times \text{annual_income_th} \end{align} \]
Note
We will talk about why there are two terms in the model for verified_income
shortly!
debt_to_income
is 0.671. This means for each point in an borrower’s debt to income ratio, the interest rate on the loan is expected to be greater by 0.671%, holding annual income and income verification constant.What is the predicted interest rate for an borrower with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000?
The predicted interest rate for an borrower with with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000 is 9.86%.
Just like with simple linear regression, we can use the predict()
function in R to calculate the appropriate intervals for our predicted values:
new_borrower <- tibble(
debt_to_income = 0.558,
verified_income = "Not Verified",
annual_income_th = 59
)
predict(int_fit, new_borrower)
# A tibble: 1 × 1
.pred
<dbl>
1 9.89
Note
Difference in predicted value due to rounding the coefficients on the previous slide.
Calculate a 90% confidence interval for the estimated mean interest rate for borrowers with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000.
Calculate a 90% confidence interval for the predicted interest rate for an individual appllicant with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000.
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 10.726 | 1.507 | 7.116 | 0.000 | 7.690 | 13.762 |
debt_to_income | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 |
verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 |
verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 |
annual_income_th | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 |
Describe the subset of borrowers who are expected to get an interest rate of 10.726% based on our model. Is this interpretation meaningful? Why or why not?
If we are interested in interpreting the intercept, we can mean-center the quantitative predictors in the model.
We can mean-center a quantitative predictor \(X_j\) using the following:
\[X_{j_{Cent}} = X_{j}- \bar{X}_{j}\]
If we mean-center all quantitative variables, then the intercept is interpreted as the expected value of the response variable when all quantitative variables are at their mean value.
How do you expect the model to change if we use the debt_inc_cent
and annual_income_cent
in the model?
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 9.444 | 0.977 | 9.663 | 0.000 | 7.476 | 11.413 |
debt_inc_cent | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 |
verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 |
verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 |
annual_income_th_cent | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 |
term | estimate |
---|---|
(Intercept) | 10.726 |
debt_to_income | 0.671 |
verified_incomeSource Verified | 2.211 |
verified_incomeVerified | 6.880 |
annual_income_th | -0.021 |
term | estimate |
---|---|
(Intercept) | 9.444 |
debt_inc_cent | 0.671 |
verified_incomeSource Verified | 2.211 |
verified_incomeVerified | 6.880 |
annual_income_th_cent | -0.021 |
Suppose there is a categorical variable with \(K\) categories (levels)
We can make \(K\) indicator variables - one indicator for each category
An indicator variable takes values 1 or 0
verified_income
# A tibble: 3 × 4
verified_income not_verified source_verified verified
<fct> <dbl> <dbl> <dbl>
1 Not Verified 1 0 0
2 Verified 0 0 1
3 Source Verified 0 1 0
verified_income
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 9.444 | 0.977 | 9.663 | 0.000 | 7.476 | 11.413 |
debt_inc_cent | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 |
verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 |
verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 |
annual_income_th_cent | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 |
Not verified
.Interpret the coefficient of Verified
in the context of the data.
The lines are not parallel indicating there is an interaction effect. The slope of annual income differs based on the income verification.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.484 | 0.989 | 9.586 | 0.000 |
debt_inc_cent | 0.691 | 0.685 | 1.009 | 0.319 |
verified_incomeSource Verified | 2.157 | 1.418 | 1.522 | 0.135 |
verified_incomeVerified | 7.181 | 1.870 | 3.840 | 0.000 |
annual_income_th_cent | -0.007 | 0.020 | -0.341 | 0.735 |
verified_incomeSource Verified:annual_income_th_cent | -0.016 | 0.026 | -0.643 | 0.523 |
verified_incomeVerified:annual_income_th_cent | -0.032 | 0.033 | -0.979 | 0.333 |
annual_income
for source verified: If the income is source verified, we expect the interest rate to decrease by 0.023% (-0.007 + -0.016) for each additional thousand dollars in annual income, holding all else constant.Defining the interaction variable in the model formula as verified_income * annual_income_th_cent
is an implicit data manipulation step as well
Rows: 50
Columns: 9
$ `(Intercept)` <dbl> 1, 1, 1, 1, 1, …
$ debt_inc_cent <dbl> -0.16511719, 0.…
$ annual_income_th_cent <dbl> -27.17, -26.17,…
$ `verified_incomeNot Verified` <dbl> 1, 1, 0, 1, 1, …
$ `verified_incomeSource Verified` <dbl> 0, 0, 0, 0, 0, …
$ verified_incomeVerified <dbl> 0, 0, 1, 0, 0, …
$ `annual_income_th_cent:verified_incomeNot Verified` <dbl> -27.17, -26.17,…
$ `annual_income_th_cent:verified_incomeSource Verified` <dbl> 0.00, 0.00, 0.0…
$ `annual_income_th_cent:verified_incomeVerified` <dbl> 0.00, 0.00, -11…
Introduced multiple linear regression
Interpreted coefficients in the multiple linear regression model
Calculated predictions and associated intervals for multiple linear regression models
Mean-centered quantitative predictors
Used indicator variables for categorical predictors
Used interaction terms
Data manipulation, with dplyr (from tidyverse):
loan50 |>
select(interest_rate, annual_income, debt_to_income, verified_income) |>
mutate(
# 1. rescale income
annual_income_th = annual_income / 1000,
# 2. mean-center quantitative predictors
debt_inc_cent = debt_to_income - mean(debt_to_income),
annual_income_th_cent = annual_income_th - mean(annual_income_th),
# 3. create dummy variables for verified_income
source_verified = if_else(verified_income == "Source Verified", 1, 0),
verified = if_else(verified_income == "Verified", 1, 0),
# 4. create interaction variables
`annual_income_th_cent:verified_incomeSource Verified` = annual_income_th_cent * source_verified,
`annual_income_th_cent:verified_incomeVerified` = annual_income_th_cent * verified
)
Feature engineering, with recipes (from tidymodels):
loan_rec <- recipe( ~ ., data = loan50) |>
# 1. rescale income
step_mutate(annual_income_th = annual_income / 1000) |>
# 2. mean-center quantitative predictors
step_center(all_numeric_predictors()) |>
# 3. create dummy variables for verified_income
step_dummy(verified_income) |>
# 4. create interaction variables
step_interact(terms = ~ annual_income_th:verified_income)
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
predictor: 24
── Operations
• Variable mutation for: annual_income / 1000
• Centering for: all_numeric_predictors()
• Dummy variables from: verified_income
• Interactions with: annual_income_th:verified_income