Multiple linear regression (MLR)

Prof. Maria Tackett

Sep 27, 2023

Announcements

  • HW 02 due Mon, Oct 2 at 11:59pm.
  • All lecture recordings available until Wed, Oct 4 at 9am.
    • Click here for link to videos. You can also find the link in the navigation bar of the course website.
  • Lab groups start this week. You will get your assigned group when you go to lab.
  • Submit questions about SLR by Thu, Sep 28. These questions will be used to make the Exam Review. Click here for more info.
  • Exam 01: Wed, Oct 4 (in-class + take-home)
  • Exam 01 review: Mon, Oct 2

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(openintro)
library(patchwork)
library(knitr)
library(kableExtra)
library(colorblindr)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

Considering multiple variables

Data: Peer-to-peer lender

Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50 data frame in the openintro R package.

# A tibble: 50 × 4
   annual_income debt_to_income verified_income interest_rate
           <dbl>          <dbl> <fct>                   <dbl>
 1         59000         0.558  Not Verified            10.9 
 2         60000         1.31   Not Verified             9.92
 3         75000         1.06   Verified                26.3 
 4         75000         0.574  Not Verified             9.92
 5        254000         0.238  Not Verified             9.43
 6         67000         1.08   Source Verified          9.92
 7         28800         0.0997 Source Verified         17.1 
 8         80000         0.351  Not Verified             6.08
 9         34000         0.698  Not Verified             7.97
10         80000         0.167  Source Verified         12.6 
# ℹ 40 more rows

Variables

Predictors:

  • annual_income: Annual income
  • debt_to_income: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total income
  • verified_income: Whether borrower’s income source and amount have been verified (Not Verified, Source Verified, Verified)

Outcome: interest_rate: Interest rate for the loan

Outcome: interest_rate

min median max iqr
5.31 9.93 26.3 5.755

Predictors

Data manipulation 1: Rescale income

loan50 <- loan50 |>
  mutate(annual_income_th = annual_income / 1000)

ggplot(loan50, aes(x = annual_income_th)) +
  geom_histogram(binwidth = 20) +
  labs(title = "Annual income (in $1000s)", 
       x = "")

Outcome vs. predictors

Single vs. multiple predictors

So far we’ve used a single predictor variable to understand variation in a quantitative response variable

Now we want to use multiple predictor variables to understand variation in a quantitative response variable

Multiple linear regression

Multiple linear regression (MLR)

Based on the analysis goals, we will use a multiple linear regression model of the following form

\[ \begin{aligned}\hat{\text{interest_rate}} ~ = \hat{\beta}_0 & + \hat{\beta}_1 \text{debt_to_income} \\ & + \hat{\beta}_2 \text{verified_income} \\ &+ \hat{\beta}_3 \text{annual_income_th} \end{aligned} \]

Similar to simple linear regression, this model assumes that at each combination of the predictor variables, the values interest_rate follow a Normal distribution.

Multiple linear regression

Recall: The simple linear regression model assumes

\[ Y|X\sim N(\beta_0 + \beta_1 X, \sigma_{\epsilon}^2) \]

Similarly: The multiple linear regression model assumes

\[ Y|X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p, \sigma_{\epsilon}^2) \]

Multiple linear regression

At any combination of the predictors, the mean value of the response \(Y\), is

\[ \mu_{Y|X_1, \ldots, X_p} = \beta_0 + \beta_1 X_{1} + \beta_2 X_2 + \dots + \beta_p X_p \]

Using multiple linear regression, we can estimate the mean response for any combination of predictors

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_{1} + \hat{\beta}_2 X_2 + \dots + \hat{\beta}_p X_{p} \]

Model fit

term estimate std.error statistic p.value
(Intercept) 10.726 1.507 7.116 0.000
debt_to_income 0.671 0.676 0.993 0.326
verified_incomeSource Verified 2.211 1.399 1.581 0.121
verified_incomeVerified 6.880 1.801 3.820 0.000
annual_income_th -0.021 0.011 -1.804 0.078

Model equation

\[ \begin{align}\hat{\text{interest_rate}} = 10.726 &+0.671 \times \text{debt_to_income}\\ &+ 2.211 \times \text{source_verified}\\ &+ 6.880 \times \text{verified}\\ & -0.021 \times \text{annual_income_th} \end{align} \]

Note

We will talk about why there are two terms in the model for verified_income shortly!

Interpreting \(\hat{\beta}_j\)

  • The estimated coefficient \(\hat{\beta}_j\) is the expected change in the mean of \(y\) when \(x_j\) increases by one unit, holding the values of all other predictor variables constant.
  • Example: The estimated coefficient for debt_to_income is 0.671. This means for each point in an borrower’s debt to income ratio, the interest rate on the loan is expected to be greater by 0.671%, holding annual income and income verification constant.

Prediction

What is the predicted interest rate for an borrower with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000?


10.726 + 0.671 * 0.558 + 2.211 * 0 + 6.880 * 0 - 0.021 * 59
[1] 9.861418

The predicted interest rate for an borrower with with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000 is 9.86%.

Prediction, revisited

Just like with simple linear regression, we can use the predict() function in R to calculate the appropriate intervals for our predicted values:

new_borrower <- tibble(
  debt_to_income  = 0.558, 
  verified_income = "Not Verified", 
  annual_income_th = 59
)

predict(int_fit, new_borrower)
# A tibble: 1 × 1
  .pred
  <dbl>
1  9.89

Note

Difference in predicted value due to rounding the coefficients on the previous slide.

Confidence interval for \(\hat{\mu}_y\)

Calculate a 90% confidence interval for the estimated mean interest rate for borrowers with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000.


predict(int_fit, new_borrower, type = "conf_int", level = 0.90)
# A tibble: 1 × 2
  .pred_lower .pred_upper
        <dbl>       <dbl>
1        8.16        11.6

Prediction interval for \(\hat{y}\)

Calculate a 90% confidence interval for the predicted interest rate for an individual appllicant with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000.


predict(int_fit, new_borrower, type = "pred_int", level = 0.90)
# A tibble: 1 × 2
  .pred_lower .pred_upper
        <dbl>       <dbl>
1        2.18        17.6

Cautions

  • Do not extrapolate! Because there are multiple predictor variables, there is the potential to extrapolate in many directions
  • The multiple regression model only shows association, not causality
    • To show causality, you must have a carefully designed experiment or carefully account for confounding variables in an observational study

Types of predictors

Interpreting results

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 10.726 1.507 7.116 0.000 7.690 13.762
debt_to_income 0.671 0.676 0.993 0.326 -0.690 2.033
verified_incomeSource Verified 2.211 1.399 1.581 0.121 -0.606 5.028
verified_incomeVerified 6.880 1.801 3.820 0.000 3.253 10.508
annual_income_th -0.021 0.011 -1.804 0.078 -0.043 0.002


Describe the subset of borrowers who are expected to get an interest rate of 10.726% based on our model. Is this interpretation meaningful? Why or why not?

Mean-centered variables

Mean-centering

If we are interested in interpreting the intercept, we can mean-center the quantitative predictors in the model.

We can mean-center a quantitative predictor \(X_j\) using the following:

\[X_{j_{Cent}} = X_{j}- \bar{X}_{j}\]

If we mean-center all quantitative variables, then the intercept is interpreted as the expected value of the response variable when all quantitative variables are at their mean value.

Data manipulation 2: Mean-center numeric predictors

loan50 <- loan50 |>
  mutate(
    debt_inc_cent = debt_to_income - mean(debt_to_income), 
    annual_income_th_cent = annual_income_th - mean(annual_income_th)
    )

Visualize mean-centered predictors

Using mean-centered variables in the model

How do you expect the model to change if we use the debt_inc_cent and annual_income_cent in the model?

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 9.444 0.977 9.663 0.000 7.476 11.413
debt_inc_cent 0.671 0.676 0.993 0.326 -0.690 2.033
verified_incomeSource Verified 2.211 1.399 1.581 0.121 -0.606 5.028
verified_incomeVerified 6.880 1.801 3.820 0.000 3.253 10.508
annual_income_th_cent -0.021 0.011 -1.804 0.078 -0.043 0.002

Original vs. mean-centered model

term estimate
(Intercept) 10.726
debt_to_income 0.671
verified_incomeSource Verified 2.211
verified_incomeVerified 6.880
annual_income_th -0.021
term estimate
(Intercept) 9.444
debt_inc_cent 0.671
verified_incomeSource Verified 2.211
verified_incomeVerified 6.880
annual_income_th_cent -0.021

Indicator variables

Indicator variables

  • Suppose there is a categorical variable with \(K\) categories (levels)

  • We can make \(K\) indicator variables - one indicator for each category

  • An indicator variable takes values 1 or 0

    • 1 if the observation belongs to that category
    • 0 if the observation does not belong to that category

Data manipulation 3: Create indicator variables for verified_income

loan50 <- loan50 |>
  mutate(
    not_verified = if_else(verified_income == "Not Verified", 1, 0),
    source_verified = if_else(verified_income == "Source Verified", 1, 0),
    verified = if_else(verified_income == "Verified", 1, 0)
  )
# A tibble: 3 × 4
  verified_income not_verified source_verified verified
  <fct>                  <dbl>           <dbl>    <dbl>
1 Not Verified               1               0        0
2 Verified                   0               0        1
3 Source Verified            0               1        0

Indicators in the model

  • We will use \(K-1\) of the indicator variables in the model.
  • The baseline is the category that doesn’t have a term in the model.
  • The coefficients of the indicator variables in the model are interpreted as the expected change in the response compared to the baseline, holding all other variables constant.
  • This approach is also called dummy coding.
loan50 |>
  select(verified_income, source_verified, verified) |>
  slice(1, 3, 6)
# A tibble: 3 × 3
  verified_income source_verified verified
  <fct>                     <dbl>    <dbl>
1 Not Verified                  0        0
2 Verified                      0        1
3 Source Verified               1        0

Interpreting verified_income

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 9.444 0.977 9.663 0.000 7.476 11.413
debt_inc_cent 0.671 0.676 0.993 0.326 -0.690 2.033
verified_incomeSource Verified 2.211 1.399 1.581 0.121 -0.606 5.028
verified_incomeVerified 6.880 1.801 3.820 0.000 3.253 10.508
annual_income_th_cent -0.021 0.011 -1.804 0.078 -0.043 0.002
  • The baseline category is Not verified.
  • People with source verified income are expected to take a loan with an interest rate that is 2.211% higher, on average, than the rate on loans to those whose income is not verified, holding all else constant.

Interpret the coefficient of Verified in the context of the data.

Interaction terms

Interaction terms

  • Sometimes the relationship between a predictor variable and the response depends on the value of another predictor variable.
  • This is an interaction effect.
  • To account for this, we can include interaction terms in the model.

Interest rate vs. annual income

The lines are not parallel indicating there is an interaction effect. The slope of annual income differs based on the income verification.

Interaction term in model

int_cent_int_fit <- linear_reg() |>
  set_engine("lm") |>
  fit(interest_rate ~ debt_inc_cent + verified_income + annual_income_th_cent + verified_income * annual_income_th_cent,
      data = loan50)
term estimate std.error statistic p.value
(Intercept) 9.484 0.989 9.586 0.000
debt_inc_cent 0.691 0.685 1.009 0.319
verified_incomeSource Verified 2.157 1.418 1.522 0.135
verified_incomeVerified 7.181 1.870 3.840 0.000
annual_income_th_cent -0.007 0.020 -0.341 0.735
verified_incomeSource Verified:annual_income_th_cent -0.016 0.026 -0.643 0.523
verified_incomeVerified:annual_income_th_cent -0.032 0.033 -0.979 0.333

Interpreting interaction terms

  • What the interaction means: The effect of annual income on the interest rate differs by -0.016 when the income is source verified compared to when it is not verified, holding all else constant.
  • Interpreting annual_income for source verified: If the income is source verified, we expect the interest rate to decrease by 0.023% (-0.007 + -0.016) for each additional thousand dollars in annual income, holding all else constant.

Data manipulation 4: Create interaction variables

Defining the interaction variable in the model formula as verified_income * annual_income_th_cent is an implicit data manipulation step as well

Rows: 50
Columns: 9
$ `(Intercept)`                                          <dbl> 1, 1, 1, 1, 1, …
$ debt_inc_cent                                          <dbl> -0.16511719, 0.…
$ annual_income_th_cent                                  <dbl> -27.17, -26.17,…
$ `verified_incomeNot Verified`                          <dbl> 1, 1, 0, 1, 1, …
$ `verified_incomeSource Verified`                       <dbl> 0, 0, 0, 0, 0, …
$ verified_incomeVerified                                <dbl> 0, 0, 1, 0, 0, …
$ `annual_income_th_cent:verified_incomeNot Verified`    <dbl> -27.17, -26.17,…
$ `annual_income_th_cent:verified_incomeSource Verified` <dbl> 0.00, 0.00, 0.0…
$ `annual_income_th_cent:verified_incomeVerified`        <dbl> 0.00, 0.00, -11…

Wrap up

Recap

  • Introduced multiple linear regression

  • Interpreted coefficients in the multiple linear regression model

  • Calculated predictions and associated intervals for multiple linear regression models

  • Mean-centered quantitative predictors

  • Used indicator variables for categorical predictors

  • Used interaction terms

Looking backward

Data manipulation, with dplyr (from tidyverse):

loan50 |>
  select(interest_rate, annual_income, debt_to_income, verified_income) |>
  mutate(
    # 1. rescale income
    annual_income_th = annual_income / 1000,
    # 2. mean-center quantitative predictors
    debt_inc_cent = debt_to_income - mean(debt_to_income),
    annual_income_th_cent = annual_income_th - mean(annual_income_th),
    # 3. create dummy variables for verified_income
    source_verified = if_else(verified_income == "Source Verified", 1, 0),
    verified = if_else(verified_income == "Verified", 1, 0),
    # 4. create interaction variables
    `annual_income_th_cent:verified_incomeSource Verified` = annual_income_th_cent * source_verified,
    `annual_income_th_cent:verified_incomeVerified` = annual_income_th_cent * verified
  )

Looking forward (after Exam 01)

Feature engineering, with recipes (from tidymodels):

loan_rec <- recipe( ~ ., data = loan50) |>
  # 1. rescale income
  step_mutate(annual_income_th = annual_income / 1000) |>
  # 2. mean-center quantitative predictors
  step_center(all_numeric_predictors()) |>
  # 3. create dummy variables for verified_income
  step_dummy(verified_income) |>
  # 4. create interaction variables
  step_interact(terms = ~ annual_income_th:verified_income)

Recipe

loan_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
predictor: 24
── Operations 
• Variable mutation for: annual_income / 1000
• Centering for: all_numeric_predictors()
• Dummy variables from: verified_income
• Interactions with: annual_income_th:verified_income