SLR: Model evaluation

Prof. Maria Tackett

Sep 25, 2023

Announcements

  • HW 02 due Mon, Oct 2 at 11:59pm. (released after section 002)

  • All lecture recordings available until Wed, Oct 4 at 9am.

    • Click here for link to videos. You can also find the link in the navigation bar of the course website.
  • Lab groups start this week. You will get your assigned group when you go to lab.

  • Looking ahead: Exam 01:

    • Closed note in-class: Wed, Oct 4

    • Open note take-home: Wed, Oct 4 - Fri, Oct 6

      • Released after Section 002
    • Exam review: Mon, Oct 2

Statistician of the day: Robert Santos

Robert Santos received an MA in Statistics from the University of Michigan, Ann Arbor. He served as president of the American Statistical Association in 2021. As a survey researcher, he worked at National Opinion Research Center (NORC, University of Chicago) and the Urban Institute in Washington, DC.

As a Mexican-American, he is the first non-white person to serve as the Director of the US Census Bureau (appointed by Joe Biden and approved by the US Senate in 2022).

Source: hardin47.github.io/CURV/scholars/santos.html

Robert Santos

Santos is a survey researcher with much of his recent focus on the US Census. In particular, he has written extensively about miscounting particular groups of people in the Census and the relationship between race and ethnicity in surveys.

From his article “Is It Time to Postpone the 2020 Census? (written during his time at the Urban Institute)

“This would create a worst-case scenario when it comes to political representation and allocation of federal resources…And the 2020 counts would then be baked in to population projections used to calibrate federal statistics and surveys, thus informing federal funds allocations and eligibility thresholds for the next 10 years.

Related work: Interactive feature Who’s at Risk of Being Miscounted?

Questions from last class?

Computational set up

# load packages
library(tidyverse)   # for data wrangling and visualization
library(tidymodels)  # for modeling
library(openintro)   # for the duke_forest dataset
library(scales)      # for pretty axis labels
library(knitr)       # for pretty tables
library(kableExtra)  # also for pretty tables
library(patchwork)   # arrange plots

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Model conditions

Model conditions

  1. Linearity: There is a linear relationship between the outcome and predictor variables
  2. Constant variance: The variability of the errors is equal for all values of the predictor variable
  3. Normality: The errors follow a normal distribution
  4. Independence: The errors are independent from each other

Augmented data frame

df_fit <- linear_reg() |>
  fit(price ~ area, data = duke_forest)

df_aug <- augment(df_fit$fit)

head(df_aug)
# A tibble: 6 × 8
    price  area  .fitted  .resid   .hat  .sigma  .cooksd .std.resid
    <dbl> <dbl>    <dbl>   <dbl>  <dbl>   <dbl>    <dbl>      <dbl>
1 1520000  6040 1079931. 440069. 0.133  162605. 0.604         2.80 
2 1030000  4475  830340. 199660. 0.0435 168386. 0.0333        1.21 
3  420000  1745  394951.  25049. 0.0226 169664. 0.000260      0.150
4  680000  2091  450132. 229868. 0.0157 168011. 0.0150        1.37 
5  428500  1772  399257.  29243. 0.0220 169657. 0.000345      0.175
6  456000  1950  427645.  28355. 0.0182 169659. 0.000266      0.170

Application exercise

Model evaluation

Two statistics

  • R-squared, \(R^2\) : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor)

    \[ R^2 = Cor(x,y)^2 = Cor(y, \hat{y})^2 \]

  • Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)

    \[ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]

What indicates a good model fit? Higher or lower \(R^2\)? Higher or lower RMSE?

\(R^2\)

  • Ranges between 0 (terrible predictor) and 1 (perfect predictor)

  • Has no units

  • Calculate with rsq() using the augmented data:

rsq(df_aug, truth = price, estimate = .fitted)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.445

Interpreting \(R^2\)

🗳️ Discussion

The \(R^2\) of the model for price from area of houses in Duke Forest is 44.5%. Which of the following is the correct interpretation of this value?

  1. Area correctly predicts 44.5% of price for houses in Duke Forest.
  2. 44.5% of the variability in price for houses in Duke Forest can be explained by area.
  3. 44.5% of the variability in area for houses in Duke Forest can be explained by price
  4. 44.5% of the time price for houses in Duke Forest can be predicted by area.

Alternative approach for \(R^2\)

Alternatively, use glance() to construct a single row summary of the model fit, including \(R^2\):

glance(df_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared   sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl>   <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.445         0.439 168798.      77.0 6.29e-14     1 -1318. 2641. 2649.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>


glance(df_fit)$r.squared
[1] 0.4451945

RMSE

  • Ranges between 0 (perfect predictor) and infinity (terrible predictor)

  • Same units as the response variable

  • Calculate with rmse() using the augmented data:

    rmse(df_aug, truth = price, estimate = .fitted)
    # A tibble: 1 × 3
      .metric .estimator .estimate
      <chr>   <chr>          <dbl>
    1 rmse    standard     167067.
  • The value of RMSE is not very meaningful on its own, but it’s useful for comparing across models (more on this when we get to regression with multiple predictors)

Obtaining \(R^2\) and RMSE

  • Use rsq() and rmse(), respectively

    rsq(df_aug, truth = price, estimate = .fitted)
    rmse(df_aug, truth = price, estimate = .fitted)
  • First argument: data frame containing truth and estimate columns

  • Second argument: name of the column containing truth (observed outcome)

  • Third argument: name of the column containing estimate (predicted outcome)

Application exercise

Questions about SLR?

We have officially completed simple linear regression! What remaining questions do you have? Please submit all questions by Thu, Sep 28. These questions will be used to make the Exam 01 review.

Note: Questions must be specific. For example:

  • ❌ How do you do simulation-based hypothesis testing?
  • ✅ Why does a small p-value correspond to rejecting the null hypothesis?