STA 210 - Fall 2023 – SLR: Model evaluation

Two statistics

R-squared, $R^{2}$ : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor)

$R^{2} = C o r (x, y)^{2} = C o r (y, \hat{y})^{2}$
Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)

$R M S E = \sqrt{\frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{n}}$

What indicates a good model fit? Higher or lower $R^{2}$ ? Higher or lower RMSE?

$R^{2}$

Ranges between 0 (terrible predictor) and 1 (perfect predictor)
Has no units
Calculate with rsq() using the augmented data:

rsq(df_aug, truth = price, estimate = .fitted)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.445

Interpreting $R^{2}$

🗳️ Discussion

The $R^{2}$ of the model for price from area of houses in Duke Forest is 44.5%. Which of the following is the correct interpretation of this value?

Area correctly predicts 44.5% of price for houses in Duke Forest.
44.5% of the variability in price for houses in Duke Forest can be explained by area.
44.5% of the variability in area for houses in Duke Forest can be explained by price
44.5% of the time price for houses in Duke Forest can be predicted by area.

Alternative approach for $R^{2}$

Alternatively, use glance() to construct a single row summary of the model fit, including $R^{2}$ :

glance(df_fit)

# A tibble: 1 × 12
  r.squared adj.r.squared   sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl>   <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.445         0.439 168798.      77.0 6.29e-14     1 -1318. 2641. 2649.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

glance(df_fit)$r.squared

[1] 0.4451945

RMSE

Ranges between 0 (perfect predictor) and infinity (terrible predictor)
Same units as the response variable

Calculate with rmse() using the augmented data:

rmse(df_aug, truth = price, estimate = .fitted)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     167067.

The value of RMSE is not very meaningful on its own, but it’s useful for comparing across models (more on this when we get to regression with multiple predictors)

Obtaining $R^{2}$ and RMSE

Use rsq() and rmse(), respectively

rsq(df_aug, truth = price, estimate = .fitted)
rmse(df_aug, truth = price, estimate = .fitted)

First argument: data frame containing truth and estimate columns
Second argument: name of the column containing truth (observed outcome)
Third argument: name of the column containing estimate (predicted outcome)

SLR: Model evaluation

Announcements

Statistician of the day: Robert Santos

Robert Santos

Questions from last class?

Computational set up

Model conditions

Model conditions

Augmented data frame

Application exercise

Model evaluation