STA 210 - Fall 2023 – SLR: Conditions + Model evaluation

term	estimate	std.error	statistic	p.value
(Intercept)	116652.325	53302.463	2.188	0.031
area	159.483	18.171	8.777	0.000

Model conditions

Linearity: There is a linear relationship between the outcome and predictor variables
Constant variance: The variability of the errors is equal for all values of the predictor variable
Normality: The errors follow a normal distribution
Independence: The errors are independent from each other

Linearity

If the linear model, ${\hat{y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{i}$ adequately describes the relationship between $X$ and $Y$ , then the residuals should reflect random (chance) error
To assess this, we can look at a plot of the residuals vs. the fitted values
Linearity satisfied if there is no distinguishable pattern in the residuals plot, i.e. the residuals should be randomly scattered
A non-random pattern (e.g. a parabola) suggests a linear model does not adequately describe the relationship between $X$ and $Y$

Linearity

✅ The residuals vs. fitted values plot should show a random scatter of residuals (no distinguishable pattern or structure)

Residuals vs. fitted values (code)

df_aug <- augment(df_fit$fit)

ggplot(df_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  ylim(-1000000, 1000000) +
  labs(
    x = "Fitted value", y = "Residual",
    title = "Residuals vs. fitted values"
  )

Non-linear relationships

Constant variance

If the spread of the distribution of $Y$ is equal for all values of $X$ then the spread of the residuals should be approximately equal for each value of $X$
To assess this, we can look at a plot of the residuals vs. the fitted values
Constant variance satisfied if the vertical spread of the residuals is approximately equal as you move from left to right (i.e. there is no “fan” pattern)
A fan pattern suggests the constant variance assumption is not satisfied and transformation or some other remedy is required (more on this later in the semester)

Constant variance

✅ The vertical spread of the residuals is relatively constant across the plot

Non-constant variance

Normality

The linear model assumes that the distribution of $Y$ is Normal for every value of $X$
This is impossible to check in practice, so we will look at the overall distribution of the residuals to assess if the normality assumption is satisfied
Normality satisfied if a histogram of the residuals is approximately normal
- Can also check that the points on a normal QQ-plot falls along a diagonal line
Most inferential methods for regression are robust to some departures from normality, so we can proceed with inference if the sample size is sufficiently large, $n > 30$

Normality

Check normality using a QQ-plot

Code

ggplot(df_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 50000, color = "white")  +
  labs(
    x = "Residual",
    y = "Count",
    title = "Histogram of residuals"
  )

Code

ggplot(df_aug, aes(sample = .resid)) +
  stat_qq()+
  stat_qq_line() + 
  labs(x = "Theoretical quantile", 
       y = "Observed quantile", 
       title = "Normal QQ-plot of residuals")

Assess whether residuals lie along the diagonal line of the Quantile-quantile plot (QQ-plot).
If so, the residuals are normally distributed.

Normality

❌ The residuals do not appear to follow a normal distribution, because the points do not lie on the diagonal line, so normality is not satisfied.

✅ The sample size $n = 98 > 30$ , so the sample size is large enough to relax this condition and proceed with inference.

Independence

We can often check the independence assumption based on the context of the data and how the observations were collected
Two common violations of the independence assumption:
- Serial Effect: If the data were collected over time, plot the residuals in time order to see if there is a pattern (serial correlation)
- Cluster Effect: If there are subgroups represented in the data that are not accounted for in the model (e.g., type of house), you can color the points in the residual plots by group to see if the model systematically over or under predicts for a particular subgroup

Independence

Recall the description of the data:

Data on houses that were sold in the Duke Forest neighborhood of Durham, NC around November 2020
Scraped from Zillow

✅ Based on the information we have, we can reasonably treat this as a random sample of Duke Forest Houses and assume the error for one house does not tell us anything about the error for another house.

Recap

Used residual plots to check conditions for SLR:

Linearity
Constant variance

Normality
Independence

Which of these conditions are required for fitting a SLR? Which for simulation-based inference for the slope for an SLR? Which for inference with mathematical models?

Ed Discussion [Section 001][Section 002]

03:00

Comparing inferential methods

What are the advantages of using simulation-based inference methods? What are the advantages of using inference methods based on mathematical models?
Under what scenario(s) would you prefer to use simulation-based methods? Under what scenario(s) would you prefer to use methods based on mathematical models?

02:00

SLR: Conditions + Model evaluation

Announcements

Questions from last class?

Computational set up

Regression model, revisited

Mathematical representation, visualized

Model conditions

Model conditions

Linearity

Linearity

Residuals vs. fitted values (code)

Non-linear relationships

Constant variance

Constant variance

Non-constant variance

Normality

Normality

Check normality using a QQ-plot

Normality

Independence

Independence

Recap

Comparing inferential methods

Application exercise

Model evaluation

Two statistics

$R^{2}$

Interpreting $R^{2}$

Alternative approach for $R^{2}$

RMSE

Obtaining $R^{2}$ and RMSE

Application exercise