MLR: Inference, conditions, and transformations

Prof. Maria Tackett

Oct 25, 2023

Conduct a hypothesis test for $β_{j}$

SLR hypothesis test

term	estimate	std.error	statistic	p.value
(Intercept)	-17.08	59.40	-0.29	0.77
hightemp	5.70	0.85	6.72	0.00

Set hypotheses: $H_{0} : β_{1} = 0$ vs. $H_{A} : β_{1} \neq 0$

Calculate test statistic and p-value: The test statistic is $t = 6.72$ . The p-value is calculated using a $t$ distribution with 88 degrees of freedom. The p-value is $\approx 0$ .

State the conclusion: The p-value is small, so we reject $H_{0}$ . The data provide strong evidence that high temperature is a helpful predictor for the number of daily riders, i.e. there is a linear relationship between high temperature and number of daily riders.

term	estimate	std.error	statistic	p.value
(Intercept)	-125.23	71.66	-1.75	0.08
hightemp	7.54	1.17	6.43	0.00
seasonSpring	5.13	34.32	0.15	0.88
seasonSummer	-76.84	47.71	-1.61	0.11

MLR hypothesis test: hightemp

Set hypotheses: $H_{0} : β_{h i g h t e m p} = 0$ vs. $H_{A} : β_{h i g h t e m p} \neq 0$ , given season is in the model

Calculate test statistic and p-value: The test statistic is $t = 6.43$ . The p-value is calculated using a $t$ distribution with 86 $(n - p - 1)$ degrees of freedom. The p-value is $\approx 0$ .

State the conclusion: The p-value is small, so we reject $H_{0}$ . The data provide strong evidence that high temperature for the day is a useful predictor in a model that already contains the season as a predictor for number of daily riders.

term	estimate	std.error	statistic	p.value
(Intercept)	-125.23	71.66	-1.75	0.08
hightemp	7.54	1.17	6.43	0.00
seasonSpring	5.13	34.32	0.15	0.88
seasonSummer	-76.84	47.71	-1.61	0.11

term	estimate	std.error	statistic	p.value
(Intercept)	-125.23	71.66	-1.75	0.08
hightemp	7.54	1.17	6.43	0.00
seasonSpring	5.13	34.32	0.15	0.88
seasonSummer	-76.84	47.71	-1.61	0.11

term	estimate	std.error	statistic	p.value
(Intercept)	-125.23	71.66	-1.75	0.08
hightemp	7.54	1.17	6.43	0.00
seasonSpring	5.13	34.32	0.15	0.88
seasonSummer	-76.84	47.71	-1.61	0.11

term	estimate	std.error	statistic	p.value
(Intercept)	-10.53	166.80	-0.06	0.95
hightemp	5.48	2.95	1.86	0.07
seasonSpring	-293.95	190.33	-1.54	0.13
seasonSummer	354.18	255.08	1.39	0.17
hightemp:seasonSpring	4.88	3.26	1.50	0.14
hightemp:seasonSummer	-4.54	3.75	-1.21	0.23

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-125.23	71.66	-1.75	0.08	-267.68	17.22
hightemp	7.54	1.17	6.43	0.00	5.21	9.87
seasonSpring	5.13	34.32	0.15	0.88	-63.10	73.36
seasonSummer	-76.84	47.71	-1.61	0.11	-171.68	18.00

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-125.23	71.66	-1.75	0.08	-267.68	17.22
hightemp	7.54	1.17	6.43	0.00	5.21	9.87
seasonSpring	5.13	34.32	0.15	0.88	-63.10	73.36
seasonSummer	-76.84	47.71	-1.61	0.11	-171.68	18.00

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-125.23	71.66	-1.75	0.08	-267.68	17.22
hightemp	7.54	1.17	6.43	0.00	5.21	9.87
seasonSpring	5.13	34.32	0.15	0.88	-63.10	73.36
seasonSummer	-76.84	47.71	-1.61	0.11	-171.68	18.00

Detecting Multicollinearity

Multicollinearity may occur when…

There are very high correlations $(r > 0.9)$ among two or more predictor variables, especially when the sample size is small

One (or more) predictor variables is an almost perfect linear combination of the others
There is a quadratic term in the model without mean-centering the variable first
There are interactions between two or more continuous variables
- Can reduce this by mean-centering the variables first
There is a categorical predictor with very few observations in the baseline level

term	estimate	std.error	statistic	p.value
(Intercept)	76.071	77.204	0.985	0.327
avgtemp	6.003	1.583	3.792	0.000
seasonSpring	34.555	34.454	1.003	0.319
seasonSummer	13.531	55.024	0.246	0.806
cloudcover	-12.807	3.488	-3.672	0.000
precip	-110.736	44.137	-2.509	0.014
day_typeWeekend	48.420	22.993	2.106	0.038

term	estimate	std.error	statistic	p.value
(Intercept)	8.421	74.992	0.112	0.911
hightemp	5.696	1.164	4.895	0.000
seasonSpring	31.239	32.082	0.974	0.333
seasonSummer	9.424	47.504	0.198	0.843
cloudcover	-8.353	3.435	-2.431	0.017
precip	-98.904	42.137	-2.347	0.021
day_typeWeekend	37.062	22.280	1.663	0.100

adj.r.squared	AIC	BIC
0.42	1087.5	1107.5

adj.r.squared	AIC	BIC
0.47	1079.05	1099.05

term	estimate	std.error	statistic	p.value
(Intercept)	8.421	74.992	0.112	0.911
hightemp	5.696	1.164	4.895	0.000
seasonSpring	31.239	32.082	0.974	0.333
seasonSummer	9.424	47.504	0.198	0.843
cloudcover	-8.353	3.435	-2.431	0.017
precip	-98.904	42.137	-2.347	0.021
day_typeWeekend	37.062	22.280	1.663	0.100

Identifying a need to transform $Y$

Typically, a “fan-shaped” residual plot indicates the need for a transformation of the response variable $Y$
- There are multiple ways to transform a variable, e.g., $\sqrt{Y}$ , $1 / Y$ , $\log (Y)$
- $\log (Y)$ the most straightforward to interpret, so we use that transformation when possible

When building a model:
- Choose a transformation and build the model on the transformed data
- Reassess the residual plots
- If the residuals plots did not sufficiently improve, try a new transformation!

Log transformation on $Y$

We want to interpret the model in terms of the original variable $Y$ , not $\log (Y)$ , so we need to write the regression equation in terms of $Y$

$\begin{aligned} \hat{Y} & = \exp {{\hat{β}}_{0} + {\hat{β}}_{1} X + \dots + {\hat{β}}_{P} X_{P}} \\ = \exp {{\hat{β}}_{0}} \exp {{\hat{β}}_{1} X} \dots \exp {{\hat{β}}_{p} X_{p}} \end{aligned}$

Note

The predicted value $\hat{Y}$ is the predicted median of $Y$ . Note, when the distribution of $Y | X_{1}, \dots, X_{p}$ is symmetric, then the median equals the mean. See the slides in the appendix for more detail.

Model for $l o g (v o l u m e)$

#fit model
log_rt_fit <- linear_reg() |>
  set_engine("lm") |>
  fit(log(volume) ~ hightemp + season + cloudcover + precip + day_type, data = rail_trail)

tidy(log_rt_fit) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	4.738	0.219	21.667	0.000
hightemp	0.018	0.003	5.452	0.000
seasonSpring	0.026	0.094	0.283	0.778
seasonSummer	-0.047	0.139	-0.338	0.736
cloudcover	-0.025	0.010	-2.452	0.016
precip	-0.294	0.123	-2.397	0.019
day_typeWeekend	0.064	0.065	0.987	0.327

Interpretation of model for $\log (v o l u m e)$

term	estimate	std.error	statistic	p.value
(Intercept)	4.738	0.219	21.667	0.000
hightemp	0.018	0.003	5.452	0.000
seasonSpring	0.026	0.094	0.283	0.778
seasonSummer	-0.047	0.139	-0.338	0.736
cloudcover	-0.025	0.010	-2.452	0.016
precip	-0.294	0.123	-2.397	0.019
day_typeWeekend	0.064	0.065	0.987	0.327

Interpret the intercept in terms of (1) log(volume) and (2) volume.
Interpret the coefficient of hightemp in terms of (1) log(volume) and (2) volume.

term	estimate	std.error	statistic	p.value
(Intercept)	50.135	0.632	79.330	0
log(Age)	-5.982	0.263	-22.781	0

MLR: Inference, conditions, and transformations Prof. Maria Tackett Oct 25, 2023

MLR: Inference, conditions, and transformations
Announcements
Topics
Computational setup
Inference for multiple linear regression
Modeling workflow
Data: rail_trail
Variables
Conduct a hypothesis test for $β_{j}$
Review: Simple linear regression (SLR)
SLR model summary
SLR hypothesis test
Multiple linear regression
Multiple linear regression
Estimating $σ_{ϵ}$
MLR hypothesis test: hightemp
The model for season = Spring
The model for season = Summer
The model for season = Fall
The models
Interaction terms
Confidence interval for $β_{j}$
Confidence interval for $β_{j}$
Confidence interval for $β_{j}$
CI for hightemp
CI for seasonSpring
Inference pitfalls
Large sample sizes
Small sample sizes
Conditions for inference
Full model
Model conditions
Checking Linearity
Residuals vs. predicted values
Residuals vs. each predictor
Checking linearity
Checking constant variance
Checking constant variance
Checking normality
Checking independence
Checking independence
Checking independence
Multicollinearity
What is multicollinearity
Example
Example
Why multicollinearity is a problem
Detecting Multicollinearity
Detecting multicollinearity in the EDA
Detecting Multicollinearity (VIF)
Detecting Multicollinearity (VIF)
VIF for rail trail model
Model without hightemp
Model without avgtemp
Choosing a model
Selected model (for now)
Variable transformations
Topics
Residuals vs. fitted for the selected model
Log transformation on the response variable
Identifying a need to transform $Y$
Log transformation on $Y$
Log transformation on $Y$
Model interpretation
Model for $l o g (v o l u m e)$
Interpretation of model for $\log (v o l u m e)$
Residuals for model with $\log (v o l u m e)$
Compare residual plots
Log transformation on a predictor variable
Log Transformation on $X$
Respiratory Rate vs. Age
Rate vs. Age
Model with Transformation on $X$
Model interpretation
Learn more
Appendix
Why $M e d i a n (Y | X)$ instead of $μ_{Y | X}$
Why $M e d i a n (Y | X)$ instead of $μ_{Y | X}$
Mean, Median, and log
Mean and median of $\log (Y)$
Mean and median of $\log (y)$