STA 210 - Fall 2023 – Logistic regression

term	estimate	std.error	statistic	p.value
(Intercept)	4.738	0.219	21.667	0.000
hightemp	0.018	0.003	5.452	0.000
seasonSpring	0.026	0.094	0.283	0.778
seasonSummer	-0.047	0.139	-0.338	0.736
cloudcover	-0.025	0.010	-2.452	0.016
precip	-0.294	0.123	-2.397	0.019
day_typeWeekend	0.064	0.065	0.987	0.327

Method	Outcome	Model
Linear regression	Quantitative	$Y = β_{0} + β_{1} X$
Linear regression (transform Y)	Quantitative	$\log (Y) = β_{0} + β_{1} X$
Logistic regression	Binary	$\log (\frac{π}{1 - π}) = β_{0} + β_{1} X$

From odds to probabilities

Logistic model: log odds = $\log (\frac{π}{1 - π}) = β_{0} + β_{1} X$
Odds = $\exp {\log (\frac{π}{1 - π})} = \frac{π}{1 - π}$
Combining (1) and (2) with what we saw earlier

$probability = π = \frac{\exp {β_{0} + β_{1} X}}{1 + \exp {β_{0} + β_{1} X}}$

Logistic regression model

Logit form: $\log (\frac{π}{1 - π}) = β_{0} + β_{1} X$

Probability form:

$π = \frac{\exp {β_{0} + β_{1} X}}{1 + \exp {β_{0} + β_{1} X}}$

Risk of coronary heart disease

This dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to use age to predict if a randomly selected adult is high risk of having coronary heart disease in the next 10 years.

high_risk:

1: High risk of having heart disease in next 10 years
0: Not high risk of having heart disease in next 10 years

age: Age at exam time (in years)

Data: `heart_disease`

# A tibble: 4,240 × 2
     age high_risk
   <dbl> <fct>    
 1    39 0        
 2    46 0        
 3    48 0        
 4    61 1        
 5    46 0        
 6    43 0        
 7    63 1        
 8    45 0        
 9    52 0        
10    43 0        
# ℹ 4,230 more rows

High risk vs. age

ggplot(heart_disease, aes(x = high_risk, y = age)) +
  geom_boxplot(fill = "steelblue") +
  labs(x = "High risk - 1: yes, 0: no",
       y = "Age", 
       title = "Age vs. High risk of heart disease")

Let’s fit the model

heart_disease_fit <- logistic_reg() |>
  set_engine("glm") |>
  fit(high_risk ~ age, data = heart_disease, family = "binomial")

tidy(heart_disease_fit) |> kable(digits = 3)heart_disease_fit <- logistic_reg() |>
  set_engine("glm") |>
  fit(high_risk ~ age, data = heart_disease, family = "binomial")

tidy(heart_disease_fit) |> kable(digits = 3)heart_disease_fit <- logistic_reg() |>
  set_engine("glm") |>
  fit(high_risk ~ age, data = heart_disease, family = "binomial")

tidy(heart_disease_fit) |> kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-5.561	0.284	-19.599	0
age	0.075	0.005	14.178	0

The model

tidy(heart_disease_fit) |> kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-5.561	0.284	-19.599	0
age	0.075	0.005	14.178	0

$\log (\frac{\hat{π}}{1 - \hat{π}}) = - 5.561 + 0.075 \times age$ where $\hat{π}$ is the predicted probability of being high risk of having heart disease in the next 10 years

Predicted log odds

augment(heart_disease_fit$fit) |> select(.fitted, .resid)

# A tibble: 4,240 × 2
   .fitted .resid
     <dbl>  <dbl>
 1  -2.65  -0.370
 2  -2.13  -0.475
 3  -1.98  -0.509
 4  -1.01   1.62 
 5  -2.13  -0.475
 6  -2.35  -0.427
 7  -0.858  1.56 
 8  -2.20  -0.458
 9  -1.68  -0.585
10  -2.35  -0.427
# ℹ 4,230 more rows

For observation 1

$predicted odds = \hat{ω} = \frac{\hat{π}}{1 - \hat{π}} = \exp {- 2.650} = 0.071$

Predicted probabilities

predict(heart_disease_fit, new_data = heart_disease, type = "prob")

# A tibble: 4,240 × 2
   .pred_0 .pred_1
     <dbl>   <dbl>
 1   0.934  0.0660
 2   0.894  0.106 
 3   0.878  0.122 
 4   0.733  0.267 
 5   0.894  0.106 
 6   0.913  0.0870
 7   0.702  0.298 
 8   0.900  0.0996
 9   0.843  0.157 
10   0.913  0.0870
# ℹ 4,230 more rows

For observation 1

$predicted probability = \hat{π} = \frac{\exp {- 2.650}}{1 + \exp {- 2.650}} = 0.066$

Predicted classes

predict(heart_disease_fit, new_data = heart_disease, type = "class")

# A tibble: 4,240 × 1
   .pred_class
   <fct>      
 1 0          
 2 0          
 3 0          
 4 0          
 5 0          
 6 0          
 7 0          
 8 0          
 9 0          
10 0          
# ℹ 4,230 more rows

Default prediction

For a logistic regression, the default prediction is the class.

predict(heart_disease_fit, new_data = heart_disease)

# A tibble: 4,240 × 1
   .pred_class
   <fct>      
 1 0          
 2 0          
 3 0          
 4 0          
 5 0          
 6 0          
 7 0          
 8 0          
 9 0          
10 0          
# ℹ 4,230 more rows

Observed vs. predicted

What does the following table show?

predict(heart_disease_fit, new_data = heart_disease) |>
  bind_cols(heart_disease) |>
  count(high_risk, .pred_class)

# A tibble: 2 × 3
  high_risk .pred_class     n
  <fct>     <fct>       <int>
1 0         0            3596
2 1         0             644

The .pred_class is the class with the highest predicted probability. What is a limitation to using this method to determine the predicted class?