Nov 08, 2023
Project draft due in your GitHub repo at 9am on
Team Feedback #1 due Friday, November 10 at 11:5pm
HW 04 due Wednesday, November 15 at 11:59pm
Comparing logistic regression models using
Drop-in-deviance test
AIC
BIC
This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.
high_risk
:
age
: Age at exam time (in years)
education
: 1 = Some High School, 2 = High School or GED, 3 = Some College or Vocational School, 4 = College
currentSmoker
: 0 = nonsmoker, 1 = smoker
heart_disease <- read_csv(here::here("slides", "data/framingham.csv")) |>
select(age, education, TenYearCHD, totChol, currentSmoker) |>
drop_na() |> #consider the limitations of doing this
mutate(
high_risk = as.factor(TenYearCHD),
education = as.factor(education),
currentSmoker = as.factor(currentSmoker)
)
heart_disease
# A tibble: 4,086 × 6
age education TenYearCHD totChol currentSmoker high_risk
<dbl> <fct> <dbl> <dbl> <fct> <fct>
1 39 4 0 195 0 0
2 46 2 0 250 0 0
3 48 1 0 245 1 0
4 61 3 1 225 1 1
5 46 3 0 285 1 0
6 43 2 0 228 0 0
7 63 1 1 205 0 1
8 45 2 0 313 1 0
9 52 1 0 260 0 0
10 43 1 0 225 1 0
# ℹ 4,076 more rows
Using age
and education
:
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -5.508 | 0.311 | -17.692 | 0.000 | -6.125 | -4.904 |
age | 0.076 | 0.006 | 13.648 | 0.000 | 0.065 | 0.087 |
education2 | -0.245 | 0.113 | -2.172 | 0.030 | -0.469 | -0.026 |
education3 | -0.236 | 0.135 | -1.753 | 0.080 | -0.504 | 0.024 |
education4 | -0.024 | 0.150 | -0.161 | 0.872 | -0.323 | 0.264 |
\[ \small{\log\Big(\frac{\hat{\pi}}{1-\hat{\pi}}\Big) = -5.508 + 0.076 ~ \text{age} - 0.245 ~ \text{ed2} - 0.236 ~ \text{ed3} - 0.024 ~ \text{ed4}} \]
Should we add currentSmoker
to this model?
\[ \log L = \sum\limits_{i=1}^n[y_i \log(\hat{\pi}_i) + (1 - y_i)\log(1 - \hat{\pi}_i)] \]
Measure of how well the model fits the data
Higher values of \(\log L\) are better
Deviance = \(-2 \log L\)
Suppose there are two models:
We want to test the hypotheses
\[ \begin{aligned} H_0&: \beta_{q+1} = \dots = \beta_p = 0 \\ H_A&: \text{ at least one }\beta_j \text{ is not } 0 \end{aligned} \]
To do so, we will use the Drop-in-deviance test, also known as the Nested Likelihood Ratio test
Hypotheses:
\[ \begin{aligned} H_0&: \beta_{q+1} = \dots = \beta_p = 0 \\ H_A&: \text{ at least 1 }\beta_j \text{ is not } 0 \end{aligned} \]
Test Statistic: \[G = (-2 \log L_{reduced}) - (-2 \log L_{full})\]
P-value: \(P(\chi^2 > G)\), calculated using a \(\chi^2\) distribution with degrees of freedom equal to the difference in the number of parameters in the full and reduced models
currentSmoker
to the model?First model, reduced:
currentSmoker
to the model?Calculate deviance for each model:
[1] 3244.187
[1] 3221.901
currentSmoker
to the model?Calculate the p-value using a pchisq()
, with degrees of freedom equal to the number of new model terms in the second model:
Conclusion: The p-value is very small, so we reject \(H_0\). The data provide sufficient evidence that the coefficient of currentSmoker
is not equal to 0. Therefore, we should add it to the model.
We can use the anova
function to conduct this test
Add test = "Chisq"
to conduct the drop-in-deviance test
Use AIC or BIC for model selection
\[ \begin{align} &AIC = - 2 * \log L - {\color{purple}n\log(n)}+ 2(p +1)\\[5pt] &BIC =- 2 * \log L - {\color{purple}n\log(n)} + log(n)\times(p+1) \end{align} \]
glance()
functionLet’s look at the AIC for the model that includes age
, education
, and currentSmoker
Let’s compare the full and reduced models using AIC.
Based on AIC, which model would you choose?
Let’s compare the full and reduced models using BIC
Based on BIC, which model would you choose?