Multinomial Logistic Regression

Prof. Maria Tackett

Nov 13, 2023

Announcements

  • Due dates

    • Draft report due in GitHub repo on 9am on your lab day
    • HW 04 due TODAY at 11:59pm
    • Statistics experience due Mon, Nov 20 at 11:59pm
  • Next week:

    • (Optional) Project meetings Nov 20 & 21. Click here to sign up. Must sign up by Fri, Nov 17
    • No class meetings next week
  • Click here to access lecture recordings. Available until Mon, Dec 04 at 9am

🍂 Have a good Thanksgiving break! 🍂

Topics

  • Conditions for logistic regression AE

  • Multinomial logistic regression

  • Interpret model coefficients

  • Inference for a coefficient βjk

Application exercise

📋 AE 16: Conditions for logistic regression

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(NHANES) #data set
library(knitr)
library(patchwork)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 20))

Generalized Linear Models

Generalized Linear Models (GLMs)

  • In practice, there are many different types of outcome variables:

    • Binary: Win or Lose
    • Nominal: Democrat, Republican or Third Party candidate
    • Ordered: Movie rating (1 - 5 stars)
    • and others…
  • Predicting each of these outcomes requires a generalized linear model, a broader class of models that generalize the multiple linear regression model

Note

Recommended reading for more details about GLMs: Generalized Linear Models: A Unifying Theory.

Binary outcome (Logistic)

  • Given P(yi=1|xi)=π^i and P(yi=0|xi)=1−π^i

    log⁡(π^i1−π^i)=β^0+β^1xi

  • We can calculate π^i by solving the logit equation:

    π^i=exp⁡{β^0+β^1xi}1+exp⁡{β^0+β^1xi}

Binary outcome (Logistic)

  • Suppose we consider y=0 the baseline category such that

    P(yi=1|xi)=π^i1 and P(yi=0|xi)=π^i0

  • Then the logistic regression model is

    log⁡(π^i11−π^i1)=log⁡(π^i1π^i0)=β^0+β^1xi

  • Slope, β^1: When x increases by one unit, the odds of y=1 versus the baseline y=0 are expected to multiply by a factor of exp⁡{β^1}

  • Intercept, β^0: When x=0, the predicted odds of y=1 versus the baseline y=0 are exp⁡{β^0}

Multinomial outcome variable

  • Suppose the outcome variable y is categorical and can take values 1,2,…,K such that (K>2)

  • Multinomial Distribution:

    P(y=1)=π1,P(y=2)=π2,…,P(y=K)=πK

    such that ∑k=1Kπk=1

Multinomial Logistic Regression

  • If we have an explanatory variable x, then we want to fit a model such that P(y=k)=πk is a function of x

  • Choose a baseline category. Let’s choose y=1. Then,

    log⁡(πikπi1)=β0k+β1kxi

  • In the multinomial logistic model, we have a separate equation for each category of the outcome relative to the baseline category

    • If the outcome has K possible categories, there will be K−1 equations as part of the multinomial logistic model

Multinomial Logistic Regression

  • Suppose we have a outcome variable y that can take three possible outcomes that are coded as “A”, “B”, “C”

  • Let “A” be the baseline category. Then

    log⁡(πiBπiA)=β0B+β1Bxilog⁡(πiCπiA)=β0C+β1Cxi

Data

NHANES Data

  • National Health and Nutrition Examination Survey is conducted by the National Center for Health Statistics (NCHS)

  • The goal is to “assess the health and nutritional status of adults and children in the United States”

  • This survey includes an interview and a physical examination

NHANES Data

  • We will use the data from the NHANES R package

  • Contains 75 variables for the 2009 - 2010 and 2011 - 2012 sample years

  • The data in this package is modified for educational purposes and should not be used for research

  • Original data can be obtained from the NCHS website for research purposes

  • Type ?NHANES in console to see list of variables and definitions

Variables

Goal: Use a person’s age and whether they do regular physical activity to predict their self-reported health rating.

  • Outcome:

    • HealthGen: Self-reported rating of participant’s health in general. Excellent, Vgood, Good, Fair, or Poor.
  • Predictors:

    • Age: Age at time of screening (in years). Participants 80 or older were recorded as 80.
    • PhysActive: Participant does moderate to vigorous-intensity sports, fitness or recreational activities.

The data

nhanes_adult <- NHANES |>
  filter(Age >= 18) |>
  select(HealthGen, Age, PhysActive) |>
  filter(!(is.na(HealthGen))) |>
  mutate(obs_num = 1:n())
glimpse(nhanes_adult)
Rows: 6,710
Columns: 4
$ HealthGen  <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood, …
$ Age        <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56,…
$ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, …
$ obs_num    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …

Exploratory data analysis

Exploratory data analysis

Fitting a multinomial logistic regression model

Model in R

Use the multinom_reg() function with the "nnet" engine:

health_fit <- multinom_reg() |>
  set_engine("nnet") |>
  fit(HealthGen ~ Age + PhysActive, data = nhanes_adult)

Model result

health_fit
parsnip model object

Call:
nnet::multinom(formula = HealthGen ~ Age + PhysActive, data = data, 
    trace = FALSE)

Coefficients:
      (Intercept)           Age PhysActiveYes
Vgood   1.2053460  0.0009101848    -0.3209047
Good    1.9476261 -0.0023686122    -1.0014925
Fair    0.9145492  0.0030462534    -1.6454297
Poor   -1.5211414  0.0221905681    -2.6556343

Residual Deviance: 17588.88 
AIC: 17612.88 

Model output

What function do we use to get the model summary, i.e., coefficient estimates.

tidy(health_fit)
# A tibble: 12 × 6
   y.level term           estimate std.error statistic  p.value
   <chr>   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
 1 Vgood   (Intercept)    1.21       0.145       8.33  8.42e-17
 2 Vgood   Age            0.000910   0.00246     0.369 7.12e- 1
 3 Vgood   PhysActiveYes -0.321      0.0929     -3.45  5.51e- 4
 4 Good    (Intercept)    1.95       0.141      13.8   1.39e-43
 5 Good    Age           -0.00237    0.00242    -0.977 3.29e- 1
 6 Good    PhysActiveYes -1.00       0.0901    -11.1   1.00e-28
 7 Fair    (Intercept)    0.915      0.164       5.57  2.61e- 8
 8 Fair    Age            0.00305    0.00288     1.06  2.90e- 1
 9 Fair    PhysActiveYes -1.65       0.107     -15.3   5.69e-53
10 Poor    (Intercept)   -1.52       0.290      -5.24  1.62e- 7
11 Poor    Age            0.0222     0.00491     4.52  6.11e- 6
12 Poor    PhysActiveYes -2.66       0.236     -11.3   1.75e-29

Model output, with CI

tidy(health_fit, conf.int = TRUE)
# A tibble: 12 × 8
   y.level term         estimate std.error statistic  p.value conf.low conf.high
   <chr>   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
 1 Vgood   (Intercept)   1.21e+0   0.145       8.33  8.42e-17  0.922     1.49   
 2 Vgood   Age           9.10e-4   0.00246     0.369 7.12e- 1 -0.00392   0.00574
 3 Vgood   PhysActiveY… -3.21e-1   0.0929     -3.45  5.51e- 4 -0.503    -0.139  
 4 Good    (Intercept)   1.95e+0   0.141      13.8   1.39e-43  1.67      2.22   
 5 Good    Age          -2.37e-3   0.00242    -0.977 3.29e- 1 -0.00712   0.00238
 6 Good    PhysActiveY… -1.00e+0   0.0901    -11.1   1.00e-28 -1.18     -0.825  
 7 Fair    (Intercept)   9.15e-1   0.164       5.57  2.61e- 8  0.592     1.24   
 8 Fair    Age           3.05e-3   0.00288     1.06  2.90e- 1 -0.00260   0.00869
 9 Fair    PhysActiveY… -1.65e+0   0.107     -15.3   5.69e-53 -1.86     -1.43   
10 Poor    (Intercept)  -1.52e+0   0.290      -5.24  1.62e- 7 -2.09     -0.952  
11 Poor    Age           2.22e-2   0.00491     4.52  6.11e- 6  0.0126    0.0318 
12 Poor    PhysActiveY… -2.66e+0   0.236     -11.3   1.75e-29 -3.12     -2.19   

Model output, with CI

y.level term estimate std.error statistic p.value conf.low conf.high
Vgood (Intercept) 1.205 0.145 8.325 0.000 0.922 1.489
Vgood Age 0.001 0.002 0.369 0.712 -0.004 0.006
Vgood PhysActiveYes -0.321 0.093 -3.454 0.001 -0.503 -0.139
Good (Intercept) 1.948 0.141 13.844 0.000 1.672 2.223
Good Age -0.002 0.002 -0.977 0.329 -0.007 0.002
Good PhysActiveYes -1.001 0.090 -11.120 0.000 -1.178 -0.825
Fair (Intercept) 0.915 0.164 5.566 0.000 0.592 1.237
Fair Age 0.003 0.003 1.058 0.290 -0.003 0.009
Fair PhysActiveYes -1.645 0.107 -15.319 0.000 -1.856 -1.435
Poor (Intercept) -1.521 0.290 -5.238 0.000 -2.090 -0.952
Poor Age 0.022 0.005 4.522 0.000 0.013 0.032
Poor PhysActiveYes -2.656 0.236 -11.275 0.000 -3.117 -2.194

Fair vs. Excellent Health

The baseline category for the model is Excellent.

The model equation for the log-odds a person rates themselves as having “Fair” health vs. “Excellent” is

log⁡(π^Fairπ^Excellent)=0.915+0.003 age−1.645 PhysActive

Interpretations

log⁡(π^Fairπ^Excellent)=0.915+0.003 age−1.645 PhysActive

For each additional year in age, the odds a person rates themselves as having fair health versus excellent health are expected to multiply by 1.003 (exp(0.003)), holding physical activity constant.


The odds a person who does physical activity will rate themselves as having fair health versus excellent health are expected to be 0.193 (exp(-1.645)) times the odds for a person who doesn’t do physical activity, holding age constant.

Interpretations

log⁡(π^Fairπ^Excellent)=0.915+0.003 age−1.645 PhysActive

The odds a 0 year old person who doesn’t do physical activity rates themselves as having fair health vs. excellent health are 2.497 (exp(0.915)).

Warning

Need to mean-center age for the intercept to have a meaningful interpretation!

Hypothesis test for βjk

The test of significance for the coefficient βjk is

Hypotheses: H0:βjk=0 vs Ha:βjk≠0

Test Statistic: z=β^jk−0SE(βjk^)

P-value: P(|Z|>|z|),

where Z∼N(0,1), the Standard Normal distribution

Confidence interval for βjk

  • We can calculate the C% confidence interval for βjk using β^jk±z∗SE(β^jk), where z∗ is calculated from the N(0,1) distribution.

  • Interpretation: We are C% confident that for every one unit change in xj, the odds of y=k versus the baseline will multiply by a factor of exp⁡{β^jk−z∗SE(β^jk)} to exp⁡{β^jk+z∗SE(β^jk)}, holding all else constant.

Interpreting CIs for βjk

tidy(health_fit, conf.int = TRUE) |>
  filter(y.level == "Fair") |>
  kable(digits = 3)
y.level term estimate std.error statistic p.value conf.low conf.high
Fair (Intercept) 0.915 0.164 5.566 0.00 0.592 1.237
Fair Age 0.003 0.003 1.058 0.29 -0.003 0.009
Fair PhysActiveYes -1.645 0.107 -15.319 0.00 -1.856 -1.435


We are 95% confident, that for each additional year in age, the odds a person rates themselves as having fair health versus excellent health will multiply by 0.997 (exp(-0.003)) to 1.009 (exp(0.009)) , holding physical activity constant.

Interpreting CIs for βjk

tidy(health_fit, conf.int = TRUE) |>
  filter(y.level == "Fair") |>
  kable(digits = 3)
y.level term estimate std.error statistic p.value conf.low conf.high
Fair (Intercept) 0.915 0.164 5.566 0.00 0.592 1.237
Fair Age 0.003 0.003 1.058 0.29 -0.003 0.009
Fair PhysActiveYes -1.645 0.107 -15.319 0.00 -1.856 -1.435


We are 95% confident that the odds a person who does physical activity will rate themselves as having fair health versus excellent health are 0.156 (exp(-1.856 )) to 0.238 (exp(-1.435)) times the odds for a person who doesn’t do physical activity, holding age constant.

Recap

  • Introduce multinomial logistic regression

  • Interpret model coefficients

  • Inference for a coefficient βjk

🔗 STA 210 - Fall 2023 - Schedule

Multinomial Logistic Regression Prof. Maria Tackett Nov 13, 2023

  1. Slides

  2. Tools

  3. Close
  • Multinomial Logistic Regression
  • Announcements
  • Topics
  • Application exercise
  • Computational setup
  • Generalized Linear Models
  • Generalized Linear Models (GLMs)
  • Binary outcome (Logistic)
  • Binary outcome (Logistic)
  • Multinomial outcome variable
  • Multinomial Logistic Regression
  • Multinomial Logistic Regression
  • Data
  • NHANES Data
  • NHANES Data
  • Variables
  • The data
  • Exploratory data analysis
  • Exploratory data analysis
  • Fitting a multinomial logistic regression model
  • Model in R
  • Model result
  • Model output
  • Model output, with CI
  • Model output, with CI
  • Fair vs. Excellent Health
  • Interpretations
  • Interpretations
  • Hypothesis test for βjk
  • Confidence interval for βjk
  • Interpreting CIs for βjk
  • Interpreting CIs for βjk
  • Recap
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help