HW 04: Logistic regression

Important

Due Wednesday, November 15 at 11:59pm

Introduction

In this assignment, you’ll analyze data from an online Ipsos survey that was conducted for the FiveThirtyEight article “Why Many Americans Don’t Vote” using logistic regression for interpretation and prediction. You can read more about the polling design and respondents in the README of the GitHub repo for the data.

Learning goals

By the end of the assignment you will be able to…

Use logistic regression to explore the relationship between a binary response variable and multiple predictor variables
Conduct exploratory data analysis for logistic regression
Interpret coefficients of logistic regression model
Use statistics to help choose the best fit model
Use the logistic regression model for prediction and classification

Getting started

The repo for this assignment is available on GitHub at github.com/sta210-fa23 and starts with the prefix hw-04. See Lab 01 for more detailed instructions on getting started.

Packages

The following packages will be used for this assignment.

library(tidyverse)
library(tidymodels)
library(knitr)

# add other packages as needed

Data: “Why Many Americans Don’t Vote”

The data from the article “Why Many Americans Don’t Vote” includes information from polling done by Ipsos for FiveThirtyEight. Respondents were asked a variety of questions about their political beliefs, thoughts on multiple issues, and voting behavior. We will focus on using the demographic variables and someone’s party identification to understand whether an eligible voter is a “frequent” voter.

The codebook for the variable definitions can be found in the GitHub repo for the data. The variables we’ll focus on are:

ppage: Age of respondent
educ: Highest educational attainment category.
race: Race of respondent, census categories. Note: all categories except Hispanic are non-Hispanic.
gender: Gender of respondent
income_cat: Household income category of respondent
Q30: Response to the question “Generally speaking, do you think of yourself as a…”
- 1:Republican
- 2: Democrat
- 3: Independent
- 4: Another party, please specify
- 5: No preference
- -1: No response
voter_category: past voting behavior:
- always: respondent voted in all or all-but-one of the elections they were eligible in
- sporadic: respondent voted in at least two, but fewer than all-but-one of the elections they were eligible in
- rarely/never: respondent voted in 0 or 1 of the elections they were eligible in

You can read in the data directly from the GitHub repo:

voter_data <- read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/non-voters/nonvoters_data.csv')

Note that the authors use weighting to make the final sample more representative on the US population for their article. We will not use the weighting in this assignment, so we should treat the sample as a convenience sample rather than a random sample of the population.

Exercises

The goal of this analysis is use the polling data to examine the relationship between U.S. adults’ political party identification and voting behavior.

Exercise 1

Why do you think the authors chose to only include data from people who were eligible to vote for at least four election cycles?

Exercise 2

Let’s prepare the data for analysis and modeling.

Create a new variable called frequent_voter that takes the value 1 if the voter_category is “always” and 0 otherwise.
Make a table of the distribution of frequent_voter.
What percentage of the respondents in the data say they voted “in all or all-but-one of the elections they were eligible in”?

Exercise 3

The variable Q30 contains the respondent’s political party identification. Make a new variable, party_id, that simplifies Q30 into three categories: “Democrat”, “Republican”, “Independent/Neither”, The category “Independent/Neither” will also include respondents who did not answer the question. Make party_id a factor and relevel it so that it is consistent with the ordering of the responses in Question 30 of the survey.

Make a plot of the distribution of party_id.
Which category of party_id occurs most frequently in this data set?

Exercise 4

In the FiveThirtyEight article, the authors include visualizations of the relationship between the voter category and demographic variables such as race, age, education, etc.

Make a segmented barplot (also known as a stacked barplot) displaying the distribution of frequent_voter for each category of party_id. Make the plot such that the percentages (instead of counts) are displayed.
Use the plot to describe the relationship between these two variables.

Tip

See the plots of demographic information by voting history in the FiveThirtyEight article for examples of segmented bar plots.

Exercise 5

Let’s start by fitting a model using the demographic factors - ppage, educ, race, gender, income_cat - to predict the odds a person is a frequent voter.

Split the data into training (75%) and testing sets (25%). Use a seed of 29.
Fit the model on the training data. Display the model using 3 digits.
Consider the relationship between ppage and one’s voting behavior. Interpret the coefficient of ppage in the context of the data in terms of the odds a person is a frequent voter.

Exercise 6

Should party identification be added to the model? Use a drop-in-deviance test to determine if party identification should be added to the model fit in the previous exercise. Include the hypotheses in mathematical notation, the output from the test, and the conclusion in the context of the data.

Exercise 7

Display the model chosen from the previous exercise using 3 digits.

Then use the model selected to write a short paragraph (2 - 5 sentences) describing the effect (or lack of effect) of political party on the odds a person is a frequent voter. The paragraph should include an indication of which levels (if any) are statistically significant along with specifics about the differences in the odds between the political parties, as appropriate.

Exercise 8

In the article, the authors write

“Nonvoters were more likely to have lower incomes; to be young; to have lower levels of education; and to say they don’t belong to either political party, which are all traits that square with what we know about people less likely to engage with the political system.”

Consider the model you selected in Exercise 6. Is it consistent with this statement? Briefly explain why or why not.

Exercise 9

Use the testing data to produce the ROC curve and calculate the area under curve (AUC) for the model selected in Exercise 6. Write 1 - 2 sentences describing how well the model fits the data.

Exercise 10

You have been tasked by a local political organization to identify adults in the community who are frequent voters. These adults will receive targeted political mailings that will be different from the mailings sent to adults who are not frequent voters. You will use the model selected in Exercise 6 to identify the frequent voters.

Make a confusion matrix based on the cut-off probability of 0.25. Use the confusion matrix to calculate the following:

Sensitivity
Specificity
False negative rate
False positive rate

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
Click on your STA 210 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.
Grading (50 pts)

Component	Points
Ex 1	2
Ex 2	3
Ex 3	4
Ex 4	4
Ex 5	6
Ex 6	8
Ex 7	5
Ex 8	4
Ex 9	5
Ex 10	6
Workflow & formatting	3¹

Footnotes

The “Workflow & formatting” grade is to assess the reproducible workflow and document format. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date in the YAML.↩︎↩︎