library(tidyverse)
library(tidymodels)
library(knitr)
# add other packages as needed
HW 04: Logistic regression
Due Wednesday, November 15 at 11:59pm
Introduction
In this assignment, you’ll analyze data from an online Ipsos survey that was conducted for the FiveThirtyEight article “Why Many Americans Don’t Vote” using logistic regression for interpretation and prediction. You can read more about the polling design and respondents in the README of the GitHub repo for the data.
Learning goals
By the end of the assignment you will be able to…
Use logistic regression to explore the relationship between a binary response variable and multiple predictor variables
Conduct exploratory data analysis for logistic regression
Interpret coefficients of logistic regression model
Use statistics to help choose the best fit model
Use the logistic regression model for prediction and classification
Getting started
The repo for this assignment is available on GitHub at github.com/sta210-fa23 and starts with the prefix hw-04. See Lab 01 for more detailed instructions on getting started.
Packages
The following packages will be used for this assignment.
Data: “Why Many Americans Don’t Vote”
The data from the article “Why Many Americans Don’t Vote” includes information from polling done by Ipsos for FiveThirtyEight. Respondents were asked a variety of questions about their political beliefs, thoughts on multiple issues, and voting behavior. We will focus on using the demographic variables and someone’s party identification to understand whether an eligible voter is a “frequent” voter.
The codebook for the variable definitions can be found in the GitHub repo for the data. The variables we’ll focus on are:
ppage
: Age of respondenteduc
: Highest educational attainment category.race
: Race of respondent, census categories. Note: all categories except Hispanic are non-Hispanic.gender
: Gender of respondentincome_cat
: Household income category of respondentQ30
: Response to the question “Generally speaking, do you think of yourself as a…”- 1:Republican
- 2: Democrat
- 3: Independent
- 4: Another party, please specify
- 5: No preference
- -1: No response
voter_category
: past voting behavior:- always: respondent voted in all or all-but-one of the elections they were eligible in
- sporadic: respondent voted in at least two, but fewer than all-but-one of the elections they were eligible in
- rarely/never: respondent voted in 0 or 1 of the elections they were eligible in
You can read in the data directly from the GitHub repo:
<- read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/non-voters/nonvoters_data.csv') voter_data
Note that the authors use weighting to make the final sample more representative on the US population for their article. We will not use the weighting in this assignment, so we should treat the sample as a convenience sample rather than a random sample of the population.
Exercises
The goal of this analysis is use the polling data to examine the relationship between U.S. adults’ political party identification and voting behavior.
Exercise 1
Why do you think the authors chose to only include data from people who were eligible to vote for at least four election cycles?
Exercise 2
Let’s prepare the data for analysis and modeling.
- Create a new variable called
frequent_voter
that takes the value 1 if thevoter_category
is “always” and 0 otherwise. - Make a table of the distribution of
frequent_voter
. - What percentage of the respondents in the data say they voted “in all or all-but-one of the elections they were eligible in”?
Exercise 3
The variable Q30
contains the respondent’s political party identification. Make a new variable, party_id
, that simplifies Q30
into three categories: “Democrat”, “Republican”, “Independent/Neither”, The category “Independent/Neither” will also include respondents who did not answer the question. Make party_id
a factor and relevel it so that it is consistent with the ordering of the responses in Question 30 of the survey.
- Make a plot of the distribution of
party_id
. - Which category of
party_id
occurs most frequently in this data set?
Exercise 4
In the FiveThirtyEight article, the authors include visualizations of the relationship between the voter category and demographic variables such as race, age, education, etc.
Make a segmented barplot (also known as a stacked barplot) displaying the distribution of
frequent_voter
for each category ofparty_id
. Make the plot such that the percentages (instead of counts) are displayed.Use the plot to describe the relationship between these two variables.
See the plots of demographic information by voting history in the FiveThirtyEight article for examples of segmented bar plots.
Exercise 5
Let’s start by fitting a model using the demographic factors - ppage
, educ
, race
, gender
, income_cat
- to predict the odds a person is a frequent voter.
Split the data into training (75%) and testing sets (25%). Use a seed of
29
.Fit the model on the training data. Display the model using 3 digits.
Consider the relationship between
ppage
and one’s voting behavior. Interpret the coefficient ofppage
in the context of the data in terms of the odds a person is a frequent voter.
Exercise 6
Should party identification be added to the model? Use a drop-in-deviance test to determine if party identification should be added to the model fit in the previous exercise. Include the hypotheses in mathematical notation, the output from the test, and the conclusion in the context of the data.
Exercise 7
Display the model chosen from the previous exercise using 3 digits.
Then use the model selected to write a short paragraph (2 - 5 sentences) describing the effect (or lack of effect) of political party on the odds a person is a frequent voter. The paragraph should include an indication of which levels (if any) are statistically significant along with specifics about the differences in the odds between the political parties, as appropriate.
Exercise 8
In the article, the authors write
“Nonvoters were more likely to have lower incomes; to be young; to have lower levels of education; and to say they don’t belong to either political party, which are all traits that square with what we know about people less likely to engage with the political system.”
Consider the model you selected in Exercise 6. Is it consistent with this statement? Briefly explain why or why not.
Exercise 9
Use the testing data to produce the ROC curve and calculate the area under curve (AUC) for the model selected in Exercise 6. Write 1 - 2 sentences describing how well the model fits the data.
Exercise 10
You have been tasked by a local political organization to identify adults in the community who are frequent voters. These adults will receive targeted political mailings that will be different from the mailings sent to adults who are not frequent voters. You will use the model selected in Exercise 6 to identify the frequent voters.
Make a confusion matrix based on the cut-off probability of 0.25. Use the confusion matrix to calculate the following:
Sensitivity
Specificity
False negative rate
False positive rate
Submission
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
To submit your assignment
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
- Click on your STA 210 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
- Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.
Grading (50 pts)
Component | Points |
---|---|
Ex 1 | 2 |
Ex 2 | 3 |
Ex 3 | 4 |
Ex 4 | 4 |
Ex 5 | 6 |
Ex 6 | 8 |
Ex 7 | 5 |
Ex 8 | 4 |
Ex 9 | 5 |
Ex 10 | 6 |
Workflow & formatting | 31 |