library(tidyverse)
library(tidymodels)
library(knitr)
AE 17: Exam 02 Review
Go to the course GitHub organization and locate your ae-17
repo to get started.Render, commit, and push your responses to GitHub by the end of class.
Render, commit, and push your responses to GitHub by the end of class. The responses are due in your GitHub repo no later than Saturday, December 2 at 11:59pm.
Note: This in-class review is not exhaustive. Use lecture notes notes, application exercises, labs, and homework for a comprehensive exam review.
Packages
Part 1: Logistic regression
Data
The data for this analysis is about credit card customers. It can be found in the file credit.csv
. The following variables are in the data set:
income
: Income in $1,000’slimit
: Credit limitrating
: Credit ratingcards
: Number of credit cardsage
: Age in yearseducation
: Number of years of educationown
: A factor with levelsNo
andYes
indicating whether the individual owns their homestudent
: A factor with levelsNo
andYes
indicating whether the individual was a studentmarried
: A factor with levelsNo
andYes
indicating whether the individual was marriedregion
: A factor with levelsSouth
,East
, andWest
indicating the region of the US the individual is frombalance
: Average credit card balance in $.
The objective of this analysis is to predict whether a person has maxed out their credit card, i.e., had $0 average card balance.
<- read_csv("data/credit.csv") |>
credit mutate(maxed = factor(if_else(balance == 0, 1, 0)))
Exercise 1
- Why is logistic regression the best modeling approach for this analysis?
Fill in the blanks. In logistic regression, we
use log-odds to …
use odds to …
use probabilities to …
Exercise 2
Split the data into 80% training and 20% test sets. Use seed 210.
set.seed(210)
<- initial_split(credit, prop = 0.8)
credit_split <- training(credit_split)
credit_train <- testing(credit_split) credit_test
Specify the logistic regression model. Call it credit_spec
.
#add code here
Exercise 3
Create a recipe called credit_rec
that does the following:
- Predict
maxed
usingincome
,age
, andstudent
. - Mean center the quantitative predictors.
- Create dummy variables where needed and drop any zero variance variables.
<- recipe(maxed ~ income + age + student,
credit_rec data = credit_train)
# add recipe steps
Exercise 4
Create the workflow that brings together the model specification and recipe. Call it credit_wflow
.
# add code here
Exercise 5
Conduct 5-fold cross validation. Use seed 210 to create the folds.
# add code here
Then, summarize the metrics from your CV resamples.
# add code here
Exercise 6
Create a workflow for another model with a new recipe that includes the variable region
along with all the variables and recipe steps from Exercise 3. Conduct cross validation, then select a model between the two.
# add code here
Exercise 7
Fit the model you selected in the previous exercise to the entire training set.
# add code here
Exercise 8
Now let’s evaluate the performance of the selected model using the testing data.
Create a confusion matrix using a cutoff probability of 0.5.
# add code here
What is the sensitivity? What does it mean in the context of the data ?
What is the specificity? What does it mean in the context of the data?
What is the false positive rate? What does it mean in the context of the data?
What is the false negative rate? What does it mean in the context of the data?
Exercise 9
Produce the ROC curve.
# add code here
- Describe how you can use this curve to select a cutoff probability (rather than just going with 0.5).
Exercise 10
Questions about checking conditions for logistic regression:
Do we assess conditions on the training or testing set?
Why do we not consider categorical predictors when checking linearity?
Why do we not need to check constant variance for logistic regression?
Submission
To submit the AE:
- Render the document to produce the PDF with all of your work from today’s class.
- Push all your work to your
ae-17
repo on GitHub. (You do not submit AEs on Gradescope).