Cross validation application

Prof. Maria Tackett

Oct 23, 2023

Announcements

  • See Ed Discussion for upcoming events and internship opportunities

  • Statistics Experience due Mon, Nov 20 at 11:59pm

  • Prof. Tackett office hours Fridays 1:30 - 3:30pm for the rest of the semester

  • Start the final project in lab this week - start thinking about the data your team wants to use

Mid-semester survey

Thank you to everyone who filled out the mid-semester survey!

Aspect of class most helpful with learning

  • Application exercises
  • Lectures
  • Discussing content with others

Something to do in class to better help with learning

  • Zooming out more / reminder of the big picture
  • Taking time to finish AEs (perhaps do some of this in lab)
  • More conceptual questions on assignments, specifically HW

Things you do that are helpful with learning

  • Attend office hours!
  • Review course materials
  • Lots practice - review AEs, HW, labs

Mid-semester survey

Why we do in-class exams

  • Opportunity to demonstrate understanding of concepts and how they apply to application

    • This is what will make you stand out as a statistician/ data scientist!
  • In-class provides the most “level” playing field to demonstrate conceptual understanding, given all the online resources available now

  • Lots of other opportunities to demonstrate application skills through labs, HW, final project, and take-home portion of exam

Statistician of the day: Felicity Enders

Photo of Dr. Felicity Enders

Dr. Felicity Enders received her PhD from Johns Hopkins Bloomberg School of Public Health. She is a Professor of Biostatistics at the Mayo Clinic. With close to 200 publications, she has worked closely with clinicians, with particular focus on women’s health and psychology. Across the medical spectrum, Dr. Enders has provided advanced statistical modeling collaboration in clinical trials.

She is also passionate about biostatistics education and works to dissolve the hidden curriculum for research, particularly statistical knowledge needed for non-statisticians.

Source: hardin47.github.io/CURV/scholars/enders

Felicity Enders

Dr. Enders was a statistician on an interdisciplinary research team that used logistic regression to identify demographic, clinical, and laboratory variables associated with the presence (or absence) of advanced fibrosis with the aim to create a scoring system that could be used by clinicians.

“Data from each of the 4 countries were randomly separated into 2/3 and 1/3 of patients for model building and model validation, respectively. Hence, data on 480 patients were used to build a model, whereas data on 253 patients were used to validate the model.”

“…cross-validation was used with 20 subgroups, so that at most 5% of the data under consideration was excluded at any one time. By employing cross-validation, the possibility of an unusually positive or negative validation subset could be assessed.

Angulo, Paul, et al. “The NAFLD fibrosis score: a noninvasive system that identifies liver fibrosis in patients with NAFLD.” Hepatology 45.4 (2007): 846-854.

Topics

  • Cross validation application exercise

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(patchwork)
library(knitr)
library(kableExtra)
library(countdown)
library(rms)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Introduction

Data: Restaurant tips

Which variables help us predict the amount customers tip at a restaurant?

# A tibble: 169 × 5
     Tip Party Meal   Age    Alcohol
   <dbl> <dbl> <chr>  <chr>  <chr>  
 1  2.99     1 Dinner Yadult No     
 2  2        1 Dinner Yadult No     
 3  5        1 Dinner SenCit No     
 4  4        3 Dinner Middle No     
 5 10.3      2 Dinner SenCit No     
 6  4.85     2 Dinner Middle No     
 7  5        4 Dinner Yadult No     
 8  4        3 Dinner Middle No     
 9  5        2 Dinner Middle No     
10  1.58     1 Dinner SenCit No     
# ℹ 159 more rows

Variables

Predictors:

  • Party: Number of people in the party
  • Meal: Time of day (Lunch, Dinner, Late Night)
  • Age: Age category of person paying the bill (Yadult, Middle, SenCit)
  • Alcohol: Whether the party ordered alcohol with the meal (Yes, No)

Outcome: Tip: Amount of tip

Outcome: Tip

Predictors

Outcome vs. predictors

Analysis goal

Use cross validation to evaluate and select a model to predict the tip amount

v-fold cross validation – commonly used resampling technique:

  • Randomly split your training data into v partitions
  • Use v-1 partitions for analysis, and the remaining 1 partition for assessment
  • Repeat v times, updating which partition is used for assessment each time

Application exercise

Inference for multiple linear regression

Modeling workflow

  • Split data into training and test sets.

  • Use cross validation on the training set to fit, evaluate, and compare candidate models. Choose a final model based on summary of cross validation results.

  • Refit the model using the entire training set and do “final” evaluation on the test set (make sure you have not overfit the model).

    • Adjust as needed if there is evidence of overfit.
  • Use model fit on training set for inference and prediction.