Oct 23, 2023
See Ed Discussion for upcoming events and internship opportunities
Statistics Experience due Mon, Nov 20 at 11:59pm
Prof. Tackett office hours Fridays 1:30 - 3:30pm for the rest of the semester
Start the final project in lab this week - start thinking about the data your team wants to use
Thank you to everyone who filled out the mid-semester survey!
Aspect of class most helpful with learning
Something to do in class to better help with learning
Things you do that are helpful with learning
Why we do in-class exams
Opportunity to demonstrate understanding of concepts and how they apply to application
In-class provides the most “level” playing field to demonstrate conceptual understanding, given all the online resources available now
Lots of other opportunities to demonstrate application skills through labs, HW, final project, and take-home portion of exam
Dr. Felicity Enders received her PhD from Johns Hopkins Bloomberg School of Public Health. She is a Professor of Biostatistics at the Mayo Clinic. With close to 200 publications, she has worked closely with clinicians, with particular focus on women’s health and psychology. Across the medical spectrum, Dr. Enders has provided advanced statistical modeling collaboration in clinical trials.
She is also passionate about biostatistics education and works to dissolve the hidden curriculum for research, particularly statistical knowledge needed for non-statisticians.
Dr. Enders was a statistician on an interdisciplinary research team that used logistic regression to identify demographic, clinical, and laboratory variables associated with the presence (or absence) of advanced fibrosis with the aim to create a scoring system that could be used by clinicians.
“Data from each of the 4 countries were randomly separated into 2/3 and 1/3 of patients for model building and model validation, respectively. Hence, data on 480 patients were used to build a model, whereas data on 253 patients were used to validate the model.”
“…cross-validation was used with 20 subgroups, so that at most 5% of the data under consideration was excluded at any one time. By employing cross-validation, the possibility of an unusually positive or negative validation subset could be assessed.”
Angulo, Paul, et al. “The NAFLD fibrosis score: a noninvasive system that identifies liver fibrosis in patients with NAFLD.” Hepatology 45.4 (2007): 846-854.
Which variables help us predict the amount customers tip at a restaurant?
# A tibble: 169 × 5
Tip Party Meal Age Alcohol
<dbl> <dbl> <chr> <chr> <chr>
1 2.99 1 Dinner Yadult No
2 2 1 Dinner Yadult No
3 5 1 Dinner SenCit No
4 4 3 Dinner Middle No
5 10.3 2 Dinner SenCit No
6 4.85 2 Dinner Middle No
7 5 4 Dinner Yadult No
8 4 3 Dinner Middle No
9 5 2 Dinner Middle No
10 1.58 1 Dinner SenCit No
# ℹ 159 more rows
Predictors:
Party
: Number of people in the partyMeal
: Time of day (Lunch, Dinner, Late Night)Age
: Age category of person paying the bill (Yadult, Middle, SenCit)Alcohol
: Whether the party ordered alcohol with the meal (Yes, No)Outcome: Tip
: Amount of tip
Tip
Use cross validation to evaluate and select a model to predict the tip amount
v-fold cross validation – commonly used resampling technique:
Split data into training and test sets.
Use cross validation on the training set to fit, evaluate, and compare candidate models. Choose a final model based on summary of cross validation results.
Refit the model using the entire training set and do “final” evaluation on the test set (make sure you have not overfit the model).
Use model fit on training set for inference and prediction.