Oct 18, 2023
See Ed Discussion for upcoming events and internship opportunities
Statistics Experience due Mon, Nov 20 at 11:59pm
Please submit mid-semester feedback by Friday
Prof. Tackett office hours Fridays 1:30 - 3:30pm for the rest of the semester
Which variables help us predict the amount customers tip at a restaurant?
# A tibble: 169 × 4
Tip Party Meal Age
<dbl> <dbl> <chr> <chr>
1 2.99 1 Dinner Yadult
2 2 1 Dinner Yadult
3 5 1 Dinner SenCit
4 4 3 Dinner Middle
5 10.3 2 Dinner SenCit
6 4.85 2 Dinner Middle
7 5 4 Dinner Yadult
8 4 3 Dinner Middle
9 5 2 Dinner Middle
10 1.58 1 Dinner SenCit
# ℹ 159 more rows
Predictors:
Party
: Number of people in the partyMeal
: Time of day (Lunch, Dinner, Late Night)Age
: Age category of person paying the bill (Yadult, Middle, SenCit)Outcome: Tip
: Amount of tip
Tip
tip_fit <- linear_reg() |>
set_engine("lm") |>
fit(Tip ~ Party + Age, data = tips)
tidy(tip_fit) |>
kable(digits = 3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.170 | 0.366 | -0.465 | 0.643 |
Party | 1.837 | 0.124 | 14.758 | 0.000 |
AgeMiddle | 1.009 | 0.408 | 2.475 | 0.014 |
AgeSenCit | 1.388 | 0.485 | 2.862 | 0.005 |
Is this the best model to explain variation in tips?
the variation that can be explained by the each of the variables in the model
the variation that can’t be explained by the model (left in the residuals)
term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|
Party | 1 | 1188.64 | 1188.64 | 285.71 | 0 |
Age | 2 | 38.03 | 19.01 | 4.57 | 0.01 |
Residuals | 165 | 686.44 | 4.16 | ||
Total | 168 | 1913.11 |
term | df | sumsq |
---|---|---|
Party | 1 | 1188.64 |
Age | 2 | 38.03 |
Residuals | 165 | 686.44 |
Total | 168 | 1913.11 |
term | df | sumsq |
---|---|---|
Party | 1 | 1188.64 |
Age | 2 | 38.03 |
Residuals | 165 | 686.44 |
Total | 168 | 1913.11 |
\(SS_{Total}\): Total sum of squares, variability of outcome
\(\sum_{i = 1}^n (y_i - \bar{y})^2\) = 1913.11
term | df | sumsq |
---|---|---|
Party | 1 | 1188.64 |
Age | 2 | 38.03 |
Residuals | 165 | 686.44 |
Total | 168 | 1913.11 |
\(SS_{Error}\): Residual sum of squares, variability of residuals
\(\sum_{i = 1}^n (y_i - \hat{y}_i)^2\) = 686.44
term | df | sumsq |
---|---|---|
Party | 1 | 1188.64 |
Age | 2 | 38.03 |
Residuals | 165 | 686.44 |
Total | 168 | 1913.11 |
\(SS_{Model}\): Variability explained by the model
\(SS_{Total} - SS_{Error}\) = 1226.67
Recall: \(R^2\) is the proportion of the variation in the response variable explained by the regression model.
\[ R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Error}}{SS_{Total}} = 1 - \frac{686.44}{1913.11} = 0.641 \]
\[R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Error}}{SS_{Total}}\]
\[R^2_{adj} = 1 - \frac{SS_{Error}/(n-p-1)}{SS_{Total}/(n-1)}\]
where
\(n\) is the number of observations used to fit the model
\(p\) is the number of terms (not including the intercept) in the model
recipe()
workflow to fit Model 1 or Model 2?Vote on Ed Discussion [10:05am lecture][1:25pm lecture]
Estimators of prediction error and relative quality of models:
Akaike’s Information Criterion (AIC): \[AIC = n\log(SS_\text{Error}) - n \log(n) + 2(p+1)\]
Schwarz’s Bayesian Information Criterion (BIC): \[BIC = n\log(SS_\text{Error}) - n\log(n) + log(n)\times(p+1)\]
\[ \begin{aligned} & AIC = \color{blue}{n\log(SS_\text{Error})} - n \log(n) + 2(p+1) \\ & BIC = \color{blue}{n\log(SS_\text{Error})} - n\log(n) + \log(n)\times(p+1) \end{aligned} \]
First Term: Decreases as p increases
\[ \begin{aligned} & AIC = n\log(SS_\text{Error}) - \color{blue}{n \log(n)} + 2(p+1) \\ & BIC = n\log(SS_\text{Error}) - \color{blue}{n\log(n)} + \log(n)\times(p+1) \end{aligned} \]
Second Term: Fixed for a given sample size n
\[ \begin{aligned} & AIC = n\log(SS_\text{Error}) - n\log(n) + \color{blue}{2(p+1)} \\ & BIC = n\log(SS_\text{Error}) - n\log(n) + \color{blue}{\log(n)\times(p+1)} \end{aligned} \]
Third Term: Increases as p increases
\[ \begin{aligned} & AIC = n\log(SS_{Error}) - n \log(n) + \color{red}{2(p+1)} \\ & BIC = n\log(SS_{Error}) - n\log(n) + \color{red}{\log(n)\times(p+1)} \end{aligned} \]
Choose model with the smaller value of AIC or BIC
If \(n \geq 8\), the penalty for BIC is larger than that of AIC, so BIC tends to favor more parsimonious models (i.e. models with fewer terms)
Which model would we choose based on AIC?
Which model would we choose based on BIC?
The principle of parsimony is attributed to William of Occam (early 14th-century English nominalist philosopher), who insisted that, given a set of equally good explanations for a given phenomenon, the correct explanation is the simplest explanation1
Called Occam’s razor because he “shaved” his explanations down to the bare minimum
Parsimony in modeling:
Occam’s razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected
Model selection follows this principle
We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model
In other words, we prefer the simplest best model, i.e. parsimonious model
Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.
Radford Neal - Bayesian Learning for Neural Networks1
split our data into testing and training sets
“train” the model on the training data and pick a few models we’re genuinely considering as potentially good models
test those models on the testing set
…and repeat this process multiple times
Resampling is only conducted on the training set. The test set is not involved. For each iteration of resampling, the data are partitioned into two subsamples:
Image source: Kuhn and Silge. Tidy modeling with R.
More specifically, v-fold cross validation – commonly used resampling technique:
Let’s give an example where v = 3
…
Split data into training and test sets
Create recipe
Specify model
Create workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
Randomly split your training data into 3 partitions:
# Resampling results
# 3-fold cross-validation
# A tibble: 3 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [84/42]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
2 <split [84/42]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
3 <split [84/42]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 2.10 3 0.243 Preprocessor1_Model1
2 rsq standard 0.591 3 0.0728 Preprocessor1_Model1
Note: These are calculated using the assessment data
# A tibble: 6 × 5
id .metric .estimator .estimate .config
<chr> <chr> <chr> <dbl> <chr>
1 Fold1 rmse standard 2.04 Preprocessor1_Model1
2 Fold1 rsq standard 0.736 Preprocessor1_Model1
3 Fold2 rmse standard 1.71 Preprocessor1_Model1
4 Fold2 rsq standard 0.509 Preprocessor1_Model1
5 Fold3 rmse standard 2.54 Preprocessor1_Model1
6 Fold3 rsq standard 0.528 Preprocessor1_Model1
Cross validation RMSE stats:
# Function to get Adj R-sq, AIC, BIC
calc_model_stats <- function(x) {
glance(extract_fit_parsnip(x)) |>
select(adj.r.squared, AIC, BIC)
}
# Fit model and calculate statistics for each fold
tips_fit_rs1 <-tips_wflow1 |>
fit_resamples(resamples = folds,
control = control_resamples(extract = calc_model_stats))
tips_fit_rs1
# Resampling results
# 3-fold cross-validation
# A tibble: 3 × 5
splits id .metrics .notes .extracts
<list> <chr> <list> <list> <list>
1 <split [84/42]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
2 <split [84/42]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
3 <split [84/42]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
# A tibble: 3 × 4
adj.r.squared AIC BIC Fold
<dbl> <dbl> <dbl> <chr>
1 0.550 369. 386. Fold1
2 0.659 377. 394. Fold2
3 0.718 337. 354. Fold3
Note: These are based on the model fit from the analysis data
To illustrate how CV works, we used v = 3
:
This was useful for illustrative purposes, but v = 3
is a poor choice in practice
Values of v
are most often 5 or 10; we generally prefer 10-fold cross-validation as a default
Cross validation for
model evaluation
model comparison