library(tidyverse)
library(tidymodels)
library(fivethirtyeight)
library(knitr)
# add other packages as needed
Lab 05: Candy Competition
Feature engineering + Model comparison
Due:
- Friday, October 20, 11:59pm (Tuesday labs)
- Friday, October 20, 11:59pm (Thursday labs)
Introduction
In today’s lab you will analyze data about candy that was collected from an online experiment conducted at FiveThirtyEight.
Learning goals
By the end of the lab you will be able to
- describe the components of a recipe
- fit a model using recipes
- compare models
- continue developing a collaborative workflow with your teammates
Getting started
- A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
- Go to the sta210-fa23 organization on GitHub. Click on the repo with the prefix lab-05. It contains the starter documents you need to complete the lab.
- Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.
- Do not make any changes to the
.qmd
file until the instructions tell you do to so.
Workflow: Using Git and GitHub as a team
There are no Team Member markers in this lab; however, you should use a similar workflow as in Lab 04. Only one person should type in the group’s .qmd file at a time. Once that person has finished typing the group’s responses, they should render, commit, and push the changes to GitHub. All other teammates can pull to see the updates in RStudio.
Every teammate must have at least one commit in the lab. Everyone is expected to contribute to discussion even when they are not typing.
Packages
The following packages are used in the lab.
Data: Candy
The data from this lab comes from the the article FiveThirtyEight The Ultimate Halloween Candy Power Ranking by Walt Hickey. To collect data, Hickey and collaborators at FiveThirtyEight set up an experiment people could vote on a series of randomly generated candy matchups (e.g. Reeses vs. Skittles). Click here to check out some of the match ups.
The data set contains the characteristics and win percentage from 85 candies in the experiment. The variables are
Variable | Description |
---|---|
chocolate |
Does it contain chocolate? |
fruity |
Is it fruit flavored? |
caramel |
Is there caramel in the candy? |
peanutalmondy |
Does it contain peanuts, peanut butter or almonds? |
nougat |
Does it contain nougat? |
crispedricewafer |
Does it contain crisped rice, wafers, or a cookie component? |
hard |
Is it a hard candy? |
bar |
Is it a candy bar? |
pluribus |
Is it one of many candies in a bag or box? |
sugarpercent |
The percentile of sugar it falls under within the data set. Values 0 - 1. |
pricepercent |
The unit price percentile compared to the rest of the set. Values 0 - 1. |
winpercent |
The overall win percentage according to 269,000 matchups. Values 0 - 100. |
Use the code below to get a glimpse of the candy_rankings
data frame in the fivethirtyeight R package.
glimpse(candy_rankings)
Rows: 85
Columns: 13
$ competitorname <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
$ chocolate <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ fruity <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
$ caramel <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ peanutyalmondy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, …
$ nougat <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ crispedricewafer <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hard <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bar <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ pluribus <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
$ sugarpercent <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…
Exercises
The goal of this analysis is to use multiple linear regression to understand the factors that make a good candy.
Exercise 1
Use
ggplot
to make a graph of your choice exploring the relationship between at least 3 variables in the data set.Write two observations from your graph.
Exercise 2
Split the data into training (80%) and testing sets (20%). Call the training set candy_train
and the testing set candy_test
. Use a seed of 2
.
Exercise 3
Below is a recipe for a model that uses the characteristics of candy to understand variability in the win percentage. The lines of the recipe
code are labeled Line 1 - Line 9. Describe what each line of code does. The explanation should be written comprehensively and specifically enough that someone could replicate the data manipulation steps based on your description.
For example, if a line of code is step_center(X)
, a comprehensive and specific explanation something similar to the following: “This line of code mean centers the variable \(X\) by subtracting \(\bar{X}\) from each value of \(X\) in the training data.”
Use the Recipes Function Reference page as a resource to learn more about the step_
functions.
#Line 1
<- recipe(winpercent ~ ., data = candy_train) |>
candy_rec #Line 2
update_role(competitorname, new_role = "ID") |>
# Line 3
step_cut(sugarpercent, breaks = c(0, 0.25, 0.5, 0.75, 1)) |>
#Line 4
step_mutate(pricepercent = pricepercent * 100) |>
#Line 5
step_dummy(all_nominal_predictors()) |>
#Line 6
step_interact(terms =~ pricepercent:chocolate) |>
#Line 7
step_interact(terms =~ peanutyalmondy:chocolate) |>
#Line 8
step_rm(fruity, caramel, hard, pluribus, bar, nougat, crispedricewafer) |>
#Line 9
step_zv(all_predictors())
Exercise 4
Fill in the code to use prep
and bake
for a preview of what will happen when the recipe in Exercise 3 is applied.
|>
candy_rec prep() |>
bake(_____) |>
glimpse()
How many terms (not including the intercept) will be in the model produced by this recipe?
Exercise 5
Fill in the code to specify model, build the model workflow using the recipe in Exercise 2, and fit the model to the training data. Then, neatly display the model using 3 digits.
#specify the model
<- linear_reg() |>
candy_spec set_engine("lm")
#build model workflow
<- workflow() |>
candy_workflow add_model(_____) |>
add_recipe(_____)
# fit the model
<- candy_workflow |>
candy_fit fit(data = _____)
Exercise 6
Interpret the following in the context of the data:
Intercept
Coefficient of
sugarpercent_X.0.75.1.
Coefficient of
pricepercent_x_chocolateTRUE
Effect of
pricepercent
for chocolate candy
Exercise 7
Let’s consider another model. Use the recipe
workflow to fit a new model that meets the following criteria:
Includes variables
chocolate
,pricepercent
,crispedricewafer
,peanutyalmondy
,sugarpercent
Update
pricepercent
so it ranges from 0 to 100 (instead of 0 to 1)Makes
sugarpercent
a factor where the levels equal the four quartiles: 0 - 0.25, 0.25 - 0.50, 0.50 - 0.75, 0.75 - 1Includes the interaction between
pricepercent
andpeanutyalmondy
Neatly display the model using 3 digits.
See the Function Reference page on recipes.tidymodels.org to find the appropriate recipe functions.
Exercise 8
Consider the model from Exercise 5 “Model 1” and the model fit in Exercise 7 “Model 2”. Use the
glance()
function to calculate \(R^2\) for both of these models.Which model would you choose based on \(R^2\)? Briefly explain your choice.
Exercise 9
We will use RMSE to evalulate the predictive performance of each model on the testing data.
- Use the code below to calculate predicted values and RMSE for Model 1 on the testing data. Then get the RMSE for Model 2 on the testing data.
<- predict(candy_fit, candy_test) |>
predict_test1 bind_cols(candy_test)
<- rmse(predict_test1, truth = winpercent, estimate = .pred) rmse1
- Which model would you choose based on RMSE on the testing data? Briefly explain your choice.
Exercise 10
- Use the model you selected in Exercise 9 to describe what generally makes a good candy, i.e., one with a high win percentage.
Submission
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
One team member submit the assignment:
- Go to http://www.gradescope.com and log in using your NetID credentials.
- Click on your STA 210 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Select all team members’ names, so they receive credit on the assignment. Click here for video on adding team members to assignment on Gradescope.
- Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
- Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.
Grading
Total points available: 50 points.
Component | Points |
---|---|
Ex 1 | 6 |
Ex 2 | 3 |
Ex 3 | 5 |
Ex 4 | 1 |
Ex 5 | 2 |
Ex 6 | 8 |
Ex 7 | 8 |
Ex 8 | 4 |
Ex 9 | 4 |
Ex 10 | 4 |
Workflow & formatting | 51 |
Footnotes
The “Workflow & formatting” grade is to assess the reproducible workflow and collaboration. This includes having at least one meaningful commit from each team member and updating the team name and date in the YAML.↩︎