HW 03: Multiple linear regression, Part 2

due Wednesday, November 1 at 11:59pm

Introduction

In this analysis you will use multiple linear regression to fit and evaluate models using characteristics of LEGO sets to understand variability in the price.

Learning goals

In this assignment, you will…

  • Use exploratory data analysis to inform feature engineering steps
  • Use cross validation to evaluate and compare models
  • Assess model conditions and multicollinearity
  • Use inference to draw conclusions
  • Continue developing a workflow for reproducible data analysis.

Getting started

The repo for this assignment is available on GitHub at github.com/sta210-fa23 and starts with the prefix hw-03. See Lab 01 for more detailed instructions on getting started.

Packages

The following packages will be used in this assignment:

library(tidyverse)
library(tidymodels) 
library(knitr) 
library(rms)
library(patchwork)
# add other packages as needed
Important

All narrative should be written in complete sentences, and all visualizations should have informative titles and axis labels.

Data: LEGOs

The data for this analysis includes information about LEGO sets from themes produced January 1, 2018 and September 11, 2020. The data were originally scraped from Brickset.com, an online LEGO set guide and were obtained for this assignment from Peterson and Ziegler (2021).

You will work with data on about 400 randomly selected LEGO sets produced during this time period. The primary variables are interest in this analysis are

  • Pieces: Number of pieces in the set from brickset.com.
  • Minifigures: Number of minifigures (LEGO people) in the set scraped from brickset.com. LEGO sets with no minifigures have been coded as NA. NA’s also represent missing data. This is due to how brickset.com reports their data.
  • Amazon_Price: Amazon price of the set scraped from brickset.com (in U.S. dollars)
  • Size: General size of the interlocking bricks (Large = LEGO Duplo sets - which include large brick pieces safe for children ages 1 to 5, Small = LEGO sets which- include the traditional smaller brick pieces created for age groups 5 and - older, e.g., City, Friends)
  • Theme: Theme of the LEGO set
  • Year : Year the LEGO set was produced
  • Pages: Number of pages in the instruction booklet

The data are contained in lego-sample.csv. Use the code below to read in the data, replace the NAs in Minifigure with 0, and remove any observations that have missing values for the relevant variables.

legos <- read_csv("data/lego-sample.csv") |>
  select(Size, Pieces, Theme, Amazon_Price, Year, Pages, Minifigures) |>
  mutate(Minifigures = replace_na(Minifigures, 0)) |>
  drop_na()

Exercises

Important

All narrative should be written in complete sentences, and all visualizations should have informative titles and axis labels.

Exercise 1

In this analysis, we dropped observations that have missing values for some of the relevant variables. What is a disadvantage of dropping observations that have missing values, instead of using a method to impute, i.e., fill in, the missing data? How might dropping these observations impact the generalizability of conclusions? What is the disadvantage of replacing the NAs in Minifigures to 0?

Exercise 2

Visualize the distributions of the predictor variables Pieces, Size, Year, and Pages. Neatly arrange the plots using the patchwork package.

Exercise 3

What are some feature engineering steps you might use to prepare the variables in the previous exercise for the model? Describe the steps and the function you would use. The list should incorporate at least three different step_ functions.

Tip

Use the recipes reference page for a list of step_ functions.

Exercise 4

The distribution of Theme is shown below. The bars are ordered by the frequency they occur in the data set.

legos |>
  count(Theme) |>
ggplot(aes(x = fct_reorder(Theme, n), y = n)) +
  geom_col() + 
    labs(title = "Lego Set Theme", 
         x = "Theme", 
         y = "Number of LEGO sets") + 
  coord_flip()

What is one reason we should avoid putting the variable Theme in a model as is?

Exercise 5

We will use v-fold cross validation to compare two models. We’ll start by preparing the data, creating the folds, and defining the model specification that will be used for both models.

  • Split the data into training (75%) and testing (25%) sets. Use a seed of 6.

  • Split the training data into 12 folds. Set the seed to 6 again.

  • Specify the model.

Exercise 6

Use the training data to create a recipe for the first model. The model uses the variables Size, Theme, and Pages to predict Amazon_Price. Conduct the following feature engineering steps:

  • Use step_other() to collapse Theme into fewer categories. Define the threshold such that any levels of Theme with fewer than 20 observations is defined as “Other”.
  • Mean-center Pages.
  • Make dummy variables for all categorical predictors.
  • Remove any predictors with zero variance.

Then create the workflow that brings together this recipe and the model specification from the previous exercise.

Exercise 7

Conduct 12-fold cross validation using the workflow from the previous exercise. Calculate and display mean RMSE across the 12 folds.

Note

We will just use RMSE to compare models for this assignment; however, in practice, it is best to take into account multiple model fit statistics to get a more holistic evaluation and comparison of the models.

Exercise 8

Now let’s consider a new model that includes all the variables used in model from Exercise 6 along with Year, Pieces and Minifigures.

  • Use the training data to create a recipe that uses all the feature engineering steps in Exercise 6 with the addition of the following steps:

    • Create a new variable called since2018 that calculates the number of years since 2018.
    • Remove Year as a potential predictor.
    • Mean-center Pieces.
  • Create the workflow that brings together this recipe and the model specification from Exercise 5.

  • Conduct 12-fold cross validation using this model workflow. Calculate and display mean RMSE across the 12 folds.

Exercise 9

Compare the cross validation results from Exercises 7 and 8. Which model do you select based on RMSE? Briefly explain your choice.

Exercise 10

Refit the selected model on the entire training data. Neatly display the model using 3 digits.

Then, calculate VIF for the model and use it to comment on whether there are potential issues with multicollinearity.

Tip

When we fit a model using recipe and workflow, we need to extract the model object before using augment or vif functions. Fill in the name of the selected model in both blanks in the code below to extract the model object and calculate VIF

legos_fit_model <- extract_fit_parsnip(______)
vif(legos_fit_model$fit)

Exercise 11

Calculate RMSE on the training data and on the testing data. Use it to comment on how well the model performs on new data and whether there are signs of model overfit.

Exercise 12

Though we do not check the model conditions in this assignment, complete this exercise assuming the model conditions are met.

Describe the effect of Theme on the price of LEGO sets, including an indication of which levels are statistically significant. Use a threshold of 0.05 to determine significance.

Before submitting, make sure you render your document and commit (with a meaningful commit message) and push all updates.

Exercise 13

The World Bank collects “world development indicators” about the past and current development of countries. These data are made available on the World Bank’s website. It can be used to understand the relationships between these various factors and trends over time.

For this analysis, we focus on indicators from 2011 on 165 countries. The variables of interest are:

  • gdp.per.capita: gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars.

  • sanit.access.factor: Population access to sanitation facilities (Low, High)

  • edu.expend: Government expenditure on education, total (% of government expenditure)

  • life.expect: Life expectancy at birth (in years)

You fit a model using sanitation access, education expenditures, and life expectancy to understand variability in GDP. The model takes the form

\[ \begin{aligned}\widehat{\log(GDP)} = \hat{\beta}_0 &+ \hat{\beta}_1 ~ sanit.access.factor + \hat{\beta}_2 ~ edu.expend + \hat{\beta}_3 ~life.expect \\ &+ \hat{\beta}_4 ~ sanit.access.factor \times life.expect\end{aligned} \]

The output from R is shown below.

term estimate std.error statistic p.value
(Intercept) 4.491 1.638 2.742 0.007
sanit.access.factorhigh -6.993 1.971 -3.548 0.001
edu.expend 0.097 0.038 2.550 0.012
life.expect 0.030 0.029 1.061 0.291
sanit.access.factorhigh:life.expect 0.122 0.032 3.853 0.000
  • Interpret the coefficient of edu.expend in the context of the data.

  • Interpret the coefficient of life.expect for countries with high sanitation access in the context of the data.

    Important

    Write all interpretations in terms of the original units, not the log-transformed units.

Exercise 14

Below are plots from the exploratory data analysis of the relationships between the predictor variables. Based on these plots, what appears to be a potential issue with the model from Exercise 13? Briefly explain your response.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
  • Click on your STA 210 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
  • Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.

Grading (50 points)

Component Points
Ex 1 - 14 47
Workflow & formatting 31

References

Peterson, Anna D., and Laura Ziegler. 2021. “Building a Multiple Linear Regression Model With LEGO Brick Data.” Journal of Statistics and Data Science Education 29 (3): 297–303. https://doi.org/10.1080/26939169.2021.1946450.

Footnotes

  1. The “Workflow & formatting” grade is to assess the reproducible workflow and document format. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date in the YAML.↩︎↩︎