library(tidyverse)
library(tidymodels)
library(knitr)
library(patchwork) #arrange plots in a grid
AE 07: Model evaluation
Songs on Spotify
Go to the course GitHub organization and locate your ae-07
repo to get started.
Render, commit, and push your responses to GitHub by the end of class. The responses are due in your GitHub repo no later than Thursday, September 28 at 11:59pm.
Data
The data set for this assignment is a subset from the Spotify Songs Tidy Tuesday data set. The data were originally obtained from Spotify using the spotifyr R package.
It contains numerous characteristics for each song. You can see the full list of variables and definitions here. This analysis will focus specifically on the following variables:
variable | class | description |
---|---|---|
track_id | character | Song unique ID |
track_name | character | Song Name |
track_artist | character | Song Artist |
track_popularity | double | Song Popularity (0-100) where higher is better |
energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
<- read_csv("data/spotify-popular.csv") spotify
Are high energy songs more positive? To answer this question, we’ll analyze data on some of the most popular songs on Spotify, i.e. those with track_popularity >= 80
. We’ll use linear regression to fit a model to predict a song’s positiveness (valence
) based on its energy level (energy
).
Below are plots as part of the exploratory data analysis.
<- ggplot(data = spotify, aes(x = valence)) +
p1 geom_histogram() +
labs(title = "Distribution of Valence",
subtitle = " for Popular songs on Spotify",
x = "Valence")
<- ggplot(data = spotify, aes(x = energy)) +
p2 geom_histogram() +
labs(title = "Distribution of Energy",
subtitle = "for Popular songs on Spotify",
x = "Energy")
+ p2 p1
ggplot(data = spotify, aes(x = energy, y = valence)) +
geom_point() +
labs(title = "Valence vs. Energy",
subtitle = "Popular songs on Spotify",
x = "Energy",
y = "Valence")
Model with 90% CI for coefficients
<- linear_reg() |>
spotify_fit fit(valence ~ energy, data = spotify)
tidy(spotify_fit, conf.int = TRUE, conf.level = 0.9) |>
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 0.121 | 0.035 | 3.401 | 0.001 | 0.062 | 0.179 |
energy | 0.614 | 0.054 | 11.321 | 0.000 | 0.525 | 0.703 |
Create augmented data frame
<- augment(spotify_fit$fit) spotify_aug
Create residual plots
ggplot(data = spotify_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(x = "Fitted values",
y = "Residuals",
title = "Residuals vs. fitted")
<- ggplot(data = spotify_aug, aes(x = .resid)) +
resid_hist geom_histogram() +
labs(x = "Residuals",
title = "Distribution of residuals")
<- ggplot(data = spotify_aug, aes(sample = .resid)) +
resid_qq stat_qq() +
stat_qq_line() +
labs(x = "Theoretical quantiles",
y = "Observed residuals",
title = "Normal QQ-plot")
+ resid_qq resid_hist
Exercise 1
Assess the four model conditions. State whether the condition is satisfied and briefly explain your response.
- Linearity
- Constant variance
- Normality
- Independence
Exercise 2
Calculate \(R^2\) and interpret this value in the context of the data.
Exercise 3
Calculate RMSE and interpret this value in the context of the data.
Evaluate response from ChatGPT
I typed the following prompt into ChatGPT (“an AI powered language model”) “How do you check the model conditions for simple linear regression?”
The response from ChatGPT shown below would receive partial credit. For each condition, state whether it is (1) correct, (2) incorrect and if so, what is wrong, (3) incomplete and if so, what is missing.
In simple linear regression, you’re trying to establish a linear relationship between two variables: a dependent variable (Y) and an independent variable (X). The model conditions or assumptions for simple linear regression are important to check to ensure that your model is valid and reliable. Here are the key assumptions and how to check them:
1. Linearity Assumption:
- Assumption: The relationship between the independent variable (X) and the dependent variable (Y) is linear.
- How to Check: Create a scatterplot of Y against X. If the points roughly form a straight line, the linearity assumption is likely met.
2. Independence Assumption:
- Assumption: The residuals (the differences between observed Y and predicted Y) are independent of each other.
- How to Check: Plot the residuals against the independent variable (X) or the predicted values. Look for patterns or trends in the residuals. There should be no clear structure or autocorrelation in the residuals.
3. Homoscedasticity [Constant variance] Assumption:
- Assumption: The variance of the residuals is constant across all levels of the independent variable.
- How to Check: Plot the residuals against the predicted values or the independent variable (X). The spread of the residuals should remain roughly constant across the range of X.
4. Normality Assumption:
- Assumption: The residuals follow a normal distribution.
- How to Check: Create a histogram or a normal probability plot (Q-Q plot) of the residuals. If the residuals are approximately normally distributed, they should follow a bell-shaped curve in the histogram and lie along a straight line in the Q-Q plot.
It’s important to note that these assumptions may not always hold perfectly in real-world data, and minor violations may not necessarily invalidate your model. However, if you detect significant violations of these assumptions, you may need to consider alternative modeling techniques or transformations of your data to address the issues.
You will start to think about the answer in class and come up with a final answer with your group in Lab 04.
To submit the AE:
- Render the document to produce the PDF with all of your work from today’s class.
- Push all your work to your
ae-07
repo on GitHub. (You do not submit AEs on Gradescope).