Lab 04: Alumni jobs

Collaborative workflows

Important

Due:

  • Friday, September 29, 11:59pm (Tuesday labs)
  • Sunday, October 1, 11:59pm (Thursday labs)

In today’s lab, you’ll use simple linear regression to analyze the relationship between the median early career salary and percent of alumni who perceive their job as making the world a better place for colleges and universities in the United States. You will also start working with your lab teams, write a team agreement, and be introduced to using GitHub for collaboration.

Learning goals

By the end of the lab you will be able to…

  • assess model conditions for simple linear regression.
  • calculate and interpret predicted values from the model.
  • assess model fit using \(R^2\) and RMSE
  • collaborate with others using a single Github repo

Meet your team!

Click here to see the team assignments for STA 210. This will be your team for labs and the final project. Before you get started on the lab, complete the following:

✅ Come up with a team name. You can’t use the same name as another team, so I encourage you to be creative! Your TA will get your team name by the end of lab.

✅ Fill out the team agreement. This will help you figure out a plan for communication,and working together during labs and outside of lab times. You can find the team agreement in the GitHub repo team-agreement-[github_team_name].

  • Have one person from the team clone the repo and start a new RStudio project. This person will type the team’s responses as you discuss the sections of the agreement. No one else in the team should type at this point but should be contributing to the discussion.

  • Be sure to push the completed agreement to GitHub. Each team member can refer to the document in this repo or download the PDF of the agreement for future reference. You do not need to submit the agreement on Gradescope.

Getting started

  • A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
  • Go to the sta210-fa23 organization on GitHub. Click on the repo with the prefix lab-04. It contains the starter documents you need to complete the lab.
  • Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.
  • Do not make any changes to the .qmd file until the instructions tell you do to so.

Workflow: Using Git and GitHub as a team

Important

Assign each person on your team a number 1 through 4. For teams of three, Team Member 1 can take on the role of Team Member 4.

The following exercises must be done in order. Only one person should type in the .qmd file, commit, and push updates at a time. When it is not your turn to type, you should still share ideas and contribute to the team’s discussion.

⌨️ Team Member 1: Hands on the keyboard.

🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!1

Change the author to your team name and include each team member’s name in the author field of the YAML in the following format: Team Name: Member 1, Member 2, Member 3, Member 4.

Team Member 1: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd and PDF documents, and finally push the changes to GitHub.


Team Members 2, 3, 4: Once Team Member 1 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the updated name in your .qmd file.

Packages

We’ll use the following packages in this lab.

library(tidyverse)
library(tidymodels)
library(knitr)
library(patchwork)
# add more packages as needed

Data: Alumni jobs

Today’s data set is part of the TidyTuesday College tuition, diversity, and pay.

The information in this data set was collected from the PayScale College Salary Report.

variable class description
rank double Potential salary rank within state
name character Name of school
state_name character state name
early_career_pay double Median salary for alumni with 0 - 5 years experience (in US dollars)
mid_career_pay double Median salary for alumni with 0 - 5 years experience (in US dollars)
make_world_better_percent double Percent of alumni who think they are making the world a better place
stem_percent double Percent of degrees awarded in science, technology, engineering, or math subjects
alumni <- read_csv("data/alumni-salaries.csv")

Exercises

Is there a relationship the typical early career pay for alumni and the percent of alumni who receive a STEM degree? To answer this question, we will use regression to predict the early career pay using the percent of STEM degrees.

Team Member 1: Type the team’s responses to exercises 1 - 2.


Exercise 1

Fit a linear model that can be used to predict the typical (median) early career pay based on the percent of STEM degrees. Display the model output using 3 digits for numerical values.

Exercise 2

Calculate the predicted values and residuals from your model and save these results in a data frame. Print the first five rows of the new data frame.

Team Member 1: Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the responses to exercises 1- 2.

Team Member 2: It’s your turn! Type the team’s response to exercise 3.

Exercise 3

Before using the model for prediction, let’s check the model conditions. To figure out how to do so, we type the following prompt into ChatGPT (“an AI powered language model”) “How do you check the model conditions for simple linear regression?”

The response from ChatGPT shown below would receive partial credit. For each condition, state whether it is correct or incorrect/ incomplete. If it incorrect/incomplete, state what is wrong or missing, then write the correct response.

In simple linear regression, you’re trying to establish a linear relationship between two variables: a dependent variable (Y) and an independent variable (X). The model conditions or assumptions for simple linear regression are important to check to ensure that your model is valid and reliable. Here are the key assumptions and how to check them:

1. Linearity Assumption:

  • Assumption: The relationship between the independent variable (X) and the dependent variable (Y) is linear.
  • How to Check: Create a scatterplot of Y against X. If the points roughly form a straight line, the linearity assumption is likely met.

2. Independence Assumption:

  • Assumption: The residuals (the differences between observed Y and predicted Y) are independent of each other.
  • How to Check: Plot the residuals against the independent variable (X) or the predicted values. Look for patterns or trends in the residuals. There should be no clear structure or autocorrelation in the residuals.

3. Homoscedasticity [Constant variance] Assumption:

  • Assumption: The variance of the residuals is constant across all levels of the independent variable.
  • How to Check: Plot the residuals against the predicted values or the independent variable (X). The spread of the residuals should remain roughly constant across the range of X.

4. Normality Assumption:

  • Assumption: The residuals follow a normal distribution.
  • How to Check: Create a histogram or a normal probability plot (Q-Q plot) of the residuals. If the residuals are approximately normally distributed, they should follow a bell-shaped curve in the histogram and lie along a straight line in the Q-Q plot.

It’s important to note that these assumptions may not always hold perfectly in real-world data, and minor violations may not necessarily invalidate your model. However, if you detect significant violations of these assumptions, you may need to consider alternative modeling techniques or transformations of your data to address the issues.

✅ ⬆️ Team Member 2: Render, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the responses to exercise 3.

Team Member 3: It’s your turn! Type the team’s response to exercises 4 - 5.

Important

You can proceed assuming all the model conditions are satisfied.

Exercise 4

  • 31 percent of the alumni at Johns Hopkins University received a STEM degree. Calculate the predicted early career pay at this university and the appropriate 95% interval. Interpret the interval in the context of the data.
  • The actual early career pay at this university is $67,200. Is this what you expected based on your analysis? Briefly explain your reasoning.

Exercise 5

Next, let’s consider how well the model fits the relationship between the early career pay and percent with a STEM degree.

  • Calculate \(R^2\) and interpret it in the context of the data.

  • Calculate RMSE and interpret it in the context of the data.

Team Member 3: Render, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the responses to exercise 4 - 5.

Team Member 4: It’s your turn! Type the team’s response to exercise 6.

Exercise 6

Do you think the model is useful for understanding and predicting the typical early career pay for alumni at a university? Briefly explain your reasoning.

Team Member 4: Render, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the responses to exercise 6.

Team Member 1: It’s your turn! Type the team’s response to exercise 7.

Exercise 7

Fit another model which predicts the typical early career pay using only the percentage of alumni that believe their career will make the world a better place.

  • Neatly display this model with 3 digits.
  • Which model do you think is better - this model or the model fit in Exercise 1? Explain your response showing any analysis used to make the decision.

Team Member 1: Render, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and the rest of the team can see the completed lab.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the team’s completed lab!

Wrapping up

Important

⌨️ Team Member 3: Hands on the keyboard. Make any edits as needed.

🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!

Team Member 3: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd and PDF documents, and finally push the changes to GitHub. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

Team Members 1, 2, 4: Once Team Member 3 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the final version of your .qmd file.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

One team member submit the assignment:

  • Go to http://www.gradescope.com and log in using your NetID credentials.
  • Click on your STA 210 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Select all team members’ names, so they receive credit on the assignment. Click here for video on adding team members to assignment on Gradescope.
  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
  • Select the first page of your PDF submission to be associated with the “Workflow & formatting” section.

Grading (50pts)

Component Points
Ex 1 3
Ex 2 4
Ex 3 9
Ex 4 7
Ex 5 6
Ex 6 4
Ex 7 6
Workflow & formatting 42
Complete team agreement 5
Submitted team name 2

Footnotes

  1. Don’t trust yourself to keep your hands off the keyboard? Put them in your pocket or cross your arms. No matter how silly it might feel, resist the urge to touch your keyboard until otherwise instructed!↩︎

  2. The “Workflow & formatting” grade is to assess the reproducible workflow and collaboration. This includes having at least one meaningful commit from each team member and updating the team name and date in the YAML.↩︎↩︎