Goals

Run and interpret linear regression models using a tidy framework.
Use models for explanation and prediction.
Practice inference for linear regression using a CLT approach.

Getting started

Every team member should go to the course GitHub organization and locate their lab08 repository, which should have the prefix lab08. Copy the URL of the repository and clone the remote repo in RStudio.

As you work on this lab, merge conflicts may arise. Refer back to Lab #05 for how to fix them. You and your team are free to divide up the work how you think is best. However, everyone should understand all code in the lab’s final submission.

Packages

library(tidyverse)
library(broom)
library(viridis)

Data

Here, you will be working with demographic data from counties in the Midwest from the midwest dataset in R. You can learn more about this dataset by typing ?midwest into the console.

Exercises

Do Midwestern counties with a higher percentage of people with a college degree have a lower poverty rate? Using ggplot, make a scatterplot with percentage of people with a college degree as the explanatory variable and the percentage of the total population below the poverty line as the response variable. Make sure to label your axes and give the plot a title. Discuss what your scatterplot shows and comment on the linearity assumption.
Run a linear regression with percentage with a college degree as the explanatory variable and poverty rate as the response variable. Write out the model and interpret both the slope and intercept in the context of the problem.
Assess the model fit by obtaining the \(R^2\). What does your value mean? Is this a high or low?
Construct and interpret a tidy 95% confidence interval for \(\beta_{\text{college}}\).
In Summit County, Ohio, 24.7% of the population has a college degree. What does the model predict the poverty rate will be in Summit County? What is the actual poverty rate? What is the difference between these two values called and what is its value? Hint: use augment().
Does the state a county is located in matter in terms of predicting the poverty rate? Run a model with the poverty rate as the response variable. Consider carefully how you will include state in the model. Interpret your results and write out the model. Hint: use factor().

Submission

Upload your team’s PDF to Gradescope. Include every team member’s name in the Gradescope submission and identify which problems are on each in Gradescope. Associate the “Overall” section with the first page of your PDF.

Include all team members’ names with the team name in the author portion of the YAML header.

You must have at least three meaningful commits.

There should only be one submission per team on Gradescope.

References

Midwest Demographics. Dataset in ggplot2. https://ggplot2.tidyverse.org/reference/midwest.html

Lab #08: Linear Regression