Do not attempt to facet by every player.
For this assignment you must have at least three meaningful commits and all of your code chunks must have informative names.
For your first commit, update your author name in the YAML header of the template R Markdown file.
All plots should follow the best visualization practices discussed in lecture, including an informative title, labeled axes, and careful consideration of aesthetic choices.
All code should follow the tidyverse style guidelines, including not exceeding the 80 character limit.
These are large datasets so do not print out the full tibbles. Doing so will cause problems when you knit to PDF.
For every join
function you should explicitly specify the by
argument
We will work with the tidyverse
package as usual.
library(tidyverse)
In this assignment you will work with four related datasets to answer research questions about baseball. The datasets are available in the data folder and are adapted from the Lahman
package in R
. This package contains a huge amount of information about batting, pitching, fielding, teams, etc. We will only examine a small portion, but I encourage you to explore on your own when you have finished the assignment.
No knowledge of baseball is required to complete this homework other than the definitions provided below. If you want more information check out the link here: https://www.youtube.com/watch?v=skOsApsF0jQ.
The pitcher throws the baseball from the pitcher’s mound towards the catcher, and the batter tries to hit it. If the pitch is successful and the batter misses (or doesn’t swing) it is called a strike. Three times is a strikeout.
An earned run occurs when a player advances around all of the bases without errors by the defense. The earned run average (era) is the average of earned runs given up by a pitcher per nine innings pitched (the standard length of a game of baseball). Low values indicate a good pitcher.
earned run average=9×earned runs allowedinnings pitched
A brief description of the datasets and how they are related to each other is provided below.
The pitching
dataset contains information on Major League Baseball pitchers. Observations are uniquely identified by player_id
, year_id
, and stint
.
Variable | Description |
---|---|
player_id |
player ID code |
year_id |
year |
stint |
player’s order of appearances within a season on different teams |
team_id |
team |
league_id |
league |
earned_runs |
earned runs |
strike_outs |
strikeouts |
innings_pitched |
innings pitched |
The people
dataset contains player characteristics. Observations are uniquely identified by player_id
.
Variable | Description |
---|---|
player_id |
player ID code |
birth_year |
birth year of player |
first_name |
player first name |
last_name |
player last name |
The salaries
dataset contains salaries of players. Observations are uniquely identified by year_id
, team_id
, and player_id
.
Variable | Description |
---|---|
year_id |
year |
team_id |
team |
player_id |
player ID code |
league_id |
league |
salary |
salary |
The teams
dataset contains yearly statistics and standings for teams. Observations are uniquely identified by year_id
and team_id
.
Variable | Description |
---|---|
year_id |
year |
team_id |
team |
games |
games played |
ws_win |
World Series winner (Y or N) |
wins |
wins |
team_name |
team name |
Use the teams
data to find each team’s win percentage for the years 2011 through 2016. Store the results in team_stats
. This dataset should have three columns (year_id
, team_id
, and win_pct
) and 180 rows (6 years × 30 teams = 180 rows).
Use an appropriate join function to add salary information to players in the pitching
data. You should only include observations that appear in both pitching
and salaries
. Then, using a group_by()
paired with a summarize()
, create a new column giving the total amount of money (sum()
) spent on pitching salaries by each team in each year. Store the results in team_spending
. This dataset should have three columns (year_id
, team_id
, and pitching_salaries
) and 918 rows.
Use an appropriate function to join team_stats
and team_spending
and use faceting to create subplots of win percentage versus pitching spending for each year.
strike_outs
in the pitching
dataset) in each year, by league, from 1901 to the most recent year in the dataset. Store the results in top_so_by_year
and glimpse()
the dataset. Your final dataset should have 243 rows (there are some ties) and the columns year_id
, league_id
,first_name
, and last_name
, where the name information is from the people
dataset.Some pitchers play for more than one team in a single year. For example, Aroldis Chapman (chapmar01
) played with both the World Series winning 2016 Chicago Cubs and the New York Yankees in 2016. During his first stint with the Yankees he had 7 earned runs and 31.1 innings pitched, and in his second stint with the Cubs he had 3 earned runs and 26.2 innings pitched.
player_era
. You should have columns player_id
, year_id
, total_strike_outs
, and era
, where total_strike_outs
refers to the total number of strikeouts for that year and era
refers to the earned run average for that year. Chapman should have a single row for 2016 and his 2016 earned run average (era
) should be 1.57.9×7+331.1+26.2≈1.57
Join the people
data to the player_era
data and mutate()
a new variable with each player’s approximate age. Call this dataset player_era_age
. Then, create a new dataset called pitching_stats_by_age
with the median earned run average, median number of strike outs, and count of players for all possible ages using group_by()
and summarize()
.
Construct a plot of median strike outs versus age with points sized by how many pitchers are that age. Describe what you observe.
Choosing the top pitchers of all time is a contentious topic, but a good argument can be made for the following players: Sandy Koufax, Clayton Kershaw, Bob Gibson, Roger Clemens, Greg Maddux, Cy Young, Walter Johnson, Randy Johnson, and Christy Mathewson. The player_id
for each of these players is provided below.
<- c("koufasa01", "kershcl01", "gibsobo01",
top_players "clemero02", "maddugr01", "youngcy01",
"johnswa01", "johnsra05", "mathech01")
player_era_age
data so that it only includes the top pitchers above and use faceting to create subplots of era
versus age
for each player. Label the subplots by “Last Name, First Name” instead of player_id
(the paste()
function will be helpful here). Briefly describe what you observe.Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.
Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.