Homework #02: Data Wrangling and Joins

due Thu, Feb 11 11:59 PM

Do not attempt to facet by every player.

Goals

For this assignment you must have at least three meaningful commits and all of your code chunks must have informative names.

For your first commit, update your author name in the YAML header of the template R Markdown file.

All plots should follow the best visualization practices discussed in lecture, including an informative title, labeled axes, and careful consideration of aesthetic choices.

All code should follow the tidyverse style guidelines, including not exceeding the 80 character limit.

These are large datasets so do not print out the full tibbles. Doing so will cause problems when you knit to PDF.

For every join function you should explicitly specify the by argument

Setup

Pitching Analysis

We will work with the tidyverse package as usual.

library(tidyverse)

In this assignment you will work with four related datasets to answer research questions about baseball. The datasets are available in the data folder and are adapted from the Lahman package in R. This package contains a huge amount of information about batting, pitching, fielding, teams, etc. We will only examine a small portion, but I encourage you to explore on your own when you have finished the assignment.

No knowledge of baseball is required to complete this homework other than the definitions provided below. If you want more information check out the link here: https://www.youtube.com/watch?v=skOsApsF0jQ.

The pitcher throws the baseball from the pitcher’s mound towards the catcher, and the batter tries to hit it. If the pitch is successful and the batter misses (or doesn’t swing) it is called a strike. Three times is a strikeout.

An earned run occurs when a player advances around all of the bases without errors by the defense. The earned run average (era) is the average of earned runs given up by a pitcher per nine innings pitched (the standard length of a game of baseball). Low values indicate a good pitcher.

earned run average=9×earned runs allowedinnings pitched

A brief description of the datasets and how they are related to each other is provided below.

The pitching dataset contains information on Major League Baseball pitchers. Observations are uniquely identified by player_id, year_id, and stint.

Variable Description
player_id player ID code
year_id year
stint player’s order of appearances within a season on different teams
team_id team
league_id league
earned_runs earned runs
strike_outs strikeouts
innings_pitched innings pitched

The people dataset contains player characteristics. Observations are uniquely identified by player_id.

Variable Description
player_id player ID code
birth_year birth year of player
first_name player first name
last_name player last name

The salaries dataset contains salaries of players. Observations are uniquely identified by year_id, team_id, and player_id.

Variable Description
year_id year
team_id team
player_id player ID code
league_id league
salary salary

The teams dataset contains yearly statistics and standings for teams. Observations are uniquely identified by year_id and team_id.

Variable Description
year_id year
team_id team
games games played
ws_win World Series winner (Y or N)
wins wins
team_name team name

Win Percentage and Spending

  1. Use the teams data to find each team’s win percentage for the years 2011 through 2016. Store the results in team_stats. This dataset should have three columns (year_id, team_id, and win_pct) and 180 rows (6 years × 30 teams = 180 rows).

  2. Use an appropriate join function to add salary information to players in the pitching data. You should only include observations that appear in both pitching and salaries. Then, using a group_by() paired with a summarize(), create a new column giving the total amount of money (sum()) spent on pitching salaries by each team in each year. Store the results in team_spending. This dataset should have three columns (year_id, team_id, and pitching_salaries) and 918 rows.

  3. Use an appropriate function to join team_stats and team_spending and use faceting to create subplots of win percentage versus pitching spending for each year.

Top Pitchers

Some pitchers play for more than one team in a single year. For example, Aroldis Chapman (chapmar01) played with both the World Series winning 2016 Chicago Cubs and the New York Yankees in 2016. During his first stint with the Yankees he had 7 earned runs and 31.1 innings pitched, and in his second stint with the Cubs he had 3 earned runs and 26.2 innings pitched.

  1. Create a new dataset containing the yearly earned run average for each player, summarizing across all teams played within each year. Name your dataset player_era. You should have columns player_id, year_id, total_strike_outs, and era, where total_strike_outs refers to the total number of strikeouts for that year and era refers to the earned run average for that year. Chapman should have a single row for 2016 and his 2016 earned run average (era) should be 1.57.

9×7+331.1+26.21.57

  1. Join the people data to the player_era data and mutate() a new variable with each player’s approximate age. Call this dataset player_era_age. Then, create a new dataset called pitching_stats_by_age with the median earned run average, median number of strike outs, and count of players for all possible ages using group_by() and summarize().

  2. Construct a plot of median strike outs versus age with points sized by how many pitchers are that age. Describe what you observe.

Choosing the top pitchers of all time is a contentious topic, but a good argument can be made for the following players: Sandy Koufax, Clayton Kershaw, Bob Gibson, Roger Clemens, Greg Maddux, Cy Young, Walter Johnson, Randy Johnson, and Christy Mathewson. The player_id for each of these players is provided below.

top_players <- c("koufasa01", "kershcl01", "gibsobo01", 
                 "clemero02", "maddugr01", "youngcy01", 
                 "johnswa01", "johnsra05", "mathech01")
  1. Filter the player_era_age data so that it only includes the top pitchers above and use faceting to create subplots of era versus age for each player. Label the subplots by “Last Name, First Name” instead of player_id (the paste() function will be helpful here). Briefly describe what you observe.

Submission

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.