Lab #03: Data Wrangling

due Sun, Feb 7 11:59 PM

Goals

Getting started

Click the link here https://classroom.github.com/a/jQxDe3IE to create your private repository for lab #03 on GitHub. Follow the steps provided in lab and lecture to clone the repo and create a new project in RStudio.

Then, configure git in the console as we have done in previous labs, using your GitHub username and email address.

library(usethis)
use_git_config(user.name = "GitHub username", user.email="your email")

Open the lab03.Rmd template and update the YAML header with your name and today’s date. Then, knit the document and make sure the resulting PDF file has the correct date. Stage, commit, and push your changes.

Write your answers in the lab03.Rmd template file. Your assignment should have at least three meaningful commits and all code chunks should have informative names.

Lego Analysis

We will examine a dataset containing characteristics of lego sets manufactured between 1961 and 2019 from the BRICKSET website. Variables in the dataset are described below.

Variable Description
id set id
name name of set
themegroup themegroup of set
theme theme of set
subtheme subtheme of set
year year released
pieces number of pieces
minifigs number of minifigs
package type of packaging
retail_price recommended retail price in dollars

We first load the tidyverse as usual.

library(tidyverse)

And then read in the data.

lego <- read_csv("data/lego.csv")
  1. Some sets have missing information for retail_price or pieces or both. This could be because the sets are free (giveaways), they aren’t traditional lego sets (comic books, etc) or just because the information is missing. Filter the lego dataset based on the specifications below and store the results in lego using <-. In addition, describe the implications of removing these sets.

Your resulting dataset should have:

  1. Arrange the dataset in descending order of retail_price and print the first three rows. Report in words the names of the three most expensive lego sets, their prices, and how many pieces each has.

  2. It appears that the most expensive sets generally have more pieces. Use mutate() to create a new variable price_per_piece, representing the price in dollars per piece for each of the sets. Store the new variable in lego using <-.

  3. Arrange the lego dataset in descending order of price_per_piece and return only the columns name, themegroup, theme, pieces, price_per_piece, and the first five rows. What do you notice about these sets?

  4. Return a new dataset containing the cheapest and most expensive lego sets (based on retail_price) in each subtheme, considering only sets with the Lord of the Rings theme.

  5. Use group_by() and summarize() to create a new dataset with one row for each year, and columns for the year, the number of sets released in that year, and the median price per piece for sets from that year. Name this dataset yearly_trends.

  6. Create a plot of the median price per piece over time using the yearly_trends data. Size points according to the number of sets produced in that year. Adjust transparency, color, etc as appropriate and remember the principles of effective data visualization. Comment on what you observe.

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.