filter()
: pick rows matching criteriaselect()
: pick columns by namemutate()
: add new variablesslice()
: pick rows using indicesarrange()
: reorder rowsgroup_by()
: for grouped operationssummarize()
: calculate summary statisticsClick the link here https://classroom.github.com/a/jQxDe3IE to create your private repository for lab #03 on GitHub. Follow the steps provided in lab and lecture to clone the repo and create a new project in RStudio.
Then, configure git in the console as we have done in previous labs, using your GitHub username and email address.
library(usethis)
use_git_config(user.name = "GitHub username", user.email="your email")
Open the lab03.Rmd
template and update the YAML header with your name and today’s date. Then, knit the document and make sure the resulting PDF file has the correct date. Stage, commit, and push your changes.
Write your answers in the lab03.Rmd
template file. Your assignment should have at least three meaningful commits and all code chunks should have informative names.
We will examine a dataset containing characteristics of lego sets manufactured between 1961 and 2019 from the BRICKSET website. Variables in the dataset are described below.
Variable | Description |
---|---|
id |
set id |
name |
name of set |
themegroup |
themegroup of set |
theme |
theme of set |
subtheme |
subtheme of set |
year |
year released |
pieces |
number of pieces |
minifigs |
number of minifigs |
package |
type of packaging |
retail_price |
recommended retail price in dollars |
We first load the tidyverse
as usual.
library(tidyverse)
And then read in the data.
<- read_csv("data/lego.csv") lego
retail_price
or pieces
or both. This could be because the sets are free (giveaways), they aren’t traditional lego sets (comic books, etc) or just because the information is missing. Filter the lego
dataset based on the specifications below and store the results in lego
using <-
. In addition, describe the implications of removing these sets.Your resulting dataset should have:
pieces
pieces
retail_price
retail_price
year
Arrange the dataset in descending order of retail_price
and print the first three rows. Report in words the names of the three most expensive lego sets, their prices, and how many pieces each has.
It appears that the most expensive sets generally have more pieces. Use mutate()
to create a new variable price_per_piece
, representing the price in dollars per piece for each of the sets. Store the new variable in lego
using <-
.
Arrange the lego
dataset in descending order of price_per_piece
and return only the columns name
, themegroup
, theme
, pieces
, price_per_piece
, and the first five rows. What do you notice about these sets?
Return a new dataset containing the cheapest and most expensive lego sets (based on retail_price
) in each subtheme, considering only sets with the Lord of the Rings theme.
Use group_by()
and summarize()
to create a new dataset with one row for each year, and columns for the year, the number of sets released in that year, and the median price per piece for sets from that year. Name this dataset yearly_trends
.
Create a plot of the median price per piece over time using the yearly_trends
data. Size points according to the number of sets produced in that year. Adjust transparency, color, etc as appropriate and remember the principles of effective data visualization. Comment on what you observe.
Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.
Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.