Homework #01: Data Visualization

due Thu, Feb 04 11:59 PM

Goals

For this assignment you must have at least three commits and all of your code chunks must have meaningful names.

For your first commit, update your author name in the YAML header of the template R Markdown file.

Clone assignment repo + start new project

Packages

We will work with the tidyverse package as usual.

library(tidyverse)

Diamond Prices

In this assignment, you will perform an investigation of diamond prices based on 1,000 diamonds. Build effective and well-labeled visualizations to answer the questions below. For each question show your code and output and write your answers in complete sentences.

All plots should follow the best visualization practices discussed in lecture. Plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

In addition, code and narrative should not exceed the 80 character limit. See the Lab #01 instructions for setting a vertical line at 80 characters in your R Markdown file.

We will only examine a subset of the data, so include the code below in a code chunk at the start of your R Markdown file.

set.seed(1)
diamonds_subset <- diamonds %>%
  filter(carat <= 2.5) %>%
  slice_sample(n = 1000)
  1. How many rows are in the diamonds_subset dataset? How many columns?

  2. Examine the documentation of the diamonds dataset by running ?diamonds in the console. What is the meaning of clarity? What is the worst clarity? What is the meaning of color? What is the best color? Note we are investigating a subset of the data (so 1,000 diamonds not over 50,000 diamonds).

  3. Construct a scatterplot of price versus carat. Describe the relationship.

  4. Color the points in the price versus carat scatterplot by the diamond’s color. Describe the relationship.

  5. Add a geom_smooth() for each color and add the argument se = FALSE to omit the bands surrounding the smooth.

  6. Examine the relationship between price and carat by clarity, using a separate scatterplot for each clarity.

  7. Create a bar chart showing all of the colors, with the count of diamonds on the y-axis.

  8. Create a segmented bar chart showing one bar per color, each bar going from 0 - 1, with the fill determined by cut.

  9. Create a segmented bar chart showing one bar per color, each bar going from 0 - 1, with the fill determined by price. Does this plot work? Why or why not?

  10. Create side-by-side boxplots of price for each color and comment on the relationship. Then construct a violin plot using geom_violin(). What do the violin plots reveal that boxplots do not? What do boxplots reveal that violin plots do not?

  11. Come up with a research question based on these data and write it down. Then, create an effective data visualization that answers the question and write a brief paragraph explaining how your visualization answers the question. Your plot should be substantially and noticeably different from the plots you created above. Do not simply switch variables or make a minor modification. Be creative and have fun!

Submission

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.