class: center, middle, inverse, title-slide # Data Wrangling II --- Click the link below to create the repository for lecture notes #05 - https://classroom.github.com/a/cy3k00Il Follow the steps to clone the repo, make a new RStudio project, and configure git. Change the author name in the YAML header of `lecture05.Rmd` to your name and update the date to today's date. <br> Complete the Lab Team Formation Survey by 2-09 11:59 PM. Do this now if you have time. - [https://forms.gle/mZMZ53Zfy3yPHwdJ6](https://forms.gle/mZMZ53Zfy3yPHwdJ6) --- class: center, middle # Data Wrangling Quiz --- ### Question #1 .reallytiny[ ``` ## # A tibble: 234 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp… ## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp… ## 3 audi a4 2 2008 4 manual… f 20 31 p comp… ## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp… ## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp… ## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp… ## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp… ## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp… ## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp… ## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp… ## # … with 224 more rows ``` ] #### Wrangled Data .reallytiny[ ``` ## # A tibble: 234 x 3 ## year cty hwy ## <int> <int> <int> ## 1 1999 18 29 ## 2 1999 21 29 ## 3 2008 20 31 ## 4 2008 21 30 ## 5 1999 16 26 ## 6 1999 18 26 ## 7 2008 18 27 ## 8 1999 18 26 ## 9 1999 16 25 ## 10 2008 20 28 ## # … with 224 more rows ``` ] --- ### Question #2 .reallytiny[ ``` ## # A tibble: 234 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp… ## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp… ## 3 audi a4 2 2008 4 manual… f 20 31 p comp… ## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp… ## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp… ## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp… ## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp… ## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp… ## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp… ## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp… ## # … with 224 more rows ``` ] #### Wrangled Data .reallytiny[ ``` ## # A tibble: 45 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 manual… f 21 29 p compact ## 2 audi a4 2 2008 4 auto(a… f 21 30 p compact ## 3 chevrolet malibu 2.4 2008 4 auto(l… f 22 30 r midsize ## 4 honda civic 1.6 1999 4 manual… f 28 33 r subcom… ## 5 honda civic 1.6 1999 4 auto(l… f 24 32 r subcom… ## 6 honda civic 1.6 1999 4 manual… f 25 32 r subcom… ## 7 honda civic 1.6 1999 4 manual… f 23 29 p subcom… ## 8 honda civic 1.6 1999 4 auto(l… f 24 32 r subcom… ## 9 honda civic 1.8 2008 4 manual… f 26 34 r subcom… ## 10 honda civic 1.8 2008 4 auto(l… f 25 36 r subcom… ## # … with 35 more rows ``` ] --- ### Question #3 .reallytiny[ ``` ## # A tibble: 234 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp… ## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp… ## 3 audi a4 2 2008 4 manual… f 20 31 p comp… ## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp… ## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp… ## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp… ## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp… ## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp… ## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp… ## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp… ## # … with 224 more rows ``` ] #### Wrangled Data .reallytiny[ ``` ## # A tibble: 234 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 volkswagen jetta 1.9 1999 4 manual… f 33 44 d compa… ## 2 volkswagen new be… 1.9 1999 4 manual… f 35 44 d subco… ## 3 volkswagen new be… 1.9 1999 4 auto(l… f 29 41 d subco… ## 4 toyota corolla 1.8 2008 4 manual… f 28 37 r compa… ## 5 honda civic 1.8 2008 4 auto(l… f 25 36 r subco… ## 6 honda civic 1.8 2008 4 auto(l… f 24 36 c subco… ## 7 toyota corolla 1.8 1999 4 manual… f 26 35 r compa… ## 8 toyota corolla 1.8 2008 4 auto(l… f 26 35 r compa… ## 9 honda civic 1.8 2008 4 manual… f 26 34 r subco… ## 10 honda civic 1.6 1999 4 manual… f 28 33 r subco… ## # … with 224 more rows ``` ] --- ### Question #4 .reallytiny[ ``` ## # A tibble: 234 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp… ## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp… ## 3 audi a4 2 2008 4 manual… f 20 31 p comp… ## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp… ## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp… ## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp… ## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp… ## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp… ## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp… ## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp… ## # … with 224 more rows ``` ] #### Wrangled Data .reallytiny[ ``` ## # A tibble: 22 x 3 ## manufacturer drv n ## <chr> <chr> <int> ## 1 audi 4 11 ## 2 audi f 7 ## 3 chevrolet 4 4 ## 4 chevrolet f 5 ## 5 chevrolet r 10 ## 6 dodge 4 26 ## 7 dodge f 11 ## 8 ford 4 13 ## 9 ford r 12 ## 10 honda f 9 ## # … with 12 more rows ``` ] --- ### Question #5 .reallytiny[ ``` ## # A tibble: 234 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp… ## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp… ## 3 audi a4 2 2008 4 manual… f 20 31 p comp… ## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp… ## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp… ## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp… ## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp… ## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp… ## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp… ## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp… ## # … with 224 more rows ``` ] #### Wrangled Data .reallytiny[ ``` ## # A tibble: 234 x 12 ## manufacturer model displ year cyl trans drv cty hwy fl ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> ## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p ## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p ## 3 audi a4 2 2008 4 manual(m6) f 20 31 p ## 4 audi a4 2 2008 4 auto(av) f 21 30 p ## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p ## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p ## 7 audi a4 3.1 2008 6 auto(av) f 18 27 p ## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p ## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p ## 10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p ## class mpg_ratio ## <chr> <dbl> ## 1 compact 0.621 ## 2 compact 0.724 ## 3 compact 0.645 ## 4 compact 0.7 ## 5 compact 0.615 ## 6 compact 0.692 ## 7 compact 0.667 ## 8 compact 0.692 ## 9 compact 0.64 ## 10 compact 0.714 ## # … with 224 more rows ``` ] --- ### Question #6 .reallytiny[ ``` ## # A tibble: 234 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp… ## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp… ## 3 audi a4 2 2008 4 manual… f 20 31 p comp… ## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp… ## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp… ## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp… ## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp… ## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp… ## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp… ## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp… ## # … with 224 more rows ``` ] #### Wrangled Data .reallytiny[ ``` ## # A tibble: 7 x 5 ## class min_mpg mean_mpg median_mpg max_mpg ## <chr> <int> <dbl> <dbl> <int> ## 1 2seater 23 24.8 25 26 ## 2 compact 23 28.3 27 44 ## 3 midsize 23 27.3 27 32 ## 4 minivan 17 22.4 23 24 ## 5 pickup 12 16.9 17 22 ## 6 subcompact 20 28.1 26 44 ## 7 suv 12 18.1 17.5 27 ``` ] --- ## Joining Demo .pull-left[ ```r x ``` ``` ## # A tibble: 3 x 2 ## value xcol ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ## 3 3 x3 ``` ] .pull-right[ ```r y ``` ``` ## # A tibble: 3 x 2 ## value ycol ## <dbl> <chr> ## 1 1 y1 ## 2 2 y2 ## 3 4 y4 ``` ] --- ## `inner_join()` .pull-left[ ```r inner_join(x, y) ``` ``` ## # A tibble: 2 x 3 ## value xcol ycol ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ``` ] .pull-right[ <img src="img/05/inner-join.gif" style="display: block; margin: auto;" /> ] --- ## `left_join()` .pull-left[ ```r left_join(x, y) ``` ``` ## # A tibble: 3 x 3 ## value xcol ycol ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ``` ] .pull-right[ <img src="img/05/left-join.gif" style="display: block; margin: auto;" /> ] --- ## `right_join()` .pull-left[ ```r right_join(x, y) ``` ``` ## # A tibble: 3 x 3 ## value xcol ycol ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 4 <NA> y4 ``` ] .pull-right[ <img src="img/05/right-join.gif" style="display: block; margin: auto;" /> ] --- ## `full_join()` .pull-left[ ```r full_join(x, y) ``` ``` ## # A tibble: 4 x 3 ## value xcol ycol ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ## 4 4 <NA> y4 ``` ] .pull-right[ <img src="img/05/full-join.gif" style="display: block; margin: auto;" /> ] --- ## `semi_join()` .pull-left[ ```r semi_join(x, y) ``` ``` ## # A tibble: 2 x 2 ## value xcol ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ``` ] .pull-right[ <img src="img/05/semi-join.gif" style="display: block; margin: auto;" /> ] --- ## `anti_join()` .pull-left[ ```r anti_join(x, y) ``` ``` ## # A tibble: 1 x 2 ## value xcol ## <dbl> <chr> ## 1 3 x3 ``` ] .pull-right[ <img src="img/05/anti-join.gif" style="display: block; margin: auto;" /> ] --- ## Code Style >"Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." > >Hadley Wickham - Style guide for this course is based on the Tidyverse style guide - [http://style.tidyverse.org/](http://style.tidyverse.org/) - There's more to it than what we'll cover today. We'll mention more as we introduce more functionality throughout the semester. --- ### File names and code chunk labels - Do not use spaces in file names, use `-` or `_` to separate words. - Use all lowercase letters. ```r # Good ucb-admit.csv # Bad <- UCB Admit.csv ``` --- ### Assignment, object creation Use `<-`, not `=` ```r # Good x <- 2 # Bad x = 2 ``` -- In an `R` chunk, Windows users may use Alt and - together (the hyphen key) as a shortcut. Mac users may use Option and -. --- ## Object names - objects should be nouns (functions should be verbs) - Use an `_` to separate words in object names. - Use informative but short object names. - Do not reuse object names within an analysis. - Don't choose existing function names or names that have special meaning in R such as `NA`, `T`, `NaN`, `pi`, etc. ```r # Good acs_employed # Bad acs.employed acs2 acs_subset acs_subsetted_for_males mean NA log ``` --- ## Spacing - Put a space before and after all infix operators (`=, +, -, <-`, etc.) and when naming arguments in function calls. - Always put a space after a comma, and never before (just like in regular English). ```r # Good average <- mean(feet / 12 + inches, na.rm = TRUE) # Bad average<-mean(feet/12+inches,na.rm=TRUE) ``` --- ## `ggplot2` - Always end a line with `+` - Always indent the next line (this should happen automatically) ```r # Good ggplot(diamonds, mapping = aes(x = price)) + geom_histogram() # Bad ggplot(diamonds,mapping=aes(x=price))+geom_histogram() ``` --- ## Quotes Use `" "`, not `' '`, for quoting text. The only exception is when the text already contains double quotes and no single quotes. ```r ggplot(diamonds, mapping = aes(x = price)) + geom_histogram() + labs(title = "Shine bright like a diamond", # Good x = "Diamond prices", # Good * y = 'Frequency') # Bad ```