class: center, middle, inverse, title-slide # Welcome to Intro to Data Science --- class: center, middle # Welcome! --- ## What is Data Science? Data science is an emerging discipline that builds on tools from mathematics, statistics, and computer science to extract knowledge from data. -- ## Course Objectives - explore, visualize, and analyze data in a reproducible manner - investigate patterns, model outcomes, and make predictions - gain experience in data wrangling and munging, exploratory data analysis, predictive modeling, and data visualization - work on problems and case studies inspired by real-world questions and data - effectively communicate results --- ## Where to find information #### Course Website: https://sta199-sp21-002.netlify.app/ .pull-left[ - slides and notes - schedule ] .pull-right[ - syllabus and course policies - links to other resources ] #### Sakai: https://sakai.duke.edu - Gradebook - Zoom links to live lecture, office hours, and labs - recorded videos #### GitHub: https://github.com/sta199-sp21-002/ - assignment repos --- ## Activities and Assessments - **Homework (25%):** Five individual assignments combining conceptual and computational skills. - **Labs (15%):** Nine individual or team assignments focusing on computing. Designed to be completed during the official lab session. - **Exams (40%):** Two individual take-home exams. - **Final Project (15%):** Team final project in which you use the data science tools to answer a data-based research question. - **Participation and Teamwork (2.5%):** Primarily completion of lecture notes. Due one week following the lecture date. - **Statistics Experiences (2.5%):** Engage with statistics outside of the classroom and reflect on your experience. --- ## Course Structure #### Lecture - Focus on concepts behind data analysis - Recorded and posted online - A lecture notes R Markdown file will be created for you for each lecture #### Lab - Focus on computing in R `tidyverse` syntax - Apply concepts from lecture to case study scenarios - Work on labs individually or in teams of 3-4 - Designed to be completed during the scheduled lab time - introductory portion will be recorded and posted online --- ## Some of what you will learn .pull-left[ - Fundamentals of `R` - Data visualization and wrangling with `ggplot2` and `dplyr` from the `tidyverse` - Web scraping - Web based applications with `RShiny` - Spatial data visualization ] .pull-right[ - Data types and functions - Version control with `GitHub` - Reproducible reports with `R Markdown` - Regression and classification - Statistical inference ] --- ## Textbooks - **[OpenIntro Statistics](https://www.openintro.org/book/os/)** - Free online - Hard copies available for purchase - Assigned readings on statistical content - **[R for Data Science](https://r4ds.had.co.nz/)** - Free online - Hard copies available for purchase - Assigned readings on R coding using `tidyverse` syntax - **Occasional other readings** - Will be posted on the course webpage --- ## Where to find help in the course - Attend **office hours** to meet with a member of the teaching team. - office hours begin Thursday Jan 21 - stay after lecture if you have questions - Use **Piazza** (free) for general questions about course content and/or assignments, since other students may benefit from the response. - Use **chat** for questions during lecture. - Use **email** for questions regarding personal matters and/or grades. -- ## Technical help this week - Jan 21 6:30 - 8:30 PM - Jan 22 8:00 - 10:00 AM Will post links after class. --- ## Technology set-up Complete into survey at the link below if you haven't already. https://forms.gle/M6XLB7BoWcCLFbGa9 -- Accept the invitation to the GitHub organization when you receive it after class. --- ### How are you feeling right now? <img src="img/01/emos.png" width="70%" style="display: block; margin: auto;" /> --- ## Zoom Mechanics -Display your first and last name -- -Raise your hand (under "Participants") -- -Yes, No, Go slower, Go faster (under "Participants") -- -Applause, Thumbs up (under "Reactions") -- -Chat the last book you read. -- - ask questions during lecture - answer questions during lecture - share links --- class: center, middle # Toolkit --- ## What is R and RStudio? - R is a statistical programming language - RStudio is a convenient interface for R (an integrated development environment, IDE) - At its simplest:<sup>*</sup> - R is like a car’s engine - RStudio is like a car’s dashboard <img src="img/01/engine-dashboard.png" width="70%" style="display: block; margin: auto;" /> .footnote[ *Source: [Modern Dive](https://moderndive.com/) ] --- ## tidyverse <img src="img/01/tidyverse-packages.png" width="60%" style="display: block; margin: auto;" /> - The [tidyverse](https://www.tidyverse.org/) is an **opinionated**\* collection of R packages designed for data science. - All packages share an underlying philosophy and a common grammar. .footnote[ Image from [Teaching in the Tidyverse 2020](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-1-getting-started/) ] --- ## RStudio <img src="img/01/rstudio-without-labels.png" width="718" style="display: block; margin: auto;" /> --- ## RStudio <img src="img/01/rstudio-with-labels.png" width="718" style="display: block; margin: auto;" /> --- ## Accessing RStudio - Link: https://vm-manage.oit.duke.edu/containers/rstudio - also on Sakai - also on the course webpage -- # Let's try! --- ## Reproducibility checklist What does it mean for a data analysis to be "reproducible"? -- **Near-term goals:** `\(\checkmark\)` Are the tables and figures reproducible from the code and data? `\(\checkmark\)` Does the code actually do what you think it does? `\(\checkmark\)` In addition to what was done, is it clear **why** it was done? <br> -- **Long-term goals:** `\(\checkmark\)` Can the code be used for other data? `\(\checkmark\)` Can you extend the code to do other things? --- class: center, middle # R Markdown --- ## R Markdown - Fully reproducible reports -- the analysis is run from the beginning each time you knit - Simple [Markdown syntax](https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf) for text - Code goes in chunks, defined by three backticks, narrative goes outside of chunks -- ## How will we use R Markdown? - Every assignment / lab / project / etc. is an R Markdown document - You'll always have a template R Markdown document to start with - The amount of scaffolding in the template will decrease over the semester --- <img src="img/01/rmarkdown-no-labels.png" width="200%" style="display: block; margin: auto;" /> --- <img src="img/01/rmarkdown-labels.png" width="200%" style="display: block; margin: auto;" /> --- class: center, middle # Let's try! --- class: center, middle # Git and GitHub --- ## Version control We will use GitHub as a platform for collaboration and version control. ##### Why do we need version control? <img src="img/01/phd_comics_vc.gif" width="40%" style="display: block; margin: auto;" /> --- ## What is versioning? <br><br> <img src="img/01/lego-steps.png" width="80%" style="display: block; margin: auto;" /> --- ## What is versioning? with human readable messages <img src="img/01/lego-steps-commit-messages.png" width="80%" style="display: block; margin: auto;" /> --- <br> <img src="img/01/git-github.png" width="80%" style="display: block; margin: auto;" /> -- - **Git** is a version control system -- like “Track Changes” features from Microsoft Word. -- - **GitHub** is the home for your Git-based projects on the internet (like DropBox but much better). -- - There are a lot of Git commands and very few people know them all. 99% of the time you will use git to stage, commit, push, and pull. --- class: center, middle # Let's try! --- ## Outline of steps - **repository (repo)**: contains files associated with a particular project and each file's version history - **pull**: update a local repository from a remote repo (GitHub) - **stage**: prepare file for commit - **commit**: save changes to local repository - **push**: upload files to remote repository (gitHub) - Step #1. Create / navigate to the repository on GitHub - Step #2. Clone the GitHub repo & make a new RStudio project - Step #3. Configure git - Step #4. Change a file locally - Step #5. Stage and commit - Step #6. Push these changes to the repo on GitHub. --- ### Step #1. Create a repository on GitHub .vocab[repository]: contains files associated with a particular project and each file's version history (a) Click the link below to create the repository for lecture notes #1. - https://classroom.github.com/a/2g5MiMkx -- (b) When prompted, select "Accept this assignment". -- (c) Refresh the page. -- (d) Click the link following "Your assignment repository has been created:". **A private repository was just created for you.** --- ### Step #1. Create a repository on GitHub <img src="img/01/private-repo.png" width="70%" style="display: block; margin: auto;" /> --- ### Step #2. Clone a GitHub repo & Make a new RStudio project (a) In the repository that was just created, click the green "CODE" button, select "Use HTTPS" and click the clipboard icon to copy the repo URL. <img src="img/01/clone-repo.png" width="70%" style="display: block; margin: auto;" /> --- ### Step #2. Clone a GitHub repo & Make a new RStudio project (b) In RStudio, click "File" `\(\rightarrow\)` "New Project" `\(\rightarrow\)` "Version Control" `\(\rightarrow\)` "Git". -- (c) Copy and paste the URL of your assignment repo in the dialog box "Repository URL:". -- (d) Click "Create Project" and enter your GitHub username and password when prompted. -- (e) The files from your GitHub repo should now be displayed in the "Files" pane in RStudio. -- The "Project" drop-down menu in the upper-right should show the project you are currently working on. --- ### Step #3. Configure Git We need to make sure RStudio can communicate with GitHub. (a) Type or paste the code below into the **console**. ```r library(usethis) use_git_config( user.name = "GitHub username", user.email="your email" ) ``` But fill in your GitHub username and the email associated with GitHub (mine is below). Click **ENTER** to run the code in the console. ```r library(usethis) use_git_config( user.name = "rdeisinger", user.email="robert.eisinger@duke.edu" ) ``` --- ### Step #3. Configure Git If if worked, your console should look like the screenshot below. <img src="img/01/config.png" width="70%" style="display: block; margin: auto;" /> --- ### Step #4. Change a file locally (a) Click the R Markdown file named lecture01.Rmd in the lower-right pane. (b) Change the author name to your name. (c) Knit the document. -- Examine this file. What changed? --- ### Step #5. Stage and Commit .vocab[stage]: prepare file for commit .vocab[commit]: save changes to local repository (a) Open the "Git" pane in the top-right panel. -- (b) Click on the .Rmd file -- (c) Click "Diff" to see the difference between the last commit and the current state -- (d) If you're happy with the changes, stage them by checking the box next to the file. -- (e) Write a meaningful commit message "changed author" in the commit message box. -- (f) Click commit. --- ### Step #6. Push these changes to the repo on GitHub. - .vocab[push]: upload files to remote repository (gitHub) (a) Click "Push". -- - Now go to the repo on GitHub. What do you notice? --- ## Vocabulary - **pull**: update a local repository from a remote repo (GitHub) - **stage**: prepare file for commit - **commit**: save changes to local repository - **push**: upload files to remote repository (gitHub) As you work on a data science project, you should periodically knit, stage, commit, and push. If you are working on a team, there may be updates to your GitHub repo that aren't in your local repo. To make sure you are starting with the most up-to-date files, click "Pull" to update your local repo before adding any new work. Submit your lecture notes for credit by pushing before the deadline. --- ## Review of steps - Step #1. Create / navigate to the repository on GitHub - Step #2. Clone the GitHub repo & make a new RStudio project - Step #3. Configure git - Step #4. Change a file locally - Step #5. Stage and commit - Step #6. Push these changes to the repo on GitHub.