Functions and Automation

Song of the Day

Rainbow Connection -Kermit the Frog

Main ideas

Make functions to automate tasks
Discuss automation using map_ functions
Introduce clean coding

Coming up

Complete Peer review via GitHub issues
- assignment available
- complete by Friday’s lab
Homework #05 is available
Statistical Experience due tomorrow
Complete course evaluation in DukeHub
T.A. Evaluation
Project work

Functions

library(tidyverse)

Functions allow you to automate tasks.

automating tasks is more powerful, general, and reproducible than copy-pasting.
You can give the function an informative name that makes the code easier to understand.
If requirements change, you only need to update the code in one place.
You reduce your chances of making an error.
You can write functions and packages that others use.

Let’s make an example dataset to work with.

Question: What does the function rnorm() do?

Generates random draws from a normal distribution.

set.seed(2320)

ex_data <- tibble(
  a = rnorm(5),
  b = rnorm(5),
  c = rnorm(5),
  d = rnorm(5)
)
ex_data

## # A tibble: 5 x 4
##        a      b      c      d
##    <dbl>  <dbl>  <dbl>  <dbl>
## 1  0.356 -0.404 -1.21   1.28 
## 2  0.544 -0.628 -0.862 -1.77 
## 3 -0.936  1.36  -0.117 -0.491
## 4  0.915  0.759 -0.644 -0.644
## 5 -0.507  0.641 -0.272  0.664

Suppose we want to normalize the data so it lives between 0 and 1.

\[\dfrac{x_i - \text{min}(x)}{\text{max}(x) - \text{min}(x)}\]

ex_data <- ex_data %>% mutate(
  a = (a - min(a))/(max(a) - min(a)),
  b = (b - min(b))/(max(b) - min(b)),
  c = (c - min(c))/(max(c) - min(c)),
  d = (d - min(d))/(max(d) - min(d)))

ex_data

## # A tibble: 5 x 4
##       a     b     c     d
##   <dbl> <dbl> <dbl> <dbl>
## 1 0.698 0.113 0     1    
## 2 0.800 0     0.320 0    
## 3 0     1     1     0.419
## 4 1     0.697 0.518 0.368
## 5 0.232 0.638 0.858 0.797

Don’t write code from scratch - start from working code.

(a - min(a))/(max(a) - min(a))

Question: How many inputs should this function have?

Just the one.

Choose an informative name.

rescale01 <-

Use function to define a function.

rescale01 <- function

Specify the inputs (arguments) inside function. Multiple arguments can be included and separated by commas (function(x, y, z)).

rescale01 <- function(x)

Create the body of the function using a {} block immediately following function.

rescale01 <- function(x){
  
}

Place your code in the body of the function.

rescale01 <- function(x){
  
  (x - min(x)) / (max(x) - min(x))

}

Now let’s test rescale01!

x1 <- 1:10
rescale01(x1)

##  [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
##  [8] 0.7777778 0.8888889 1.0000000

x2 <- c(1:10, NA)
rescale01(x2)

##  [1] NA NA NA NA NA NA NA NA NA NA NA

Question: What’s going on here? Address this issue in the code chunk below.

rescale01a <- function(x) {
  rangex <- max(x, na.rm = TRUE) - min(x, na.rm = TRUE)
  (x - min(x, na.rm = TRUE)) / rangex
}

x1 <- 1:10
rescale01(x1)

##  [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
##  [8] 0.7777778 0.8888889 1.0000000

x2 <- c(1:10, NA)
rescale01(x2)

##  [1] NA NA NA NA NA NA NA NA NA NA NA

Functions take inputs defined in the function definition.
By default they return the last value computed in the function.
You can define more outputs to be returned in a list as well as nice print methods, but we won’t go there for now.

 do_something <- function(x, y, z){
  # do bunch of stuff with the input...
  
  # return a tibble
  tibble(...)
}

Question: Does the function defined below behave as you expect? Why or why not?

The add_2() function returns 1000 every time since 1000 is the last value computed in the function.

add_2 <- function(x){
  x + 2
  1000
}

add_2(998)

## [1] 1000

add_2(2)

## [1] 1000

add_2(100)

## [1] 1000

add_2(24)

## [1] 1000

Automation: Mapping

Mapping allows us to apply a function to each element of an object and return a specific type of value.

Suppose we have exam 1 and exam 2 scores of 4 students stored in a list.

exam_scores <- list(
  exam1 <- c(80, 90, 70, 50),
  exam2 <- c(85, 83, 45, 60)
)
exam_scores

## [[1]]
## [1] 80 90 70 50
## 
## [[2]]
## [1] 85 83 45 60

We can use map() to find the mean score for each exam.

map(exam_scores, mean)

## [[1]]
## [1] 72.5
## 
## [[2]]
## [1] 68.25

Suppose we want the results as a numeric (double) vector.

map_dbl(exam_scores, mean)

## [1] 72.50 68.25

What if we want the results as a character string?

map_chr(exam_scores, mean)

## [1] "72.500000" "68.250000"

map(): - returns a list
map_lgl(): - returns a logical vector
map_int(): - returns an integer vector
map_dbl(): - returns a double vector
map_chr(): - returns a character vector
map_df() / map_dfr(): returns a data frame by row binding
map_dfc(): returns a data frame by column binding

map_dbl(ex_data, mean)

##         a         b         c         d 
## 0.5460050 0.4894938 0.5392597 0.5168666

map_dbl(ex_data, median)

##         a         b         c         d 
## 0.6982264 0.6377430 0.5182620 0.4185009

map_dbl(ex_data, sd)

##         a         b         c         d 
## 0.4154975 0.4205473 0.4042266 0.3908414

Question: How many distinct observations are there in each column? Use an appropriate map_ function to answer.

mtcars %>% map_int(n_distinct)

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##   25    3   27   22   22   29   30    2    2    3    6

Clean Coding

Code should express intent, use the correct parts of speech, have the length correspond to scope, and contain no disinformation (Martin).

For variables, what is it? For functions, what does it do? These should be expressed in the name of the variable or function.

Variables are nouns, functions are verbs, and predicates (T/F) are predicates. They should be named as such.

Small scope variables should have short names and longer scope variables should have long names.

The opposite is true for functions. Small scope functions should have long names, and long scope functions should have short names.

Multiword names should be separated by underscores (snake_case not CamelCase).
Families of functions should be named similarly (str_trim, str_sub, str_remove).
Use consistent naming conventions.
Don’t overwrite existing functions or variables.

Question: Why are the functions below bad?

mean <- function(x){
  sum(x)
}

T <- FALSE

c <- 25

Sources and Additional Information

R for Data Science Chapter 19: Functions
R for Data Science Chapter 21: Iteration
Clean Code by Robert Martin