Data Wrangling

class: center, middle, inverse, title-slide

# Data Wrangling

---

### Goal

.large[
.question[
Learn/review some functions for manipulating and summarizing data in R.
]
]

---

### Penguins data

.large[
Data on 344 penguins near Palmer Station, Antarctica.
]

.center[
<img src="penguin_art.png" width="600">
]

.footnote[
Artwork by @allison_horst
]

---

### Penguins data

.large[
Data on 344 penguins near Palmer Station, Antarctica. Variables include:

* `species`: penguin's species (Adelie, Chinstrap, Gentoo)
* `island`: island where penguin measured (Biscoe, Dream, Torgersen)
* `bill_length_mm`: penguin's bill length (mm)
* `bill_depth_mm`: penguin's bill depth (mm)
* `flipper_length_mm`: penguin's flipper length (mm)
* `body_mass_g`: penguin's body mass (g)
* `sex`: penguin's sex (female, male)
* `year`: year when data recorded (2007, 2008, 2009)
]

.large[
.question[
We get this data set -- where do we start?
]
]

---

### Starting points for data analysis

.large[
* Work with a smaller, manageable subset of the data
* Plots, summary statistics, and missing data
* Create new variables

.question[
**Data wrangling:** Manipulating, summarizing, and transforming data.
]
]

---

### Tools for data wrangling

.large[
.pull-left[
.center[
<img src="dplyr_logo.png" width = "200px">
]
]

.pull-right[
* part of the tidyverse
* provides a "grammar of data manipulation": useful verbs (functions) for manipulating data
* we will cover a few key `dplyr` functions
]
]

---

### Making a subset of the data

.large[
.question[
We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.
]
]

.large[
**Step 1:** What data do I start with?

```r
penguins
```

```
## # A tibble: 344 × 8
##    species   island    bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <fct>     <fct>              <dbl>         <dbl>            <int>       <int>
##  1 Adelie    Dream               38.8          20                190        3950
##  2 Gentoo    Biscoe              47.5          15                218        4950
##  3 Adelie    Dream               36.2          17.3              187        3300
##  4 Gentoo    Biscoe              45.1          14.5              207        5050
##  5 Chinstrap Dream               45.2          16.6              191        3250
##  6 Adelie    Torgersen           36.2          17.2              187        3150
##  7 Gentoo    Biscoe              49.3          15.7              217        5850
##  8 Adelie    Biscoe              41.1          18.2              192        4050
##  9 Adelie    Torgersen           44.1          18                210        4000
## 10 Chinstrap Dream               52            20.7              210        4800
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
```
]

---

### Making a subset of the data

.large[
.question[
We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.
]

**Step 2:** What do I do to that data?

```r
penguins %>%
  filter(species == "Chinstrap")
```
]

```
## # A tibble: 68 × 8
##    species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Chinstrap Dream            45.2          16.6               191        3250
##  2 Chinstrap Dream            52            20.7               210        4800
##  3 Chinstrap Dream            54.2          20.8               201        4300
##  4 Chinstrap Dream            42.5          17.3               187        3350
##  5 Chinstrap Dream            45.5          17                 196        3500
##  6 Chinstrap Dream            50.2          18.8               202        3800
##  7 Chinstrap Dream            50.8          18.5               201        4450
##  8 Chinstrap Dream            50.5          18.4               200        3400
##  9 Chinstrap Dream            50.5          19.6               201        4050
## 10 Chinstrap Dream            46.5          17.9               192        3500
## # … with 58 more rows, and 2 more variables: sex <fct>, year <int>
```

---

### Making a subset of the data

.large[
.question[
We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.
]

**Step 2:** What do I do to that data?

```r
penguins %>%
  filter(species == "Chinstrap")
```

* `%>%` is called the *pipe*. It means "take `<this>`, THEN do `<that>`"
* `filter` keeps only the rows which satisfy a specific condition
]

---

### Making a subset of the data

.large[
.question[
We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.
]

**Step 3:** What do I do with the result?

```r
chinstrap_penguins <- penguins %>% 
  filter(species == "Chinstrap")
```

* `<-` is the *assignment* operator. It means "save the result in R"
  * Here we create a new data frame called `chinstrap_penguins` that contains just the Chinstraps
]
  
---

### Making a subset of the data

.large[
.question[
We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.
]

```r
chinstrap_penguins <- penguins %>% 
  filter(species == "Chinstrap")

chinstrap_penguins
```

]

---

### Starting points for data analysis

.large[
* Work with a smaller, manageable subset of the data
* Plots, summary statistics, and missing data
* Create new variables
]

---

### Calculating summary statistics

.large[
.question[
What is the average body mass for Chinstrap penguins?
]
]

.large[
**Step 1:** What data do I start with?

```r
chinstrap_penguins
```

---

### Calculating summary statistics

.large[
.question[
What is the average body mass for Chinstrap penguins?
]

**Step 2:** What do I do to that data?

```r
chinstrap_penguins %>%
  summarize(avg_mass = mean(body_mass_g))
```
]

```
## # A tibble: 1 × 1
##   avg_mass
##      <dbl>
## 1    3733.
```

.large[
* `%>%` is called the *pipe*. It means "take `<this>`, THEN do `<that>`"
* `summarize` is used to calculate summary statistics
]
---

### Calculating summary statistics

.large[
.question[
What is the average body mass for Chinstrap penguins?
]

**Step 3:** (optional) Do I want to save the result?

```r
chinstrap_summary <- chinstrap_penguins %>%
  summarize(avg_mass = mean(body_mass_g))
```
]

---

### Chaining pipes together

.large[
If we don't care about the intermediate results, we can chain pipes (`%>%`) together. These two chunks calculate the same summary statistics.

**Option 1:**

```r
chinstrap_penguins <- penguins %>%
  filter(species == "Chinstrap")

chinstrap_penguins %>%
  summarize(avg_mass = mean(body_mass_g))
```

**Option 2:**

```r
penguins %>%
  filter(species == "Chinstrap") %>%
  summarize(avg_mass = mean(body_mass_g))
```
]

---

### Calculating summary statistics

.large[
.question[
What if I want the average body mass for Adelie penguins?
]

```r
penguins %>%
  filter(species == "Chinstrap") %>%
  summarize(avg_mass = mean(body_mass_g))
```
]

---

### Calculating summary statistics

.large[
.question[
What if I want the average body mass for Adelie penguins?
]

```r
penguins %>%
  filter(species == "Adelie") %>%
  summarize(avg_mass = mean(body_mass_g))
```
]

---

### Calculating summary statistics

.large[
.question[
What is the average body mass for Adelie penguins?
]

```r
penguins %>%
  filter(species == "Adelie") %>%
  summarize(avg_mass = mean(body_mass_g))
```
]

```
## # A tibble: 1 × 1
##   avg_mass
##      <dbl>
## 1       NA
```

.large[
.question[
What does a result of `NA` mean?
]
]

.large[
* `NA` means "Not Available"
* We get `NA` when there are missing values
]

---

### Handling missing values

.large[
.question[
What is the average body mass for Adelie penguins?
]

**Option 1:**

```r
penguins %>%
  filter(species == "Adelie") %>%
  summarize(avg_mass = mean(body_mass_g,
                            na.rm=TRUE))
```
]

```
## # A tibble: 1 × 1
##   avg_mass
##      <dbl>
## 1    3701.
```

.large[
* Use `filter` to focus only on Adelie penguins
* `summarize` is used to calculate summary statistics
* `na.rm=TRUE` means "ignore missing values"
]
---

### Handling missing values

.large[
.question[
What is the average body mass for Adelie penguins?
]

**Option 2:**

```r
penguins %>%
  filter(species == "Adelie") %>%
  drop_na() %>%
  summarize(avg_mass = mean(body_mass_g))
```
]

```
## # A tibble: 1 × 1
##   avg_mass
##      <dbl>
## 1    3706.
```

.large[
* `drop_na` means "remove any rows with missing values in any columns"
]
---

### Handling missing values

.large[

```r
penguins %>%
  filter(species == "Adelie") %>%
  summarize(avg_mass = mean(body_mass_g, na.rm=TRUE))
```

`3701.`

```r
penguins %>%
  filter(species == "Adelie") %>%
  drop_na() %>%
  summarize(avg_mass = mean(body_mass_g))
```

`3706.`

.question[
Why do these chunks give different numbers?
]
]

---

### Handling missing values

.large[
* `drop_na` removes *all* rows with missing values (not just missing values in `body_mass_g`)
* Reasonable if this is a small number of rows
* When you have missing values, check how much data is missing
]
---

### Calculating summary statistics

.large[
.question[
What is the average body mass for *each* species of penguin?
]
]

.large[

```r
penguins %>%
  group_by(species) %>%
  summarize(avg_mass = mean(body_mass_g,
                            na.rm=T))
```

```
## # A tibble: 3 × 2
##   species   avg_mass
##   <fct>        <dbl>
## 1 Adelie       3701.
## 2 Chinstrap    3733.
## 3 Gentoo       5076.
```
]

.large[
* `group_by` is used to group rows together
* We often use `group_by` before `summarize`
]
---

### Calculating summary statistics

.large[
.question[
What is the mean and standard deviation of body mass, for each species and sex?
]
]

.large[

```r
penguins %>%
  group_by(species, sex) %>%
  summarize(avg_mass = mean(body_mass_g, na.rm=T),
            sd_mass = sd(body_mass_g, na.rm=T))
```

```
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
```

```
## # A tibble: 8 × 4
## # Groups:   species [3]
##   species   sex    avg_mass sd_mass
##   <fct>     <fct>     <dbl>   <dbl>
## 1 Adelie    female    3369.    269.
## 2 Adelie    male      4043.    347.
## 3 Adelie    <NA>      3540     477.
## 4 Chinstrap female    3527.    285.
## 5 Chinstrap male      3939.    362.
## 6 Gentoo    female    4680.    282.
## 7 Gentoo    male      5485.    313.
## 8 Gentoo    <NA>      4588.    338.
```
]

---

### Calculating summary statistics

.large[
.question[
How many penguins of each species and sex are there?
]
]

.large[

```r
penguins %>%
  count(species, sex)
```

```
## # A tibble: 8 × 3
##   species   sex        n
##   <fct>     <fct>  <int>
## 1 Adelie    female    73
## 2 Adelie    male      73
## 3 Adelie    <NA>       6
## 4 Chinstrap female    34
## 5 Chinstrap male      34
## 6 Gentoo    female    58
## 7 Gentoo    male      61
## 8 Gentoo    <NA>       5
```
]

---

### Starting points for data analysis

.large[
* Work with a smaller, manageable subset of the data
* Plots, summary statistics, and missing data
* Create new variables
]

---

### Creating new variables

.large[
.question[
What's the distribution of the ratio of body mass to flipper length?
]
]

.large[
**Step 1:** Create a new variable

```r
penguins %>%
  mutate(bf_ratio = body_mass_g/flipper_length_mm)
```
]

.large[

```
...
##    species   island    body_mass_g flipper_length_mm bf_ratio
##    <fct>     <fct>           <int>             <int>    <dbl>
##  1 Adelie    Dream            3950               190     20.8
##  2 Gentoo    Biscoe           4950               218     22.7
##  3 Adelie    Dream            3300               187     17.6
##  4 Gentoo    Biscoe           5050               207     24.4
##  5 Chinstrap Dream            3250               191     17.0
##  6 Adelie    Torgersen        3150               187     16.8
##  7 Gentoo    Biscoe           5850               217     27.0
...
```
]

---

### Creating new variables

.large[
.question[
What's the distribution of the ratio of body mass to flipper length?
]

**Step 1:** Create a new variable

```r
penguins %>%
  mutate(bf_ratio = body_mass_g/flipper_length_mm)
```

* `mutate` creates a new column in your dataset
]

---

### Creating new variables

.large[
.question[
What's the distribution of the ratio of body mass to flipper length?
]

**Step 2:** Save the data with the new column

```r
penguins_new <- penguins %>%
  mutate(bf_ratio = body_mass_g/flipper_length_mm)
```
]

---

### Creating new variables

.large[
.question[
What's the distribution of the ratio of body mass to flipper length?
]

**Step 3:** Exploratory data analysis
]

.large[

```r
penguins_new <- penguins %>%
  mutate(bf_ratio = body_mass_g/flipper_length_mm)

penguins_new %>%
  group_by(species) %>%
  summarize(mean_ratio = mean(bf_ratio, 
                              na.rm=T))
```

```
## # A tibble: 3 × 2
##   species   mean_ratio
##   <fct>          <dbl>
## 1 Adelie          19.5
## 2 Chinstrap       19.0
## 3 Gentoo          23.3
```
]

---

### Starting points for data analysis

.large[
* Work with a smaller, manageable subset of the data
  * `filter`
* Plots, summary statistics, and missing data
  * `summarize`, `group_by`, `count`, `drop_na`
* Create new variables
  * `mutate`

.question[
**Activity:** practice with data wrangling functions on the building energy efficiency data.
]
]