+ - 0:00:00
Notes for current slide
Notes for next slide

Data Wrangling

1 / 32

Goal

Learn/review some functions for manipulating and summarizing data in R.

2 / 32

Penguins data

Data on 344 penguins near Palmer Station, Antarctica.

Artwork by @allison_horst

3 / 32

Penguins data

Data on 344 penguins near Palmer Station, Antarctica. Variables include:

  • species: penguin's species (Adelie, Chinstrap, Gentoo)
  • island: island where penguin measured (Biscoe, Dream, Torgersen)
  • bill_length_mm: penguin's bill length (mm)
  • bill_depth_mm: penguin's bill depth (mm)
  • flipper_length_mm: penguin's flipper length (mm)
  • body_mass_g: penguin's body mass (g)
  • sex: penguin's sex (female, male)
  • year: year when data recorded (2007, 2008, 2009)
4 / 32

Penguins data

Data on 344 penguins near Palmer Station, Antarctica. Variables include:

  • species: penguin's species (Adelie, Chinstrap, Gentoo)
  • island: island where penguin measured (Biscoe, Dream, Torgersen)
  • bill_length_mm: penguin's bill length (mm)
  • bill_depth_mm: penguin's bill depth (mm)
  • flipper_length_mm: penguin's flipper length (mm)
  • body_mass_g: penguin's body mass (g)
  • sex: penguin's sex (female, male)
  • year: year when data recorded (2007, 2008, 2009)

We get this data set -- where do we start?

4 / 32

Starting points for data analysis

  • Work with a smaller, manageable subset of the data
  • Plots, summary statistics, and missing data
  • Create new variables

Data wrangling: Manipulating, summarizing, and transforming data.

5 / 32

Tools for data wrangling

  • part of the tidyverse
  • provides a "grammar of data manipulation": useful verbs (functions) for manipulating data
  • we will cover a few key dplyr functions
6 / 32

Making a subset of the data

We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.

7 / 32

Making a subset of the data

We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.

Step 1: What data do I start with?

penguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Dream 38.8 20 190 3950
## 2 Gentoo Biscoe 47.5 15 218 4950
## 3 Adelie Dream 36.2 17.3 187 3300
## 4 Gentoo Biscoe 45.1 14.5 207 5050
## 5 Chinstrap Dream 45.2 16.6 191 3250
## 6 Adelie Torgersen 36.2 17.2 187 3150
## 7 Gentoo Biscoe 49.3 15.7 217 5850
## 8 Adelie Biscoe 41.1 18.2 192 4050
## 9 Adelie Torgersen 44.1 18 210 4000
## 10 Chinstrap Dream 52 20.7 210 4800
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
7 / 32

Making a subset of the data

We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.

Step 2: What do I do to that data?

penguins %>%
filter(species == "Chinstrap")
## # A tibble: 68 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Chinstrap Dream 45.2 16.6 191 3250
## 2 Chinstrap Dream 52 20.7 210 4800
## 3 Chinstrap Dream 54.2 20.8 201 4300
## 4 Chinstrap Dream 42.5 17.3 187 3350
## 5 Chinstrap Dream 45.5 17 196 3500
## 6 Chinstrap Dream 50.2 18.8 202 3800
## 7 Chinstrap Dream 50.8 18.5 201 4450
## 8 Chinstrap Dream 50.5 18.4 200 3400
## 9 Chinstrap Dream 50.5 19.6 201 4050
## 10 Chinstrap Dream 46.5 17.9 192 3500
## # … with 58 more rows, and 2 more variables: sex <fct>, year <int>
8 / 32

Making a subset of the data

We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.

Step 2: What do I do to that data?

penguins %>%
filter(species == "Chinstrap")
  • %>% is called the pipe. It means "take <this>, THEN do <that>"
  • filter keeps only the rows which satisfy a specific condition
9 / 32

Making a subset of the data

We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.

Step 3: What do I do with the result?

chinstrap_penguins <- penguins %>%
filter(species == "Chinstrap")
  • <- is the assignment operator. It means "save the result in R"
    • Here we create a new data frame called chinstrap_penguins that contains just the Chinstraps
10 / 32

Making a subset of the data

We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.

chinstrap_penguins <- penguins %>%
filter(species == "Chinstrap")

chinstrap_penguins

## # A tibble: 68 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Chinstrap Dream 45.2 16.6 191 3250
## 2 Chinstrap Dream 52 20.7 210 4800
## 3 Chinstrap Dream 54.2 20.8 201 4300
## 4 Chinstrap Dream 42.5 17.3 187 3350
## 5 Chinstrap Dream 45.5 17 196 3500
## 6 Chinstrap Dream 50.2 18.8 202 3800
## 7 Chinstrap Dream 50.8 18.5 201 4450
## 8 Chinstrap Dream 50.5 18.4 200 3400
## 9 Chinstrap Dream 50.5 19.6 201 4050
## 10 Chinstrap Dream 46.5 17.9 192 3500
## # … with 58 more rows, and 2 more variables: sex <fct>, year <int>
11 / 32

Starting points for data analysis

  • Work with a smaller, manageable subset of the data
  • Plots, summary statistics, and missing data
  • Create new variables
12 / 32

Calculating summary statistics

What is the average body mass for Chinstrap penguins?

13 / 32

Calculating summary statistics

What is the average body mass for Chinstrap penguins?

Step 1: What data do I start with?

chinstrap_penguins
## # A tibble: 68 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Chinstrap Dream 45.2 16.6 191 3250
## 2 Chinstrap Dream 52 20.7 210 4800
## 3 Chinstrap Dream 54.2 20.8 201 4300
## 4 Chinstrap Dream 42.5 17.3 187 3350
## 5 Chinstrap Dream 45.5 17 196 3500
## 6 Chinstrap Dream 50.2 18.8 202 3800
## 7 Chinstrap Dream 50.8 18.5 201 4450
## 8 Chinstrap Dream 50.5 18.4 200 3400
## 9 Chinstrap Dream 50.5 19.6 201 4050
## 10 Chinstrap Dream 46.5 17.9 192 3500
## # … with 58 more rows, and 2 more variables: sex <fct>, year <int>
13 / 32

Calculating summary statistics

What is the average body mass for Chinstrap penguins?

Step 2: What do I do to that data?

chinstrap_penguins %>%
summarize(avg_mass = mean(body_mass_g))
## # A tibble: 1 × 1
## avg_mass
## <dbl>
## 1 3733.
  • %>% is called the pipe. It means "take <this>, THEN do <that>"
  • summarize is used to calculate summary statistics
14 / 32

Calculating summary statistics

What is the average body mass for Chinstrap penguins?

Step 3: (optional) Do I want to save the result?

chinstrap_summary <- chinstrap_penguins %>%
summarize(avg_mass = mean(body_mass_g))
15 / 32

Chaining pipes together

If we don't care about the intermediate results, we can chain pipes (%>%) together. These two chunks calculate the same summary statistics.

Option 1:

chinstrap_penguins <- penguins %>%
filter(species == "Chinstrap")
chinstrap_penguins %>%
summarize(avg_mass = mean(body_mass_g))

Option 2:

penguins %>%
filter(species == "Chinstrap") %>%
summarize(avg_mass = mean(body_mass_g))
16 / 32

Calculating summary statistics

What if I want the average body mass for Adelie penguins?

penguins %>%
filter(species == "Chinstrap") %>%
summarize(avg_mass = mean(body_mass_g))
17 / 32

Calculating summary statistics

What if I want the average body mass for Adelie penguins?

penguins %>%
filter(species == "Adelie") %>%
summarize(avg_mass = mean(body_mass_g))
18 / 32

Calculating summary statistics

What is the average body mass for Adelie penguins?

penguins %>%
filter(species == "Adelie") %>%
summarize(avg_mass = mean(body_mass_g))
## # A tibble: 1 × 1
## avg_mass
## <dbl>
## 1 NA

What does a result of NA mean?

19 / 32

Calculating summary statistics

What is the average body mass for Adelie penguins?

penguins %>%
filter(species == "Adelie") %>%
summarize(avg_mass = mean(body_mass_g))
## # A tibble: 1 × 1
## avg_mass
## <dbl>
## 1 NA

What does a result of NA mean?

  • NA means "Not Available"
  • We get NA when there are missing values
19 / 32

Handling missing values

What is the average body mass for Adelie penguins?

Option 1:

penguins %>%
filter(species == "Adelie") %>%
summarize(avg_mass = mean(body_mass_g,
na.rm=TRUE))
## # A tibble: 1 × 1
## avg_mass
## <dbl>
## 1 3701.
  • Use filter to focus only on Adelie penguins
  • summarize is used to calculate summary statistics
  • na.rm=TRUE means "ignore missing values"
20 / 32

Handling missing values

What is the average body mass for Adelie penguins?

Option 2:

penguins %>%
filter(species == "Adelie") %>%
drop_na() %>%
summarize(avg_mass = mean(body_mass_g))
## # A tibble: 1 × 1
## avg_mass
## <dbl>
## 1 3706.
  • drop_na means "remove any rows with missing values in any columns"
21 / 32

Handling missing values

penguins %>%
filter(species == "Adelie") %>%
summarize(avg_mass = mean(body_mass_g, na.rm=TRUE))

3701.

penguins %>%
filter(species == "Adelie") %>%
drop_na() %>%
summarize(avg_mass = mean(body_mass_g))

3706.

Why do these chunks give different numbers?

22 / 32

Handling missing values

  • drop_na removes all rows with missing values (not just missing values in body_mass_g)
  • Reasonable if this is a small number of rows
  • When you have missing values, check how much data is missing
23 / 32

Calculating summary statistics

What is the average body mass for each species of penguin?

24 / 32

Calculating summary statistics

What is the average body mass for each species of penguin?

penguins %>%
group_by(species) %>%
summarize(avg_mass = mean(body_mass_g,
na.rm=T))
## # A tibble: 3 × 2
## species avg_mass
## <fct> <dbl>
## 1 Adelie 3701.
## 2 Chinstrap 3733.
## 3 Gentoo 5076.
24 / 32

Calculating summary statistics

What is the average body mass for each species of penguin?

penguins %>%
group_by(species) %>%
summarize(avg_mass = mean(body_mass_g,
na.rm=T))
## # A tibble: 3 × 2
## species avg_mass
## <fct> <dbl>
## 1 Adelie 3701.
## 2 Chinstrap 3733.
## 3 Gentoo 5076.
  • group_by is used to group rows together
  • We often use group_by before summarize
24 / 32

Calculating summary statistics

What is the mean and standard deviation of body mass, for each species and sex?

penguins %>%
group_by(species, sex) %>%
summarize(avg_mass = mean(body_mass_g, na.rm=T),
sd_mass = sd(body_mass_g, na.rm=T))
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
## # A tibble: 8 × 4
## # Groups: species [3]
## species sex avg_mass sd_mass
## <fct> <fct> <dbl> <dbl>
## 1 Adelie female 3369. 269.
## 2 Adelie male 4043. 347.
## 3 Adelie <NA> 3540 477.
## 4 Chinstrap female 3527. 285.
## 5 Chinstrap male 3939. 362.
## 6 Gentoo female 4680. 282.
## 7 Gentoo male 5485. 313.
## 8 Gentoo <NA> 4588. 338.
25 / 32

Calculating summary statistics

How many penguins of each species and sex are there?

penguins %>%
count(species, sex)
## # A tibble: 8 × 3
## species sex n
## <fct> <fct> <int>
## 1 Adelie female 73
## 2 Adelie male 73
## 3 Adelie <NA> 6
## 4 Chinstrap female 34
## 5 Chinstrap male 34
## 6 Gentoo female 58
## 7 Gentoo male 61
## 8 Gentoo <NA> 5
26 / 32

Starting points for data analysis

  • Work with a smaller, manageable subset of the data
  • Plots, summary statistics, and missing data
  • Create new variables
27 / 32

Creating new variables

What's the distribution of the ratio of body mass to flipper length?

28 / 32

Creating new variables

What's the distribution of the ratio of body mass to flipper length?

Step 1: Create a new variable

penguins %>%
mutate(bf_ratio = body_mass_g/flipper_length_mm)
...
## species island body_mass_g flipper_length_mm bf_ratio
## <fct> <fct> <int> <int> <dbl>
## 1 Adelie Dream 3950 190 20.8
## 2 Gentoo Biscoe 4950 218 22.7
## 3 Adelie Dream 3300 187 17.6
## 4 Gentoo Biscoe 5050 207 24.4
## 5 Chinstrap Dream 3250 191 17.0
## 6 Adelie Torgersen 3150 187 16.8
## 7 Gentoo Biscoe 5850 217 27.0
...
28 / 32

Creating new variables

What's the distribution of the ratio of body mass to flipper length?

Step 1: Create a new variable

penguins %>%
mutate(bf_ratio = body_mass_g/flipper_length_mm)
  • mutate creates a new column in your dataset
29 / 32

Creating new variables

What's the distribution of the ratio of body mass to flipper length?

Step 2: Save the data with the new column

penguins_new <- penguins %>%
mutate(bf_ratio = body_mass_g/flipper_length_mm)
30 / 32

Creating new variables

What's the distribution of the ratio of body mass to flipper length?

Step 3: Exploratory data analysis

penguins_new <- penguins %>%
mutate(bf_ratio = body_mass_g/flipper_length_mm)
penguins_new %>%
group_by(species) %>%
summarize(mean_ratio = mean(bf_ratio,
na.rm=T))
## # A tibble: 3 × 2
## species mean_ratio
## <fct> <dbl>
## 1 Adelie 19.5
## 2 Chinstrap 19.0
## 3 Gentoo 23.3
31 / 32

Starting points for data analysis

  • Work with a smaller, manageable subset of the data
    • filter
  • Plots, summary statistics, and missing data
    • summarize, group_by, count, drop_na
  • Create new variables
    • mutate

Activity: practice with data wrangling functions on the building energy efficiency data.

32 / 32

Goal

Learn/review some functions for manipulating and summarizing data in R.

2 / 32
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow