Learn/review some functions for manipulating and summarizing data in R.
Data on 344 penguins near Palmer Station, Antarctica.
Artwork by @allison_horst
Data on 344 penguins near Palmer Station, Antarctica. Variables include:
species
: penguin's species (Adelie, Chinstrap, Gentoo)island
: island where penguin measured (Biscoe, Dream, Torgersen)bill_length_mm
: penguin's bill length (mm)bill_depth_mm
: penguin's bill depth (mm)flipper_length_mm
: penguin's flipper length (mm)body_mass_g
: penguin's body mass (g)sex
: penguin's sex (female, male)year
: year when data recorded (2007, 2008, 2009)Data on 344 penguins near Palmer Station, Antarctica. Variables include:
species
: penguin's species (Adelie, Chinstrap, Gentoo)island
: island where penguin measured (Biscoe, Dream, Torgersen)bill_length_mm
: penguin's bill length (mm)bill_depth_mm
: penguin's bill depth (mm)flipper_length_mm
: penguin's flipper length (mm)body_mass_g
: penguin's body mass (g)sex
: penguin's sex (female, male)year
: year when data recorded (2007, 2008, 2009)We get this data set -- where do we start?
Data wrangling: Manipulating, summarizing, and transforming data.
dplyr
functionsWe have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.
We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.
Step 1: What data do I start with?
penguins
## # A tibble: 344 × 8## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g## <fct> <fct> <dbl> <dbl> <int> <int>## 1 Adelie Dream 38.8 20 190 3950## 2 Gentoo Biscoe 47.5 15 218 4950## 3 Adelie Dream 36.2 17.3 187 3300## 4 Gentoo Biscoe 45.1 14.5 207 5050## 5 Chinstrap Dream 45.2 16.6 191 3250## 6 Adelie Torgersen 36.2 17.2 187 3150## 7 Gentoo Biscoe 49.3 15.7 217 5850## 8 Adelie Biscoe 41.1 18.2 192 4050## 9 Adelie Torgersen 44.1 18 210 4000## 10 Chinstrap Dream 52 20.7 210 4800## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.
Step 2: What do I do to that data?
penguins %>% filter(species == "Chinstrap")
## # A tibble: 68 × 8## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g## <fct> <fct> <dbl> <dbl> <int> <int>## 1 Chinstrap Dream 45.2 16.6 191 3250## 2 Chinstrap Dream 52 20.7 210 4800## 3 Chinstrap Dream 54.2 20.8 201 4300## 4 Chinstrap Dream 42.5 17.3 187 3350## 5 Chinstrap Dream 45.5 17 196 3500## 6 Chinstrap Dream 50.2 18.8 202 3800## 7 Chinstrap Dream 50.8 18.5 201 4450## 8 Chinstrap Dream 50.5 18.4 200 3400## 9 Chinstrap Dream 50.5 19.6 201 4050## 10 Chinstrap Dream 46.5 17.9 192 3500## # … with 58 more rows, and 2 more variables: sex <fct>, year <int>
We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.
Step 2: What do I do to that data?
penguins %>% filter(species == "Chinstrap")
%>%
is called the pipe. It means "take <this>
, THEN do <that>
"filter
keeps only the rows which satisfy a specific conditionWe have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.
Step 3: What do I do with the result?
chinstrap_penguins <- penguins %>% filter(species == "Chinstrap")
<-
is the assignment operator. It means "save the result in R"chinstrap_penguins
that contains just the ChinstrapsWe have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins.
chinstrap_penguins <- penguins %>% filter(species == "Chinstrap")
chinstrap_penguins
## # A tibble: 68 × 8## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g## <fct> <fct> <dbl> <dbl> <int> <int>## 1 Chinstrap Dream 45.2 16.6 191 3250## 2 Chinstrap Dream 52 20.7 210 4800## 3 Chinstrap Dream 54.2 20.8 201 4300## 4 Chinstrap Dream 42.5 17.3 187 3350## 5 Chinstrap Dream 45.5 17 196 3500## 6 Chinstrap Dream 50.2 18.8 202 3800## 7 Chinstrap Dream 50.8 18.5 201 4450## 8 Chinstrap Dream 50.5 18.4 200 3400## 9 Chinstrap Dream 50.5 19.6 201 4050## 10 Chinstrap Dream 46.5 17.9 192 3500## # … with 58 more rows, and 2 more variables: sex <fct>, year <int>
What is the average body mass for Chinstrap penguins?
What is the average body mass for Chinstrap penguins?
Step 1: What data do I start with?
chinstrap_penguins
## # A tibble: 68 × 8## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g## <fct> <fct> <dbl> <dbl> <int> <int>## 1 Chinstrap Dream 45.2 16.6 191 3250## 2 Chinstrap Dream 52 20.7 210 4800## 3 Chinstrap Dream 54.2 20.8 201 4300## 4 Chinstrap Dream 42.5 17.3 187 3350## 5 Chinstrap Dream 45.5 17 196 3500## 6 Chinstrap Dream 50.2 18.8 202 3800## 7 Chinstrap Dream 50.8 18.5 201 4450## 8 Chinstrap Dream 50.5 18.4 200 3400## 9 Chinstrap Dream 50.5 19.6 201 4050## 10 Chinstrap Dream 46.5 17.9 192 3500## # … with 58 more rows, and 2 more variables: sex <fct>, year <int>
What is the average body mass for Chinstrap penguins?
Step 2: What do I do to that data?
chinstrap_penguins %>% summarize(avg_mass = mean(body_mass_g))
## # A tibble: 1 × 1## avg_mass## <dbl>## 1 3733.
%>%
is called the pipe. It means "take <this>
, THEN do <that>
"summarize
is used to calculate summary statisticsWhat is the average body mass for Chinstrap penguins?
Step 3: (optional) Do I want to save the result?
chinstrap_summary <- chinstrap_penguins %>% summarize(avg_mass = mean(body_mass_g))
If we don't care about the intermediate results, we can chain pipes (%>%
) together. These two chunks calculate the same summary statistics.
Option 1:
chinstrap_penguins <- penguins %>% filter(species == "Chinstrap")chinstrap_penguins %>% summarize(avg_mass = mean(body_mass_g))
Option 2:
penguins %>% filter(species == "Chinstrap") %>% summarize(avg_mass = mean(body_mass_g))
What if I want the average body mass for Adelie penguins?
penguins %>% filter(species == "Chinstrap") %>% summarize(avg_mass = mean(body_mass_g))
What if I want the average body mass for Adelie penguins?
penguins %>% filter(species == "Adelie") %>% summarize(avg_mass = mean(body_mass_g))
What is the average body mass for Adelie penguins?
penguins %>% filter(species == "Adelie") %>% summarize(avg_mass = mean(body_mass_g))
## # A tibble: 1 × 1## avg_mass## <dbl>## 1 NA
What does a result of NA
mean?
What is the average body mass for Adelie penguins?
penguins %>% filter(species == "Adelie") %>% summarize(avg_mass = mean(body_mass_g))
## # A tibble: 1 × 1## avg_mass## <dbl>## 1 NA
What does a result of NA
mean?
NA
means "Not Available"NA
when there are missing valuesWhat is the average body mass for Adelie penguins?
Option 1:
penguins %>% filter(species == "Adelie") %>% summarize(avg_mass = mean(body_mass_g, na.rm=TRUE))
## # A tibble: 1 × 1## avg_mass## <dbl>## 1 3701.
filter
to focus only on Adelie penguinssummarize
is used to calculate summary statisticsna.rm=TRUE
means "ignore missing values"What is the average body mass for Adelie penguins?
Option 2:
penguins %>% filter(species == "Adelie") %>% drop_na() %>% summarize(avg_mass = mean(body_mass_g))
## # A tibble: 1 × 1## avg_mass## <dbl>## 1 3706.
drop_na
means "remove any rows with missing values in any columns"penguins %>% filter(species == "Adelie") %>% summarize(avg_mass = mean(body_mass_g, na.rm=TRUE))
3701.
penguins %>% filter(species == "Adelie") %>% drop_na() %>% summarize(avg_mass = mean(body_mass_g))
3706.
Why do these chunks give different numbers?
drop_na
removes all rows with missing values (not just missing values in body_mass_g
)What is the average body mass for each species of penguin?
What is the average body mass for each species of penguin?
penguins %>% group_by(species) %>% summarize(avg_mass = mean(body_mass_g, na.rm=T))
## # A tibble: 3 × 2## species avg_mass## <fct> <dbl>## 1 Adelie 3701.## 2 Chinstrap 3733.## 3 Gentoo 5076.
What is the average body mass for each species of penguin?
penguins %>% group_by(species) %>% summarize(avg_mass = mean(body_mass_g, na.rm=T))
## # A tibble: 3 × 2## species avg_mass## <fct> <dbl>## 1 Adelie 3701.## 2 Chinstrap 3733.## 3 Gentoo 5076.
group_by
is used to group rows togethergroup_by
before summarize
What is the mean and standard deviation of body mass, for each species and sex?
penguins %>% group_by(species, sex) %>% summarize(avg_mass = mean(body_mass_g, na.rm=T), sd_mass = sd(body_mass_g, na.rm=T))
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
## # A tibble: 8 × 4## # Groups: species [3]## species sex avg_mass sd_mass## <fct> <fct> <dbl> <dbl>## 1 Adelie female 3369. 269.## 2 Adelie male 4043. 347.## 3 Adelie <NA> 3540 477.## 4 Chinstrap female 3527. 285.## 5 Chinstrap male 3939. 362.## 6 Gentoo female 4680. 282.## 7 Gentoo male 5485. 313.## 8 Gentoo <NA> 4588. 338.
How many penguins of each species and sex are there?
penguins %>% count(species, sex)
## # A tibble: 8 × 3## species sex n## <fct> <fct> <int>## 1 Adelie female 73## 2 Adelie male 73## 3 Adelie <NA> 6## 4 Chinstrap female 34## 5 Chinstrap male 34## 6 Gentoo female 58## 7 Gentoo male 61## 8 Gentoo <NA> 5
What's the distribution of the ratio of body mass to flipper length?
What's the distribution of the ratio of body mass to flipper length?
Step 1: Create a new variable
penguins %>% mutate(bf_ratio = body_mass_g/flipper_length_mm)
...## species island body_mass_g flipper_length_mm bf_ratio## <fct> <fct> <int> <int> <dbl>## 1 Adelie Dream 3950 190 20.8## 2 Gentoo Biscoe 4950 218 22.7## 3 Adelie Dream 3300 187 17.6## 4 Gentoo Biscoe 5050 207 24.4## 5 Chinstrap Dream 3250 191 17.0## 6 Adelie Torgersen 3150 187 16.8## 7 Gentoo Biscoe 5850 217 27.0...
What's the distribution of the ratio of body mass to flipper length?
Step 1: Create a new variable
penguins %>% mutate(bf_ratio = body_mass_g/flipper_length_mm)
mutate
creates a new column in your datasetWhat's the distribution of the ratio of body mass to flipper length?
Step 2: Save the data with the new column
penguins_new <- penguins %>% mutate(bf_ratio = body_mass_g/flipper_length_mm)
What's the distribution of the ratio of body mass to flipper length?
Step 3: Exploratory data analysis
penguins_new <- penguins %>% mutate(bf_ratio = body_mass_g/flipper_length_mm)penguins_new %>% group_by(species) %>% summarize(mean_ratio = mean(bf_ratio, na.rm=T))
## # A tibble: 3 × 2## species mean_ratio## <fct> <dbl>## 1 Adelie 19.5## 2 Chinstrap 19.0## 3 Gentoo 23.3
filter
summarize
, group_by
, count
, drop_na
mutate
Activity: practice with data wrangling functions on the building energy efficiency data.
Learn/review some functions for manipulating and summarizing data in R.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |