class: center, middle, inverse, title-slide # Data Wrangling --- ### Goal .large[ .question[ Learn/review some functions for manipulating and summarizing data in R. ] ] --- ### Penguins data .large[ Data on 344 penguins near Palmer Station, Antarctica. ] .center[ <img src="penguin_art.png" width="600"> ] .footnote[ Artwork by @allison_horst ] --- ### Penguins data .large[ Data on 344 penguins near Palmer Station, Antarctica. Variables include: * `species`: penguin's species (Adelie, Chinstrap, Gentoo) * `island`: island where penguin measured (Biscoe, Dream, Torgersen) * `bill_length_mm`: penguin's bill length (mm) * `bill_depth_mm`: penguin's bill depth (mm) * `flipper_length_mm`: penguin's flipper length (mm) * `body_mass_g`: penguin's body mass (g) * `sex`: penguin's sex (female, male) * `year`: year when data recorded (2007, 2008, 2009) ] -- .large[ .question[ We get this data set -- where do we start? ] ] --- ### Starting points for data analysis .large[ * Work with a smaller, manageable subset of the data * Plots, summary statistics, and missing data * Create new variables .question[ **Data wrangling:** Manipulating, summarizing, and transforming data. ] ] --- ### Tools for data wrangling .large[ .pull-left[ .center[ <img src="dplyr_logo.png" width = "200px"> ] ] .pull-right[ * part of the tidyverse * provides a "grammar of data manipulation": useful verbs (functions) for manipulating data * we will cover a few key `dplyr` functions ] ] --- ### Making a subset of the data .large[ .question[ We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins. ] ] -- .large[ **Step 1:** What data do I start with? ```r penguins ``` ``` ## # A tibble: 344 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Dream 38.8 20 190 3950 ## 2 Gentoo Biscoe 47.5 15 218 4950 ## 3 Adelie Dream 36.2 17.3 187 3300 ## 4 Gentoo Biscoe 45.1 14.5 207 5050 ## 5 Chinstrap Dream 45.2 16.6 191 3250 ## 6 Adelie Torgersen 36.2 17.2 187 3150 ## 7 Gentoo Biscoe 49.3 15.7 217 5850 ## 8 Adelie Biscoe 41.1 18.2 192 4050 ## 9 Adelie Torgersen 44.1 18 210 4000 ## 10 Chinstrap Dream 52 20.7 210 4800 ## # … with 334 more rows, and 2 more variables: sex <fct>, year <int> ``` ] --- ### Making a subset of the data .large[ .question[ We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins. ] **Step 2:** What do I do to that data? ```r penguins %>% filter(species == "Chinstrap") ``` ] ``` ## # A tibble: 68 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Chinstrap Dream 45.2 16.6 191 3250 ## 2 Chinstrap Dream 52 20.7 210 4800 ## 3 Chinstrap Dream 54.2 20.8 201 4300 ## 4 Chinstrap Dream 42.5 17.3 187 3350 ## 5 Chinstrap Dream 45.5 17 196 3500 ## 6 Chinstrap Dream 50.2 18.8 202 3800 ## 7 Chinstrap Dream 50.8 18.5 201 4450 ## 8 Chinstrap Dream 50.5 18.4 200 3400 ## 9 Chinstrap Dream 50.5 19.6 201 4050 ## 10 Chinstrap Dream 46.5 17.9 192 3500 ## # … with 58 more rows, and 2 more variables: sex <fct>, year <int> ``` --- ### Making a subset of the data .large[ .question[ We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins. ] **Step 2:** What do I do to that data? ```r penguins %>% filter(species == "Chinstrap") ``` * `%>%` is called the *pipe*. It means "take `<this>`, THEN do `<that>`" * `filter` keeps only the rows which satisfy a specific condition ] --- ### Making a subset of the data .large[ .question[ We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins. ] **Step 3:** What do I do with the result? ```r chinstrap_penguins <- penguins %>% filter(species == "Chinstrap") ``` * `<-` is the *assignment* operator. It means "save the result in R" * Here we create a new data frame called `chinstrap_penguins` that contains just the Chinstraps ] --- ### Making a subset of the data .large[ .question[ We have three species of penguin (Adelie, Chinstrap, Gentoo). Let's make a subset with just the Chinstrap penguins. ] ```r chinstrap_penguins <- penguins %>% filter(species == "Chinstrap") chinstrap_penguins ``` ] ``` ## # A tibble: 68 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Chinstrap Dream 45.2 16.6 191 3250 ## 2 Chinstrap Dream 52 20.7 210 4800 ## 3 Chinstrap Dream 54.2 20.8 201 4300 ## 4 Chinstrap Dream 42.5 17.3 187 3350 ## 5 Chinstrap Dream 45.5 17 196 3500 ## 6 Chinstrap Dream 50.2 18.8 202 3800 ## 7 Chinstrap Dream 50.8 18.5 201 4450 ## 8 Chinstrap Dream 50.5 18.4 200 3400 ## 9 Chinstrap Dream 50.5 19.6 201 4050 ## 10 Chinstrap Dream 46.5 17.9 192 3500 ## # … with 58 more rows, and 2 more variables: sex <fct>, year <int> ``` --- ### Starting points for data analysis .large[ * Work with a smaller, manageable subset of the data * Plots, summary statistics, and missing data * Create new variables ] --- ### Calculating summary statistics .large[ .question[ What is the average body mass for Chinstrap penguins? ] ] -- .large[ **Step 1:** What data do I start with? ```r chinstrap_penguins ``` ``` ## # A tibble: 68 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Chinstrap Dream 45.2 16.6 191 3250 ## 2 Chinstrap Dream 52 20.7 210 4800 ## 3 Chinstrap Dream 54.2 20.8 201 4300 ## 4 Chinstrap Dream 42.5 17.3 187 3350 ## 5 Chinstrap Dream 45.5 17 196 3500 ## 6 Chinstrap Dream 50.2 18.8 202 3800 ## 7 Chinstrap Dream 50.8 18.5 201 4450 ## 8 Chinstrap Dream 50.5 18.4 200 3400 ## 9 Chinstrap Dream 50.5 19.6 201 4050 ## 10 Chinstrap Dream 46.5 17.9 192 3500 ## # … with 58 more rows, and 2 more variables: sex <fct>, year <int> ``` ] --- ### Calculating summary statistics .large[ .question[ What is the average body mass for Chinstrap penguins? ] **Step 2:** What do I do to that data? ```r chinstrap_penguins %>% summarize(avg_mass = mean(body_mass_g)) ``` ] ``` ## # A tibble: 1 × 1 ## avg_mass ## <dbl> ## 1 3733. ``` .large[ * `%>%` is called the *pipe*. It means "take `<this>`, THEN do `<that>`" * `summarize` is used to calculate summary statistics ] --- ### Calculating summary statistics .large[ .question[ What is the average body mass for Chinstrap penguins? ] **Step 3:** (optional) Do I want to save the result? ```r chinstrap_summary <- chinstrap_penguins %>% summarize(avg_mass = mean(body_mass_g)) ``` ] --- ### Chaining pipes together .large[ If we don't care about the intermediate results, we can chain pipes (`%>%`) together. These two chunks calculate the same summary statistics. **Option 1:** ```r chinstrap_penguins <- penguins %>% filter(species == "Chinstrap") chinstrap_penguins %>% summarize(avg_mass = mean(body_mass_g)) ``` **Option 2:** ```r penguins %>% filter(species == "Chinstrap") %>% summarize(avg_mass = mean(body_mass_g)) ``` ] --- ### Calculating summary statistics .large[ .question[ What if I want the average body mass for Adelie penguins? ] ```r penguins %>% filter(species == "Chinstrap") %>% summarize(avg_mass = mean(body_mass_g)) ``` ] --- ### Calculating summary statistics .large[ .question[ What if I want the average body mass for Adelie penguins? ] ```r penguins %>% filter(species == "Adelie") %>% summarize(avg_mass = mean(body_mass_g)) ``` ] --- ### Calculating summary statistics .large[ .question[ What is the average body mass for Adelie penguins? ] ```r penguins %>% filter(species == "Adelie") %>% summarize(avg_mass = mean(body_mass_g)) ``` ] ``` ## # A tibble: 1 × 1 ## avg_mass ## <dbl> ## 1 NA ``` .large[ .question[ What does a result of `NA` mean? ] ] -- .large[ * `NA` means "Not Available" * We get `NA` when there are missing values ] --- ### Handling missing values .large[ .question[ What is the average body mass for Adelie penguins? ] **Option 1:** ```r penguins %>% filter(species == "Adelie") %>% summarize(avg_mass = mean(body_mass_g, na.rm=TRUE)) ``` ] ``` ## # A tibble: 1 × 1 ## avg_mass ## <dbl> ## 1 3701. ``` .large[ * Use `filter` to focus only on Adelie penguins * `summarize` is used to calculate summary statistics * `na.rm=TRUE` means "ignore missing values" ] --- ### Handling missing values .large[ .question[ What is the average body mass for Adelie penguins? ] **Option 2:** ```r penguins %>% filter(species == "Adelie") %>% drop_na() %>% summarize(avg_mass = mean(body_mass_g)) ``` ] ``` ## # A tibble: 1 × 1 ## avg_mass ## <dbl> ## 1 3706. ``` .large[ * `drop_na` means "remove any rows with missing values in any columns" ] --- ### Handling missing values .large[ ```r penguins %>% filter(species == "Adelie") %>% summarize(avg_mass = mean(body_mass_g, na.rm=TRUE)) ``` `3701.` ```r penguins %>% filter(species == "Adelie") %>% drop_na() %>% summarize(avg_mass = mean(body_mass_g)) ``` `3706.` .question[ Why do these chunks give different numbers? ] ] --- ### Handling missing values .large[ * `drop_na` removes *all* rows with missing values (not just missing values in `body_mass_g`) * Reasonable if this is a small number of rows * When you have missing values, check how much data is missing ] --- ### Calculating summary statistics .large[ .question[ What is the average body mass for *each* species of penguin? ] ] -- .large[ ```r penguins %>% group_by(species) %>% summarize(avg_mass = mean(body_mass_g, na.rm=T)) ``` ``` ## # A tibble: 3 × 2 ## species avg_mass ## <fct> <dbl> ## 1 Adelie 3701. ## 2 Chinstrap 3733. ## 3 Gentoo 5076. ``` ] -- .large[ * `group_by` is used to group rows together * We often use `group_by` before `summarize` ] --- ### Calculating summary statistics .large[ .question[ What is the mean and standard deviation of body mass, for each species and sex? ] ] .large[ ```r penguins %>% group_by(species, sex) %>% summarize(avg_mass = mean(body_mass_g, na.rm=T), sd_mass = sd(body_mass_g, na.rm=T)) ``` ``` ## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument. ``` ``` ## # A tibble: 8 × 4 ## # Groups: species [3] ## species sex avg_mass sd_mass ## <fct> <fct> <dbl> <dbl> ## 1 Adelie female 3369. 269. ## 2 Adelie male 4043. 347. ## 3 Adelie <NA> 3540 477. ## 4 Chinstrap female 3527. 285. ## 5 Chinstrap male 3939. 362. ## 6 Gentoo female 4680. 282. ## 7 Gentoo male 5485. 313. ## 8 Gentoo <NA> 4588. 338. ``` ] --- ### Calculating summary statistics .large[ .question[ How many penguins of each species and sex are there? ] ] .large[ ```r penguins %>% count(species, sex) ``` ``` ## # A tibble: 8 × 3 ## species sex n ## <fct> <fct> <int> ## 1 Adelie female 73 ## 2 Adelie male 73 ## 3 Adelie <NA> 6 ## 4 Chinstrap female 34 ## 5 Chinstrap male 34 ## 6 Gentoo female 58 ## 7 Gentoo male 61 ## 8 Gentoo <NA> 5 ``` ] --- ### Starting points for data analysis .large[ * Work with a smaller, manageable subset of the data * Plots, summary statistics, and missing data * Create new variables ] --- ### Creating new variables .large[ .question[ What's the distribution of the ratio of body mass to flipper length? ] ] -- .large[ **Step 1:** Create a new variable ```r penguins %>% mutate(bf_ratio = body_mass_g/flipper_length_mm) ``` ] .large[ ``` ... ## species island body_mass_g flipper_length_mm bf_ratio ## <fct> <fct> <int> <int> <dbl> ## 1 Adelie Dream 3950 190 20.8 ## 2 Gentoo Biscoe 4950 218 22.7 ## 3 Adelie Dream 3300 187 17.6 ## 4 Gentoo Biscoe 5050 207 24.4 ## 5 Chinstrap Dream 3250 191 17.0 ## 6 Adelie Torgersen 3150 187 16.8 ## 7 Gentoo Biscoe 5850 217 27.0 ... ``` ] --- ### Creating new variables .large[ .question[ What's the distribution of the ratio of body mass to flipper length? ] **Step 1:** Create a new variable ```r penguins %>% mutate(bf_ratio = body_mass_g/flipper_length_mm) ``` * `mutate` creates a new column in your dataset ] --- ### Creating new variables .large[ .question[ What's the distribution of the ratio of body mass to flipper length? ] **Step 2:** Save the data with the new column ```r penguins_new <- penguins %>% mutate(bf_ratio = body_mass_g/flipper_length_mm) ``` ] --- ### Creating new variables .large[ .question[ What's the distribution of the ratio of body mass to flipper length? ] **Step 3:** Exploratory data analysis ] .large[ ```r penguins_new <- penguins %>% mutate(bf_ratio = body_mass_g/flipper_length_mm) penguins_new %>% group_by(species) %>% summarize(mean_ratio = mean(bf_ratio, na.rm=T)) ``` ``` ## # A tibble: 3 × 2 ## species mean_ratio ## <fct> <dbl> ## 1 Adelie 19.5 ## 2 Chinstrap 19.0 ## 3 Gentoo 23.3 ``` ] --- ### Starting points for data analysis .large[ * Work with a smaller, manageable subset of the data * `filter` * Plots, summary statistics, and missing data * `summarize`, `group_by`, `count`, `drop_na` * Create new variables * `mutate` .question[ **Activity:** practice with data wrangling functions on the building energy efficiency data. ] ]