Activity 3A: Introduction to Data Wrangling
The Goal:
We have been working on ways to explore the data using R. In Activity 1, we learned about summaries and tables. In Activity 2, we learned about creating visualizations using ggplot2. In this activity, we are going to focus on data wrangling: manipulating, summarizing, and transforming data for exploratory data analysis and statistical modeling.
In this activity, we will cover several important skills that are useful in data wrangling:
- Creating a subset of the data. By focusing initially on a subset, we can make exploration more manageable. Then we can explore the rest of the data later.
- Checking for missing data. Missing data can be a problem for statistical analyses, so we need to know whether any data is missing.
- Calculating summary statistics. Summary statistics are useful for describing the variables in our data. We usually begin with the variables that seem most important.
- Creating new variables. Sometimes, the question we’re interested in requires us to create new variables from some combination of the existing variables.
This activity will introduce data wrangling functions from the dplyr and tidyr R packages.
Setup
R packages
For this activity you will need the dplyr and tidyr packages. Each of these packages is part of the tidyverse, a collection of R packages with a common philosophy for data analysis. If you don’t have the tidyverse package installed, go to your R console and install it first with install.packages("tidyverse") (remember you only need to do this once).
Once the tidyverse package is installed, load it into R with library(tidyverse).
Data
In this activity we will work with data on 344 penguins recorded near Palmer Station, Antarctica. Variables include
species: penguin’s species (Adelie, Chinstrap, Gentoo)island: island where penguin measured (Biscoe, Dream, Torgersen)bill_length_mm: penguin’s bill length (mm)bill_depth_mm: penguin’s bill depth (mm)flipper_length_mm: penguin’s flipper length (mm)body_mass_g: penguin’s body mass (g)sex: penguin’s sex (female, male)year: year when data recorded (2007, 2008, 2009)
This information is contained in the penguins dataset, which is part of the palmerpenguins R package. If you don’t have the palmerpenguins package installed, go to your R console and install it first with install.packages("palmerpenguins") (remember you only need to do this once).
Once the palmerpenguins package is installed, load it into R with library(palmerpenguins).
Activity
Looking at the data
To begin, let’s look at our data. The glimpse function is useful for taking a peek at a dataset.
Code
In your R console, run the following:
glimpse(penguins)What does this output tell us?
- The number of rows (i.e., the number of penguins): 344
- The number of columns (i.e., the number of variables recorded for each penguin): 8
- The names of each variable (e.g.,
species,island, etc.), and the first few observations in each variable (Adelie, Torgersen, etc.)
Question 1
What is the bill length for the first penguin in the dataset? What species is that penguin?
Now look at the bill length for the fourth penguin. Instead of a number, you get NA. What does NA mean?
In R, NA stands for “Not Available”, and it means that this value is missing in our data. Missing data can be a problem, because R doesn’t know how to handle missing values when we calculate summary statistics or fit models.
Handling missing data
A simple way of dealing with missing data is to remove any rows which contain missing values. How do we do this?
Code
In your R console, run the following code:
penguins <- penguins %>%
drop_na()What’s going on in this code? First, we take the penguins data:
penguinsNext, we want to remove missing values:
penguins %>%
drop_na()- The
drop_na()function says “remove any rows with missing values” - The
%>%means “Take<THIS>, then do<THAT>”. Sopenguins %>% drop_na()means “takepenguins, then remove rows with missing values”
Finally, we need to save the result. The <- is like “Save As…”, and means “take what is on the right hand side, and save it as the left hand side”. So penguins <- penguins %>% drop_na() means “modify penguins to remove rows with missing values”.
Question 2
How many rows did we remove by dropping missing values?
Making a subset of data
We have three species of penguin: Adelie, Chinstrap, and Gentoo. Ultimately, we might want to compare characteristics between these different species, but having groups in the data can make initial exploration more challenging. A good strategy is to begin by focusing on just one group. In this case, we will focus on just one species of penguin for now.
Let’s focus on the Chinstrap penguins. How do we do that? We can use the filter function, which keeps only the rows which satisfy a specified condition.
Code
In your R console, run the following code:
chinstrap_penguins <- penguins %>%
filter(species == "Chinstrap")What’s going on in this code? Remember that the pipe %>% means “take <THIS>, then do <THAT>”. So we take the penguins data, then filter so that only the Chinstrap penguins are left (species == "Chinstrap"). Finally, we save the result as a new dataset, which we call chinstrap_penguins.
- Note that we use two equals signs (
==) when we are checking whether species is Chinstrap - Because Chinstrap is a word (rather than a number), we need to put it in quotes (
"Chinstrap") for R to interpret the code correctly.
Question 3
How many Chinstrap penguins are in our data (after removing missing values)?
Summary statistics
Now that we have a subset of data, let’s explore some variables! One part of exploring variables is summary statistics, like the mean, standard deviation, median, and IQR. Let’s start with the mean bill length, which is calculated with the mean function in R.
Code
In your R console, run the following code:
chinstrap_penguins %>%
summarize(mean_bill_length = mean(bill_length_mm))This tells us that the mean bill length for Chinstrap penguins in the data is 48.8 mm. How does the code work?
- The
summarizefunction mean “calculate summary statistics” - Inside the
summarizefunction, we calculate the statistics we want - In this case, we want the mean bill length (
mean(bill_length_mm)) - Finally, we give our summary statistic a meaningful name:
mean_bill_length(we could call it whatever we want)
Question 4
Modify the code above to calculate the mean body mass for Chinstrap penguins.
Question 5
Modify the code above to calculate the median body mass for Chinstrap penguins. Hint: the median function calculates medians in R
Question 6
Calculate the mean bill length for Adelie penguins.
Comparing groups
In Question 6, you calculated the mean bill length for Adelie penguins. But to do that, we had to first subset the data to pull out the Adelies. Is there a way to compare statistics between groups, without lots of subsetting?
Fortunately, there is! The group_by function allows us to group our data before calculating summary statistics.
Code
In your R console, run the following code:
penguins %>%
group_by(species) %>%
summarize(mean_bill_length = mean(bill_length_mm))Now we get the average bill length for each species. What’s going on in this code?
- First, notice that we can chain our pipes
%>%together. The output of one line is the input to the next line. This code means “take our penguins, THEN group by species, THEN calculate summary statistics” - The
group_byfunction creates groups based on the values of one or more variables. Here, our groups are defined by thespeciesvariable, so we get one group per species. After grouping, summary statistics are calculated separately for each group.
Question 7
Calculate the mean body mass for each species of penguin.
Question 8
Calculate the mean body mass for each sex.
Question 9
Calculate the mean body mass for each species and sex (so we get the mass for male Adelies, female Adelies, etc.).
Creating new variables
Finally, in exploring our data we may want to create new variables. For example, suppose we care about the ratio of body mass to flipper length. We can use the mutate function to create new variables.
Code
In your R console, run the following code:
penguins <- penguins %>%
mutate(bf_ratio = body_mass_g/flipper_length_mm)Now glimpse your penguins data, and confirm that the new bf_ratio variable appears.
What’s going on in this code?
mutatecreates a new variable- We call this new variable
bf_ratio bf_ratiois defined asbody_mass_g/flipper_length_mm
Question 10
Calculate the median ratio between body mass and flipper length for each species.
Question 11
Suppose instead we are interested in the difference between bill length and bill depth. What is the maximum difference between bill length and bill depth, in each species? Hint: the max function calculates the maximum
Summary
- Work with a smaller, manageable subset of the data
filter
- Remove missing data
drop_na
- Calculate summary statistics
summarizeandgroup_by
- Create new variables
mutate
This work was created by Ciaran Evans and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 March 26.