Activity 3A: Introduction to Data Wrangling
The Goal:
We have been working on ways to explore the data using R. In Activity 1, we learned about summaries and tables. In Activity 2, we learned about creating visualizations using ggplot2. In this activity, we are going to focus on data wrangling: manipulating, summarizing, and transforming data for exploratory data analysis and statistical modeling.
In this activity, we will cover several important skills that are useful in data wrangling:
- Creating a subset of the data. By focusing initially on a subset, we can make exploration more manageable. Then we can explore the rest of the data later.
- Checking for missing data. Missing data can be a problem for statistical analyses, so we need to know whether any data is missing.
- Calculating summary statistics. Summary statistics are useful for describing the variables in our data. We usually begin with the variables that seem most important.
- Creating new variables. Sometimes, the question we’re interested in requires us to create new variables from some combination of the existing variables.
This activity will introduce data wrangling functions from the dplyr
and tidyr
R packages.
Setup
R packages
For this activity you will need the dplyr
and tidyr
packages. Each of these packages is part of the tidyverse, a collection of R packages with a common philosophy for data analysis. If you don’t have the tidyverse
package installed, go to your R console and install it first with install.packages("tidyverse")
(remember you only need to do this once).
Once the tidyverse
package is installed, load it into R with library(tidyverse)
.
Data
In this activity we will work with data on 344 penguins recorded near Palmer Station, Antarctica. Variables include
species
: penguin’s species (Adelie, Chinstrap, Gentoo)island
: island where penguin measured (Biscoe, Dream, Torgersen)bill_length_mm
: penguin’s bill length (mm)bill_depth_mm
: penguin’s bill depth (mm)flipper_length_mm
: penguin’s flipper length (mm)body_mass_g
: penguin’s body mass (g)sex
: penguin’s sex (female, male)year
: year when data recorded (2007, 2008, 2009)
This information is contained in the penguins
dataset, which is part of the palmerpenguins
R package. If you don’t have the palmerpenguins
package installed, go to your R console and install it first with install.packages("palmerpenguins")
(remember you only need to do this once).
Once the palmerpenguins
package is installed, load it into R with library(palmerpenguins)
.
Activity
Looking at the data
To begin, let’s look at our data. The glimpse
function is useful for taking a peek at a dataset.
Code
In your R console, run the following:
glimpse(penguins)
What does this output tell us?
- The number of rows (i.e., the number of penguins): 344
- The number of columns (i.e., the number of variables recorded for each penguin): 8
- The names of each variable (e.g.,
species
,island
, etc.), and the first few observations in each variable (Adelie, Torgersen, etc.)
Question 1
What is the bill length for the first penguin in the dataset? What species is that penguin?
Now look at the bill length for the fourth penguin. Instead of a number, you get NA
. What does NA
mean?
In R, NA
stands for “Not Available”, and it means that this value is missing in our data. Missing data can be a problem, because R doesn’t know how to handle missing values when we calculate summary statistics or fit models.
Handling missing data
A simple way of dealing with missing data is to remove any rows which contain missing values. How do we do this?
Code
In your R console, run the following code:
<- penguins %>%
penguins drop_na()
What’s going on in this code? First, we take the penguins data:
penguins
Next, we want to remove missing values:
%>%
penguins drop_na()
- The
drop_na()
function says “remove any rows with missing values” - The
%>%
means “Take<THIS>
, then do<THAT>
”. Sopenguins %>% drop_na()
means “takepenguins
, then remove rows with missing values”
Finally, we need to save the result. The <-
is like “Save As…”, and means “take what is on the right hand side, and save it as the left hand side”. So penguins <- penguins %>% drop_na()
means “modify penguins
to remove rows with missing values”.
Question 2
How many rows did we remove by dropping missing values?
Making a subset of data
We have three species of penguin: Adelie, Chinstrap, and Gentoo. Ultimately, we might want to compare characteristics between these different species, but having groups in the data can make initial exploration more challenging. A good strategy is to begin by focusing on just one group. In this case, we will focus on just one species of penguin for now.
Let’s focus on the Chinstrap penguins. How do we do that? We can use the filter
function, which keeps only the rows which satisfy a specified condition.
Code
In your R console, run the following code:
<- penguins %>%
chinstrap_penguins filter(species == "Chinstrap")
What’s going on in this code? Remember that the pipe %>%
means “take <THIS>
, then do <THAT>
”. So we take the penguins
data, then filter so that only the Chinstrap penguins are left (species == "Chinstrap"
). Finally, we save the result as a new dataset, which we call chinstrap_penguins
.
- Note that we use two equals signs (
==
) when we are checking whether species is Chinstrap - Because Chinstrap is a word (rather than a number), we need to put it in quotes (
"Chinstrap"
) for R to interpret the code correctly.
Question 3
How many Chinstrap penguins are in our data (after removing missing values)?
Summary statistics
Now that we have a subset of data, let’s explore some variables! One part of exploring variables is summary statistics, like the mean, standard deviation, median, and IQR. Let’s start with the mean bill length, which is calculated with the mean
function in R.
Code
In your R console, run the following code:
%>%
chinstrap_penguins summarize(mean_bill_length = mean(bill_length_mm))
This tells us that the mean bill length for Chinstrap penguins in the data is 48.8 mm. How does the code work?
- The
summarize
function mean “calculate summary statistics” - Inside the
summarize
function, we calculate the statistics we want - In this case, we want the mean bill length (
mean(bill_length_mm)
) - Finally, we give our summary statistic a meaningful name:
mean_bill_length
(we could call it whatever we want)
Question 4
Modify the code above to calculate the mean body mass for Chinstrap penguins.
Question 5
Modify the code above to calculate the median body mass for Chinstrap penguins. Hint: the median
function calculates medians in R
Question 6
Calculate the mean bill length for Adelie penguins.
Comparing groups
In Question 6, you calculated the mean bill length for Adelie penguins. But to do that, we had to first subset the data to pull out the Adelies. Is there a way to compare statistics between groups, without lots of subsetting?
Fortunately, there is! The group_by
function allows us to group our data before calculating summary statistics.
Code
In your R console, run the following code:
%>%
penguins group_by(species) %>%
summarize(mean_bill_length = mean(bill_length_mm))
Now we get the average bill length for each species. What’s going on in this code?
- First, notice that we can chain our pipes
%>%
together. The output of one line is the input to the next line. This code means “take our penguins, THEN group by species, THEN calculate summary statistics” - The
group_by
function creates groups based on the values of one or more variables. Here, our groups are defined by thespecies
variable, so we get one group per species. After grouping, summary statistics are calculated separately for each group.
Question 7
Calculate the mean body mass for each species of penguin.
Question 8
Calculate the mean body mass for each sex.
Question 9
Calculate the mean body mass for each species and sex (so we get the mass for male Adelies, female Adelies, etc.).
Creating new variables
Finally, in exploring our data we may want to create new variables. For example, suppose we care about the ratio of body mass to flipper length. We can use the mutate
function to create new variables.
Code
In your R console, run the following code:
<- penguins %>%
penguins mutate(bf_ratio = body_mass_g/flipper_length_mm)
Now glimpse
your penguins
data, and confirm that the new bf_ratio
variable appears.
What’s going on in this code?
mutate
creates a new variable- We call this new variable
bf_ratio
bf_ratio
is defined asbody_mass_g/flipper_length_mm
Question 10
Calculate the median ratio between body mass and flipper length for each species.
Question 11
Suppose instead we are interested in the difference between bill length and bill depth. What is the maximum difference between bill length and bill depth, in each species? Hint: the max
function calculates the maximum
Summary
- Work with a smaller, manageable subset of the data
filter
- Remove missing data
drop_na
- Calculate summary statistics
summarize
andgroup_by
- Create new variables
mutate
This work was created by Ciaran Evans and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 March 26.