Data Exploration Cheat Sheet
Use: The purpose of this cheat sheet is to serve as a refresher on different types of variables, plots we can use to visualize them, and code for making those plots. We also discuss some summary statistics.
Types of variables
Here we will deal mainly with categorical and quantitative variables.
- Categorical variables are variables which take on one of several fixed values. These values generally do not have a numeric interpretation.
- Examples: gender, favorite food, brand of laptop
- Binary categorical variables have exactly 2 possible values
- Quantitative variables are variables which take on a numeric value, and which have a numeric interpretation.
- Examples: number of pets, height, weight, age
- Discrete quantitative variables only take on discrete values (e.g., number of pets)
- Continuous quantitative variables can take on an entire range of values (e.g., height is continuous if we allow heights like 60.323 inches)
Univariate exploratory data analysis
Here we discuss how to summarize, visualize, and describe the distributions of categorical and quantitative variables.
Categorical variables
Summarize
The number of observations in each group can be summarized in a frequency table:
%>%
penguins count(species)
## # A tibble: 3 × 2
## species n
## <fct> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
Visualize
The same information can be visualized with bar charts, which display the number of observations in each group as the height of the bar:
%>%
penguins ggplot(aes(x = species)) +
geom_bar()
Other visualization options include pie charts.
Describe
- Which category has the most number of observations? The least?
- Are observations spread relatively evenly across categories, or do one or two categories have the majority of the observations?
Quantitative variables
Summarize
Many summary statistics can be calculated for quantitative variables. We often calculate the mean or median to summarize the center of the distribution, and the standard deviation or IQR to summarize the spread of the distribution. If the data are highly skewed, the median and IQR are often more appropriate measures of center and spread.
Note that if NAs (missing values) are present in the data, then we need to remove them before calculating summary statistics. This can be done by removing all rows with NAs (drop_na()
), or by ignoring NAs when we calculate the summary statistics (na.rm=TRUE
).
%>%
penguins summarize(mean_mass = mean(body_mass_g, na.rm=TRUE),
median_mass = median(body_mass_g, na.rm=TRUE),
sd_mass = sd(body_mass_g, na.rm=TRUE),
iqr_mass = IQR(body_mass_g, na.rm=TRUE))
## # A tibble: 1 × 4
## mean_mass median_mass sd_mass iqr_mass
## <dbl> <dbl> <dbl> <dbl>
## 1 4202. 4050 802. 1200
Visualize
A good choice for visualize the distribution of a quantitative variable is with a histogram. A histogram divides the range of the data into evenly spaced bins, and then displays the number of observations which fall into each bin. Since the number of bins affects how the histogram looks, it is good practice to experiment with several different numbers of bins. This can be specified with the bins
argument in geom_histogram
.
%>%
penguins ggplot(aes(x = body_mass_g)) +
geom_histogram(bins = 20)
Another common option for visualization is the boxplot. A boxplot doesn’t show the whole distribution, but rather a summary of it. In particular, it displays the median, first and third quartiles, the smallest and largest non-outlier values, and any outliers.
%>%
penguins ggplot(aes(y = body_mass_g)) +
geom_boxplot()
Other tools include density plots (geom_density
) and violin plots (geom_violin
).
Describe
- Shape (symmetric vs. skewed, number of modes, location of modes)
- Center (usually mean or median)
- Spread (usually standard deviation or IQR)
- Any unusual features?
- Any potential outliers?
Bivariate exploratory data analysis
What if we want to look at the relationship between two variables?
Two categorical variables
Summarize
We can count the number of observations in each group:
%>%
penguins count(species, island)
## # A tibble: 5 × 3
## species island n
## <fct> <fct> <int>
## 1 Adelie Biscoe 44
## 2 Adelie Dream 56
## 3 Adelie Torgersen 52
## 4 Chinstrap Dream 68
## 5 Gentoo Biscoe 124
Sometimes, it is nice to display the result as a two-way table, where categories for one variable are in the rows, and categories for the second variable are in the columns:
%>%
penguins count(species, island) %>%
spread(island, n)
## # A tibble: 3 × 4
## species Biscoe Dream Torgersen
## <fct> <int> <int> <int>
## 1 Adelie 44 56 52
## 2 Chinstrap NA 68 NA
## 3 Gentoo 124 NA NA
(Note that here, NA means that this combination of values did not appear in the dataset. So, e.g., there were no Chinstrap penguins from Biscoe island).
Visualize
A common way to visualize the relationship between two categorical variables is with a stacked bar graph:
%>%
penguins ggplot(aes(x = species, fill = island)) +
geom_bar()
Other options include mosaic plots.
Describe
- Which combination of categories has the most observations? The least?
- Are there any combinations which do not appear in the data?
- Is the distribution for the second variable the same for each level of the first variable? (E.g., in the
penguins
example above, there appears to be a relationship between species and island, because the distribution of penguins in each island is different for the three species. Adelie penguins are found on all three islands, whereas Chinstrap and Gentoo penguins are only on one).
Two quantitative variables
Visualize
To visualize the relationship between two quantitative variables, we can use a scatterplot:
%>%
penguins ggplot(aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point()
Summarize
If the relationship looks linear, we can calculate the sample correlation coefficient, \(r\), to summarize the strength of the linear relationship. Recall that \(r\) takes values between -1 and 1, with \(r = -1\) a very strong negative relationship, \(r = 0\) no relationship, and \(r = 1\) a very strong positive relationship.
When calculating the correlation, we have to handle NAs, if missing values are present in the data. This can be done either by removing all rows with NAs before hand (drop_na()
), or by ignoring NAs when computing correlation (use = "complete.obs"
).
%>%
penguins summarize(r = cor(flipper_length_mm,
body_mass_g, use="complete.obs"))
## # A tibble: 1 × 1
## r
## <dbl>
## 1 0.871
Describe
- does there appear to be a relationship?
- if so, does the relationship appear to be positive or negative?
- what is the general shape of the relationship? Does it look linear?
- if the relationship looks linear, report the sample correlation coefficient
One categorical, one quantitative
Visualize
There are several options for visualizing the relationship between a categorical and a quantitative variable. A common choice is to make a boxplot for each level of the categorical variable:
%>%
penguins ggplot(aes(x = species, y = body_mass_g)) +
geom_boxplot()
While boxplots are just summaries of a distribution, they are very handy for comparing across groups.
Another option, if the number of categories isn’t too large, is to create a histogram faceted by the categorical variable:
%>%
penguins ggplot(aes(x = body_mass_g)) +
geom_histogram(bins = 20) +
facet_wrap(~species)
Summarize
To summarize the relationship, we can calculate summary statistics for the quantitative variable at each level of the categorical variable. The group_by
function is very helpful here.
%>%
penguins group_by(species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm=TRUE),
median_mass = median(body_mass_g, na.rm=TRUE))
## # A tibble: 3 × 3
## species mean_mass median_mass
## <fct> <dbl> <dbl>
## 1 Adelie 3701. 3700
## 2 Chinstrap 3733. 3700
## 3 Gentoo 5076. 5000
Describe
Is the distribution of the quantitative variable different across levels of the categorical variable? If so, how? (e.g., differences in shape, center, spread)
More than two variables
With more than two variables, you can get a lot of combinations. Here are just a couple examples. Using additional aesthetics and faceting is helpful for visualization. Using grouping is helpful for summary statistics.
Quantitative, quantitative, categorical
%>%
penguins ggplot(aes(x = bill_depth_mm,
y = body_mass_g,
color = species)) +
geom_point()
%>%
penguins group_by(species) %>%
summarize(r = cor(bill_depth_mm,
body_mass_g, use="complete.obs"))
## # A tibble: 3 × 2
## species r
## <fct> <dbl>
## 1 Adelie 0.576
## 2 Chinstrap 0.604
## 3 Gentoo 0.719
Quantitative, categorical, categorical
%>%
penguins ggplot(aes(x = island,
y = body_mass_g)) +
geom_boxplot() +
facet_wrap(~species)
%>%
penguins ggplot(aes(x = body_mass_g)) +
geom_histogram(bins=15) +
facet_grid(island~species)
%>%
penguins group_by(species, island) %>%
summarize(mean_mass = mean(body_mass_g, na.rm=TRUE),
median_mass = median(body_mass_g, na.rm=TRUE))
## # A tibble: 5 × 4
## # Groups: species [3]
## species island mean_mass median_mass
## <fct> <fct> <dbl> <dbl>
## 1 Adelie Biscoe 3710. 3750
## 2 Adelie Dream 3688. 3575
## 3 Adelie Torgersen 3706. 3700
## 4 Chinstrap Dream 3733. 3700
## 5 Gentoo Biscoe 5076. 5000
This work was created by Ciaran Evans and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 March 25.