Data Exploration Cheat Sheet

Use: The purpose of this cheat sheet is to serve as a refresher on different types of variables, plots we can use to visualize them, and code for making those plots. We also discuss some summary statistics.

Types of variables

Here we will deal mainly with categorical and quantitative variables.

Categorical variables are variables which take on one of several fixed values. These values generally do not have a numeric interpretation.
- Examples: gender, favorite food, brand of laptop
- Binary categorical variables have exactly 2 possible values
Quantitative variables are variables which take on a numeric value, and which have a numeric interpretation.
- Examples: number of pets, height, weight, age
- Discrete quantitative variables only take on discrete values (e.g., number of pets)
- Continuous quantitative variables can take on an entire range of values (e.g., height is continuous if we allow heights like 60.323 inches)

Univariate exploratory data analysis

Here we discuss how to summarize, visualize, and describe the distributions of categorical and quantitative variables.

Categorical variables

Summarize

The number of observations in each group can be summarized in a frequency table:

penguins %>%
  count(species)

## # A tibble: 3 × 2
##   species       n
##   <fct>     <int>
## 1 Adelie      152
## 2 Chinstrap    68
## 3 Gentoo      124

Visualize

The same information can be visualized with bar charts, which display the number of observations in each group as the height of the bar:

penguins %>%
  ggplot(aes(x = species)) +
  geom_bar()

Other visualization options include pie charts.

Describe

Which category has the most number of observations? The least?
Are observations spread relatively evenly across categories, or do one or two categories have the majority of the observations?

Quantitative variables

Summarize

Many summary statistics can be calculated for quantitative variables. We often calculate the mean or median to summarize the center of the distribution, and the standard deviation or IQR to summarize the spread of the distribution. If the data are highly skewed, the median and IQR are often more appropriate measures of center and spread.

Note that if NAs (missing values) are present in the data, then we need to remove them before calculating summary statistics. This can be done by removing all rows with NAs (drop_na()), or by ignoring NAs when we calculate the summary statistics (na.rm=TRUE).

penguins %>%
  summarize(mean_mass = mean(body_mass_g, na.rm=TRUE),
            median_mass = median(body_mass_g, na.rm=TRUE),
            sd_mass = sd(body_mass_g, na.rm=TRUE),
            iqr_mass = IQR(body_mass_g, na.rm=TRUE))

## # A tibble: 1 × 4
##   mean_mass median_mass sd_mass iqr_mass
##       <dbl>       <dbl>   <dbl>    <dbl>
## 1     4202.        4050    802.     1200

Visualize

A good choice for visualize the distribution of a quantitative variable is with a histogram. A histogram divides the range of the data into evenly spaced bins, and then displays the number of observations which fall into each bin. Since the number of bins affects how the histogram looks, it is good practice to experiment with several different numbers of bins. This can be specified with the bins argument in geom_histogram.

penguins %>%
  ggplot(aes(x = body_mass_g)) +
  geom_histogram(bins = 20)

Another common option for visualization is the boxplot. A boxplot doesn’t show the whole distribution, but rather a summary of it. In particular, it displays the median, first and third quartiles, the smallest and largest non-outlier values, and any outliers.

penguins %>%
  ggplot(aes(y = body_mass_g)) +
  geom_boxplot()

Other tools include density plots (geom_density) and violin plots (geom_violin).

Describe

Shape (symmetric vs. skewed, number of modes, location of modes)
Center (usually mean or median)
Spread (usually standard deviation or IQR)
Any unusual features?
Any potential outliers?

Bivariate exploratory data analysis

What if we want to look at the relationship between two variables?

Two categorical variables

Summarize

We can count the number of observations in each group:

penguins %>%
  count(species, island)

## # A tibble: 5 × 3
##   species   island        n
##   <fct>     <fct>     <int>
## 1 Adelie    Biscoe       44
## 2 Adelie    Dream        56
## 3 Adelie    Torgersen    52
## 4 Chinstrap Dream        68
## 5 Gentoo    Biscoe      124

Sometimes, it is nice to display the result as a two-way table, where categories for one variable are in the rows, and categories for the second variable are in the columns:

penguins %>%
  count(species, island) %>%
  spread(island, n)

## # A tibble: 3 × 4
##   species   Biscoe Dream Torgersen
##   <fct>      <int> <int>     <int>
## 1 Adelie        44    56        52
## 2 Chinstrap     NA    68        NA
## 3 Gentoo       124    NA        NA

(Note that here, NA means that this combination of values did not appear in the dataset. So, e.g., there were no Chinstrap penguins from Biscoe island).

Visualize

A common way to visualize the relationship between two categorical variables is with a stacked bar graph:

penguins %>%
  ggplot(aes(x = species, fill = island)) +
  geom_bar()

Other options include mosaic plots.

Describe

Which combination of categories has the most observations? The least?
Are there any combinations which do not appear in the data?
Is the distribution for the second variable the same for each level of the first variable? (E.g., in the penguins example above, there appears to be a relationship between species and island, because the distribution of penguins in each island is different for the three species. Adelie penguins are found on all three islands, whereas Chinstrap and Gentoo penguins are only on one).

Two quantitative variables

Visualize

To visualize the relationship between two quantitative variables, we can use a scatterplot:

penguins %>%
  ggplot(aes(x = flipper_length_mm, 
             y = body_mass_g)) +
  geom_point()

Summarize

If the relationship looks linear, we can calculate the sample correlation coefficient, \(r\), to summarize the strength of the linear relationship. Recall that \(r\) takes values between -1 and 1, with \(r = -1\) a very strong negative relationship, \(r = 0\) no relationship, and \(r = 1\) a very strong positive relationship.

When calculating the correlation, we have to handle NAs, if missing values are present in the data. This can be done either by removing all rows with NAs before hand (drop_na()), or by ignoring NAs when computing correlation (use = "complete.obs").

penguins %>%
  summarize(r = cor(flipper_length_mm, 
                    body_mass_g, 
                    use="complete.obs"))

## # A tibble: 1 × 1
##       r
##   <dbl>
## 1 0.871

Describe

does there appear to be a relationship?
if so, does the relationship appear to be positive or negative?
what is the general shape of the relationship? Does it look linear?
if the relationship looks linear, report the sample correlation coefficient

One categorical, one quantitative

Visualize

There are several options for visualizing the relationship between a categorical and a quantitative variable. A common choice is to make a boxplot for each level of the categorical variable:

penguins %>%
  ggplot(aes(x = species, y = body_mass_g)) +
  geom_boxplot()

While boxplots are just summaries of a distribution, they are very handy for comparing across groups.

Another option, if the number of categories isn’t too large, is to create a histogram faceted by the categorical variable:

penguins %>%
  ggplot(aes(x = body_mass_g)) +
  geom_histogram(bins = 20) +
  facet_wrap(~species)

Summarize

To summarize the relationship, we can calculate summary statistics for the quantitative variable at each level of the categorical variable. The group_by function is very helpful here.

penguins %>%
  group_by(species) %>%
  summarize(mean_mass = mean(body_mass_g, na.rm=TRUE),
            median_mass = median(body_mass_g, na.rm=TRUE))

## # A tibble: 3 × 3
##   species   mean_mass median_mass
##   <fct>         <dbl>       <dbl>
## 1 Adelie        3701.        3700
## 2 Chinstrap     3733.        3700
## 3 Gentoo        5076.        5000

Describe

Is the distribution of the quantitative variable different across levels of the categorical variable? If so, how? (e.g., differences in shape, center, spread)

More than two variables

With more than two variables, you can get a lot of combinations. Here are just a couple examples. Using additional aesthetics and faceting is helpful for visualization. Using grouping is helpful for summary statistics.

Quantitative, quantitative, categorical

penguins %>%
  ggplot(aes(x = bill_depth_mm, 
             y = body_mass_g, 
             color = species)) +
  geom_point()

penguins %>%
  group_by(species) %>%
  summarize(r = cor(bill_depth_mm, 
                    body_mass_g, 
                    use="complete.obs"))

## # A tibble: 3 × 2
##   species       r
##   <fct>     <dbl>
## 1 Adelie    0.576
## 2 Chinstrap 0.604
## 3 Gentoo    0.719

Quantitative, categorical, categorical

penguins %>%
  ggplot(aes(x = island, 
             y = body_mass_g)) +
  geom_boxplot() +
  facet_wrap(~species)

penguins %>%
  ggplot(aes(x = body_mass_g)) +
  geom_histogram(bins=15) +
  facet_grid(island~species)

penguins %>%
  group_by(species, island) %>%
  summarize(mean_mass = mean(body_mass_g, na.rm=TRUE),
            median_mass = median(body_mass_g, na.rm=TRUE))

## # A tibble: 5 × 4
## # Groups:   species [3]
##   species   island    mean_mass median_mass
##   <fct>     <fct>         <dbl>       <dbl>
## 1 Adelie    Biscoe        3710.        3750
## 2 Adelie    Dream         3688.        3575
## 3 Adelie    Torgersen     3706.        3700
## 4 Chinstrap Dream         3733.        3700
## 5 Gentoo    Biscoe        5076.        5000

This work was created by Ciaran Evans and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 March 25.