Activity 3B: Data Wrangling Practice

The Goal

In the last activity we covered some useful data wrangling functions from the dplyr package. In this activity, you will practice using these functions to explore the energy efficiency data we have been using for the rest of our activities. This allows you to try out these techniques on a larger data set.

Setup

R packages

For this activity you will need the dplyr, ggplot2, and tidyr packages. Each of these packages is part of the tidyverse. If you don’t have tidyverse installed, go to your R console and install it first with install.packages("tidyverse") (remember you only need to do this once).

Once the tidyverse package is installed, load it into R with library(tidyverse).

Data

We will continue using the energy efficiency data from last week. Load the train.csv dataset into R (see Activity 1 for full instructions).

Activity

The first part of the activity contains some guided practice with the dplyr functions discussed in the previous activity, as well as some more visualizations with ggplot. The second part is more open-ended. If you feel very comfortable with data wrangling functions, you can start with Part II.

Part I

There are two main groups of buildings in the data (specified by the building_class variable): Commercial buildings and Residential buildings. There is much more variety to commercial buildings than residential buildings, so let’s make the problem more manageable by focusing on the residential buildings to start.

Question 1

Create a subset of the data called train_residential, which only contains the residential buildings.

In addition to the building_class variable, the facility_type variable gives more detailed information about building type. Let’s see what different facility types are included, and how common they are.

Question 2

Fill in the ... in the code below to use the count function to determine how many residential buildings belong to each facility type.

train_residential %>%
  count(...)

What’s the least common type?

One facility type has only 9 residential buildings in the data. If we think facility type is an important variable, this may be too few. One option is to remove residential buildings of that facility type.

Question 3

Fill in the ... in the code below to change train_residential so we exclude the facility type with only 9 observations.

train_residential <- train_residential %>%
  filter(facility_type ...)

Hint: in Question 1, you used the logical operator == in your filter. For Question 3 we can use a different logical operator – pick the appropriate one from the table below:

operator definition operator definition
< less than x | y x OR y
<= less than or equal to is.na(x) test if x is NA
> greater than !is.na(x) test if x is not NA
>= greater than or equal to x %in% y test if x is in y
== exactly equal to !(x %in% y) test if x is not in y
!= not equal to !x not x
x & y x AND y

Of course, we’re not interested in building types for their own sake. In this data, we care about site_eui, the energy efficiency of the building. The building type could be important if energy efficiency differs across building types.

Question 4

Let’s start by plotting the distribution of site_eui for each facility type for residential buildings. Fill in the code below to create a boxplot of site_eui for each facility type, and add appropriate labels, a title, and a caption.

train_residential %>%
  ggplot(aes(x = ..., y = ...)) +
  geom_...() +
  labs(x = "...", 
       y = "...",
       title = "...",
       caption = "...") +
  theme_bw()

Does it look like energy efficiency differs for different facility types?

We can also use summary statistics to quantify what we see visually in the plot.

Question 5

Fill in the code below to calculate a measure of center and spread for site_eui in each residential building type. Choose appropriate measures of center and spread based on the boxplots you made in Question 4.

train_residential %>%
  group_by(...) %>%
  summarize(... = ...(site_eui),
            ... = ...(side_eui))

We can make this table of summary statistics look nicer. If we’re knitting to R Markdown, the kable function makes the knitted table pretty:

train_residential %>%
  group_by(...) %>%
  summarize(... = ...(site_eui),
            ... = ...(side_eui)) %>%
  knitr::kable(col.names = c("Residential building type", 
                             "Median energy efficiency",
                             "IQR energy efficiency"))

Part II

Building type is one variable that might be associated with energy efficiency. Another important variable is building age: we probably expect that older buildings will be less efficient than newer buildings. However, we don’t have a column for building age in the data.

Question 6

Using the mutate function, create a new column which records the (current) age of each building.

Now that we’ve made a column for building age, we can explore this new variable.

Question 7

How many buildings in the data have a missing value for age? (Hint: check out the is.na function in the table of logical operators by Question 3).

Question 8

Do the residential buildings with missing age tend to belong to a specific facility type?

Let’s look at the relationship between residential building age, facility type, and energy efficiency.

Question 9

Fill in the code below to create a scatterplot showing the relationship between age and energy efficiency. Change the color and shape of the points to reflect facility type. Add good labels, a title, and a caption. Feel free to change the theme too.

train_residential %>%
  ggplot(aes(x = ...,
             y = ...,
             color = ...,
             shape = ...)) +
  geom_...() +
  labs(x = "...",
       y = "...",
       color = "...",
       shape = "...",
       title = "...",
       caption = "...") +
  theme_bw()

Question 10

Now calculate the correlation between age and energy efficiency for each residential facility type. Think carefully about how you will handle missing data. Note: look at the documentation for the cor function in R (?cor) to see the options for handling missing values.

Question 11

When calculating the age of each building, we compared the year in which it was built to 2022. While we don’t know in which year the buildings were inspected, this is fine as long as all buildings were measured in the same year. However, the buildings in this dataset were actually measured across seven different years. How would you handle this when analyzing building age?

Citation

This activity uses data from:

Climate Change AI (CCAI) and Lawrence Berkeley National Laboratory (Berkeley Lab). (2022 January). WiDS Datathon 2022, Version 1. Retrieved January 10, 2022 from https://www.kaggle.com/competitions/widsdatathon2022/overview.

Creative Commons License
This work was created by Ciaran Evans is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 May 17.