The Goal

At the start of DataFest, you will be given a data set and provided with the research questions your client would like you to explore - and that’s it! How you approach the problem, what model you choose…everything is completely up to you.

In the last activity, we learned that one way to start to explore a data set involves using tables and summaries. In this activity, we are going to explore a data set using visualizations (graphs) R. We will also see some best practices for visualizations (like adding appropriate labels and titles to graphs) that your team should use for DataFest and indeed for any formal report or presentation!

First Steps: The Type of Task

We will continue working with our same data set. We have a lot of rows, and a LOT of possible features in this data set. The data at DataFest may be even bigger. Sometimes the hardest part is figuring out how to get started, so let’s talk about that.

The first question to ask yourselves: What is the goal? Are we building a model to help predict something? If so, then our goal is prediction. Is our goal to understand relationships among some of the features (explanatory variables) and our response variable? If so, then our goal is association.

There are other types of goals, like establishing causal relationships, but generally with DataFest we have either prediction or association.

Our client is interested in exploring how existing buildings can be updated to improve their energy efficiency and reduce their carbon foot print. They are interested in identifying traits associated with high energy usage, and they would like recommendations on how buildings can improve their energy efficiency.

Question 1

Based on this, is our challenge a prediction or an association task?

Why does this matter? Well, if our goal is only prediction, we can use really complicated models, and as long as they predict well, we are probably fine with that. However, if our goal is association, we have to choose a model that allows us to actually interpret the relationships going on in the data. Really complex models may not allow for this. This means we need to be aware of the goal of the analysis from the very start.

Now that we know the goal, the next step is to start digging in to the data. Characteristics of the data themselves (yes, data is a plural term!) can also help us decide on modeling techniques. In the real competition, this is a good time to split up the features (explanatory variables) you might want to visualize among your team - it’s more efficient to split up the work.

To make visualizations, we will be working with a group of functions in the ggplot2 package in R. If you have already installed ggplot2, skip the next section.

Installing ggplot2

The package ggplot2 is a powerful collection of R functions for creating flexible, professional graphics. A package is collection of R codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package. We will be using several packages as we move through this course.

The first step to using ggplot2 is to install the ggplot2 package. Go to the top of your RStudio window and find “Tools”. From there, click on “Install Packages.” In the blank box, type in ggplot2, and hit install. The computer should automatically begin to load in the packages that you need, but this may take a minute.

Now, some of you may see an error about language parsing, or an error involving rlang. If you do, go ahead and install the rlang package. Then, copy and paste the following into a chunk and hit play. Nothing will seem to happen, and that’s okay.

library(rlang)

Note that this process of installing a package is one you need to do only once. Think of this as teaching R a new set of skills. Once it knows the skills, you don’t have to teach it again.

Once you have installed the ggplot2 package, you need to tell R that you would like to begin using the function by loading the library. Remember that we said installing a package is like teaching R a skill? Loading a library is how we tell we R we want it to use those skills. To do this, create a chunk in your RMarkdown and copy and paste the following, and hit play.

suppressMessages(library(ggplot2))

Note that this process of loading a library is one you have to do ONCE each time we start a lab or project.This tells R “Hey, remember those skills we taught you? Use them.”

You are now ready to begin EDA.

EDA: One Numeric Variable

To start off with, let’s create a visual to explore the distribution the response variable, site EUI. What are we looking for? Things like modality, outliers, spread, etc.

Question 2

What type of variable is site EUI, and based on that, what kinds of plots could we create to explore its distribution? List at least two.

We are going to start off with a histogram.To create the histogram of site EUI, paste the following code in a chunk and hit play.

ggplot(train, aes(x=site_eui)) + 
 geom_histogram(bins = 20, fill='blue', col = 'black')

Welcome to building plots in ggplot2! Let’s talk about what we just did.

The creation of plots in ggplot2 requires building the plot in layers.

First, we build the background, the grid on which we will be building our graph. This is the job of the ggplot() part of the code.
Once we have built the background, we are ready to plot our data. The command we will use for this depends upon the data type we are working with. In this case, we want a histogram. The command we use to build a histogram is geom_histogram().
In this same layer, we specify that we want the bars to be filled (fill) in blue and we want them outlined (col) in black. The only other part of the command is bins=20. We need this because for histograms, we have to specify how many bins we want.

Notice that to add each layer to the graph, in the code we use a plus sign. We add the background AND THEN the bars to make the final graph. Let’s break that down in more detail:

ggplot(train, aes(x=site_eui)): This part of the code creates the background of the plot. The two arguments are the data set we are using (train) and the variable(s) that will be used to define the axis/axes(aes). In this case, we defined only that the x-axis would contain information on site EUI (aes(x=site_eui)).
geom_histogram(fill=‘blue’, col = ‘black’): Once the axes are set, we are adding on (+) the actual data. In this case, we want a histogram, so we add bars (geom_histogram). We also specify that we want 20 bars, and those bars should colored blue and outlined in black.

Let’s try it.

Question 3

Create a histogram of average temperatures in March, using 15 bins. Make the bars of the histogram cyan and outline them in white. Show your result.

First Rule of Graphs: Labels and Titles

We mentioned at the start of this lab that we would cover some best practices for visualizations. Here is the first - always label your graphs appropriately.

What does that mean? It means every graph you create needs a title (like Figure 1, Figure 2, etc.) and your axes need to be labelled in such a way that a person looking at your graph can tell what information is being displayed on each axis.

The command labs, which stands for labels, is used for this.

ggplot(train, aes(x=site_eui)) + 
   geom_histogram(fill='blue', col = 'black', bins = 20)+
   labs(title="Figure 1:", x = "Site EUI", y = "Frequency")

This ggplot syntax actually mimics the ways humans would draw a graph by hand. First, you draw the axes. Then, you add on your data. Finally, you add a label. Thinking through the steps in this manner will help you understand the syntax of this package.

You can also add captions to the bottom of graphs:

ggplot(train, aes(x=site_eui)) +
  geom_histogram(fill='blue', col = 'black', bins = 20) +
  labs(title="Figure 1:", x = "Site EUI", y = "Frequency",
       caption = "A histogram of site EUI.")

Question 4

Copy the code you used to make the graph from Question 3. Now, add the title “Figure 2:” and add appropriate labels to the x and y axis, and add a caption.

One VERY important thing to remember when we make plots is to make sure the axes are easily interpretable by your reader. You do not want to use default variable names, like “march_avg_temp”. Instead, we want clear labels like “March Average Temperature”. This is important for any plots you make in statistics - label your axes appropriately and title your graphs.

Now let’s try a new plot: a box plot.

Question 5

Create a box plot of site EUI. Fill the plot in gold and outline it in black. Title your plot Figure 3, and label the x axis “Site EUI”. The y axis does not matter in this box plot (it literally gives us no information), so we want a blank axis label (y = ") . Hint: Instead of a geom_histogram, we want geom_boxplot, and box plots do not have bins.

Now that we have our graphs, let’s think about what information they yield.

Question 6

Based on Figure 1 and Figure 3, describe the distribution of site EUI. Do you see any outliers? Does the distribution seem unimodal or multimodal? Symmetric or skewed? Etc.

All of these questions can help us determine the type of model we might want to consider.

Plotting Two Numeric Variables

Once we have explored our response variable, we will likely want to get some idea of how some of the features relate to the response variable. This means that we want to create a graph to examine the relationship between two variables.

Let’s start with X = the amount of square feet in the building. This means we have two numeric variables, so we need a scatter plot to explore their relationship. We could create the necessary plot using the following code:

ggplot(train, aes(x=floor_area, y = site_eui)) + geom_point()

Just as before, we have two layers. The first draws the background and the second adds on the graph. Here, in the first layer, we specify both the x and y axis of the graph, as we have two variables that we are working with.

geom_point() tells R to make a scatter plot.

Question 7

Start with the code above to create a scatter plot. Adapt it so that X = the year the building was constructed. Keep Y = site EUI. Color the dots purple.(Hint: This time we are not specifying a fill, but a color). Title your plot Figure 4, and label the x axis and the y axis.

In the figure you have just constructed, you should notice something unusual. There are 5 buildings which were built in year 0!

which(train$year_built ==0)

This is part of why we do EDA when we work with real data. We can see data quality issues, like missing or incorrect data.

Question 8

What do you think a 0 for year built is supposed to indicate? And how would you suggest dealing with these 5 buildings? Explain your choice.

Stacking Graphs

Okay, so we have seen how to make a few different kinds of graphs. Great. However, we have a lot of features to explore, so we have the potential for a lot of plots. That starts to take up a lot of space in a report. One nice way to present multiple graphs is to stack them.

Suppose we want to make a histogram of site EUI (as we have already done). The code we need for that is:

ggplot(train, aes(x=site_eui)) +
  geom_histogram(fill='blue', col = 'black', bins = 20) +
  labs(title="Figure 1:", x = "Site EUI", y = "Frequency")

If we put that in a chunk and press play, one histogram will appear on our screen. Let’s suppose we want to show a box plot along with this histogram. We can create both graphs and have them both print out separately in our knit document. However, we can also tell the computer to print more than one graph at once to save space. Try the following:

# Create the Histogram
g1 <- ggplot(train, aes(x=site_eui)) +
  geom_histogram(fill='blue', col = 'black', bins = 20) +
  labs(title="Figure 1:", x = "Site EUI", y = "Frequency")

# Create the box plot 
g2 <- ggplot(train, aes(x=site_eui)) +
  geom_boxplot(fill='gold') + labs(title="Figure 2:", x = "Site EUI", y = "")

# Show the plots side by side
gridExtra::grid.arrange(g1,g2, ncol = 2)

If you get a warning message about not having gridExtra, this means you will need to install the package using the same steps we did above to install ggplot2. We have to install packages all the time with R, depending on the version of R you have and your computer system.

What you will notice is that we have stored each of the two graphs under a name. Our histogram is stored under g1 and our box plot is stored under g2. Then, we use a special code called gridExtra::grid.arrange to help us arrange the graphs in a grid. In our case, we have to graphs, and we want them side by side. This means we want the graphs in a 1 (row) by 2 (column) grid.

To create the 1 by 2 grid, we feed the computer our two graphs, and then tell it we want the figures in 2 columns (next to each other) by specifying ncol = 2. In other words, the number of columns we want is 2!

Question 9

Create a (1) box plot for site EUI, (2) histogram for site EUI, (3) box plot for the average temperature in March and (4) histogram for the average temperature in March. Stack 4 graphs in a 2 x 2 grid (2 rows and 2 columns). You have already made some of the graphs you need!

Changing Themes

When using ggplot2, you also have an option to change the theme of your graphs. Consider these:

The Light Theme

ggplot(train, aes(x=site_eui)) +
  geom_histogram(fill='blue', col = 'black', bins = 20) +
  labs(title="Figure 1:", x = "Site EUI", y = "Frequency",
       caption = "A histogram of site EUI.") +
  theme_light()

The BW Theme

ggplot(train, aes(x=site_eui)) +
  geom_histogram(fill='blue', col = 'black', bins = 20) +
  labs(title="Figure 1:", x = "Site EUI", y = "Frequency",
       caption = "A histogram of site EUI.") +
  theme_bw()

The Minimal Theme

ggplot(train, aes(x=site_eui)) +
  geom_histogram(fill='blue', col = 'black', bins = 20) +
  labs(title="Figure 1:", x = "Site EUI", y = "Frequency",
       caption = "A histogram of site EUI.") +
  theme_minimal()

Take a look and, as a team, decide what you like!

Making Tables

While we are thinking about EDA, we don’t only focus on numeric variables. What if we have categories? With categorical variables, we can make tables or bar plots. We make bar plots just like we made histograms, except there are no bins and we use geom_bar instead of geom_histogram.

Question 10

Make a bar plot to display the building class. Make sure you title your graphs appropriately!

There are several ways to make tables in R, but we will discuss two. The first is very direct. We tell R we want to use the train data set and the variable building_class, by using the code train$building_class (dataset$variable). Then, we use the table(whatWeWantToMakeATableWith) command to actually make the table.

table(train$building_class)

However, this makes a table that is not particularly pretty or professional when you knit. A second option that does create professional tables is:

knitr::kable(table(train$building_class), 
             col.names=c("Building Type", "Count") )

The code is more complex, but the heart of it is the same table. This table will not look very pretty when you press play, but go ahead and knit. See how nicely the table gets formatted?

Luckily, you only have to adapt a few things in this code to make different tables. You need need to specify what you want to make a table of (train$building_class) and you need to change the column names accordingly (“Building Type”).

Question 11

Create a table, using the second way to make a table, for the facility type. Make sure your columns are labelled Facility Type and Count. (Hint: The col.names part of the code above will help with that.)

Next Steps

Now that we know some basics of visualization, think about if this was the real DataFest. What would you do now?

Question 12

Discuss as a team how you would proceed with exploring and visualizing this data set. This is called making a plan, and is critical to success in applied statistics/data science work.

Question 13

Take a look at the variables (features) you are given. Are there any that you think it would not make sense to graph? If so, explain why.

We will keep working with this data set for future activities.

Citation

This activity uses data from:

Climate Change AI (CCAI) and Lawrence Berkeley National Laboratory (Berkeley Lab). (2022 January). WiDS Datathon 2022, Version 1. Retrieved January 10, 2022 from https://www.kaggle.com/competitions/widsdatathon2022/overview.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 May 17.

Activity 2: Data Visualization