Activity 1 Part 2: Essentials

The Goal

Now that we have opened our data set, let’s explore it a little. At DataFest, it is very important to take your time looking at the data and really understanding its structure. Doing so will help you make decisions about how best to approach the challenge!

The Data and the client

Our data today is provided by a client who is interested in exploring how existing buildings can be updated to improve their energy efficiency and reduce their carbon foot print. They are interested in identifying traits associated with high energy usage, and recommendations on how buildings can improve their energy efficiency.They have provided us with information on over 70,000 buildings.

Our response variable (outcome variable) is:

  • site_eui - Site Energy Usage Intensity (Site EUI). This is the amount of heat and electricity consumed by a building. This is recorded from utility bills for the building over the last year. It has no units.

There are other columns in the data set. These represent identifying information about the building and possible features that we can use to help us build a model for Site EUI. A feature is an explanatory variable, meaning a variable we can use when we build our model for Y = Site EUI.

  • id - building id.
  • Year_Factor - anonymized year in which the weather and energy usage factors were observed.
  • State_Factor - anonymized state in which the building is located.
  • building_class - building classification.
  • facility_type- building usage type.
  • floor_area - floor area (in square feet) of the building.
  • year_built - year in which the building was constructed.
  • energy_star_rating - the energy star rating of the building.
  • ELEVATION - elevation of the building location.
  • january_min_temp - minimum temperature in January (in Fahrenheit) at the location of the building. NOTE: The data contain information for additional months, such as february_min_temp.
  • january_avg_temp - average temperature in January (in Fahrenheit) at the location of the building. NOTE: The data contain information for additional months, such as february_avg_temp.
  • january_max_temp - maximum temperature in January (in Fahrenheit) at the location of the building. NOTE: The data contain information for additional months, such as february_max_temp.
  • cooling_degree_days - cooling degree day for a given day is the number of degrees where the daily average temperature exceeds 65 degrees Fahrenheit. Each month is summed to produce an annual total at the location of the building.
  • heating_degree_days - heating degree day for a given day is the number of degrees where the daily average temperature falls under 65 degrees Fahrenheit. Each month is summed to produce an annual total at the location of the building.
  • precipitation_inches - annual precipitation in inches at the location of the building.
  • snowfall_inches - annual snowfall in inches at the location of the building.
  • snowdepth_inches - annual snow depth in inches at the location of the building.
  • avg_temp - average temperature over a year at the location of the building.
  • days_below_30F - total number of days below 30 degrees Fahrenheit at the location of the building.
  • days_below_20F - total number of days below 20 degrees Fahrenheit at the location of the building.
  • days_below_10F - total number of days below 10 degrees Fahrenheit at the location of the building.
  • days_below_0F - total number of days below 0 degrees Fahrenheit at the location of the building.
  • days_above_80F - total number of days above 80 degrees Fahrenheit at the location of the building.
  • days_above_90F - total number of days above 90 degrees Fahrenheit at the location of the building.
  • days_above_100F - total number of days above 100 degrees Fahrenheit at the location of the building.
  • days_above_110F - total number of days above 110 degrees Fahrenheit at the location of the building.
  • direction_max_wind_speed - wind direction for maximum wind speed at the location of the building. Given in 360-degree compass point directions (e.g. 360 = north, 180 = south, etc.).
  • direction_peak_wind_speed - wind direction for peak wind gust speed at the location of the building. Given in 360-degree compass point directions (e.g. 360 = north, 180 = south, etc.).
  • max_wind_speed - maximum wind speed at the location of the building.
  • days_with_fog - number of days with fog at the location of the building.

Data Structure

To open your data set, look in the upper right hand corner of your RStudio screen. You should see train listed in that quadrant of your screen. To open the data, just click on the word train (not the blue arrow next to it!)

You will also notice that some additional information is provided in this quadrant of your screen.

Question 1

Based on the information in this quadrant, how many rows are in the data set? How many columns (variables)?


Data at DataFest are generally large. This can mean that the data set has a lot of rows, a lot of columns, or both. The data can even come in more than one data set that you might have to merge with other data sets before your analysis can begin.

Because the data are so large, it is important to learn how to focus on individual pieces of the data. For instance, suppose I only want to look at the third column in the data set. To do this, we create a chunk in our Markdown file and run the following:

train[,3]

This tells R to look at the data set called train and access the third column([, 3]).

It turns out there is a second way to access the third column. The third column in the data set is called building_class, and it tells us whether each building is commercial (for business) or residential (for living in). This means we can also look at the third column by using

train$building_class

This tells R to look at the data set called train and access ($) the column called building_class.

Now, both of these codes produce a VERY large amount of output; 75757 pieces of output to be exact!! This is too much for us to really look at. Instead, what we want to do is look at a table that summarizes what is in the column. To do this, we can use the table command.

table(train$building_class)

This tells R to create a table (table) of the building class column in the train data (train$building_class)

Question 2

Based on the table, how many commercial buildings are in this data set? How many residential buildings?


Why is important that we can answer questions like Question 2? Well, at DataFest, you may be asked to look at only the commercial buildings, or only the residential. Using a table allows you to see how much data you have in any particular category.

Question 3

How many buildings in this data set are colleges or universities? Hint: That information is contained in the facility_type variable in the data set.


Tables are a really powerful tool for categorical data, but what about numeric data? Our response variable (Y) of interest in this data set is:

  • site_eui - Site Energy Usage Intensity (Site EUI). This is the amount of heat and electricity consumed by a building. This is recorded from utility bills for the building over the last year.

This is not a categorical variable; it’s numeric. This means that making a table is not a great plan for exploring this variable. Why? There are too many unique value of site EUI! This means we would get a really big table.

Instead of using a table, for numeric variables we use a summary.

summary(train$site_eui)

This produces 6 numbers.

  • Min: The smallest value of variable.
  • 1st Qu.: The first quartile of the variable. 25% of the values for site EUI are below this value.
  • Median: The median/second quartile of the variable. 50% of the values for site EUI are below this value.
  • Mean: The average value of site EUI.
  • 3rd Qu.: The third quartile of the variable. 75% of the values for site EUI are below this value.
  • Max: The largest value of site EUI.

In addition to the values above, the summary can also help us see things in the data that are unusual.

Question 4

Take a look at the maximum value of site EUI versus all the other numbers. Based on this, do you think there are (A) some unusually small values in the site EUI, (B) some unusually large values of site EUI, (C) no unusual values of site EUI?


To confirm your theory, you can make a box plot of site EUI:

boxplot(train$site_eui, horizontal = TRUE)

Question 5

Was your answer to Question 4 correct?

Looking at data summaries can help us identify unusual features of the data, including skew or very large/very small values. We can also find missing data and other things that will be important for when modeling.

The bottom line: Spending the time to look at summaries or tables of the variables you are interested in can make a big difference in your ability to answer the client’s questions at DataFest.

However…there are 64 variables in this data set! It would be a lot of work for one person to explore them all. That’s why we work in teams for DataFest.

Question 6

Have each member of your team explore 2 different variables in this data set. When you are done, show your results to your teammates and discuss what you have noticed about your chosen variables.

Singling Out Unusual Points

Let’s explore the year that the buildings in the data were built (year_built).

Question 7

Make an appropriate summary to explore the year_built variable.


When you do this, you will notice that the minimum value for year built is 0, and there are 1837 values that are recorded as NA.

NA stands for “not available”, and this means that we do not have information on the year that 1837 of our buildings were built. We’ll come back to this later.

For now, let’s think about that house that was built in year 0. This seems…highly unlikely. What if I want to look at just that one row in the data and see what is going on? Can I do that?

To find the rows in the data that has the smallest value of year, we use

which.min(train$year_built)

This returns 354, which tells us the 354th building in our data set has year recorded as 0.

Question 8

Which building in the data set has the largest value of site EUI? Hint: In the code above, we were looking for a minimum (min). Now, we are looking for a maximum (max).


Now that we know we want to look at the 354th building to find the smallest year built, how do we actually look at that row? Use the code below.

train[354,]

Note: We saw before that train[,3] pulled the third column. Now we see that train[354,] pulls the 354th row. This means that when you use train[ , ], anything that comes BEFORE the comma indicates a row. Anything that comes AFTER the comma indicates a column.

Now, look at the results of running this code.

train[354,]

This provides a lot of information about this building!! The rest of the variables in this data set look okay, so we then consider this 0 again. What does this mean? Well, it could mean the building was built this year (literally the year we are in now), and so someone recorded building age rather than building year. It could also mean the building year was not known, and someone used a 0 to record that. We just don’t know.

This is a reality of real data! It can be messy, and we will encounter things like this at DataFest.

One common way to handle things like this is to replace the 0 with NA. In other words, since we don’t know the year, let’s replace it with something that actually means “we don’t know.”

train$year_built[354] <- "NA"

This code combines working with rows and columns. We choose our column (train$year_built) and we look at a specific building ([354]). We then replace the 354th year built with NA.

Next Steps

This may seem like a lot of code right now, but remember, you don’t have to memorize it!! Part of the point of these activities is for you to have them to refer back to during DataFest. The key right now is to start to see what kinds of things you can do with data in R and why those things might be useful.

In the next part of this activity, we will add on to this exploration of the data with some visualizations.

Citation

This activity uses data from:

Climate Change AI (CCAI) and Lawrence Berkeley National Laboratory (Berkeley Lab). (2022 January). WiDS Datathon 2022, Version 1. Retrieved January 10, 2022 from https://www.kaggle.com/competitions/widsdatathon2022/overview.

Creative Commons License
This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 May 17.