Joining data frames

Another common task is joining or merging two different data sets. For example, in the flights data, the variable carrier lists the carrier code for the different flights. While "UA" and "AA" might be somewhat easy to guess for some (United and American Airlines), what are "VX", "HA", and "B6"? This information is provided in a separate data frame airlines.

head(airlines)
# A tibble: 6 x 2
  carrier name                    
  <chr>   <chr>                   
1 9E      Endeavor Air Inc.       
2 AA      American Airlines Inc.  
3 AS      Alaska Airlines Inc.    
4 B6      JetBlue Airways         
5 DL      Delta Air Lines Inc.    
6 EV      ExpressJet Airlines Inc.

We see that in airports, carrier is the carrier code while name is the full name of the airline. Using this table, we can see that "VX", "HA", and "B6" correspond to Virgin America, Hawaiian Airlines, and JetBlue respectively. However, will we have to continually look up the carrier's name for each flight in the airlines data set? No! Instead of having to do this manually, we can have R automatically do the "looking up" for us.

Note that the values in the variable carrier in flights match the values in the variable carrier in airlines. In this case, we can use the variable carrier as a key variable to join/merge/match the two data frames by. Key variables are almost always identification variables that uniquely identify the observational units as we saw back in the "Identification vs Measurement Variable" section. This ensures that rows in both data frames are appropriate matched during the join.

This diagram helps us understand how the different data sets are linked by various key variables:

Data relationships in nycflights13 from R for Data Science, Hadley and Garrett (2016)

Figure 4: Data relationships in nycflights13 from R for Data Science, Hadley and Garrett (2016)

Joining by "key" variables

In both flights and airlines, the key variable we want to join/merge/match the two data frames with has the same name in both data sets: carriers. We make use of the inner_join() function to join by the variable carrier.

flights_joined <- flights %>% 
  inner_join(airlines, by = "carrier")
names(flights)
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"      "gain"          
[21] "hours"          "gain_per_hour" 
names(flights_joined)
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"      "gain"          
[21] "hours"          "gain_per_hour"  "name"          
flights_joined %>% select(flight, carrier, name) 
# A tibble: 336,776 x 3
   flight carrier name                    
    <int> <chr>   <chr>                   
 1   1545 UA      United Air Lines Inc.   
 2   1714 UA      United Air Lines Inc.   
 3   1141 AA      American Airlines Inc.  
 4    725 B6      JetBlue Airways         
 5    461 DL      Delta Air Lines Inc.    
 6   1696 UA      United Air Lines Inc.   
 7    507 B6      JetBlue Airways         
 8   5708 EV      ExpressJet Airlines Inc.
 9     79 B6      JetBlue Airways         
10    301 AA      American Airlines Inc.  
# ... with 336,766 more rows
# i Use `print(n = ...)` to see more rows

We observed that the flights and flights_joined are identical except that flights_joined has an additional variable name whose values were drawn from airlines.

A visual representation of the inner_join is given below:

Diagram of inner join from R for Data Science

Figure 5: Diagram of inner join from R for Data Science

There are more complex joins available, but the inner_join will solve nearly all of the problems you'll face in our experience.

Joining by "key" variables with different names

Say instead, you are interested in all the destinations of flights from NYC in 2013 and ask yourself:

  • "What cities are these airports in?"
  • "Is "ORD" Orlando?"
  • "Where is "FLL"?

The airports data frame contains airport codes:

head(airports)
# A tibble: 6 x 8
  faa   name                             lat   lon   alt    tz dst   tzone      
  <chr> <chr>                          <dbl> <dbl> <dbl> <dbl> <chr> <chr>      
1 04G   Lansdowne Airport               41.1 -80.6  1044    -5 A     America/Ne~
2 06A   Moton Field Municipal Airport   32.5 -85.7   264    -6 A     America/Ch~
3 06C   Schaumburg Regional             42.0 -88.1   801    -6 A     America/Ch~
4 06N   Randall Airport                 41.4 -74.4   523    -5 A     America/Ne~
5 09J   Jekyll Island Airport           31.1 -81.4    11    -5 A     America/Ne~
6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593    -5 A     America/Ne~

However, looking at both the airports and flights and the visual representation of the relations between the data frames in the figure above, we see that in:

  • airports the airport code is in the variable faa
  • flights the departure airport code is in the variable origin

So to join these two data sets, our inner_join operation involves a by argument that accounts for the different names:

flights %>% 
  inner_join(airports, by = c("dest" = "faa"))

Let's construct the sequence of commands that computes the number of flights from NYC to each destination, but also includes information about each destination airport:

named_dests <- flights %>%
  group_by(dest) %>%
  summarize(num_flights = n()) %>%
  arrange(desc(num_flights)) %>%
  inner_join(airports, by = c("dest" = "faa")) %>%
  rename(airport_name = name)
named_dests
# A tibble: 101 x 9
   dest  num_flights airport_name             lat    lon   alt    tz dst   tzone
   <chr>       <int> <chr>                  <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
 1 ORD         17283 Chicago Ohare Intl      42.0  -87.9   668    -6 A     Amer~
 2 ATL         17215 Hartsfield Jackson At~  33.6  -84.4  1026    -5 A     Amer~
 3 LAX         16174 Los Angeles Intl        33.9 -118.    126    -8 A     Amer~
 4 BOS         15508 General Edward Lawren~  42.4  -71.0    19    -5 A     Amer~
 5 MCO         14082 Orlando Intl            28.4  -81.3    96    -5 A     Amer~
 6 CLT         14064 Charlotte Douglas Intl  35.2  -80.9   748    -5 A     Amer~
 7 SFO         13331 San Francisco Intl      37.6 -122.     13    -8 A     Amer~
 8 FLL         12055 Fort Lauderdale Holly~  26.1  -80.2     9    -5 A     Amer~
 9 MIA         11728 Miami Intl              25.8  -80.3     8    -5 A     Amer~
10 DCA          9705 Ronald Reagan Washing~  38.9  -77.0    15    -5 A     Amer~
# ... with 91 more rows
# i Use `print(n = ...)` to see more rows

In case you didn't know, "ORD" is the airport code of Chicago O'Hare airport and "FLL" is the main airport in Fort Lauderdale, Florida, which we can now see in our named_dests data frame.

Joining by multiple "key" variables

Say instead we are in a situation where we need to join by multiple variables. For example, in the first figure in this section we see that in order to join the flights and weather data frames, we need more than one key variable: year, month, day, hour, and origin. This is because the combination of these 5 variables act to uniquely identify each observational unit in the weather data frame: hourly weather recordings at each of the 3 NYC airports.

We achieve this by specifying a vector of key variables to join by using the c() concatenate function. Note the individual variables need to be wrapped in quotation marks.

flights_weather_joined <- flights %>%
  inner_join(weather, by = c("year", "month", "day", "hour", "origin"))
head(flights_weather_joined[,c(1:4,10:11,22:32)])
# A tibble: 6 x 17
   year month   day dep_time carrier flight gain_per~1  temp  dewp humid wind_~2
  <int> <int> <int>    <int> <chr>    <int>      <dbl> <dbl> <dbl> <dbl>   <dbl>
1  2013     1     1      517 UA        1545      -2.38  39.0  28.0  64.4     260
2  2013     1     1      533 UA        1714      -4.23  39.9  25.0  54.8     250
3  2013     1     1      542 AA        1141     -11.6   39.0  27.0  61.6     260
4  2013     1     1      544 B6         725       5.57  39.0  27.0  61.6     260
5  2013     1     1      554 DL         461       9.83  39.9  25.0  54.8     260
6  2013     1     1      554 UA        1696      -6.4   39.0  28.0  64.4     260
# ... with 6 more variables: wind_speed <dbl>, wind_gust <dbl>, precip <dbl>,
#   pressure <dbl>, visib <dbl>, time_hour.y <dttm>, and abbreviated variable
#   names 1: gain_per_hour, 2: wind_dir
# i Use `colnames()` to see all variable names


Task Looking at the first figure in this section, when joining flights and weather (or, in other words, matching the hourly weather values with each flight), why do we need to join by all of year, month, day, hour, and origin, and not just hour?

Because hour is simply a value between 0 and 23; to identify a specific hour, we need to also know which year, month, day and at which airport.