Joining data frames
Another common task is joining or merging two different data sets. For example, in the flights
data, the variable carrier
lists the carrier code for the different flights. While "UA"
and "AA"
might be somewhat easy to guess for some (United and American Airlines), what are "VX", "HA", and "B6"? This information is provided in a separate data frame airlines
.
head(airlines)
# A tibble: 6 x 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
We see that in airports
, carrier
is the carrier code while name
is the full name of the airline. Using this table, we can see that "VX", "HA", and "B6" correspond to Virgin America, Hawaiian Airlines, and JetBlue respectively. However, will we have to continually look up the carrier's name for each flight in the airlines
data set? No! Instead of having to do this manually, we can have R automatically do the "looking up" for us.
Note that the values in the variable carrier
in flights
match the values in the variable carrier
in airlines
. In this case, we can use the variable carrier
as a key variable to join/merge/match the two data frames by. Key variables are almost always identification variables that uniquely identify the observational units as we saw back in the "Identification vs Measurement Variable" section. This ensures that rows in both data frames are appropriate matched during the join.
This diagram helps us understand how the different data sets are linked by various key variables:

Figure 4: Data relationships in nycflights13 from R for Data Science, Hadley and Garrett (2016)
Joining by "key" variables
In both flights
and airlines
, the key variable we want to join/merge/match the two data frames with has the same name in both data sets: carriers
. We make use of the inner_join()
function to join by the variable carrier
.
<- flights %>%
flights_joined inner_join(airlines, by = "carrier")
names(flights)
[1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour" "gain"
[21] "hours" "gain_per_hour"
names(flights_joined)
[1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour" "gain"
[21] "hours" "gain_per_hour" "name"
%>% select(flight, carrier, name) flights_joined
# A tibble: 336,776 x 3
flight carrier name
<int> <chr> <chr>
1 1545 UA United Air Lines Inc.
2 1714 UA United Air Lines Inc.
3 1141 AA American Airlines Inc.
4 725 B6 JetBlue Airways
5 461 DL Delta Air Lines Inc.
6 1696 UA United Air Lines Inc.
7 507 B6 JetBlue Airways
8 5708 EV ExpressJet Airlines Inc.
9 79 B6 JetBlue Airways
10 301 AA American Airlines Inc.
# ... with 336,766 more rows
# i Use `print(n = ...)` to see more rows
We observed that the flights
and flights_joined
are identical except that flights_joined
has an additional variable name
whose values were drawn from airlines
.
A visual representation of the inner_join
is given below:

Figure 5: Diagram of inner join from R for Data Science
There are more complex joins available, but the inner_join
will solve nearly all of the problems you'll face in our experience.
Joining by "key" variables with different names
Say instead, you are interested in all the destinations of flights from NYC in 2013 and ask yourself:
- "What cities are these airports in?"
- "Is
"ORD"
Orlando?" - "Where is
"FLL"
?
The airports
data frame contains airport codes:
head(airports)
# A tibble: 6 x 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/Ne~
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Ch~
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Ch~
4 06N Randall Airport 41.4 -74.4 523 -5 A America/Ne~
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/Ne~
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/Ne~
However, looking at both the airports
and flights
and the visual representation of the relations between the data frames in the figure above, we see that in:
airports
the airport code is in the variablefaa
flights
the departure airport code is in the variableorigin
So to join these two data sets, our inner_join
operation involves a by
argument that accounts for the different names:
%>%
flights inner_join(airports, by = c("dest" = "faa"))
Let's construct the sequence of commands that computes the number of flights from NYC to each destination, but also includes information about each destination airport:
<- flights %>%
named_dests group_by(dest) %>%
summarize(num_flights = n()) %>%
arrange(desc(num_flights)) %>%
inner_join(airports, by = c("dest" = "faa")) %>%
rename(airport_name = name)
named_dests
# A tibble: 101 x 9
dest num_flights airport_name lat lon alt tz dst tzone
<chr> <int> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ORD 17283 Chicago Ohare Intl 42.0 -87.9 668 -6 A Amer~
2 ATL 17215 Hartsfield Jackson At~ 33.6 -84.4 1026 -5 A Amer~
3 LAX 16174 Los Angeles Intl 33.9 -118. 126 -8 A Amer~
4 BOS 15508 General Edward Lawren~ 42.4 -71.0 19 -5 A Amer~
5 MCO 14082 Orlando Intl 28.4 -81.3 96 -5 A Amer~
6 CLT 14064 Charlotte Douglas Intl 35.2 -80.9 748 -5 A Amer~
7 SFO 13331 San Francisco Intl 37.6 -122. 13 -8 A Amer~
8 FLL 12055 Fort Lauderdale Holly~ 26.1 -80.2 9 -5 A Amer~
9 MIA 11728 Miami Intl 25.8 -80.3 8 -5 A Amer~
10 DCA 9705 Ronald Reagan Washing~ 38.9 -77.0 15 -5 A Amer~
# ... with 91 more rows
# i Use `print(n = ...)` to see more rows
In case you didn't know, "ORD"
is the airport code of Chicago O'Hare airport and "FLL"
is the main airport in Fort Lauderdale, Florida, which we can now see in our named_dests
data frame.
Joining by multiple "key" variables
Say instead we are in a situation where we need to join by multiple variables. For example, in the first figure in this section we see that in order to join the flights
and weather
data frames, we need more than one key variable: year
, month
, day
, hour
, and origin
. This is because the combination of these 5 variables act to uniquely identify each observational unit in the weather
data frame: hourly weather recordings at each of the 3 NYC airports.
We achieve this by specifying a vector of key variables to join by using the c()
concatenate function. Note the individual variables need to be wrapped in quotation marks.
<- flights %>%
flights_weather_joined inner_join(weather, by = c("year", "month", "day", "hour", "origin"))
head(flights_weather_joined[,c(1:4,10:11,22:32)])
# A tibble: 6 x 17
year month day dep_time carrier flight gain_per~1 temp dewp humid wind_~2
<int> <int> <int> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 UA 1545 -2.38 39.0 28.0 64.4 260
2 2013 1 1 533 UA 1714 -4.23 39.9 25.0 54.8 250
3 2013 1 1 542 AA 1141 -11.6 39.0 27.0 61.6 260
4 2013 1 1 544 B6 725 5.57 39.0 27.0 61.6 260
5 2013 1 1 554 DL 461 9.83 39.9 25.0 54.8 260
6 2013 1 1 554 UA 1696 -6.4 39.0 28.0 64.4 260
# ... with 6 more variables: wind_speed <dbl>, wind_gust <dbl>, precip <dbl>,
# pressure <dbl>, visib <dbl>, time_hour.y <dttm>, and abbreviated variable
# names 1: gain_per_hour, 2: wind_dir
# i Use `colnames()` to see all variable names
Task
Looking at the first figure in this section, when joining flights
and weather
(or, in other words, matching the hourly weather values with each flight), why do we need to join by all of year
, month
, day
, hour
, and origin
, and not just hour
?
Because hour
is simply a value between 0 and 23; to identify a specific hour, we need to also know which year
, month
, day
and at which airport
.