Other verbs
Select variables using select

Figure 6: Select diagram from Data Wrangling with dplyr and tidyr cheatsheet
We've seen that the flights
data frame in the nycflights13
package contains many different variables. The names
function gives a listing of all the columns in a data frame; in our case you would run names(flights)
. You can also identify these variables by running the glimpse
function in the dplyr
package:
glimpse(flights)
Rows: 336,776
Columns: 22
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2~
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, ~
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, ~
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1~
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,~
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,~
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1~
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "~
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4~
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394~
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",~
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",~
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1~
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, ~
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6~
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0~
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0~
$ gain <dbl> -9, -16, -31, 17, 19, -16, -24, 11, 5, -10, 0, 1, -9, 1~
$ hours <dbl> 3.7833333, 3.7833333, 2.6666667, 3.0500000, 1.9333333, ~
$ gain_per_hour <dbl> -2.3788546, -4.2290749, -11.6250000, 5.5737705, 9.82758~
However, say you only want to consider two of these variables, say carrier
and flight
. You can select
these:
%>%
flights select(carrier, flight)
# A tibble: 336,776 x 2
carrier flight
<chr> <int>
1 UA 1545
2 UA 1714
3 AA 1141
4 B6 725
5 DL 461
6 UA 1696
7 B6 507
8 EV 5708
9 B6 79
10 AA 301
# ... with 336,766 more rows
# i Use `print(n = ...)` to see more rows
This function makes navigating data sets with a very large number of variables easier for humans by restricting consideration to only those of interest, like carrier
and flight
above. So for example, this might make viewing the data set using the View()
spreadsheet viewer more digestible. However, as far as the computer is concerned it doesn't care how many variables additional variables are in the data set in question, so long as carrier
and flight
are included.
Another example involves the variable year
. If you remember the original description of the flights
data frame (or by running ?flights
), you'll remember that this data correspond to flights in 2013 departing New York City. The year
variable isn't really a variable here in that it doesn't vary... flights
actually comes from a larger data set that covers many years. We may want to remove the year
variable from our data set since it won't be helpful for analysis in this case. We can deselect year
by using the -
sign:
<- flights %>%
flights_no_year select(-year)
names(flights_no_year)
[1] "month" "day" "dep_time" "sched_dep_time"
[5] "dep_delay" "arr_time" "sched_arr_time" "arr_delay"
[9] "carrier" "flight" "tailnum" "origin"
[13] "dest" "air_time" "distance" "hour"
[17] "minute" "time_hour" "gain" "hours"
[21] "gain_per_hour"
Or we could specify a ranges of columns:
<- flights %>%
flight_arr_times select(month:dep_time, arr_time:sched_arr_time)
flight_arr_times
# A tibble: 336,776 x 5
month day dep_time arr_time sched_arr_time
<int> <int> <int> <int> <int>
1 1 1 517 830 819
2 1 1 533 850 830
3 1 1 542 923 850
4 1 1 544 1004 1022
5 1 1 554 812 837
6 1 1 554 740 728
7 1 1 555 913 854
8 1 1 557 709 723
9 1 1 557 838 846
10 1 1 558 753 745
# ... with 336,766 more rows
# i Use `print(n = ...)` to see more rows
The select
function can also be used to reorder columns in combination with the everything
helper function. Let's suppose we'd like the hour
, minute
, and time_hour
variables, which appear at the end of the flights
data set, to actually appear immediately after the day
variable:
<- flights %>%
flights_reorder select(month:day, hour:time_hour, everything())
names(flights_reorder)
[1] "month" "day" "hour" "minute"
[5] "time_hour" "year" "dep_time" "sched_dep_time"
[9] "dep_delay" "arr_time" "sched_arr_time" "arr_delay"
[13] "carrier" "flight" "tailnum" "origin"
[17] "dest" "air_time" "distance" "gain"
[21] "hours" "gain_per_hour"
in this case everything()
picks up all remaining variables. Lastly, the helper functions starts_with
, ends_with
, and contains
can be used to choose variables/column names that match those conditions:
<- flights %>%
flights_begin_a select(starts_with("a"))
head(flights_begin_a)
# A tibble: 6 x 3
arr_time arr_delay air_time
<int> <dbl> <dbl>
1 830 11 227
2 850 20 227
3 923 33 160
4 1004 -18 183
5 812 -25 116
6 740 12 150
<- flights %>%
flights_delays select(ends_with("delay"))
head(flights_delays)
# A tibble: 6 x 2
dep_delay arr_delay
<dbl> <dbl>
1 2 11
2 4 20
3 2 33
4 -1 -18
5 -6 -25
6 -4 12
<- flights %>%
flights_time select(contains("time"))
head(flights_time)
# A tibble: 6 x 6
dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
<int> <int> <int> <int> <dbl> <dttm>
1 517 515 830 819 227 2013-01-01 05:00:00
2 533 529 850 830 227 2013-01-01 05:00:00
3 542 540 923 850 160 2013-01-01 05:00:00
4 544 545 1004 1022 183 2013-01-01 05:00:00
5 554 600 812 837 116 2013-01-01 06:00:00
6 554 558 740 728 150 2013-01-01 05:00:00
Rename variables using rename
Another useful function is rename
, which as you may suspect renames one column to another name. Suppose we wanted dep_time
and arr_time
to be departure_time
and arrival_time
instead in the flights_time
data frame:
<- flights %>%
flights_time select(contains("time")) %>%
rename(departure_time = dep_time,
arrival_time = arr_time)
names(flights_time)
[1] "departure_time" "sched_dep_time" "arrival_time" "sched_arr_time"
[5] "air_time" "time_hour"
Note that in this case we used a single =
sign with the rename()
. e.g.. departure_time = dep_time
. This is because we are not testing for equality like we would using ==
, but instead we want to assign a new variable departure_time
to have the same values as dep_time
and then delete the variable dep_time
.
It's easy to forget if the new name comes before or after the equals sign. I usually remember this as "New Before, Old After" or NBOA. You'll receive an error if you try to do it the other way:
Error: Unknown variables: departure_time, arrival_time.
Find the top number of values using top_n
We can also use the top_n
function which automatically tells us the most frequent num_flights
. We specify the top 10 airports here:
%>%
named_dests top_n(n = 10, wt = num_flights)
# A tibble: 10 x 9
dest num_flights airport_name lat lon alt tz dst tzone
<chr> <int> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 MIA 11728 Miami Intl 25.8 -80.3 8 -5 A Amer~
2 FLL 12055 Fort Lauderdale Holly~ 26.1 -80.2 9 -5 A Amer~
3 MCO 14082 Orlando Intl 28.4 -81.3 96 -5 A Amer~
4 ATL 17215 Hartsfield Jackson At~ 33.6 -84.4 1026 -5 A Amer~
5 LAX 16174 Los Angeles Intl 33.9 -118. 126 -8 A Amer~
6 CLT 14064 Charlotte Douglas Intl 35.2 -80.9 748 -5 A Amer~
7 SFO 13331 San Francisco Intl 37.6 -122. 13 -8 A Amer~
8 DCA 9705 Ronald Reagan Washing~ 38.9 -77.0 15 -5 A Amer~
9 ORD 17283 Chicago Ohare Intl 42.0 -87.9 668 -6 A Amer~
10 BOS 15508 General Edward Lawren~ 42.4 -71.0 19 -5 A Amer~
We'll still need to arrange this by num_flights
though:
%>%
named_dests top_n(n = 10, wt = num_flights) %>%
arrange(desc(num_flights))
# A tibble: 10 x 9
dest num_flights airport_name lat lon alt tz dst tzone
<chr> <int> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ORD 17283 Chicago Ohare Intl 42.0 -87.9 668 -6 A Amer~
2 ATL 17215 Hartsfield Jackson At~ 33.6 -84.4 1026 -5 A Amer~
3 LAX 16174 Los Angeles Intl 33.9 -118. 126 -8 A Amer~
4 BOS 15508 General Edward Lawren~ 42.4 -71.0 19 -5 A Amer~
5 MCO 14082 Orlando Intl 28.4 -81.3 96 -5 A Amer~
6 CLT 14064 Charlotte Douglas Intl 35.2 -80.9 748 -5 A Amer~
7 SFO 13331 San Francisco Intl 37.6 -122. 13 -8 A Amer~
8 FLL 12055 Fort Lauderdale Holly~ 26.1 -80.2 9 -5 A Amer~
9 MIA 11728 Miami Intl 25.8 -80.3 8 -5 A Amer~
10 DCA 9705 Ronald Reagan Washing~ 38.9 -77.0 15 -5 A Amer~
Note: Remember that I didn't pull the n
and wt
arguments out of thin air. They can be found by using the ?
function on top_n
.
We can go one step further and tie together the group_by
and summarize
functions we used to find the most frequent flights:
<- flights %>%
ten_freq_dests group_by(dest) %>%
summarize(num_flights = n()) %>%
arrange(desc(num_flights)) %>%
top_n(n = 10)
ten_freq_dests
# A tibble: 10 x 2
dest num_flights
<chr> <int>
1 ORD 17283
2 ATL 17215
3 LAX 16174
4 BOS 15508
5 MCO 14082
6 CLT 14064
7 SFO 13331
8 FLL 12055
9 MIA 11728
10 DCA 9705
Task
What are some ways to select all three of the dest
, air_time
, and distance
variables from flights
? Give the code showing how to do this in at least three different ways.
%>%
flights select(dest, air_time, distance)
# A tibble: 336,776 x 3
dest air_time distance
<chr> <dbl> <dbl>
1 IAH 227 1400
2 IAH 227 1416
3 MIA 160 1089
4 BQN 183 1576
5 ATL 116 762
6 ORD 150 719
7 FLL 158 1065
8 IAD 53 229
9 MCO 140 944
10 ORD 138 733
# ... with 336,766 more rows
# i Use `print(n = ...)` to see more rows
%>%
flights select(dest:distance)
# A tibble: 336,776 x 3
dest air_time distance
<chr> <dbl> <dbl>
1 IAH 227 1400
2 IAH 227 1416
3 MIA 160 1089
4 BQN 183 1576
5 ATL 116 762
6 ORD 150 719
7 FLL 158 1065
8 IAD 53 229
9 MCO 140 944
10 ORD 138 733
# ... with 336,766 more rows
# i Use `print(n = ...)` to see more rows
%>%
flights select(-year, -month, -day, -dep_time, -sched_dep_time, -dep_delay, -arr_time,
-sched_arr_time, -arr_delay, -carrier, -flight, -tailnum, -origin,
-hour, -minute, -time_hour)
# A tibble: 336,776 x 6
dest air_time distance gain hours gain_per_hour
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 IAH 227 1400 -9 3.78 -2.38
2 IAH 227 1416 -16 3.78 -4.23
3 MIA 160 1089 -31 2.67 -11.6
4 BQN 183 1576 17 3.05 5.57
5 ATL 116 762 19 1.93 9.83
6 ORD 150 719 -16 2.5 -6.4
7 FLL 158 1065 -24 2.63 -9.11
8 IAD 53 229 11 0.883 12.5
9 MCO 140 944 5 2.33 2.14
10 ORD 138 733 -10 2.3 -4.35
# ... with 336,766 more rows
# i Use `print(n = ...)` to see more rows
Task
How could one use starts_with
, ends_with
, and contains
to select columns from the flights
data frame? Provide three different examples in total: one for starts_with
, one for ends_with
, and one for contains
.
# Anything that starts with "d"
%>%
flights select(starts_with("d"))
# A tibble: 336,776 x 5
day dep_time dep_delay dest distance
<int> <int> <dbl> <chr> <dbl>
1 1 517 2 IAH 1400
2 1 533 4 IAH 1416
3 1 542 2 MIA 1089
4 1 544 -1 BQN 1576
5 1 554 -6 ATL 762
6 1 554 -4 ORD 719
7 1 555 -5 FLL 1065
8 1 557 -3 IAD 229
9 1 557 -3 MCO 944
10 1 558 -2 ORD 733
# ... with 336,766 more rows
# i Use `print(n = ...)` to see more rows
# Anything related to delays:
%>%
flights select(ends_with("delay"))
# A tibble: 336,776 x 2
dep_delay arr_delay
<dbl> <dbl>
1 2 11
2 4 20
3 2 33
4 -1 -18
5 -6 -25
6 -4 12
7 -5 19
8 -3 -14
9 -3 -8
10 -2 8
# ... with 336,766 more rows
# i Use `print(n = ...)` to see more rows
# Anything related to departures:
%>%
flights select(contains("dep"))
# A tibble: 336,776 x 3
dep_time sched_dep_time dep_delay
<int> <int> <dbl>
1 517 515 2
2 533 529 4
3 542 540 2
4 544 545 -1
5 554 600 -6
6 554 558 -4
7 555 600 -5
8 557 600 -3
9 557 600 -3
10 558 600 -2
# ... with 336,766 more rows
# i Use `print(n = ...)` to see more rows
Task Create a new data frame that shows the top 5 airports with the largest average arrival delays from NYC in 2013.
<- flights %>%
top5_arr_delays group_by(dest) %>%
summarize(mean_arr_delay = mean(arr_delay, na.rm=T)) %>%
arrange(desc(mean_arr_delay)) %>%
top_n(n = 5)
%>%
top5_arr_delays inner_join(airports %>% select(faa, name), by = c("dest" = "faa") )
# A tibble: 5 x 3
dest mean_arr_delay name
<chr> <dbl> <chr>
1 CAE 41.8 Columbia Metropolitan
2 TUL 33.7 Tulsa Intl
3 OKC 30.6 Will Rogers World
4 JAC 28.1 Jackson Hole Airport
5 TYS 24.1 Mc Ghee Tyson