Filter observations using filter
The filter
function allows you to specify criteria about values of a variable in your data set and then chooses only those rows that match that criteria. We begin by returning to the flights
data frame in the nycflights13
package, focusing only on flights from New York City to Portland, Oregon. The dest
code (or airport code) for Portland, Oregon is "PDX"
. Run the following and look at the resulting spreadsheet to ensure that only flights heading to Portland are chosen here:
<- flights %>%
portland_flights filter(dest == "PDX")
head(portland_flights[,seq(-6,-12)])
# A tibble: 6 x 12
year month day dep_time sched_d~1 origin dest air_t~2 dista~3 hour minute
<int> <int> <int> <int> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 1739 1740 JFK PDX 341 2454 17 40
2 2013 1 1 1805 1757 EWR PDX 336 2434 17 57
3 2013 1 1 2052 2029 JFK PDX 331 2454 20 29
4 2013 1 2 804 805 EWR PDX 310 2434 8 5
5 2013 1 2 1552 1550 JFK PDX 305 2454 15 50
6 2013 1 2 1727 1720 EWR PDX 351 2434 17 20
# ... with 1 more variable: time_hour <dttm>, and abbreviated variable names
# 1: sched_dep_time, 2: air_time, 3: distance
# i Use `colnames()` to see all variable names
#We leave out columns 6-11 from the display so we can see the "dest" variable
Note the following:
- The ordering of the commands:
- Take the data frame
flights
then filter
the data frame so that only those where thedest
equals"PDX"
are included.
- Take the data frame
- The double equal sign
==
for testing for equality, and not=
. (You are almost guaranteed to make the mistake of only including one equals sign at least once!!)
You can combine multiple criteria together using operators that make comparisons:
|
corresponds to "or"&
corresponds to "and"
We can often skip the use of &
and just separate our conditions with a comma. You'll see this in the example below.
In addition, you can use other mathematical checks (similar to ==
):
>
corresponds to "greater than"<
corresponds to "less than">=
corresponds to "greater than or equal to"<=
corresponds to "less than or equal to"!=
corresponds to "not equal to"
To see many of these in action, let's select all flights that left JFK airport heading to Burlington, Vermont ("BTV"
) or Seattle, Washington ("SEA"
) in the months of October, November, or December. This can be done with the following code:
<- flights %>%
btv_sea_flights_fall filter(origin == "JFK", (dest == "BTV" | dest == "SEA"), month >= 10)
head(btv_sea_flights_fall[,-6:-12])
# A tibble: 6 x 12
year month day dep_time sched_d~1 origin dest air_t~2 dista~3 hour minute
<int> <int> <int> <int> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 10 1 729 735 JFK SEA 352 2422 7 35
2 2013 10 1 853 900 JFK SEA 362 2422 9 0
3 2013 10 1 916 925 JFK BTV 48 266 9 25
4 2013 10 1 1216 1221 JFK BTV 49 266 12 21
5 2013 10 1 1452 1459 JFK BTV 46 266 14 59
6 2013 10 1 1459 1500 JFK SEA 348 2422 15 0
# ... with 1 more variable: time_hour <dttm>, and abbreviated variable names
# 1: sched_dep_time, 2: air_time, 3: distance
# i Use `colnames()` to see all variable names
#We leave out columns 6-11 from the display so we can see the "origin" and "dest" variables
Note: even though colloquially speaking one might say "all flights leaving for Burlington, Vermont and Seattle, Washington," in terms of computer logical operations, we really mean "all flights leaving for Burlington, Vermont or Seattle, Washington." For a given row in the data, dest
can be "BTV", "SEA", or something else, but not "BTV" and "SEA" at the same time.
Another example uses the !
to pick rows that don't match a condition. The !
can be read as "not". Here we are selecting rows corresponding to flights that didn't go to Burlington, VT or Seattle, WA.
<- flights %>%
not_BTV_SEA filter(!(dest == "BTV" | dest == "SEA"))
head(not_BTV_SEA[,-6:-12])
# A tibble: 6 x 12
year month day dep_time sched_d~1 origin dest air_t~2 dista~3 hour minute
<int> <int> <int> <int> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 515 EWR IAH 227 1400 5 15
2 2013 1 1 533 529 LGA IAH 227 1416 5 29
3 2013 1 1 542 540 JFK MIA 160 1089 5 40
4 2013 1 1 544 545 JFK BQN 183 1576 5 45
5 2013 1 1 554 600 LGA ATL 116 762 6 0
6 2013 1 1 554 558 EWR ORD 150 719 5 58
# ... with 1 more variable: time_hour <dttm>, and abbreviated variable names
# 1: sched_dep_time, 2: air_time, 3: distance
# i Use `colnames()` to see all variable names
#We leave out columns 6-11 from the display so we can see the "origin" and "dest" variables)
As a final note we point out that filter()
should often be the first verb you'll apply to your data. This narrows down the data to just the observations you are interested in.
Task
What's another way, using the "not" operator !
, we could filter only the rows that are not going to Burlington nor Seattle in the flights
data frame?
<- flights %>%
not_BTV_SEA filter(!dest == "BTV" & !dest == "SEA")
head(not_BTV_SEA[,-6:-12])
# A tibble: 6 x 12
year month day dep_time sched_d~1 origin dest air_t~2 dista~3 hour minute
<int> <int> <int> <int> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 515 EWR IAH 227 1400 5 15
2 2013 1 1 533 529 LGA IAH 227 1416 5 29
3 2013 1 1 542 540 JFK MIA 160 1089 5 40
4 2013 1 1 544 545 JFK BQN 183 1576 5 45
5 2013 1 1 554 600 LGA ATL 116 762 6 0
6 2013 1 1 554 558 EWR ORD 150 719 5 58
# ... with 1 more variable: time_hour <dttm>, and abbreviated variable names
# 1: sched_dep_time, 2: air_time, 3: distance
# i Use `colnames()` to see all variable names
# Alternative way
<- flights %>%
not_BTV_SEA filter(dest != "BTV" & dest != "SEA")
head(not_BTV_SEA[,-6:-12])
# A tibble: 6 x 12
year month day dep_time sched_d~1 origin dest air_t~2 dista~3 hour minute
<int> <int> <int> <int> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 515 EWR IAH 227 1400 5 15
2 2013 1 1 533 529 LGA IAH 227 1416 5 29
3 2013 1 1 542 540 JFK MIA 160 1089 5 40
4 2013 1 1 544 545 JFK BQN 183 1576 5 45
5 2013 1 1 554 600 LGA ATL 116 762 6 0
6 2013 1 1 554 558 EWR ORD 150 719 5 58
# ... with 1 more variable: time_hour <dttm>, and abbreviated variable names
# 1: sched_dep_time, 2: air_time, 3: distance
# i Use `colnames()` to see all variable names