Filter observations using filter

The filter function allows you to specify criteria about values of a variable in your data set and then chooses only those rows that match that criteria. We begin by returning to the flights data frame in the nycflights13 package, focusing only on flights from New York City to Portland, Oregon. The dest code (or airport code) for Portland, Oregon is "PDX". Run the following and look at the resulting spreadsheet to ensure that only flights heading to Portland are chosen here:

portland_flights <- flights %>% 
  filter(dest == "PDX")
head(portland_flights[,seq(-6,-12)]) 
# A tibble: 6 × 12
   year month   day dep_time sched_dep_time origin dest  air_time distance  hour
  <int> <int> <int>    <int>          <int> <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2013     1     1     1739           1740 JFK    PDX        341     2454    17
2  2013     1     1     1805           1757 EWR    PDX        336     2434    17
3  2013     1     1     2052           2029 JFK    PDX        331     2454    20
4  2013     1     2      804            805 EWR    PDX        310     2434     8
5  2013     1     2     1552           1550 JFK    PDX        305     2454    15
6  2013     1     2     1727           1720 EWR    PDX        351     2434    17
# ℹ 2 more variables: minute <dbl>, time_hour <dttm>
#We leave out columns 6-11 from the display so we can see the "dest" variable

Note the following:

  • The ordering of the commands:
    • Take the data frame flights then
    • filter the data frame so that only those where the dest equals "PDX" are included.
  • The double equal sign == for testing for equality, and not =. (You are almost guaranteed to make the mistake of only including one equals sign at least once!!)

You can combine multiple criteria together using operators that make comparisons:

  • | corresponds to "or"
  • & corresponds to "and"

We can often skip the use of & and just separate our conditions with a comma. You'll see this in the example below.

In addition, you can use other mathematical checks (similar to ==):

  • > corresponds to "greater than"
  • < corresponds to "less than"
  • >= corresponds to "greater than or equal to"
  • <= corresponds to "less than or equal to"
  • != corresponds to "not equal to"

To see many of these in action, let's select all flights that left JFK airport heading to Burlington, Vermont ("BTV") or Seattle, Washington ("SEA") in the months of October, November, or December. This can be done with the following code:

btv_sea_flights_fall <- flights %>% 
  filter(origin == "JFK", (dest == "BTV" | dest == "SEA"), month >= 10)
head(btv_sea_flights_fall[,-6:-12])
# A tibble: 6 × 12
   year month   day dep_time sched_dep_time origin dest  air_time distance  hour
  <int> <int> <int>    <int>          <int> <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2013    10     1      729            735 JFK    SEA        352     2422     7
2  2013    10     1      853            900 JFK    SEA        362     2422     9
3  2013    10     1      916            925 JFK    BTV         48      266     9
4  2013    10     1     1216           1221 JFK    BTV         49      266    12
5  2013    10     1     1452           1459 JFK    BTV         46      266    14
6  2013    10     1     1459           1500 JFK    SEA        348     2422    15
# ℹ 2 more variables: minute <dbl>, time_hour <dttm>
#We leave out columns 6-11 from the display so we can see the "origin" and "dest" variables

Note: even though colloquially speaking one might say "all flights leaving for Burlington, Vermont and Seattle, Washington," in terms of computer logical operations, we really mean "all flights leaving for Burlington, Vermont or Seattle, Washington." For a given row in the data, dest can be "BTV", "SEA", or something else, but not "BTV" and "SEA" at the same time.

Another example uses the ! to pick rows that don't match a condition. The ! can be read as "not". Here we are selecting rows corresponding to flights that didn't go to Burlington, VT or Seattle, WA.

not_BTV_SEA <- flights %>% 
  filter(!(dest == "BTV" | dest == "SEA"))
head(not_BTV_SEA[,-6:-12])
# A tibble: 6 × 12
   year month   day dep_time sched_dep_time origin dest  air_time distance  hour
  <int> <int> <int>    <int>          <int> <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2013     1     1      517            515 EWR    IAH        227     1400     5
2  2013     1     1      533            529 LGA    IAH        227     1416     5
3  2013     1     1      542            540 JFK    MIA        160     1089     5
4  2013     1     1      544            545 JFK    BQN        183     1576     5
5  2013     1     1      554            600 LGA    ATL        116      762     6
6  2013     1     1      554            558 EWR    ORD        150      719     5
# ℹ 2 more variables: minute <dbl>, time_hour <dttm>
#We leave out columns 6-11 from the display so we can see the "origin" and "dest" variables)

As a final note we point out that filter() should often be the first verb you'll apply to your data. This narrows down the data to just the observations you are interested in.


Task What's another way, using the "not" operator !, we could filter only the rows that are not going to Burlington nor Seattle in the flights data frame?

not_BTV_SEA <- flights %>% 
  filter(!dest == "BTV" & !dest == "SEA")
head(not_BTV_SEA[,-6:-12])
# A tibble: 6 × 12
   year month   day dep_time sched_dep_time origin dest  air_time distance  hour
  <int> <int> <int>    <int>          <int> <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2013     1     1      517            515 EWR    IAH        227     1400     5
2  2013     1     1      533            529 LGA    IAH        227     1416     5
3  2013     1     1      542            540 JFK    MIA        160     1089     5
4  2013     1     1      544            545 JFK    BQN        183     1576     5
5  2013     1     1      554            600 LGA    ATL        116      762     6
6  2013     1     1      554            558 EWR    ORD        150      719     5
# ℹ 2 more variables: minute <dbl>, time_hour <dttm>
# Alternative way
not_BTV_SEA <- flights %>% 
  filter(dest != "BTV" & dest != "SEA")
head(not_BTV_SEA[,-6:-12])
# A tibble: 6 × 12
   year month   day dep_time sched_dep_time origin dest  air_time distance  hour
  <int> <int> <int>    <int>          <int> <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2013     1     1      517            515 EWR    IAH        227     1400     5
2  2013     1     1      533            529 LGA    IAH        227     1416     5
3  2013     1     1      542            540 JFK    MIA        160     1089     5
4  2013     1     1      544            545 JFK    BQN        183     1576     5
5  2013     1     1      554            600 LGA    ATL        116      762     6
6  2013     1     1      554            558 EWR    ORD        150      719     5
# ℹ 2 more variables: minute <dbl>, time_hour <dttm>