Filter observations using filter

The filter function allows you to specify criteria about values of a variable in your data set and then chooses only those rows that match that criteria. We begin by returning to the flights data frame in the nycflights13 package, focusing only on flights from New York City to Portland, Oregon. The dest code (or airport code) for Portland, Oregon is "PDX". Run the following and look at the resulting spreadsheet to ensure that only flights heading to Portland are chosen here:

portland_flights <- flights %>% 
  filter(dest == "PDX")
head(portland_flights[,seq(-6,-12)]) 
# A tibble: 6 x 12
   year month   day dep_time sched_d~1 origin dest  air_t~2 dista~3  hour minute
  <int> <int> <int>    <int>     <int> <chr>  <chr>   <dbl>   <dbl> <dbl>  <dbl>
1  2013     1     1     1739      1740 JFK    PDX       341    2454    17     40
2  2013     1     1     1805      1757 EWR    PDX       336    2434    17     57
3  2013     1     1     2052      2029 JFK    PDX       331    2454    20     29
4  2013     1     2      804       805 EWR    PDX       310    2434     8      5
5  2013     1     2     1552      1550 JFK    PDX       305    2454    15     50
6  2013     1     2     1727      1720 EWR    PDX       351    2434    17     20
# ... with 1 more variable: time_hour <dttm>, and abbreviated variable names
#   1: sched_dep_time, 2: air_time, 3: distance
# i Use `colnames()` to see all variable names
#We leave out columns 6-11 from the display so we can see the "dest" variable

Note the following:

  • The ordering of the commands:
    • Take the data frame flights then
    • filter the data frame so that only those where the dest equals "PDX" are included.
  • The double equal sign == for testing for equality, and not =. (You are almost guaranteed to make the mistake of only including one equals sign at least once!!)

You can combine multiple criteria together using operators that make comparisons:

  • | corresponds to "or"
  • & corresponds to "and"

We can often skip the use of & and just separate our conditions with a comma. You'll see this in the example below.

In addition, you can use other mathematical checks (similar to ==):

  • > corresponds to "greater than"
  • < corresponds to "less than"
  • >= corresponds to "greater than or equal to"
  • <= corresponds to "less than or equal to"
  • != corresponds to "not equal to"

To see many of these in action, let's select all flights that left JFK airport heading to Burlington, Vermont ("BTV") or Seattle, Washington ("SEA") in the months of October, November, or December. This can be done with the following code:

btv_sea_flights_fall <- flights %>% 
  filter(origin == "JFK", (dest == "BTV" | dest == "SEA"), month >= 10)
head(btv_sea_flights_fall[,-6:-12])
# A tibble: 6 x 12
   year month   day dep_time sched_d~1 origin dest  air_t~2 dista~3  hour minute
  <int> <int> <int>    <int>     <int> <chr>  <chr>   <dbl>   <dbl> <dbl>  <dbl>
1  2013    10     1      729       735 JFK    SEA       352    2422     7     35
2  2013    10     1      853       900 JFK    SEA       362    2422     9      0
3  2013    10     1      916       925 JFK    BTV        48     266     9     25
4  2013    10     1     1216      1221 JFK    BTV        49     266    12     21
5  2013    10     1     1452      1459 JFK    BTV        46     266    14     59
6  2013    10     1     1459      1500 JFK    SEA       348    2422    15      0
# ... with 1 more variable: time_hour <dttm>, and abbreviated variable names
#   1: sched_dep_time, 2: air_time, 3: distance
# i Use `colnames()` to see all variable names
#We leave out columns 6-11 from the display so we can see the "origin" and "dest" variables

Note: even though colloquially speaking one might say "all flights leaving for Burlington, Vermont and Seattle, Washington," in terms of computer logical operations, we really mean "all flights leaving for Burlington, Vermont or Seattle, Washington." For a given row in the data, dest can be "BTV", "SEA", or something else, but not "BTV" and "SEA" at the same time.

Another example uses the ! to pick rows that don't match a condition. The ! can be read as "not". Here we are selecting rows corresponding to flights that didn't go to Burlington, VT or Seattle, WA.

not_BTV_SEA <- flights %>% 
  filter(!(dest == "BTV" | dest == "SEA"))
head(not_BTV_SEA[,-6:-12])
# A tibble: 6 x 12
   year month   day dep_time sched_d~1 origin dest  air_t~2 dista~3  hour minute
  <int> <int> <int>    <int>     <int> <chr>  <chr>   <dbl>   <dbl> <dbl>  <dbl>
1  2013     1     1      517       515 EWR    IAH       227    1400     5     15
2  2013     1     1      533       529 LGA    IAH       227    1416     5     29
3  2013     1     1      542       540 JFK    MIA       160    1089     5     40
4  2013     1     1      544       545 JFK    BQN       183    1576     5     45
5  2013     1     1      554       600 LGA    ATL       116     762     6      0
6  2013     1     1      554       558 EWR    ORD       150     719     5     58
# ... with 1 more variable: time_hour <dttm>, and abbreviated variable names
#   1: sched_dep_time, 2: air_time, 3: distance
# i Use `colnames()` to see all variable names
#We leave out columns 6-11 from the display so we can see the "origin" and "dest" variables)

As a final note we point out that filter() should often be the first verb you'll apply to your data. This narrows down the data to just the observations you are interested in.


Task What's another way, using the "not" operator !, we could filter only the rows that are not going to Burlington nor Seattle in the flights data frame?

not_BTV_SEA <- flights %>% 
  filter(!dest == "BTV" & !dest == "SEA")
head(not_BTV_SEA[,-6:-12])
# A tibble: 6 x 12
   year month   day dep_time sched_d~1 origin dest  air_t~2 dista~3  hour minute
  <int> <int> <int>    <int>     <int> <chr>  <chr>   <dbl>   <dbl> <dbl>  <dbl>
1  2013     1     1      517       515 EWR    IAH       227    1400     5     15
2  2013     1     1      533       529 LGA    IAH       227    1416     5     29
3  2013     1     1      542       540 JFK    MIA       160    1089     5     40
4  2013     1     1      544       545 JFK    BQN       183    1576     5     45
5  2013     1     1      554       600 LGA    ATL       116     762     6      0
6  2013     1     1      554       558 EWR    ORD       150     719     5     58
# ... with 1 more variable: time_hour <dttm>, and abbreviated variable names
#   1: sched_dep_time, 2: air_time, 3: distance
# i Use `colnames()` to see all variable names
# Alternative way
not_BTV_SEA <- flights %>% 
  filter(dest != "BTV" & dest != "SEA")
head(not_BTV_SEA[,-6:-12])
# A tibble: 6 x 12
   year month   day dep_time sched_d~1 origin dest  air_t~2 dista~3  hour minute
  <int> <int> <int>    <int>     <int> <chr>  <chr>   <dbl>   <dbl> <dbl>  <dbl>
1  2013     1     1      517       515 EWR    IAH       227    1400     5     15
2  2013     1     1      533       529 LGA    IAH       227    1416     5     29
3  2013     1     1      542       540 JFK    MIA       160    1089     5     40
4  2013     1     1      544       545 JFK    BQN       183    1576     5     45
5  2013     1     1      554       600 LGA    ATL       116     762     6      0
6  2013     1     1      554       558 EWR    ORD       150     719     5     58
# ... with 1 more variable: time_hour <dttm>, and abbreviated variable names
#   1: sched_dep_time, 2: air_time, 3: distance
# i Use `colnames()` to see all variable names