Group rows using group_by

It's often of interest to summarize a variable based on the groupings of another variable. Let's say, for example, we are interested in the mean and standard deviation of temperatures in each month. We can produce this by running the following code:

summary_monthly_temp <- weather %>% 
  group_by(month) %>% 
  summarize(mean = mean(temp, na.rm = TRUE), 
            std_dev = sd(temp, na.rm = TRUE))
summary_monthly_temp
month mean std_dev
1 35.63566 10.224635
2 34.27060 6.982378
3 39.88007 6.249278
4 51.74564 8.786168
5 61.79500 9.681644
6 72.18400 7.546371
7 80.06622 7.119898
8 74.46847 5.191615
9 67.37129 8.465902
10 60.07113 8.846035
11 44.99043 10.443805
12 38.44180 9.982432

This code is identical to the previous code that created summary_temp, with an extra group_by(month) added. Grouping the weather data set by month and then passing this new data frame into summarize yields a data frame that shows the mean and standard deviation of temperature for each month in New York City. Note: Since each row in summary_monthly_temp represents a summary of different rows in weather, the observational units have changed.

It is important to note that group_by doesn't change the data frame. It sets meta-data (data about the data), specifically the group structure of the data. It is only after we apply the summarize function that the data frame changes.

If we would like to remove this group structure meta-data, we can pipe the resulting data frame into the ungroup() function. For example, say the group structure meta-data is set to be by month via group_by(month), all future summaries will be reported on a month-by-month basis. If however, we would like to no longer have this and have all summaries be for all data in a single group (in this case over the entire year of 2013), then pipe the data frame in question through and ungroup() to remove this.

summary_monthly_temp <- weather %>% 
  group_by(month) %>% 
  ungroup() %>%
  summarize(mean = mean(temp, na.rm = TRUE), 
            std_dev = sd(temp, na.rm = TRUE))
summary_monthly_temp
mean std_dev
55.26039 17.78785

We now revisit the n() counting summary function we introduced in the previous section. For example, suppose we'd like to get a sense for how many flights departed each of the three airports in New York City:

by_origin <- flights %>% 
  group_by(origin) %>% 
  summarize(count = n())
by_origin
origin count
EWR 120835
JFK 111279
LGA 104662

We see that Newark ("EWR") had the most flights departing in 2013 followed by "JFK" and lastly by LaGuardia ("LGA"). Note there is a subtle but important difference between sum() and n(). While sum() simply adds up a large set of numbers, the latter counts the number of times each of many different values occur.

Grouping by more than one variable

You are not limited to grouping by one variable! Say you wanted to know the number of flights leaving each of the three New York City airports for each month, we can also group by a second variable month: group_by(origin, month).

by_origin_monthly <- flights %>% 
  group_by(origin, month) %>% 
  summarize(count = n())
by_origin_monthly
# A tibble: 36 x 3
# Groups:   origin [3]
   origin month count
   <chr>  <int> <int>
 1 EWR        1  9893
 2 EWR        2  9107
 3 EWR        3 10420
 4 EWR        4 10531
 5 EWR        5 10592
 6 EWR        6 10175
 7 EWR        7 10475
 8 EWR        8 10359
 9 EWR        9  9550
10 EWR       10 10104
# ... with 26 more rows
# i Use `print(n = ...)` to see more rows

We see there are 36 rows to by_origin_monthly because there are 12 months times 3 airports (EWR, JFK, and LGA). Let's now pose two questions. First, what if we reverse the order of the grouping i.e. we group_by(month, origin)?

by_monthly_origin <- flights %>% 
  group_by(month, origin) %>% 
  summarize(count = n())
by_monthly_origin
# A tibble: 36 x 3
# Groups:   month [12]
   month origin count
   <int> <chr>  <int>
 1     1 EWR     9893
 2     1 JFK     9161
 3     1 LGA     7950
 4     2 EWR     9107
 5     2 JFK     8421
 6     2 LGA     7423
 7     3 EWR    10420
 8     3 JFK     9697
 9     3 LGA     8717
10     4 EWR    10531
# ... with 26 more rows
# i Use `print(n = ...)` to see more rows

In by_monthly_origin the month column is now first and the rows are sorted by month instead of origin. If you compare the values of count in by_origin_monthly and by_monthly_origin using the View() function, you'll see that the values are actually the same, just presented in a different order.

Second, why do we group_by(origin, month) and not group_by(origin) and then group_by(month)? Let's investigate:

by_origin_monthly_incorrect <- flights %>% 
  group_by(origin) %>% 
  group_by(month) %>% 
  summarize(count = n())
by_origin_monthly_incorrect
# A tibble: 12 x 2
   month count
   <int> <int>
 1     1 27004
 2     2 24951
 3     3 28834
 4     4 28330
 5     5 28796
 6     6 28243
 7     7 29425
 8     8 29327
 9     9 27574
10    10 28889
11    11 27268
12    12 28135

What happened here is that the second group_by(month) overrode the first group_by(origin), so that in the end we are only grouping by month. The lesson here, is if you want to group_by() two or more variables, you should include all these variables in a single group_by() function call.


Task Recall from Week 1 when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the summary_monthly_temp data frame tell us about temperatures in New York City throughout the year?

The standard deviation is a quantification of spread or variability.

We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days.


Task Write code to produce the mean and standard deviation temperature for each day in 2013 for NYC?

Make sure to group_by the appropriate variables first, before summarizeing the mean and standard deviation.

summary_temp_by_day <- weather %>% 
  group_by(year, month, day) %>% 
  summarize(mean = mean(temp, na.rm = TRUE),
            std_dev = sd(temp, na.rm = TRUE))
summary_temp_by_day
# A tibble: 364 x 5
# Groups:   year, month [12]
    year month   day  mean std_dev
   <int> <int> <int> <dbl>   <dbl>
 1  2013     1     1  37.0    4.00
 2  2013     1     2  28.7    3.45
 3  2013     1     3  30.0    2.58
 4  2013     1     4  34.9    2.45
 5  2013     1     5  37.2    4.01
 6  2013     1     6  40.1    4.40
 7  2013     1     7  40.6    3.68
 8  2013     1     8  40.1    5.77
 9  2013     1     9  43.2    5.40
10  2013     1    10  43.8    2.95
# ... with 354 more rows
# i Use `print(n = ...)` to see more rows


Task Recreate by_monthly_origin, but instead of grouping via group_by(origin, month), group variables in a different order group_by(month, origin). What differs in the resulting data set?

by_monthly_origin <- flights %>% 
  group_by(month, origin) %>% 
  summarize(count = n())
by_monthly_origin
# A tibble: 36 x 3
# Groups:   month [12]
   month origin count
   <int> <chr>  <int>
 1     1 EWR     9893
 2     1 JFK     9161
 3     1 LGA     7950
 4     2 EWR     9107
 5     2 JFK     8421
 6     2 LGA     7423
 7     3 EWR    10420
 8     3 JFK     9697
 9     3 LGA     8717
10     4 EWR    10531
# ... with 26 more rows
# i Use `print(n = ...)` to see more rows

The difference is they are organized/sorted by month first, then origin


Task How could we identify how many flights left each of the three airports for each carrier?

Summarize the count from each airport using the n() function, which counts rows.

count_flights_by_airport <- flights %>% 
  group_by(origin, carrier) %>% 
  summarize(count=n())
count_flights_by_airport
# A tibble: 35 x 3
# Groups:   origin [3]
   origin carrier count
   <chr>  <chr>   <int>
 1 EWR    9E       1268
 2 EWR    AA       3487
 3 EWR    AS        714
 4 EWR    B6       6557
 5 EWR    DL       4342
 6 EWR    EV      43939
 7 EWR    MQ       2276
 8 EWR    OO          6
 9 EWR    UA      46087
10 EWR    US       4405
# ... with 25 more rows
# i Use `print(n = ...)` to see more rows

Note: the n() function counts rows, whereas the sum(VARIABLE_NAME) function sums all values of a certain numerical variable VARIABLE_NAME.


Task How does the filter operation differ from a group_by followed by a summarize?

filter() picks out rows from the original data set without modifying them, whereas group_by %>% summarize computes summaries of numerical variables, and hence reports new values.