Group rows using group_by

It's often of interest to summarize a variable based on the groupings of another variable. Let's say, for example, we are interested in the mean and standard deviation of temperatures in each month. We can produce this by running the following code:
summary_monthly_temp <- weather %>%
group_by(month) %>%
summarize(mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE))
summary_monthly_temp| month | mean | std_dev |
|---|---|---|
| 1 | 35.63566 | 10.224635 |
| 2 | 34.27060 | 6.982378 |
| 3 | 39.88007 | 6.249278 |
| 4 | 51.74564 | 8.786168 |
| 5 | 61.79500 | 9.681644 |
| 6 | 72.18400 | 7.546371 |
| 7 | 80.06622 | 7.119898 |
| 8 | 74.46847 | 5.191615 |
| 9 | 67.37129 | 8.465902 |
| 10 | 60.07113 | 8.846035 |
| 11 | 44.99043 | 10.443805 |
| 12 | 38.44180 | 9.982432 |
This code is identical to the previous code that created summary_temp, with an extra group_by(month) added. Grouping the weather data set by month and then passing this new data frame into summarize yields a data frame that shows the mean and standard deviation of temperature for each month in New York City. Note: Since each row in summary_monthly_temp represents a summary of different rows in weather, the observational units have changed.
It is important to note that group_by doesn't change the data frame. It sets meta-data (data about the data), specifically the group structure of the data. It is only after we apply the summarize function that the data frame changes.
If we would like to remove this group structure meta-data, we can pipe the resulting data frame into the ungroup() function. For example, say the group structure meta-data is set to be by month via group_by(month), all future summaries will be reported on a month-by-month basis. If however, we would like to no longer have this and have all summaries be for all data in a single group (in this case over the entire year of 2013), then pipe the data frame in question through and ungroup() to remove this.
summary_monthly_temp <- weather %>%
group_by(month) %>%
ungroup() %>%
summarize(mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE))
summary_monthly_temp| mean | std_dev |
|---|---|
| 55.26039 | 17.78785 |
We now revisit the n() counting summary function we introduced in the previous section. For example, suppose we'd like to get a sense for how many flights departed each of the three airports in New York City:
by_origin <- flights %>%
group_by(origin) %>%
summarize(count = n())
by_origin| origin | count |
|---|---|
| EWR | 120835 |
| JFK | 111279 |
| LGA | 104662 |
We see that Newark ("EWR") had the most flights departing in 2013 followed by "JFK" and lastly by LaGuardia ("LGA"). Note there is a subtle but important difference between sum() and n(). While sum() simply adds up a large set of numbers, the latter counts the number of times each of many different values occur.
Grouping by more than one variable
You are not limited to grouping by one variable! Say you wanted to know the number of flights leaving each of the three New York City airports for each month, we can also group by a second variable month: group_by(origin, month).
by_origin_monthly <- flights %>%
group_by(origin, month) %>%
summarize(count = n())
by_origin_monthly# A tibble: 36 x 3
# Groups: origin [3]
origin month count
<chr> <int> <int>
1 EWR 1 9893
2 EWR 2 9107
3 EWR 3 10420
4 EWR 4 10531
5 EWR 5 10592
6 EWR 6 10175
7 EWR 7 10475
8 EWR 8 10359
9 EWR 9 9550
10 EWR 10 10104
# ... with 26 more rows
# i Use `print(n = ...)` to see more rows
We see there are 36 rows to by_origin_monthly because there are 12 months times 3 airports (EWR, JFK, and LGA). Let's now pose two questions. First, what if we reverse the order of the grouping i.e. we group_by(month, origin)?
by_monthly_origin <- flights %>%
group_by(month, origin) %>%
summarize(count = n())
by_monthly_origin# A tibble: 36 x 3
# Groups: month [12]
month origin count
<int> <chr> <int>
1 1 EWR 9893
2 1 JFK 9161
3 1 LGA 7950
4 2 EWR 9107
5 2 JFK 8421
6 2 LGA 7423
7 3 EWR 10420
8 3 JFK 9697
9 3 LGA 8717
10 4 EWR 10531
# ... with 26 more rows
# i Use `print(n = ...)` to see more rows
In by_monthly_origin the month column is now first and the rows are sorted by month instead of origin. If you compare the values of count in by_origin_monthly and by_monthly_origin using the View() function, you'll see that the values are actually the same, just presented in a different order.
Second, why do we group_by(origin, month) and not group_by(origin) and then group_by(month)? Let's investigate:
by_origin_monthly_incorrect <- flights %>%
group_by(origin) %>%
group_by(month) %>%
summarize(count = n())
by_origin_monthly_incorrect# A tibble: 12 x 2
month count
<int> <int>
1 1 27004
2 2 24951
3 3 28834
4 4 28330
5 5 28796
6 6 28243
7 7 29425
8 8 29327
9 9 27574
10 10 28889
11 11 27268
12 12 28135
What happened here is that the second group_by(month) overrode the first group_by(origin), so that in the end we are only grouping by month. The lesson here, is if you want to group_by() two or more variables, you should include all these variables in a single group_by() function call.
Task
Recall from Week 1 when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the summary_monthly_temp data frame tell us about temperatures in New York City throughout the year?
The standard deviation is a quantification of spread or variability.
We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days.
Task Write code to produce the mean and standard deviation temperature for each day in 2013 for NYC?
Make sure to group_by the appropriate variables first, before summarizeing the mean and standard deviation.
summary_temp_by_day <- weather %>%
group_by(year, month, day) %>%
summarize(mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE))
summary_temp_by_day# A tibble: 364 x 5
# Groups: year, month [12]
year month day mean std_dev
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 37.0 4.00
2 2013 1 2 28.7 3.45
3 2013 1 3 30.0 2.58
4 2013 1 4 34.9 2.45
5 2013 1 5 37.2 4.01
6 2013 1 6 40.1 4.40
7 2013 1 7 40.6 3.68
8 2013 1 8 40.1 5.77
9 2013 1 9 43.2 5.40
10 2013 1 10 43.8 2.95
# ... with 354 more rows
# i Use `print(n = ...)` to see more rows
Task
Recreate by_monthly_origin, but instead of grouping via group_by(origin, month), group variables in a different order group_by(month, origin). What differs in the resulting data set?
by_monthly_origin <- flights %>%
group_by(month, origin) %>%
summarize(count = n())
by_monthly_origin# A tibble: 36 x 3
# Groups: month [12]
month origin count
<int> <chr> <int>
1 1 EWR 9893
2 1 JFK 9161
3 1 LGA 7950
4 2 EWR 9107
5 2 JFK 8421
6 2 LGA 7423
7 3 EWR 10420
8 3 JFK 9697
9 3 LGA 8717
10 4 EWR 10531
# ... with 26 more rows
# i Use `print(n = ...)` to see more rows
The difference is they are organized/sorted by month first, then origin
Task
How could we identify how many flights left each of the three airports for each carrier?
Summarize the count from each airport using the n() function, which counts rows.
count_flights_by_airport <- flights %>%
group_by(origin, carrier) %>%
summarize(count=n())
count_flights_by_airport# A tibble: 35 x 3
# Groups: origin [3]
origin carrier count
<chr> <chr> <int>
1 EWR 9E 1268
2 EWR AA 3487
3 EWR AS 714
4 EWR B6 6557
5 EWR DL 4342
6 EWR EV 43939
7 EWR MQ 2276
8 EWR OO 6
9 EWR UA 46087
10 EWR US 4405
# ... with 25 more rows
# i Use `print(n = ...)` to see more rows
Note: the n() function counts rows, whereas the sum(VARIABLE_NAME) function sums all values of a certain numerical variable VARIABLE_NAME.
Task
How does the filter operation differ from a group_by followed by a summarize?
filter() picks out rows from the original data set without modifying them, whereas group_by %>% summarize computes summaries of numerical variables, and hence reports new values.