Group rows using group_by
It's often of interest to summarize a variable based on the groupings of another variable. Let's say, for example, we are interested in the mean and standard deviation of temperatures in each month. We can produce this by running the following code:
<- weather %>%
summary_monthly_temp group_by(month) %>%
summarize(mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE))
summary_monthly_temp
month | mean | std_dev |
---|---|---|
1 | 35.63566 | 10.224635 |
2 | 34.27060 | 6.982378 |
3 | 39.88007 | 6.249278 |
4 | 51.74564 | 8.786168 |
5 | 61.79500 | 9.681644 |
6 | 72.18400 | 7.546371 |
7 | 80.06622 | 7.119898 |
8 | 74.46847 | 5.191615 |
9 | 67.37129 | 8.465902 |
10 | 60.07113 | 8.846035 |
11 | 44.99043 | 10.443805 |
12 | 38.44180 | 9.982432 |
This code is identical to the previous code that created summary_temp
, with an extra group_by(month)
added. Grouping the weather
data set by month
and then passing this new data frame into summarize
yields a data frame that shows the mean and standard deviation of temperature for each month in New York City. Note: Since each row in summary_monthly_temp
represents a summary of different rows in weather
, the observational units have changed.
It is important to note that group_by
doesn't change the data frame. It sets meta-data (data about the data), specifically the group structure of the data. It is only after we apply the summarize
function that the data frame changes.
If we would like to remove this group structure meta-data, we can pipe the resulting data frame into the ungroup()
function. For example, say the group structure meta-data is set to be by month via group_by(month)
, all future summaries will be reported on a month-by-month basis. If however, we would like to no longer have this and have all summaries be for all data in a single group (in this case over the entire year of 2013), then pipe the data frame in question through and ungroup()
to remove this.
<- weather %>%
summary_monthly_temp group_by(month) %>%
ungroup() %>%
summarize(mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE))
summary_monthly_temp
mean | std_dev |
---|---|
55.26039 | 17.78785 |
We now revisit the n()
counting summary function we introduced in the previous section. For example, suppose we'd like to get a sense for how many flights departed each of the three airports in New York City:
<- flights %>%
by_origin group_by(origin) %>%
summarize(count = n())
by_origin
origin | count |
---|---|
EWR | 120835 |
JFK | 111279 |
LGA | 104662 |
We see that Newark ("EWR"
) had the most flights departing in 2013 followed by "JFK"
and lastly by LaGuardia ("LGA"
). Note there is a subtle but important difference between sum()
and n()
. While sum()
simply adds up a large set of numbers, the latter counts the number of times each of many different values occur.
Grouping by more than one variable
You are not limited to grouping by one variable! Say you wanted to know the number of flights leaving each of the three New York City airports for each month, we can also group by a second variable month
: group_by(origin, month)
.
<- flights %>%
by_origin_monthly group_by(origin, month) %>%
summarize(count = n())
by_origin_monthly
# A tibble: 36 x 3
# Groups: origin [3]
origin month count
<chr> <int> <int>
1 EWR 1 9893
2 EWR 2 9107
3 EWR 3 10420
4 EWR 4 10531
5 EWR 5 10592
6 EWR 6 10175
7 EWR 7 10475
8 EWR 8 10359
9 EWR 9 9550
10 EWR 10 10104
# ... with 26 more rows
# i Use `print(n = ...)` to see more rows
We see there are 36 rows to by_origin_monthly
because there are 12 months times 3 airports (EWR
, JFK
, and LGA
). Let's now pose two questions. First, what if we reverse the order of the grouping i.e. we group_by(month, origin)
?
<- flights %>%
by_monthly_origin group_by(month, origin) %>%
summarize(count = n())
by_monthly_origin
# A tibble: 36 x 3
# Groups: month [12]
month origin count
<int> <chr> <int>
1 1 EWR 9893
2 1 JFK 9161
3 1 LGA 7950
4 2 EWR 9107
5 2 JFK 8421
6 2 LGA 7423
7 3 EWR 10420
8 3 JFK 9697
9 3 LGA 8717
10 4 EWR 10531
# ... with 26 more rows
# i Use `print(n = ...)` to see more rows
In by_monthly_origin
the month
column is now first and the rows are sorted by month
instead of origin. If you compare the values of count
in by_origin_monthly
and by_monthly_origin
using the View()
function, you'll see that the values are actually the same, just presented in a different order.
Second, why do we group_by(origin, month)
and not group_by(origin)
and then group_by(month)
? Let's investigate:
<- flights %>%
by_origin_monthly_incorrect group_by(origin) %>%
group_by(month) %>%
summarize(count = n())
by_origin_monthly_incorrect
# A tibble: 12 x 2
month count
<int> <int>
1 1 27004
2 2 24951
3 3 28834
4 4 28330
5 5 28796
6 6 28243
7 7 29425
8 8 29327
9 9 27574
10 10 28889
11 11 27268
12 12 28135
What happened here is that the second group_by(month)
overrode the first group_by(origin)
, so that in the end we are only grouping by month
. The lesson here, is if you want to group_by()
two or more variables, you should include all these variables in a single group_by()
function call.
Task
Recall from Week 1 when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the summary_monthly_temp
data frame tell us about temperatures in New York City throughout the year?
The standard deviation is a quantification of spread or variability.
We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days.
Task Write code to produce the mean and standard deviation temperature for each day in 2013 for NYC?
Make sure to group_by
the appropriate variables first, before summarize
ing the mean and standard deviation.
<- weather %>%
summary_temp_by_day group_by(year, month, day) %>%
summarize(mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE))
summary_temp_by_day
# A tibble: 364 x 5
# Groups: year, month [12]
year month day mean std_dev
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 37.0 4.00
2 2013 1 2 28.7 3.45
3 2013 1 3 30.0 2.58
4 2013 1 4 34.9 2.45
5 2013 1 5 37.2 4.01
6 2013 1 6 40.1 4.40
7 2013 1 7 40.6 3.68
8 2013 1 8 40.1 5.77
9 2013 1 9 43.2 5.40
10 2013 1 10 43.8 2.95
# ... with 354 more rows
# i Use `print(n = ...)` to see more rows
Task
Recreate by_monthly_origin
, but instead of grouping via group_by(origin, month)
, group variables in a different order group_by(month, origin)
. What differs in the resulting data set?
<- flights %>%
by_monthly_origin group_by(month, origin) %>%
summarize(count = n())
by_monthly_origin
# A tibble: 36 x 3
# Groups: month [12]
month origin count
<int> <chr> <int>
1 1 EWR 9893
2 1 JFK 9161
3 1 LGA 7950
4 2 EWR 9107
5 2 JFK 8421
6 2 LGA 7423
7 3 EWR 10420
8 3 JFK 9697
9 3 LGA 8717
10 4 EWR 10531
# ... with 26 more rows
# i Use `print(n = ...)` to see more rows
The difference is they are organized/sorted by month
first, then origin
Task
How could we identify how many flights left each of the three airports for each carrier
?
Summarize the count from each airport using the n()
function, which counts rows.
<- flights %>%
count_flights_by_airport group_by(origin, carrier) %>%
summarize(count=n())
count_flights_by_airport
# A tibble: 35 x 3
# Groups: origin [3]
origin carrier count
<chr> <chr> <int>
1 EWR 9E 1268
2 EWR AA 3487
3 EWR AS 714
4 EWR B6 6557
5 EWR DL 4342
6 EWR EV 43939
7 EWR MQ 2276
8 EWR OO 6
9 EWR UA 46087
10 EWR US 4405
# ... with 25 more rows
# i Use `print(n = ...)` to see more rows
Note: the n()
function counts rows, whereas the sum(VARIABLE_NAME)
function sums all values of a certain numerical variable VARIABLE_NAME
.
Task
How does the filter
operation differ from a group_by
followed by a summarize
?
filter()
picks out rows from the original data set without modifying them, whereas group_by %>% summarize
computes summaries of numerical variables, and hence reports new values.