Summarize variables using summarize
The next common task when working with data is to be able to summarize data: take a large number of values and summarize them with a single value. While this may seem like a very abstract idea, something as simple as the sum, the smallest value, and the largest values are all summaries of a large number of values.
We can calculate the standard deviation and mean of the temperature variable temp
in the weather
data frame of nycflights13
in one step using the summarize
(or equivalently using the UK spelling summarise
) function in dplyr
.
<- weather %>%
summary_temp summarize(mean = mean(temp), std_dev = sd(temp))
summary_temp
mean | std_dev |
---|---|
NA | NA |
We've created a small data frame here called summary_temp
that includes both the mean
and the std_dev
of the temp
variable in weather
. Notice, the data frame weather
went from many rows to a single row of just the summary values in the data frame summary_temp
.
But why are the values returned NA
? This stands for "not available or not applicable" and is how R encodes missing values; if in a data frame for a particular row and column no value exists, NA
is stored instead. Furthermore, by default any time you try to summarize a number of values (using mean()
and sd()
for example) that has one or more missing values, then NA
is returned.
Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it? Perhaps the data was not collected at all because it was too difficult? Perhaps there was an erroneous value that someone entered that has been changed to read as missing? You'll often encounter issues with missing values.
You can summarize all non-missing values by setting the na.rm
argument to TRUE (rm
is short for "remove"). This will remove any NA
missing values and only return the summary value for all non-missing values. So the code below computes the mean and standard deviation of all non-missing values. Notice how the na.rm=TRUE
are set as arguments to the mean()
and sd()
functions, and not to the summarize()
function.
<- weather %>%
summary_temp summarize(mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE))
summary_temp
mean | std_dev |
---|---|
55.26039 | 17.78785 |
It is not good practice to include a na.rm = TRUE
in your summary commands by default; you should attempt to run code first without this argument as this will alert you to the presence of missing data. Only after you've identified where missing values occur and have thought about the potential causes of this missing should you consider using na.rm = TRUE
. In the upcoming Tasks we'll consider the possible ramifications of blindly sweeping rows with missing values under the rug.
What other summary functions can we use inside the summarize()
verb? Any function in R that takes a vector of values and returns just one. Here are just a few:
mean()
: the mean or averagesd()
: the standard deviation, which is a measure of spreadmin()
andmax()
: the minimum and maximum values respectivelyIQR()
: Interquartile rangesum()
: the sumn()
: a count of the number of rows/observations in each group. This particular summary function will make more sense whengroup_by()
is used in the next section.
Task Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor's approach?
The missing patients may have died of lung cancer! So to ignore them might seriously bias your results!
It is very important to think of what the consequences on your analysis are of ignoring missing data! Ask yourself:
- Is there is a systematic reasons why certain values are missing?
- If yes, you might be biasing your results!
- If there isn't, then it might be OK to "sweep missing values under the rug."
Task
Modify the code above to create summary_temp
to also use the n()
summary function: summarize(count = n())
. What does the returned value correspond to?
%>%
weather summarize(count = n())#It corresponds to a count of the number of observations/rows.
# A tibble: 1 x 1
count
<int>
1 26115
#We can check this using the dim() function which returns the dimensions (rows and columns)
dim(weather)
[1] 26115 15
Task Why doesn't the following code work?
<- weather %>%
summary_temp summarize(mean = mean(temp, na.rm = TRUE)) %>%
summarize(std_dev = sd(temp, na.rm = TRUE))
Run the code line by line instead of all at once, and then look at the data. In other words, run weather %>% summarize(mean = mean(temp, na.rm = TRUE))
first and see what it produces.
Consider the output of only running the first two lines:
%>%
weather summarize(mean = mean(temp, na.rm = TRUE))
mean |
---|
55.26039 |
Because after the first summarize()
, the variable temp
disappears as it has been collapsed to the value mean
. So when we try to run the second summarize()
, it can't find the variable temp` to compute the standard deviation of.