Boxplots

Another way to look at the distribution of a variable is using a boxplot. A boxplot makes use of the standard five-number summary, that is

  • the minimum value;
  • the first quartile or 25th percentile;
  • the median (or 2nd quartile / 50th percentile);
  • the third quartile or 75th percentile; and
  • the maximum value.

Keeping with the hourly temperature data, the five-number summary can be obtained in R using the following command:

summary(weather$temp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.94   39.92   55.40   55.26   69.98  100.04       1

This provides us with the five-number summary, as well as the mean hourly temperature. There is one missing value in the hourly temperature data, which is represented in R by NA.


What does having similar median and mean values say about the distribution of a variable?


The boxplot of the hourly temperature data is plotted below.

Figure 12: Boxplot of hourly temperature recordings from NYC in 2013.

Figure 10: Figure 12: Boxplot of hourly temperature recordings from NYC in 2013.

The elements of the boxplot relating to the five-number summary have also been labelled. Other features of the boxplot are:

  • the interquartile range (IQR), that is, the height of the box (3rd - 1st quartiles), and is a measure of the spread of the middle 50% of values. Longer boxes are indicative of greater variability.
  • The dashed lines are known as whiskers and represent values less than the 25th percentile, and greater than the 75th percentile, respectively. The longer the whiskers, the more variability there is outside of the middle 50%.

Boxplots are useful visualisations when comparing the distribution of a numerical variable split across groups (or a categorical variable). For example, we could look at how the hourly temperature changes by month, where month is our categorical, or grouping, variable.

To create boxplots using ggplot we use the geom_boxplot function. If we want to look at boxplots of a variable separately for a categorical variable then we need to introduce the factor function. This converts a numerical variable into a categorical one, essentially creating labels or categories from the numeric values. For example, the month variable within the weather data set is a numerical variable taking on the values 1,\(\ldots\),12, for each month. However, it makes more sense to convert this into a categorical variable using the factor function, such that:

weather$month
##  [1] 1 1 1 1 1 1 1 1 1 1

becomes

factor(weather$month)
##  [1] 1 1 1 1 1 1 1 1 1 1
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12

with levels, or categories, 1,\(\ldots\),12 for each month. Hence, the boxplots can be produced using ggplot as follows:

ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
  geom_boxplot(fill = "steelblue") +
  labs(x = "Month", y = "Temperature (Hourly)",
       title = "Hourly temperatures from NYC in 2013 by month")  +
  scale_x_discrete(labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                              "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
Figure 13: Boxplots of hourly temperature recordings from NYC in 2013 by month.

Figure 11: Figure 13: Boxplots of hourly temperature recordings from NYC in 2013 by month.

Note, we have introduced a new function scale_x_discrete, which is used to rename the labels of the boxplots. This function is used as our categorical variables are discrete in nature.

After splitting the hourly temperatures by month, we now see points extending beyond the whiskers of the boxplots. These are known as outliers, and may be thought of as unusually small or large values. However, the definition of an outlier here is somewhat arbitrary as they are defined by the length of the whiskers, which are no more than 1.5 x IQR.


Which month has the highest median hourly temperature?

Remember that the horizontal line within the boxplot is the median value.


Which month has the largest amount of spread within the middle 50% of the hourly temperature?

Remember that the box of the boxplot contains the middle 50% of data, and as such the larger the box, the more spread or variability within the middle 50% of values.