Viewing The Data

Before visualising any data set, we first need to know its contents. For example, the contents of the flights data within the nycflights13 library can be observed using the following command:

head(flights, n = 3)

This prints to the R console the first n = 3 rows of the flights data set, displaying each of the variables within said data set. We now know the data set contains 19 variables, as well as their names. A quick check on the size of a data set can be obtained using:

dim(flights)

## [1] 336776     19

which displays the dimensions of the data set. Thus, here we have 336776 rows and 19 columns worth of data.

To reduce the amount of data we will be working with and make things a little easier, let's only look at Alaska Airlines flights leaving from New York City in 2013. This can be done by subsetting the data in such a way that we only observe flights from Alaska Airlines (carrier code AS), as follows:

Alaska <- flights[flights$carrier == "AS", ]

This essentially picks out all of the rows within the flights data set for which the carrier code is AS and discards the rest, thus creating a new data set entitled Alaska.

Task: Write code to observe the first 5 rows of the Alaska data.

You may want to use the head function.

head(Alaska, n = 5)

## # A tibble: 5 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      724            725        -1     1020           1030
## 2  2013     1     1     1808           1815        -7     2111           2130
## 3  2013     1     2      722            725        -3      949           1030
## 4  2013     1     2     1818           1815         3     2131           2130
## 5  2013     1     3      724            725        -1     1012           1030
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

What are the dimensions of the Alaska data set?

714 19 500 19 19 500 19 714

Check the the dimensions using the dim function.

Next week we will look at more sophisticated ways of manipulating data sets. Now, let us go on to look at different visualisations of our Alaska data set using ggplot2, starting with scatterplots.