Observational units

Recall the nycflights13 package with data about all domestic flights departing from New York City in 2013 that we used in Week 1 to create visualizations. In particular, let's revisit the flights data frame:

dim(flights)  #Returns the dimensions of a dataframe

[1] 336776     19

head(flights) #Returns the first 6 rows of the object

# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

glimpse(flights) #Lists the variables in an object with their first few values

Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

We see that flights has a rectangular shape with each row corresponding to a different flight and each column corresponding to a characteristic of that flight. This matches exactly with the first two properties of tidy data, namely:

Each variable forms a column.
Each observation forms a row.

But what about the third property?

Each type of observational unit forms a table.

The observational unit in the flights data set is an individual flight and we can see above that this data set consists of 336,776 flights with 19 variables. In other words, rows of this data set don't refer to a measurement on an airline or on an airport; they refer to characteristics/measurements on a given flight from New York City in 2013. This illustrates the 3rd property of tidy data, i.e. each observational unit is fully described by a single data set.

Not that there is only one observational unit of interest in any analysis. For example, also included in the nycflights13 package are data sets with different observational units*:

airlines
planes
weather
airports

The organization of this data follows the third "tidy" data property: observations corresponding to the same observational unit are saved in the same data frame.

Task For each of the data sets listed above (other than flights), identify the observational unit and how many of these are described in each of the data sets.

Use names() and dim() functions.

names(airlines) #Obs Unit: individual airlines

[1] "carrier" "name"

dim(airlines) #16 airlines are described

[1] 16  2

names(planes) # Obs Unit: different makes/models of planes

[1] "tailnum"      "year"         "type"         "manufacturer" "model"       
[6] "engines"      "seats"        "speed"        "engine"

dim(planes) #3322 different makes/models of planes are described

[1] 3322    9

names(weather) #Obs Unit: weather conditions at different airports at different times

 [1] "origin"     "year"       "month"      "day"        "hour"      
 [6] "temp"       "dewp"       "humid"      "wind_dir"   "wind_speed"
[11] "wind_gust"  "precip"     "pressure"   "visib"      "time_hour"

dim(weather) #26115 weather conditions are described

[1] 26115    15

names(airports) # Obs Unit: individual airports

[1] "faa"   "name"  "lat"   "lon"   "alt"   "tz"    "dst"   "tzone"

dim(airports) # 1458 airports are described

[1] 1458    8

How many different types of planes are represented in the nycflights13 package?

3322 16 26115 1458

* You can get basic information on R packages using help(package = "packagename"), which can be applied to this library using help(package = "nycflights13").