Observational units

Recall the nycflights13 package with data about all domestic flights departing from New York City in 2013 that we used in Week 1 to create visualizations. In particular, let's revisit the flights data frame:

dim(flights)  #Returns the dimensions of a dataframe
[1] 336776     19
head(flights) #Returns the first 6 rows of the object
# A tibble: 6 x 19
   year month   day dep_time sched_dep~1 dep_d~2 arr_t~3 sched~4 arr_d~5 carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     1     1      517         515       2     830     819      11 UA     
2  2013     1     1      533         529       4     850     830      20 UA     
3  2013     1     1      542         540       2     923     850      33 AA     
4  2013     1     1      544         545      -1    1004    1022     -18 B6     
5  2013     1     1      554         600      -6     812     837     -25 DL     
6  2013     1     1      554         558      -4     740     728      12 UA     
# ... with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names 1: sched_dep_time,
#   2: dep_delay, 3: arr_time, 4: sched_arr_time, 5: arr_delay
# i Use `colnames()` to see all variable names
glimpse(flights) #Lists the variables in an object with their first few values 
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2~
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, ~
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, ~
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1~
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,~
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,~
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1~
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "~
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4~
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394~
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",~
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",~
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1~
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, ~
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6~
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0~
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0~

We see that flights has a rectangular shape with each row corresponding to a different flight and each column corresponding to a characteristic of that flight. This matches exactly with the first two properties of tidy data, namely:

  1. Each variable forms a column.
  2. Each observation forms a row.

But what about the third property?

  1. Each type of observational unit forms a table.

The observational unit in the flights data set is an individual flight and we can see above that this data set consists of 336,776 flights with 19 variables. In other words, rows of this data set don't refer to a measurement on an airline or on an airport; they refer to characteristics/measurements on a given flight from New York City in 2013. This illustrates the 3rd property of tidy data, i.e. each observational unit is fully described by a single data set.

Not that there is only one observational unit of interest in any analysis. For example, also included in the nycflights13 package are data sets with different observational units*:

  • airlines
  • planes
  • weather
  • airports

The organization of this data follows the third "tidy" data property: observations corresponding to the same observational unit are saved in the same data frame.


Task For each of the data sets listed above (other than flights), identify the observational unit and how many of these are described in each of the data sets.

Use names() and dim() functions.

names(airlines) #Obs Unit: individual airlines
[1] "carrier" "name"   
dim(airlines) #16 airlines are described 
[1] 16  2
names(planes) # Obs Unit: different makes/models of planes
[1] "tailnum"      "year"         "type"         "manufacturer" "model"       
[6] "engines"      "seats"        "speed"        "engine"      
dim(planes) #3322 different makes/models of planes are described 
[1] 3322    9
names(weather) #Obs Unit: weather conditions at different airports at different times
 [1] "origin"     "year"       "month"      "day"        "hour"      
 [6] "temp"       "dewp"       "humid"      "wind_dir"   "wind_speed"
[11] "wind_gust"  "precip"     "pressure"   "visib"      "time_hour" 
dim(weather) #26115 weather conditions are described
[1] 26115    15
names(airports) # Obs Unit: individual airports
[1] "faa"   "name"  "lat"   "lon"   "alt"   "tz"    "dst"   "tzone"
dim(airports) # 1458 airports are described
[1] 1458    8


How many different types of planes are represented in the nycflights13 package?

* You can get basic information on R packages using help(package = "packagename"), which can be applied to this library using help(package = "nycflights13").