Observational units
Recall the nycflights13
package with data about all domestic flights departing from New York City in 2013 that we used in Week 1 to create visualizations. In particular, let's revisit the flights
data frame:
dim(flights) #Returns the dimensions of a dataframe
[1] 336776 19
head(flights) #Returns the first 6 rows of the object
# A tibble: 6 x 19
year month day dep_time sched_dep~1 dep_d~2 arr_t~3 sched~4 arr_d~5 carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
# ... with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, and abbreviated variable names 1: sched_dep_time,
# 2: dep_delay, 3: arr_time, 4: sched_arr_time, 5: arr_delay
# i Use `colnames()` to see all variable names
glimpse(flights) #Lists the variables in an object with their first few values
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2~
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, ~
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, ~
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1~
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,~
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,~
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1~
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "~
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4~
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394~
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",~
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",~
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1~
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, ~
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6~
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0~
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0~
We see that flights
has a rectangular shape with each row corresponding to a different flight and each column corresponding to a characteristic of that flight. This matches exactly with the first two properties of tidy data, namely:
- Each variable forms a column.
- Each observation forms a row.
But what about the third property?
- Each type of observational unit forms a table.
The observational unit in the flights
data set is an individual flight and we can see above that this data set consists of 336,776 flights with 19 variables. In other words, rows of this data set don't refer to a measurement on an airline or on an airport; they refer to characteristics/measurements on a given flight from New York City in 2013. This illustrates the 3rd property of tidy data, i.e. each observational unit is fully described by a single data set.
Not that there is only one observational unit of interest in any analysis. For example, also included in the nycflights13
package are data sets with different observational units*:
airlines
planes
weather
airports
The organization of this data follows the third "tidy" data property: observations corresponding to the same observational unit are saved in the same data frame.
Task
For each of the data sets listed above (other than flights
), identify the observational unit and how many of these are described in each of the data sets.
Use names()
and dim()
functions.
names(airlines) #Obs Unit: individual airlines
[1] "carrier" "name"
dim(airlines) #16 airlines are described
[1] 16 2
names(planes) # Obs Unit: different makes/models of planes
[1] "tailnum" "year" "type" "manufacturer" "model"
[6] "engines" "seats" "speed" "engine"
dim(planes) #3322 different makes/models of planes are described
[1] 3322 9
names(weather) #Obs Unit: weather conditions at different airports at different times
[1] "origin" "year" "month" "day" "hour"
[6] "temp" "dewp" "humid" "wind_dir" "wind_speed"
[11] "wind_gust" "precip" "pressure" "visib" "time_hour"
dim(weather) #26115 weather conditions are described
[1] 26115 15
names(airports) # Obs Unit: individual airports
[1] "faa" "name" "lat" "lon" "alt" "tz" "dst" "tzone"
dim(airports) # 1458 airports are described
[1] 1458 8
nycflights13
package?
* You can get basic information on R packages using help(package = "packagename")
, which can be applied to this library using help(package = "nycflights13")
.