-
Notifications
You must be signed in to change notification settings - Fork 230
/
datetimes.Rmd
447 lines (309 loc) · 11.6 KB
/
datetimes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
# Dates and times {#dates-and-times .r4ds-section}
## Introduction {#introduction-10 .r4ds-section}
```{r message=FALSE,cache=FALSE}
library("tidyverse")
library("lubridate")
library("nycflights13")
```
## Creating date/times {#creating-datetimes .r4ds-section}
This code is needed by exercises.
```{r}
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
```
### Exercise 16.2.1 {.unnumbered .exercise data-number="16.2.1"}
<div class="question">
What happens if you parse a string that
contains invalid dates?
</div>
<div class="answer">
```{r}
ret <- ymd(c("2010-10-10", "bananas"))
print(class(ret))
ret
```
It produces an `NA` and a warning message.
</div>
### Exercise 16.2.2 {.unnumbered .exercise data-number="16.2.2"}
<div class="question">
What does the `tzone` argument to `today()` do?
Why is it important?
</div>
<div class="answer">
It determines the time-zone of the date.
Since different time-zones can have different dates, the value of `today()` can vary depending on the time-zone specified.
</div>
### Exercise 16.2.3 {.unnumbered .exercise data-number="16.2.3"}
<div class="question">
Use the appropriate lubridate function to parse each of the following dates:
```{r}
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14"
```
</div>
<div class="answer">
```{r}
mdy(d1)
ymd(d2)
dmy(d3)
mdy(d4)
mdy(d5)
```
</div>
## Date-time components {#date-time-components .r4ds-section}
The following code from the chapter is used
```{r}
sched_dep <- flights_dt %>%
mutate(minute = minute(sched_dep_time)) %>%
group_by(minute) %>%
summarise(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
```
In the previous code, the difference between rounded and un-rounded dates provides the within-period time.
### Exercise 16.3.1 {.unnumbered .exercise data-number="16.3.1"}
<div class="question">
How does the distribution of flight times
within a day change over the course of the year?
</div>
<div class="answer">
Let's try plotting this by month:
```{r}
flights_dt %>%
filter(!is.na(dep_time)) %>%
mutate(dep_hour = update(dep_time, yday = 1)) %>%
mutate(month = factor(month(dep_time))) %>%
ggplot(aes(dep_hour, color = month)) +
geom_freqpoly(binwidth = 60 * 60)
```
This will look better if everything is normalized within groups. The reason
that February is lower is that there are fewer days and thus fewer flights.
```{r}
flights_dt %>%
filter(!is.na(dep_time)) %>%
mutate(dep_hour = update(dep_time, yday = 1)) %>%
mutate(month = factor(month(dep_time))) %>%
ggplot(aes(dep_hour, color = month)) +
geom_freqpoly(aes(y = ..density..), binwidth = 60 * 60)
```
At least to me there doesn't appear to much difference in within-day distribution over the year, but I maybe thinking about it incorrectly.
</div>
### Exercise 16.3.2 {.unnumbered .exercise data-number="16.3.2"}
<div class="question">
Compare `dep_time`, `sched_dep_time` and `dep_delay`. Are they consistent? Explain your findings.
</div>
<div class="answer">
If they are consistent, then `dep_time = sched_dep_time + dep_delay`.
```{r}
flights_dt %>%
mutate(dep_time_ = sched_dep_time + dep_delay * 60) %>%
filter(dep_time_ != dep_time) %>%
select(dep_time_, dep_time, sched_dep_time, dep_delay)
```
There exist discrepancies. It looks like there are mistakes in the dates. These
are flights in which the actual departure time is on the *next* day relative to
the scheduled departure time. We forgot to account for this when creating the
date-times using `make_datetime_100()` function in [16.2.2 From individual components](https://r4ds.had.co.nz/dates-and-times.html#from-individual-components). The code would have had to check if the departure time is less than
the scheduled departure time plus departure delay (in minutes). Alternatively, simply adding the departure delay to the scheduled departure time is a more robust way to construct the departure time because it will automatically account for crossing into the next day.
</div>
### Exercise 16.3.3 {.unnumbered .exercise data-number="16.3.3"}
<div class="question">
Compare `air_time` with the duration between the departure and arrival.
Explain your findings.
</div>
<div class="answer">
```{r}
flights_dt %>%
mutate(
flight_duration = as.numeric(arr_time - dep_time),
air_time_mins = air_time,
diff = flight_duration - air_time_mins
) %>%
select(origin, dest, flight_duration, air_time_mins, diff)
```
</div>
### Exercise 16.3.4 {.unnumbered .exercise data-number="16.3.4"}
<div class="question">
How does the average delay time change over the course of a day? Should you use `dep_time` or `sched_dep_time`? Why?
</div>
<div class="answer">
Use `sched_dep_time` because that is the relevant metric for someone scheduling a flight. Also, using `dep_time` will always bias delays to later in the day since delays will push flights later.
```{r}
flights_dt %>%
mutate(sched_dep_hour = hour(sched_dep_time)) %>%
group_by(sched_dep_hour) %>%
summarise(dep_delay = mean(dep_delay)) %>%
ggplot(aes(y = dep_delay, x = sched_dep_hour)) +
geom_point() +
geom_smooth()
```
</div>
### Exercise 16.3.5 {.unnumbered .exercise data-number="16.3.5"}
<div class="question">
On what day of the week should you leave if you want to minimize the chance of a delay?
</div>
<div class="answer">
Saturday has the lowest average departure delay time and the lowest average arrival delay time.
```{r 16.3.5.tbl}
flights_dt %>%
mutate(dow = wday(sched_dep_time)) %>%
group_by(dow) %>%
summarise(
dep_delay = mean(dep_delay),
arr_delay = mean(arr_delay, na.rm = TRUE)
) %>%
print(n = Inf)
```
```{r 16.3.5.fig1}
flights_dt %>%
mutate(wday = wday(dep_time, label = TRUE)) %>%
group_by(wday) %>%
summarize(ave_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = wday, y = ave_dep_delay)) +
geom_bar(stat = "identity")
```
```{r 16.3.5.fig2}
flights_dt %>%
mutate(wday = wday(dep_time, label = TRUE)) %>%
group_by(wday) %>%
summarize(ave_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(aes(x = wday, y = ave_arr_delay)) +
geom_bar(stat = "identity")
```
</div>
### Exercise 16.3.6 {.unnumbered .exercise data-number="16.3.6"}
<div class="question">
What makes the distribution of `diamonds$carat` and `flights$sched_dep_time` similar?
</div>
<div class="answer">
```{r}
ggplot(diamonds, aes(x = carat)) +
geom_density()
```
In both `carat` and `sched_dep_time` there are abnormally large numbers of values are at nice "human" numbers. In `sched_dep_time` it is at 00 and 30 minutes. In carats, it is at 0, 1/3, 1/2, 2/3,
```{r}
ggplot(diamonds, aes(x = carat %% 1 * 100)) +
geom_histogram(binwidth = 1)
```
In scheduled departure times it is 00 and 30 minutes, and minutes
ending in 0 and 5.
```{r}
ggplot(flights_dt, aes(x = minute(sched_dep_time))) +
geom_histogram(binwidth = 1)
```
</div>
### Exercise 16.3.7 {.unnumbered .exercise data-number="16.3.7"}
<div class="question">
Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early.
Hint: create a binary variable that tells you whether or not a flight was delayed.
</div>
<div class="answer">
First, I create a binary variable `early` that is equal to 1 if a flight leaves early, and 0 if it does not.
Then, I group flights by the minute of departure.
This shows that the proportion of flights that are early departures is highest between minutes 20--30 and 50--60.
```{r}
flights_dt %>%
mutate(minute = minute(dep_time),
early = dep_delay < 0) %>%
group_by(minute) %>%
summarise(
early = mean(early, na.rm = TRUE),
n = n()) %>%
ggplot(aes(minute, early)) +
geom_line()
```
</div>
## Time spans {#time-spans .r4ds-section}
### Exercise 16.4.1 {.unnumbered .exercise data-number="16.4.1"}
<div class="question">
Why is there `months()` but no `dmonths()`?
</div>
<div class="answer">
There is no unambiguous value of months in terms of seconds since months have differing numbers of days.
- 31 days: January, March, May, July, August, October, December
- 30 days: April, June, September, November
- 28 or 29 days: February
The month is not a duration of time defined independently of when it occurs, but a special interval between two dates.
</div>
### Exercise 16.4.2 {.unnumbered .exercise data-number="16.4.2"}
<div class="question">
Explain `days(overnight * 1)` to someone who has just started learning R.
How does it work?
</div>
<div class="answer">
The variable `overnight` is equal to `TRUE` or `FALSE`.
If it is an overnight flight, this becomes 1 day, and if not, then overnight = 0, and no days are added to the date.
</div>
### Exercise 16.4.3 {.unnumbered .exercise data-number="16.4.3"}
<div class="question">
Create a vector of dates giving the first day of every month in 2015.
Create a vector of dates giving the first day of every month in the current year.
</div>
<div class="answer">
A vector of the first day of the month for every month in 2015:
```{r}
ymd("2015-01-01") + months(0:11)
```
To get the vector of the first day of the month for *this* year, we first need to figure out what this year is, and get January 1st of it.
I can do that by taking `today()` and truncating it to the year using `floor_date()`:
```{r}
floor_date(today(), unit = "year") + months(0:11)
```
</div>
### Exercise 16.4.4 {.unnumbered .exercise data-number="16.4.4"}
<div class="question">
Write a function that given your birthday (as a date), returns how old you are in years.
</div>
<div class="answer">
```{r}
age <- function(bday) {
(bday %--% today()) %/% years(1)
}
age(ymd("1990-10-12"))
```
</div>
### Exercise 16.4.5 {.unnumbered .exercise data-number="16.4.5"}
<div class="question">
Why can’t `(today() %--% (today() + years(1)) / months(1)` work?
</div>
<div class="answer">
The code in the question is missing a parentheses.
So, I will assume that that the correct code is,
```{r}
(today() %--% (today() + years(1))) / months(1)
```
While this code will not display a warning or message, it does not work exactly as
expected. The problem is discussed in the [Intervals](https://r4ds.had.co.nz/dates-and-times.html#intervals) section.
The numerator of the expression, `(today() %--% (today() + years(1))`, is an *interval*, which includes both a duration of time and a starting point. The interval has an exact number of seconds.
The denominator of the expression, `months(1)`, is a period, which is meaningful to humans but not defined in terms of an exact number of seconds.
Months can be 28, 29, 30, or 31 days, so it is not clear what `months(1)` divide by?
The code does not produce a warning message, but it will not always produce the correct result.
To find the number of months within an interval use `%/%` instead of `/`,
```{r}
(today() %--% (today() + years(1))) %/% months(1)
```
Alternatively, we could define a "month" as 30 days, and run
```{r}
(today() %--% (today() + years(1))) / days(30)
```
This approach will not work with `today() + years(1)`, which is not defined for February 29th on leap years:
```{r}
as.Date("2016-02-29") + years(1)
```
</div>
## Time zones {#time-zones .r4ds-section}
`r no_exercises()`