forked from clauswilke/dataviz
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtime_series.Rmd
450 lines (380 loc) · 28.2 KB
/
time_series.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
```{r echo = FALSE, message = FALSE, warning = FALSE}
# run setup script
source("_common.R")
library(ggridges)
library(lubridate)
library(ggrepel)
```
# Visualizing time series and other functions of an independent variable {#time-series}
The preceding chapter discussed scatter plots, where we plot one quantitative variable against another. A special case arises when one of the two variables can be thought of as time, because time imposes additional structure on the data. Now the data points have an inherent order; we can arrange the points in order of increasing time and define a predecessor and successor for each data point. We frequently want to visualize this temporal order and we do so with line graphs. Line graphs are not limited to time series, however. They are appropriate whenever one variable imposes an ordering on the data. This scenario arises also, for example, in a controlled experiment where a treatment variable is purposefully set to a range of different values. If we have multiple variables that depend on time, we can either draw separate line plots or we can draw a regular scatter plot and then draw lines to connect the neighboring points in time.
## Individual time series
As a first demonstration of a time series, we will consider the pattern of monthly preprint submissions in biology. Preprints are scientific articles that researchers post online before formal peer review and publication in a scientific journal. The preprint server bioRxiv, which was founded in November 2013 specifically for researchers working in the biological sciences, has seen substantial growth in monthly submissions since. We can visualize this growth by making a form of scatter plot (Chapter \@ref(visualizing-associations)) where we draw dots representing the number of submissions in each month (Figure \@ref(fig:biorxiv-dots)).
(ref:biorxiv-dots) Monthly submissions to the preprint server bioRxiv, from its inception in November 2014 until April 2018. Each dot represents the number of submissions in one month. There has been a steady increase in submission volume throughout the entire 4.5-year period. Data source: Jordan Anaya, http://www.prepubmed.org/
```{r biorxiv-dots, fig.cap = '(ref:biorxiv-dots)'}
preprint_growth %>% filter(archive == "bioRxiv") %>%
filter(count > 0) -> biorxiv_growth
ggplot(biorxiv_growth, aes(date, count)) +
#geom_point(color = "#0072B2") +
geom_point(color = "white", fill = "#0072B2", shape = 21, size = 2) +
scale_y_continuous(limits = c(0, 1600), expand = c(0, 0),
name = "preprints / month") +
scale_x_date(name = "year") +
theme_dviz_open() +
theme(plot.margin = margin(7, 7, 3, 1.5))
```
There is an important difference however between Figure \@ref(fig:biorxiv-dots) and the scatter plots discussed in Chapter \@ref(visualizing-associations). In Figure \@ref(fig:biorxiv-dots), the dots are spaced evenly along the *x* axis, and there is a defined order among them. Each dot has exactly one left and one right neighbor (except the leftmost and rightmost points which have only one neighbor each). We can visually emphasize this order by connecting neighboring points with lines (Figure \@ref(fig:biorxiv-dots-line)). Such a plot is called a *line graph*.
(ref:biorxiv-dots-line) Monthly submissions to the preprint server bioRxiv, shown as dots connected by lines. The lines do not represent data but are only meant as a guide to the eye. By connecting the individual dots with lines, we emphasize that there is an order between the dots, each dot has exactly one neighbor that comes before and one that comes after. Data source: Jordan Anaya, http://www.prepubmed.org/
```{r biorxiv-dots-line, fig.cap = '(ref:biorxiv-dots-line)'}
ggplot(biorxiv_growth, aes(date, count)) +
geom_line(size = 0.5, color = "#0072B2") +
geom_point(color = "white", fill = "#0072B2", shape = 21, size = 2) +
scale_y_continuous(limits = c(0, 1600), expand = c(0, 0),
name = "preprints / month") +
scale_x_date(name = "year") +
theme_dviz_open() +
theme(plot.margin = margin(7, 7, 3, 1.5))
```
Some people object to drawing lines between points because the lines do not represent observed data. In particular, if there are only a few observations spaced far apart, had observations been made at intermediate times they would probably not have fallen exactly onto the lines shown. Thus, in a sense, the lines correspond to made-up data. Yet they may help with perception when the points are spaced far apart or are unevenly spaced. We can somewhat resolve this dilemma by pointing it out in the figure caption, for example by writing "lines are meant as a guide to the eye" (see caption of Figure \@ref(fig:biorxiv-dots-line)).
Using lines to represent time series is generally accepted practice, however, and frequently the dots are omitted altogether (Figure \@ref(fig:biorxiv-line)). Without dots, the figure places more emphasis on the overall trend in the data and less on individual observations. A figure without dots is also visually less busy. In general, the denser the time series, the less important it is to show individual observations with dots. For the preprint dataset shown here, I think omitting the dots is fine.
(ref:biorxiv-line) Monthly submissions to the preprint server bioRxiv, shown as a line graph without dots. Omitting the dots emphasizes the overall temporal trend while de-emphasizing individual observations at specific time points. It is particularly useful when the time points are spaced very densely. Data source: Jordan Anaya, http://www.prepubmed.org/
```{r biorxiv-line, fig.cap = '(ref:biorxiv-line)'}
ggplot(biorxiv_growth, aes(date, count)) + geom_line(color = "#0072B2", size = 1) +
scale_y_continuous(limits = c(0, 1600), expand = c(0, 0),
name = "preprints / month") +
scale_x_date(name = "year") +
theme_dviz_open() +
theme(plot.margin = margin(7, 7, 3, 1.5))
```
We can also fill the area under the curve with a solid color (Figure \@ref(fig:biorxiv-line-area)). This choice further emphasizes the overarching trend in the data, because it visually separates the area above the curve from the area below. However, this visualization is only valid if the *y* axis starts at zero, so that the height of the shaded area at each time point represents the data value at that time point.
(ref:biorxiv-line-area) Monthly submissions to the preprint server bioRxiv, shown as a line graph with filled area underneath. By filling the area under the curve, we put even more emphasis on the overarching temporal trend than if we just draw a line (Figure \@ref(fig:biorxiv-line)). Data source: Jordan Anaya, http://www.prepubmed.org/
```{r biorxiv-line-area, fig.cap = '(ref:biorxiv-line-area)'}
ggplot(biorxiv_growth, aes(date, height = count, y = 0)) +
geom_ridgeline(color = "#0072B2", fill = "#0072B240", size = 0.75) +
scale_y_continuous(limits = c(0, 1600), expand = c(0, 0),
name = "preprints / month") +
scale_x_date(name = "year") +
theme_dviz_open() +
theme(plot.margin = margin(7, 7, 3, 1.5))
```
## Multiple time series and dose--response curves
We often have multiple time courses that we want to show at once. In this case, we have to be more careful in how we plot the data, because the figure can become confusing or difficult to read. For example, if we want to show the monthly submissions to multiple preprint servers, a scatter plot is not a good idea, because the individual time courses run into each other (Figure \@ref(fig:bio-preprints-dots)). Connecting the dots with lines alleviates this issue (Figure \@ref(fig:bio-preprints-lines)).
(ref:bio-preprints-dots) Monthly submissions to three preprint servers covering biomedical research: bioRxiv, the q-bio section of arXiv, and PeerJ Preprints. Each dot represents the number of submissions in one month to the respective preprint server. This figure is labeled "bad" because the three time courses visually interfere with each other and are difficult to read. Data source: Jordan Anaya, http://www.prepubmed.org/
```{r bio-preprints-dots, fig.cap = '(ref:bio-preprints-dots)'}
preprint_growth %>% filter(archive %in% c("bioRxiv", "arXiv q-bio", "PeerJ Preprints")) %>%
filter(count > 0) %>%
mutate(archive = factor(archive, levels = c("bioRxiv", "arXiv q-bio", "PeerJ Preprints")))-> preprints
p <- ggplot(preprints, aes(date, count, color = archive, fill = archive, shape = archive)) +
geom_point(color = "white", size = 2) +
scale_shape_manual(values = c(21, 22, 23),
name = NULL) +
scale_y_continuous(limits = c(0, 600), expand = c(0, 0),
name = "preprints / month") +
scale_x_date(name = "year",
limits = c(min(biorxiv_growth$date), ymd("2017-01-01"))) +
scale_color_manual(values = c("#0072b2", "#D55E00", "#009e73"),
name = NULL) +
scale_fill_manual(values = c("#0072b2", "#D55E00", "#009e73"),
name = NULL) +
theme_dviz_open() +
theme(legend.title.align = 0.5,
legend.position = c(0.1, .9),
legend.just = c(0, 1),
plot.margin = margin(14, 7, 3, 1.5))
stamp_bad(p)
```
(ref:bio-preprints-lines) Monthly submissions to three preprint servers covering biomedical research. By connecting the dots in Figure \@ref(fig:bio-preprints-dots) with lines, we help the viewer follow each individual time course. Data source: Jordan Anaya, http://www.prepubmed.org/
```{r bio-preprints-lines, fig.cap = '(ref:bio-preprints-lines)'}
ggplot(preprints, aes(date, count, color = archive, fill = archive, shape = archive)) +
geom_line(size = 0.75) + geom_point(color = "white", size = 2) +
scale_y_continuous(limits = c(0, 600), expand = c(0, 0),
name = "preprints / month") +
scale_x_date(name = "year",
limits = c(min(biorxiv_growth$date), ymd("2017-01-01"))) +
scale_color_manual(values = c("#0072b2", "#D55E00", "#009e73"),
name = NULL) +
scale_fill_manual(values = c("#0072b2", "#D55E00", "#009e73"),
name = NULL) +
scale_shape_manual(values = c(21, 22, 23),
name = NULL) +
theme_dviz_open() +
theme(legend.title.align = 0.5,
legend.position = c(0.1, .9),
legend.just = c(0, 1),
plot.margin = margin(14, 7, 3, 1.5))
```
Figure \@ref(fig:bio-preprints-lines) represents an acceptable visualization of the preprints dataset. However, the separate legend creates unnecessary cognitive load. We can reduce this cognitive load by labeling the lines directly (Figure \@ref(fig:bio-preprints-direct-label)). We have also eliminated the individual dots in this figure, for a result that is much more streamlined and easy to read than the original starting point, Figure \@ref(fig:bio-preprints-dots).
(ref:bio-preprints-direct-label) Monthly submissions to three preprint servers covering biomedical research. By direct labeling the lines instead of providing a legend, we have reduced the cognitive load required to read the figure. And the elimination of the legend removes the need for points of different shapes. Thus, we could streamline the figure further by eliminating the dots. Data source: Jordan Anaya, http://www.prepubmed.org/
```{r bio-preprints-direct-label, fig.cap = '(ref:bio-preprints-direct-label)'}
preprints_final <- filter(preprints, date == ymd("2017-01-01"))
ggplot(preprints) +
aes(date, count, color = archive, fill = archive, shape = archive) +
geom_line(size = 1) +
#geom_point(color = "white", size = 2) +
scale_y_continuous(
limits = c(0, 600), expand = c(0, 0),
name = "preprints / month",
sec.axis = dup_axis(
breaks = preprints_final$count,
labels = c("arXiv\nq-bio", "PeerJ\nPreprints", "bioRxiv"),
name = NULL)
) +
scale_x_date(name = "year",
limits = c(min(biorxiv_growth$date), ymd("2017-01-01")),
expand = expand_scale(mult = c(0.02, 0))) +
scale_color_manual(values = c("#0072b2", "#D55E00", "#009e73"),
name = NULL) +
scale_fill_manual(values = c("#0072b2", "#D55E00", "#009e73"),
name = NULL) +
scale_shape_manual(values = c(21, 22, 23),
name = NULL) +
coord_cartesian(clip = "off") +
theme_dviz_open() +
theme(legend.position = "none") +
theme(axis.line.y.right = element_blank(),
axis.ticks.y.right = element_blank(),
axis.text.y.right = element_text(margin = margin(0, 0, 0, 0)),
plot.margin = margin(14, 7, 3, 1.5))
```
Line graphs are not limited to time series. They are appropriate whenever the data points have a natural order that is reflected in the variable shown along the *x* axis, so that neighboring points can be connected with a line. This situation arises, for example, in dose--response curves, where we measure how changing some numerical parameter in an experiment (the dose) affects an outcome of interest (the response). Figure \@ref(fig:oats-yield) shows a classic experiment of this type, measuring oat yield in response to increasing amounts of fertilization. The line-graph visualization highlights how the dose--response curve has a similar shape for the three oat varieties considered but differs in the starting point in the absence of fertilization (i.e., some varieties have naturally higher yield than others).
(ref:oats-yield) Dose--response curve showing the mean yield of oats varieties after fertilization with manure. The manure serves as a source of nitrogen, and oat yields generally increase as more nitrogen is available, regardless of variety. Here, manure application is measured in cwt (hundredweight) per acre. The hundredweight is an old imperial unit equal to 112 lbs or 50.8 kg. Data source: @Yates1935
```{r oats-yield, fig.cap = '(ref:oats-yield)'}
MASS::oats %>%
# 1 long (UK) cwt == 112 lbs == 50.802345 kg
mutate(N = 1*as.numeric(sub("cwt", "", N, fixed = TRUE))) %>%
group_by(N, V) %>%
summarize(mean = 20 * mean(Y)) %>% # factor 20 converts units to lbs/acre
mutate(variety = ifelse(V == "Golden.rain", "Golden Rain", as.character(V))) ->
oats_df
oats_df$variety <- factor(oats_df$variety, levels = c("Marvellous", "Golden Rain", "Victory"))
ggplot(oats_df,
aes(N, mean, color = variety, shape = variety, fill = variety)) +
geom_line(size = 0.75) + geom_point(color = "white", size = 2.5) +
scale_y_continuous(name = "mean yield (lbs/acre)") +
scale_x_continuous(name = "manure treatment (cwt/acre)") +
scale_shape_manual(values = c(21, 22, 23),
name = "oat variety") +
scale_color_manual(values = c("#0072b2", "#D55E00", "#009e73"),
name = "oat variety") +
scale_fill_manual(values = c("#0072b2", "#D55E00", "#009e73"),
name = "oat variety") +
coord_cartesian(clip = "off") +
theme_dviz_open() +
theme(legend.title.align = 0.5)
```
## Time series of two or more response variables {#time-series-connected-scatter}
In the preceding examples we dealt with time courses of only a single response variable (e.g., preprint submissions per month or oat yield). It is not unusual, however, to have more than one response variable. Such situations arise commonly in macroeconomics. For example, we may be interested in the change in house prices from the previous 12 months as it relates to the unemployment rate. We may expect that house prices rise when the unemployment rate is low and vice versa.
Given the tools from the preceding subsections, we can visualize such data as two separate line graphs stacked on top of each other (Figure \@ref(fig:house-price-unemploy)). This plot directly shows the two variables of interest, and it is straightforward to interpret. However, because the two variables are shown as separate line graphs, drawing comparisons between them can be cumbersome. If we want to identify temporal regions when both variables move in the same or in opposite directions, we need to switch back and forth between the two graphs and compare the relative slopes of the two curves.
(ref:house-price-unemploy) 12-month change in house prices (a) and unemployment rate (b) over time, from Jan. 2001 through Dec. 2017. Data sources: Freddie Mac House Prices Index, U.S. Bureau of Labor Statistics.
```{r house-price-unemploy, fig.width = 5*6/4.2, fig.cap = '(ref:house-price-unemploy)'}
# prepare dataset already for next figure
CA_house_prices <-
filter(house_prices, state == "California", year(date) > 2000) %>%
mutate(
label = ifelse(
date %in% c(ymd("2005-01-01"), ymd("2007-07-01"),
ymd("2010-01-01"), ymd("2012-07-01"), ymd("2015-01-01")),
format(date, "%b %Y"), ""),
nudge_x = case_when(
label == "Jan 2005" ~ -0.003,
TRUE ~ 0.003
),
nudge_y = case_when(
label == "Jan 2005" ~ 0.01,
label %in% c("Jul 2007", "Jul 2012") ~ 0.01,
TRUE ~ -0.01
),
hjust = case_when(
label == "Jan 2005" ~ 1,
TRUE ~ 0
)
)
p1 <- ggplot(CA_house_prices, aes(date, house_price_perc)) +
geom_line(size = 1, color = "#0072b2") +
scale_y_continuous(
limits = c(-0.3, .32), expand = c(0, 0),
breaks = c(-.3, -.15, 0, .15, .3),
name = "12-month change\nin house prices", labels = scales::percent_format(accuracy = 1)
) +
scale_x_date(name = "", expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_dviz_grid() +
theme(
axis.line = element_blank(),
plot.margin = margin(12, 1.5, 0, 1.5)
)
p2 <- ggplot(CA_house_prices, aes(date, unemploy_perc/100)) +
geom_line(size = 1, color = "#0072b2") +
scale_y_continuous(
limits = c(0.037, 0.143),
name = "unemployment\nrate", labels = scales::percent_format(accuracy = 1),
expand = c(0, 0)
) +
scale_x_date(name = "year", expand = c(0, 0)) +
theme_dviz_grid() +
theme(
axis.line = element_blank(),
plot.margin = margin(6, 1.5, 3, 1.5)
)
plot_grid(p1, p2, align = 'v', ncol = 1, labels = "auto")
```
As an alternative to showing two separate line graphs, we can plot the two variables against each other, drawing a path that leads from the earliest time point to the latest (Figure \@ref(fig:house-price-path)). Such a visualization is called a *connected scatter plot*, because we are technically making a scatter plot of the two variables against each other and then are connecting neighboring points. Physicists and engineers often call this a *phase portrait*, because in their disciplines it is commonly used to represent movement in phase space. We have previously encountered connected scatter plots in Chapter \@ref(coordinate-systems-axes), where we plotted the daily temperature normals in Houston, TX, versus those in San Diego, CA (Figure \@ref(fig:temperature-normals-Houston-San-Diego)).
(ref:house-price-path) 12-month change in house prices versus unemployment rate, from Jan. 2001 through Dec. 2017, shown as a connected scatter plot. Darker shades represent more recent months. The anti-correlation seen in Figure \@ref(fig:house-price-unemploy) between the change in house prices and the unemployment rate causes the connected scatter plot to form two counter-clockwise circles. Data sources: Freddie Mac House Price Index, U.S. Bureau of Labor Statistics. Original figure concept: Len Kiefer
```{r house-price-path, fig.asp = 3/4, fig.cap = '(ref:house-price-path)'}
ggplot(CA_house_prices) +
aes(unemploy_perc/100, house_price_perc, colour = as.numeric(date)) +
geom_path(size = 1, lineend = "round") +
geom_text_repel(
aes(label = label), point.padding = .2, color = "black",
min.segment.length = 0, size = 12/.pt,
hjust = CA_house_prices$hjust,
nudge_x = CA_house_prices$nudge_x,
nudge_y = CA_house_prices$nudge_y,
direction = "y",
family = dviz_font_family
) +
scale_x_continuous(
limits = c(0.037, 0.143),
name = "unemployment rate", labels = scales::percent_format(accuracy = 1),
expand = c(0, 0)
) +
scale_y_continuous(
limits = c(-0.315, .315), expand = c(0, 0),
breaks = c(-.3, -.15, 0, .15, .3),
name = "12-month change in house prices", labels = scales::percent_format(accuracy = 1)
) +
scale_colour_gradient(low = "#E7F0FF", high = "#035B8F") + #"#0072b2") +
guides(colour = FALSE) +
coord_cartesian(clip = "off") +
theme_dviz_grid() +
theme(
axis.ticks.length = unit(0, "pt"),
plot.margin = margin(21, 14, 3.5, 1.5))
```
In a connected scatter plot, lines going in the direction from the lower left to the upper right represent correlated movement between the two variables (as one variable grows, so does the other), and lines going in the perpendicular direction, from the upper left to the lower right, represent anti-correlated movement (as one variable grows, the other shrinks). If the two variables have a somewhat cyclic relationship, we will see circles or spirals in the connected scatter plot. In Figure \@ref(fig:house-price-path), we see one small circle from 2001 through 2005 and one large circle for the remainder of the time course.
When drawing a connected scatter plot, it is important that we indicate both the direction and the temporal scale of the data. Without such hints, the plot can turn into meaningless scribble (Figure \@ref(fig:house-price-path-bad)). I am using here (in Figure \@ref(fig:house-price-path)) a gradual darkening of the color to indicate direction. Alternatively, one could draw arrows along the path.
(ref:house-price-path-bad) 12-month change in house prices versus unemployment rate, from Jan. 2001 through Dec. 2017. This figure is labeled "bad" because without the date markers and color shading of Figure \@ref(fig:house-price-path), we can see neither the direction nor the speed of change in the data. Data sources: Freddie Mac House Prices Index, U.S. Bureau of Labor Statistics.
```{r house-price-path-bad, fig.asp = 3/4, fig.cap = '(ref:house-price-path-bad)'}
p <- ggplot(CA_house_prices) +
aes(unemploy_perc/100, house_price_perc) +
geom_path(size = 1, lineend = "round", color = "#0072b2") +
scale_x_continuous(
limits = c(0.037, 0.143),
name = "unemployment rate", labels = scales::percent_format(accuracy = 1),
expand = c(0, 0)
) +
scale_y_continuous(
limits = c(-0.315, .315), expand = c(0, 0),
breaks = c(-.3, -.15, 0, .15, .3),
name = "12-month change in house prices", labels = scales::percent_format(accuracy = 1)
) +
coord_cartesian(clip = "off") +
theme_dviz_grid() +
theme(
axis.ticks.length = unit(0, "pt"),
plot.margin = margin(21, 14, 3.5, 1.5))
stamp_bad(p)
```
Is it better to use a connected scatter plot or two separate line graphs? Separate line graphs tend to be easier to read, but once people are used to connected scatter plots they may be able to extract certain patterns (such as cyclical behavior with some irregularity) that can be difficult to spot in line graphs. In fact, to me the cyclical relationship between change in house prices and unemployment rate is hard to spot in Figure \@ref(fig:house-price-unemploy), but the counter-clockwise spiral in Figure \@ref(fig:house-price-path) clearly shows it. Research reports that readers are more likely to confuse order and direction in a connected scatter plot than in line graphs and less likely to report correlation [@Haroz_et_al_2016]. On the flip side, connected scatter plots seem to result in higher engagement, and thus such plots may be a effective tools to draw readers into a story [@Haroz_et_al_2016].
Even though connected scatter plots can show only two variables at a time, we can also use them to visualize higher-dimensional datasets. The trick is to apply dimension reduction first (see Chapter \@ref(visualizing-associations)). We can then draw a connected scatterplot in the dimension-reduced space. As an example of this approach, we will visualize a database of monthly observations of over 100 macroeconomic indicators, provided by the Federal Reserve Bank of St. Louis. We perform a principal components analysis (PCA) of all indicators and then draw a connected scatter plot of PC 2 versus PC 1 (Figure \@ref(fig:fred-md-PCA)a) and versus PC 3 (Figure \@ref(fig:fred-md-PCA)b).
(ref:fred-md-PCA) Visualizing a high-dimensional time series as a connected scatter plot in principal components space. The path indicates the joint movement of over 100 macroeconomic indicators from January 1990 to December 2017. Times of recession and recovery are indicated via color, and the end points of the three recessions (March 1991, November 2001, and June 2009) are also labeled. (a) PC 2 versus PC 1. (b) PC 2 versus PC 3. Data source: M. W. McCracken, St. Louis Fed
```{r fred-md-PCA, fig.width = 5*6/4.2, fig.asp = 2*0.618, fig.cap = '(ref:fred-md-PCA)'}
fred_md %>%
select(-date, -sasdate) %>%
scale() %>%
prcomp() -> pca
pca_data <- data.frame(date = fred_md$date, pca$x) %>%
mutate(
type = ifelse(
(ymd("1990-07-01") <= date & date < ymd("1991-03-01")) |
(ymd("2001-03-01") <= date & date < ymd("2001-11-01")) |
(ymd("2007-12-01") <= date & date < ymd("2009-06-01")),
"recession",
"recovery"
)
)
pca_labels <-
mutate(pca_data,
label = ifelse(
date %in% c(ymd("1990-01-01"), ymd("1991-03-01"), ymd("2001-11-01"),
ymd("2009-06-01"), ymd("2017-12-01")),
format(date, "%b %Y"), ""
)
) %>%
filter(label != "") %>%
mutate(
nudge_x = c(.2, -.2, -.2, -.2, .2),
nudge_y = c(.2, -.2, -.2, -.2, .2),
hjust = c(0, 1, 1, 1, 0),
vjust = c(0, 1, 1, 1, 0),
nudge_x2 = c(.2, .2, .2, -.2, .2),
nudge_y2 = c(.2, -.2, -1, .2, .2),
hjust2 = c(0, 0, .2, 1, 0),
vjust2 = c(0, 1, 1, 1, 0)
)
colors = darken(c("#D55E00", "#009E73"), c(0.1, 0.1))
p1 <- ggplot(filter(pca_data, date >= ymd("1990-01-01"))) +
aes(x=PC1, y=PC2, color=type, alpha = date, group = 1) +
geom_path(size = 1, lineend = "butt") +
geom_text_repel(
data = pca_labels,
aes(label = label),
alpha = 1,
point.padding = .2, color = "black",
min.segment.length = 0, size = 12/.pt,
family = dviz_font_family,
nudge_x = pca_labels$nudge_x,
nudge_y = pca_labels$nudge_y,
hjust = pca_labels$hjust,
vjust = pca_labels$vjust
) +
scale_color_manual(
values = colors,
name = NULL
) +
scale_alpha_date(range = c(0.45, 1), guide = "none") +
scale_x_continuous(limits = c(-2, 15.5), name = "PC 1") +
scale_y_continuous(limits = c(-5, 5), name = "PC 2") +
theme_dviz_grid(14) +
theme(
legend.position = c(1, 1),
legend.justification = c(1, 1),
legend.direction = "horizontal",
legend.box.background = element_rect(color = NA, fill = "white"),
legend.box.margin = margin(6, 0, 6, 0),
plot.margin = margin(3, 1.5, 6, 1.5)
)
p2 <- ggplot(filter(pca_data, date >= ymd("1990-01-01"))) +
aes(x=PC3, y=PC2, color=type, alpha = date, group = 1) +
geom_path(size = 1, lineend = "butt") +
geom_text_repel(
data = pca_labels,
aes(label = label),
alpha = 1,
point.padding = .2, color = "black",
min.segment.length = 0, size = 12/.pt,
family = dviz_font_family,
nudge_x = pca_labels$nudge_x2,
nudge_y = pca_labels$nudge_y2,
hjust = pca_labels$hjust2,
vjust = pca_labels$vjust2
) +
scale_color_manual(
values = colors,
name = NULL
) +
scale_alpha_date(range = c(0.45, 1), guide = "none") +
scale_x_continuous(limits = c(-6, 8.5), name = "PC 3") +
scale_y_continuous(limits = c(-5, 5), name = "PC 2") +
theme_dviz_grid(14) +
theme(
legend.position = c(1, 1),
legend.justification = c(1, 1),
legend.direction = "horizontal",
legend.box.background = element_rect(color = NA, fill = "white"),
legend.box.margin = margin(6, 0, 6, 0),
plot.margin = margin(6, 1.5, 3, 1.5)
)
plot_grid(p1, p2, labels = "auto", ncol = 1)
```
Notably, Figure \@ref(fig:fred-md-PCA)a looks almost like a regular line plot, with time running from left to right. This pattern is caused by a common feature of PCA: The first component often measures the overall size of the system. Here, PC 1 approximately measures the overall size of the economy, which rarely decreases over time.
By coloring the connected scatter plot by times of recession and recovery, we can see that recessions are associated with a drop in PC 2 whereas recoveries do not correspond to a clear feature in either PC 1 or PC 2 (Figure \@ref(fig:fred-md-PCA)a). The recoveries do, however, seem to correspond to a drop in PC 3 (Figure \@ref(fig:fred-md-PCA)b). Moreover, in the PC 2 versus PC 3 plot, we see that the line follows the shape of a clockwise spiral. This pattern emphasizes the cyclical nature of the economy, with recessions following recoveries and vice versa.