forked from clauswilke/dataviz
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcoordinate_systems_axes.Rmd
662 lines (556 loc) · 40.8 KB
/
coordinate_systems_axes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")
library(lubridate)
library(forcats)
library(tidyr)
library(ggrepel)
```
# Coordinate systems and axes {#coordinate-systems-axes}
To make any sort of data visualization, we need to define position scales, which determine where in a graphic different data values are located. We cannot visualize data without placing different data points at different locations, even if we just arrange them next to each other along a line. For regular 2d visualizations, two numbers are required to uniquely specify a point, and therefore we need two position scales. These two scales are usually but not necessarily the *x* and *y* axis of the plot. We also have to specify the relative geometric arrangement of these scales. Conventionally, the *x* axis runs horizontally and the *y* axis vertically, but we could choose other arrangements. For example, we could have the *y* axis run at an acute angle relative to the *x* axis, or we could have one axis run in a circle and the other run radially. The combination of a set of position scales and their relative geometric arrangement is called a *coordinate system.*
## Cartesian coordinates
The most widely used coordinate system for data visualization is the 2d *Cartesian coordinate system*, where each location is uniquely specified by an *x* and a *y* value. The *x* and *y* axes run orthogonally to each other, and data values are placed in an even spacing along both axes (Figure \@ref(fig:cartesian-coord)). The two axes are continuous position scales, and they can represent both positive and negative real numbers. To fully specify the coordinate system, we need to specify the range of numbers each axis covers. In Figure \@ref(fig:cartesian-coord), the *x* axis runs from -2.2 to 3.2 and the *y* axis runs from -2.2 to 2.2. Any data values between these axis limits are placed at the respective location in the plot. Any data values outside the axis limits are discarded.
(ref:cartesian-coord) Standard cartesian coordinate system. The horizontal axis is conventionally called *x* and the vertical axis *y*. The two axes form a grid with equidistant spacing. Here, both the *x* and the *y* grid lines are separated by units of one. The point (2, 1) is located two *x* units to the right and one *y* unit above the origin (0, 0). The point (-1, -1) is located one *x* unit to the left and one *y* unit below the origin.
```{r cartesian-coord, fig.asp = 0.8, fig.cap = '(ref:cartesian-coord)'}
df_points <- data.frame(x = c(-1, 0, 2),
y = c(-1, 0, 1),
label = c("(–1, –1)", "(0, 0)", "(2, 1)"),
vjust = c(1.4, -.8, -.8),
hjust = c(1.1, 1.1, -.1))
df_segments <- data.frame(x0 = c(0, 2, 0, -1),
x1 = c(2, 2, -1, -1),
y0 = c(1, 0, -1, 0),
y1 = c(1, 1, -1, -1))
df_labels <- data.frame(x = c(-1, -.5, 1, 2),
y = c(-.5, -1, 1, 0.5),
vjust = c(.5, 1.3, -.3, .5),
hjust = c(1.1, .5, .5, -.1),
label = c("y = –1", "x = –1", "x = 2", "y = 1"))
ggplot(df_points, aes(x, y)) +
geom_hline(yintercept = 0, color = "gray50") +
geom_vline(xintercept = 0, color = "gray50") +
geom_segment(data = df_segments, aes(x = x0, xend = x1, y = y0, yend = y1),
linetype = 2) +
geom_point(size = 3, color = "#0072B2") +
geom_text(aes(label = label, vjust = vjust, hjust = hjust),
size = 12/.pt, family = dviz_font_family) +
geom_text(data = df_labels, aes(label = label, hjust = hjust, vjust = vjust),
size = 12/.pt, family = dviz_font_family) +
coord_fixed(xlim = c(-2.2, 3.2), ylim = c(-2.2, 2.2), expand = FALSE) +
xlab("x axis") +
ylab("y axis") +
theme_dviz_grid() +
theme(axis.ticks.length = grid::unit(0, "pt"))
```
Data values usually aren't just numbers, however. They come with units. For example, if we're measuring temperature, the values may be measured in degrees Celsius or Fahrenheit. Similarly, if we're measuring distance, the values may be measured in kilometers or miles, and if we're measuring duration, the values may be measured in minutes, hours, or days. In a Cartesian coordinate system, the spacing between grid lines along an axis corresponds to discrete steps in these data units. In a temperature scale, for example, we may have a grid line every 10 degrees Fahrenheit, and in a distance scale, we may have a grid line every 5 kilometers.
A Cartesian coordinate system can have two axes representing two different units. This situation arises quite commonly whenever we're mapping two different types of variables to *x* and *y*. For example, in Figure \@ref(fig:temp-normals-vs-time), we plotted temperature vs. days of the year. The *y* axis of Figure \@ref(fig:temp-normals-vs-time) is measured in degrees Fahrenheit, with a grid line every at 20 degrees, and the *x* axis is measured in months, with a grid line at the first of every third month. Whenever the two axes are measured in different units, we can stretch or compress one relative to the other and maintain a valid visualization of the data (Figure \@ref(fig:temperature-normals-Houston)). Which version is preferable may depend on the story we want to convey. A tall and narrow figure emphasizes change along the *y* axis and a short and wide figure does the opposite. Ideally, we want to choose an aspect ratio that ensures that any important differences in position are noticeable.
(ref:temperature-normals-Houston) Daily temperature normals for Houston, TX. Temperature is mapped to the *y* axis and day of the year to the *x* axis. Parts (a), (b), and (c) show the same figure in different aspect ratios. All three parts are valid visualizations of the temperature data. Data source: NOAA.
```{r temperature-normals-Houston, fig.width = 5*6/4.2, fig.asp = 3/4, fig.cap = '(ref:temperature-normals-Houston)'}
temps_wide <- filter(ncdc_normals,
station_id %in% c(
"USW00014819", # Chicago, IL 60638
"USC00516128", # Honolulu, HI 96813
"USW00027502", # Barrow, AK 99723, coldest point in the US
"USC00042319", # Death Valley, CA 92328 hottest point in the US
"USW00093107", # San Diego, CA 92145
"USW00012918", # Houston, TX 77061
"USC00427606" # Salt Lake City, UT 84103
)) %>%
mutate(location = fct_recode(factor(station_id),
"Chicago" = "USW00014819",
"Honolulu" = "USC00516128",
"Barrow, AK" = "USW00027502",
"Death Valley" = "USC00042319",
"San Diego" = "USW00093107",
"Houston" = "USW00012918",
"Salt Lake City, UT" = "USC00427606")) %>%
select(-station_id, -flag) %>%
spread(location, temperature) %>%
arrange(date)
temps_wide_label <- mutate(
temps_wide,
label = ifelse(
date %in% c(ymd("0000-01-01"), ymd("0000-04-01"), ymd("0000-07-01"), ymd("0000-10-01")),
format(date, "%b 1st"),
""
),
nudge_x = ifelse(
date %in% c(ymd("0000-01-01"), ymd("0000-04-01"), ymd("0000-07-01"), ymd("0000-10-01")),
c(-1, -2, -2, 1)[round(month(date)/3)+1],
0
),
nudge_y = ifelse(
date %in% c(ymd("0000-01-01"), ymd("0000-04-01"), ymd("0000-07-01"), ymd("0000-10-01")),
c(-2, 1, 0.5, -2)[round(month(date)/3)+1],
0
)
)
temp_plot <- ggplot(temps_wide_label, aes(x = date, y = `Houston`)) +
geom_line(size = 1, color = "#0072B2") +
scale_x_date(name = "month", limits = c(ymd("0000-01-01"), ymd("0001-01-03")),
breaks = c(ymd("0000-01-01"), ymd("0000-04-01"), ymd("0000-07-01"),
ymd("0000-10-01"), ymd("0001-01-01")),
labels = c("Jan", "Apr", "Jul", "Oct", "Jan"), expand = c(2/366, 0)) +
scale_y_continuous(limits = c(50, 90),
name = "temperature (°F)") +
theme_dviz_grid(12) +
theme(plot.margin = margin(3, 5, 3, 1.5))
plot_grid(
plot_grid(
temp_plot, NULL, temp_plot, rel_widths = c(1, 0.06, 2), labels = c("a", "", "b"), nrow = 1
),
NULL, temp_plot,
rel_heights = c(1.5, 0.06, 1), labels = c("", "", "c"), label_y = c(1, 1, 1.03), ncol = 1
)
```
On the other hand, if the *x* and the *y* axes are measured in the same units, then the grid spacings for the two axes should be equal, such that the same distance along the *x* or *y* axis corresponds to the same number of data units. As an example, we can plot the temperature in Houston, TX against the temperature in San Diego, CA, for every day of the year (Figure \@ref(fig:temperature-normals-Houston-San-Diego)a). Since the same quantity is plotted along both axes, we need to make sure that the grid lines form perfect squares, as is the case in Figure \@ref(fig:temperature-normals-Houston-San-Diego).
(ref:temperature-normals-Houston-San-Diego) Daily temperature normals for Houston, TX, plotted versus the respective temperature normals of San Diego, CA. The first days of the months January, April, July, and October are highlighted to provide a temporal reference. (a) Temperatures are shown in degrees Fahrenheit. (b) Temperatures are shown in degrees Celsius. Data source: NOAA.
```{r temperature-normals-Houston-San-Diego, fig.width = 5.5*6/4.2, fig.asp = 0.5, fig.cap = '(ref:temperature-normals-Houston-San-Diego)'}
tempsplot_F <- ggplot(temps_wide_label, aes(x = `San Diego`, y = `Houston`)) +
geom_path(size = 1, color = "#0072B2") +
geom_text_repel(
aes(label = label), point.padding = .4, color = "black",
min.segment.length = 0, size = 12/.pt,
family = dviz_font_family,
nudge_x = (9/5)*temps_wide_label$nudge_x,
nudge_y = (9/5)*temps_wide_label$nudge_y
) +
coord_fixed(
xlim = c(45, 85), ylim = c(48, 88),
expand = FALSE
) +
scale_color_continuous_qualitative(guide = "none") +
scale_x_continuous(breaks = c(10*(5:8))) +
xlab("temperature in San Diego (°F)") +
ylab("temperature in Houston (°F)") +
theme_dviz_grid() +
theme(plot.margin = margin(3, 1.5, 3, 1.5))
# Fahrenheit to Celsius conversion
F2C <- function(t) {(t-32)*5/9}
tempsplot_C <- ggplot(temps_wide_label, aes(x = F2C(`San Diego`), y = F2C(`Houston`))) +
geom_path(size = 1, color = "#0072B2") +
geom_text_repel(
aes(label = label), point.padding = .4, color = "black",
min.segment.length = 0, size = 12/.pt,
family = dviz_font_family,
nudge_x = temps_wide_label$nudge_x,
nudge_y = temps_wide_label$nudge_y
) +
coord_fixed(
xlim = F2C(c(45, 85)), ylim = F2C(c(48, 88)),
expand = FALSE
) +
scale_color_continuous_qualitative(guide = "none") +
scale_x_continuous(breaks = c(5*(2:6))) +
xlab("temperature in San Diego (°C)") +
ylab("temperature in Houston (°C)") +
theme_dviz_grid() +
theme(plot.margin = margin(3, 1.5, 3, 1.5))
plot_grid(
tempsplot_F, NULL, tempsplot_C,
labels = c("a", "", "b"), nrow = 1, rel_widths = c(1, .04, 1)
)
```
You may wonder what happens if you change the units of your data. After all, units are arbitrary, and your preferences might be different from somebody else's. A change in units is a linear transformation, where we add or subtract a number to or from all data values and/or multiply all data values with another number. Fortunately, Cartesian coordinate systems are invariant under such linear transformations. Therefore, you can change the units of your data and the resulting figure will not change as long as you change the axes accordingly. As an example, compare Figures \@ref(fig:temperature-normals-Houston-San-Diego)a and \@ref(fig:temperature-normals-Houston-San-Diego)b. Both show the same data, but in part (a) the temperature units are degrees Fahrenheit and in part (b) they are degrees Celsius. Even though the grid lines are in different locations and the numbers along the axes are different, the two data visualizations look exactly the same.
## Nonlinear axes
In a Cartesian coordinate system, the grid lines along an axis are spaced evenly both in data units and in the resulting visualization. We refer to the position scales in these coordinate systems as *linear*. While linear scales generally provide an accurate representation of the data, there are scenarios where nonlinear scales are preferred. In a nonlinear scale, even spacing in data units corresponds to uneven spacing in the visualization, or conversely even spacing in the visualization corresponds to uneven spacing in data units.
The most commonly used nonlinear scale is the *logarithmic scale* or *log scale* for short. Log scales are linear in multiplication, such that a unit step on the scale corresponds to multiplication with a fixed value. To create a log scale, we need to log-transform the data values while exponentiating the numbers that are shown along the axis grid lines. This process is demonstrated in Figure \@ref(fig:linear-log-scales), which shows the numbers 1, 3.16, 10, 31.6, and 100 placed on linear and log scales. The numbers 3.16 and 31.6 may seem a strange choice, but they were chosen because they are exactly half-way between 1 and 10 and between 10 and 100 on a log scale. We can see this by observing that $10^{0.5} = \sqrt{10} \approx 3.16$ and equivalently $3.16 \times 3.16 \approx 10$. Similarly, $10^{1.5} = 10\times10^{0.5} \approx 31.6$.
(ref:linear-log-scales) Relationship between linear and logarithmic scales. The dots correspond to data values 1, 3.16, 10, 31.6, 100, which are evenly-spaced numbers on a logarithmic scale. We can display these data points on a linear scale, we can log-transform them and then show on a linear scale, or we can show them on a logarithmic scale. Importantly, the correct axis title for a logarithmic scale is the name of the variable shown, not the logarithm of that variable.
```{r linear-log-scales, fig.width = 6, fig.asp = 3/4, fig.cap = '(ref:linear-log-scales)'}
df <- data.frame(x = c(1, 3.16, 10, 31.6, 100))
xaxis_lin <- ggplot(df, aes(x, y = 1)) +
geom_point(size = 3, color = "#0072B2") +
scale_y_continuous(limits = c(0.8, 1.2), expand = c(0, 0), breaks = 1) +
theme_dviz_grid(14, rel_large = 1) +
theme(axis.ticks.length = grid::unit(0, "pt"),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(face = "plain"),
plot.margin = margin(3, 14, 3, 1.5))
xaxis_log <- ggplot(df, aes(log10(x), y = 1)) +
geom_point(size = 3, color = "#0072B2") +
scale_y_continuous(limits = c(0.8, 1.2), expand = c(0, 0), breaks = 1) +
theme_dviz_grid(14, rel_large = 1) +
theme(axis.ticks.length = grid::unit(0, "pt"),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(face = "plain"),
plot.margin = margin(3, 14, 3, 1.5))
plotlist <-
align_plots(xaxis_lin + scale_x_continuous(limits = c(0, 100)) +
ggtitle("original data, linear scale"),
xaxis_log + scale_x_continuous(limits = c(0, 2)) +
xlab(expression(paste("log"["10"], "(x)"))) +
ggtitle("log-transformed data, linear scale"),
xaxis_lin + scale_x_log10(limits = c(1, 100), breaks = c(1, 3.16, 10, 31.6, 100),
labels = c("1", "3.16", "10", "31.6", "100")) +
ggtitle("original data, logarithmic scale"),
xaxis_lin + scale_x_log10(limits = c(1, 100), breaks = c(1, 3.16, 10, 31.6, 100),
labels = c("1", "3.16", "10", "31.6", "100")) +
xlab(expression(paste("log"["10"], "(x)"))) +
ggtitle("logarithmic scale with incorrect axis title"),
align = 'vh')
plot_grid(plotlist[[1]], plotlist[[2]], plotlist[[3]], stamp_wrong(plotlist[[4]]), ncol = 1)
```
Mathematically, there is no difference between plotting the log-transformed data on a linear scale or plotting the original data on a logarithmic scale (Figure \@ref(fig:linear-log-scales)). The only difference lies in the labeling for the individual axis ticks and for the axis as a whole. In most cases, the labeling for a logarithmic scale is preferable, because it places less mental burden on the reader to interpret the numbers shown as the axis tick labels. There is also less of a risk of confusion about the base of the logarithm. When working with log-transformed data, we can get confused about whether the data were transformed using the natural logarithm or the logarithm to base 10. And it's not uncommon for labeling to be ambiguous, e.g. "log(x)", which doesn't specify a base at all. I recommend that you always verify the base when working with log-transformed data. When plotting log-transformed data, always specify the base in the labeling of the axis.
Because multiplication on a log scale looks like addition on a linear scale, log scales are the natural choice for any data that have been obtained by multiplication or division. In particular, ratios should generally be shown on a log scale. As an example, I have taken the number of inhabitants in each county in Texas and have divided it by the median number of inhabitants across all Texas counties. The resulting ratio is a number that can be larger or smaller than 1. A ratio of exactly 1 implies that the corresponding county has the median number of inhabitants. When visualizing these ratios on a log scale, we can see clearly that the population numbers in Texas counties are symmetrically distributed around the median, and that the most populous counties have over 100 times more inhabitants than the median while the least populous counties have over 100 times fewer inhabitants (Figure \@ref(fig:texas-counties-pop-ratio-log)). By contrast, for the same data, a linear scale obscures the differences between a county with median population number and a county with a much smaller population number than median (Figure \@ref(fig:texas-counties-pop-ratio-lin)).
(ref:texas-counties-pop-ratio-log) Population numbers of Texas counties relative to their median value. Select counties are highlighted by name. The dashed line indicates a ratio of 1, corresponding to a county with median population number. The most populous counties have approximately 100 times more inhabitants than the median county, and the least populous counties have approximately 100 times fewer inhabitants than the median county. Data source: 2010 Decennial U.S. Census.
```{r texas-counties-pop-ratio-log, fig.width = 5*6/4.2, fig.asp = 0.6, fig.cap = '(ref:texas-counties-pop-ratio-log)'}
set.seed(3878)
US_census %>% filter(state == "Texas") %>%
select(name, pop2010) %>%
extract(name, "county", regex = "(.+) County") %>%
mutate(popratio = pop2010/median(pop2010)) %>%
arrange(desc(popratio)) %>%
mutate(index = 1:n(),
label = ifelse(index <= 3 | index > n()-3 | runif(n()) < .04, county, ""),
label_large = ifelse(index <= 6, county, "")) -> tx_counties
ggplot(tx_counties, aes(x = index, y = popratio)) +
geom_hline(yintercept = 1, linetype = 2, color = "grey40") +
geom_point(size = 0.5, color = "#0072B2") +
geom_text_repel(aes(label = label), point.padding = .4, color = "black",
min.segment.length = 0, family = dviz_font_family) +
scale_y_log10(breaks = c(.01, .1, 1, 10, 100),
name = "population number / median",
labels = label_log10) +
scale_x_continuous(limits = c(.5, nrow(tx_counties) + .5), expand = c(0, 0),
breaks = NULL, #c(1, 50*(1:5)),
name = "Texas counties, from most to least populous") +
theme_dviz_hgrid() +
theme(axis.line = element_blank(),
plot.margin = margin(3, 7, 3, 1.5))
```
(ref:texas-counties-pop-ratio-lin) Population sizes of Texas counties relative to their median value. By displaying a ratio on a linear scale, we have overemphasized ratios > 1 and have obscured ratios < 1. As a general rule, ratios should not be displayed on a linear scale. Data source: 2010 Decennial U.S. Census.
```{r texas-counties-pop-ratio-lin, fig.width = 5*6/4.2, fig.asp = 0.6, fig.cap = '(ref:texas-counties-pop-ratio-lin)'}
counties_lin <- ggplot(tx_counties, aes(x = index, y = popratio)) +
geom_point(size = 0.5, color = "#0072B2") +
geom_text_repel(aes(label = label_large), point.padding = .4, color = "black",
min.segment.length = 0, family = dviz_font_family) +
scale_y_continuous(name = "population number / median") +
scale_x_continuous(limits = c(.5, nrow(tx_counties) + .5), expand = c(0, 0),
breaks = NULL, #c(1, 50*(1:5)),
name = "Texas counties, from most to least populous") +
theme_dviz_hgrid() +
theme(axis.line = element_blank(),
plot.margin = margin(3, 7, 3, 1.5))
stamp_bad(counties_lin)
```
On a log scale, the value 1 is the natural midpoint, similar to the value 0 on a linear scale. We can think of values greater than 1 as representing multiplications and values less than 1 divisions. For example, we can write $10 = 1\times 10$ and $0.1 = 1/10$. The value 0, on the other hand, can never appear on a log scale. It lies infinitely far from 1. One way to see this is to consider that $\log(0) = -\infty$. Or, alternatively, consider that to go from 1 to 0, it takes either an infinite number of divisions by a finite value (e.g., $1/10/10/10/10/10/10\dots = 0$) or alternatively one division by infinity (i.e., $1/\infty = 0$).
Log scales are frequently used when the data set contains numbers of very different magnitudes. For the Texas counties shown in Figures \@ref(fig:texas-counties-pop-ratio-log) and \@ref(fig:texas-counties-pop-ratio-lin), the most populous one (Harris) had 4,092,459 inhabitants in the 2010 U.S. Census while the least populous one (Loving) had 82. So a log scale would be appropriate even if we hadn't divided population numbers by their median to turn them into ratios. But what would we do if there was a county with 0 inhabitants? This county could not be shown on the logarithmic scale, because it would lie at minus infinity. In this situation, the recommendation is sometimes to use a square-root scale, which uses a square root transformation instead of a log transformation (Figure \@ref(fig:sqrt-scales)). Just like a log scale, a square-root scale compresses larger numbers into a smaller range, but unlike a log scale, it allows for the presence of 0.
(ref:sqrt-scales) Relationship between linear and square-root scales. The dots correspond to data values 0, 1, 4, 9, 16, 25, 36, 49, which are evenly-spaced numbers on a square-root scale, since they are the squares of the integers from 0 to 7. We can display these data points on a linear scale, we can square-root-transform them and then show on a linear scale, or we can show them on a square-root scale.
```{r sqrt-scales, fig.width = 6, fig.asp = 3*(3/4)/4, fig.cap = '(ref:sqrt-scales)'}
df <- data.frame(x = c(0, 1, 4, 9, 16, 25, 36, 49))
xaxis_lin <- ggplot(df, aes(x, y = 1)) +
geom_point(size = 3, color = "#0072B2") +
scale_y_continuous(limits = c(0.8, 1.2), expand = c(0, 0), breaks = 1) +
theme_dviz_grid(14, rel_large = 1) +
theme(axis.ticks.length = grid::unit(0, "pt"),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(face = "plain"),
plot.margin = margin(3, 14, 3, 1.5))
xaxis_sqrt <- ggplot(df, aes(sqrt(x), y = 1)) +
geom_point(size = 3, color = "#0072B2") +
scale_y_continuous(limits = c(0.8, 1.2), expand = c(0, 0), breaks = 1) +
theme_dviz_grid(14, rel_large = 1) +
theme(axis.ticks.length = grid::unit(0, "pt"),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(face = "plain"),
plot.margin = margin(3, 14, 3, 1.5))
plotlist <-
align_plots(xaxis_lin + scale_x_continuous(limits = c(0, 50)) +
ggtitle("original data, linear scale"),
xaxis_sqrt + scale_x_continuous(limits = c(0, 7.07)) +
xlab(expression(sqrt(x))) +
ggtitle("square-root-transformed data, linear scale"),
xaxis_sqrt + scale_x_continuous(limits = c(0, 7.07), breaks = c(0, 1, sqrt(5), sqrt(10*(1:5))),
labels = c(0, 1, 5, 10*(1:5)), name = "x") +
expand_limits(expand = c(0, 1)) +
ggtitle("original data, square-root scale"),
align = 'vh')
plot_grid(plotlist[[1]], plotlist[[2]], plotlist[[3]], ncol = 1)
```
I see two problems with square-root scales. First, while on a linear scale one unit step corresponds to addition or subtraction of a constant value and on a log scale it corresponds to multiplication with or division by a constant value, no such rule exists for a square-root scale. The meaning of a unit step on a square-root scale depends on the scale value at which we're starting. Second, it is unclear how to best place axis ticks on a square-root scale. To obtain evenly spaced ticks, we would have to place them at squares, but axis ticks at, for example, positions 0, 4, 25, 49, 81 (every second square) would be highly unintuitive. Alternatively, we could place them at linear intervals (10, 20, 30, etc), but this would result in either too few axis ticks near the low end of the scale or too many near the high end. In Figure \@ref(fig:sqrt-scales), I have placed the axis ticks at positions 0, 1, 5, 10, 20, 30, 40, and 50 on the square-root scale. These values are arbitrary but provide a reasonable covering of the data range.
Despite these problems with square-root scales, they are valid position scales and I do not discount the possibility that they have appropriate applications. For example, just like a log scale is the natural scale for ratios, one could argue that the square-root scale is the natural scale for data that come in squares. One scenario in which data are naturally squares are in the context of geographic regions. If we show the areas of geographic regions on a square-root scale, we are highlighting the regions' linear extent from East to West or North to South. These extents could be relevant, for example, if we are wondering how long it might take to drive across a region. Figure \@ref(fig:northeast-state-areas) shows the areas of states in the U.S. Northeast on both a linear and a square-root scale. Even though the areas of these states are quite different (Figure \@ref(fig:northeast-state-areas)a), the time it will take to drive across each state will more closely resemble the figure on the square-root scale (Figure \@ref(fig:northeast-state-areas)b) than the figure on the linear scale (Figure \@ref(fig:northeast-state-areas)a).
(ref:northeast-state-areas) Areas of Northeastern U.S. states. (a) Areas shown on a linear scale. (b) Areas shown on a square-root scale. Data source: Google.
```{r northeast-state-areas, fig.width = 5.5*6/4.2, fig.asp = 0.4, fig.cap = '(ref:northeast-state-areas)'}
# areas in square miles
# source: Google, 01/07/2018
northeast_areas <- read.csv(text = "state_abr,area
NY,54556
PA,46055
ME,35385
MA,10565
VT,9616
NH,9349
NJ,8723
CT,5543
RI,1212")
northeast_areas$state_abr <- factor(northeast_areas$state_abr, levels = northeast_areas$state_abr)
areas_base <- ggplot(northeast_areas, aes(x = state_abr, y = area)) +
geom_col(fill = "#56B4E9") +
ylab("area (square miles)") +
xlab("state") +
theme_dviz_hgrid() +
theme(plot.margin = margin(3, 1.5, 3, 1.5))
p1 <- areas_base + scale_y_sqrt(limits = c(0, 55000), breaks = c(0, 1000, 5000, 10000*(1:5)),
expand = c(0, 0))
p2 <- areas_base + scale_y_continuous(limits = c(0, 55000), breaks = 10000*(0:6), expand = c(0, 0))
plot_grid(
p2, NULL, p1,
labels = c("a", "", "b"), nrow = 1, rel_widths = c(1, .04, 1)
)
```
## Coordinate systems with curved axes
All coordinate systems we have encountered so far used two straight axes positioned at a right angle to each other, even if the axes themselves established a non-linear mapping from data values to positions. There are other coordinate systems, however, where the axes themselves are curved. In particular, in the *polar* coordinate system, we specify positions via an angle and a radial distance from the origin, and therefore the angle axis is circular (Figure \@ref(fig:polar-coord)).
(ref:polar-coord) Relationship between Cartesian and polar coordinates. (a) Three data points shown in a Cartesian coordinate system. (b) The same three data points shown in a polar coordinate system. We have taken the *x* coordinates from part (a) and used them as angular coordinates and the *y* coordinates from part (a) and used them as radial coordinates. The circular axis runs from 0 to 4 in this example, and therefore *x* = 0 and *x* = 4 are the same locations in this coordinate system.
```{r polar-coord, fig.width = 5*6/4.2, fig.asp = 0.5, fig.cap = '(ref:polar-coord)'}
df_points <- data.frame(x = c(1, 3.5, 0),
y = c(3, 4, 0),
label = c("(1, 3)", "(3.5, 4)", "(0, 0)"),
vjust_polar = c(1.6, 1, 1.6),
hjust_polar = c(.5, -.1, 0.5),
vjust_cart = c(1.6, 1.6, -.6),
hjust_cart = c(0.5, 1.1, -.1))
df_segments <- data.frame(x0 = c(0, 1, 2, 3, 0, 0, 0, 0),
x1 = c(0, 1, 2, 3, 4, 4, 4, 4),
y0 = c(0, 0, 0, 0, 1, 2, 3, 4),
y1 = c(4, 4, 4, 4, 1, 2, 3, 4))
p_cart <- ggplot(df_points, aes(x, y)) +
geom_point(size = 2, color = "#0072B2") +
geom_text(aes(label = label, vjust = vjust_cart, hjust = hjust_cart),
size = 12/.pt, family = dviz_font_family) +
scale_x_continuous(limits = c(-0.5, 4.5), expand = c(0, 0)) +
scale_y_continuous(limits = c(-0.5, 4.5), expand = c(0, 0)) +
coord_fixed() +
xlab("x axis") +
ylab("y axis") +
theme_dviz_grid(12) +
theme(axis.ticks = element_blank(),
axis.ticks.length = grid::unit(0, "pt"),
plot.margin = margin(3, 1.5, 3, 1.5))
p_polar <- ggplot(df_points, aes(x, y)) +
geom_segment(
data = df_segments,
aes(x = x0, xend = x1, y = y0, yend = y1),
size = theme_dviz_grid()$panel.grid$size,
color = theme_dviz_grid()$panel.grid$colour,
inherit.aes = FALSE
) +
geom_point(size = 2, color = "#0072B2") +
geom_text(
aes(label = label, vjust = vjust_polar, hjust = hjust_polar),
size = 12/.pt, family = dviz_font_family
) +
scale_x_continuous(limits = c(0, 4)) +
scale_y_continuous(limits = c(0, 4)) +
coord_polar() +
xlab("x values (circular axis)") +
ylab("y values (radial axis)") +
theme_dviz_grid(12) +
background_grid(major = "none") +
theme(axis.line.x = element_blank(),
axis.ticks = element_line(color = "black"),
plot.margin = margin(3, 1.5, 3, 1.5))
plot_grid(
p_cart, NULL, p_polar,
labels = c("a", "", "b"), nrow = 1, rel_widths = c(1, .04, 1)
)
```
Polar coordinates can be useful for data of a periodic nature, such that data values at one end of the scale can be logically joined to data values at the other end. For example, consider the days in a year. December 31st is the last day of the year, but it is also one day before the first day of the year. If we want to show how some quantity varies over the year, it can be appropriate to use polar coordinates with the angle coordinate specifying each day. Let's apply this concept to the temperature normals of Figure \@ref(fig:temp-normals-vs-time). Because temperature normals are average temperatures that are not tied to any specific year, Dec. 31st can be thought of as 366 days later than Jan. 1st (temperature normals include Feb. 29) and also one day earlier. By plotting the temperature normals in a polar coordinate system, we emphasize this cyclical property they have (Figure \@ref(fig:temperature-normals-polar)). In comparison to Figure \@ref(fig:temp-normals-vs-time), the polar version highlights how similar the temperatures are in Death Valley, Houston, and San Diego from late fall to early spring. In the Cartesian coordinate system, this fact is obscured because the temperature values in late December and in early January are shown in opposite parts of the figure and therefore don't form a single visual unit.
(ref:temperature-normals-polar) Daily temperature normals for four selected locations in the U.S., shown in polar coordinates. The radial distance from the center point indicates the daily temperature in Fahrenheit, and the days of the year are arranged counter-clockwise starting with Jan. 1st at the 6:00 position.
```{r temperature-normals-polar, fig.width = 6, fig.cap = '(ref:temperature-normals-polar)'}
temps_long <- gather(temps_wide, location, temperature, -month, -day, -date) %>%
filter(location %in% c("Chicago",
"Death Valley",
"Houston",
"San Diego")) %>%
mutate(location = factor(location, levels = c("Death Valley",
"Houston",
"San Diego",
"Chicago")))
ggplot(temps_long, aes(x = date, y = temperature, color = location)) +
geom_line(size = 1) +
scale_x_date(name = "date", expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 105), expand = c(0, 0),
breaks = seq(-30, 90, by = 30),
name = "temperature (°F)") +
scale_color_OkabeIto(order = c(1:3, 7), name = NULL) +
coord_polar(theta = "x", start = pi, direction = -1) +
theme_dviz_grid()
```
A second setting in which we encounter curved axes is in the context of geospatial data, i.e., maps. Locations on the globe are specified by their longitude and latitude. But because the earth is a sphere, drawing latitude and longitude as Cartesian axes is misleading and not recommended (Figure \@ref(fig:worldmap-four-projections)). Instead, we use various types of non-linear projections that attempt to minimize artifacts and that strike different balances between conserving areas or angles relative to the true shape lines on the globe (Figure \@ref(fig:worldmap-four-projections)).
(ref:worldmap-four-projections) Map of the world, shown in four different projections. The Cartesian longitude and latitude system maps the longitude and latitude of each location onto a regular Cartesian coordinate system. This mapping causes substantial distortions in both areas and angles relative to their true values on the 3D globe. The interrupted Goode homolosine projection perfectly represents true surface areas, at the cost of dividing some land masses into separate pieces, most notably Greenland and Antarctica. The Robinson projection and the Winkel tripel projection both strike a balance between angular and area distortions, and they are commonly used for maps of the entire globe.
```{r worldmap-four-projections, fig.width = 5.5*6/4.2, fig.cap = '(ref:worldmap-four-projections)'}
library(sf)
world_sf <- sf::st_as_sf(rworldmap::getMap(resolution = "low"))
## world in long-lat
crs_longlat <- "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
p_longlat <- ggplot(world_sf) +
geom_sf(fill = "#E69F00B0", color = "black", size = 0.5/.pt) +
coord_sf(expand = FALSE, crs = crs_longlat) +
scale_x_continuous(
name = "longitude",
breaks = seq(-160, 160, by = 20),
labels = parse(text = c("NA", "NA", "120*degree*W", "NA", "NA", "60*degree*W", "NA", "NA", "0*degree", "NA", "NA", "60*degree*E", "NA", "NA", "120*degree*E", "NA", "NA"))
) +
scale_y_continuous(
name = "latitude",
breaks = seq(-80, 80, by = 20),
labels = parse(text = c("80*degree*S", "NA", "40*degree*S", "NA", "0*degree", "NA", "40*degree*N", "NA", "80*degree*N"))
) +
theme_dviz_grid(12) +
theme(
panel.background = element_rect(fill = "#56B4E950", color = "grey30", size = 0.5),
panel.grid.major = element_line(color = "gray30", size = 0.25),
axis.ticks = element_line(color = "gray30", size = 0.5/.pt),
plot.margin = margin(5, 10, 1.5, 1.5)
)
## Interrupted Goode homolosine
crs_goode <- "+proj=igh"
# projection outline in long-lat coordinates
lats <- c(
90:-90, # right side down
-90:0, 0:-90, # third cut bottom
-90:0, 0:-90, # second cut bottom
-90:0, 0:-90, # first cut bottom
-90:90, # left side up
90:0, 0:90, # cut top
90 # close
)
longs <- c(
rep(180, 181), # right side down
rep(c(80.01, 79.99), each = 91), # third cut bottom
rep(c(-19.99, -20.01), each = 91), # second cut bottom
rep(c(-99.99, -100.01), each = 91), # first cut bottom
rep(-180, 181), # left side up
rep(c(-40.01, -39.99), each = 91), # cut top
180 # close
)
goode_outline <-
list(cbind(longs, lats)) %>%
st_polygon() %>%
st_sfc(
crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
) %>%
st_transform(crs = crs_goode)
# bounding box in transformed coordinates
xlim_goode <- c(-21945470, 21963330)
ylim_goode <- c(-9538022, 9266738)
goode_bbox <-
list(
cbind(
c(xlim_goode[1], xlim_goode[2], xlim_goode[2], xlim_goode[1], xlim_goode[1]),
c(ylim_goode[1], ylim_goode[1], ylim_goode[2], ylim_goode[2], ylim_goode[1])
)
) %>%
st_polygon() %>%
st_sfc(crs = crs_goode)
# area outside the earth outline
goode_without <- st_difference(goode_bbox, goode_outline)
p_goode <- ggplot(world_sf) +
geom_sf(fill = "#E69F00B0", color = "black", size = 0.5/.pt) +
geom_sf(data = goode_without, fill = "white", color = NA) +
geom_sf(data = goode_outline, fill = NA, color = "grey30", size = 0.5/.pt) +
scale_x_continuous(
name = NULL,
breaks = seq(-160, 160, by = 20)
) +
scale_y_continuous(
name = NULL,
breaks = seq(-80, 80, by = 20)
) +
coord_sf(xlim = 0.95*xlim_goode, ylim = 0.95*ylim_goode, expand = FALSE, crs = crs_goode, ndiscr = 1000) +
theme_dviz_grid(12, rel_small = 1) +
theme(
panel.background = element_rect(fill = "#56B4E950", color = "white", size = 1),
panel.grid.major = element_line(color = "gray30", size = 0.25),
plot.margin = margin(1.5, 1.5, 24, 1.5)
)
## Robinson projection
crs_robin <- "+proj=robin +lat_0=0 +lon_0=0 +x0=0 +y0=0"
# projection outline in long-lat coordinates
lats <- c(90:-90, -90:90, 90)
longs <- c(rep(c(180, -180), each = 181), 180)
robin_outline <-
list(cbind(longs, lats)) %>%
st_polygon() %>%
st_sfc(
crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
) %>%
st_transform(crs = crs_robin)
# bounding box in transformed coordinates
xlim_robin <- c(-18494733, 18613795)
ylim_robin <- c(-9473396, 9188587)
robin_bbox <-
list(
cbind(
c(xlim_robin[1], xlim_robin[2], xlim_robin[2], xlim_robin[1], xlim_robin[1]),
c(ylim_robin[1], ylim_robin[1], ylim_robin[2], ylim_robin[2], ylim_robin[1])
)
) %>%
st_polygon() %>%
st_sfc(crs = crs_robin)
# area outside the earth outline
robin_without <- st_difference(robin_bbox, robin_outline)
p_robin <- ggplot(world_sf) +
geom_sf(fill = "#E69F00B0", color = "black", size = 0.5/.pt) +
geom_sf(data = robin_without, fill = "white", color = NA) +
geom_sf(data = robin_outline, fill = NA, color = "grey30", size = 0.5/.pt) +
scale_x_continuous(
name = NULL,
breaks = seq(-160, 160, by = 20)
) +
scale_y_continuous(
name = NULL,
breaks = seq(-80, 80, by = 20)
) +
coord_sf(xlim = 0.95*xlim_robin, ylim = 0.95*ylim_robin, expand = FALSE, crs = crs_robin, ndiscr = 1000) +
theme_dviz_grid(12, rel_small = 1) +
theme(
panel.background = element_rect(fill = "#56B4E950", color = "white", size = 1),
panel.grid.major = element_line(color = "gray30", size = 0.25),
plot.margin = margin(6, 1.5, 1.5, 1.5)
)
## Winkel tripel
# The Winkel tripel projection needs to be done manually, it is not supported by sf.
crs_wintri <- "+proj=wintri +datum=WGS84 +no_defs +over"
# world
world_wintri <- lwgeom::st_transform_proj(world_sf, crs = crs_wintri)
# graticule
grat_wintri <- sf::st_graticule(lat = c(-89.9, seq(-80, 80, 20), 89.9))
grat_wintri <- lwgeom::st_transform_proj(grat_wintri, crs = crs_wintri)
# earth outline
lats <- c(90:-90, -90:90, 90)
longs <- c(rep(c(180, -180), each = 181), 180)
wintri_outline <-
list(cbind(longs, lats)) %>%
st_polygon() %>%
st_sfc(
crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
) %>%
lwgeom::st_transform_proj(crs = crs_wintri)
p_wintri <- ggplot() +
geom_sf(data = wintri_outline, fill = "#56B4E950", color = NA) +
geom_sf(data = grat_wintri, color = "gray30", size = 0.25/.pt) +
geom_sf(data = world_wintri, fill = "#E69F00B0", color = "black", size = 0.5/.pt) +
geom_sf(data = wintri_outline, fill = NA, color = "grey30", size = 0.5/.pt) +
coord_sf(datum = NA, expand = FALSE) +
theme_dviz_grid(12, rel_small = 1) +
theme(
plot.margin = margin(6, 1.5, 3, 1.5)
)
p <- plot_grid(
p_longlat, p_goode, p_robin, p_wintri,
labels = c(
"Cartesian longitude and latitude", "Interrupted Goode homolosine",
"Robinson", "Winkel tripel"
)
)
p + theme(plot.margin = margin(1.5, 0, 0, 0))
```