forked from clauswilke/dataviz
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathoverlapping_points.Rmd
380 lines (314 loc) · 25.1 KB
/
overlapping_points.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")
library(lubridate)
```
# Handling overlapping points {#overlapping-points}
When we want to visualize large or very large datasets, we often experience the challenge that simple *x*--*y* scatter plots do not work very well because many points lie on top of each other and partially or fully overlap. And similar problems can arise even in small datasets if data values were recorded with low precision or rounded, such that multiple observations have exactly the same numeric values. The technical term commonly used to describe this situation is "overplotting", i.e., plotting many points on top of each other. Here I describe several strategies you can pursue when encountering this challenge.
## Partial transparency and jittering
We first consider a scenario with only a moderate number of data points but with extensive rounding. Our dataset contains fuel economy during city driving and engine displacement for 234 popular car models released between 1999 and 2008 (Figure \@ref(fig:mpg-cty-displ-solid)). In this dataset, fuel economy is measured in miles per gallon (mpg) and is rounded to the nearest integer value. Engine displacement is measured in liters and is rounded to the nearest deciliter. Due to this rounding, many car models have exactly identical values. For example, there are 21 cars total with 2.0 liter engine displacement, and as a group they have only four different fuel economy values, 19, 20, 21, or 22 mpg. Therefore, in Figure \@ref(fig:mpg-cty-displ-solid) these 21 cars are represented by only four distinct points, so that 2.0 liter engines appear much less popular than they actually are. Moreover, the dataset contains two four-wheel drive cars with 2.0 liter engines, which are represented by black dots. However, these black dots are fully occluded by yellow dots, so that it looks like there are no four-wheel drive cars with a 2.0 liter engine.
(ref:mpg-cty-displ-solid) City fuel economy versus engine displacement, for popular cars released between 1999 and 2008. Each point represents one car. The point color encodes the drive train: front-wheel drive (FWD), rear-wheel drive (RWD), or four-wheel drive (4WD). The figure is labeled "bad" because many points are plotted on top of others and obscure them.
```{r mpg-cty-displ-solid, fig.width=5.5, fig.asp=.7416, fig.cap='(ref:mpg-cty-displ-solid)'}
p_mpg_solid <- ggplot(mpg, aes(y = cty, x = displ, color = drv, fill = drv)) +
geom_point(size = 3, shape = 21) +
ylab("fuel economy (mpg)") +
xlab("displacement (l)") +
scale_color_manual(values=c("#202020", "#E69F00", "#56B4E9"),
name="drive train",
breaks=c("f", "r", "4"),
labels=c("FWD", "RWD", "4WD")) +
scale_fill_manual(values=c("#202020", "#E69F00", "#56B4E9"),
name="drive train",
breaks=c("f", "r", "4"),
labels=c("FWD", "RWD", "4WD")) +
theme_dviz_open() +
theme(legend.position = c(.7, .8),
plot.margin = margin(3, 7, 3, 1.5))
stamp_bad(p_mpg_solid)
```
One way to ameliorate this problem is to use partial transparency. If we make individual points partially transparent, then overplotted points appear as darker points and thus the shade of the points reflects the density of points in that location of the graph (Figure \@ref(fig:mpg-cty-displ-transp)).
(ref:mpg-cty-displ-transp) City fuel economy versus engine displacement. Because points have been made partially transparent, points that lie on top of other points can now be identified by their darker shade.
```{r mpg-cty-displ-transp, fig.width=5.5, fig.asp=.7416, fig.cap='(ref:mpg-cty-displ-transp)'}
p_mpg_transp <- ggplot(mpg, aes(y = cty, x = displ, color = drv, fill = drv)) +
geom_point(size = 3, shape = 21) +
ylab("fuel economy (mpg)") +
xlab("displacement (l)") +
scale_color_manual(values=c("#202020", "#E69F00", "#56B4E9"),
name="drive train",
breaks=c("f", "r", "4"),
labels=c("FWD", "RWD", "4WD")) +
scale_fill_manual(values=c("#20202080", "#E69F0080", "#56B4E980"),
name="drive train",
breaks=c("f", "r", "4"),
labels=c("FWD", "RWD", "4WD")) +
theme_dviz_open() +
theme(legend.position = c(.7, .8),
plot.margin = margin(3, 7, 3, 1.5))
p_mpg_transp
```
However, making points partially transparent is not always sufficient to solve the issue of overplotting. For example, even though we can see in Figure \@ref(fig:mpg-cty-displ-transp) that some points have a darker shade than others, it is difficult to estimate how many points were plotted on top of each other in each location. In addition, while the differences in shading are clearly visible, they are not self-explanatory. A reader who sees this figure for the first time will likely wonder why some points are darker than others and will not realize that those points are in fact multiple points stacked on top of each other. A simple trick that helps in this situation is to apply a small amount of jitter to the points, i.e., to displace each point randomly by a small amount in either the *x* or the *y* direction or both. With jitter, it is immediately apparent that the darker areas arise from points that are plotted on top of each other (Figure \@ref(fig:mpg-cty-displ-jitter)). Also, now, for the first time the black dots representing four-wheel drive cars with 2.0 liter engines are clearly visible.
(ref:mpg-cty-displ-jitter) City fuel economy versus engine displacement. By adding a small amount of jitter to each point, we can make the overplotted points more clearly visible without substantially distorting the message of the plot.
```{r mpg-cty-displ-jitter, fig.width=5.5, fig.asp=.7416, fig.cap='(ref:mpg-cty-displ-jitter)'}
p_mpg_jitter <- ggplot(mpg, aes(y = cty, x = displ, color = drv, fill = drv)) +
geom_point(size = 3, shape = 21,
position = position_jitter(width = 0.01 * diff(range(mpg$displ)),
height = 0.01 * diff(range(mpg$cty)),
seed = 7384)) +
ylab("fuel economy (mpg)") +
xlab("displacement (l)") +
scale_color_manual(values=c("#202020", "#E69F00", "#56B4E9"),
name="drive train",
breaks=c("f", "r", "4"),
labels=c("FWD", "RWD", "4WD")) +
scale_fill_manual(values=c("#20202080", "#E69F0080", "#56B4E980"),
name="drive train",
breaks=c("f", "r", "4"),
labels=c("FWD", "RWD", "4WD")) +
theme_dviz_open() +
theme(legend.position = c(.7, .8),
plot.margin = margin(3, 7, 3, 1.5))
p_mpg_jitter
```
One downside of jittering is that it does change the data and therefore has to be performed with care. If we jitter too much, we end up placing points in locations that are not representative of the underlying dataset. The result is a misleading visualization of the data. See Figure \@ref(fig:mpg-cty-displ-jitter-extreme) as an example.
(ref:mpg-cty-displ-jitter-extreme) City fuel economy versus engine displacement. By adding too much jitter to the points, we have created a visualization that does not accurately reflect the underlying dataset.
```{r mpg-cty-displ-jitter-extreme, fig.width=5.5, fig.asp=.7416, fig.cap='(ref:mpg-cty-displ-jitter-extreme)'}
p_mpg_jitter_extreme <- ggplot(mpg, aes(y = cty, x = displ, color = drv, fill = drv)) +
geom_point(size = 3, shape = 21,
position = position_jitter(width = 0.1 * diff(range(mpg$displ)),
height = 0.1 * diff(range(mpg$cty)))) +
scale_x_continuous(breaks = 2:7) +
ylab("fuel economy (mpg)") +
xlab("displacement (l)") +
scale_color_manual(values=c("#202020", "#E69F00", "#56B4E9"),
name="drive train",
breaks=c("f", "r", "4"),
labels=c("FWD", "RWD", "4WD")) +
scale_fill_manual(values=c("#20202080", "#E69F0080", "#56B4E980"),
name="drive train",
breaks=c("f", "r", "4"),
labels=c("FWD", "RWD", "4WD")) +
theme_dviz_open() +
theme(legend.position = c(.7, .8),
plot.margin = margin(3, 7, 3, 1.5))
stamp_bad(p_mpg_jitter_extreme)
```
## 2D histograms
When the number of individual points gets very large, partial transparency (with or without jittering) will not be sufficient to resolve the overplotting issue. What will typically happen is that areas with high point density will appear as uniform blobs of dark color while in areas with low point density the individual points are barely visible (Figure \@ref(fig:nycflights-points)). And changing the transparency level of individual points will either ameliorate one or the other of these problems while worsening the other; no transparency setting can address both at the same time.
(ref:nycflights-points) Departure delay in minutes versus the flight departure time, for all flights departing Newark airport (EWR) in 2013. Each dot represents one departure.
```{r nycflights-points, fig.asp = 0.75, fig.cap = '(ref:nycflights-points)'}
# break points along the x axis
breaks_x <- c("0:00", "6:00", "12:00", "18:00", "24:00")
p_flights_base <- ggplot(flight_delays, aes(`departure time`, `departure delay (minutes)`)) +
geom_abline(slope = 0, intercept = 0, color="grey70") +
scale_x_time(
name = "departure time",
breaks = hm(breaks_x),
labels = breaks_x
) +
scale_y_continuous(
name = "departure delay (minutes)"
) +
theme_dviz_open() +
theme(plot.margin = margin(3, 7, 3, 1.5))
p_flights_scatter <- p_flights_base + geom_point(alpha = 0.1, stroke = 0, size = 2)
stamp_bad(p_flights_scatter)
```
Figure \@ref(fig:nycflights-points) shows departure delays for over 100,000 individual flights, with each dot representing one flight departure. Even though we have made the individual dots fairly transparent, the majority of them just forms a black band between 0 and 300 minutes departure delay. This band obscures whether most flights depart approximately on time or with substantial delay (say 50 minutes or more). At the same time, the most delayed flights (with delays of 400 minutes or more) are barely visible due to the transparency of the dots.
In such cases, instead of plotting individual points, we can make a 2D histogram. A 2D histogram is conceptually similar to a 1D histogram as discussed in Chapter \@ref(histograms-density-plots), but now we bin the data in two dimensions. We subdivide the entire *x*--*y* plane into small rectangles, count how many observations fall into each one, and then color the rectangles by that count. Figure \@ref(fig:nycflights-2d-bins) shows the result of this approach for the departure-delay data. This visualization clearly highlights several important features of the flight-departure data. First, the vast majority of departures during the day (6am to about 9pm) actually depart without delay or even early (negative delay). However, a modest number of departures has a substantial delay. Moreover, the later a plane departs in the day the more of a delay it can have. Importantly, the departure time is the actual time of departure, not the scheduled time of departure. So this figure does not necessarily tell us that planes scheduled to depart early never experience delay. What it does tell us, though, is that if a plane departs early it either has little delay or, in very rare cases, a delay of around 900 minutes.
(ref:nycflights-2d-bins) Departure delay in minutes versus the flight departure time. Each colored rectangle represents all flights departing at that time with that departure delay. Coloring represents the number of flights represented by that rectangle.
```{r nycflights-2d-bins, fig.asp = 0.75, fig.cap = '(ref:nycflights-2d-bins)'}
p_flights_2d_bins <- p_flights_base +
geom_bin2d(bins=50) +
#scale_fill_continuous_sequential(palette = "Blue-Yellow", l2 = 90, c2 = 20) +
scale_fill_continuous_sequential(
h1 = -83, h2 = 20, c1 = 30, cmax = 40, c2 = 0, l1 = 20, l2 = 94, p1 = 1, p2 = 1.2,
rev = TRUE,
begin = 0.2,
name = "departures"
) +
theme(legend.position = c(0.85, .85))
p_flights_2d_bins
```
As an alternative to binning the data into rectangle, we can also bin into hexagons. This approach, first proposed by
@Carr-et-al-1987, has the advantage that the points in a hexagon are, on average, closer to the hexagon center than the points in an equal-area square are to the center of the square. Therefore, the colored hexagon represents the data slightly more accurately than the colored rectangle does. Figure \@ref(fig:nycflights-hex-bins) shows the flight departure data with hexagon binning rather than rectangular binning.
(ref:nycflights-hex-bins) Departure delay in minutes versus the flight departure time. Each colored hexagon represents all flights departing at that time with that departure delay. Coloring represents the number of flights represented by that hexagon.
```{r nycflights-hex-bins, fig.asp = 0.75, fig.cap = '(ref:nycflights-hex-bins)'}
p_flights_hex_bins <- p_flights_base +
geom_hex(aes(colour = stat(count)), bins = 50) +
#scale_fill_continuous_sequential(palette = "Blue-Yellow", l2 = 90, c2 = 20) +
scale_fill_continuous_sequential(
aesthetics = c("colour", "fill"),
h1 = -83, h2 = 20, c1 = 30, cmax = 40, c2 = 0, l1 = 20, l2 = 94, p1 = 1, p2 = 1.2,
rev = TRUE,
begin = 0.2,
name = "departures"
) +
theme(legend.position = c(0.85, .85))
p_flights_hex_bins
```
## Contour lines
Instead of binning data points into rectangles or hexagons, we can also estimate the point density across the plot area and indicate regions of different point densities with contour lines. This technique works well when the point density changes slowly across both the *x* and the *y* dimensions.
As an example for this approach, we return to the blue jays dataset from Chapter \@ref(visualizing-associations). Figure \@ref(fig:blue-jays-scatter) showed the relationship between head length and body mass for 123 blue jays, and there was some amount of overlap among the points. We can highlight the distribution of points more clearly by making the points smaller and partially transparent and plotting them on top of contour lines that delineate regions of similar point density (Figure \@ref(fig:blue-jays-contour)). We can further enhance the perception of changes in the point density by shading the regions enclosed by the contour lines, using darker colors for regions representing higher point densities (Figure \@ref(fig:blue-jays-contour-filled)).
(ref:blue-jays-contour) Head length versus body mass for 123 blue jays, as in Figure \@ref(fig:blue-jays-scatter). Each dot corresponds to one bird, and the lines indicate regions of similar point density. The point density increases towards the center of the plot, near a body mass of 75g and a head length between 55mm and 57.5mm. Data source: Keith Tarvin, Oberlin College
```{r blue-jays-contour, fig.width = 6, fig.asp = 3/4, fig.cap='(ref:blue-jays-contour)'}
blue_jays_base <- ggplot(blue_jays, aes(Mass, Head)) +
scale_x_continuous(
limits = c(57, 82),
expand = c(0, 0),
name = "body mass (g)") +
scale_y_continuous(
limits = c(49, 61),
expand = c(0, 0),
name = "head length (mm)"
) +
theme_dviz_grid()
blue_jays_base +
stat_density_2d(color = "black", size = 0.4, binwidth = 0.004) +
geom_point(color = "black", size = 1.5, alpha = 1/3)
```
(ref:blue-jays-contour-filled) Head length versus body mass for 123 blue jays. This figure is nearly identical to Figure \@ref(fig:blue-jays-scatter), but now the areas enclosed by the contour lines are shaded with increasingly darker shades of gray. This shading creates a stronger visual impression of increasing point density towards the center of the point cloud. Data source: Keith Tarvin, Oberlin College
```{r blue-jays-contour-filled, fig.width = 6, fig.asp = 3/4, fig.cap='(ref:blue-jays-contour-filled)'}
blue_jays_base +
stat_density_2d(aes(fill = ..level..), geom = "polygon", color = "black", size = 0.15, binwidth = 0.004) +
geom_point(color = "black", size = 1.5, alpha = .4) +
scale_fill_gradient(low = "grey95", high = "grey70", guide = "none")
```
In Chapter \@ref(visualizing-associations), we also looked at the relationship between head length and body mass separately for male and female birds (Figure \@ref(fig:blue-jays-scatter-sex)). We can do the same with contour lines, by drawing separately colored contour lines for male and female birds (Figure \@ref(fig:blue-jays-contour-by-sex)).
(ref:blue-jays-contour-by-sex) Head length versus body mass for 123 blue jays. As in Figure \@ref(fig:blue-jays-scatter-sex), we can also indicate the birds' sex by color when drawing contour lines. This figure highlights how the point distribution is different for male and female birds. In particular, male birds are more densely clustered in one region of the plot area whereas female birds are more spread out. Data source: Keith Tarvin, Oberlin College
```{r blue-jays-contour-by-sex, fig.width = 6, fig.asp = 3/4, fig.cap='(ref:blue-jays-contour-by-sex)'}
blue_jays_base +
aes(color = KnownSex) +
stat_density_2d(size = 0.4, binwidth = 0.006) +
geom_point(size = 1.5, alpha = 0.7) +
scale_color_manual(
values = c(F = "#D55E00", M = "#0072B2"),
breaks = c("F", "M"),
labels = c("female birds ", "male birds"),
name = NULL,
guide = guide_legend(
direction = "horizontal",
override.aes = list(size = 2, linetype = 0)
)
) +
theme_dviz_grid() +
theme(
legend.position = c(1, 0),
legend.justification = c(1, 0),
#legend.position = "top",
#legend.justification = "right",
#legend.box.spacing = unit(3.5, "pt"), # distance between legend and plot
legend.text = element_text(vjust = 0.6),
legend.spacing.x = unit(2, "pt"),
legend.background = element_rect(fill = "white", color = NA),
#legend.key.width = unit(10, "pt")
axis.ticks.length = unit(0, "pt"),
axis.ticks = element_blank()
)
```
Drawing multiple sets of contour lines in different colors can be a powerful strategy for showing the distributions of several point clouds at once. However, this technique needs to be employed with care. It only works when the number of groups with distinct colors is small (two to three) and the groups are clearly separated. Otherwise, we may end up with a hairball of differently colored lines all crisscrossing each other and not showing any particular pattern at all.
To illustrate this potential problem, I will employ the diamonds dataset, which contains information for 53,940 diamonds, including their price, weight (carat), and cut. Figure \@ref(fig:diamonds-points) shows this dataset as a scatter plot. We see clear problems with overplotting. There are so many different-colored points on top of one another that it is impossible to discern anything beyond the overall broad outline of where diamonds fall on the price--carat spectrum.
(ref:diamonds-points) Price of diamonds versus their carat value, for 53,940 individual diamonds. Each diamond's cut is indicated by color. The plot is labeled as "bad" because the extensive overplotting makes it impossible to discern any patterns among the different diamond cuts. Data source: Hadley Wickham, ggplot2
```{r diamonds-points, fig.width = 5*6/4.2, fig.asp = 3/4, fig.cap = '(ref:diamonds-points)'}
p <- ggplot(diamonds, aes(carat, price, color = cut)) +
geom_point(size = .2, alpha = 1/5) +
scale_x_continuous(
limits = c(-1, 5.1)
) +
scale_y_log10(
name = "price (USD)",
breaks = c(300, 1000, 3000, 10000),
labels = c("$300", "$1,000", "$3,000", "$10,000")
) +
scale_color_discrete_sequential(
palette = "Inferno",
nmax = 6,
order = 1:5,
rev = FALSE,
breaks = c("Ideal", "Premium", "Very Good", "Good", "Fair"),
labels = c("ideal", "premium", "very good", "good", "fair"),
guide = guide_legend(
override.aes = list(size = 2, alpha = 1)
)
) +
coord_cartesian(xlim = c(-.1, 3.2), ylim = c(240, 25000), expand = FALSE) +
theme_dviz_grid() +
panel_border() +
theme(
plot.margin = margin(18, 7, 1, 1.5),
legend.key.width = unit(6, "pt"),
legend.spacing.y = unit(3, "pt"),
legend.title = element_text(hjust = 0, margin = margin(0, 0, 0, 0)),
legend.position = c(.97, .3),
legend.justification = c(1, 0.5),
legend.box.margin = margin(7, 7, 7, 7),
legend.box.background = element_rect(fill = "white", color = NA),
axis.ticks.length = unit(0, "pt")
)
stamp_bad(p)
```
We could try to draw colored contour lines for the different qualities of cut, as in Figure \@ref(fig:blue-jays-contour-by-sex). However, in the diamonds dataset, we have five distinct colors and the groups strongly overlap. Therefore, the contour plot (Figure \@ref(fig:diamonds-contour-colors)) is not much better than the original scatter plot (Figure \@ref(fig:diamonds-points)).
(ref:diamonds-contour-colors) Price of diamonds versus their carat value. As Figure \@ref(fig:diamonds-points), but now individual points have been replaced by contour lines. The resulting plot is still labeled "bad", because the contour lines all lie on top of each other. Neither the point distribution for individual cuts nor the overall point distribution can be discerned. Data source: Hadley Wickham, ggplot2
```{r diamonds-contour-colors, fig.width = 5*6/4.2, fig.asp = 3/4, fig.cap = '(ref:diamonds-contour-colors)'}
p <- ggplot(diamonds, aes(carat, price, color = cut)) +
geom_density2d(size = .35, binwidth = 0.8) +
scale_x_continuous(
limits = c(-1, 5.1)
) +
scale_y_log10(
name = "price (USD)",
breaks = c(300, 1000, 3000, 10000),
labels = c("$300", "$1,000", "$3,000", "$10,000")
) +
scale_color_discrete_sequential(
palette = "Inferno",
nmax = 6,
order = 1:5,
rev = FALSE,
breaks = c("Ideal", "Premium", "Very Good", "Good", "Fair"),
labels = c("ideal", "premium", "very good", "good", "fair"),
guide = guide_legend(
override.aes = list(size = 0.5)
)
) +
coord_cartesian(xlim = c(-.1, 2.3), ylim = c(240, 25000), expand = FALSE) +
theme_dviz_grid() +
panel_border() +
theme(
plot.margin = margin(18, 7, 1, 1.5),
legend.spacing.y = unit(3, "pt"),
legend.title = element_text(hjust = 0, margin = margin(0, 0, 0, 0)),
legend.position = c(.97, .3),
legend.justification = c(1, 0.5),
legend.box.margin = margin(7, 7, 7, 7),
legend.box.background = element_rect(fill = "white", color = NA),
axis.ticks.length = unit(0, "pt")
)
stamp_bad(p)
```
What helps here is to draw the contour lines for each cut quality in its own plot panel (Figure \@ref(fig:diamonds-contour-facets)). The purpose of drawing them all in one panel might be to enable visual comparison between the groups, but Figure \@ref(fig:diamonds-contour-colors) is so busy that a comparison isn't possible. Instead, in Figure \@ref(fig:diamonds-contour-facets), the background grid enables us to make comparisons across cut qualities, by paying attention to where exactly the contour lines fall relative to the grid lines. (A similar effect could have been achieved by plotting partially transparent individual points instead of contour lines in each panel.)
(ref:diamonds-contour-facets) Price of diamonds versus their carat value. Here, we have taken the density contours from Figure \@ref(fig:diamonds-contour-colors) and drawn them separately for each cut. We can now see that better cuts (very good, premium, ideal) tend to have lower carat values than the poorer cuts (fair, good) but command a higher price per carat. Data source: Hadley Wickham, ggplot2
```{r diamonds-contour-facets, fig.width = 5.5*6/4.2, fig.asp = 3/4, fig.cap = "(ref:diamonds-contour-facets)"}
ggplot(diamonds, aes(carat, price)) +
stat_density_2d(aes(fill = ..level..), geom = "polygon", color = darken("#0072B2", .2), size = .3, binwidth = 0.8) +
#geom_density2d(color = darken("#0072B2", .2), size = .3, binwidth = 0.8) +
scale_fill_gradient(low = desaturate(lighten("#0072B2", .9), .6), high = desaturate(lighten("#0072B2", .6), .6), guide = "none") +
scale_x_continuous(
limits = c(-1, 5.1)
) +
scale_y_log10(
name = "price (USD)",
breaks = c(300, 1000, 3000, 10000),
labels = c("$300", "$1,000", "$3,000", "$10,000")
) +
coord_cartesian(xlim = c(-.1, 2.3), ylim = c(200, 25000), expand = FALSE) +
facet_wrap(~cut, scales = "free_x", labeller = labeller(cut = tolower)) +
theme_dviz_grid() +
panel_border() +
theme(
legend.title = element_text(hjust = 0.5),
legend.position = c(.95, .05),
legend.justification = c(1, 0),
axis.ticks.length = unit(0, "pt")
)
```
We can make out two main trends. First, the better cuts (very good, premium, ideal) tend to have lower carat values than the poorer cuts (fair, good). Recall that carat is a measure of diamond weight (1 carat = 0.2 gram). Better cuts tend to result (on average) in lighter diamonds because more material needs to be removed to create them. Second, at the same carat value, better cuts tend to command higher prices. To see this pattern, look for example at the price distribution for 0.5 carat. The distribution is shifted upwards for better cuts, and in particular it is substantially higher for diamonds with ideal cut than for diamonds with fair or good cut.