Skip to content

Commit

Permalink
Addressing reviewer comments
Browse files Browse the repository at this point in the history
  • Loading branch information
clauswilke committed Jan 13, 2019
1 parent 035baf6 commit e22f93f
Show file tree
Hide file tree
Showing 18 changed files with 83 additions and 79 deletions.
17 changes: 9 additions & 8 deletions boxplots_violins.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ library(ggridges)

There are many scenarios in which we want to visualize multiple distributions at the same time. For example, consider weather data. We may want to visualize how temperature varies across different months while also showing the distribution of observed temperatures within each month. This scenario requires showing twelve temperature distributions at once, one for each month. None of the visualizations discussed in Chapters \@ref(histograms-density-plots) or \@ref(ecdf-qq) work well in this case. Instead, viable approaches include boxplots, violin plots, and ridgeline plots.

Whenever we are dealing with many distributions, it is helpful to think in terms of the response variable and one or more grouping variables. The response variable is the variable whose distributions we want to show, and the grouping variables define subsets of the data with distinct distributions of the response variable. For example, for temperature distributions across months, the response variable is the temperature and the grouping variable is the month. All techniques discussed in this chapter draw the response variable along one axis and the grouping variables along the other. In the following, I will first describe approaches that show the response variable along the vertical axis, and then I will describe approaches that show the response variable along the horizontal axis. In all cases discussed, we could flip the axes and arrive at an alternative and viable visualization. I am showing here the canonical forms of the various visualizations.
Whenever we are dealing with many distributions, it is helpful to think in terms of the response variable and one or more grouping variables. The response variable is the variable whose distributions we want to show. The grouping variables define subsets of the data with distinct distributions of the response variable. For example, for temperature distributions across months, the response variable is the temperature and the grouping variable is the month. All techniques discussed in this chapter draw the response variable along one axis and the grouping variables along the other. In the following, I will first describe approaches that show the response variable along the vertical axis, and then I will describe approaches that show the response variable along the horizontal axis. In all cases discussed, we could flip the axes and arrive at an alternative and viable visualization. I am showing here the canonical forms of the various visualizations.


## Visualizing distributions along the vertical axis {#boxplots-violins-vertical}

The simplest approach to showing many distributions at once is to show their mean or median as points, with some indication of the variation around the mean or median shown by error bars. Figure \@ref(fig:lincoln-temp-points-errorbars) demonstrates this approach for the distributions of monthly temperatures in Lincoln, Nebraska, in 2016. I have labeled this figure as bad because there are multiple problems with this approach. First, by representing each distribution by only one point and two error bars, we are losing a lot of information about the data. Second, it is not immediately obvious what the points represent, even though most readers would likely guess that they represent either the mean or the median. Third, it is definitely not obvious what the error bars represent. Do they represent the standard deviation of the data, the standard error of the mean, a 95% confidence interval, or something else altogether? There is no commonly accepted standard. By reading the figure caption of Figure \@ref(fig:lincoln-temp-points-errorbars), we can see that they represent here twice the standard deviation of the daily mean temperatures, meant to indicate the range that contains approximately 95% of the data. However, error bars are more commonly employed to visualize the standard error (or twice the standard error for a 95% confidence interval), and it is easy for readers to confuse the standard error with the standard deviation. The standard error quantifies how accurate our estimate of the mean is, whereas the standard deviation estimates how much spread there is in the data around the mean. It is possible for a dataset to have both a very small standard error of the mean and a very large standard deviation. Fourth, symmetric error bars are misleading if there is any skew in the data, which is the case here and almost always for real-world datasets.

(ref:lincoln-temp-points-errorbars) Mean daily temperatures in Lincoln, Nebraska in 2016. Points represent the average daily mean temperatures for each month, averaged over all days of the month, and error bars represent twice the standard deviation of the daily mean temperatures within each month. This figure has been labeled as "bad" because because error bars are conventionally used to visualize the uncertainty of an estimate, not the variability in a population.
(ref:lincoln-temp-points-errorbars) Mean daily temperatures in Lincoln, Nebraska in 2016. Points represent the average daily mean temperatures for each month, averaged over all days of the month, and error bars represent twice the standard deviation of the daily mean temperatures within each month. This figure has been labeled as "bad" because because error bars are conventionally used to visualize the uncertainty of an estimate, not the variability in a population. Data source: Weather Underground

```{r lincoln-temp-points-errorbars, fig.cap = '(ref:lincoln-temp-points-errorbars)'}
lincoln_weather %>%
Expand Down Expand Up @@ -53,7 +53,7 @@ lincoln_errbar <- ggplot(lincoln_df, aes(x = month_short, y = `Mean Temperature
stamp_bad(lincoln_errbar)
```

We can address all four shortcomings of Figure \@ref(fig:lincoln-temp-points-errorbars) by using a traditional and commonly used method for visualizing distributions, the boxplot. A boxplot divides the data into quartiles and visualizes them in a standardized manner (Figure \@ref(fig:boxplot-schematic)). Boxplots are simple yet informative, and they work well when plotted next to each other to visualize many distributions at once. For the Lincoln temperature data, using boxplots leads to Figure \@ref(fig:lincoln-temp-boxplots). In that figure, we can now see that temperature is highly skewed in December (most days are moderately cold and a few are extremely cold) and not very skewed at all in some other months, for example in July.
We can address all four shortcomings of Figure \@ref(fig:lincoln-temp-points-errorbars) by using a traditional and commonly used method for visualizing distributions, the boxplot. A boxplot divides the data into quartiles and visualizes them in a standardized manner (Figure \@ref(fig:boxplot-schematic)).

(ref:boxplot-schematic) Anatomy of a boxplot. Shown are a cloud of points (left) and the corresponding boxplot (right). Only the *y* values of the points are visualized in the boxplot. The line in the middle of the boxplot represents the median, and the box encloses the middle 50% of the data. The top and bottom whiskers extend either to the maximum and minimum of the data or to the maximum or minimum that falls within 1.5 times the height of the box, whichever yields the shorter whisker. The distances of 1.5 times the height of the box in either direction are called the upper and the lower fences. Individual data points that fall beyond the fences are referred to as outliers and are usually showns as individual dots.

Expand All @@ -80,6 +80,7 @@ p_points <- ggplot(data.frame(y), aes(x = 0, y = y)) +
plot_grid(p_points, p_boxplot, rel_widths = c(.65, 1), nrow = 1)
```

Boxplots are simple yet informative, and they work well when plotted next to each other to visualize many distributions at once. For the Lincoln temperature data, using boxplots leads to Figure \@ref(fig:lincoln-temp-boxplots). In that figure, we can now see that temperature is highly skewed in December (most days are moderately cold and a few are extremely cold) and not very skewed at all in some other months, for example in July.

(ref:lincoln-temp-boxplots) Mean daily temperatures in Lincoln, Nebraska, visualized as boxplots.

Expand Down Expand Up @@ -146,11 +147,11 @@ lincoln_violin

Because violin plots are derived from density estimates, they have similar shortcomings (Chapter \@ref(histograms-density-plots)). In particular, they can generate the appearance that there is data where none exists, or that the data set is very dense when actually it is quite sparse. We can try to circumvent these issues by simply plotting all the individual data points directly, as dots (Figure \@ref(fig:lincoln-temp-all-points)). Such a figure is called a *strip chart.* Strip charts are fine in principle, as long as we make sure that we don't plot too many points on top of each other. A simple solution to overplotting is to spread out the points somewhat along the *x* axis, by adding some random noise in the *x* dimension (Figure \@ref(fig:lincoln-temp-jittered)). This technique is also called *jittering.*

(ref:lincoln-temp-all-points) Mean daily temperatures in Lincoln, Nebraska, visualized as individual temperature values. Each point represents the mean temperature for one day. This figure is labeled as "bad" because so many points are plotted on top of each other that it is not possible to ascertain which temperatures were the most common in each month.
(ref:lincoln-temp-all-points) Mean daily temperatures in Lincoln, Nebraska, visualized as strip chart. Each point represents the mean temperature for one day. This figure is labeled as "bad" because so many points are plotted on top of each other that it is not possible to ascertain which temperatures were the most common in each month.

```{r lincoln-temp-all-points, fig.cap = '(ref:lincoln-temp-all-points)'}
lincoln_points <- ggplot(lincoln_df, aes(x = month_short, y = `Mean Temperature [F]`)) +
geom_point() +
geom_point(size = 0.75) +
xlab("month") +
ylab("mean temperature (°F)") +
theme_dviz_open() +
Expand All @@ -160,11 +161,11 @@ stamp_bad(lincoln_points)
```


(ref:lincoln-temp-jittered) Mean daily temperatures in Lincoln, Nebraska, visualized as individual temperature values. The points have been jittered along the *x* axis to better show the density of points at each temperature value.
(ref:lincoln-temp-jittered) Mean daily temperatures in Lincoln, Nebraska, visualized as strip chart. The points have been jittered along the *x* axis to better show the density of points at each temperature value.

```{r lincoln-temp-jittered, fig.cap = '(ref:lincoln-temp-jittered)'}
lincoln_jitter <- ggplot(lincoln_df, aes(x = month_short, y = `Mean Temperature [F]`)) +
geom_point(position = position_jitter(width = .15, height = 0, seed = 320)) +
geom_point(position = position_jitter(width = .15, height = 0, seed = 320), size = 0.75) +
xlab("month") +
ylab("mean temperature (°F)") +
theme_dviz_open() +
Expand All @@ -185,7 +186,7 @@ Finally, we can combine the best of both worlds by spreading out the dots in pro
```{r lincoln-temp-sina, fig.cap = '(ref:lincoln-temp-sina)'}
lincoln_sina <- ggplot(lincoln_df, aes(x = month_short, y = `Mean Temperature [F]`)) +
geom_violin(color = "transparent", fill = "gray90") +
stat_sina() +
stat_sina(size = 0.75) +
xlab("month") +
ylab("mean temperature (°F)") +
theme_dviz_open() +
Expand Down
2 changes: 1 addition & 1 deletion choosing_visualization_software.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ ggridges::lincoln_weather %>% mutate(month_short = fct_recode(Month,
lincoln1 <- ggplot(lincoln_df, aes(x = month_short, y = `Mean Temperature [F]`)) +
geom_point(position = position_jitter(width = .15, height = 0, seed = 320),
size = .75) +
size = .5) +
xlab("month") +
ylab("mean temperature (°F)") +
theme_dviz_open(12) + theme(plot.margin = margin(3, 12, 3, 0))
Expand Down
Loading

0 comments on commit e22f93f

Please sign in to comment.