Skip to content

Commit

Permalink
several new paragraphs. chapter is almost done.
Browse files Browse the repository at this point in the history
  • Loading branch information
clauswilke committed Mar 5, 2018
1 parent 0586752 commit 8213765
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 26 deletions.
65 changes: 39 additions & 26 deletions choosing_plotting_software.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,42 +8,25 @@ library(forcats)

# Choosing the right visualization software

Throughout this book, I have purposefully avoided one critical question of data visualization: How do we actually generate our figures? What tools should we use? This question can generate heated discussions, as many people have strong emotional bonds to the specific tools they are familiar with. I have often seen people vigorously defend their own preferred tools instead of investing time to learn a new approach, even if the new approach has objective benefits. And I will say that sticking with the tools you know is not entirely unreasonable. Learning any new tool will require time and effort, and you will have to go through a painful transition period where getting things done with the new tool is much more difficult than it was with the old tool. Whether going through this period is worth the effort can usually only be evaluated in retrospect, after one has fully made the transition. Therefore, regardless of the pros and cons of different tools and approaches, the overriding principle is that you need to pick one that works for you. If you can make the figures you want to make, without excessive effort, then that's all that matters.
Throughout this book, I have purposefully avoided one critical question of data visualization: How do we actually generate our figures? What tools should we use? This question can generate heated discussions, as many people have strong emotional bonds to the specific tools they are familiar with. I have often seen people vigorously defend their own preferred tools instead of investing time to learn a new approach, even if the new approach has objective benefits. And I will say that sticking with the tools you know is not entirely unreasonable. Learning any new tool will require time and effort, and you will have to go through a painful transition period where getting things done with the new tool is much more difficult than it was with the old tool. Whether going through this period is worth the effort can usually only be evaluated in retrospect, after one has made the investment to learn the new tool. Therefore, regardless of the pros and cons of different tools and approaches, the overriding principle is that you need to pick one that works for you. If you can make the figures you want to make, without excessive effort, then that's all that matters.

```{block type='rmdtip', echo=TRUE}
The best visualization software is the one that allows you to make the figures you need.
```

Having said this, I do think there are general principles we can use to assess the relative merits of different approaches to producing visualizations. These principles roughly break down by how easy it is to rapidly explore the data, how reproducible the resulting visualizations are, and to what extent the visual appearance of the output can be tweaked.


## Data exploration versus publication-ready figures


Throughout much of the 20th century, data visualizations were drawn by hand, mostly by technical illustrators who did so for a living.

One notable example is Randall Munroe, creator of the web comic XKCD,

(ref:sequencing-cost) After the introduction of next-gen sequencing methods, the sequencing cost per genome has declined much more rapidly than predicted by Moore's law. Data source: National Human Genome Research Institute

```{r sequencing-cost, out.width = "70%", fig.cap='(ref:sequencing-cost)'}
knitr::include_graphics("figures/sequencing_costs.png", auto_pdf = FALSE)
```


Having said this, I do think there are general principles we can use to assess the relative merits of different approaches to producing visualizations. These principles roughly break down by how reproducible the visualizations are, how easy it is to rapidly explore the data, and to what extent the visual appearance of the output can be tweaked.

## Reproducibility and repeatability

In the context of scientific experiments, we refer to work as reproducible if the overarching scientific finding of the work will remain unchanged if a different research group performs the same type of study. For example, if one research group finds that a new pain medication reduces perceived headache pain significantly without causing noticeable side effects and a different group subsequently studies the same medication on a different patient group and has the same findings, then the work is reproducible. By contrast, work is repeatable if very similar or identical measurements can be obtained by the same person repeating the exact same measurement procedure on the same equipment. For example, if I weigh my dog and find she weighs 41 lbs and then I weigh her again on the same scales and find again that she weighs 41 lbs, then this measurement is repeatable.

With minor modifications, we can apply these concepts to data visualization. A visualization is reproducible if the plotted data are available and any data transformations that may have been applied are exactly specified. For example, if you make a figure and then send me the exact data that you plotted, then I can prepare a figure that looks substantially similar. We may be using slightly different fonts or colors or point sizes to display the same data, so the two figures may not be exactly identical, but your figure and mine convey the same message and therefore are reproductions of each other. A visualization is repeatable, on the other hand, if it is possible to recreate the exact same visual appearance, down to the last pixel, from the raw data. Strictly speaking, repeatability requires that even if there are random elements in the figure, such as jitter (Chapter \@ref(overlapping-points)), those elements were specified in a repeatable way and can be regenerated at a future date. For random data, repeatability generally requires that we specify a particular random number generator for which we set and record a seed.

We can apply these concepts to data visualizations with minor modifications. A visualization is reproducible if the plotted data are available and any data transformations that may have been applied are exactly specified. For example, if you make a figure and then send me the exact data that you plotted, then I can prepare a figure that looks substantially similar. We may be using slightly different fonts or colors or point sizes to display the same data, so the two figures may not be exactly identical, but your figure and mine convey the same message and therefore are reproductions of each other. A visualization is repeatable, on the other hand, if it is possible to recreate the exact same visual appearance, down to the last pixel, from the raw data. Strictly speaking, repeatability requires that even if there are random elements in the figure, such as jitter (Chapter \@ref(overlapping-points)), those elements were specified in a repeatable way and can be regenerated at a future date. For random data, repeatability generally requires that we set and record a seed for the random number generator.
Throughout this book, we have seen many examples of figures that reproduce but don't repeat other figures. For example, Chapter \@ref(avoid-line-drawings) shows several sets of figures where all figures in each set show the same data but each figure in each set looks somewhat different. Similarly, Figure \@ref(fig:lincoln-repro)a is a repeat of Figure \@ref(fig:lincoln-temp-jittered), down to the random jitter that was applied to each data point, whereas Figure \@ref(fig:lincoln-repro)b is a reproduction of that figure. Figure \@ref(fig:lincoln-repro)b has different jitter than Figure \@ref(fig:lincoln-temp-jittered), and it also uses a sufficiently different visual design that the two figures look quite distinct, even if they clearly convey the same information about the data.

**Make an example of a figure that reproduces but doesn't repeat another figure. Use a different theme, colors, point sizes, etc. Maybe also use altered jitter. And also refer to Chapter \@ref(avoid-line-drawings) which has several more such examples.**
(ref:lincoln-repro) Repeat and reproduction of a figure. Part (a) is a near-complete repeat of Figure \@ref(fig:lincoln-temp-jittered). With exception of the exact sizes of the text elements and points, which were adjusted so the figure remains legible at the reduced size, the two figures are identical down to the random jitter that was applied to each point. By contrast, part (b) is a reproduction but not a repeat. In particular, the jitter in part (b) differs from the jitter in part (a) or Figure \@ref(fig:lincoln-temp-jittered).

(ref:lincoln-lincoln-repro) Repeat and reproduction of a figure. Part (a) is a near-complete repeat of Figure \@ref(fig:lincoln-temp-jittered). With exception of the exact sizes of the text elements and points, which were adjusted so the figure remains legible at the reduced size, the two figures are identical down to the random jitter that was applied to each point. By contrast, part (b) is a reproduction but not a repeat. In particular, the jitter in part (b) differs from the jitter in part (a) or Figure \@ref(fig:lincoln-temp-jittered).

```{r lincoln-repro, fig.width = 8.5, fig.asp = .32, fig.cap = '(ref:lincoln-lincoln-repro)'}
```{r lincoln-repro, fig.width = 8.5, fig.asp = .32, fig.cap = '(ref:lincoln-repro)'}
ggridges::lincoln_weather %>% mutate(month_short = fct_recode(Month,
Jan = "January",
Feb = "February",
Expand Down Expand Up @@ -77,14 +60,44 @@ lincoln2 <- ggplot(lincoln_df, aes(x = month_short, y = `Mean Temperature [F]`))
plot_grid(lincoln1, lincoln2, labels = "auto", label_fontface = "plain", ncol = 2)
```

Both reproducibility and repeatability can be difficult to achieve when we're working with interactive plotting software. Many interactive programs allow you to transform or otherwise manipulate the data but don't keep track of every individual data transformation you perform, only of the final product. As a result, if somebody asked you to reproduce a figure you made or to produce a very similar one with a slightly different data set, you might have difficulty to do so. During my years as a postdoc and a young assistant professor, I used an interactive program for all my scientific visualizations, and this exact issue happened to me several times. For example, I had made several figures for a scientific manuscript. When I wanted to revise the manuscript a few months later and needed to reproduce a slightly altered version of one of the figures, I realized that I wasn't quite sure anymore how I had made the original figure in the first place. This experience has taught me to stay away from interactive programs as much as possible. I now make figures programmatically, by writing code (scripts) that generates the figures from the raw data. Programmatically generated figures will generally be repeatable by anybody who has access to the script that generated the figure as well as the programming language and specific libraries used.


## Data exploration versus data presentation

There are two distinct phases of data visualization, and they have very different requirements. The first is data exploration. Whenever you start working with a new dataset, you need to look at it from different angles and try various ways of visualizing it, just to develop an understanding of the dataset's key features. In this phase, speed and efficiency are of the essence. You need to try different types of visualizations, different data transformations, and different subsets of the data. The faster you can iterate through different ways of looking at the data, the more you will explore, and the higher the likelihood that you will notice an important feature in the data that you might otherwise have overlooked. The second phase is data presentation. You enter it once you understand your dataset and know what aspects of it you want to show to your audience. The key objective in this phase is to prepare a high-quality, publication-ready figure that can be printed in an article or book, included in a presentation, or posted on the internet.

**paragraph on exploration**

Once we have determined how exactly we want to visualize our data, what transformations we want to make, and what type of plot to use, we will commonly want to prepare a high-quality version for publication. At this point, we have many different avenues to pursue. First, we can finalize the figure using same software platform we used for initial exploration. Second, we can switch platform to one that provides us finer control over the final product, even if that platform makes it harder to explore. Third, we can produce a draft figure with a visualization software and then manually post-process with an image manipulation or illustration program such as Photoshop or Illustrator. Fourth, we can redraw the entire figure from scratch, either by hand or using an illustration program.

In fact, hand-drawn data visualizations were commonplace throughout much of the 20th century, mostly out of necessity. Visualization software did not start to become widespread and readily available until the mid-to-late 1980s. Now, in the 21st century, computer-generated visualizations are everywhere, and users can choose from thousands of different software to visualize their data. However, in this context, manually drawn figures are making somewhat of a resurgence, likely because they represent a unique and personalized take on what can otherwise be a somewhat sterile and boring art form (see Figure \@ref(fig:sequencing-cost) for an example).

(ref:sequencing-cost) After the introduction of next-gen sequencing methods, the sequencing cost per genome has declined much more rapidly than predicted by Moore's law. This is a hand-drawn figure reproducing a figure prepared by the National Institutes of Health and widely publicized. Data source: National Human Genome Research Institute

```{r sequencing-cost, out.width = "70%", fig.cap='(ref:sequencing-cost)'}
knitr::include_graphics("figures/sequencing_costs.png", auto_pdf = FALSE)
```

I have no principled concern about hand-drawn figures or figures that have been manually post-processed, for example to change axis labels, add annotations, or modify colors. These approaches can yield beautiful and unique figures that couldn't easily be made in any other way. However, I would like to caution against manual sprucing up of figures in routine data analysis pipelines or for scientific publications. Manual steps in the figure preparation pipeline make repeating or reproducing a figure inherently difficult and time-consuming. And in my experience from working in the natural sciences, we rarely make a figure just once. Over the course of a study, we may redo experiments, expand the original dataset, or repeat an experiment several times with slightly altered conditions. I've seen it many times that late in the publication process, when we think everything is done and finalized, we end up introducing a small modification to how we analyze our data, and consequently all figures have to be redrawn. And I've also seen, in similar situations, that the decision is made not to redo the analysis or not to redraw the figures, either due to the effort involved or because the people who had made the original figure have moved on and aren't available anymore. In all these scenarios, an unnecessarily complicated and non-reproducible data visualization pipeline interferes with producing the best possible science.

## Separation of content and design

A good plotting software allows you to think separately about content and design.
A good visualization software should allow you to think separately about the contents and the design of your figures. By contents, I refer to the specific data set shown, the data transformations applied (if any), the specific mappings from data onto aesthetics, the scales, the axis ranges, and the type of plot (scatter plot, line plot, bar plot, boxplot, etc.). Design, on the other hand, describes features such as the foreground and background colors, font specifications (e.g. size, face, family), symbol shapes and sizes, the placement of legends, axis ticks, axis titles, and plot titles, and whether or not the figure has a background grid. When I work on a new visualization, I usually determine first what the contents should be, using the kind of rapid iteration described in the previous subsection. Once the contents is set, I may tweak the design, or more likely I will apply a pre-defined design that I like and/or that gives the figure a consistent look in the context of a larger body of work.


**pre-defined designs are good; general principle of publishing, also on web, in books, etc.**

Most data scientists are not designers, and they should not be expected to be.

For example, my current preferred plotting software, ggplot2, has the concept of a theme. A theme specifies the visual appearance of a plot, without making any assumptions about the plot contents. Theme authors can provide different complete designs, and it is very easy to take a plot and apply different themes to it (Figure \@ref(fig:unemploy-themes)).


**Refer back to the figure from the previous subsection which shows the same content using different designs.** Chapter \@ref(avoid-line-drawings)


**Refer back to the figure from the previous subsection which shows the same content using different designs.**

(ref:unemploy-themes) Number of unemployed persons in the U.S. from 1970 to 2015. The same figure is displayed using four different designs: (a) the default design for this book; (b) the default design of ggplot2, the plotting software I have used to make all figures in this book; (c) a design similar to visualizations shown in the Econommist; (d) a design similar to visualizations shown by FiveThiryEight. FiveThirtyEight often foregos axis labels in favor of plot titles and subtitles. Data source: U.S. Bureau of Labor Statistics
(ref:unemploy-themes) Number of unemployed persons in the U.S. from 1970 to 2015. The same figure is displayed using four different ggplot2 themes: (a) the default theme for this book; (b) the default theme of ggplot2, the plotting software I have used to make all figures in this book; (c) a theme that mimicks visualizations shown in the Econommist; (d) a theme that mimicks visualizations shown by FiveThiryEight. FiveThirtyEight often foregos axis labels in favor of plot titles and subtitles, and therefore I have adjusted the figure accordingly. Data source: U.S. Bureau of Labor Statistics

```{r unemploy-themes, fig.width = 8.5, fig.asp = 0.75, fig.cap = '(ref:unemploy-themes)'}
unemploy_base <- ggplot(economics, aes(x = date, y = unemploy)) +
Expand Down
Binary file added figures/sequencing_costs.pdf
Binary file not shown.
Binary file added figures/sequencing_costs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 8213765

Please sign in to comment.