cleanup, rename data-ink chapter

gedonis · Jan 13, 2019 · e01651e · e01651e
1 parent e22f93f
commit e01651e
Show file tree

Hide file tree

Showing 23 changed files with 32 additions and 1,939 deletions.
diff --git a/R/iris_plot.R b/R/iris_plot.R
diff --git a/R/misc.R b/R/misc.R
diff --git a/R/themes.R b/R/themes.R
diff --git a/_bookdown_draft.yml b/_bookdown_draft.yml
@@ -33,7 +33,7 @@ rmd_files: [
   "redundant_coding.Rmd", # completed
   "multi-panel_figures.Rmd", # completed
   "figure_titles_captions.Rmd", # completed
-  "balance_data_ink_ratio.Rmd", # completed
+  "balance_data_context.Rmd", # completed
   "small_axis_labels.Rmd", # completed
   "avoid_line_drawings.Rmd", # completed
   "no_3d.Rmd", # completed
@@ -44,10 +44,7 @@ rmd_files: [
   "telling_a_story.Rmd", # completed
   "annotated_bibliography.Rmd",
 
-  "outline.Rmd",
   "technical_notes.Rmd",
 
-  "notes.Rmd",
-
   "references.Rmd"
 ]
diff --git a/_bookdown_final.yml b/_bookdown_final.yml
@@ -33,7 +33,7 @@ rmd_files: [
   "redundant_coding.Rmd", # completed
   "multi-panel_figures.Rmd", # completed
   "figure_titles_captions.Rmd", # completed
-  "balance_data_ink_ratio.Rmd", # completed
+  "balance_data_context.Rmd", # completed
   "small_axis_labels.Rmd", # completed
   "avoid_line_drawings.Rmd", # completed
   "no_3d.Rmd", # completed
@@ -46,7 +46,5 @@ rmd_files: [
 
   "technical_notes.Rmd",
 
-  #"notes.Rmd",
-
   "references.Rmd"
 ]
diff --git a/balance_data_ink_ratio.Rmd → balance_data_context.Rmd b/balance_data_ink_ratio.Rmd → balance_data_context.Rmd
@@ -7,17 +7,17 @@ library(lubridate)
 library(ggrepel)
 ```
 
-# Balance the data-to-ink ratio {#balance-data-ink}
+# Balance the data and the context {#balance-data-context}
 
-We can broadly subdivide the graphical elements in any visualization into elements that represent data and elements that do not. The former are elements such as the points in a scatter plot, the bars in a histogram or barplot, or the shaded areas in a heatmap. The latter are elements such as plot axes, axis ticks and labels, axis titles, legends, and plot annotations. When designing a plot, it can be helpful to think about the amount of ink (Chapter \@ref(proportional-ink)) used to represent data and non-data elements. A common recommendation is to reduce the amount of non-data ink, and following this advice can often yield less cluttered and more elegant visualizations. At the same time, non-data ink provides context and visual structure, and overly minimizing such plot elements can result in figures that are difficult to read, confusing, or simply not that compelling.
+We can broadly subdivide the graphical elements in any visualization into elements that represent data and elements that do not. The former are elements such as the points in a scatter plot, the bars in a histogram or barplot, or the shaded areas in a heatmap. The latter are elements such as plot axes, axis ticks and labels, axis titles, legends, and plot annotations. These elements generally either provide context for the data and/or visual structure to the plot. When designing a plot, it can be helpful to think about the amount of ink (Chapter \@ref(proportional-ink)) used to represent data and context. A common recommendation is to reduce the amount of non-data ink, and following this advice can often yield less cluttered and more elegant visualizations. At the same time, context and visual structure are important, and overly minimizing such plot elements can result in figures that are difficult to read, confusing, or simply not that compelling.
 
-## Finding the appropriate data-to-ink ratio
+## Providing the appropriate amount of context
 
 The idea that distinguishing between data and non-data ink may be useful was popularized by Edward Tufte in his book "The Visual Display of Quantitative Information" [@TufteQuantDispl]. Tufte introduces the concept of the "data--ink ratio", which he defines as the "proportion of a graphic's ink devoted to the non-redundant display of data-information." He then writes (emphasis mine):
 
 > Maximize the data--ink ratio, *within reason.*
 
-I have emphasized the phrase "within reason" because it is critical and frequently forgotten. In fact, I think that Tufte himself forgets it in the remainder of his book, where he advocates overly minimalistic designs that, in my opinion, are neither elegant nor easy to decipher. If we interpret the phrase "maximize the data--ink ratio" to mean "remove clutter and strive for clean and elegant designs," then I think it is reasonable advice. But if we interpret it as "do everything you can to remove non-data ink" then it will result in poor design choices. If we go too far in either direction along the scale of possible data-to-ink ratios we will end up with ugly figures. However, away from the extremes there is wide range of designs with different data-to-ink ratios that are all acceptable and may be appropriate in different contexts.
+I have emphasized the phrase "within reason" because it is critical and frequently forgotten. In fact, I think that Tufte himself forgets it in the remainder of his book, where he advocates overly minimalistic designs that, in my opinion, are neither elegant nor easy to decipher. If we interpret the phrase "maximize the data--ink ratio" to mean "remove clutter and strive for clean and elegant designs," then I think it is reasonable advice. But if we interpret it as "do everything you can to remove non-data ink" then it will result in poor design choices. If we go too far in either direction we will end up with ugly figures. However, away from the extremes there is wide range of designs that are all acceptable and may be appropriate in different settings.
 
 To explore the extremes, let's consider a figure that clearly has too much non-data ink (Figure \@ref(fig:Aus-athletes-grid-bad)). The colored points in the plot panel (the framed center area containing data points) are data ink. Everything else is non-data ink. The non-data ink includes a frame around the entire figure, a frame around the plot panel, and a frame around the legend. None of these frames are needed. We also see a prominent and dense background grid that draws attention away from the actual data points. By removing the frames and minor grid lines and by drawing the major grid lines in a light gray, we arrive at Figure \@ref(fig:Aus-athletes-grid-good). In this version of the figure, the actual data points stand out much more clearly, and they are perceived as the most important component of the figure.
 
@@ -68,9 +68,9 @@ p_Aus_base +
   theme_dviz_grid()# + panel_border(colour = "grey70") + theme(axis.ticks = element_line(colour = "grey70"))
 ```
 
-At the other extreme of the data-to-ink-ratio scale, we might end up with a figure such as Figure \@ref(fig:Aus-athletes-min-bad), which is a minimalist version of Figure \@ref(fig:Aus-athletes-grid-good). In this figure, the axis tick labels and titles have been made so faint that they are hard to see. If we just glance at the figure we will not immediately perceive what data is actually shown. We only see points floating in space. Moreover, the legend annotations are so faint that the points in the legend could be mistaken for data points. This effect is amplified because there is no clear visual separation between the plot area and the legend. Notice how the background grid in Figure \@ref(fig:Aus-athletes-grid-good) both anchors the points in space and sets off the data area from the legend area. Both of these effects have been lost in Figure \@ref(fig:Aus-athletes-min-bad).
+At the other extreme, we might end up with a figure such as Figure \@ref(fig:Aus-athletes-min-bad), which is a minimalist version of Figure \@ref(fig:Aus-athletes-grid-good). In this figure, the axis tick labels and titles have been made so faint that they are hard to see. If we just glance at the figure we will not immediately perceive what data is actually shown. We only see points floating in space. Moreover, the legend annotations are so faint that the points in the legend could be mistaken for data points. This effect is amplified because there is no clear visual separation between the plot area and the legend. Notice how the background grid in Figure \@ref(fig:Aus-athletes-grid-good) both anchors the points in space and sets off the data area from the legend area. Both of these effects have been lost in Figure \@ref(fig:Aus-athletes-min-bad).
 
-(ref:Aus-athletes-min-bad) Percent body fat versus height in professional male Australian athletes. In this example, the concept of maximizing the data--ink ratio has been taken too far. The axis tick labels and title are too faint and are barely visible. The data points seem to float in space. The points in the legend are not sufficiently set off from the data points, and the casual observer might think they are part of the data. Data source: @Telford-Cunningham-1991
+(ref:Aus-athletes-min-bad) Percent body fat versus height in professional male Australian athletes. In this example, the concept of removing non-data ink has been taken too far. The axis tick labels and title are too faint and are barely visible. The data points seem to float in space. The points in the legend are not sufficiently set off from the data points, and the casual observer might think they are part of the data. Data source: @Telford-Cunningham-1991
 
 ```{r Aus-athletes-min-bad, fig.cap='(ref:Aus-athletes-min-bad)'}
 axis_col <- "grey70"

diff --git a/coordinate_systems_axes.Rmd b/coordinate_systems_axes.Rmd
@@ -199,6 +199,8 @@ plot_grid(plotlist[[1]], plotlist[[2]], plotlist[[3]], stamp_wrong(plotlist[[4]]
 
 ```
 
+Mathematically, there is no difference between plotting the log-transformed data on a linear scale or plotting the original data on a logarithmic scale (Figure \@ref(fig:linear-log-scales)). The only difference lies in the labeling for the individual axis ticks and for the axis as a whole. In most cases, the labeling for a logarithmic scale is preferable, because it places less mental burden on the reader to interpret the numbers shown as the axis tick labels. There is also less of a risk of confusion about the base of the logarithm. When working with log-transformed data, we can get confused about whether the data were transformed using the natural logarithm or the logarithm to base 10. And it's not uncommon for labeling to be ambiguous, e.g. "log(x)", which doesn't specify a base at all. I recommend that you always verify the base when working with log-transformed data. When plotting log-transformed data, always specify the base in the labeling of the axis.
+
 Because multiplication on a log scale looks like addition on a linear scale, log scales are the natural choice for any data that have been obtained by multiplication or division. In particular, ratios should generally be shown on a log scale. As an example, I have taken the number of inhabitants in each county in Texas and have divided it by the median number of inhabitants across all Texas counties. The resulting ratio is a number that can be larger or smaller than 1. A ratio of exactly 1 implies that the corresponding county has the median number of inhabitants. When visualizing these ratios on a log scale, we can see clearly that the population numbers in Texas counties are symmetrically distributed around the median, and that the most populous counties have over 100 times more inhabitants than the median while the least populous counties have over 100 times fewer inhabitants (Figure \@ref(fig:texas-counties-pop-ratio-log)). By contrast, for the same data, a linear scale obscures the differences between a county with median population number and a county with a much smaller population number than median (Figure \@ref(fig:texas-counties-pop-ratio-lin)).
 
 (ref:texas-counties-pop-ratio-log) Population numbers of Texas counties relative to their median value. Select counties are highlighted by name. The dashed line indicates a ratio of 1, corresponding to a county with median population number. The most populous counties have approximately 100 times more inhabitants than the median county, and the least populous counties have approximately 100 times fewer inhabitants than the median county. Data source: 2010 Decennial U.S. Census.

diff --git a/datasets/Australian_athletes/Sex,_sport,_and_body_size_dependency_of_hematology.4.pdf b/datasets/Australian_athletes/Sex,_sport,_and_body_size_dependency_of_hematology.4.pdf
diff --git a/datasets/Australian_athletes/scatter_plot.R b/datasets/Australian_athletes/scatter_plot.R
-Original file line number
+Diff line change
@@ Expand Up @@
     ```
+    Mathematically, there is no difference between plotting the log-transformed data on a linear scale or plotting the original data on a logarithmic scale (Figure \@ref(fig:linear-log-scales)). The only difference lies in the labeling for the individual axis ticks and for the axis as a whole. In most cases, the labeling for a logarithmic scale is preferable, because it places less mental burden on the reader to interpret the numbers shown as the axis tick labels. There is also less of a risk of confusion about the base of the logarithm. When working with log-transformed data, we can get confused about whether the data were transformed using the natural logarithm or the logarithm to base 10. And it's not uncommon for labeling to be ambiguous, e.g. "log(x)", which doesn't specify a base at all. I recommend that you always verify the base when working with log-transformed data. When plotting log-transformed data, always specify the base in the labeling of the axis.
     Because multiplication on a log scale looks like addition on a linear scale, log scales are the natural choice for any data that have been obtained by multiplication or division. In particular, ratios should generally be shown on a log scale. As an example, I have taken the number of inhabitants in each county in Texas and have divided it by the median number of inhabitants across all Texas counties. The resulting ratio is a number that can be larger or smaller than 1. A ratio of exactly 1 implies that the corresponding county has the median number of inhabitants. When visualizing these ratios on a log scale, we can see clearly that the population numbers in Texas counties are symmetrically distributed around the median, and that the most populous counties have over 100 times more inhabitants than the median while the least populous counties have over 100 times fewer inhabitants (Figure \@ref(fig:texas-counties-pop-ratio-log)). By contrast, for the same data, a linear scale obscures the differences between a county with median population number and a county with a much smaller population number than median (Figure \@ref(fig:texas-counties-pop-ratio-lin)).
     (ref:texas-counties-pop-ratio-log) Population numbers of Texas counties relative to their median value. Select counties are highlighted by name. The dashed line indicates a ratio of 1, corresponding to a county with median population number. The most populous counties have approximately 100 times more inhabitants than the median county, and the least populous counties have approximately 100 times fewer inhabitants than the median county. Data source: 2010 Decennial U.S. Census.
@@ Expand Down @@