Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support simultaneous stacking and dodging by different variables in geom_col #6324

Closed
bakaburg1 opened this issue Feb 6, 2025 · 14 comments
Closed
Labels
feature a feature request or enhancement positions 🥇

Comments

@bakaburg1
Copy link

I'd like to propose adding support for simultaneous stacking and dodging controlled by different variables in geom_col. Currently, this common visualization need requires workarounds that are both verbose and harder to maintain.

Current Limitation

When using geom_col, we can either stack or dodge bars based on a grouping variable, but not both at the same time using different variables. This makes it difficult to create visualizations where we want to:

  1. Stack bars by one categorical variable
  2. Dodge the resulting stacks by another categorical variable

Here's a reprex with counts from surveillance data stratified by year, country and surveillance protocol

Minimal Reproducible Example

library(ggplot2)
library(dplyr)

# Sample data
df <- bind_rows(
    data.frame(
        year = rep(2016, 5),
        protocol = rep("M", 5),
        country = c("A", "B", "C", "D", "E"),
        freq = c(100, 50, 30, 40, 11)
    ),
    data.frame(
        year = rep(2016, 4),
        protocol = rep("L", 4),
        country = c("A", "B", "C", "D"),
        freq = c(23, 60, 200, 100)
    )
)

# Current workaround requires multiple geom_col calls
ggplot() +
    geom_col(
        data = df %>% filter(protocol == "M"),
        aes(x = year - .5, y = freq,
            fill = protocol, group = country),
        position = "stack",
        width = 0.4
    ) +
    geom_col(
        data = df %>% filter(protocol == "L"),
        aes(x = year + .5, y = freq,
            fill = protocol, group = country),
        position = "stack",
        width = 0.4
    )

Desired Behavior

Ideally, we would be able to specify both stacking and dodging variables in a single geom_col call, something like:

# Conceptual syntax (not working)
ggplot(df, aes(x = year, y = freq)) +
    geom_col(
        aes(fill = protocol, group = country),
        position = position_stackdodge(
            stack_by = "country",
            dodge_by = "protocol"
        )
    )

Use Cases

This functionality would be particularly useful for:

  • Comparing distributions across multiple categories
  • Visualizing nested hierarchical data
  • Creating more complex compositional charts without resorting to hacky solutions
  • Maintaining consistent spacing and positioning without manual x-axis adjustments

Benefits

  1. More intuitive API for common visualization needs
  2. Reduced code complexity
  3. Better maintainability
  4. Consistent positioning and spacing handled by ggplot2
  5. Easier integration with scales and themes
@teunbrand
Copy link
Collaborator

teunbrand commented Feb 6, 2025

Thanks for the report! This request is similar to #2267, which was closed as unplanned.
I think one reason we've been reluctant to implement this is because it would break the API as position adjustments do not have the right authority to include variables (like stack_by and dodge_by) from the data.
However, because we implemented #6100, I think this limitation no longer holds and this suggestion no longer would break the API.
For these reasons, I think this should be possible, but I'm not yet convinced that it belongs to ggplot2 and not an extension package.

@teunbrand teunbrand added positions 🥇 feature a feature request or enhancement labels Feb 6, 2025
@bakaburg1
Copy link
Author

Thank you!

In the meantime (with great help of various AIs) I developed an ad hoc geom. I still think that a position_ function is more appropriate since it could accommodate other geoms too (and I don't like the idea of a geom just for positioning) but I wasn't able to make one. Regarding whether to put it ggplot or not I would advise for the first solution. I was very surprised in the first place this was not possible already, it's something one would expect out of the box!

GeomStackDodgeCol <- ggproto(
    "GeomStackDodgeCol", GeomRect,
    required_aes = c("x", "y", "fill", "group"),
    default_aes = aes(
        colour = "black",
        linewidth = 0.5,
        linetype = 1,
        alpha = NA
    ),
    
    setup_data = function(data, params) {
        # Reset stacking for each x value and fill group
        data <- data |>
            group_by(x, fill) |>
            mutate(
                ymin = c(0, head(cumsum(y), -1)),
                ymax = cumsum(y)
            ) |>
            ungroup()
        
        # Compute dodging offsets with width and padding
        fill_groups <- unique(data$fill)
        n_groups <- length(fill_groups)
        width <- params$width %||% 0.9     # width of the bars
        padding <- params$padding %||% 0.1  # padding between bars
        
        # Calculate total width needed for the group
        total_width <- n_groups * width + (n_groups - 1) * padding * width
        
        # Calculate positions with proper spacing
        positions <- seq(-total_width/2, total_width/2, length.out = n_groups)
        
        # Create rectangle coordinates
        data$xmin <- data$x + positions[match(data$fill, fill_groups)] - width/2
        data$xmax <- data$x + positions[match(data$fill, fill_groups)] + width/2
        
        data
    },
    
    draw_panel = function(data, panel_params, coord, width = 0.9, ...) {
        coords <- coord$transform(data, panel_params)
        
        grid::rectGrob(
            x = (coords$xmin + coords$xmax)/2,
            y = (coords$ymin + coords$ymax)/2,
            width = coords$xmax - coords$xmin,
            height = coords$ymax - coords$ymin,
            default.units = "native",
            just = c("center", "center"),
            gp = grid::gpar(
                col = coords$colour,
                fill = alpha(coords$fill, coords$alpha),
                lwd = coords$linewidth * .pt,
                lty = coords$linetype
            )
        )
    },
    
    parameters = function(complete = FALSE) {
        c("na.rm", "width", "padding")
    }
)

geom_stackdodge_col <- function(mapping = NULL, data = NULL,
                            position = "identity", 
                            width = 0.9,
                            padding = 0.1,
                            na.rm = FALSE,
                            show.legend = NA,
                            inherit.aes = TRUE, ...) {
    layer(
        geom = GeomStackDodgeCol,
        mapping = mapping,
        data = data,
        stat = "identity",
        position = position,
        show.legend = show.legend,
        inherit.aes = inherit.aes,
        params = list(
            na.rm = na.rm,
            width = width,
            padding = padding
        )
    )
}

of course testing is mandated.

Here's some testing code:

local({
    df <- bind_rows(
        data.frame(
            year = rep(2016, 5),
            protocol = rep("M", 5),
            country = c("A", "B", "C", "D", "E"),
            freq = c(100, 50, 30, 40, 11) # sum is 231
        ),
        data.frame(
            year = rep(2016, 4),
            protocol = rep("L", 4),
            country = c("A", "B", "C", "D"),
            freq = c(23, 60, 200, 100) # sum is 383
        )
    )
    
   # Add more years
    df <- bind_rows(
        df,
        df |> mutate(year = 2017, freq = sample(freq)),
    )
    
    # Create summary data
    df_sum <- df |>
        summarise(
            label = paste(country, collapse = "\n"),
            freq = sum(freq),
            .by = c(year, protocol)
        )
    ggplot() +
        geom_stackdodge_col(
            data = df,
            aes(x = factor(year), y = freq, group = country,
                fill = protocol),
            width = 0.1, padding = 0.5
        ) +
        geom_hline(yintercept = c(sum(c(100, 50, 30, 40, 11), sum(c(23, 60, 200, 100) )) # To show that the bars sum up to the expected values
})

image|690x379

@clauswilke
Copy link
Member

Regarding whether to put it ggplot or not I would advise for the first solution. I was very surprised in the first place this was not possible already, it's something one would expect out of the box!

We have for many years now followed the philosophy that only the absolute core features are in ggplot2 itself and other, less commonly used features should go into extension packages. Maybe this would be a good fit for ggforce for example.

Also, while I'm of the opinion that everybody should be allowed and empowered to make any visualization they want, I find it difficult to think of a valid use case for this geom. I've never in my life thought "hm, I want to stack and dodge at the same time." This is definitely an obscure corner case, and I feel reasonably confident that any figure you make with this feature can be improved by removing one of the two position adjustments.

@bakaburg1
Copy link
Author

Uhm, it's a pretty common scenario in epidemiology!

Should I cross post it to ggforce? Do they work also on position functions or only on geoms?

@clauswilke
Copy link
Member

clauswilke commented Feb 6, 2025

Show me a figure that uses this feature and I'll tell you how to improve the figure.

@smouksassi
Copy link

smouksassi commented Feb 6, 2025

  • sorry my reprex is crashing this shows you can do what you want using facets

library(ggplot2)
library(dplyr)
library(patchwork)
#Sample data
df <- bind_rows(
data.frame(
year = rep(2016, 5),
protocol = rep("M", 5),
country = c("A", "B", "C", "D", "E"),
freq = c(100, 50, 30, 40, 11)
),
data.frame(
year = rep(2016, 4),
protocol = rep("L", 4),
country = c("A", "B", "C", "D"),
freq = c(23, 60, 200, 100)
),
data.frame(
year = rep(2017, 5),
protocol = rep("M", 5),
country = c("A", "B", "C", "D", "E"),
freq = c(100, 50, 30, 40, 11)
),
data.frame(
year = rep(2017, 4),
protocol = rep("L", 4),
country = c("A", "B", "C", "D"),
freq = c(23, 60, 200, 100)
)
)
a<- ggplot(data = df) +
geom_col(
aes(x = protocol, y = freq,
fill = country, group = country),
position = "stack",
width = 0.4
) +
scale_fill_viridis_d()+
facet_grid(~year)
b <- ggplot(data = df) +
geom_col(
aes(x = as.factor(year) , y = freq,
fill = country, group = country),
position = "stack",
width = 0.4
) +
scale_fill_viridis_d()+
facet_grid(~protocol)

a/b

@davidhodge931
Copy link

I think stacking and dodging at the same time is useful. I've needed to do this in the past. I get by with hacking around using a combo of faceting, scale and theme adjustments. But it'd be awesome if a position_stackdodge function or similar was available to do this in a more elegant way

@smouksassi
Copy link

considerations default width of bars and also the ordering of factors:
here is what is currently possible and what can be done using the code above:

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(patchwork)

GeomStackDodgeCol <- ggproto(
  "GeomStackDodgeCol", GeomRect,
  required_aes = c("x", "y", "fill", "group"),
  default_aes = aes(
    colour = "red",
    linewidth = 0.5,
    linetype = 1,
    alpha = NA
  ),
  
  setup_data = function(data, params) {
    # Reset stacking for each x value and fill group
    data <- data |>
      group_by(x, fill) |>
      mutate(
        ymin = c(0, head(cumsum(y), -1)),
        ymax = cumsum(y)
      ) |>
      ungroup()
    
    # Compute dodging offsets with width and padding
    fill_groups <- unique(data$fill)
    n_groups <- length(fill_groups)
    width <- params$width %||% 0.9     # width of the bars
    padding <- params$padding %||% 0.1  # padding between bars
    
    # Calculate total width needed for the group
    total_width <- n_groups * width + (n_groups - 1) * padding * width
    
    # Calculate positions with proper spacing
    positions <- seq(-total_width/2, total_width/2, length.out = n_groups)
    
    # Create rectangle coordinates
    data$xmin <- data$x + positions[match(data$fill, fill_groups)] - width/2
    data$xmax <- data$x + positions[match(data$fill, fill_groups)] + width/2
    
    data
  },
  
  draw_panel = function(data, panel_params, coord, width = 0.9, ...) {
    coords <- coord$transform(data, panel_params)
    
    grid::rectGrob(
      x = (coords$xmin + coords$xmax)/2,
      y = (coords$ymin + coords$ymax)/2,
      width = coords$xmax - coords$xmin,
      height = coords$ymax - coords$ymin,
      default.units = "native",
      just = c("center", "center"),
      gp = grid::gpar(
        col = coords$colour,
        fill = alpha(coords$fill, coords$alpha),
        lwd = coords$linewidth * .pt,
        lty = coords$linetype
      )
    )
  },
  
  parameters = function(complete = FALSE) {
    c("na.rm", "width", "padding")
  }
)

geom_stackdodge_col <- function(mapping = NULL, data = NULL,
                                position = "identity", 
                                width = 0.9,
                                padding = 0.1,
                                na.rm = FALSE,
                                show.legend = NA,
                                inherit.aes = TRUE, ...) {
  layer(
    geom = GeomStackDodgeCol,
    mapping = mapping,
    data = data,
    stat = "identity",
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      na.rm = na.rm,
      width = width,
      padding = padding
    )
  )
}
    df <- bind_rows(
        data.frame(
            year = rep(2016, 5),
            protocol = rep("M", 5),
            country = c("A", "B", "C", "D", "E"),
            freq = c(100, 50, 30, 40, 11) # sum is 231
        ),
        data.frame(
            year = rep(2016, 4),
            protocol = rep("L", 4),
            country = c("A", "B", "C", "D"),
            freq = c(23, 60, 200, 100) # sum is 383
        )
    )
    df <- bind_rows(
        df,
        df |> mutate(year = 2017, freq = sample(freq)),
    )
    
    # Create summary data
    df_sum <- df |>
        summarise(
            label = paste(country, collapse = "\n"),
            freq = sum(freq),
            .by = c(year, protocol)
        )
    ggplot() +
        geom_stackdodge_col(
            data = df,
            aes(x = factor(year), y = freq, group = country,
                fill = protocol),
            width = 0.1, padding = 0.5
        )

                                  
  
    

    #Sample data
    df <- bind_rows(
      data.frame(
        year = rep(2016, 5),
        protocol = rep("M", 5),
        country = c("A", "B", "C", "D", "E"),
        freq = c(100, 50, 30, 40, 11)
      ),
      data.frame(
        year = rep(2016, 4),
        protocol = rep("L", 4),
        country = c("A", "B", "C", "D"),
        freq = c(23, 60, 200, 100)
      ),
      data.frame(
        year = rep(2017, 5),
        protocol = rep("M", 5),
        country = c("A", "B", "C", "D", "E"),
        freq = c(100, 50, 30, 40, 11)
      ),
      data.frame(
        year = rep(2017, 4),
        protocol = rep("L", 4),
        country = c("A", "B", "C", "D"),
        freq = c(23, 60, 200, 100)
      )
    )
    

    
    
    
    a<- ggplot(data = df) +
      geom_col(
        aes(x = protocol, y = freq,
            fill = country, group = country),
        position = "stack",
        width = 0.4
      ) +
      scale_fill_viridis_d()+
      facet_grid(~year)
    b <- ggplot(data = df) +
      geom_col(
        aes(x = as.factor(year) , y = freq,
            fill = country, group = country),
        position = "stack",
        width = 0.4
      ) +
      scale_fill_viridis_d()+
      facet_grid(~protocol)
    
    a2<- ggplot(data = df) +
      geom_col(
        aes(x = protocol, y = freq,
            fill = protocol, group = country),
        position = "stack",color="red",
        width = 0.4
      ) +
      scale_fill_viridis_d()+
      facet_grid(~year) 
  c <-    ggplot(data = df) +
    geom_stackdodge_col(
      aes(x = as.factor(year) , y = freq,
          fill = protocol, group = country),
      width = 0.1, padding = 0,
    ) +
    scale_fill_viridis_d()
  a/a2/c + plot_layout(guide="collect")

Created on 2025-02-11 with reprex v2.1.1

Standard output and standard error
-- nothing to show --

@teunbrand
Copy link
Collaborator

Thanks everyone for the examples. I don't think implementation is the barrier for this issue, but Claus' remark below is:

We have for many years now followed the philosophy that only the absolute core features are in ggplot2 itself and other, less commonly used features should go into extension packages.

So the relevant question is whether simulateously stacking + dodging is a core feature or not. One the one hand I think (but am not wholly convinced) that this can be useful in some circumstances. On the other hand, I want to agree with Claus that there is most likely is a better way to display data than stacking and dodging.

@clauswilke
Copy link
Member

I know we're somewhat offtopic now, but since the question is "is this a core feature" and not "should this be possible at all", I want to point out that stacking more than two categories is almost always bad, because it's usually impossible to actually compare the stacked data values. I discuss this in my book here: https://clauswilke.com/dataviz/visualizing-proportions.html

In addition, I'm not a fan of mixing a display of proportions (which you get by stacking) with a display of absolute values (which you get from bars that have different overall heights). It creates additional confusion in the viewer, as the absolute amount of something may increase from one condition to the other while the relative proportion goes down.

I generally discourage people from stacking unless they're dealing with a binary variable (male/female, success/failure, etc).

@davidhodge931
Copy link

An example below of a graph stacked and dodged copied from the internet.

In this one, the stacked type variable has heaps of values, but it could instead be a simple binary variable like Male/Female. It's also not super clear at present what the dodged stuff represents. But an alpha or pattern aesthetic could be used here.

You could do this instead by faceting - but then maybe you wanted to facet by a different variable anyway. You could potentially do this using patchwork. But everything gets hacky and difficult, compared to if there was a position adjustment for it

Image

@thomasp85
Copy link
Member

The "can it be done" and "is it being done" has already been established. The question is whether it should be in ggplot2 or in an extension package. ggplot2 is opinionated and, like with secondary axes, we sometimes steer away from "popular" approaches because they are flawed be design. I'm afraid this also falls into such category which means that it will not end up in ggplot2-proper. However, we made the system extensible for a reason, so that you are not beholden to our pet peeves :-)

@thomasp85 thomasp85 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 11, 2025
@szkabel
Copy link

szkabel commented Feb 11, 2025

Hi All,
This seems to be an exciting discussion.
I have had a feature request like a year ago for stratifying a column chart by a variable with shades. The plot was already dodged by colors. My solution was to extend the functionality of position_dodge.

Please see my pull-request: #6328
A bit more lengthy description here: https://rpubs.com/szkabel/dodgeStackDemo

I think that this is a needed feature as shown by the questions collected by @davidhodge931.

This was tested only for geom_col, but that seemed to be the most needed anyways. I also think it is more elegant than most of the above solutions. A minimal working example for the above case:

library(tidyverse)

base = bind_rows(
    tibble(count = c(10,9,26),type = factor(c(1:3))) %>% mutate(category = "A"),
    tibble(count = c(80,90,60),type = factor(c(1:3))) %>% mutate(category = "B") 
)
    
df = bind_rows(
  base %>% mutate(id = 1),
  base %>% mutate(id = 2),
  base %>% mutate(id = 3)
)

df %>% ggplot() + 
  aes(x = id,y = count, fill = type, alpha = category, group = category) +
  geom_col(position = position_dodge(stack_overlap = "by_extent")) +
  scale_alpha_manual(values = c("A" = 0.5, "B" = 1))

Image

I acknowledge that it doesn't yet work for the labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement positions 🥇
Projects
None yet
Development

No branches or pull requests

7 participants