grid-based transform limits for downsampling #9424

orbisvicis · 2024-09-02T16:13:09Z

Enhancement Description

Describe the feature's goal, motivating use cases, and its expected behavior.

I have 50,000 temporal observations and I wish to make them all available in an 800-pixel-wide plot of mark-type point so I can pan and zoom into any time period. I don't need to see all 50K points at once. In fact I don't want to because it slows vega-lite to a crawl. I'd like to limit the number of points visible at any given time to around 500 or width*2 (ideally). When I'm zoomed-out, 49.5K points are removed but I still get to see an overview. When I'm zoomed in sufficiently (ie, a day) no points are removed and I see all the data.

I tried applying a signal-based sample filter which averages to X points displayed: total_count_orig / (grid_count_orig / X). The problems is that the more points the transform removes, the slower vega-lite becomes. Significantly slower. Fully zoomed in, drawing 50K points was faster than sampling 50K points to 10 points. Panning those 10 points was excruciatingly slow.

I'm proposing a grid-based sampling filter. I believe the original sample was slow because it would constantly operate from the full dataset. I'm interested in a new sampler whose parameter describes the maximum number of data objects to include in the grid (current view). Any data outside would be dropped to make the transform more efficient.

That's the minimum requirements, but some additional features would be nice.

Transforms that reduce the data change the order of the data objects. It would be nice to re-insert the tuples in their original locations using an internal rowid key. I drop down into JS to log some statistics to a text mark, and constantly sorting the data has a performance penalty.

Sampling is not the ideal decimation algorithm. If there's anything you can think of to improve the sampling, let us know!

Here is the stack overflow session that inspired this issue: sample or density transform based on zoom domain. I've also tried to add my own transform, but that didn't work out: create a custom vega.transform in vega-lite.

Also in the meantime do you know of any vega optimization projects that implement this exact type of transform?

<!doctype html>
<html>
  <head>
    <title>Too Much Data</title>
    <meta charset="utf-8" />

    <script src="https://cdn.jsdelivr.net/npm/[email protected]/build/vega.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/[email protected]/build/vega-lite.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/[email protected]/build/vega-embed.js"></script>

    <style media="screen">
    h1 {
        text-align: center;
        font-family: Georgia, serif
    }
    #vis {
        width: 100%;
    }
    </style>
  </head>
  <body>
    <h1>Too Much Data</h1>
    <!-- Container for the visualization -->
    <div id="vis"></div>

    <script>
      // Assign the specification to a local variable vlSpec.
      data = [{"time": "2023-09-02 22:01:45", "data": 273.9798719763763, "category": 1}, {"time": "2023-09-02 22:03:14", "data": 71.63329574244004, "category": 0}, {"time": "2023-09-02 22:05:28", "data": 502.89069245514384, "category": 1}, {"time": "2023-09-02 22:05:32", "data": 995.3916608640706, "category": 1}, ...]

      // Assign the specification to a local variable vlSpec.
      var vlSpec =
        { $schema: "https://vega.github.io/schema/vega-lite/v5.json"
        , datasets:
          { source: data
          }
        , data: {name: "source"}
        , params:
          [ {name: "span", expr: "span(grid_time ? grid_time : [toDate(data('source')[0].time), toDate(peek(data('source')).time)])/1000"}
          //, {name: "samplespan", expr: "span/60/60/24 < 7 ? 10000 : 600"}
          //, {name: "samplespan", expr: "samplecount(span/60/60/24)"}
          //, {name: "samplespan", expr: "samplecount(span/60/60/24, grid_time, data('source'))"}
          , {name: "samplespan", expr: "samplecount(null, grid_time, data('source'))"}
          ]
        , transform:
          [ {filter: "datum.data > 0"}
          //, {filter: {param: "grid"}}
          //, {sample: 100000000}
          //, {sample: {signal: "width"}}
          , {sample: {signal: "samplespan"}}
          //, {window: [{op:"row_number", as: "rowidx"}]}
          //, {calculate: "datum.rowidx % 20", as: "bin"}
          //, {filter: "datum.bin > samplespan"}
          ]
        , title: "Too Much Data"
        , config: { font: "monospace" }
        , width: "container"
        , layer:
          [ { params:
              [ { name: "grid"
                , bind: "scales"
                , select:
                  { type: "interval"
                  , encodings: ["x"]
                  , on: "[mousedown[!event.shiftKey], mouseup] > mousemove"
                  , translate: "[mousedown[!event.shiftKey], mouseup] > mousemove!"
                  }
                }
              , { name: "brush"
                , select:
                  { type: "interval"
                  , encodings: ["x"]
                  , on: "[mousedown[event.shiftKey], mouseup] > mousemove"
                  , translate: "[mousedown[event.shiftKey], mouseup] > mousemove!"
                  }
                }
              ]
            , mark: "point"
            , encoding:
              { x: {field: "time", type: "temporal"}
              , y: {field: "data", type: "quantitative"}
              , color: {field: "category", type: "nominal"}
              }
            }
          , { data: {values: [{}]}
            , mark:
              { type: "rect"
              , fillOpacity: 0.5
              , stroke: "darkgrey"
              , strokeWidth: 2
              , fill: "white"
              }
            , encoding:
              { x: {value: 10}
              , y: {value: 10}
              , x2: {value: 186}
              , y2: {value: 108}
              }
            }
          , { data: {values: [{}]}
            , mark:
              { type: "text"
              , align: "left"
              , baseline: "top"
              // these are commented-out when evaluating performance
              //, text: "FOO"
              //, text: {expr: "statistics(grid_time, brush_time, data('source'), data('data_0'))"}
              , text: {expr: "statistics(grid_time, brush_time, data('source'), data('data_0'), span)"}
              }
              , encoding:
                { x: {value: 15}
                , y: {value: 15}
                }
            }
          ]
        }


      function bisect_left(xs, x, key=i=>i) {
        let lo=0;
        let hi=xs.length;
        let mid;
        while (lo < hi) {
          mid = Math.floor((lo + hi) / 2)
            if (x <= key(xs[mid]))
              hi = mid
            else
              lo = mid + 1
        }
        return lo
      }

      function bisect_right(xs, x, key=i=>i) {
        let lo=0;
        let hi=xs.length;
        let mid;
        while (lo < hi) {
          mid = Math.floor((lo + hi) / 2)
            if (x >= key(xs[mid]))
              lo = mid + 1
            else
              hi = mid
        }
        return lo
      }

      function statistics(grid_time, brush_time, data_orig, data_post, span) {
        let results, delta;

        // when I realized that reduction transforms shuffle the data:
        // later to be replaced by an O(n) iteration that collects all the
        // required in a single pass.
        data_post = data_post.sort(i=>new Date(i.time))
        data_post = data_post.sort((b,a)=>new Date(b.time) - new Date(a.time))

        if (!grid_time) {
          //grid_time = [data_post[0].time, data_post[data_post.length-1].time]
          grid_time = [data_orig[0], data_orig[data_orig.length-1]]
          grid_time = grid_time.map(i=>new Date(i.time).getTime())
        }
        
        results =
            { ...statistics_grid(grid_time)
            , ...statistics_brush(brush_time)
            , ...statistics_brush_data(brush_time, data_post)
            , ...statistics_grid_data(grid_time, data_orig, data_post)
            }
        delta = results.delta ? " over " + results.delta : ""
        return (
          [ "view_l " + results.view[0]
          , "view_r " + results.view[1]
          , "sel_l  " + results.selection[0]
          , "sel_r  " + results.selection[1]
          , "dat_l  " + results.data[0]
          , "dat_r  " + results.data[1]
          , "rate   " + results.count + delta
          , "vis    " + results.visible
          , "span   " + deltafmt(span) + "(" + span/60/60/24 + ")"
          ])
      }

      function statistics_grid(grid_time) {
        const fmt = i => datefmt(new Date(i));
        return {view: grid_time.map(fmt)}
      }

      function statistics_brush(brush_time) {
        let results = {selection: ["N/A", "N/A"], delta: ""};
        if (!brush_time)
          return results
        delta = deltafmt((brush_time[1] - brush_time[0]) / 1000)
        results.selection = brush_time.map(datefmt)
        results.delta = delta
        return results
      }

      function statistics_brush_data(brush_time, data) {
        let results = {data: ["N/A", "N/A"], count: 0};
        const fmt = i => datefmt(new Date(data[i].time));
        if (!brush_time)
          return results
        indexes = islice(...brush_time, data, i=>i.time)
        // no overlap
        if (indexes[0] > data.length || indexes[1] < 0)
          return results
        // no selection
        if (indexes[0] > indexes[1])
          return results
        results.data = indexes.map(fmt)
        results.count = indexes[1] - indexes[0] + 1
        return results
      }

      function statistics_grid_data(grid_time, data_orig, data_post) {
        let key_orig = i => (new Date(i.time));
        let key_post = i => (i.time);
        let indexes_orig = islice(...grid_time, data_orig, key_orig);
        let indexes_post = islice(...grid_time, data_post, key_post);
        //console.log(indexes_orig)
        //console.log(indexes_post)
        //console.log("orig: ", data_orig)
        //console.log("post: ", data_post)
        //console.log("grid: ", grid_time)
        let count_orig = indexes_orig[1] - indexes_orig[0] + 1;
        let count_post = indexes_post[1] - indexes_post[0] + 1;
          return {visible: `${count_post}/${count_orig} (${(count_post/count_orig*100).toFixed(2)})%`}
      }

      function islice(a, b, data, key) {
        ai = bisect_left(data, a, key)
        bi = bisect_right(data, b, key) - 1
        return [ai, bi]
      }

      function datefmt(d) {
        if (!(d instanceof Date))
          return d
        const pad = d => (d.toString().padStart(2, "0"));
        let year = d.getFullYear();
        let month = pad(d.getMonth());
        let day = pad(d.getDate());
        let hours = pad(d.getHours());
        let minutes = pad(d.getMinutes());
        let seconds = pad(d.getSeconds());
        return (year + "-" + month + "-" + day + " " +
          hours + ":" + minutes + ":" + seconds)
      }

      function deltafmt(s) {
        let seconds, minutes, hours, r;
        [s, seconds] = divmod(s, 60);
        [s, minutes] = divmod(s, 60);
        [s, hours] = divmod(s, 24);
        seconds = seconds.toString()
        minutes = minutes.toString().padStart(2, "0")
        hours = hours.toString().padStart(2, "0")
        r = `${hours}:${minutes}:${seconds}`
        if (s)
          r = `${s}d ${r}`
        return r
      }

      function divmod(x, y) {
        let r = x % y;
        let q = (x - r) / y;
        return [q, r].map(Math.trunc)
      }

      function samplecount(span, grid_time, data_orig) {
        // Sample based on a function of span
        //return (40000-600)/(1-365)*(span - 365) + 600
        //return 39400 * (1 - span/365)**100 + 600
        //return 39400 * (2**(1 - span/365)) + 600

        // Sample to an average of X (here, 400)
        if (!grid_time) {
          grid_time = [data_orig[0], data_orig[data_orig.length-1]]
          grid_time = grid_time.map(i=>new Date(i.time).getTime())
        }

        let key_orig = i => (new Date(i.time));
        let indexes_orig = islice(...grid_time, data_orig, key_orig);
        let count_orig = indexes_orig[1] - indexes_orig[0] + 1;

        //console.log(count_orig, count_orig / 400, 40000 / (count_orig / 400))
        return 40000 / (count_orig / 400)

        // Sample into zoom bins that match pre-transformed data bins
        //span_orig = [data_orig[0], data_orig[data_orig.length-1]]
        //span_orig = span_orig.map(i=>new Date(i.time).getTime())

        //if (!grid_time) {
        //  span_grid = span_orig
        //} else {
        //  span_grid = [grid_time[0], grid_time[grid_time.length-1]]
        //  span_grid = span_grid.map(i=>new Date(i))
        //}

        //count_orig = span_orig[1] - span_orig[0]
        //count_grid = span_grid[1] - span_grid[0]
        //ratio = (count_grid / count_orig)**(1/4)
        //bin = Math.trunc(ratio * 20)
        //if (bin >= 2)
        //  bin -= 2
        //console.log("bin: ", bin)
        //return bin
      }

      vega.expressionFunction("statistics", statistics)
      vega.expressionFunction("samplecount", samplecount)

      // My failed attempt to add a transform:

      // Insert a predetermined row only when the dataset is empty.
      // This can happen because aggregate does not understand empty sets,
      // i.e. "count" goes from 3,2,1,<empty dataset> rather than 3,2,1,0.
      class Default extends vega.Transform {
        Definition =
          { type: "Filter"
          , metadata: {changes: true}
          , params:
            [ { name: "values", type: "any", array: true }
            ]
          }

        constructor(params) {
          // todo: provide initial value
          // { count: 0, cache: null } ??
          super(null, params)
        }

        transform(params, pulse) {
          // rough draft (probably wrong):
          // 1. For each add, increment this.value.count
          // 2. For each rem, decrement this.value.count
          // 3. if this.value.count == 0 {
          //      if this.value.cache == null {
          //        // need to lookup null/undefined/==/===
          //        // need to console.log some tuples to learn the
          //        // expected ingest format
          //        this.value.cache = ingest(params.values)
          //      }
          //      pulse.add.push(this.value.cache)
          //    } else {
          //      // pretty sure I have to remove the default, i.e. will
          //      // persist during subsequent pulses
          //      pulse.rem.push(this.value.cache)
          //    }
          // 4. Handle params.modified() ...
          //    reset count to 0 (??) - before step 3, then after step 3:
          //    pulse.visit(pulse.REFLOW, ...) // anything except ADD/REM/MOD
          //    For each, increment this.value.count
          // Will these count strategies actually work? No idea. Need to
          // console.log all changes to count.
          // ???
          const out = pulse.fork(pulse.ALL)
          pulse.visit(pulse.ADD, t => {
            console.log(t)
          })

          return out
        }
      }

      // problem: need to convert the vega interface to the vega-lite interface.
      vega.transforms["default"] = Default

      // Embed the visualization in the container with id `vis`
      vegaEmbed('#vis', vlSpec).then(function(result) {
      // Access the Vega view instance as result.view
      // (https://vega.github.io/vega/docs/api/view/)
      }).catch(console.error);
    </script>
  </body>
</html>

Checklist

[ x] I checked for duplicate issues.

The text was updated successfully, but these errors were encountered:

mattijn · 2024-09-02T17:24:02Z

Cross-ref of an example of an applied thinning algorithm, vega/vegafusion#350 (comment). It’s in Python, but it should be possible to do similar in JS.

orbisvicis · 2024-09-02T19:47:53Z

M4 is the exact sort of thing I am looking for. I skimmed the JS implementation and it seems simple and fast. Even better the data isn't required to be sorted, though if it is the three loops required can be reduced to just one loop. The Vega dataflow model of add/rem/mod makes it difficult to enforce ordering for transforms, which is why I'm excited about M4.

Aside: How would could ordering be enforced with add/rem/mod? We only care about ordering on inserts to the add array, I think(??). With pan that would be to the beginning or end of that array, but with zoom or sampling it could be anywhere. This excludes insertions based on an internal auto-incrementing row index. The best approach would be to maintain an ordered stack of removals used to play back inserts in reverse. The transform would maintain an ordered history array of Vega tuple IDs (basically an internal auto-incrementing row index) as well as a map from tuple IDs to tuples. On a pulse the map and add-array would be inner-joined on the history list (ordered by the history list), and we'd walk over the result in reverse.

I see that you used JupyterChart to replace the entire chart on changes to the width signal. How is the performance of that? I'm hesitant to replace the chart or backing source array for interactive real-time usage of downsampling while scrolling and panning. Are pulses dropped from the Vega dataflow during congestion? If so it would be safe to run M4 on every little change, but I doubt it. If not we must define a threshold range, and only run M4 of the number of visible points falls above or below that threshold.

The simplest path towards integration would be to first merge all the data from add/mod/rem, and then run M4 on the result. I assume that REFLOW +ADD -REM (plus the MOD updates) constitutes the visible data and nothing more. If not and the Vega dataflow calculates transforms for the entire backing source array, we're in trouble. I'm not familiar with Vega but I really hope not.

What's nice about M4 is that it can likely be adapted for incremental usage, ie as a moving window. Rather than applied to the entire visible data, I think it can be applied independently to the ADD and REM arrays (where MOD contributes to both) and used to update the 4 internal accumulated statistics - the "M4".

The only alternative to replacing the chart (or data) would be to implement M4 as a transformation. Unfortunately, I seem to be hitting a wall as vega-lite refuses to recognize and load my custom vega transforms. Until then my hands are tied.

mattijn · 2024-09-04T08:19:35Z

I see that you used JupyterChart to replace the entire chart on changes to the width signal. How is the performance of that?

The reference I included was a test how this could be integrated within VegaFusion. In the past, we have 'good-enough' experiences by setting up params (signals in Vega) in chart specification and using the Vega View API to replace the data in the chart on-the-fly. A dedicated backend or WASM engine is then adopted to execute the custom queries and only return the data that fits within the current chart-view. We used like a max of showing 5K datapoints in the chart. Btw, a JupyterChart object can interact with the Vega View API within Python, if the latter is not included in your stack, then the JupyterChart feature is noise here.

domoritz · 2024-09-05T15:16:27Z

I love the ideas here. Do you think we would need a new transform in Vega or can this be done entirely in Vega-Lite (by generating specific Vega)?

orbisvicis added the Enhancement 🎉 label Sep 2, 2024

orbisvicis changed the title ~~grid-based transform limits~~ grid-based transform limits for downsampling Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grid-based transform limits for downsampling #9424

grid-based transform limits for downsampling #9424

orbisvicis commented Sep 2, 2024

mattijn commented Sep 2, 2024

orbisvicis commented Sep 2, 2024 •

edited

Loading

mattijn commented Sep 4, 2024

domoritz commented Sep 5, 2024

grid-based transform limits for downsampling #9424

grid-based transform limits for downsampling #9424

Comments

orbisvicis commented Sep 2, 2024

Enhancement Description

Checklist

mattijn commented Sep 2, 2024

orbisvicis commented Sep 2, 2024 • edited Loading

mattijn commented Sep 4, 2024

domoritz commented Sep 5, 2024

orbisvicis commented Sep 2, 2024 •

edited

Loading