Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directly support HoloViews-style inspect operations #1126

Open
jbednar opened this issue Sep 26, 2022 · 12 comments
Open

Directly support HoloViews-style inspect operations #1126

jbednar opened this issue Sep 26, 2022 · 12 comments
Milestone

Comments

@jbednar
Copy link
Member

jbednar commented Sep 26, 2022

When Datashader renders a large dataset, a human being is usually able to see patterns and interesting datapoints that merit further investigation. Unfortunately, the rendered image does not provide any easy means of doing so, as the original datapoints have all been reduced to pixels (or more accurately, to scalar accumulated values in bins of a 2d histogram). To support investigation of interesting features, HoloViews implements a series of "inspect" operations that query the original dataset after a selection or hover event on the rasterized data. E.g. inspect_points in https://examples.pyviz.org/ship_traffic will query the original dataset to show hover and other information about the original datapoints being visualized. However, going back to the original dataset is quite slow, because it requires traversing either the entire dataset or (for a spatially indexed data structure) at least a chunk of the dataset, which makes the interface unpleasant and awkward and thus eliminating certain types of interactivity.

Datashader can collect multiple aggregations on a single pass through the data, so I suggest that we support an accumulation mode that gathers datapoint indexes rather than datapoints, so that hover and drilldown information can be supported instantaneously. Of course, arbitrarily many datapoints can be aggregated into a single pixel, while any practical aggregation can only accumulate a fixed number of indexes per pixel. Still, that's already how the inspect_ operations work; they discard all but a configurable number of results, which is fine for linking to one or two examples per pixel, and allows single-datapoint precision with enough zooming in. By default I'd suggest accumulating the index of the minimum and the maximum value per pixel, but even just keeping the first or last datapoint for that pixel would be useful.

If we keep at least three datapoints per pixel (e.g. min, max, and one other) we'd be able to distinguish between complete and incomplete inspection data for that pixel (i.e. are these the only points? Yes, if there are 2 or fewer; unclear otherwise). Seems to me that we should be able to have a fully responsive, fully inspectable rendering of a dataset at low computational and memory cost using this method.

@jbednar jbednar added this to the wishlist milestone Sep 26, 2022
@jlstevens
Copy link
Collaborator

After some discussion, I think we agreed that a first and last index (per bin) aggregate makes sense and that a where aggregator (e.g. 'show the ship with the highest tonnage value that contributed to this pixel') would be nice too.

The only other point that I think is important is that you need to know the count because that can give you the context to know whether the 'first' or 'last' index is unique or a random sample that contributed to the pixel value (ignoring the possibility that it is meaningful due to sorting e.g. by time)

@jbednar
Copy link
Member Author

jbednar commented Sep 28, 2022

@philippjfr suggests implementing a where aggregator that returns some column's value given an aggregator that's applied to some other column, e.g. where(max('value'), 'index'). That way a user can define which samples are kept.

I believe this syntax could support an n argument, retaining the top n values (e.g. the n largest value datapoints encountered). It will be important to clearly indicate in the documentation the conditions under which it's just n arbitrary samples compared to the top n along a well-defined measure. The default for plotting purposes would probably need to be arbitrary since counts are plotted by default and counts don't establish any ordering between datapoints. In that case a single datapoint is probably the most reasonable default (one exemplar per pixel), i.e. a default like where(..., 'index', n=1).

@ianthomas23
Copy link
Member

Some of this can already be done in datashader, e.g.

import datashader as ds
import pandas as pd

df = pd.DataFrame(dict(x=[0, 1, 0, 1, 0], y=[0, 0, 1, 1, 0], myindex=[4, 5, 6, 7, 8]))
canvas = ds.Canvas(3, 3)
agg = canvas.line(
    df, "x", "y",
    agg=ds.summary(count=ds.count(), first=ds.first("myindex"), last=ds.last("myindex")),
)

which produces

<xarray.Dataset>
Dimensions:  (x: 3, y: 3)
Coordinates:
  * x        (x) float64 0.1667 0.5 0.8333
  * y        (y) float64 0.1667 0.5 0.8333
Data variables:
    count    (y, x) uint32 2 1 1 0 2 0 1 1 1
    first    (y, x) float64 4.0 4.0 4.0 nan 5.0 nan 5.0 6.0 6.0
    last     (y, x) float64 7.0 4.0 4.0 nan 7.0 nan 5.0 6.0 6.0

and you can read individual variables using agg['first'] or similar. Note that I have manually added the myindex column to the DataFrame, and ds.first and ds.last always return floats.

Longer term ideas like where(max('value'), 'myindex') require some infrastructure changes because that needs two reductions to interact on a per-pixel basis which currently is not supported; all current reductions are independent.

Eventually that could lead to where(max_n('value', n=3), 'myindex'). We would first need max_n as a standalone reduction that needs to write to a 3D array of shape (ny, nx, n); this also needs infrastructure changes.

I am hoping that the example above is sufficient to start implementing support for this in holoviews. That should give me time to work on a refactor of the canvas/reduction code in datashader to make adding the new reductions much easier.

@ianthomas23 ianthomas23 modified the milestones: wishlist, v0.14.3 Oct 24, 2022
@ianthomas23
Copy link
Member

Possible API for where reduction:

where(selector: Reduction, lookup: str | None = None)

(although I have just made up the names selector and lookup and they can easily change).

If the user specifies a string name for lookup then it is the name of the column that must already be in the DataFrame and are the values returned to the user based on the selector. If lookup is None then Datashader uses the index of the row in the DataFrame instead.

@jbednar
Copy link
Member Author

jbednar commented Apr 24, 2023

@hoxbro , @jlstevens , @ianthomas23, @mattpap , thanks for your recent work making this closer to reality! Can you please chime in here with the remaining tasks involved? What I am aware of:

  • Ian: first/last on Datashader using Dask (datashader#1182)
  • Mateusz: Small issue with Bokeh hover? (Need a new issue.)
  • Jean-Luc: custom hovertool support in HoloViews (Need an issue)
  • All: Work out good defaults for HoloViews that give hover information approximating what the pre-datashaded Bokeh plot would include.

@jlstevens
Copy link
Collaborator

jlstevens commented Apr 24, 2023

I think that is a good summary of what is needed.

For the Bokeh hover tool, my understanding was that the necessary changes would be fairly straightforward to implement but that some API changes/additions are also needed. @mattpap can correct me if I am wrong!

@ianthomas23
Copy link
Member

Datashader: what you have at the moment is support for max, max_n, min and min_n reductions on CPU, GPU and dask, on their own and within a where reduction. Needed are:

In HoloViews I don't think there is built-in support for calling Bokeh's categorical colormapping or Datashader's where reduction yet, but this probably needs @hoxbro to confirm?

@mattpap
Copy link
Contributor

mattpap commented Apr 24, 2023

For the Bokeh hover tool, my understanding was that the necessary changes would be fairly straightforward to implement but that some API changes/additions are also needed. @mattpap can correct me if I am wrong!

If this is what we discussed last week, then it requires some changes to make referencing custom formatter more robust (and hopefully deprecate HoverTool.formatters).

@jbednar
Copy link
Member Author

jbednar commented Apr 25, 2023

Ok, please open the appropriate issues and then link back here! Thanks.

@mattpap
Copy link
Contributor

mattpap commented Apr 25, 2023

I actually found a way to work around limitations related to referencing custom formatters. Consider this example (based on bokeh's examples/plotting/customjs_hover.py):

from bokeh.models import CustomJSHover, HoverTool
from bokeh.plotting import figure, show

# range bounds supplied in web mercator coordinates
p = figure(
    x_range=(-2000000, 6000000), y_range=(-1000000, 7000000),
    x_axis_type="mercator", y_axis_type="mercator",
)
p.add_tile("CartoDB Positron")

p.circle(x=[0, 2000000, 4000000], y=[4000000, 2000000, 0], size=30)

formatter = CustomJSHover(code="""
    const projections = Bokeh.require("core/util/projections")
    const {x, y} = special_vars
    const coords = projections.wgs84_mercator.invert(x, y)
    const dim = format == "x" ? 0 : 1
    return coords[dim].toFixed(2)
""")

p.add_tools(HoverTool(
    tooltips=[
        ("lon", "$x{x}"),
        ("lat", "$y{y}"),
    ],
    formatters={
        "$x": formatter,
        "$y": formatter,
    },
))

show(p)

Given that the contents of {} can be anything except empty and custom has no intrinsic meaning (in fact it's not referenced in the implementation at all). Thus you can use it to enumerate possible implementations of a custom formatter. This translates nicely to the example @hoxbro sent me. Note that I would consider this is a bit of an abuse of the API.

@hoxbro
Copy link
Member

hoxbro commented Apr 25, 2023

Thank you @mattpap. Got it to work with your example.

I assume you would still want to make custom formatters more robust?

@philippjfr
Copy link
Member

@jbednar I'd say we close this. We have other issues to actually leverage the new aggregates for inspection purposes in the other repos and afaik where along with <agg>_n covers everything we need out of datashader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

6 participants