Prefilter the tablechunks if there are relevant where statements #1
Replies: 5 comments 4 replies
-
@lothar7 I'd be interested in getting a few more details on what you have in mind. If you could give some background (what is "very large", what format is the data, is the assumption that the data is pre-sorted in time order etc) that would be very helpful. Kusto-loco should already be significantly more performant than the original implementation for these kinds of queries since it uses shared index tables into filtered rows to reduce copy-overhead and memory allocation. However I'm assuming you're suggesting an even higher level optimisation based on something like looking at the "span" of a chunk in time and simply ignoring it if not relevant. Unfortunately expressions are evaluated after a chunk is loaded for filtering so it's difficult to avoid this step without introducing additional "magic" (e.g. we add a flag to a column to provide hints to short-circuit per-row evaluation). Neverthless, we use a lot of time-series data ourselves so definitely interested in improving this aspect. |
Beta Was this translation helpful? Give feedback.
-
I have a timeseries storage system that can store large amounts of data (1-10TB+) in roughly 1-2 GB files. These files contains 1K pages. All data is indexed by time and timeseries id and each timeseries is sorted by time inside the files. So each file has a known starttime/endtime/duration. Each timeseries is stored in a set of consecutive pages inside the file. Each archive file may contain 100.000+ timeseries, Anyhow - my plan was then to create a "virtual" table in kusto based on directly this raw data or do some lowlevel preaggregates first so we get shared timestamps for all timeseries and use that as a basis for a virtual table. Each archive file would match pretty well with tablechunk (or if using parquet files in the future - each rowgroup could be a tablechunk) Now comes the tricky part. Most queries would have to include a where/filter statement specifying the timerange(s) and/or timestamp(s). Since the data spans months and years over hundreds or 1000's of files there is an obvious need to filter the tablechunks so only the relevant ones are scanned and the rest is ignored. Since each archive file has a timeindex it would also be good to use that to filter the pages and ignore the ones that are not relevant. This is also important if asking for data for a specific timestamp. One way to solve this would be to find a way of preparing/filtering the data/tablechunks by running the time filtering before running the query itself. Since a typical datetime filter expression is just like any another expression its hard to find the actualy resulting timeranges/timestamps beforehand. Perhaps the kusto engine could provide some kind of hook somewhere that provides the relevant time expression for the time column? At this stage I dont really know the best way to be honest. |
Beta Was this translation helpful? Give feedback.
-
Thanks - that's very interesting. I've done quite a bit of work to add an "indirection" layer for chunks/columns. One advantage of this is that in principle it 1) allows loading of chunk data to be deferred until it really must be used 2) allows for significant compression in the case where a chunk represents a continuous set of rows or is filtered out completely. It may be possible to extend the chunk definition to indicate an "index" column which is assumed to be sorted and which initially holds a start/end span. So they way I could imagine this working would be....
So the net effect is that you end up with a chunk for each file (page?) in your storage layer but these are just empty shells until you need the data and can be filtered out very efficiently. Anyway, I'll put some more thought into this over the holidays; there's bunch of other infrastructure work I need to get done first :-) |
Beta Was this translation helpful? Give feedback.
-
Some more thoughts on this (as much so I don't forget them as anything else). Assumptions
For example, suppose we have time-series data for items that have a constant colour and size but varying price and sales count. So the data might look like:
For the sake of clarity I have made partitions explicit columns in the file table but clearly this redundancy can be removed.
It may also be seen as desirable to perform the partition filtering/selection as part of the query (*see discussion later) MechanismsKustoLoco already has a "SingleValueColumn" which is simply a virtual mapping of N rows onto a single value. There is not yet a concept of a deferred column for which the data is loaded only at the point of use so this would need to be introduced. A custom ITableSource would need to create a chunk for each file in the dataset with:-
In the example above the columns for the first chunk might look like
This would allow queries such as A further enhancement to avoid redundant scanning of single-value columns might be possible by modifying the lambda in GetScalarImplementation to check whether all supplied arguments are single-value and, if so, performing a single evaluation. This could potentially make initial filtering much faster. Handling cases such as
or worse,
is harder since knowledge of "single-valuedness" needs to pushed down to all operators. There is some scope in the case of logical operators of reordering arguments so that the single-value columns are evaluated first. If we get a short-circuit result by looking at the sv columns for the first row we clearly never need to evalate the later "real data" columns. Cases such as
are probably impractical to manage in the current implementation. In theory you could rewrite the expression tree to hoist references to partition columns but it's even more complicated when you consider the possibly of operations such as Problems/LimitationsThis approach separates partitions from column values. A big advantage of this is that it means n-dimensional partitioning is easily supported. The disadvantage is that we can no longer use the original column in the filter term. I.e. we need to write We've defined partitions as single-value (vs start..end spans or even small sets) That might make some kind of partitions a bit harder to express in query-friendly ways. As described above, the mechanism really relies upon the partition filter being supplied as the first term in the query. Any deviation from this runs the risk of the engine trying to load the entire data-set and crashing. That might not be a problem if you are generating query strings from code and can guarantee a partition filter term will be present. Conclusion/DiscussionIt's almost certainly possible to implement something like the scheme above and could be a significant performance improvement for "moderate" size datasets. I'm not sure whether it's the right approach for "large" data though; the risk of missing a filter clause and blowing up your machine seems quite high! Is there a reason you want to avoid having a "pre-filter" operation where the time-range is specified separately to the kusto query and selects which partition files are considered for loading? You alluded to this in your first post but I wasn't sure why it wasn't a desirable solution? |
Beta Was this translation helpful? Give feedback.
-
Yes that's what I meant by saying that there's nothing to stop MON being a DateTime. I.e. on chunk creation you'd create a single-value column of type DateTime (where the value is the beginning of Jan for example). You could then write
The tree is easily accessible from the engine and you could certainly run a pre-pass where you tried to find relevant filter expressions and either apply them directly or extract the constant values they are using. There are a lot of cases where that falls down though... the Time constraint might be expressed as a complicated scalar expression or even as columnar comparison against a evaluation-time computed value. Pathological case is something like:
For "simple" cases - i.e. comparison against scalar constant values I've almost convinced myself it is always safe to hoist them to the top of the evaluation tree but there may be cases I'm not considering. |
Beta Was this translation helpful? Give feedback.
-
@lothar7 raised this idea....
I am looking to process very large timeseries tables by using the ITableSource. They all have a timestamp column that can be used for filtering. However this wont work very well now since all the data must be scanned regardless.
If we could prune the tablechunks by time assuming there is a "where" statement in the query then we would only need to process there relevant table chunks.
Beta Was this translation helpful? Give feedback.
All reactions