Prefilter the tablechunks if there are relevant where statements #1

NeilMacMullen · 2023-12-21T23:04:08Z

NeilMacMullen
Dec 21, 2023
Maintainer

@lothar7 raised this idea....

I am looking to process very large timeseries tables by using the ITableSource. They all have a timestamp column that can be used for filtering. However this wont work very well now since all the data must be scanned regardless.

If we could prune the tablechunks by time assuming there is a "where" statement in the query then we would only need to process there relevant table chunks.

NeilMacMullen · 2023-12-21T23:14:41Z

NeilMacMullen
Dec 21, 2023
Maintainer Author

@lothar7 I'd be interested in getting a few more details on what you have in mind. If you could give some background (what is "very large", what format is the data, is the assumption that the data is pre-sorted in time order etc) that would be very helpful.

Kusto-loco should already be significantly more performant than the original implementation for these kinds of queries since it uses shared index tables into filtered rows to reduce copy-overhead and memory allocation. However I'm assuming you're suggesting an even higher level optimisation based on something like looking at the "span" of a chunk in time and simply ignoring it if not relevant.

Unfortunately expressions are evaluated after a chunk is loaded for filtering so it's difficult to avoid this step without introducing additional "magic" (e.g. we add a flag to a column to provide hints to short-circuit per-row evaluation).

Neverthless, we use a lot of time-series data ourselves so definitely interested in improving this aspect.

0 replies

lothar7 · 2023-12-22T20:19:29Z

lothar7
Dec 22, 2023

I have a timeseries storage system that can store large amounts of data (1-10TB+) in roughly 1-2 GB files. These files contains 1K pages. All data is indexed by time and timeseries id and each timeseries is sorted by time inside the files. So each file has a known starttime/endtime/duration. Each timeseries is stored in a set of consecutive pages inside the file. Each archive file may contain 100.000+ timeseries,
One problem with the data is that separate timeseries doesnt all share identical timestamps. (Although I may plan to use parquet files in the future in which case shared timestamps would be needed). I already have builting support for various aggregates and have a REST API and custom aggreation API on top. But what I really want is something like kusto and this project seemed a good match instead of starting from scratch with the kusto C# parser.

Anyhow - my plan was then to create a "virtual" table in kusto based on directly this raw data or do some lowlevel preaggregates first so we get shared timestamps for all timeseries and use that as a basis for a virtual table. Each archive file would match pretty well with tablechunk (or if using parquet files in the future - each rowgroup could be a tablechunk)

Now comes the tricky part. Most queries would have to include a where/filter statement specifying the timerange(s) and/or timestamp(s). Since the data spans months and years over hundreds or 1000's of files there is an obvious need to filter the tablechunks so only the relevant ones are scanned and the rest is ignored. Since each archive file has a timeindex it would also be good to use that to filter the pages and ignore the ones that are not relevant. This is also important if asking for data for a specific timestamp.

One way to solve this would be to find a way of preparing/filtering the data/tablechunks by running the time filtering before running the query itself. Since a typical datetime filter expression is just like any another expression its hard to find the actualy resulting timeranges/timestamps beforehand. Perhaps the kusto engine could provide some kind of hook somewhere that provides the relevant time expression for the time column? At this stage I dont really know the best way to be honest.

0 replies

NeilMacMullen · 2023-12-22T23:52:03Z

NeilMacMullen
Dec 22, 2023
Maintainer Author

Thanks - that's very interesting. I've done quite a bit of work to add an "indirection" layer for chunks/columns. One advantage of this is that in principle it 1) allows loading of chunk data to be deferred until it really must be used 2) allows for significant compression in the case where a chunk represents a continuous set of rows or is filtered out completely.

It may be possible to extend the chunk definition to indicate an "index" column which is assumed to be sorted and which initially holds a start/end span. So they way I could imagine this working would be....

custom ITableSource builder which understands your file/naming/storage convention (I'm assuming you must have a "cheap" way of figuring out spans without reading the file contents)
builder creates "indexed" chunks (which have the span of the chunk defined) and where the column contents are "deferred"
when evaluating queries, the first expression needs to be the timestamp comparison.
the filter operator sees this is an "indexed" chunk and that the expression is referring to the column you've marked as the index so does some "magic" (tbd) to short-circuit the normal row-by-row evaluation by looking at the chunk span.
If the chunk is filtered in then the data needs to be fetched at that point.

So the net effect is that you end up with a chunk for each file (page?) in your storage layer but these are just empty shells until you need the data and can be filtered out very efficiently.

Anyway, I'll put some more thought into this over the holidays; there's bunch of other infrastructure work I need to get done first :-)

1 reply

lothar7 Dec 24, 2023

This sounds promising. I like this concept

NeilMacMullen · 2023-12-24T11:16:33Z

NeilMacMullen
Dec 24, 2023
Maintainer Author

Some more thoughts on this (as much so I don't forget them as anything else).

Assumptions

On-disk data is partitioned into separate files with a naming scheme (or separate index) that makes it cheap to extract their partition vectors.

For example, suppose we have time-series data for items that have a constant colour and size but varying price and sales count. So the data might look like:

File Name	Colour(P)	Size(P)	Month(P)	Time	Price	Sales
JAN_RED_LG	RED	LG	JAN	jan 5th	..	..
JAN_RED_LG	RED	LG	JAN	jan 8th	..	..
JAN_BLUE_LG	BLUE	LG	JAN	jan 5th	..	..
FEB_RED_LG	RED	LG	FEB	feb 8th	..	..

For the sake of clarity I have made partitions explicit columns in the file table but clearly this redundancy can be removed.

Because we have a very large data set we wish to
- Avoid loading every file in the set for queries that will only visit a subset. (We may not even have enough memory for this.)
- Even if we could load the data into memory we'd prefer to avoid scanning/evaluating test expressions for every row in a partition column since we're doing a lot of computation just to repeatedly yield the same result.

It may also be seen as desirable to perform the partition filtering/selection as part of the query (*see discussion later)

Mechanisms

KustoLoco already has a "SingleValueColumn" which is simply a virtual mapping of N rows onto a single value.

There is not yet a concept of a deferred column for which the data is loaded only at the point of use so this would need to be introduced.

A custom ITableSource would need to create a chunk for each file in the dataset with:-

"SingleValue" columns for each partition axis
"Deferred" columns for the actual data

In the example above the columns for the first chunk might look like

Single(COLOUR=RED)	Single(Size=LG)	Single(Month=JAN)	Deferred(Time)	Deferred(Price)	Deferred(Sales)

This would allow queries such as data | where MON=='JAN' and COLOUR=='RED' | where price > 100 to immediately filter in only the desired partition(s).

A further enhancement to avoid redundant scanning of single-value columns might be possible by modifying the lambda in GetScalarImplementation to check whether all supplied arguments are single-value and, if so, performing a single evaluation. This could potentially make initial filtering much faster.

Handling cases such as

data | where MON=='JAN' and COLOUR=='RED' and price > 100

or worse,

data | where price > 100 and MON=='JAN' and COLOUR=='RED'

is harder since knowledge of "single-valuedness" needs to pushed down to all operators. There is some scope in the case of logical operators of reordering arguments so that the single-value columns are evaluated first. If we get a short-circuit result by looking at the sv columns for the first row we clearly never need to evalate the later "real data" columns.

Cases such as

data | where price > 100 | where MON=='JAN' and COLOUR=='RED'
data | summarize max(price) by MON | where MON=='JAN'
data | summarize max(price) by MON | order by MON | take 1

are probably impractical to manage in the current implementation. In theory you could rewrite the expression tree to hoist references to partition columns but it's even more complicated when you consider the possibly of operations such as project-rename and extend.

Problems/Limitations

This approach separates partitions from column values. A big advantage of this is that it means n-dimensional partitioning is easily supported. The disadvantage is that we can no longer use the original column in the filter term. I.e. we need to write MON == 'JAN' rather than Time > .... There's no reason though that MON can't be a DateTime - it's just that now it is clearly a different thing from "Time". A hybrid approach similar the the first idea above where a column could have a "dual" identity as both single-value and multi-row might be possible but complicated....

We've defined partitions as single-value (vs start..end spans or even small sets) That might make some kind of partitions a bit harder to express in query-friendly ways.

As described above, the mechanism really relies upon the partition filter being supplied as the first term in the query. Any deviation from this runs the risk of the engine trying to load the entire data-set and crashing. That might not be a problem if you are generating query strings from code and can guarantee a partition filter term will be present.

Conclusion/Discussion

It's almost certainly possible to implement something like the scheme above and could be a significant performance improvement for "moderate" size datasets. I'm not sure whether it's the right approach for "large" data though; the risk of missing a filter clause and blowing up your machine seems quite high!

Is there a reason you want to avoid having a "pre-filter" operation where the time-range is specified separately to the kusto query and selects which partition files are considered for loading? You alluded to this in your first post but I wasn't sure why it wasn't a desirable solution?

1 reply

lothar7 Dec 24, 2023

My initial thoughts so far - in holiday mode now so I haven't spent too much time thinking yet :)

Having a "pre-filter" operation does solve my immediate problem and is probably the simplest but I wanted to see if there was a unified approach that could work. That was my backup plan. Its less elegant but it does the job.

As for this concept you scetched out above - in my opinion it would be too limiting and a major problem if time filtering could only be written like MON == 'JAN' rather than Time > .... . If we could come up with a concept using the greater/less than operators then that would be great.

I also pondered whether another way would be to extract parts of the AST tree and run the filter query separately first against an index?

NeilMacMullen · 2023-12-24T12:26:16Z

NeilMacMullen
Dec 24, 2023
Maintainer Author

n my opinion it would be too limiting and a major problem if time filtering could only be written like MON == 'JAN' rather than Time > .... . If we could come up with a concept using the greater/less than operators then that would be great.

Yes that's what I meant by saying that there's nothing to stop MON being a DateTime. I.e. on chunk creation you'd create a single-value column of type DateTime (where the value is the beginning of Jan for example). You could then write where MON > date(....) and MON < date(...). But... it's still a differently-named column from the "Time" from which it was derived.

I also pondered whether another way would be to extract parts of the AST tree and run the filter query separately first against an index?

The tree is easily accessible from the engine and you could certainly run a pre-pass where you tried to find relevant filter expressions and either apply them directly or extract the constant values they are using. There are a lot of cases where that falls down though... the Time constraint might be expressed as a complicated scalar expression or even as columnar comparison against a evaluation-time computed value. Pathological case is something like:

data | extend midday=round_to_midday(Time) | project_rename NewTime=Time | summarize max(Price),midday by Colour,NewTime | where NewTime < midday

For "simple" cases - i.e. comparison against scalar constant values I've almost convinced myself it is always safe to hoist them to the top of the evaluation tree but there may be cases I'm not considering.

2 replies

lothar7 Dec 24, 2023

Thanks for clearing that up. I understand now. Then it could work and would be a decent solution.

There is still this thought in back of my head that if we managed to introduce some index concept that was attached to the column we could hit that first while querying that column. So any column could then potentially have a corresponding index that is run before the query is run against the table. But not knowing the code well enough its probably not going to work.

NeilMacMullen Dec 30, 2023
Maintainer Author

I've added changes to support "pure" index columns that are simply single-valued (per chunk). These columns are also used internally wherever a columnar value can be reasonably detected as containing a single value. The logic for InvokeScalar has been modified to short-circuit the entire column-scan in the case where all arguments are "single value". In addition, the logical-and and logical-or operations have been modified to perform "special-case" short-circuiting when one of their operands is single value and would result in a constant result regardless of the other operand.

With these changes "pre filters" of the form mybigdata | where indexA == '1' or indexB < date(2022-10-7)
are far more efficient and can avoid loading of other columns in the chunk (when deferred loading is fully implemented).

Of course, this still means that the "index columns" are distinct from the data columns. As you say, in principle it would be possible to "attach" an index value (or values) to a column and try to evaluate these first (in the same way we short-cut single-value columns).

However even in this case it's very difficult in the current implementation to efficiently avoid unnecessary column loading. For example if you write where index == 'X' and data == 'Y' the evaluator will evaluate the whole of index == 'X' (this will be quick at least since it's an index!) then try to evaluate the whole of data == 'Y' before passing both columns to the "and" operator which may immediately determine that the left-hand operand was false and therefore it never needed the right hand side. By that time the damage has been done; we've already done the disk-load of 'data' column.

Could this be avoided ? I can't see a straightforward way (yet) so the guidance would have to be to avoid mixing "index" and "normal" columns in the initial pre-filter pipe expression.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefilter the tablechunks if there are relevant where statements #1

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Prefilter the tablechunks if there are relevant where statements #1

NeilMacMullen Dec 21, 2023 Maintainer

Replies: 5 comments · 4 replies

NeilMacMullen Dec 21, 2023 Maintainer Author

lothar7 Dec 22, 2023

NeilMacMullen Dec 22, 2023 Maintainer Author

lothar7 Dec 24, 2023

NeilMacMullen Dec 24, 2023 Maintainer Author

Assumptions

Mechanisms

Problems/Limitations

Conclusion/Discussion

lothar7 Dec 24, 2023

NeilMacMullen Dec 24, 2023 Maintainer Author

lothar7 Dec 24, 2023

NeilMacMullen Dec 30, 2023 Maintainer Author

NeilMacMullen
Dec 21, 2023
Maintainer

Replies: 5 comments 4 replies

NeilMacMullen
Dec 21, 2023
Maintainer Author

lothar7
Dec 22, 2023

NeilMacMullen
Dec 22, 2023
Maintainer Author

NeilMacMullen
Dec 24, 2023
Maintainer Author

NeilMacMullen
Dec 24, 2023
Maintainer Author

NeilMacMullen Dec 30, 2023
Maintainer Author