Rethinking our runtime error design #8008

frankmcsherry · 2021-08-23T21:31:24Z

frankmcsherry
Aug 23, 2021
Maintainer

Materialize's current run-time error design (e.g. "you divided by zero") forms parallel streams of errors that flow along error-free results. This has good properties, but also some limiting properties: it is difficult to recover from errors, as their context is lost.

It seems reasonable to reconsider our error design, to try and think through whether there are other patterns that might be more expressive, potentially less complicated, and ideally less ambiguous.

First, let's talk through some desiderata. I don't know that we have any hard constraints, other than not crashing things and correctness.

Expressions that produce errors should produce the error as data, rather than aborting the computation.
Expressions that produce errors should propagate that error through other operators that rely on the value of the expression.

For example, I added that last bit "that rely on the value of the expression" because our current error strategy is more casual than this. If errors are produced but then discarded, for some reason, we will still produce an error. Ideally we would realize that we didn't actually depend on the value through some analysis, but perhaps that was hard for some run-time decision reason.

We also have some other free-form errors like AvroParseError or SubqueryGeneratedTooManyResults that may not obviously correspond to expressions.

Here are some design questions that don't have clear solutions

Should errors replace entire rows, or individual columns?
Currently they replace entire rows, which is part of what makes recovering from them hard. At the same time, it is very easy to see if a result is an error or a valid row (as it is in the type, rather than in the data). Expression operators like IFERROR are left hanging because we cannot "undo" errors that affected only one expression.
Should errors be maintained in the same collections as valid data?
Separating errors out makes operations like join much easier as we can join only the valid data. On the other hand, it makes operations like reduce much harder, as we cannot produce two arrangements as output (this blocks pushing potentially erroring computation like HAVING 1/AVG(x) > 3 into the operator). The answer to this question might be different for arrangements where errors occur in the keys, where we perhaps always want to partition the results away.
Should rows with column-based errors participate in operations like join, reduce, and topk?
Unless the error is in the key, or the aggregate expressions for the reduce, we can still operate on the row that contains an error, and make "discovering" the error someone else's job later on. In essence, we delay the "evaluation" of the error.

So clearly an alternate proposal is "extend Datum to contain an Error variant" and then update our logic in most places to deal with that variant, most often propagating the error, and in some cases addressing the fact that Error is meant to be a special value that 1. should not just join with other errors, and 2. somehow taints results from aggregations somehow, 3. other things we don't realize yet. Tbh, independent of whether we like this or not, I'd really like to go through the process of determining the intended results of operators on inputs that contain errors.

frankmcsherry · 2021-09-15T18:06:58Z

frankmcsherry
Sep 15, 2021
Maintainer Author

Another thing to determine and commit to is whether errors reflect "deterministic errors associated with the input data and computation" or are extended to include "transient and non-deterministic errors that reflect the operating environment". I personally prefer the former, in that we want to distinguish between Materialize's transient difficulty in producing the right answer, and the right answer itself. But I'm happy to hear from folks who think we might want the latter.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethinking our runtime error design #8008

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Rethinking our runtime error design #8008

frankmcsherry Aug 23, 2021 Maintainer

Replies: 1 comment

frankmcsherry Sep 15, 2021 Maintainer Author

frankmcsherry
Aug 23, 2021
Maintainer

frankmcsherry
Sep 15, 2021
Maintainer Author