Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(duckdb): literal python floats are interpreted as Decimals, not DOUBLE #10890

Open
1 task done
NickCrews opened this issue Feb 24, 2025 · 11 comments · May be fixed by #10933
Open
1 task done

bug(duckdb): literal python floats are interpreted as Decimals, not DOUBLE #10890

NickCrews opened this issue Feb 24, 2025 · 11 comments · May be fixed by #10933
Labels
bug Incorrect behavior inside of ibis duckdb The DuckDB backend

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Feb 24, 2025

What happened?

In duckdb, a value like 1000000000.0000005 is interpreted as a DECIMAL, not a float. This usually causes no problems. But it sometimes does, like if you multiply two large numbers, you will get an OverFlowError with decimals, where if you were using a float, you would be fine.

import ibis
import duckdb

print(duckdb.sql("SELECT typeof(1000000000.0000005)").fetchone()[0])  # DECIMAL(17,7)
print(
    duckdb.sql("SELECT 1000000000.0000005::DOUBLE * 1000000").fetchone()[0]
)  # 1000000000000000.5

e = ibis.literal(1000000000.0000005) * ibis.literal(1_000_000)
print(
    ibis.to_sql(e)
)  # SELECT 1000000000.0000005 * 1000000 AS "Multiply(1000000000.0000005, 1000000)"
e.execute()
# OutOfRangeException: Out of Range Error: Overflow in multiplication of DECIMAL(18) (10000000000000005 * 1000000)

What version of ibis are you using?

The bug is not present on ibis 9.2.0 and before, and it is present on 9.3.0 and after. I think e4ff1bd was the commit that caused the regression.

What backend(s) are you using, if any?

duckdb

Relevant log output

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the bug Incorrect behavior inside of ibis label Feb 24, 2025
@cpcloud
Copy link
Member

cpcloud commented Feb 27, 2025

Thanks for opening an issue!

Do you have a use case for this?

For this specific issue, we removed casting because it was defeating some important pushdown optimizations in duckdb-spatial.

@cpcloud cpcloud added the duckdb The DuckDB backend label Feb 27, 2025
@NickCrews
Copy link
Contributor Author

NickCrews commented Feb 27, 2025

I found this when I got an overflow error from multiplying two large numbers as I describe.

To be specific, when multiplying multiple large probabilities in my record linkage library mismo.
See
NickCrews/mismo@357fa79#diff-1ec0582d16916ba2d4319891e339b61c8e19c5942fb4912be84f02e9d1fecd30R311-R313 (eg we have several columns which represent the odds from some feature, eg that the address or names match, and multiply them together. These odds can be quite large. See https://moj-analytical-services.github.io/splink/topic_guides/theory/fellegi_sunter.html#deriving-match-weights-from-m-and-u)

@NickCrews
Copy link
Contributor Author

If we add the explicit cast only for when ibis compiles a floatXX literal to duckdb, would that manage to avoid the optimization issue?

@NickCrews
Copy link
Contributor Author

If others run into this, the workaround I found was ibis.literal("12.34").cast(float)

@gforsyth
Copy link
Member

You can remove the cast and specify the dtype when you define the literal

[ins] In [2]: ibis.literal(1, type='float')
Out[2]: 1.0

[ins] In [3]: ibis.literal(12.34, type='float')
Out[3]: 12.34

[ins] In [4]: ibis.literal('12.34', type='float')
Out[4]: 12.34

@NickCrews
Copy link
Contributor Author

NickCrews commented Mar 2, 2025

Unfortunately, all those result in the naked 12.34 (which duckdb interprets as decimal) in the generate SQL:

import ibis

ibis.to_sql(ibis.literal(1, type="float"), dialect="duckdb")  # SELECT 1.0 AS "1.0"
ibis.to_sql(ibis.literal(12.34, type='float'), dialect='duckdb')  # SELECT 12.34 AS "12.34"
ibis.to_sql(ibis.literal("12.34", type='float'), dialect='duckdb')  # SELECT 12.34 AS "12.34"
ibis.to_sql(ibis.literal("12.34").cast("float"), dialect="duckdb")  # SELECT CAST('12.34' AS DOUBLE) AS "Cast('12.34', float64)"
ibis.to_sql(ibis.literal(12.34).cast("float"), dialect="duckdb")  # SELECT 12.34 AS "12.34"

In order to actually get ibis to generate the cast-to-float, you need to start with something that ibis doesn't think is a float, eg the ibis.literal("12.34") which ibis thinks has dtype of string.

@NickCrews
Copy link
Contributor Author

Here is a little test that I would expect to pass for any expression and backend. Could we add this to our tests, where hypothesis generates expressions for us?

import ibis
from ibis.backends.sql import SQLBackend
from ibis.expr import datatypes as dt


def assert_type_roundtrips(e: ibis.ir.Value, backend: SQLBackend):
    type_in_backend_string: str = backend.execute(e.typeof())
    type_in_backend_ibis: dt.DataType = backend.compiler.type_mapper.from_string(
        type_in_backend_string
    )
    assert e.type() == type_in_backend_ibis, (e.type(), type_in_backend_ibis)


e = ibis.literal(12.34)
backend = ibis.duckdb.connect()
assert_type_roundtrips(e, backend)
# AssertionError: (Float64(nullable=True), Decimal(precision=4, scale=2, nullable=True))

@cpcloud
Copy link
Member

cpcloud commented Mar 3, 2025

If you can come up with a way to not break existing optimizations on DuckDB and get that test to pass, you're more than welcome to try adding it to the test suite.

@cpcloud
Copy link
Member

cpcloud commented Mar 3, 2025

That test used to pass before we disabled literal casting on DuckDB. We made the concession because we deemed it more important to enable the optimization that was defeated by casting.

@NickCrews
Copy link
Contributor Author

NickCrews commented Mar 3, 2025

OK, that tradeoff makes sense that we want to prioritize the optimization more, I bet more people are affected by the optimization than by the bug that I experienced, and my bug has a workaround.

I am willing to try to put in a bit of effort to solve this, if it doesn't turn into a total rabbit hole.

Links to the perf issues:

@cpcloud in that linked PR, you mention what you think the long-term solution should be:

Longer term, I think we should probably support two kinds of literals:

  1. A ops.TypedLiteral when a type is specified in ibis.literal, which will be cast to the specific type
  2. ops.Literal would then be a literal that has type information but avoids casting. This is what op construction would use.

Can you give some example of this? If for number 2 you are saying that Ops.Multiply(1, 12.34) should avoid casting and compile to 1 * 12.34, then we would still suffer from ibis thinking this is a float64 but it getting evaluated as a decimal on the backend.

Alternatively, instead of having two different flavors of Ops at construction time, could we do it through re-write rules at compile time?

  1. in expressions like ops.Equal(x, ops.Literal(12.34)), we can avoid the cast, because the overall type of boolean doesn't care about if the 12.34 is a float or decimal.
  2. but, in expressions like ops.Add(x, ops.Literal(12.34)), where the overall type is numeric, we DO need the cast?

It might be death by a thousand cuts to properly enumerate all the rules for when the type is important and when it is not. What about if by default we made it so that we compiled the type cast, but just special cased the boolean comparison Ops to not include the cast during compilation? Eg

  1. we kept pretty much all our implementation for building the expression tree the same.
  2. we added a new ops.UntypedLiteral
  3. we added a rewrite rule so that ops.Equal(x, ops.Literal(12.34)) get rewritten to ops.Equal(x, ops.UntypedLiteral(12.34))
  4. ops.Literal, when compiled, always includes the cast, ops.UntypedLiteral does not.

My thought here is that by default we should try to have explicit casting so that assert_type_roundtrips always holds, and then only special case away from this when we find there are optimizations etc that we want to keep.

It also looks like in that PR, there is no automatic test to make sure that the filter pushdown happens, is that right? You were just manually running EXPLAIN <query> on the query from the original issue?

@NickCrews
Copy link
Contributor Author

#10933 implements the rewrite solution that I describe above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis duckdb The DuckDB backend
Projects
Status: backlog
Development

Successfully merging a pull request may close this issue.

3 participants