-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple Functions #12635
Comments
FYI: I am doing some experiments how this could look like |
I think the idea of making it easier to write functions that include specialized implementations for different types is a great idea. This would likely both make our code faster (more specializations, more uniformly handle constants) as well as reduce the replication (there are several different patterns for dispatch / specialization now -- consolidating would be really nice) |
here is some prior work on the precompiled regular expressions: #11146 (note compiling the regex hasn't ever shown up in any performance trace I did, probably because the cost of actually matching is so much bigger than the cost of compiling and the cost of compiling is spread over 8192 rows |
Thanks @findepi I think this process go through iterations, and easier than was before but still far from perfect. The ScalarUDFImpl common trait is already a huge help, but for example dynamic typing, like Rust can benefit from macros and generate functions with all possible output types combinations, but this is just 1 problem. I'm totally down if we can make it easier |
Just to be clear, what I am imagining comes out of this work is:
|
FYI i touched upon the topic of types on DataFusion meetup in Belgrade yesterday. |
I experimented with this on the way from DataFusion meetup in Belgrade. i came up with something like this function author would write thisrow-oriented function logic // hand-written
pub struct MyGcd {}
impl MyGcd {
pub fn call(x: i64, y: i64) -> Result<i64> {
datafusion::functions::math::gcd::compute_gcd(x, y)
}
} a macro would generategenerated (syntactically derived) impl datafusion::logical_expr::ScalarUDFImpl for MyGcd {
fn as_any(&self) -> &dyn Any {
self
}
fn name(&self) -> &str {
"MyGcd"
}
fn signature(&self) -> &Signature {
todo!() // TODO here we should use logical typyes
}
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
todo!() // TODO here we should use logical typyes
}
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
lifted_invoke::<MyGcd>(self, args)
}
}
impl Invocable for MyGcd {
const ARGUMENT_COUNT: u8 = 2;
type ArgumentTypeList = (i64, (i64, ()));
type OutArgType = ();
type FormalReturnType = Result<i64>;
fn invoke(&self, regular_args: Self::ArgumentTypeList, _out_arg: &mut Self::OutArgType) -> Self::FormalReturnType {
let (a, (b, ())) = regular_args;
MyGcd::call(a, b)
}
} All the heavy lifting happens inside Notes on API design:
Note:
|
Planning on opening the PR on |
Thanks @findepi I was thinking about proc macros as well, having the the original function I hope we can extract necessary information and generate the DF code and the documentation, I do some testing for documentation in #12822 on top of what @Omega359 has already built. If we think about simplication on this phase as to generate boilerplate code leaving the developer to implement the logic it sounds achievable |
Wouldn't this incur a significant amount of overhead calling a function for every single row? |
@Omega359 this is how the function logic is structured anyway --
|
🤔 |
I'll let @davidhewitt chime in but we've experience a lot of generic bloat from having to implement functions that operate on scalars, arrays, dictionary arrays and take multiple parameters: you end up with an exponential explosion of generics which is very painful at compile time. |
https://github.com/apache/datafusion/blob/main/datafusion/functions-nested/src/string.rs is a good example of what can result with supporting many types and args |
Thanks, looks similar to what we've done in |
It seems like datafusion-contrib/datafusion-functions-json#66 might provide a possible avenue; by widening input dictionary keys to |
Going back to this topic. Not much happened on the Simple Functions front, but a lot happened in the world. New type system changes (Logical types can be found in type signatures #13372; preliminary steps towards extension types #14247). We internally implemented quite a few functions and it allowed me to experience first-hand how important it is to make function authoring easier. Here is a design doc for what I would love to implement It's open to comment by anyone. TeaserWouldn't it be cool to be able to implement a function like this? // SQL regexp_like(text, pattern, flags) -> bool
#[vectorize]
fn regexp_like(
#[arg(2)] pattern: &str,
#[arg(3)] flags: &str,
) -> Result<impl FnMut(#[arg(1)] &str) -> bool> {
let regex = Regex::new(format!("(?{flags})^({pattern})$").as_str())?;
Ok(move |text: &str| regex.is_match(text))
} All the regexp and many date/time, and many other functions would greatly benefit if we can pull this up. |
I love the work that you've put into this! You use the regexp_like as an example of what could be handled by if you look at that function it actually doesn't operate on individual values but delegates the whole arrays off to arrow-string. My primary concern with this is not the direction at all - I love it. I am concerned primarily how to handle functions mixing string arrays and scalars. The whole casting logic for the various string types can get quite laboured and until Rust has Enum values as types I don't see a way around it unless we just force everything to a common type (GenericStringArray ?) which would remove the benefit of using something like StringViewArray. I'm going to review the doc again later today and think about it a bit more. |
Thanks @findepi and everyone, this work is epic, literally. in DataFusion it was always needed to unify the builtin functions as they implemented by different developers in different times. Please also consider that functions can be declared in separate or even external crates and be plugged dynamically like now we discussing #5600 Thanks for the google doc, would you mind attach here the final result how procedural macros are unwrapped into the function code supporting all the attributes like Another thing coming to my mind is to optimize builtin function if we know that the function parameter set by user is actually a literal which provides further optimizations |
Design doc
Is your feature request related to a problem or challenge?
Verbosity
Currently implementing a scalar function is a pretty involved process. For example a simple function calculating greatest common divisor looks like this (a lot of detail omitted for brevity):
This function is still "relatively simple" because:
return_type
and again ininvoke
x
or constanty
parametersmake_scalar_function
helper expands scalar constant argument into an array, thusgcd(x, 5)
is less efficient thangcd(x, y)
because it first needs to allocate a temporary buffer for5
s repeated times the batch lengthgcd
on partitioning keys, but this is just an exampleIt should be possible to express functions using simple row-level operations, because in query engines most function logic is structured like that anyway (
compute_gcd
in the example).Local data structures
Currently DataFusion functions are singletons plugged into the execution engine. They have no way to store and reuse buffers or compiled regular expressions, etc.
Thread local storage is not an option because DataFusion, when embedded, does not control creation of application threads.
Describe the solution you'd like
Simple
It should be possible to separate function logic from the mechanics.
Exemplary GCD function needs to provide
fn compute_gcd(x: i64, y: i64) -> Result<i64>
and the rest of boilerplate should not be hand-written for every function separately.It should be possible to implement a function that accepts string values, without having to deal with the 5 different runtime representations that can carry string values:
StringArray, LargeStringArray, StringViewArray, DictionaryArray<Key>, RunArray<Key>
(maybe more than 5 because they are recursive in theory: canDictionaryArray
contain aDictionaryArray
? can it containRunArray
?)Focused
Because SQL is statically typed, it is necessary to select function overload during planning time, so that return type is also known (the
return_type
function). Theinvoke
function needs to implement same logic again. This process should be less error-prone: once the function is bound to the query call site, its signature is known and the implementation should not need to do type checks.Performant / Efficient
It should be the compiler's / framework's job to provide vectorization, without hand-writing same logic in every function separately.
It should be the compiler's / framework's to do null checks, providing efficient tight loops for the all-non-null case without having to write such logic in every function separately.
It should be possible to write efficient functions that need thread local data structures, for example for regular expressions, without having to use thread locals and/or shared pools which introduce contention.
Describe alternatives you've considered
arrow-udf
liibrary #11413However, direct use of the library is not possible because
The library could still be part of the solution, but doesn't have to be and it's a non-goal to use a particular library.
Additional context
The text was updated successfully, but these errors were encountered: