Releases: Eventual-Inc/Daft
Releases · Eventual-Inc/Daft
v0.3.8
Changes
👾 Bug Fixes
📖 Documentation
- [DOCUMENTATION] add value counts to rst @andrewgazelka (#3032)
🧰 Maintenance
- [DOCUMENTATION] add value counts to rst @andrewgazelka (#3032)
v0.3.7
v0.3.6
Changes
✨ New Features
- [FEAT] Implement standard deviation @raunakab (#3005)
- [FEAT] Add time travel to read_deltalake @kevinzwang (#3022)
- [FEAT] agg_list support for list and struct types @kevinzwang (#3019)
- [FEAT] Cast SparseTensor and FixedShapeSparseTensor to Python @sagiahrac (#3010)
- [FEAT] add
list.value_counts()
@andrewgazelka (#2902) - [FEAT] Infer timedelta literal as duration @colin-ho (#3011)
- [DOCS] Naming consistency of
length
functions @vicky1999 (#2942)
👾 Bug Fixes
- [BUG] Pass parquet2 io errors correctly into arrow2 @desmondcheongzx (#3012)
- [BUG] Fix actor pool project splitting when column is not renamed @kevinzwang (#2998)
- [BUG] Add resources to Ray stateful UDF actor @kevinzwang (#2987)
- [BUG] Fix join errors with same key name joins (resolves #2649) @anmolsingh20 (#2877)
- [BUG]: error messages for add @universalmind303 (#2990)
📖 Documentation
- [FEAT] Implement standard deviation @raunakab (#3005)
- [DOC] fix link in doc @amitschang (#2944)
- [DOCS] Update readme to use python syntax highlighting @jaychia (#3006)
- [DOCS] Naming consistency of
length
functions @vicky1999 (#2942) - [DOCS] Update readme to correctly reflect new messaging @jaychia (#3001)
🧰 Maintenance
- [CHORE] add/fix many clippy lints @andrewgazelka (#2978)
v0.3.5
Changes
✨ New Features
- [FEAT]: sql
read_deltalake
function @universalmind303 (#2974) - [FEAT]: SQL add hash and minhash @universalmind303 (#2948)
- [FEAT] Enable init args for stateful UDFs @kevinzwang (#2956)
👾 Bug Fixes
- [BUG]: add count_matches and fix a bunch of str functions @universalmind303 (#2946)
- [BUG] Writes from empty partitions should return empty micropartitions with non-null schema @colin-ho (#2952)
- [CHORE] Enable test_creation and test_parquet for native executor @colin-ho (#2672)
- [BUG] improve error reporting for multistatement sql @amitschang (#2916)
- [BUG]: sql nested and wildcard @universalmind303 (#2937)
- [BUG] Enable groupby with alias for native executor @colin-ho (#2917)
- [BUG] Use dashes for machete dependency ignores @colin-ho (#2919)
📖 Documentation
- [DOCS] Fix docs to add SQL capabilities @jaychia (#2931)
- [DOCS] update arch png @samster25 (#2970)
- [DOCS] Add docs on to_arrow and as_arrow @samster25 (#2965)
- [DOCS]: add a helper function to list all sql functions @universalmind303 (#2943)
- [CHORE] Additional fixes for nightly tests @kevinzwang (#2936)
- [CHORE] Fix issues from nightly tests @kevinzwang (#2926)
🧰 Maintenance
- [CHORE] ignore 45e2944 @andrewgazelka (#2979)
- [CHORE] Enable test_creation and test_parquet for native executor @colin-ho (#2672)
- [CHORE] pin cargo machete to 0.7.0 @andrewgazelka (#2920)
- [CHORE] Refactor Binary Ops @samster25 (#2876)
- [CHORE] add pytest to vscode settings.json @andrewgazelka (#2930)
- [CHORE] Additional fixes for nightly tests @kevinzwang (#2936)
- [CHORE] update GH template name from md to yml @samster25 (#2934)
- [CHORE] update GH bug template @samster25 (#2932)
- [CHORE] Fix issues from nightly tests @kevinzwang (#2926)
- [CHORE] Enable sources to return empty tables @colin-ho (#2915)
v0.3.4
Changes
✨ New Features
- [FEAT]
agg_concat
doesn't work on strings @vicky1999 (#2847) - [FEAT] Add ability for RayRunner to run actor pool projects (beta feature) @jaychia (#2881)
- [FEAT]: [SQL] struct subscript and json_query @universalmind303 (#2891)
- [FEAT] UTF8 to binary coercion flag @raunakab (#2893)
- [FEAT] Delta Lake partitioned writing @kevinzwang (#2884)
- [FEAT]: add partitioning_* functions to sql @universalmind303 (#2869)
- [FEAT]: add sql support for "DATE <date>" and "DATETIME <datetime>" @universalmind303 (#2870)
- [FEAT] Add Sparse Tensor logical type @michaelvay (#2722)
- [FEAT] [SQL] Enable SQL query to run on callers scoped variables @amitschang (#2864)
- Revert "[FEAT]:
shuffle_join_default_partitions
param" @jaychia (#2873) - [FEAT] Iceberg partitioned writes @kevinzwang (#2842)
- [FEAT]: SQL temporal functions @universalmind303 (#2858)
- [FEAT]: sql list operations @universalmind303 (#2856)
- [FEAT]:
shuffle_join_default_partitions
param @universalmind303 (#2844) - [FEAT] Add left/right/anti/semi joins to native executor @colin-ho (#2743)
🚀 Performance Improvements
- [PERF] Lazily import heavy modules to speed up import times @desmondcheongzx (#2826)
👾 Bug Fixes
- [BUG] Fix display for decimal types @raunakab (#2909)
- [BUG] Fix partitioning SQL scans on empty tables @desmondcheongzx (#2885)
- [BUG] Fix concat expression typing @colin-ho (#2868)
🧰 Maintenance
- [CHORE] Classify throttle and internal errors as Retryable in Python @samster25 (#2914)
- [CHORE] auto-fix prefer
Self
over explicit type @andrewgazelka (#2908) - [CHORE]: bump sqlparser version @universalmind303 (#2886)
- [CHORE]: Move daft.sql.sql module to daft.sql @universalmind303 (#2907)
- [CHORE] ignore vendored crates for codecov @samster25 (#2895)
- [CHORE]: move
numeric
out of daft-dsl and intodaft-functions
@universalmind303 (#2857) - [CHORE] Update documentation for config variables @jaychia (#2874)
- [CHORE] Move codspeed interactive tests to local files @samster25 (#2872)
- [CHORE]: move list functions from daft-dsl to daft-functions @universalmind303 (#2854)
- [CHORE] Change TPC-H q4 and q22 answers to use new join types @kevinzwang (#2756)
- [CHORE] Add native executor to CI @colin-ho (#2855)
⬆️ Dependencies
- Bump astral-sh/setup-uv from 2 to 3 @dependabot (#2888)
- Bump isbang/compose-action from 2.0.0 to 2.0.2 @dependabot (#2887)
v0.3.3
Changes
✨ New Features
- [FEAT]: Dataframe.filter method @universalmind303 (#2853)
- [FEAT] Add
to_pylist
on DataFrame @vicky1999 (#2816) - [FEAT]: sql float operations @universalmind303 (#2834)
- [FEAT]: sql count(*) @universalmind303 (#2832)
- [FEAT] Delta lake allow unsafe rename for local writes @kevinzwang (#2824)
- [FEAT] Ellipsize glob scan paths @anmolsingh20 (#2809)
- [FEAT] [SQL] Add global agg support for SQL @amitschang (#2799)
- [FEAT] Adds str.length_bytes() function @thomasjpfan (#2775)
🚀 Performance Improvements
👾 Bug Fixes
- [BUG]: Sql groupby fix @universalmind303 (#2843)
- [BUG] Avoid reconstructing sql query in read_sql @colin-ho (#2818)
- [BUG] Perform cleanup of tasks and results when iterator is deleted @jaychia (#2812)
- [BUG] Propogate S3Config.num_tries to pyarrow S3 filesystem @jmurray-clarify (#2800)
📖 Documentation
- [FEAT]: Dataframe.filter method @universalmind303 (#2853)
- [FEAT] Add
to_pylist
on DataFrame @vicky1999 (#2816) - [FEAT] Delta lake allow unsafe rename for local writes @kevinzwang (#2824)
- [DOCS] Add docs to hash and hash to docs @kevinzwang (#2821)
- [DOCS] Trigger the workflow after PR Labeler runs @jaychia (#2823)
- [CHORE] Update netlify publishing @jaychia (#2814)
- [DOCS] Enable hosted docs preview @jaychia (#2803)
- [DOCS] Fix documentation errors @jaychia (#2811)
- [DOCS] Add grouping and aggregation docs @colin-ho (#2805)
- [DOCS] Casting matrix @colin-ho (#2801)
- [FEAT] Adds str.length_bytes() function @thomasjpfan (#2775)
🧰 Maintenance
- [CHORE] Add rustfmt config file and run formatter @raunakab (#2807)
- [CHORE] Concretize casting semantics for temporal + decimal types @colin-ho (#2798)
- [CHORE]: Move jq out of core @universalmind303 (#2828)
- [CHORE] Install Python before using uv @samster25 (#2840)
- [CHORE] Decouple Ray tensor types from main Daft logic @desmondcheongzx (#2829)
- [CHORE] Ensure compatibility with deltalake version v0.19 @kevinzwang (#2827)
- [CHORE] Update PyO3 and use their new Bound API @kevinzwang (#2793)
- [CHORE]: Move image kernel out of daft-core @universalmind303 (#2804)
- [CHORE] Cleanup display impls - follow-up PR @raunakab (#2820)
- [CHORE] Break daft-plan/daft-scheduler dependency on daft-io @jaychia (#2813)
- [CHORE] Remove enum imports daft core @raunakab (#2819)
- [CHORE] Add
derive_more
to get rid of manualDisplay
impls @raunakab (#2794) - [CHORE] Move out datatype and schema from daft-core @samster25 (#2806)
- [CHORE] Update netlify publishing @jaychia (#2814)
- [CHORE] Remove user-facing arguments for casting to Ray's tensor type @jaychia (#2802)
- [CHORE] Use treenode for tree traversal in logical optimizer rules @kevinzwang (#2797)
v0.3.2
Changes
✨ New Features
- [FEAT] Add runner logic in PyRunner for ActorPoolProject @jaychia (#2677)
- [FEAT]: sql image_encode and image_resize @universalmind303 (#2764)
- [FEAT] sql
image_decode
@universalmind303 (#2757) - [FEAT] Add an
approx_count_distinct
expression (using the HLL algorithm) @raunakab (#2718) - [FEAT] Add support for sum aggregation for decimal128 type @amitschang (#2755)
- [FEAT] expose more type info @chuanlei-coding (#2762)
- [FEAT] Adds SQL function modules @rchowell (#2725)
- [FEAT] (ACTORS-3) Propagate feature flags from Planning Config through to logical optimizer @jaychia (#2674)
- [FEAT] Fix projection pushdowns in actor pool project @jaychia (#2680)
🚀 Performance Improvements
👾 Bug Fixes
- [BUG] Groupby with alias not working @colin-ho (#2790)
- [BUG] Fix parquet reads with limit across row groups @desmondcheongzx (#2751)
- [BUG] Fix ScanTask memory estimations when limits are provided @jaychia (#2735)
- [BUG] Enable Spawn Functions for IO and Compute Functions @samster25 (#2687)
- [BUG] Fix
set_execution_config
not settinghash_join_partition_size_leniency
@Vince7778 (#2759) - [BUG] Fix
count("*")
behavior @Vince7778 (#2733) - [BUG] Add marker prefixes to filter during reads @colin-ho (#2726)
- [BUG]: fsl to list with validity @universalmind303 (#2729)
- [BUG]: use recordbatch instead of table for
df.to_arrow_iter
@universalmind303 (#2724)
📖 Documentation
- [DOCS] Fix struct accessors in tutorial examples @jaychia (#1809)
- Fix huggingface.rst documentation @asmith26 (#2746)
- [DOCS]: Fix typos in UDF documentation @amitschang (#2728)
- [DOCS] Fix small typo in partitioning.rst @jaychia (#2721)
🧰 Maintenance
- [CHORE] Enable all targets for cargo check @samster25 (#2792)
- [CHORE] refactor daft-core with preclude @samster25 (#2782)
- [CHORE] Implement thiserror::Error for DaftError and arrow2::Error @raunakab (#2785)
- [CHORE] Rename vpartition -> micropartition @jaychia (#2781)
- [CHORE] Add check for stateful UDF outside of project @kevinzwang (#2771)
- [CHORE] Fix conditional compilation for UDFs @jaychia (#2761)
- [CHORE] Refactor local hash joins + pipeline connections @colin-ho (#2719)
- [CHORE]: remove this file @universalmind303 (#2752)
- [CHORE] Add .lldbinit for debugging @kevinzwang (#2750)
- [CHORE] early terminate read parquet bulk @samster25 (#2748)
- [CHORE] add large fake files for benchmarks (disabled) @samster25 (#2744)
- [CHORE] disables aqe tests in CI @samster25 (#2745)
- [CHORE] add benchmarks for interactive reads @samster25 (#2732)
v0.3.1
Changes
✨ New Features
- [FEAT] (ACTORS-2) Add optimization pass to split Project into ActorPoolProject @jaychia (#2627)
- [FEAT] Stream results from native executor into python @colin-ho (#2667)
- [FEAT]: huggingface integration @universalmind303 (#2701)
🚀 Performance Improvements
- [PERF] Fix excessive parquet metadata reading @Vince7778 (#2694)
👾 Bug Fixes
- [BUG] Use python logging level @colin-ho (#2705)
- [BUG] Add a with_execution/planning_config context manager and fix tests for splitting of parquet @jaychia (#2713)
- [BUG] Fix Resource Request Serialization and factor our Serialize Object as bincode @samster25 (#2707)
📖 Documentation
- [DOCS] Partitioning user guide and small doc fixes @jaychia (#2717)
- [FEAT] (ACTORS-2) Add optimization pass to split Project into ActorPoolProject @jaychia (#2627)
- [BUG] Add a with_execution/planning_config context manager and fix tests for splitting of parquet @jaychia (#2713)
- Update PreCommit Hooks @samster25 (#2715)
- [FEAT]: huggingface integration @universalmind303 (#2701)
🧰 Maintenance
v0.3.0
‼️ v0.2 → v0.3 Migration Guide ‼️
We're proud to release version 0.3.0 of Daft! Please note that with this minor version increment, v0.3 contains several breaking changes:
daft.read_delta_lake
- This function was deprecated in favor of
daft.read_deltalake
in v0.2.26 and is now removed. (#2663)
- This function was deprecated in favor of
daft.read_parquet
/daft.read_csv
/daft.read_json
- Schema hints are deprecated in favor of
infer_schema
(whether to turn on schema inference) andschema
(a definitive schema if infer_schema is False, otherwise it is used as a schema hint that is applied post inference). (#2326)
- Schema hints are deprecated in favor of
Expression.str.normalize()
- Parameters are now all False by default, and need to individually be toggled on. (#2647)
DataFrame.agg
/GroupedDataFrame.agg
- Tuple syntax for aggregations was deprecated in v0.2.18 and is now no longer supported. Please use aggregation expressions instead. (#2663)
- Ex:
df.agg([(col("x"), "sum"), (col("y"), "mean")])
should be written instead asdf.agg(col("x").sum(), col("y").mean())
DataFrame.count
- Calling
.count()
with no arguments will now return a DataFrame with column “count” which contains the length of the entire DataFrame, instead of the count for each of the columns (#1996)
- Calling
DataFrame.with_column
- Resource requests should now be specified on UDF expressions (
@udf(num_gpus=…)
) instead of on Projections (through.with_column(..., resource_request=...)
(#2654)
- Resource requests should now be specified on UDF expressions (
DataFrame.join
- When joining two DataFrames, columns will now be merged only if they exactly match join keys. (#2631)
- Ex:
df1 = daft.from_pydict({
"a": ["x", "y"],
"b": [1, 2]
})
df2 = daft.from_pydict({
"a": ["y", "z"],
"b": [20, 30]
})
result_df = df1.join(
df2,
left_on=[col("a"), col("b")],
right_on=[col("a"), col("b")/10], # NOTE THE "/10"
how="outer"
)
result_df.sort("a").collect()
# before
╭──────┬───────╮
│ a ┆ b │
│ --- ┆ --- │
│ Utf8 ┆ Int64 │
╞══════╪═══════╡
│ x ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ z ┆ 30 │
╰──────┴───────╯
# after
╭──────┬───────┬─────────╮
│ a ┆ b ┆ right.b │
│ --- ┆ --- ┆ --- │
│ Utf8 ┆ Int64 ┆ Int64 │
╞══════╪═══════╪═════════╡
│ x ┆ 1 ┆ None │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ y ┆ 2 ┆ 20 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ z ┆ None ┆ 30 │
╰──────┴───────┴─────────╯
Changes
✨ New Features
- [FEAT] Ellipsize scan task sources if too many @Vince7778 (#2695)
- [FEAT] Allow user provided schema and schema inference length for read_sql @colin-ho (#2676)
- [FEAT] Add dataframe iteration on rows and change default buffer size @jaychia (#2685)
- [FEAT]: add to_arrow_iter @universalmind303 (#2681)
- [FEAT] Example Analyze for Local Execution Engine @samster25 (#2648)
- [FEAT] (ACTORS-1) Add DAFT_ENABLE_ACTOR_POOL_PROJECTS=1 feature flag and specifying concurrency @jaychia (#2668)
- [FEAT]: sql like & ilike @universalmind303 (#2666)
- [FEAT] Changes the default count() behavior to perform a global row count instead @jaychia (#2653)
- [FEAT] Support passing in column name strings to
to_struct
@Vince7778 (#2671) - [FEAT]: refactor tree display to get more info into physicalplan @universalmind303 (#2640)
- [FEAT] Add
to_struct
function for merging columns into a struct @Vince7778 (#2662) - [FEAT] Add hashing and groupby on structs @Vince7778 (#2657)
- [FEAT]:
daft.sql_expr
@universalmind303 (#2656) - [FEAT] Deprecates usage of resource_request on df.with_column API @jaychia (#2654)
- [FEAT] Add input batching for UDFs @Vince7778 (#2651)
- [FEAT] Add
cbrt
expression @raunakab (#2646) - [FEAT] use ObfuscatedString to hide creds when Display IOConfig @samster25 (#2645)
- [FEAT]: more sql functions @universalmind303 (#2596)
- [FEAT] Support __init__ arguments for StatefulUDFs @jaychia (#2634)
- [FEAT] Move resource requests to UDFs instead of on with_column @jaychia (#2632)
- [FEAT] Add wildcards in column expressions @Vince7778 (#2629)
- [FEAT] factor mermaid builder into it's own module to use independently @samster25 (#2636)
- [FEAT] Remote parquet streaming @colin-ho (#2620)
- [FEAT]: mermaid formatter @universalmind303 (#2619)
- [FEAT] Add ActorPoolProject logical and physical plans @jaychia (#2601)
- [FEAT] Enable broadcast strategy on anti and semi joins @kevinzwang (#2621)
- [FEAT] Add
.list.sort()
for sorting lists within a list column @Vince7778 (#2589) - [FEAT] Streaming Local Parquet Reads @colin-ho (#2592)
🚀 Performance Improvements
- [PERF] Add ability to automatically choose broadcast for anti/semi joins @kevinzwang (#2699)
- [PERF] Swordfish Dynamic Pipelines @samster25 (#2599)
- [PERF] Dyn Compare + Probe Table @samster25 (#2618)
👾 Bug Fixes
- [BUG] Fix Parquet reads with chunk sizing @desmondcheongzx (#2658)
- [BUG]: repr mermaid fix @universalmind303 (#2688)
- [BUG] Use Daft Pickle instead of Ray Pickle and use bincode for serializing @samster25 (#2693)
- [BUG] Add timeout to analytics client @raunakab (#2670)
- [BUG] Fix swordfish inner joins @colin-ho (#2678)
- [BUG] Fix struct
.hash()
naming bug @Vince7778 (#2673) - [BUG] Fix filter pushdown into non-inner joins @kevinzwang (#2659)
- [BUG] Fix issues where we check "is_ray_runner" on non-initialized contexts @jaychia (#2652)
- [BUG] Fix nested parquet reads for .show() and .limit() @desmondcheongzx (#2643)
- [BUG] Fix join op names and join key definition @kevinzwang (#2631)
- [BUG] Fix projection pushdowns not working with limits @Vince7778 (#2635)
- [BUG] Fix Expr::with_new_children for ScalarFunction @kevinzwang (#2624)
- [BUG] Fix pushdown past monotonically increasing id @Vince7778 (#2622)
📖 Documentation
- [CHORE] Fix FOTW #1 images notebook @jaychia (#2697)
- [DOCS] Add join types, renaming behavior, and example to join docs @kevinzwang (#2691)
- [FEAT] Add dataframe iteration on rows and change default buffer size @jaychia (#2685)
- [DOCS]: add docs for cosine_distance @universalmind303 (#2675)
- [FEAT] Add
to_struct
function for merging columns into a struct @Vince7778 (#2662) - [CHORE] Turn v0.3 deprecations into breaking changes @kevinzwang (#2663)
- [FEAT] Add
cbrt
expression @raunakab (#2646) - [FEAT] Support __init__ arguments for StatefulUDFs @jaychia (#2634)
- [FEAT] Move resource requests to UDFs instead of on with_column @jaychia (#2632)
- [FEAT] Add wildcards in column expressions @Vince7778 (#2629)
- [DOCS] Enable doc tests in CI @colin-ho (#2615)
- [FEAT] Add
.list.sort()
for sorting lists within a list column @Vince7778 (#2589) - docs: Add fotw tutorial on working with images @avriiil (#2490)
🧰 Maintenance
- [CHORE] fix merge conflict in repr tests @samster25 (#2700)
- [CHORE] Fix FOTW #1 images notebook @jaychia (#2697)
- [CHORE] Deprecate schema hints @colin-ho (#2655)
- [CHORE] Add error snafus for local executor @colin-ho (#2660)
- [FEAT]: refactor tree display to get more info into physicalplan @universalmind303 (#2640)
- [CHORE] Turn v0.3 deprecations into breaking changes @kevinzwang (#2663)
- [CHORE]: Drop use of deprecated form "default_features" @universalmind303 (#2665)
- [CHORE] bump dev version to 0.3.0 @samster25 (#2664)
- [CHORE]: fix feature flags @universalmind303 (#2661)
- [CHORE] Set
Expression.str.normalize()
options to False by default @Vince7778 (#2647) - [CHORE] Improve swordfish error handling @colin-ho (#2628)
- [CHORE] Add ignore for helix editor @raunakab (#2642)
- [CHORE] Add toolchain check to Makefile @Vince7778 (#2641)
- [CHORE] Upgrade Rust toolchain to 2024-08-01 @Vince7778 (#2639)
- [CHORE] Track memory for swordfish tpch @colin-ho (#2633)
- [CHORE] Split resource-request and hashable-float-wrapper into utility crates @jaychia (#2630)
- [CHORE] Use parquet for native tpch benchmarks @colin-ho (#2609)
- [CHORE] Refactor UDFs to separate stateful and stateless @jaychia (#2597)
v0.2.33
Changes
✨ New Features
- [FEAT]: sql case/when @universalmind303 (#2591)
- [FEAT] Add comparison of timestamps with same timezone @Vince7778 (#2604)
- [FEAT] Add support for pyiceberg v0.7 @kevinzwang (#2594)
- [FEAT] Make the
end
argument for.list.slice()
optional @desmondcheongzx (#2593)
🚀 Performance Improvements
- [PERF] Add physical plan optimizer and optimization @Vince7778 (#2557)
👾 Bug Fixes
- [BUG]: remove simsimd dependency @universalmind303 (#2605)
- [BUG] Fix parquet reads when a top-level column's final row spans more than one data page @desmondcheongzx (#2586)
- [BUG]: accept "iterable[pa.Table]" for from_arrow @universalmind303 (#2583)
📖 Documentation
- [CHORE] Fix imports on jupyter notebook examples @kevinzwang (#2600)
🧰 Maintenance
- [CHORE] Fix imports on jupyter notebook examples @kevinzwang (#2600)
- [CHORE]: ignore ".zed" directory @universalmind303 (#2595)