-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Adding an optional hdfs
crate
#1377
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1377 +/- ##
=============================================
- Coverage 56.12% 39.06% -17.06%
- Complexity 976 2071 +1095
=============================================
Files 119 263 +144
Lines 11743 60746 +49003
Branches 2251 12909 +10658
=============================================
+ Hits 6591 23733 +17142
- Misses 4012 32530 +28518
- Partials 1140 4483 +3343 ☔ View full report in Codecov by Sentry. |
# Conflicts: # native/core/Cargo.toml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a README that covers any changes we have made (if any), and what versions of hdfs client are supported (and on what platforms)?
Also, if we have made changes, can we add a comment in the code itself to mark the areas where the code has been updated?
From the org of datafusion-contrib, I see many hdfs crates, which one is best for comet? |
I know it was brought up in the original issue, but just plugging my pure Rust implementation: https://github.com/datafusion-contrib/hdfs-native-object-store. It's already integrated in delta-rs and delta-kernel-rs. Though this arguably could be the one time a JNI based implementation could make sense since you're guaranteed to have Java installed and probably your classpath already set correctly since you're running Spark. Doesn't seem like something that should have the implementation copied into the Comet repo, as it seems out of scope. |
native/hdfs/Cargo.toml
Outdated
# Example: JAVA_HOME="/opt/homebrew/opt/openjdk@11" cargo build --features=hdfs | ||
|
||
[package] | ||
name = "datafusion-objectstore-hdfs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should start with datafusion-comet-
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
native/hdfs/Cargo.toml
Outdated
|
||
[package] | ||
name = "datafusion-objectstore-hdfs" | ||
description = "Comet HDFS integration. Initiated by Yanghong Zhong <[email protected]> (https://github.com/datafusion-contrib/datafusion-objectstore-hdfs)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would be better to give credit in https://github.com/apache/datafusion-comet/blob/main/NOTICE.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Hey @Kimahriman its nice to see you here, we checked the contribution crate https://github.com/datafusion-contrib/hdfs-native-object-store and https://github.com/datafusion-contrib/datafusion-objectstore-hdfs. I really like that the crate you mentioned because it has less memory footprint and has no JVM dependency and therefore no JVM roundtrips. Can't wait this crate to grow up and use it. For now |
This is a good point. |
If there's custom settings you rely on please file an issue! We've been using this internally a decent amount and haven't seen many issues at all. Also curious how Comet handles custom Hadoop configs for accessing object stores (i.e. |
Thats a nice point, I would hope the crate matures as fast as possible and it is good idea to create feature requests, I'll do it. |
I followed up on some comments, otherwise LGTM |
All of Comet's hdfs interaction is in the Parquet |
Yeah that's all I was referring to, was wondering how Hadoop configs would get translated to datafusion scan/object store configs. I thought maybe the whole thing would still go through the JNI for hadoop file system interactions. I guess you could actually use the JNI based object store in this PR for any Hadoop file system |
A small question that is it possible to use this crate to access other hadoop compatible protocol filesystem? (the libhdfs.so can do this) |
Right now, Comet is losing the configs passed in to hadoop which is something we will address (soon?). Converting those configs to Datafusion configs will have to be done as and when we encounter them. To start with, the configs from hadoop-aws would probably be the first to be handled. I haven't dug any deeper into this yet. |
@kazuyukitanimura @parthchandra @andygrove appreciate if you can have another look? |
@@ -34,6 +34,7 @@ runs: | |||
run: | | |||
apt-get update | |||
apt-get install -y protobuf-compiler | |||
apt-get install -y clang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we did not have clang before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we dont' as fs-hdfs
failed before that it was not able to find libclang*.so
files
@@ -3201,6 +3321,7 @@ checksum = "3d61fa4ffa3de412bfea335c6ecff681de2b609ba3c77ef3e00e521813a9ed9e" | |||
dependencies = [ | |||
"backtrace", | |||
"bytes", | |||
"parking_lot", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering why this existing dependency has changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @comphead.
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Which issue does this PR close?
Closes #1368 .
Rationale for this change
What changes are included in this PR?
How are these changes tested?