Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Adding an optional hdfs crate #1377

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

comphead
Copy link
Contributor

@comphead comphead commented Feb 7, 2025

Which issue does this PR close?

Closes #1368 .

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@codecov-commenter
Copy link

codecov-commenter commented Feb 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 39.06%. Comparing base (f09f8af) to head (adfacc7).
Report is 25 commits behind head on main.

Additional details and impacted files
@@              Coverage Diff              @@
##               main    #1377       +/-   ##
=============================================
- Coverage     56.12%   39.06%   -17.06%     
- Complexity      976     2071     +1095     
=============================================
  Files           119      263      +144     
  Lines         11743    60746    +49003     
  Branches       2251    12909    +10658     
=============================================
+ Hits           6591    23733    +17142     
- Misses         4012    32530    +28518     
- Partials       1140     4483     +3343     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@comphead comphead marked this pull request as ready for review February 11, 2025 00:13
# Conflicts:
#	native/core/Cargo.toml
@comphead
Copy link
Contributor Author

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a README that covers any changes we have made (if any), and what versions of hdfs client are supported (and on what platforms)?
Also, if we have made changes, can we add a comment in the code itself to mark the areas where the code has been updated?

@zuston
Copy link
Member

zuston commented Feb 11, 2025

From the org of datafusion-contrib, I see many hdfs crates, which one is best for comet?

native/core/Cargo.toml Outdated Show resolved Hide resolved
.github/actions/rust-test/action.yaml Outdated Show resolved Hide resolved
@Kimahriman
Copy link
Contributor

From the org of datafusion-contrib, I see many hdfs crates, which one is best for comet?

I know it was brought up in the original issue, but just plugging my pure Rust implementation: https://github.com/datafusion-contrib/hdfs-native-object-store. It's already integrated in delta-rs and delta-kernel-rs. Though this arguably could be the one time a JNI based implementation could make sense since you're guaranteed to have Java installed and probably your classpath already set correctly since you're running Spark.

Doesn't seem like something that should have the implementation copied into the Comet repo, as it seems out of scope.

# Example: JAVA_HOME="/opt/homebrew/opt/openjdk@11" cargo build --features=hdfs

[package]
name = "datafusion-objectstore-hdfs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should start with datafusion-comet- ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


[package]
name = "datafusion-objectstore-hdfs"
description = "Comet HDFS integration. Initiated by Yanghong Zhong <[email protected]> (https://github.com/datafusion-contrib/datafusion-objectstore-hdfs)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be better to give credit in https://github.com/apache/datafusion-comet/blob/main/NOTICE.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@comphead
Copy link
Contributor Author

comphead commented Feb 11, 2025

From the org of datafusion-contrib, I see many hdfs crates, which one is best for comet?

I know it was brought up in the original issue, but just plugging my pure Rust implementation: https://github.com/datafusion-contrib/hdfs-native-object-store. It's already integrated in delta-rs and delta-kernel-rs. Though this arguably could be the one time a JNI based implementation could make sense since you're guaranteed to have Java installed and probably your classpath already set correctly since you're running Spark.

Doesn't seem like something that should have the implementation copied into the Comet repo, as it seems out of scope.

Hey @Kimahriman its nice to see you here, we checked the contribution crate https://github.com/datafusion-contrib/hdfs-native-object-store and https://github.com/datafusion-contrib/datafusion-objectstore-hdfs.

I really like that the crate you mentioned because it has less memory footprint and has no JVM dependency and therefore no JVM roundtrips. Can't wait this crate to grow up and use it.

For now libhdfs on JVM provides more HDFS client configuration which critical on production sites comparing to https://github.com/Kimahriman/hdfs-native?tab=readme-ov-file#supported-hdfs-settings where it is hard to configure cluster network configuration, namenode retries, etc.

@comphead
Copy link
Contributor Author

Can we have a README that covers any changes we have made (if any), and what versions of hdfs client are supported (and on what platforms)? Also, if we have made changes, can we add a comment in the code itself to mark the areas where the code has been updated?

This is a good point.

@Kimahriman
Copy link
Contributor

For now libhdfs on JVM provides more HDFS client configuration which critical on production sites comparing to https://github.com/Kimahriman/hdfs-native?tab=readme-ov-file#supported-hdfs-settings where it is hard to configure cluster network configuration, namenode retries, etc.

If there's custom settings you rely on please file an issue! We've been using this internally a decent amount and haven't seen many issues at all.

Also curious how Comet handles custom Hadoop configs for accessing object stores (i.e. fs.s3a.*) type settings on the native path.

@comphead
Copy link
Contributor Author

For now libhdfs on JVM provides more HDFS client configuration which critical on production sites comparing to https://github.com/Kimahriman/hdfs-native?tab=readme-ov-file#supported-hdfs-settings where it is hard to configure cluster network configuration, namenode retries, etc.

If there's custom settings you rely on please file an issue! We've been using this internally a decent amount and haven't seen many issues at all..

Thats a nice point, I would hope the crate matures as fast as possible and it is good idea to create feature requests, I'll do it.

@kazuyukitanimura
Copy link
Contributor

I followed up on some comments, otherwise LGTM

@parthchandra
Copy link
Contributor

For now libhdfs on JVM provides more HDFS client configuration which critical on production sites comparing to https://github.com/Kimahriman/hdfs-native?tab=readme-ov-file#supported-hdfs-settings where it is hard to configure cluster network configuration, namenode retries, etc.

If there's custom settings you rely on please file an issue! We've been using this internally a decent amount and haven't seen many issues at all.

Also curious how Comet handles custom Hadoop configs for accessing object stores (i.e. fs.s3a.*) type settings on the native path.

All of Comet's hdfs interaction is in the Parquet FileReader which is in Java. We simply use the hadoop client to access any hdfs compliant store including those accessed via hadoop-aws. Since Spark recognizes all hadoop configs and passes them on to the FileReader, we are able to access anything a Java application can.
Doing all this natively is not going to be easy so for the next phase where we are moving more of our parquet code to native, we will have a hook back to the jvm class when we cannot access hdfs or an object store natively.

@Kimahriman
Copy link
Contributor

Kimahriman commented Feb 12, 2025

Doing all this natively is not going to be easy so for the next phase where we are moving more of our parquet code to native, we will have a hook back to the jvm class when we cannot access hdfs or an object store natively.

Yeah that's all I was referring to, was wondering how Hadoop configs would get translated to datafusion scan/object store configs. I thought maybe the whole thing would still go through the JNI for hadoop file system interactions. I guess you could actually use the JNI based object store in this PR for any Hadoop file system

@zuston
Copy link
Member

zuston commented Feb 12, 2025

A small question that is it possible to use this crate to access other hadoop compatible protocol filesystem? (the libhdfs.so can do this)

@parthchandra
Copy link
Contributor

Doing all this natively is not going to be easy so for the next phase where we are moving more of our parquet code to native, we will have a hook back to the jvm class when we cannot access hdfs or an object store natively.

Yeah that's all I was referring to, was wondering how Hadoop configs would get translated to datafusion scan/object store configs. I thought maybe the whole thing would still go through the JNI for hadoop file system interactions. I guess you could actually use the JNI based object store in this PR for any Hadoop file system

Right now, Comet is losing the configs passed in to hadoop which is something we will address (soon?). Converting those configs to Datafusion configs will have to be done as and when we encounter them. To start with, the configs from hadoop-aws would probably be the first to be handled. I haven't dug any deeper into this yet.

@comphead
Copy link
Contributor Author

@kazuyukitanimura @parthchandra @andygrove appreciate if you can have another look?

@@ -34,6 +34,7 @@ runs:
run: |
apt-get update
apt-get install -y protobuf-compiler
apt-get install -y clang
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we did not have clang before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we dont' as fs-hdfs failed before that it was not able to find libclang*.so files

@@ -3201,6 +3321,7 @@ checksum = "3d61fa4ffa3de412bfea335c6ecff681de2b609ba3c77ef3e00e521813a9ed9e"
dependencies = [
"backtrace",
"bytes",
"parking_lot",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering why this existing dependency has changed

NOTICE.txt Outdated Show resolved Hide resolved
NOTICE.txt Outdated Show resolved Hide resolved
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @comphead.

comphead and others added 2 commits February 12, 2025 17:13
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add separate HDFS submodule to Comet
7 participants