Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable fs.gs.inputstream.fast.fail.on.not.found.enable by default #5510

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

clairemcginty
Copy link
Contributor

@clairemcginty clairemcginty commented Oct 4, 2024

This should speed up ParquetReader initialization by skipping a GCS StorageObject lookup that validates that the file exists before trying to open a ReadChannel. Per the doc:

  /**
   * If {@code true}, on opening a file we will proactively perform a metadata {@code GET} to check
   * whether the object exists, even though the underlying channel will not open a data stream until
   * {@code read()} is actually called. This is necessary to technically match the expected behavior
   * of Hadoop filesystems, but incurs an extra latency overhead on {@code open()}. If the calling
   * code can handle late failures on not-found errors, or has independently already ensured that a
   * file exists before calling {@code open()}, then you can set this to {@code false} for more
   * efficient reads.
   *
   * <p>Note, this is known to not work with YARN {@code CommonNodeLabelsManager} and potentially
   * other Hadoop components. That's why it's not recommended to set this property to {@code false}
   * cluster-wide, instead set it for a specific job/application that is compatible with it.
   */
  public static final HadoopConfigurationProperty<Boolean>
      GCS_INPUT_STREAM_FAST_FAIL_ON_NOT_FOUND_ENABLE =
          new HadoopConfigurationProperty<>(
              "fs.gs.inputstream.fast.fail.on.not.found.enable", true);

This is redundant, since ParquetRead executes FileIO.readMatches before opening any ParquetReaders, which will throw an error on nonexistent files anyway; I tested this with Dataflow and confirmed:

Caused by: java.io.FileNotFoundException: No files matched spec: gs://my-bucket/path/doesnot/exist/*.parquet
	org.apache.beam.sdk.io.FileSystems.maybeAdjustEmptyMatchResult(FileSystems.java:174)
	org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:161)
	org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn.process(FileIO.java:753)

Therefore, we know we'll always invoke ParquetReader with an existent file path. Since this metadata lookup can be quite expensive threadpool-wise, so let's skip it since we don't need it.

@clairemcginty clairemcginty requested a review from kellen October 4, 2024 18:12
Copy link

codecov bot commented Oct 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.40%. Comparing base (b2c4ff1) to head (eec92db).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5510      +/-   ##
==========================================
- Coverage   61.42%   61.40%   -0.02%     
==========================================
  Files         312      312              
  Lines       11106    11106              
  Branches      771      771              
==========================================
- Hits         6822     6820       -2     
- Misses       4284     4286       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still require this file ? IIRC to use parquet with scio-smb we need scio-parquet where we define the same settings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point actually - I think it can be deleted

@clairemcginty
Copy link
Contributor Author

Hmm, on running some tests w/ debugger enabled, I'm still seeing GoogleCloudStorageOptions instantiated from GcsUtil with fastFailOnNotFound: true ... this might be something we need to update in the Beam layer. Looking into it more now

@clairemcginty
Copy link
Contributor Author

Hmm, on running some tests w/ debugger enabled, I'm still seeing GoogleCloudStorageOptions instantiated from GcsUtil with fastFailOnNotFound: true ... this might be something we need to update in the Beam layer. Looking into it more now

Small update: I discovered that gcs-connector options were not getting propagated to any Parquet reads that use SplittableDoFn API. I've contributed a workaround that will be included in the upcoming Beam 2.62.0 release (we'll need to set default options in Scio as well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants