Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQueryIO uniformize direct and export reads #32360

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

RustedBones
Copy link
Contributor

@RustedBones RustedBones commented Aug 29, 2024

Refers to #26329, also fix #20100, #21076

When using readWithDatumReader and DIRECT_READ method, the transform would fail because the parseFn is expected. Refactor the IO so the avro datumReader can be use in both cases.

In some case, it is required to get the data with the desired schema. Currently, BQ io always uses the writer schema (or table schema). Create new APIs to set the reader schema.

This refactoring contains some breaking changes:

withFormat is not exposed anymore. Indeed, it is not possible to configure a TypedRead with a DatumReaderFactory and change the format later. Data format MUST be chosen when creating the transform.

In the TypedRead.Builder, replace the DatumReaderFactory with the BigQueryReaderFactory allowing to handle both avro and arrow in uniform fashion. This alters the BigQueryIOTranslation.
I need some help on that point to handle that in a better way.

Edit: reworked part of this PR to keep compatibility

Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@RustedBones
Copy link
Contributor Author

assign set of reviewers

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@RustedBones
Copy link
Contributor Author

Some BQ integration tests are failing.

I don't know schema & data of the following big_query_import_export.parallel_read_table_row_xxx tables so I can recreated the setup in a personal GCP project. Can someone give me a hand ?

@RustedBones RustedBones marked this pull request as draft September 3, 2024 07:18
@github-actions github-actions bot added examples and removed examples labels Sep 3, 2024
@RustedBones RustedBones marked this pull request as ready for review September 3, 2024 19:48
@github-actions github-actions bot added examples and removed examples labels Sep 4, 2024
// read table schema and infer coder if possible
Coder<T> c;
if (getCoder() == null) {
tableSchema = requestTableSchema(sourceDef, bqOptions, getSelectedFields());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fine to access the BQ table at graph creation time? (It was already doing that when beam schema was requested)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a valid concern. I've heard use case where pipeline submission machine does not or has incomplete permission to the resource, and infer schema at graph creation time can cause issue. General guideline is the use case used to work should be able to work still (and vice versa)

@@ -1731,7 +1870,7 @@ public void processElement(ProcessContext c) throws Exception {
.setTable(
BigQueryHelpers.toTableResourceName(
queryResultTable.getTableReference()))
.setDataFormat(DataFormat.AVRO))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was arrow even supported ?

@@ -89,8 +95,8 @@ static class BigQueryIOReadTranslator implements TransformPayloadTranslator<Type
.addNullableBooleanField("use_legacy_sql")
.addNullableBooleanField("with_template_compatibility")
.addNullableByteArrayField("bigquery_services")
.addNullableByteArrayField("parse_fn")
.addNullableByteArrayField("datum_reader_factory")
.addNullableByteArrayField("bigquery_reader_factory")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a complex object to serialize. subject to serialization error if there's changes between versions

Copy link
Contributor

Reminder, please take a look at this pr: @robertwb @ahmedabu98

@Abacn
Copy link
Contributor

Abacn commented Jan 9, 2025

waiting on author

@Abacn
Copy link
Contributor

Abacn commented Jan 9, 2025

Looks like this is a evolving PR, please let me know when it's ready. Also, good to split independent fix out so it expedites review. Thank you!

@RustedBones
Copy link
Contributor Author

RustedBones commented Jan 10, 2025

This is not evolving anymore.
I rebased many times the PR to re-trigger the CI, in the hope to get all green checks in CI, without success.
Seems most of the time the error is unrelated to the changes.

Will merge master in the future. It will be visible on your end that no changes are added.

@github-actions github-actions bot added kotlin and removed kotlin labels Jan 10, 2025
Copy link
Contributor

Reminder, please take a look at this pr: @robertwb @Abacn

Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

Copy link
Contributor

Reminder, please take a look at this pr: @kennknowles @ahmedabu98

Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damondouglas for label java.
R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

Copy link
Contributor

github-actions bot commented Feb 8, 2025

Reminder, please take a look at this pr: @damondouglas @Abacn

Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @johnjcasey for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@Abacn
Copy link
Contributor

Abacn commented Feb 12, 2025

sorry for being late and thanks again for persistence. If there are something most urgent (and small) within this PR, feel free to send smaller PRs for review. Had been challenging to find sufficient chunk of time reviewing large changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQueryIO TableRowParser should support Arrow and Avro data formats
3 participants