updated/PLUGIN-296 #457

ailegion · 2020-12-17T14:51:16Z

added: Output schema to match fields from import query

Jira Ticket: https://cdap.atlassian.net/browse/PLUGIN-296

CuriousVini · 2020-12-21T21:11:25Z

src/main/java/io/cdap/plugin/gcp/spanner/source/SpannerSource.java

  // listing table's schema documented at https://cloud.google.com/spanner/docs/information-schema
  private static final Statement.Builder SCHEMA_STATEMENT_BUILDER = Statement.newBuilder(
    String.format("SELECT  t.column_name,t.spanner_type, t.is_nullable FROM information_schema.columns AS t WHERE " +
                    "  t.table_catalog = ''  AND  t.table_schema = '' AND t.table_name = @%s", TABLE_NAME));
+  private static final Statement.Builder SCHEMA_STATEMENT_BUILDER_WITH_COLUMNS = Statement.newBuilder(
+    String.format("SELECT  t.column_name,t.spanner_type, t.is_nullable FROM information_schema.columns AS t WHERE " +


if prefix of this string is same as above string, we can just concat 2 strings

CuriousVini · 2020-12-21T22:34:54Z

src/main/java/io/cdap/plugin/gcp/spanner/source/SpannerSource.java

@@ -241,11 +255,40 @@ private Schema getSchema(FailureCollector collector) {
                                                         projectId)) {
      DatabaseClient databaseClient =
        spanner.getDatabaseClient(DatabaseId.of(projectId, config.instance, config.database));
-      Statement getTableSchemaStatement = SCHEMA_STATEMENT_BUILDER.bind(TABLE_NAME).to(config.table).build();
+      Map<String, String> columnNameMap = new HashMap<>();
+      // get columns from import query when query does not contain the '*' or 'case'


why are we making this assumption? also why are we checking for 'case'?

Regarding * it indicates that user selects all column of the table.
Regarding case user can type select statement like in example:
SELECT A, B, CASE A WHEN 90 THEN 'red' WHEN 50 THEN 'blue' ELSE 'green' END AS result FROM Numbers
in these cases we fallback to getting full schema of the table

What is case used for?

If user types a query like:

SELECT A, B, CASE A WHEN 90 THEN 'red' WHEN 50 THEN 'blue' ELSE 'green' END AS result FROM Numbers

in these cases we fallback to getting full schema of the table

hm this approach is not ideal. You can construct queries where we unnecessarily fallback to getting full table schema.
For example, if the query is SELECT COUNT(*) FROM table, we'll get the full schema.

Instead of parsing the query, can we execute the query with LIMIT 1 and get the schema from the query result?

Let's implement this correctly so we won't need to rework the fix later.
Ideally, we would get the correct schema for all queries.

@rmstar With the suggested implementation we face the following issue:
Currently with Spanner cloud library we can get column name and colum type. We are missing the information whether column is nullable or not.
Spanner has this information in 'ResultSetMetadata' class -however this class is onlyavailable in 'com.google.spanner' package (https://googleapis.dev/java/google-cloud-spanner/3.2.1/com/google/spanner/v1/ResultSetMetadata.html) and not in 'com.google.cloud.spanner' package ('https://googleapis.dev/java/google-cloud-spanner/latest/com/google/cloud/spanner/ResultSet.html')

What's the effort involved to use com.google.spanner library to get column metadata, including whether it's nullable or not?

If that's not feasible, can we use com.google.cloud.spanner to get the column names, and then use the existing approach in this PR (query information_schema) to check if the column is nullable?

We will not be able to use the alternative 1 com.google.spanner library.

Yes we can use the com.google.cloud.spanner to get the column names, and then use the existing approach (information_schema) to check if the column is nullable. However, there are cases where we can't get the nullable information like:

Querying using aliases

Aggregate Functions

For these cases should we default to nullable column?

Yes, if you can't get nullable information, then default to nullable column.

CuriousVini · 2020-12-22T09:36:52Z

Schema should be populated in accordance with import query

google-cla · 2020-12-28T15:56:51Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

ardian4 · 2020-12-28T15:58:03Z

Schema should be populated in accordance with import query

update the PR to populated the schema accordance with import query

CuriousVini

Could you please provide list of select statements this fix has been tested with?

CuriousVini · 2020-12-28T22:31:37Z

src/main/java/io/cdap/plugin/gcp/spanner/source/SpannerSourceConfig.java

@@ -84,10 +84,6 @@ public void validate(FailureCollector collector) {
    if (!containsMacro(NAME_SCHEMA) && schema != null) {
      SpannerUtil.validateSchema(schema, SUPPORTED_TYPES, collector);
    }
-    if (!containsMacro(NAME_SCHEMA) && schema == null) {


this will undo the changes for https://cdap.atlassian.net/browse/PLUGIN-251
Correct fix is to do something similar to bigquery source:
https://github.com/data-integrations/google-cloud/blob/develop/src/main/java/io/cdap/plugin/gcp/bigquery/source/BigQuerySource.java#L194

CuriousVini · 2020-12-28T22:34:08Z

src/main/java/io/cdap/plugin/gcp/spanner/source/SpannerSource.java

-      Statement getTableSchemaStatement = SCHEMA_STATEMENT_BUILDER.bind(TABLE_NAME).to(config.table).build();
+      Map<String, String> columnNameMap = new HashMap<>();
+      // get columns from import query when query does not contain the '*' or 'case'
+      if (config.importQuery != null && !config.importQuery.contains("*") &&


check for null and empty query

CuriousVini · 2020-12-28T22:35:06Z

src/main/java/io/cdap/plugin/gcp/spanner/source/SpannerSource.java

@@ -241,11 +255,40 @@ private Schema getSchema(FailureCollector collector) {
                                                         projectId)) {
      DatabaseClient databaseClient =
        spanner.getDatabaseClient(DatabaseId.of(projectId, config.instance, config.database));
-      Statement getTableSchemaStatement = SCHEMA_STATEMENT_BUILDER.bind(TABLE_NAME).to(config.table).build();
+      Map<String, String> columnNameMap = new HashMap<>();
+      // get columns from import query when query does not contain the '*' or 'case'


What is case used for?

CuriousVini · 2020-12-28T22:35:35Z

src/main/java/io/cdap/plugin/gcp/spanner/source/SpannerSource.java

+          }
+        }
+      }
+      Statement getTableSchemaStatement = null;


there is no need to initialize this as null

src/main/java/io/cdap/plugin/gcp/spanner/source/SpannerSource.java

CuriousVini · 2020-12-28T22:39:09Z

src/main/java/io/cdap/plugin/gcp/spanner/source/SpannerSource.java

      try (ResultSet resultSet = databaseClient.singleUse().executeQuery(getTableSchemaStatement)) {
        List<Schema.Field> schemaFields = new ArrayList<>();
        while (resultSet.next()) {
          String columnName = resultSet.getString("column_name");
+          // remap column name to alias
+          if (!columnNameMap.isEmpty() && columnNameMap.containsKey(columnName)) {


why is !columnNameMap.isEmpty() check needed? Why not just do columnNameMap.containsKey(columnName)?

CuriousVini · 2020-12-28T22:43:10Z

src/main/java/io/cdap/plugin/gcp/spanner/source/SpannerSource.java

+            // check for column aliases
+            if (column.toLowerCase().contains(" as ")) {
+              String[] columnNameAndAlias = column.split(COLUMN_ALIAS_SPLIT_PATTERN);
+              columnNameMap.put(columnNameAndAlias[0].trim(), columnNameAndAlias[1].trim());


should check for the size of columnNameAndAlias

google-cla · 2020-12-29T13:37:29Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

ardian4 · 2020-12-29T13:44:01Z

Could you please provide list of select statements this fix has been tested with?

Output schema null and for import query:

empty import query
Select * from testtable
Select FirstName from testtable
Select FirstName as name from testtable
Select FirstName, Identifier from testtable
Select FirstName, Identifier as Id from testtable
Select FirstName, Identifier as from testtable
Select FirstName, as from testtable

CuriousVini · 2020-12-29T21:36:43Z

Could you please provide list of select statements this fix has been tested with?

Output schema null and for import query:

empty import query

Select * from testtable

Select FirstName from testtable

Select FirstName as name from testtable

Select FirstName, Identifier from testtable

Select FirstName, Identifier as Id from testtable

Select FirstName, Identifier as from testtable

Select FirstName, as from testtable

What is the behavior in case of following queries (in case of invalid queries)?

Select FirstName, Identifier as from testtable
Select FirstName, as from testtable

CuriousVini · 2020-12-30T01:15:16Z

also please test with queries like:

SELECT A, B, CASE A WHEN 90 THEN 'red' WHEN 50 THEN 'blue' ELSE 'green' END AS result FROM Numbers

CuriousVini · 2020-12-30T01:15:28Z

This should also have an integration test

ardian4 · 2020-12-30T14:06:06Z

Select FirstName, as from testtable

Select FirstName, Identifier as from testtable

ardian4 · 2020-12-30T14:09:19Z

also please test with queries like:

SELECT A, B, CASE A WHEN 90 THEN 'red' WHEN 50 THEN 'blue' ELSE 'green' END AS result FROM Numbers

When import query contain � case it will get all the fields

ardian4 · 2021-01-05T15:30:53Z

This should also have an integration test

cdapio/cdap-integration-tests#1084

google-cla · 2021-01-18T15:05:06Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

rmstar · 2021-01-19T20:56:38Z

src/main/java/io/cdap/plugin/gcp/spanner/source/SpannerSource.java

+    if (StringUtils.containsIgnoreCase(importQuery, LIMIT)) {
+      int total = StringUtils.lastIndexOf(importQuery, LIMIT);
+      String substringToReplace = importQuery.substring(total);
+      query = importQuery.replace(substringToReplace, "limit 1");


This can break in some corner cases where the table name contains "limit" substring. For example something like SELECT <columns> from limited.
Can you make sure we only replace "limit < number >" with "limit 1".

google-cla · 2021-01-20T14:18:52Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

rmstar

lgtm, please address minor comment and squash commits.

rmstar · 2021-01-20T19:29:40Z

src/main/java/io/cdap/plugin/gcp/spanner/source/SpannerSource.java

@@ -272,6 +309,20 @@ private Schema getSchema(FailureCollector collector) {

  }

+  private Statement getStatementForOneRow(String importQuery) {
+    String query;
+    String regex = "^(?:[^;']|(?:'[^']+'))+ LIMIT +\\d+(.*)";


Please add a comment that explains what this regex matches.

ardian4 · 2021-01-21T10:27:15Z

squash

Done

bajram-adapt requested a review from CuriousVini December 17, 2020 15:18

CuriousVini reviewed Dec 22, 2020

View reviewed changes

bajram-adapt requested a review from CuriousVini December 22, 2020 14:40

CuriousVini reviewed Dec 28, 2020

View reviewed changes

CuriousVini added the 6.4 label Dec 29, 2020

bajram-adapt requested review from CuriousVini and rmstar January 5, 2021 19:50

rmstar reviewed Jan 19, 2021

View reviewed changes

bajram-adapt requested a review from rmstar January 20, 2021 14:29

rmstar approved these changes Jan 20, 2021

View reviewed changes

updated: SpannerSource schema to match import query

0f66e6e

ardian4 force-pushed the update/PLUGIN-296 branch from 2b7396a to 0f66e6e Compare January 21, 2021 10:26

rmstar merged commit 6f53291 into data-integrations:develop Jan 22, 2021

This was referenced Jan 22, 2021

updated: SpannerSource schema to match import query #522

Open

updated: SpannerSource schema to match import query #523

Open

updated: SpannerSource schema to match import query #524

Open

updated/PLUGIN-296 #457

updated/PLUGIN-296 #457

Conversation

ailegion commented Dec 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmstar Dec 30, 2020 • edited Loading

Choose a reason for hiding this comment

rmstar Jan 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CuriousVini commented Dec 22, 2020

google-cla bot commented Dec 28, 2020

ardian4 commented Dec 28, 2020

CuriousVini left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

google-cla bot commented Dec 29, 2020

ardian4 commented Dec 29, 2020

CuriousVini commented Dec 29, 2020

CuriousVini commented Dec 30, 2020

CuriousVini commented Dec 30, 2020

ardian4 commented Dec 30, 2020

ardian4 commented Dec 30, 2020 • edited Loading

ardian4 commented Jan 5, 2021

google-cla bot commented Jan 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

google-cla bot commented Jan 20, 2021

rmstar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ardian4 commented Jan 21, 2021

rmstar Dec 30, 2020 •

edited

Loading

rmstar Jan 6, 2021 •

edited

Loading

ardian4 commented Dec 30, 2020 •

edited

Loading