[SPARK-34944][SQL][TESTS] Replace bigint with int for web_returns and store_returns in TPCDS tests to employ correct data type #32037

yaooqinn · 2021-04-02T16:42:33Z

What changes were proposed in this pull request?

According to http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf

 2.2.2 Datatype
2.2.2.1 Each column employs one of the following datatypes:
a) Identifier means that the column shall be able to hold any key value generated for that column.
b) Integer means that the column shall be able to exactly represent integer values (i.e., values in increments of
1) in the range of at least ( − 2n − 1) to (2n − 1 − 1), where n is 64.
c) Decimal(d, f) means that the column shall be able to represent decimal values up to and including d digits,
of which f shall occur to the right of the decimal place; the values can be either represented exactly or
interpreted to be in this range.
d) Char(N) means that the column shall be able to hold any string of characters of a fixed length of N.
Comment: If the string that a column of datatype char(N) holds is shorter than N characters, then trailing
spaces shall be stored in the database or the database shall automatically pad with spaces upon retrieval such
that a CHAR_LENGTH() function will return N.
e) Varchar(N) means that the column shall be able to hold any string of characters of a variable length with a
maximum length of N. Columns defined as "varchar(N)" may optionally be implemented as "char(N)".
f) Date means that the column shall be able to express any calendar day between January 1, 1900 and
December 31, 2199.
2.2.2.2 The datatypes do not correspond to any specific SQL-standard datatype. The definitions are provided to
highlight the properties that are required for a particular column. The benchmark implementer may employ any internal representation or SQL datatype that meets those requirements.

This PR proposes that we use int for identifiers instead of bigint to reach a compromise with TPC-DS Standard Specification.

After this PR, the field schemas are now consistent with those DDLs in the tpcds.sql from tpc-ds tool kit, see https://gist.github.com/yaooqinn/b9978a77bbf4f871a95d6a9103019907

Why are the changes needed?

reach a compromise with TPC-DS Standard Specification

Does this PR introduce any user-facing change?

no test only

How was this patch tested?

test only

…d store_returns in TPCDS tests

yaooqinn · 2021-04-02T16:45:18Z

cc @cloud-fan @maropu @HyukjinKwon please take a look, thanks

SparkQA · 2021-04-02T17:34:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41440/

SparkQA · 2021-04-02T18:00:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41440/

SparkQA · 2021-04-02T18:31:18Z

Test build #136862 has finished for PR 32037 at commit 5ecd6e2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-04-02T23:59:31Z

It seems we need to update the TPCDS query results accordingly if we change the schemas. I'll check this later.

yaooqinn · 2021-04-03T03:04:32Z

It seems we need to update the TPCDS query results accordingly if we change the schemas. I'll check this later.

Shall we comment out the action temporarily? I tried to re-generate the result golden file but as the data is static the job failed with unmatched parquet decoding. The static data is not comfy enough if the write-side involving data or schema change itself. Maybe we should generate the data within the test itself, but I don't know if the GA could be able to handle such a large workload.

SparkQA · 2021-04-03T03:35:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41453/

SparkQA · 2021-04-03T03:35:33Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41453/

SparkQA · 2021-04-03T07:11:19Z

Test build #136876 has finished for PR 32037 at commit 936ecfe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-04-06T01:15:58Z

@yaooqinn Yea, I'm planning to do something for the issue in a few weeks. Please see the related discussion: #31886 (comment)

SparkQA · 2021-04-06T04:59:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41503/

SparkQA · 2021-04-06T04:59:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41503/

SparkQA · 2021-04-06T08:17:42Z

Test build #136926 has finished for PR 32037 at commit 081fd00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-04-06T13:50:49Z

Thank you @maropu for following up.

.github/workflows/build_and_test.yml

SparkQA · 2021-04-06T15:02:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41527/

SparkQA · 2021-04-06T15:02:27Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41527/

SparkQA · 2021-04-06T18:27:18Z

Test build #136951 has finished for PR 32037 at commit c4e4973.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-04-07T02:01:36Z

I am good with going ahead but I will leave it to @maropu since he's driving this.

SparkQA · 2021-04-07T08:31:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41579/

SparkQA · 2021-04-07T08:31:21Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41579/

SparkQA · 2021-04-12T07:28:04Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41778/

SparkQA · 2021-04-12T10:41:15Z

Test build #137199 has finished for PR 32037 at commit 2efff16.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-04-12T11:35:05Z

.github/workflows/build_and_test.yml

@@ -485,11 +485,10 @@ jobs:
        restore-keys: |
          tpcds-
    - name: Checkout TPC-DS (SF=1) generated data repository
-      if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true'


We need to always fetch the data?

It failed the first time, so I turned it off, shall I turn it on now

13:37:51.854 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 315.0 (TID 204) java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45) at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToInt(ParquetDictionary.java:38) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:298) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:344) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Do we need to disable the cache and merge this PR, then enable the cache again? Still seeing the above error if cache on

What's the mechanism to refresh the cache?

I guess it's based on the catch key here https://github.com/apache/spark/pull/32037/files#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2R484, which is based on the cache data itself(seems a mistake we made before?), and seems that it never changed if we have no opportunity to change the cache data entirely by disabling it. Or let me try to modify the key to rely on the lastest revision of maropu/tpcds-sf-1

rely on the lastest revision of maropu/tpcds-sf-1

SGTM

It may still hit the cache by fuzzy matching of restore-keys which seems useless for this case. if fails again, I will remove it too and try again.

FYI, https://docs.github.com/en/actions/guides/caching-dependencies-to-speed-up-workflows#matching-a-cache-key

Yea, it looks better to check if it can work correctly without restore-keys.

maropu

Looks fine if the GA tests can pass.

SparkQA · 2021-04-12T13:29:56Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41794/

SparkQA · 2021-04-12T13:30:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41795/

SparkQA · 2021-04-12T13:30:43Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41795/

SparkQA · 2021-04-12T16:25:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41801/

SparkQA · 2021-04-12T16:25:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41801/

SparkQA · 2021-04-12T16:59:00Z

Test build #137214 has finished for PR 32037 at commit 545a42b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-12T17:42:43Z

Test build #137215 has finished for PR 32037 at commit 98296c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-12T18:35:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41809/

SparkQA · 2021-04-12T18:36:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41809/

SparkQA · 2021-04-12T20:43:53Z

Test build #137221 has finished for PR 32037 at commit aa042bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-12T21:48:47Z

Test build #137229 has finished for PR 32037 at commit f129d95.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-13T02:19:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41837/

SparkQA · 2021-04-13T02:19:57Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41837/

yaooqinn · 2021-04-13T03:27:33Z

All tests pass except a known issue of GA environment Cause: java.net.BindException: Port in use: fv-az201-716:0 which is not related to this PR.
I am going to merge this, thanks all for reviewing and help

merging to master

SparkQA · 2021-04-13T06:59:43Z

Test build #137258 has finished for PR 32037 at commit 3ddd450.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ricks/spark-sql-perf ### What changes were proposed in this pull request? This PR proposes to port minimal code to generate TPC-DS data from [databricks/spark-sql-perf](https://github.com/databricks/spark-sql-perf). The classes in a new class file `tpcdsDatagen.scala` are basically copied from the `databricks/spark-sql-perf` codebase. Note that I've modified them a bit to follow the Spark code style and removed unnecessary parts from them. The code authors of these classes are: juliuszsompolski npoggi wangyum ### Why are the changes needed? We frequently use TPCDS data now for benchmarks/tests, but the classes for the TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g., - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala - https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala I think this causes some inconveniences, e.g., we need to update both files in the separate repositories if we update the TPCDS schema #32037. So, it would be useful for the Spark codebase to generate them by referring to the same schema definition. ### Does this PR introduce _any_ user-facing change? dev only. ### How was this patch tested? Manually checked and GA passed. Closes #32243 from maropu/tpcdsDatagen. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

[SPARK-34944][SQL][TESTS] Employ correct data type for web_returns an…

5ecd6e2

…d store_returns in TPCDS tests

github-actions bot added the SQL label Apr 2, 2021

update

936ecfe

disable TPC-DS queries with SF=1 tempo

081fd00

github-actions bot added the INFRA label Apr 6, 2021

yaooqinn mentioned this pull request Apr 6, 2021

Add TPC-DS DDL test cases apache/kyuubi#487

Closed

3 tasks

HyukjinKwon reviewed Apr 6, 2021

View reviewed changes

.github/workflows/build_and_test.yml Outdated Show resolved Hide resolved

address comments

c4e4973

yaooqinn added 2 commits April 7, 2021 13:44

Merge branch 'master' of github.com:apache/spark into SPARK-34944

59cdca9

regen

44a09c7

Merge branch 'master' of github.com:apache/spark into SPARK-34944

3cfc320

yaooqinn requested a review from maropu April 12, 2021 07:28

maropu reviewed Apr 12, 2021

View reviewed changes

cacah

545a42b

maropu approved these changes Apr 12, 2021

View reviewed changes

nit

98296c2

update cache key via git revision

aa042bf

empty

f129d95

restore-key

3ddd450

yaooqinn closed this in 16e2faa Apr 13, 2021

maropu mentioned this pull request Apr 20, 2021

[SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf #32243

Closed

[SPARK-34944][SQL][TESTS] Replace bigint with int for web_returns and store_returns in TPCDS tests to employ correct data type #32037

[SPARK-34944][SQL][TESTS] Replace bigint with int for web_returns and store_returns in TPCDS tests to employ correct data type #32037

Conversation

yaooqinn commented Apr 2, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

yaooqinn commented Apr 2, 2021

SparkQA commented Apr 2, 2021

SparkQA commented Apr 2, 2021

SparkQA commented Apr 2, 2021

maropu commented Apr 2, 2021

yaooqinn commented Apr 3, 2021

SparkQA commented Apr 3, 2021

SparkQA commented Apr 3, 2021

SparkQA commented Apr 3, 2021

maropu commented Apr 6, 2021

SparkQA commented Apr 6, 2021

SparkQA commented Apr 6, 2021

SparkQA commented Apr 6, 2021

HyukjinKwon commented Apr 6, 2021

SparkQA commented Apr 6, 2021

SparkQA commented Apr 6, 2021

SparkQA commented Apr 6, 2021

HyukjinKwon commented Apr 7, 2021

SparkQA commented Apr 7, 2021

SparkQA commented Apr 7, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu left a comment

Choose a reason for hiding this comment

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 12, 2021

SparkQA commented Apr 13, 2021

SparkQA commented Apr 13, 2021

yaooqinn commented Apr 13, 2021

SparkQA commented Apr 13, 2021

yaooqinn commented Apr 2, 2021 •

edited

Loading