Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34944][SQL][TESTS] Replace bigint with int for web_returns and store_returns in TPCDS tests to employ correct data type #32037

Closed
wants to merge 24 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Apr 2, 2021

What changes were proposed in this pull request?

According to http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf

 2.2.2 Datatype
2.2.2.1 Each column employs one of the following datatypes:
a) Identifier means that the column shall be able to hold any key value generated for that column.
b) Integer means that the column shall be able to exactly represent integer values (i.e., values in increments of
1) in the range of at least ( − 2n − 1) to (2n − 1 − 1), where n is 64.
c) Decimal(d, f) means that the column shall be able to represent decimal values up to and including d digits,
of which f shall occur to the right of the decimal place; the values can be either represented exactly or
interpreted to be in this range.
d) Char(N) means that the column shall be able to hold any string of characters of a fixed length of N.
Comment: If the string that a column of datatype char(N) holds is shorter than N characters, then trailing
spaces shall be stored in the database or the database shall automatically pad with spaces upon retrieval such
that a CHAR_LENGTH() function will return N.
e) Varchar(N) means that the column shall be able to hold any string of characters of a variable length with a
maximum length of N. Columns defined as "varchar(N)" may optionally be implemented as "char(N)".
f) Date means that the column shall be able to express any calendar day between January 1, 1900 and
December 31, 2199.
2.2.2.2 The datatypes do not correspond to any specific SQL-standard datatype. The definitions are provided to
highlight the properties that are required for a particular column. The benchmark implementer may employ any internal representation or SQL datatype that meets those requirements.

This PR proposes that we use int for identifiers instead of bigint to reach a compromise with TPC-DS Standard Specification.

After this PR, the field schemas are now consistent with those DDLs in the tpcds.sql from tpc-ds tool kit, see https://gist.github.com/yaooqinn/b9978a77bbf4f871a95d6a9103019907

Why are the changes needed?

reach a compromise with TPC-DS Standard Specification

Does this PR introduce any user-facing change?

no test only

How was this patch tested?

test only

@github-actions github-actions bot added the SQL label Apr 2, 2021
@yaooqinn
Copy link
Member Author

yaooqinn commented Apr 2, 2021

cc @cloud-fan @maropu @HyukjinKwon please take a look, thanks

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41440/

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41440/

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Test build #136862 has finished for PR 32037 at commit 5ecd6e2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Apr 2, 2021

It seems we need to update the TPCDS query results accordingly if we change the schemas. I'll check this later.

@yaooqinn
Copy link
Member Author

yaooqinn commented Apr 3, 2021

It seems we need to update the TPCDS query results accordingly if we change the schemas. I'll check this later.

Shall we comment out the action temporarily? I tried to re-generate the result golden file but as the data is static the job failed with unmatched parquet decoding. The static data is not comfy enough if the write-side involving data or schema change itself. Maybe we should generate the data within the test itself, but I don't know if the GA could be able to handle such a large workload.

@SparkQA
Copy link

SparkQA commented Apr 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41453/

@SparkQA
Copy link

SparkQA commented Apr 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41453/

@SparkQA
Copy link

SparkQA commented Apr 3, 2021

Test build #136876 has finished for PR 32037 at commit 936ecfe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Apr 6, 2021

@yaooqinn Yea, I'm planning to do something for the issue in a few weeks. Please see the related discussion: #31886 (comment)

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41503/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41503/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Test build #136926 has finished for PR 32037 at commit 081fd00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Thank you @maropu for following up.

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41527/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41527/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Test build #136951 has finished for PR 32037 at commit c4e4973.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I am good with going ahead but I will leave it to @maropu since he's driving this.

@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41579/

@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41579/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41778/

@yaooqinn yaooqinn requested a review from maropu April 12, 2021 07:28
@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137199 has finished for PR 32037 at commit 2efff16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -485,11 +485,10 @@ jobs:
restore-keys: |
tpcds-
- name: Checkout TPC-DS (SF=1) generated data repository
if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to always fetch the data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It failed the first time, so I turned it off, shall I turn it on now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13:37:51.854 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 315.0 (TID 204)
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
	at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToInt(ParquetDictionary.java:38)
	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:298)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:344)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Do we need to disable the cache and merge this PR, then enable the cache again? Still seeing the above error if cache on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the mechanism to refresh the cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's based on the catch key here https://github.com/apache/spark/pull/32037/files#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2R484, which is based on the cache data itself(seems a mistake we made before?), and seems that it never changed if we have no opportunity to change the cache data entirely by disabling it. Or let me try to modify the key to rely on the lastest revision of maropu/tpcds-sf-1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rely on the lastest revision of maropu/tpcds-sf-1

SGTM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may still hit the cache by fuzzy matching of restore-keys which seems useless for this case. if fails again, I will remove it too and try again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it looks better to check if it can work correctly without restore-keys.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine if the GA tests can pass.

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41794/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41795/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41795/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41801/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41801/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137214 has finished for PR 32037 at commit 545a42b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137215 has finished for PR 32037 at commit 98296c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41809/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41809/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137221 has finished for PR 32037 at commit aa042bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137229 has finished for PR 32037 at commit f129d95.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41837/

@SparkQA
Copy link

SparkQA commented Apr 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41837/

@yaooqinn
Copy link
Member Author

All tests pass except a known issue of GA environment Cause: java.net.BindException: Port in use: fv-az201-716:0 which is not related to this PR.
I am going to merge this, thanks all for reviewing and help

merging to master

@yaooqinn yaooqinn closed this in 16e2faa Apr 13, 2021
@SparkQA
Copy link

SparkQA commented Apr 13, 2021

Test build #137258 has finished for PR 32037 at commit 3ddd450.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

maropu added a commit that referenced this pull request May 3, 2021
…ricks/spark-sql-perf

### What changes were proposed in this pull request?

This PR proposes to port minimal code to generate TPC-DS data from [databricks/spark-sql-perf](https://github.com/databricks/spark-sql-perf). The classes in a new class file `tpcdsDatagen.scala` are basically copied from the `databricks/spark-sql-perf` codebase.
Note that I've modified them a bit to follow the Spark code style and removed unnecessary parts from them.

The code authors of these classes are:
juliuszsompolski
npoggi
wangyum

### Why are the changes needed?

We frequently use TPCDS data now for benchmarks/tests, but the classes for the TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g.,
 - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
 - https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala

 I think this causes some inconveniences, e.g., we need to update both files in the separate repositories if we update the TPCDS schema #32037. So, it would be useful for the Spark codebase to generate them by referring to the same schema definition.

### Does this PR introduce _any_ user-facing change?

dev only.

### How was this patch tested?

Manually checked and GA passed.

Closes #32243 from maropu/tpcdsDatagen.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants