-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34944][SQL][TESTS] Replace bigint with int for web_returns and store_returns in TPCDS tests to employ correct data type #32037
Conversation
…d store_returns in TPCDS tests
cc @cloud-fan @maropu @HyukjinKwon please take a look, thanks |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #136862 has finished for PR 32037 at commit
|
It seems we need to update the TPCDS query results accordingly if we change the schemas. I'll check this later. |
Shall we comment out the action temporarily? I tried to re-generate the result golden file but as the data is static the job failed with unmatched parquet decoding. The static data is not comfy enough if the write-side involving data or schema change itself. Maybe we should generate the data within the test itself, but I don't know if the GA could be able to handle such a large workload. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136876 has finished for PR 32037 at commit
|
@yaooqinn Yea, I'm planning to do something for the issue in a few weeks. Please see the related discussion: #31886 (comment) |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136926 has finished for PR 32037 at commit
|
Thank you @maropu for following up. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136951 has finished for PR 32037 at commit
|
I am good with going ahead but I will leave it to @maropu since he's driving this. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status failure |
Test build #137199 has finished for PR 32037 at commit
|
@@ -485,11 +485,10 @@ jobs: | |||
restore-keys: | | |||
tpcds- | |||
- name: Checkout TPC-DS (SF=1) generated data repository | |||
if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to always fetch the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It failed the first time, so I turned it off, shall I turn it on now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
13:37:51.854 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 315.0 (TID 204)
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToInt(ParquetDictionary.java:38)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:298)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:344)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Do we need to disable the cache and merge this PR, then enable the cache again? Still seeing the above error if cache on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the mechanism to refresh the cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's based on the catch key here https://github.com/apache/spark/pull/32037/files#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2R484, which is based on the cache data itself(seems a mistake we made before?), and seems that it never changed if we have no opportunity to change the cache data entirely by disabling it. Or let me try to modify the key to rely on the lastest revision of maropu/tpcds-sf-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rely on the lastest revision of maropu/tpcds-sf-1
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may still hit the cache by fuzzy matching of restore-keys
which seems useless for this case. if fails again, I will remove it too and try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it looks better to check if it can work correctly without restore-keys
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine if the GA tests can pass.
Kubernetes integration test unable to build dist. exiting with code: 1 |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137214 has finished for PR 32037 at commit
|
Test build #137215 has finished for PR 32037 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #137221 has finished for PR 32037 at commit
|
Test build #137229 has finished for PR 32037 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
All tests pass except a known issue of GA environment merging to master |
Test build #137258 has finished for PR 32037 at commit
|
…ricks/spark-sql-perf ### What changes were proposed in this pull request? This PR proposes to port minimal code to generate TPC-DS data from [databricks/spark-sql-perf](https://github.com/databricks/spark-sql-perf). The classes in a new class file `tpcdsDatagen.scala` are basically copied from the `databricks/spark-sql-perf` codebase. Note that I've modified them a bit to follow the Spark code style and removed unnecessary parts from them. The code authors of these classes are: juliuszsompolski npoggi wangyum ### Why are the changes needed? We frequently use TPCDS data now for benchmarks/tests, but the classes for the TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g., - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala - https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala I think this causes some inconveniences, e.g., we need to update both files in the separate repositories if we update the TPCDS schema #32037. So, it would be useful for the Spark codebase to generate them by referring to the same schema definition. ### Does this PR introduce _any_ user-facing change? dev only. ### How was this patch tested? Manually checked and GA passed. Closes #32243 from maropu/tpcdsDatagen. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
What changes were proposed in this pull request?
According to http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf
This PR proposes that we use int for identifiers instead of bigint to reach a compromise with TPC-DS Standard Specification.
After this PR, the field schemas are now consistent with those DDLs in the
tpcds.sql
from tpc-ds tool kit, see https://gist.github.com/yaooqinn/b9978a77bbf4f871a95d6a9103019907Why are the changes needed?
reach a compromise with TPC-DS Standard Specification
Does this PR introduce any user-facing change?
no test only
How was this patch tested?
test only