From ba7b0c4cb9789c6ffa5a3bc17400e92c29ca5b32 Mon Sep 17 00:00:00 2001 From: Haejoon Lee Date: Wed, 26 Feb 2025 16:51:05 +0900 Subject: [PATCH 1/2] [SPARK-51314][DOCS][PS] Add proper note for distributed-sequence about indeterministic case --- .../docs/source/user_guide/pandas_on_spark/options.rst | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst index e8fffea7e33be..adc3da408399b 100644 --- a/python/docs/source/user_guide/pandas_on_spark/options.rst +++ b/python/docs/source/user_guide/pandas_on_spark/options.rst @@ -208,6 +208,16 @@ This is conceptually equivalent to the PySpark example as below: >>> spark_df.rdd.zipWithIndex().map(lambda p: p[1]).collect() [0, 1, 2] +.. warning:: + Unlike `sequence`, since `distributed-sequence` is executed in a distributed environment, + the rows corresponding to each index can be different although the index itself is still + generated globally sequential. + This happens because the rows are distributed across multiple partitions and nodes, + leading to indeterministic row-to-index mappings when the data is loaded. + Therefore, it is recommended to explicitly set an index column by using `index_col` parameter + instead of relying on the default index when creating `DataFrame` + if the row-to-index mapping is critical for your application. + **distributed**: It implements a monotonically increasing sequence simply by using PySpark's `monotonically_increasing_id` function in a fully distributed manner. The values are indeterministic. If the index does not have to be a sequence that increases From 7acc626975e035d76373b248449e82267af6ccf5 Mon Sep 17 00:00:00 2001 From: Haejoon Lee Date: Thu, 27 Feb 2025 10:26:45 +0900 Subject: [PATCH 2/2] Applied comments --- python/docs/source/user_guide/pandas_on_spark/options.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst index adc3da408399b..31f3cff266de2 100644 --- a/python/docs/source/user_guide/pandas_on_spark/options.rst +++ b/python/docs/source/user_guide/pandas_on_spark/options.rst @@ -210,8 +210,8 @@ This is conceptually equivalent to the PySpark example as below: .. warning:: Unlike `sequence`, since `distributed-sequence` is executed in a distributed environment, - the rows corresponding to each index can be different although the index itself is still - generated globally sequential. + the rows corresponding to each index may vary although the index itself is still + remains globally sequential. This happens because the rows are distributed across multiple partitions and nodes, leading to indeterministic row-to-index mappings when the data is loaded. Therefore, it is recommended to explicitly set an index column by using `index_col` parameter