From ba7b0c4cb9789c6ffa5a3bc17400e92c29ca5b32 Mon Sep 17 00:00:00 2001
From: Haejoon Lee <haejoon.lee@databricks.com>
Date: Wed, 26 Feb 2025 16:51:05 +0900
Subject: [PATCH 1/2] [SPARK-51314][DOCS][PS] Add proper note for
 distributed-sequence about indeterministic case

---
 .../docs/source/user_guide/pandas_on_spark/options.rst | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst
index e8fffea7e33be..adc3da408399b 100644
--- a/python/docs/source/user_guide/pandas_on_spark/options.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/options.rst
@@ -208,6 +208,16 @@ This is conceptually equivalent to the PySpark example as below:
     >>> spark_df.rdd.zipWithIndex().map(lambda p: p[1]).collect()
     [0, 1, 2]
 
+.. warning::
+    Unlike `sequence`, since `distributed-sequence` is executed in a distributed environment,
+    the rows corresponding to each index can be different although the index itself is still
+    generated globally sequential.
+    This happens because the rows are distributed across multiple partitions and nodes,
+    leading to indeterministic row-to-index mappings when the data is loaded.
+    Therefore, it is recommended to explicitly set an index column by using `index_col` parameter
+    instead of relying on the default index when creating `DataFrame`
+    if the row-to-index mapping is critical for your application.
+
 **distributed**: It implements a monotonically increasing sequence simply by using
 PySpark's `monotonically_increasing_id` function in a fully distributed manner. The
 values are indeterministic. If the index does not have to be a sequence that increases

From 7acc626975e035d76373b248449e82267af6ccf5 Mon Sep 17 00:00:00 2001
From: Haejoon Lee <haejoon.lee@databricks.com>
Date: Thu, 27 Feb 2025 10:26:45 +0900
Subject: [PATCH 2/2] Applied comments

---
 python/docs/source/user_guide/pandas_on_spark/options.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst
index adc3da408399b..31f3cff266de2 100644
--- a/python/docs/source/user_guide/pandas_on_spark/options.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/options.rst
@@ -210,8 +210,8 @@ This is conceptually equivalent to the PySpark example as below:
 
 .. warning::
     Unlike `sequence`, since `distributed-sequence` is executed in a distributed environment,
-    the rows corresponding to each index can be different although the index itself is still
-    generated globally sequential.
+    the rows corresponding to each index may vary although the index itself is still
+    remains globally sequential.
     This happens because the rows are distributed across multiple partitions and nodes,
     leading to indeterministic row-to-index mappings when the data is loaded.
     Therefore, it is recommended to explicitly set an index column by using `index_col` parameter