Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51314][DOCS][PS] Add proper note for distributed-sequence about indeterministic case #50086

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions python/docs/source/user_guide/pandas_on_spark/options.rst
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,16 @@ This is conceptually equivalent to the PySpark example as below:
>>> spark_df.rdd.zipWithIndex().map(lambda p: p[1]).collect()
[0, 1, 2]

.. warning::
Unlike `sequence`, since `distributed-sequence` is executed in a distributed environment,
the rows corresponding to each index can be different although the index itself is still
generated globally sequential.
This happens because the rows are distributed across multiple partitions and nodes,
leading to indeterministic row-to-index mappings when the data is loaded.
Therefore, it is recommended to explicitly set an index column by using `index_col` parameter
instead of relying on the default index when creating `DataFrame`
if the row-to-index mapping is critical for your application.

**distributed**: It implements a monotonically increasing sequence simply by using
PySpark's `monotonically_increasing_id` function in a fully distributed manner. The
values are indeterministic. If the index does not have to be a sequence that increases
Expand Down