Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update delta protocol to specify remove file path must byte match add file path #4179

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lzlfred
Copy link
Contributor

@lzlfred lzlfred commented Feb 20, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Update delta protocol to specify remove file path must string match add file path. The current spec doesn't actually say how to compare the paths from two string actions. An implementation could choose any of the following, with very different outcomes for ambiguous cases:

  • Do a raw string comparison. Not reliable because different URI encoding libraries could choose to encode a different subset of "illegal" characters. e.g. abc vs %65bc.
  • URI-decode the strings before comparing them. Moderately expensive, but still not reliable, e.g. /path/to/table and /path//to//table refer to the same URI.
  • Construct a full URI from the string is the most expensive, and still not reliable, e.g. /path/to/table and /path/to/other/../table are not considered the same URI but some systems still treat them as the same path

In practice, all Delta clients we know of just copy whatever was in add.path as remove.path, so they should anyway compare byte-equal. Many Delta clients already perform raw string comparisons.

Since none of the approaches is reliable, and raw string comparisons are already used in practice, hereby we just update the spec to make that the required behavior.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description needs to explain why the spec change is necessary. From what I understand:

  1. The current spec doesn't actually say how to compare the paths from two string actions. An implementation could choose any of the following, with very different outcomes for ambiguous cases:
    • Do a raw string comparison. Not reliable because different URI encoding libraries could choose to encode a different subset of "illegal" characters. e.g. abc vs %65bc.
    • URI-decode the strings before comparing them. Moderately expensive, but still not reliable, e.g. /path/to/table and /path//to//table refer to the same URI.
    • Construct a full URI from the string. Most expensive, and still not reliable, e.g. /path/to/table and /path/to/other/../table are not considered the same URI but some systems still treat them as the same path
  2. In practice, all Delta clients we know of just copy whatever was in add.path as remove.path, so they should anyway compare byte-equal.
  3. Many Delta clients already perform raw string comparisons.

Since none of the approaches is reliable, and raw string comparisons are already used in practice, we should just update the spec to make that the required behavior.

PROTOCOL.md Outdated
@@ -553,7 +553,7 @@ The schema of the `remove` action is as follows:

Field Name | Data Type | Description | optional/required
-|-|-|-
path| String | A relative path to a file from the root of the table or an absolute path to a file that should be removed from the table. The path is a URI as specified by [RFC 2396 URI Generic Syntax](https://www.ietf.org/rfc/rfc2396.txt), which needs to be decoded to get the data file path. | required
path| String | A relative path to a file from the root of the table or an absolute path to a file that should be removed from the table. The path is a URI as specified by [RFC 2396 URI Generic Syntax](https://www.ietf.org/rfc/rfc2396.txt), which needs to be decoded to get the data file path. The path must string match the corresponding `add` action. | required
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path must be a byte-identical copy of the corresponding add.path.

PROTOCOL.md Outdated
@@ -842,7 +842,7 @@ To achieve the requirements above, related actions from different delta files ne
- The latest `metaData` action seen wins
- For `txn` actions, the latest `version` seen for a given `appId` wins
- For `domainMetadata`, the latest `domainMetadata` seen for a given `domain` wins. The actions with `removed=true` act as tombstones to suppress earlier versions. Snapshot reads do _not_ return removed `domainMetadata` actions.
- Logical files in a table are identified by their `(path, deletionVector.uniqueId)` primary key. File actions (`add` or `remove`) reference logical files, and a log can contain any number of references to a single file.
- Logical files in a table are identified by their `(path, deletionVector.uniqueId)` primary key. File actions (`add` or `remove`) reference logical files, and a log can contain any number of references to a single file. `add` and `remove` are considered the same only if the paths are string equal.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two file actions refer to the same file only if their path strings are byte-identical (before URI decoding).

We need that extra requirement because there are ~infinite ways to URI-encode the same string, e.g.
abc could technically URI-encode as %65bc even tho it's completely unnecessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file there is a top-level description of how adds and removes work, what is legal/illegal, etc. We should probably update it as well. Somewhere near this paragraph:

The remove action includes a timestamp that indicates when the removal occurred. Physical deletion of physical files can happen lazily after some user-specified expiration time threshold. This delay allows concurrent readers to continue to execute against a stale snapshot of the data. A remove action should remain in the state of the table as a tombstone until it has expired. A tombstone expires when current time (according to the node performing the cleanup) exceeds the expiration threshold added to the remove action timestamp.

We should probably add a sentence explaining:

Because it is possible to represent the same physical file name with many different URI layouts and URI encodings, all add and remove actions that refer to the same physical file must use the same (byte-identical) path string. For example, [action reconciliation] will treat a relative path as distinct from an absolute path, and a as distinct from %65, because they are not byte-identical.

@felipepessoto
Copy link
Contributor

felipepessoto commented Feb 20, 2025

The description needs to explain why the spec change is necessary. From what I understand:

  1. The current spec doesn't actually say how to compare the paths from two string actions. An implementation could choose any of the following, with very different outcomes for ambiguous cases:

    • Do a raw string comparison. Not reliable because different URI encoding libraries could choose to encode a different subset of "illegal" characters. e.g. abc vs %65bc.
    • URI-decode the strings before comparing them. Moderately expensive, but still not reliable, e.g. /path/to/table and /path//to//table refer to the same URI.
    • Construct a full URI from the string. Most expensive, and still not reliable, e.g. /path/to/table and /path/to/other/../table are not considered the same URI but some systems still treat them as the same path
  2. In practice, all Delta clients we know of just copy whatever was in add.path as remove.path, so they should anyway compare byte-equal.

  3. Many Delta clients already perform raw string comparisons.

Since none of the approaches is reliable, and raw string comparisons are already used in practice, we should just update the spec to make that the required behavior.

@lzlfred, @scovich can paths be considered equal if only query string is different?

@lzlfred lzlfred changed the title Update delta protocol to specify remove file path must string match add file path Update delta protocol to specify remove file path must byte match add file path Feb 20, 2025
@lzlfred
Copy link
Contributor Author

lzlfred commented Feb 20, 2025

@felipepessoto may i ask to clarify "query string" ?

@lzlfred lzlfred requested a review from scovich February 20, 2025 19:49
@@ -486,7 +486,11 @@ The `remove` action includes a timestamp that indicates when the removal occurre
Physical deletion of physical files can happen lazily after some user-specified expiration time threshold.
This delay allows concurrent readers to continue to execute against a stale snapshot of the data.
A `remove` action should remain in the state of the table as a _tombstone_ until it has expired.
A tombstone expires when *current time* (according to the node performing the cleanup) exceeds the expiration threshold added to the `remove` action timestamp.
A tombstone expires when *current time* (according to the node performing the cleanup) exceeds the expiration
threshold added to the `remove` action timestamp. Because it is possible to represent the same physical file name with
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this addition should be a new paragraph?

@scovich
Copy link
Collaborator

scovich commented Feb 20, 2025

can paths be considered equal if only query string is different?

Normally two URI would not compare equal if they have different query strings, e.g. java.net.URI docs state:

For two hierarchical URIs to be considered equal, their paths must be equal and their queries must either both be undefined or else be equal.

What use case do you have in mind, that would require comparisons to ignore the query fragment?

@felipepessoto
Copy link
Contributor

@felipepessoto may i ask to clarify "query string" ?

for example:

abfss://..../file.parquet?copy=1
abfss://..../file.parquet?copy=2

@felipepessoto
Copy link
Contributor

can paths be considered equal if only query string is different?

Normally two URI would not compare equal if they have different query strings, e.g. java.net.URI docs state:

For two hierarchical URIs to be considered equal, their paths must be equal and their queries must either both be undefined or else be equal.

What use case do you have in mind, that would require comparisons to ignore the query fragment?

Not sure if there is a valid reason to do it, I was reviewing the list of examples you gave, like e.g. /path/to/table and /path/to/other/../table

And wonder if having the same file referenced twice in the Delta Log, but with different query strings are valid.

@scovich
Copy link
Collaborator

scovich commented Feb 21, 2025

can paths be considered equal if only query string is different?

Normally two URI would not compare equal if they have different query strings, e.g. java.net.URI docs state:

For two hierarchical URIs to be considered equal, their paths must be equal and their queries must either both be undefined or else be equal.

What use case do you have in mind, that would require comparisons to ignore the query fragment?

Not sure if there is a valid reason to do it, I was reviewing the list of examples you gave, like e.g. /path/to/table and /path/to/other/../table

And wonder if having the same file referenced twice in the Delta Log, but with different query strings are valid.

Gotcha. I'm not aware of any valid use case for that, and a big part of this change is to accept the fact that URI are too generic and too domain-specific to really be able to say whether two of them are "equivalent" or not. So we bake in the simplest (and already prevailing) way of comparing them to eliminate that whole source of confusion.

@felipepessoto
Copy link
Contributor

We may need to change Kernel to avoid the relative path -> absolute path conversion here:

String tableRoot = scanFileInfo.getString(TABLE_ROOT_ORDINAL);
String absolutePath =
new Path(new Path(URI.create(tableRoot)), new Path(URI.create(path))).toString();

and

// NOTE: table root is temporary, until the path in `add.path` is converted to
// an absolute path. https://github.com/delta-io/delta/issues/2089
.add(TABLE_ROOT_STRUCT_FIELD);

@scovich
Copy link
Collaborator

scovich commented Feb 28, 2025

We may need to change Kernel to avoid the relative path -> absolute path conversion here

Not sure I follow? As far as I know those code sites are converting to absolute paths for consumption by the engine, e.g. for benefit of the object store (which always requires an absolute path). I don't think it influences the deduplication logic of log replay?

@felipepessoto
Copy link
Contributor

I was planning to use this API to make modifications to Delta Log, and adding a remove action from the data returned by it cause problems because path doesn’t match.

Do we have a different API in Kernel that could help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants