-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update delta protocol to specify remove file path must byte match add file path #4179
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description needs to explain why the spec change is necessary. From what I understand:
- The current spec doesn't actually say how to compare the paths from two string actions. An implementation could choose any of the following, with very different outcomes for ambiguous cases:
- Do a raw string comparison. Not reliable because different URI encoding libraries could choose to encode a different subset of "illegal" characters. e.g.
abc
vs%65bc
. - URI-decode the strings before comparing them. Moderately expensive, but still not reliable, e.g.
/path/to/table
and/path//to//table
refer to the same URI. - Construct a full URI from the string. Most expensive, and still not reliable, e.g.
/path/to/table
and/path/to/other/../table
are not considered the same URI but some systems still treat them as the same path
- Do a raw string comparison. Not reliable because different URI encoding libraries could choose to encode a different subset of "illegal" characters. e.g.
- In practice, all Delta clients we know of just copy whatever was in
add.path
asremove.path
, so they should anyway compare byte-equal. - Many Delta clients already perform raw string comparisons.
Since none of the approaches is reliable, and raw string comparisons are already used in practice, we should just update the spec to make that the required behavior.
PROTOCOL.md
Outdated
@@ -553,7 +553,7 @@ The schema of the `remove` action is as follows: | |||
|
|||
Field Name | Data Type | Description | optional/required | |||
-|-|-|- | |||
path| String | A relative path to a file from the root of the table or an absolute path to a file that should be removed from the table. The path is a URI as specified by [RFC 2396 URI Generic Syntax](https://www.ietf.org/rfc/rfc2396.txt), which needs to be decoded to get the data file path. | required | |||
path| String | A relative path to a file from the root of the table or an absolute path to a file that should be removed from the table. The path is a URI as specified by [RFC 2396 URI Generic Syntax](https://www.ietf.org/rfc/rfc2396.txt), which needs to be decoded to get the data file path. The path must string match the corresponding `add` action. | required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path must be a byte-identical copy of the corresponding add.path
.
PROTOCOL.md
Outdated
@@ -842,7 +842,7 @@ To achieve the requirements above, related actions from different delta files ne | |||
- The latest `metaData` action seen wins | |||
- For `txn` actions, the latest `version` seen for a given `appId` wins | |||
- For `domainMetadata`, the latest `domainMetadata` seen for a given `domain` wins. The actions with `removed=true` act as tombstones to suppress earlier versions. Snapshot reads do _not_ return removed `domainMetadata` actions. | |||
- Logical files in a table are identified by their `(path, deletionVector.uniqueId)` primary key. File actions (`add` or `remove`) reference logical files, and a log can contain any number of references to a single file. | |||
- Logical files in a table are identified by their `(path, deletionVector.uniqueId)` primary key. File actions (`add` or `remove`) reference logical files, and a log can contain any number of references to a single file. `add` and `remove` are considered the same only if the paths are string equal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two file actions refer to the same file only if their path strings are byte-identical (before URI decoding).
We need that extra requirement because there are ~infinite ways to URI-encode the same string, e.g.
abc
could technically URI-encode as %65bc
even tho it's completely unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file there is a top-level description of how adds and removes work, what is legal/illegal, etc. We should probably update it as well. Somewhere near this paragraph:
The
remove
action includes a timestamp that indicates when the removal occurred. Physical deletion of physical files can happen lazily after some user-specified expiration time threshold. This delay allows concurrent readers to continue to execute against a stale snapshot of the data. Aremove
action should remain in the state of the table as a tombstone until it has expired. A tombstone expires when current time (according to the node performing the cleanup) exceeds the expiration threshold added to theremove
action timestamp.
We should probably add a sentence explaining:
Because it is possible to represent the same physical file name with many different URI layouts and URI encodings, all
add
andremove
actions that refer to the same physical file must use the same (byte-identical)path
string. For example, [action reconciliation] will treat a relative path as distinct from an absolute path, anda
as distinct from%65
, because they are not byte-identical.
@lzlfred, @scovich can paths be considered equal if only query string is different? |
@felipepessoto may i ask to clarify "query string" ? |
@@ -486,7 +486,11 @@ The `remove` action includes a timestamp that indicates when the removal occurre | |||
Physical deletion of physical files can happen lazily after some user-specified expiration time threshold. | |||
This delay allows concurrent readers to continue to execute against a stale snapshot of the data. | |||
A `remove` action should remain in the state of the table as a _tombstone_ until it has expired. | |||
A tombstone expires when *current time* (according to the node performing the cleanup) exceeds the expiration threshold added to the `remove` action timestamp. | |||
A tombstone expires when *current time* (according to the node performing the cleanup) exceeds the expiration | |||
threshold added to the `remove` action timestamp. Because it is possible to represent the same physical file name with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this addition should be a new paragraph?
Normally two URI would not compare equal if they have different query strings, e.g. java.net.URI docs state:
What use case do you have in mind, that would require comparisons to ignore the query fragment? |
for example: abfss://..../file.parquet?copy=1 |
Not sure if there is a valid reason to do it, I was reviewing the list of examples you gave, like And wonder if having the same file referenced twice in the Delta Log, but with different query strings are valid. |
Gotcha. I'm not aware of any valid use case for that, and a big part of this change is to accept the fact that URI are too generic and too domain-specific to really be able to say whether two of them are "equivalent" or not. So we bake in the simplest (and already prevailing) way of comparing them to eliminate that whole source of confusion. |
We may need to change Kernel to avoid the relative path -> absolute path conversion here: delta/kernel/kernel-api/src/main/java/io/delta/kernel/internal/InternalScanFileUtils.java Lines 110 to 112 in cad2672
and delta/kernel/kernel-api/src/main/java/io/delta/kernel/internal/InternalScanFileUtils.java Lines 63 to 65 in cad2672
|
Not sure I follow? As far as I know those code sites are converting to absolute paths for consumption by the engine, e.g. for benefit of the object store (which always requires an absolute path). I don't think it influences the deduplication logic of log replay? |
I was planning to use this API to make modifications to Delta Log, and adding a remove action from the data returned by it cause problems because path doesn’t match. Do we have a different API in Kernel that could help? |
Which Delta project/connector is this regarding?
Description
Update delta protocol to specify remove file path must string match add file path. The current spec doesn't actually say how to compare the paths from two string actions. An implementation could choose any of the following, with very different outcomes for ambiguous cases:
In practice, all Delta clients we know of just copy whatever was in add.path as remove.path, so they should anyway compare byte-equal. Many Delta clients already perform raw string comparisons.
Since none of the approaches is reliable, and raw string comparisons are already used in practice, hereby we just update the spec to make that the required behavior.