Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Copy Operation by taking the Source URI #482

Merged
merged 21 commits into from
Jan 13, 2025
Merged

Conversation

waahm7
Copy link
Contributor

@waahm7 waahm7 commented Jan 9, 2025

Issue #, if available:
awslabs/s3-connector-for-pytorch#295

Description of changes:
The current copy implementation has a lot of limitations because we try to parse the source host and path from the x-amz-copy-source header and the destination URI. This falls apart in many cases, such as when the bucket has a . in its name or during cross-region copies where we can't generate the source URL correctly due to missing information. Allowing the user to pass the source URL can bypass these limitations.

  • Code coverage action was broken, fixed by updating gcc to match newer ubuntu.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@waahm7 waahm7 changed the title WIP | Improve Copy Operation by taking the URI Improve Copy Operation by taking the URI Jan 10, 2025
@waahm7 waahm7 marked this pull request as ready for review January 10, 2025 19:48
@waahm7 waahm7 changed the title Improve Copy Operation by taking the URI Improve Copy Operation by taking the Source URI Jan 10, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 52.63158% with 9 lines in your changes missing coverage. Please review.

Project coverage is 89.61%. Comparing base (45894ed) to head (5121ae1).
Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
source/s3_request_messages.c 52.63% 9 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #482      +/-   ##
==========================================
- Coverage   89.64%   89.61%   -0.04%     
==========================================
  Files          20       20              
  Lines        6144     6287     +143     
==========================================
+ Hits         5508     5634     +126     
- Misses        636      653      +17     
Files with missing lines Coverage Δ
source/s3_request_messages.c 74.36% <52.63%> (-0.42%) ⬇️

... and 10 files with indirect coverage changes

AWS_LS_S3_META_REQUEST,
"Unable to parse the copy_source_uri provided in the request: " PRInSTR "",
AWS_BYTE_CURSOR_PRI(options->copy_source_uri));
goto on_error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the copy_object->synced_data.part_list was initialized above, it needs to be cleaned up here.

Or move the initialization of the list after anything can fail

We could add a simple test to pass in an invalid URI to add a bit more test coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch. Fixed with a test.

struct aws_http_message *message = NULL;

message = aws_http_message_new_request(allocator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot of the error handling down there assumes there was nothing to clean up and just returns null.

Need to update them since we now allocate the message at the beginning of the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.


message = aws_http_message_new_request(allocator);
if (message == NULL) {
goto error_cleanup;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

head_object_host_header have not been defined yet, but we will try to clean it up from this goto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

struct aws_byte_buf head_object_host_header;
AWS_ZERO_STRUCT(head_object_host_header);

if (message == NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to worry about the signing region here today as well? or we are keeping the source must be in the same region as dest behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. For now, I have updated the docs to mention that only limitations 1 and 2 will be bypassed using URI.

@waahm7 waahm7 merged commit 1c80418 into main Jan 13, 2025
35 checks passed
@waahm7 waahm7 deleted the fix-copy-dot-bucket branch January 13, 2025 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants