Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retryable 400 errors are not retried due to ignoring XML error response #464

Closed
grrtrr opened this issue Dec 1, 2024 · 2 comments
Closed
Labels
bug Something isn't working needs-triage This issue or PR still needs to be triaged.

Comments

@grrtrr
Copy link
Contributor

grrtrr commented Dec 1, 2024

Describe the bug

The C++ SDK retries requests based on Exception name (XML response document) and HTTP response code.
The aws-c-s3 client retries only based on response code.

We are encountering fatal errors due to retry not applying in cases like the following:

[ERROR] 2024-11-20 18:29:29.374 S3MetaRequest [139830786781184] id=0x7eae94c4e500 Meta request failed from error 2058 (The connection has closed or is closing.). (request=0x7f2cdf6a6600, response status=400). Try to setup a retry.
terminate called after throwing an instance of 'av::CheckException'
  what():  Check failure at perception/dataset/tensor_group_io.cc:145:
  Expected: 'x is ok', with x := 'blobstore::write_blob(outfile, all_data)' [av::status::Status]
  x = PutObject() failed
  where: cloud/aws/s3/s3_streambuf.cc:100
  extra: s3://aurora-cloud-swe-prod-batch-artifacts/opt/2c9f8615/logs/77b0a8ba-1732-4444-bf99-a537dc6e7ddd/543317464f8ecea758af168e7cb50d99.rats: HTTP response code: 400
Resolved remote host IP address:
Request ID: 23TK08F9MWREV3JE
Exception name: RequestTimeout
Error message: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
7 response headers:
connection : close
content-type : application/xml 
date : Wed, 20 Nov 2024 18:29:27 GMT
server : AmazonS3
transfer-encoding : chunked
x-amz-id-2 : QY98w0bZSLwaRtEN4fxse39l6wcXtgEaG6/U7nTcjIMDIFkQA7Gzj8OI8B21wjd+jXgiKGtQbhU=
x-amz-request-id : 23TK08F9MWREV3JE

The above situation is a common case (s3 closing connections and sending 400 RequestTimeout) errors, see e.g. here:

But similar retry support is lacking in aws-c-s3.

Expected Behavior

400 errors are retried based on evaluating the Exception name.

Current Behavior

Any 400 error automatically is a fatal error, due to the translation into AWS_ERROR_S3_INVALID_RESPONSE_STATUS.
There is support for XML in source/s3_util.c, but it is not currently used to parse the response bodies of failed requests.

Reproduction Steps

Have the S3 backend return 400 errors with retryable Exeption names. No retries are happening.

Possible Solution

Add XML parsing of response bodies, update the logic to retry RequestTimeout, as it is currently supported by aws-cli, Golangv1 SDK, AWS C++ SDK for S3 (not S3-CRT).

aws-c-s3 version used

v0.2.3

Compiler and version used

clang-15.0.7

Operating System and version

ubuntu 22.04

@grrtrr grrtrr added bug Something isn't working needs-triage This issue or PR still needs to be triaged. labels Dec 1, 2024
@grrtrr
Copy link
Contributor Author

grrtrr commented Dec 1, 2024

I just realized that v0.7.2 has support for XML parsing of error response bodies.

@grrtrr grrtrr closed this as completed Dec 1, 2024
@waahm7
Copy link
Contributor

waahm7 commented Dec 1, 2024

Thanks, yes, it was fixed in #457.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage This issue or PR still needs to be triaged.
Projects
None yet
Development

No branches or pull requests

2 participants