Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSL verification error #640

Open
mshannon-sil opened this issue Jan 28, 2025 · 4 comments
Open

SSL verification error #640

mshannon-sil opened this issue Jan 28, 2025 · 4 comments
Assignees
Labels
bug Something isn't working mlops DevOps, GPUs, ClearML

Comments

@mshannon-sil
Copy link
Collaborator

When uploading checkpoints to the S3 bucket, the following error is occurring: botocore.exceptions.SSLError: SSL validation failed for [file to be uploaded] EOF occurred in violation of protocol (_ssl.c:2426). This error appears about 70 seconds past the initial upload timestamp. The read timeout is currently set to 600 seconds, but the connect timeout is only set to 60 seconds by default, so that may be the cause.

@mshannon-sil mshannon-sil added bug Something isn't working mlops DevOps, GPUs, ClearML labels Jan 28, 2025
@mshannon-sil mshannon-sil self-assigned this Jan 28, 2025
@ddaspit
Copy link
Collaborator

ddaspit commented Jan 28, 2025

Is this error causing problems?

@mshannon-sil
Copy link
Collaborator Author

Yes, oftentimes experiments are failing after 10 retries.

@Enkidu93
Copy link
Collaborator

I just saw this on a production job as well: 2025-01-28 21:02:18,573 - clearml.storage - ERROR - Failed uploading: SSL validation failed for ... occurred in violation of protocol (_ssl.c:2406)

@mshannon-sil
Copy link
Collaborator Author

After talking with TechOps, It looks like there may be an issue with the network's upload speeds. I submitted a ticket to IT services yesterday to see if they can resolve the issue on their end, and I'm currently waiting for a reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mlops DevOps, GPUs, ClearML
Projects
None yet
Development

No branches or pull requests

3 participants