-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instance status checks failures #642
Comments
Happened to us again today:
|
Again:
|
Again:
|
Again:
|
It seems like there's a 1-2% chance of failure per 24 hours of uptime recently. I can't imagine this level of unreliability would be tolerated with any other instance type. This is happening across multiple accounts, multiple zones in us-east-1, two different bitfiles, single software architecture / fpga drivers. The error message suggests a pure AWS issue. It wouldn't surprise me if we're contributing to the failures somehow but nothing to go on based on the error message. |
Hello, Thanks for reaching out with this issue. We've been internally monitoring the issue and will report back soon. In the mean time, have you been able to follow some of the AWS EC2 troubleshoot steps? |
Looked over the link with devops and most of it doesn't make sense in our context because it's primarily about configuration problems and this happens after instances run for extended periods. It look like there may be a more specific cause available if we can check the EC2 console and see the instance's details but they're in an autoscaling group so get automatically terminated and details become inaccessible on the console quickly. Questions I have are:
Happy for you to close the support ticket if it helps your KPIs. I don't need this solved urgently but it is problematic from the perspective of our customers so I don't want it ignored either. |
One thing that may be contributing to the instance instability are PCIe/AXI errors on the bus. Can you provide the shell timeout data immediately prior to the instance failures? You can find more information on collecting this data with the SDK here: https://github.com/aws/aws-fpga/blob/863d963308231d0789a48f8840ceb1141368b34a/hdk/docs/HOWTO_detect_shell_timeout.md Gathering the data above will help us narrow down the issue as "hanging the PCI bus" is the most likely root cause of the issue. Don't worry about closing the support tickets, it helps us collect data and gain visibility on the issues! |
Added below to the cleanup hook, which might provide some insight. I know OCL reads are working fine immediately prior to the status checks failure.
|
Could you share what shell interfaces your workload is exercising at the time of failures? |
We didn't change anything except for rebuilding the image to add above code and haven't experienced the problem in the last two weeks so might be gone. OCL and DMA_PCIS. We do DMA writes to two DDRs from the processor, read from two DDRs to the FPGA, and reads & writes between the processor and FPGA with OCL registers. |
We're glad to hear you're no longer experiencing the issue. If you ever do experience the failure again, please reach out with any information you have! |
Autoscaling group:
EC2 console:
fpga-describe-local-image:
I find the generic reasons for a reachability check failure implausible given OCL reads and DMA writes aren't erroring immediately prior to the fpga-describe-local-image call. There's usually an associated issue that comes from within our logic where we have a processing operation timeout when there's a status checks failure. I am aware of the timeout from successful OCL reads. We've had these timeouts happen before with functional problems in our design but I struggle to see how anything we do in the CL or with DDR could cause a reachability failure. Representative power consumption from a working instance:
Any further thoughts? |
Autoscaling group:
EC2 console:
fpga-describe-local-image:
Same deal. |
I've observed several instance failures in our production environments during the last month. Curious what's happening as this is something new and highly disruptive.
From today:
The text was updated successfully, but these errors were encountered: