One of the diffing service pods started failing and returning a variety of errors, causing downstream errors in the DB’s auto-analysis job. @Mr0grog was offline at a camp and was unable to address it for a full day.
The problem appears to have been caused by a broken executor.
All times in PDT.
The diffing service starts raising a few more errors than usual.
Sentry begins sending rollup error alerts with multiple errors because of the error frequency.
Rob sees huge number of errors while checking in on the internet at DWeb Camp. Internet is limited, and the errors are mostly about about fetch timeouts in the differ while fetching snapshots from S3, but S3 is not reporting any issues, so it’s unclear what exactly is going wrong and hard to fix at the time.
Rob gets home and starts looking into the issue in more detail. Sentry is mainly sending two error types:
- Timed out while fetching a snapshot from S3
- “Cannot send error response after headers written”
The second error indicates things are in a weird state, and checking the actual logs, it looks like there are issues being emitted from the process pool that actually runs the diff. Based on that, it looks like the process pool is just broken, and the only real remediation is to restart the differ pods.
Rob restarts all the differ pods one by one using:
> kubectl delete pod <diffing_service_pod_name>
After monitoring Sentry for half an hour, all errors seem to have stopped and the incident is resolved.
Plan to look into the cause and possible code fixes in more detail tomorrow.
- Logs provided useful information about the issue.
- Resolving the incident and restarting was reasonably straightforward.
- @Mr0grog was unavailable and largely offline for the weekend and nobody else addressed the issue.
- Most of errors Sentry was reporting were side-effects of the actual issue (broken process pool). The process pool issue was not at all immediately obvious.
- Look deeper into the actual cause and determine whether we could change anything to:
- Automatically resolve similar issues.
- Make similar issues more apparent when they occur (e.g. stop and warn about the process pool rather than warning about so many side-effects)
- @Mr0grog