Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner not terminated after cancellation of job #537

Closed
pharindoko opened this issue Apr 23, 2024 · 3 comments
Closed

Runner not terminated after cancellation of job #537

pharindoko opened this issue Apr 23, 2024 · 3 comments

Comments

@pharindoko
Copy link
Contributor

pharindoko commented Apr 23, 2024

Hey,

I do have the case that sometimes that a job is cancelled but the runner is not terminated.
I get a warning sign in the job in the ui:

build
Runner xxxxxx did not respond to a cancelation request with 00:05:00.

The stepfunction and the runner itself still are in progress and think that there is a ongoing job available.
It`s bad because the ec2s are still running without doing anything.

Does somebody else have this issue ?
Does anyone have a solution how to fix this behaviour ?

@kichik
Copy link
Member

kichik commented Apr 24, 2024

That's strange. Do you have more insight on what the instance is doing? Did a job actually start on it? The fact that cancellation is being attempted suggests a job did start.

Even if a job started, eventually the actions runner itself should time out as well (and then terminate the instance). Maybe it will have something useful in its logs once it does?

And if a job wasn't started, the runner should be deleted by the idle reaper. At that point, the runner will stop on the instance and terminate itself.

The only guess I have so far is the instance ran out of memory, started thrashing swap space, and therefore wasn't able to respond to GitHub server causing the cancellation request timing out.

@pharindoko
Copy link
Contributor Author

Good hint. Switched to an instance with more memory.

I will add an additional alert to see when instances idle too long (e.g. for an hour).
If that happens more often I will check if related jobs are cacelled and the instance can be terminated.

@kichik
Copy link
Member

kichik commented Apr 24, 2024

FYI #518 will cause SSM to terminate the instance instead of the instance terminating itself. That might help here too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants