Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[16.0][IMP] queue_job: remove dead jobs requeuer cron and automatically requeue dead jobs #716

Open
wants to merge 3 commits into
base: 16.0
Choose a base branch
from

Conversation

AnizR
Copy link

@AnizR AnizR commented Dec 6, 2024

Goal

Automatically re-queue jobs that have been started but whose worker have been killed.
And rely on Odoo's limit_time_cpu and limit_time_real for job execution.

Technical explanation

Everything relies on a new table queue_job_locks which contains ids of jobs that have been started.
When a job is executed, its row queue_job_locks is locked.

If a row is in queue_job_locks with a state='started' but not locked, it is either:

  • a job killed
  • a job not yet started (happens in a very small time frame)

Using this information, we can re-queue these jobs.

Why not lock directly in the `queue_job' table?

This was tried in #423 but it wasn't working when a job was raising an error.
It seems that the row was locked and it tried to write on the row to set as failed before committing.

Improve current behavior

Re-queue jobs that have been killed but increment their 'retries' to avoid having a job that is always get killed in infinite re-queuing.

@OCA-git-bot
Copy link
Contributor

Hi @guewen,
some modules you are maintaining are being modified, check this out!

@AnizR AnizR changed the title [IMP] queue_job: remove cron garbage collector and automatically requeue jobs in timeout [16.0][IMP] queue_job: remove cron garbage collector and automatically requeue jobs in timeout Dec 6, 2024
Copy link
Member

@sbidoul sbidoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very promising!

A few thoughts:

  • There is a Caveat comment in runner.py that needs updating.
  • This should handle the enqueued state too, which is the state when the runner has decided that a jobs needs to run but then /queue_job/runjob controller has not set it started yet. This state normally exists for a very short time, but I have seen situations where the workers are overloaded and take time to accept /queue_job/runjob requests then die, leaving jobs in enqueued state forever.
  • Since there are two race conditions between enqueued and started, and between started and the time the job transaction actually starts with the lock, I wonder if we should not introduce a small elapsed time condition (10s?) in reset_dead_jobs. May be based on date_enqueued.

queue_job/readme/CONFIGURE.rst Outdated Show resolved Hide resolved
@AnizR AnizR marked this pull request as ready for review December 6, 2024 14:09
@AnizR
Copy link
Author

AnizR commented Dec 6, 2024

This looks very promising!

A few thoughts:

* There is a Caveat comment in `runner.py` that needs updating.

* This should handle the `enqueued` state too, which is the state when the runner has decided that a jobs needs to run but then `/queue_job/runjob` controller has not set it started yet. This state normally exists for a very short time, but I have seen situations where the workers are overloaded and take time to accept `/queue_job/runjob` requests then die, leaving jobs in `enqueued` state forever.

* Since there are two race conditions between `enqueued` and `started`, and between `started` and the time the job transaction actually starts with the lock, I wonder if we should not introduce a small elapsed time condition (10s?) in `reset_dead_jobs`. May be based on `date_enqueued`.

Thank you for your suggestions. I have implemented the necessary corrections.
While I have not encountered a case where a job is left in the 'enqueued' state, I have made the appropriate correction to ensure the code is more robust.

Copy link
Member

@sbidoul sbidoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more minor comments.

queue_job/__manifest__.py Outdated Show resolved Hide resolved
queue_job/controllers/main.py Outdated Show resolved Hide resolved
queue_job/job.py Outdated Show resolved Hide resolved
queue_job/job.py Outdated
Comment on lines 545 to 547
if self.max_retries and self.retry >= self.max_retries:
raise FailedJobError("Max. retries (%d) reached" % (self.max_retries))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding this? This sounds unrelated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check that max_retries is greater than retry was done only when a RetryableJobError was raised during the execution of the job.

Since I am increasing the retry when re-queuing a job that was killed, I need to do the check before peforming the job, not only when raising an exception.

Moreover, I think that it doesn't make sense to perform the check only when an exception has been raised 🤔

queue_job/jobrunner/runner.py Outdated Show resolved Hide resolved
queue_job/jobrunner/runner.py Outdated Show resolved Hide resolved
queue_job/jobrunner/runner.py Show resolved Hide resolved
@sbidoul sbidoul changed the title [16.0][IMP] queue_job: remove cron garbage collector and automatically requeue jobs in timeout [16.0][IMP] queue_job: remove dead jobs requeuer cron and automatically requeue jobs in timeout Dec 6, 2024
@sbidoul sbidoul changed the title [16.0][IMP] queue_job: remove dead jobs requeuer cron and automatically requeue jobs in timeout [16.0][IMP] queue_job: remove dead jobs requeuer cron and automatically requeue dead jobs Dec 6, 2024
queue_job/job.py Outdated Show resolved Hide resolved
queue_job/migrations/16.0.2.6.9/pre-migration.py Outdated Show resolved Hide resolved
queue_job/migrations/16.0.2.6.9/pre-migration.py Outdated Show resolved Hide resolved
@AnizR AnizR force-pushed the zra-imp-timeout2 branch 2 times, most recently from f98c856 to 8edf042 Compare December 9, 2024 15:56
Copy link
Member

@guewen guewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this old issue, I like this elegant solution. It should be quite optimized also. Congrats for this work

@@ -1,17 +1,6 @@
<?xml version="1.0" encoding="utf-8" ?>
<odoo>
<data noupdate="1">
<record id="ir_cron_queue_job_garbage_collector" model="ir.cron">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this become

<delete id="ir_cron_queue_job_garbage_collector" model="ir.cron"/>

to clean up upgrades / avoid filling cron logs with errors since requeue_stuck_jobs method is gone?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have archived the cron in the pre-migration.py. Therefore, there won't be any error.
I always prefer to archive (set active to false) rather than deleting.

@sbidoul
Copy link
Member

sbidoul commented Dec 11, 2024

@AnizR I think we can also remove the set_job_pending function, as the new mechanism will take care of that part.

@AnizR
Copy link
Author

AnizR commented Dec 13, 2024

@AnizR I think we can also remove the set_job_pending function, as the new mechanism will take care of that part.

Yes, jobs that have not been started will be re-queued by my new mechanism.
Removing this part will provide a more "uniform" process of re-queuing jobs.

@AnizR AnizR force-pushed the zra-imp-timeout2 branch 2 times, most recently from 6f8f4b8 to e0e9327 Compare December 23, 2024 13:42
Copy link
Contributor

@adrienpeiffer adrienpeiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for that.

@adrienpeiffer
Copy link
Contributor

@guewen Could you merge this one ?

@adrienpeiffer
Copy link
Contributor

@guewen We're finally going to do a few more tests before we merge this. We'll keep you informed

Copy link
Contributor

@florian-dacosta florian-dacosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review LGTM
I also did some tests with job exceeding the time_cpu_limit and it works as expected.

Copy link
Contributor

@hparfr hparfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing this problem.

I'm asking for more comments to give explicit intent.

And I have a question about: why all these joins instead of use id or uuid directly.

queue_job/job.py Show resolved Hide resolved
queue_job/jobrunner/runner.py Show resolved Hide resolved
queue_job/migrations/16.0.2.7.0/pre-migration.py Outdated Show resolved Hide resolved
Copy link
Contributor

@rousseldenis rousseldenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

queue_job/job.py Outdated Show resolved Hide resolved
queue_job/jobrunner/runner.py Outdated Show resolved Hide resolved
queue_job/job.py Outdated Show resolved Hide resolved
@rousseldenis
Copy link
Contributor

@AnizR Could you check last @hparfr comments ?

@AnizR
Copy link
Author

AnizR commented Jan 30, 2025

Thank you for all your valuable feedback!

I appreciate the time you've taken to review this PR.
I will rework the code based on your suggestions and push an update as soon as possible.

@sbidoul
Copy link
Member

sbidoul commented Jan 30, 2025

@AnizR another feedback I got today is that the new queue_job_locks table got deleted by database_cleanup. Maybe we should create a model for it with a proper foreign key, even if we don't actually use the ORM to manipulate it?

@AnizR AnizR force-pushed the zra-imp-timeout2 branch 2 times, most recently from f3c13cc to 8646632 Compare January 31, 2025 15:03
@AnizR
Copy link
Author

AnizR commented Jan 31, 2025

I think that I've addressed every remarks. Thank you all!

I've managed to add some tests on my newly introduced mechanism.
I was forced to introduce some commit() in my tests since I am relying on rows and locks in the db.

…eue jobs in timeout

[IMP] queue_job: increment 'retry' when re-queuing job that have been killed
queue_job/job.py Outdated Show resolved Hide resolved
A model is better than a manually managed table as it will
protect the table from deletion by database_cleanup.
@sbidoul
Copy link
Member

sbidoul commented Feb 1, 2025

@AnizR I realized that the AbstractModel technique to create the lock table would not prevent it to be purged by database_cleanup. So I made the change to use a regular model on the train back from the sprint. Could you check?

@pedrobaeza
Copy link
Member

You can add the exception in the database_cleanup module as well, as we have done with some columns, although it's true that if there's another way to not require that manual list, it's better.

@AnizR
Copy link
Author

AnizR commented Feb 3, 2025

@AnizR I realized that the AbstractModel technique to create the lock table would not prevent it to be purged by database_cleanup. So I made the change to use a regular model on the train back from the sprint. Could you check?

I did a small test to simulate re-queuing of 'dead jobs'.
It still works correctly, thanks!

Copy link
Contributor

@hparfr hparfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments added as I requested, thanks @AnizR

@rousseldenis
Copy link
Contributor

@AnizR I realized that the AbstractModel technique to create the lock table would not prevent it to be purged by database_cleanup. So I made the change to use a regular model on the train back from the sprint. Could you check?

I did a small test to simulate re-queuing of 'dead jobs'. It still works correctly, thanks!

We'll do tests today

PeterC10 added a commit to otriumofficial/queue that referenced this pull request Feb 4, 2025
Backport of enhancement from Odoo 16 PR on OCA:  OCA#716
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants