Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A job is executed concurrently,does not follow @DisallowConcurrentExecution #1287

Open
AyoaHin opened this issue Dec 19, 2024 · 0 comments

Comments

@AyoaHin
Copy link

AyoaHin commented Dec 19, 2024

In a cluster, for a job with the @DisallowConcurrentExecution annotation configured, when some containers are restarted, the previous execution plan is not completed and the next execution plan starts concurrent scheduling.
I deployed four pods in a cluster, one of which was a task that was executed every 30 seconds. When I was rolling upgrade, there were tasks that were marked as non-parallel and were executed at the same time (two different sessions, the previous execution time exceeded his execution time), and I did not find any description of this problem on the Internet. Below I will describe my phenomenon, hoping to provide optimization ideas. Assume that there are four online pods, we call them a, b, c, d, and now try to restart two of them, a and b. By querying the binlog log of the database, before restarting a and b (20:12:00), this task was acquired by acquireNextTriggers of node c (20:12:30), and the status in the triggers table was marked as ACQUIRED, and an ACQUIRED data was inserted into the fired_triggers table. As we know, the current thread has not completed this cycle. At 20:12:02, a and b are restarted. Because my job configuration can be overwritten, it will refresh my execution plan in the database and will be refreshed twice. After that, the status in the triggers table changes to WAITING. After a is restarted, at 20:12:27, a executes acquireNextTriggers for the first time and obtains the execution plan of the job again and changes it to ACQUIRED. At this time, there are two pods that have not completed this cycle on the two nodes, and there are ACQUIRED records of two different instances of the same job and the same session in the fired_triggers table. After acquireNextTriggers, I see that there is a timeUntilTrigger (the difference between the trigger time and the current time) judgment to confirm whether the task continues to execute. When it reaches 20:12:30, node c takes the lock first, triggers the task scheduling, and changes the status in the triggers table to BLOCKED, updates the next execution time, and the status in the fired_triggers table is EXECUTING. Node a then holds the lock in releaseIfScheduleChangedSignificantly in the timeUntilTrigger branch and thinks that the task should be resumed, so it changes the status in the triggers table from BLOCKED to WAITING and deletes the record in the fired_triggers table. At this point, there is actually a task in execution, but its status is WAITING. Naturally, the task will be scanned in the next scan and it is considered that it can be triggered. At this time, two tasks with different sessions will appear in parallel.
The above is the conclusion I drew from reading the source code of quartz2.3.1 and the binlog of the database. If there is any misunderstanding, please correct me. I also hope to get your reply for the above questions. Thank you.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant