You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a cluster, for a job with the @DisallowConcurrentExecution annotation configured, when some containers are restarted, the previous execution plan is not completed and the next execution plan starts concurrent scheduling.
I deployed four pods in a cluster, one of which was a task that was executed every 30 seconds. When I was rolling upgrade, there were tasks that were marked as non-parallel and were executed at the same time (two different sessions, the previous execution time exceeded his execution time), and I did not find any description of this problem on the Internet. Below I will describe my phenomenon, hoping to provide optimization ideas. Assume that there are four online pods, we call them a, b, c, d, and now try to restart two of them, a and b. By querying the binlog log of the database, before restarting a and b (20:12:00), this task was acquired by acquireNextTriggers of node c (20:12:30), and the status in the triggers table was marked as ACQUIRED, and an ACQUIRED data was inserted into the fired_triggers table. As we know, the current thread has not completed this cycle. At 20:12:02, a and b are restarted. Because my job configuration can be overwritten, it will refresh my execution plan in the database and will be refreshed twice. After that, the status in the triggers table changes to WAITING. After a is restarted, at 20:12:27, a executes acquireNextTriggers for the first time and obtains the execution plan of the job again and changes it to ACQUIRED. At this time, there are two pods that have not completed this cycle on the two nodes, and there are ACQUIRED records of two different instances of the same job and the same session in the fired_triggers table. After acquireNextTriggers, I see that there is a timeUntilTrigger (the difference between the trigger time and the current time) judgment to confirm whether the task continues to execute. When it reaches 20:12:30, node c takes the lock first, triggers the task scheduling, and changes the status in the triggers table to BLOCKED, updates the next execution time, and the status in the fired_triggers table is EXECUTING. Node a then holds the lock in releaseIfScheduleChangedSignificantly in the timeUntilTrigger branch and thinks that the task should be resumed, so it changes the status in the triggers table from BLOCKED to WAITING and deletes the record in the fired_triggers table. At this point, there is actually a task in execution, but its status is WAITING. Naturally, the task will be scanned in the next scan and it is considered that it can be triggered. At this time, two tasks with different sessions will appear in parallel.
The above is the conclusion I drew from reading the source code of quartz2.3.1 and the binlog of the database. If there is any misunderstanding, please correct me. I also hope to get your reply for the above questions. Thank you.
The text was updated successfully, but these errors were encountered:
The text was updated successfully, but these errors were encountered: