-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The test_scheduler_mix hangs on !x86_64! #1281
Comments
CC=gcc-11 CXX=g++-11 cmake -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_STANDARD=20 -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON ../..
cmake --build . --verbose --config RelWithDebInfo
ctest --timeout 180 --build-config RelWithDebInfo -R test_scheduler_mix --repeat-until-fail 3600 Ctest output:
|
RelWithDebInfo without LTO:
|
RelWithDebInfo backtrace:
|
Thank you for submitting this. We have reproduced the hang and are looking into it. |
Have any news? |
After enough runs, the hang appears to reproduce on every platform with every configuration. We are still diagnosing the root cause. |
The hang happens (after enough runs) even with only 1 user thread simply calling the actions. Only |
Any news about bug root cause? |
I haven't done further debugging since my previous update. As I mentioned, it does appear that the hang occurs on every platform after repeatedly running the simplified test (branch: dev/dnmokhov/test_scheduler_mix), where I removed
So just repeatedly creating, destroying, executing, enqueueing (in random order) seems to eventually cause the hang. Whether it is the test itself or scheduler code has yet to be determined. |
@dnmokhov |
@phprus, I am eventually planning to keep debugging it and root-cause the hang (i.e., test or scheduler). However, currently I have some higher priority tasks. Is this issue blocking anything on your side? |
One deadlock scenario with the test can be described as that. It's possible to create an arena with I believe it's not specific to 1-slot arenas and may affect broader cases. One way to eliminate a possibility for deadlocks is to order request to resources, i.e. if a thread owns an arena with index N, it may try to execute a task only in arena with larger index. |
Thank you, @dnmokhov , for the hang report after fixing the test. It looks like another issue, the problem this time is in the scheduler. The hang occurs when a master thread occupies an arena slot for workers, and in the same time the thread unable to steal tasks due to stack size limit. If dependent tasks for the thread to wait were stolen and partially executed by other threads, the wait became infinite, as a worker thread unable to join the arena (worker slot is occupied by the master). |
@Alexandr-Konovalov |
@phprus No news from my side. Just in case, are you having some specific scenario that may lead to the situation? |
In productive code - not. But could this bug lead to hangs on the ARM platform (like #756)? Or not? |
I don't think so. |
Commit: 58653a3
gcc version 11.3.0
OS: openSUSE 15.5
CPU: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Without virtualization.
Build commands:
CC=gcc-11 CXX=g++-11 cmake -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_STANDARD=20 -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON ../.. cmake --build . --verbose --config Release
Ctest:
For that environment, 3600 retries are enough to cause a hang.
Average passed test time is approximately 2 sec. Timeout is 180 sec.
Ctest output:
The text was updated successfully, but these errors were encountered: