-
-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize unit tests #902
Conversation
Signed-off-by: Andrew Helwer <[email protected]>
Signed-off-by: Andrew Helwer <[email protected]>
Signed-off-by: Andrew Helwer <[email protected]>
Signed-off-by: Andrew Helwer <[email protected]>
Used directory=./states/tlc2.tool.liveness.CodePlexBug08EWD840FL2Test/
monitor_directory() {
ant -f customBuild.xml compile compile-test dist test &
# Get the PID of the program
pid=$!
echo $pid
while [ ! -d "./states" ]; do
inotifywait -e create "."
done
echo "\e[31Directory states created\e[0m"
while [ ! -d "$directory" ]; do
inotifywait -e create "./states"
done
echo "\e[31Directory '$directory' created\e[0m"
# Monitor the directory for delete events
while true; do
inotifywait -e delete_self "$directory"
echo "\e[31Directory '$directory' deleted, pausing program\e[0m"
kill -STOP $pid
echo "\e[31Program paused, press Enter to resume or Ctrl+C to terminate\e[0m"
read -r
kill -CONT $pid
done
}
monitor_directory Turns out it was |
CommonTestCase should probably set TLCGlobals.metaRoot to a directory whose name is the concatenation of the test name and a timestamp. |
Signed-off-by: Andrew Helwer <[email protected]>
Signed-off-by: Andrew Helwer <[email protected]>
I discovered that all the |
Signed-off-by: Andrew Helwer <[email protected]>
Signed-off-by: Andrew Helwer <[email protected]>
Signed-off-by: Andrew Helwer <[email protected]>
Signed-off-by: Andrew Helwer <[email protected]>
I have now successfully run this workflow on my 8-threaded local workstation over 10 times, giving reasonable confidence that there are no additional concurrency failures lurking in the tests. |
tlatools/org.lamport.tlatools/test/tlc2/tool/liveness/ModelCheckerTestCase.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @Calvin-L Feel free to merge as you see fit.
Signed-off-by: Andrew Helwer <[email protected]>
Dang, some tests fail when executed in parallel on my macbook m1:
|
FWIW: Execution time is down from ~12 minutes to ~3 minutes. |
Drats! Impressive execution time though. Okay I'll try executing them on my M1 macbook tomorrow |
If you touch this again, it would also be useful to support limiting the parallelism (up to a single thread) to not consume all system resources. For example, I typically launch the test suite and then tend to other work. |
@lemmy can't reproduce on my m1 macbook unfortunately. It runs a bit slower than yours, completing in about 4 minutes. Ran it about five times without failures. Would you be able to run the following on your mini then paste the output here? The python script summary should give a good head start on debugging this: ant -f tlatools/org.lamport.tlatools/customBuild.xml compile compile-test dist test 2>&1 | tee test-output.txt
python .github/scripts/parse-unit-test-output.py test-output.txt tlatools/org.lamport.tlatools/target/surefire-reports/ One weird thing I noticed, |
All tests pass if there is no other activity on machine. Could be that I ran TLC from VSCode yesterday while running the suite in the background. What's the OS of your workstation? |
Workstation is running linux |
Just to add another point of data: the tests run successfully on my 10-core M1 laptop (MacOS 14.4) in 2m 16s. I haven't tried running them with significant background activity. |
How long did it take to run the |
Looks like 1m 22s:
|
It is a bad idea to parse Ant's output. |
I don't necessarily agree it's a bad idea; the summary is very useful even if the summary is not always correct. The summary is a nice-to-have; we still have the original output and the original exit code. We can fix the summary to handle these cases as they arise. |
In this case here is the failure (which I had to find by downloading the full text of the CI log and searching through it for the keyword "FAILED", significantly less easy than looking at a summary):
|
In other words, this comment is factually incorrect. More importantly, it raises concerns about our reliance on a script that may or may not detect test failures. |
Markus I feel like you're being unnecessarily unfriendly today and would kindly request you show more grace. |
This is an unfairly harsh assessment of something that is, in my estimation, a strict upgrade to the CI job.
It's true that Furthermore, the exit code of test_failure ~> ci_failure holds with or without |
@Calvin-L Thank you for taking the time to check that the interception of exit values with pipefail and passing it between steps worked reliably. I didn't have time to check it myself. While |
That is appreciated, and I am glad we have found a more robust alternative (and probably would have happily worked on it myself) but please do not rewrite history: until Calvin mentioned he appreciated the features of the script the linked PR was a simple deletion of the script from the repo. |
Before this change, you could insert BUILD FAILED
.../tlatools/org.lamport.tlatools/customBuild.xml:535: Context loader has not been reset |
@ahelwer I suggest bringing back the previous sequential test execution if restoring the original behavior is not possible. |
Will take a look |
Findings running on Arch Linux with openjdk 22 and Ant 1.10.14:
So unfortunately I cannot reproduce this behavior, are you running on macOS? What is your openjdk and Ant version? |
-> % git diff src/tlc2/tool/liveness/LiveCheck.java
diff --git a/tlatools/org.lamport.tlatools/src/tlc2/tool/liveness/LiveCheck.java b/tlatools/org.lamport.tlatools/src/tlc2/tool/liveness/LiveCheck.java
index 28660a2ff..348636518 100644
--- a/tlatools/org.lamport.tlatools/src/tlc2/tool/liveness/LiveCheck.java
+++ b/tlatools/org.lamport.tlatools/src/tlc2/tool/liveness/LiveCheck.java
@@ -668,6 +668,9 @@ public class LiveCheck implements ILiveCheck {
return;
}
+ if (nodes.length == 2) {
+ System.exit(-1);
+ }
final int alen = oos.getCheckAction().length;
// See node0.addTransition(..) of previous case for what the -> % ant -d -f customBuild.xml info test > out.txt
[junit] Test pcal.CCallReturn1Test FAILED (crashed)
[junit] Test pcal.CallReturn1Test FAILED (crashed)
[junit] Test pcal.CMultiprocDefineTest FAILED (crashed)
BUILD FAILED
/Users/markus/src/TLA/tla/tlatools/org.lamport.tlatools/customBuild.xml:535: Context loader has not been reset
at org.apache.tools.ant.AntClassLoader.setThreadContextLoader(AntClassLoader.java:446)
at org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.logVmExit(JUnitTask.java:1898)
at org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.logVmCrash(JUnitTask.java:1870)
at org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.executeAsForked(JUnitTask.java:1302)
at org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.execute(JUnitTask.java:1037)
at org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.oneJunitThread(JUnitTask.java:941)
at org.apache.tools.ant.taskdefs.optional.junit.JUnitTask.access$000(JUnitTask.java:137)
at org.apache.tools.ant.taskdefs.optional.junit.JUnitTask$JunitTestThread.run(JUnitTask.java:895)
at java.base/java.lang.Thread.run(Thread.java:1570)
Total time: 49 seconds |
Unfortunately I still can't reproduce this. Running on arm64 macOS with:
and
which seems to mostly match the versions you are using. When I run the tests I see a number of crashes:
but the test execution continues without the exception you saw. Searching "Context loader has not been reset" only brings up the place in the source code where the exception is thrown, nobody else has hit this issue. Does the issue happen every time on your computer? Do you have access to another computer you can try to reproduce the issue on? I could certainly believe this is a weird concurrency bug in Ant. Probably most importantly, does the problem occur when you run with |
Ant 1.10.14 has some changes related to
The release notes are a little poorly worded, but I bet that |
This was actually easier than expected (edit: it was not easier than expected). Somehow I missed that the JUnit task has a
threads
parameter. I guess this is a newer addition which is why not a lot of documentation mentions it. Now the unit test target will run with number of threads equal to available cores on the machine. Decent improvement, unit tests finish in less than 10 minutes now which puts them below the time of the tlaplus/examples integration tests! There were a few small changes I had to make to some tests to ensure they don't interfere with each other with operations on thestates
metadir.I also added a useful python script to run during the CI after the tests; it:
I made the following changes to the ant build:
-Dtest.halt=true
parameter; previously the tests would stop at the first failure, which IMO is not as useful for debugging. This necessitated adding a conditionalfail
target under the test targets, which activates whenfailureproperty
is set in thejunit
target. This ensures the CI still fails with nonzero exit code upon any failures.I ran this workflow locally on my 8-thread workstation over 10 times so I have fairly high confidence there aren't any other concurrency failures lurking.
Closes #901