-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
errors when running on a cluster #103
Comments
It may be helpful to include the shell script you used to submit the job and the output/error file generated by SLURM. |
Thank you for the comment! The essential part of my SLURM script is the following: slurmContent = "#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=$nthreads
#SBATCH --ntasks-per-node=1
#SBATCH --time=$ndays-0
#SBATCH --mem=$(nmem)G
#SBATCH --partition=$partitionName
#SBATCH --open-mode=truncate
export JULIA_PKG_OFFLINE=true
export JULIA_NUM_PRECOMPILE_TASKS=1
echo \"job started at:\"
date
echo \"---------------------------\"
/public/home/acpyi5g3o3/Downloads/julia-1.11.1/bin/julia -t $nthreads --compiled-modules=no --heap-size-hint=$(mem_hint)G $runfn $cb $ce $g $h" The output/error file generated by SLURM is attached in my first comment. I've tried with different clusters and different julia versions, but this same error consistently appears. |
I had a look at the error, and indeed it seems to originate in our implementation of the One thing you could try is to disable the threading there, by manually selecting the set_scheduler!(:serial) and seeing if the problem persists? Just for reference, the actually relevant part of the stacktrace is the error within the tasks, which is
It seems like the way we are (ab)using Zygote and chainrules along with the multithreading is not super robust, but honestly I have no clue what the cause of that could be... |
Thank you @lkdvos for the comment! I've tested the case with a single thread (by explicitly requesting a single CPU in my SLURM script) before, and it failed. And just now I've also add This problem is indeed very strange... |
Are the error messages in all these cases the same? |
Yes, in all cases, the relevant part of error messages are the same. |
To give some context: I have run a couple of simulations on an HPC environment also using SLURM and I have not yet encountered these errors. I guess that makes the problem even more strange. Given that the multi-threaded rrule doesn't seem to be robust, maybe we should disable that by default and leave the multi-threaded reverse pass as an experimental feature? @lkdvos (Also I wanted to perform some benchmarks on how much the multi-threading really improves the performance in the backwards pass because that is not yet super clear to me.) |
The problem with just disabling that is that it persists when not run in a multithreaded environment, which is really strange to me. It should be somewhat easy to try and replace the calls with map to check if that really fixes anything? |
So you mean replacing all calls to |
Honestly, I would suggest trying both to see if it changes anything. It's a bit hard for me to do anything specific, since I can't seem to reproduce it, so its basically a bit of whack-a-mole here. |
Unfortunately, I also can't reproduce it on the HPC environment I have access to. Another thing to try might be to play around with the SLURM setup? For comparison, here's my rather simple batch script:
Note that I have installed Julia using One reason there could be a problem with the SLURM setup is that apparently the example code runs fine on the front-end node, where I presume the code is just launched from the REPL or using |
The essential difference between your script and mine is the above line. I've tested just now by including it. The same error message appears... |
Dear all,
A simple example code like this one runs perfectly on my mac and home node on a cluster directly. But when submit it using SLURM system to a child node, soon after several steps of calculations, there are errors encountered.
Note that I've been using SLURM to submit jobs involving MPSKit.jl for a long time, which works perfectly.
The error output file is attached output.txt, and its first several lines are present below (the whole file is pretty long). The problem seems not to be conventional ones like
out-of-memory
.Any help is greatly appreciated!
Best,
Junsen
The text was updated successfully, but these errors were encountered: