errors when running on a cluster #103

phyjswang · 2024-12-05T07:16:31Z

Dear all,

A simple example code like this one runs perfectly on my mac and home node on a cluster directly. But when submit it using SLURM system to a child node, soon after several steps of calculations, there are errors encountered.

Note that I've been using SLURM to submit jobs involving MPSKit.jl for a long time, which works perfectly.

The error output file is attached output.txt, and its first several lines are present below (the whole file is pretty long). The problem seems not to be conventional ones like out-of-memory.

Any help is greatly appreciated!

Best,
Junsen

  Activating project at `~/projects/2dising`
[ Info: CTMRG init:	obj = +1.533415612321e+00 +1.193497286278e+00im	err = 1.0000e+00
[ Info: CTMRG conv 33:	obj = +7.625582516626e+00	err = 4.7298809634e-09	time = 1.32 min
[ Info: CTMRG init:	obj = +7.625582516626e+00	err = 1.0000e+00
[ Info: CTMRG conv 4:	obj = +7.625582516626e+00	err = 2.0314190187e-10	time = 8.10 sec
ERROR: LoadError: TaskFailedException
Stacktrace:
  [1] wait(t::Task)
    @ Base ./task.jl:370
  [2] fetch
    @ ./task.jl:390 [inlined]
  [3] fetch
    @ ~/.julia/packages/StableTasks/3CrzR/src/internals.jl:9 [inlined]
  [4] mapreduce_first
    @ ./reduce.jl:421 [inlined]
  [5] _mapreduce(f::typeof(fetch), op::typeof(BangBang.append!!), ::IndexLinear, A::Vector{StableTasks.StableTask{Any}})
    @ Base ./reduce.jl:432
  [6] _mapreduce_dim(f::Function, op::Function, ::Base._InitialValue, A::Vector{StableTasks.StableTask{Any}}, ::Colon)
    @ Base ./reducedim.jl:337
  [7] mapreduce(f::Function, op::Function, A::Vector{StableTasks.StableTask{Any}})
    @ Base ./reducedim.jl:329
  [8] _tmapreduce(f::Function, op::Function, Arrs::Tuple{Vector{UnitRange{Int64}}}, ::Type{Any}, scheduler::OhMyThreads.Schedulers.DynamicScheduler{OhMyThreads.Schedulers.FixedCount, ChunkSplitters.Consecutive}, mapreduce_kwargs::@NamedTuple{})
    @ OhMyThreads.Implementation ~/.julia/packages/OhMyThreads/eiaNP/src/implementation.jl:113
  [9] #tmapreduce#22
    @ ~/.julia/packages/OhMyThreads/eiaNP/src/implementation.jl:85 [inlined]
 [10] tmapreduce
    @ ~/.julia/packages/OhMyThreads/eiaNP/src/implementation.jl:69 [inlined]
 [11] _tmap(scheduler::OhMyThreads.Schedulers.DynamicScheduler{OhMyThreads.Schedulers.FixedCount, ChunkSplitters.Consecutive}, f::Function, A::Vector{Tuple{ComplexF64, Zygote.var"#ad_pullback#61"{Tuple{PEPSKit.var"#302#303"{InfinitePEPS{TrivialTensorMap{ComplexSpace, 1, 4, Matrix{ComplexF64}}}, CTMRGEnv{TrivialTensorMap{ComplexSpace, 1, 1, Matrix{ComplexF64}}, TrivialTensorMap{ComplexSpace, 3, 1, Matrix{ComplexF64}}}}, Pair{Tuple{CartesianIndex{2}, Vararg{CartesianIndex{2}}}, TrivialTensorMap{ComplexSpace, N₁, N₂, Matrix{ComplexF64}} where {N₁, N₂}}}}}}, _Arrs::FillArrays.Fill{Float64, 1, Tuple{Base.OneTo{Int64}}})
    @ OhMyThreads.Implementation ~/.julia/packages/OhMyThreads/eiaNP/src/implementation.jl:451
 [12] #tmap#102
    @ ~/.julia/packages/OhMyThreads/eiaNP/src/implementation.jl:373 [inlined]
 [13] tmap
    @ ~/.julia/packages/OhMyThreads/eiaNP/src/implementation.jl:337 [inlined]
 [14] (::PEPSKit.var"#dtmap_pullback#22"{@Kwargs{}, Vector{typeof(identity)}, typeof(identity), Vector{Tuple{ComplexF64, Zygote.var"#ad_pullback#61"{Tuple{PEPSKit.var"#302#303"{InfinitePEPS{TrivialTensorMap{ComplexSpace, 1, 4, Matrix{ComplexF64}}}, CTMRGEnv{TrivialTensorMap{ComplexSpace, 1, 1, Matrix{ComplexF64}}, TrivialTensorMap{ComplexSpace, 3, 1, Matrix{ComplexF64}}}}, Pair{Tuple{CartesianIndex{2}, Vararg{CartesianIndex{2}}}, TrivialTensorMap{ComplexSpace, N₁, N₂, Matrix{ComplexF64}} where {N₁, N₂}}}}}}})(dy_raw::FillArrays.Fill{Float64, 1, Tuple{Base.OneTo{Int64}}})
    @ PEPSKit ~/.julia/packages/PEPSKit/EZBsW/src/utility/diffable_threads.jl:25
 [15] ZBack
    @ ~/.julia/packages/Zygote/nyzjS/src/compiler/chainrules.jl:212 [inlined]
 [16] expectation_value
    @ ~/.julia/packages/PEPSKit/EZBsW/src/algorithms/toolbox.jl:3 [inlined]

The text was updated successfully, but these errors were encountered:

Yue-Zhengyuan · 2024-12-05T10:39:25Z

It may be helpful to include the shell script you used to submit the job and the output/error file generated by SLURM.

phyjswang · 2024-12-05T11:32:51Z

It may be helpful to include the shell script you used to submit the job and the output/error file generated by SLURM.

Thank you for the comment!

The essential part of my SLURM script is the following:

slurmContent = "#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=$nthreads
#SBATCH --ntasks-per-node=1
#SBATCH --time=$ndays-0
#SBATCH --mem=$(nmem)G
#SBATCH --partition=$partitionName
#SBATCH --open-mode=truncate

export JULIA_PKG_OFFLINE=true
export JULIA_NUM_PRECOMPILE_TASKS=1

echo \"job started at:\"
date
echo \"---------------------------\"
/public/home/acpyi5g3o3/Downloads/julia-1.11.1/bin/julia -t $nthreads --compiled-modules=no --heap-size-hint=$(mem_hint)G $runfn $cb $ce $g $h"

The output/error file generated by SLURM is attached in my first comment.

I've tried with different clusters and different julia versions, but this same error consistently appears.
According to the output file, it seems to relate to OhMyThreads or multiple threading? But I've also tried different number of cpus, including multiple and single thread. All attempts failed.

lkdvos · 2024-12-05T14:35:41Z

I had a look at the error, and indeed it seems to originate in our implementation of the rrule for the parallel map. I'm really confused why it would work locally but not on a cluster to be honest, and the stacktrace really doesn't give too much information...

One thing you could try is to disable the threading there, by manually selecting the SerialScheduler (which should already happen automatically if you only have a single thread)

set_scheduler!(:serial)

and seeing if the problem persists?

Just for reference, the actually relevant part of the stacktrace is the error within the tasks, which is

    nested task error: MethodError: no method matching ChainRulesCore.ZeroTangent()
    The applicable method may be too new: running in world age 47133, while current world is 67051.
    
    Closest candidates are:
      ChainRulesCore.ZeroTangent() (method too new to be called from this world context.)
       @ ChainRulesCore ~/.julia/packages/ChainRulesCore/6Pucz/src/tangent_types/abstract_zero.jl:58

It seems like the way we are (ab)using Zygote and chainrules along with the multithreading is not super robust, but honestly I have no clue what the cause of that could be...

phyjswang · 2024-12-05T15:20:45Z

Thank you @lkdvos for the comment!

I've tested the case with a single thread (by explicitly requesting a single CPU in my SLURM script) before, and it failed.

And just now I've also add set_scheduler!(:serial) in my main code, which failed again.

This problem is indeed very strange...

lkdvos · 2024-12-05T15:26:48Z

Are the error messages in all these cases the same?

phyjswang · 2024-12-05T15:31:41Z

Yes, in all cases, the relevant part of error messages are the same.

pbrehmer · 2024-12-06T09:46:40Z

To give some context: I have run a couple of simulations on an HPC environment also using SLURM and I have not yet encountered these errors. I guess that makes the problem even more strange.

Given that the multi-threaded rrule doesn't seem to be robust, maybe we should disable that by default and leave the multi-threaded reverse pass as an experimental feature? @lkdvos

(Also I wanted to perform some benchmarks on how much the multi-threading really improves the performance in the backwards pass because that is not yet super clear to me.)

lkdvos · 2024-12-06T11:12:50Z

The problem with just disabling that is that it persists when not run in a multithreaded environment, which is really strange to me. It should be somewhat easy to try and replace the calls with map to check if that really fixes anything?

pbrehmer · 2024-12-06T11:20:30Z

So you mean replacing all calls to dtmap or replacing tmap in the rrule with map? Because I suspect that there is some weird interplay between how OhMyThreads handles threads (even in :serial mode?) and how HPC environment configure multi-threading.

lkdvos · 2024-12-06T12:50:16Z

Honestly, I would suggest trying both to see if it changes anything. It's a bit hard for me to do anything specific, since I can't seem to reproduce it, so its basically a bit of whack-a-mole here.

pbrehmer · 2024-12-06T13:06:35Z

Unfortunately, I also can't reproduce it on the HPC environment I have access to. Another thing to try might be to play around with the SLURM setup? For comparison, here's my rather simple batch script:

#!/bin/bash

#SBATCH --job-name=job_name
#SBATCH --time=1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-core=1  # Disable hyperthreading
#SBATCH --cpus-per-task=$n_cores
#SBATCH --output="/path/to/output/output.log"

export PATH="\$PATH:/home/username/.juliaup/bin"

export JULIA_NUM_THREADS=16
export OPENBLAS_NUM_THREADS=4

cd /path/to/repo/
julia --project "/path/to/script/script.jl"

Note that I have installed Julia using juliaup, so I export the corresponding binary.

One reason there could be a problem with the SLURM setup is that apparently the example code runs fine on the front-end node, where I presume the code is just launched from the REPL or using julia --threads=... example_code.jl, right?

phyjswang · 2024-12-06T14:18:24Z

#SBATCH --ntasks-per-core=1 # Disable hyperthreading

The essential difference between your script and mine is the above line. I've tested just now by including it. The same error message appears...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

errors when running on a cluster #103

errors when running on a cluster #103

phyjswang commented Dec 5, 2024

Yue-Zhengyuan commented Dec 5, 2024

phyjswang commented Dec 5, 2024

lkdvos commented Dec 5, 2024

phyjswang commented Dec 5, 2024

lkdvos commented Dec 5, 2024

phyjswang commented Dec 5, 2024

pbrehmer commented Dec 6, 2024

lkdvos commented Dec 6, 2024

pbrehmer commented Dec 6, 2024

lkdvos commented Dec 6, 2024

pbrehmer commented Dec 6, 2024 •

edited

Loading

phyjswang commented Dec 6, 2024

errors when running on a cluster #103

errors when running on a cluster #103

Comments

phyjswang commented Dec 5, 2024

Yue-Zhengyuan commented Dec 5, 2024

phyjswang commented Dec 5, 2024

lkdvos commented Dec 5, 2024

phyjswang commented Dec 5, 2024

lkdvos commented Dec 5, 2024

phyjswang commented Dec 5, 2024

pbrehmer commented Dec 6, 2024

lkdvos commented Dec 6, 2024

pbrehmer commented Dec 6, 2024

lkdvos commented Dec 6, 2024

pbrehmer commented Dec 6, 2024 • edited Loading

phyjswang commented Dec 6, 2024

pbrehmer commented Dec 6, 2024 •

edited

Loading