Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors when running on a cluster #103

Open
phyjswang opened this issue Dec 5, 2024 · 12 comments
Open

errors when running on a cluster #103

phyjswang opened this issue Dec 5, 2024 · 12 comments

Comments

@phyjswang
Copy link

Dear all,

A simple example code like this one runs perfectly on my mac and home node on a cluster directly. But when submit it using SLURM system to a child node, soon after several steps of calculations, there are errors encountered.

Note that I've been using SLURM to submit jobs involving MPSKit.jl for a long time, which works perfectly.

The error output file is attached output.txt, and its first several lines are present below (the whole file is pretty long). The problem seems not to be conventional ones like out-of-memory.

Any help is greatly appreciated!

Best,
Junsen

  Activating project at `~/projects/2dising`
[ Info: CTMRG init:	obj = +1.533415612321e+00 +1.193497286278e+00im	err = 1.0000e+00
[ Info: CTMRG conv 33:	obj = +7.625582516626e+00	err = 4.7298809634e-09	time = 1.32 min
[ Info: CTMRG init:	obj = +7.625582516626e+00	err = 1.0000e+00
[ Info: CTMRG conv 4:	obj = +7.625582516626e+00	err = 2.0314190187e-10	time = 8.10 sec
ERROR: LoadError: TaskFailedException
Stacktrace:
  [1] wait(t::Task)
    @ Base ./task.jl:370
  [2] fetch
    @ ./task.jl:390 [inlined]
  [3] fetch
    @ ~/.julia/packages/StableTasks/3CrzR/src/internals.jl:9 [inlined]
  [4] mapreduce_first
    @ ./reduce.jl:421 [inlined]
  [5] _mapreduce(f::typeof(fetch), op::typeof(BangBang.append!!), ::IndexLinear, A::Vector{StableTasks.StableTask{Any}})
    @ Base ./reduce.jl:432
  [6] _mapreduce_dim(f::Function, op::Function, ::Base._InitialValue, A::Vector{StableTasks.StableTask{Any}}, ::Colon)
    @ Base ./reducedim.jl:337
  [7] mapreduce(f::Function, op::Function, A::Vector{StableTasks.StableTask{Any}})
    @ Base ./reducedim.jl:329
  [8] _tmapreduce(f::Function, op::Function, Arrs::Tuple{Vector{UnitRange{Int64}}}, ::Type{Any}, scheduler::OhMyThreads.Schedulers.DynamicScheduler{OhMyThreads.Schedulers.FixedCount, ChunkSplitters.Consecutive}, mapreduce_kwargs::@NamedTuple{})
    @ OhMyThreads.Implementation ~/.julia/packages/OhMyThreads/eiaNP/src/implementation.jl:113
  [9] #tmapreduce#22
    @ ~/.julia/packages/OhMyThreads/eiaNP/src/implementation.jl:85 [inlined]
 [10] tmapreduce
    @ ~/.julia/packages/OhMyThreads/eiaNP/src/implementation.jl:69 [inlined]
 [11] _tmap(scheduler::OhMyThreads.Schedulers.DynamicScheduler{OhMyThreads.Schedulers.FixedCount, ChunkSplitters.Consecutive}, f::Function, A::Vector{Tuple{ComplexF64, Zygote.var"#ad_pullback#61"{Tuple{PEPSKit.var"#302#303"{InfinitePEPS{TrivialTensorMap{ComplexSpace, 1, 4, Matrix{ComplexF64}}}, CTMRGEnv{TrivialTensorMap{ComplexSpace, 1, 1, Matrix{ComplexF64}}, TrivialTensorMap{ComplexSpace, 3, 1, Matrix{ComplexF64}}}}, Pair{Tuple{CartesianIndex{2}, Vararg{CartesianIndex{2}}}, TrivialTensorMap{ComplexSpace, N₁, N₂, Matrix{ComplexF64}} where {N₁, N₂}}}}}}, _Arrs::FillArrays.Fill{Float64, 1, Tuple{Base.OneTo{Int64}}})
    @ OhMyThreads.Implementation ~/.julia/packages/OhMyThreads/eiaNP/src/implementation.jl:451
 [12] #tmap#102
    @ ~/.julia/packages/OhMyThreads/eiaNP/src/implementation.jl:373 [inlined]
 [13] tmap
    @ ~/.julia/packages/OhMyThreads/eiaNP/src/implementation.jl:337 [inlined]
 [14] (::PEPSKit.var"#dtmap_pullback#22"{@Kwargs{}, Vector{typeof(identity)}, typeof(identity), Vector{Tuple{ComplexF64, Zygote.var"#ad_pullback#61"{Tuple{PEPSKit.var"#302#303"{InfinitePEPS{TrivialTensorMap{ComplexSpace, 1, 4, Matrix{ComplexF64}}}, CTMRGEnv{TrivialTensorMap{ComplexSpace, 1, 1, Matrix{ComplexF64}}, TrivialTensorMap{ComplexSpace, 3, 1, Matrix{ComplexF64}}}}, Pair{Tuple{CartesianIndex{2}, Vararg{CartesianIndex{2}}}, TrivialTensorMap{ComplexSpace, N₁, N₂, Matrix{ComplexF64}} where {N₁, N₂}}}}}}})(dy_raw::FillArrays.Fill{Float64, 1, Tuple{Base.OneTo{Int64}}})
    @ PEPSKit ~/.julia/packages/PEPSKit/EZBsW/src/utility/diffable_threads.jl:25
 [15] ZBack
    @ ~/.julia/packages/Zygote/nyzjS/src/compiler/chainrules.jl:212 [inlined]
 [16] expectation_value
    @ ~/.julia/packages/PEPSKit/EZBsW/src/algorithms/toolbox.jl:3 [inlined]
@Yue-Zhengyuan
Copy link
Collaborator

It may be helpful to include the shell script you used to submit the job and the output/error file generated by SLURM.

@phyjswang
Copy link
Author

It may be helpful to include the shell script you used to submit the job and the output/error file generated by SLURM.

Thank you for the comment!

The essential part of my SLURM script is the following:

slurmContent = "#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=$nthreads
#SBATCH --ntasks-per-node=1
#SBATCH --time=$ndays-0
#SBATCH --mem=$(nmem)G
#SBATCH --partition=$partitionName
#SBATCH --open-mode=truncate

export JULIA_PKG_OFFLINE=true
export JULIA_NUM_PRECOMPILE_TASKS=1

echo \"job started at:\"
date
echo \"---------------------------\"
/public/home/acpyi5g3o3/Downloads/julia-1.11.1/bin/julia -t $nthreads --compiled-modules=no --heap-size-hint=$(mem_hint)G $runfn $cb $ce $g $h"

The output/error file generated by SLURM is attached in my first comment.

I've tried with different clusters and different julia versions, but this same error consistently appears.
According to the output file, it seems to relate to OhMyThreads or multiple threading? But I've also tried different number of cpus, including multiple and single thread. All attempts failed.

@lkdvos
Copy link
Member

lkdvos commented Dec 5, 2024

I had a look at the error, and indeed it seems to originate in our implementation of the rrule for the parallel map. I'm really confused why it would work locally but not on a cluster to be honest, and the stacktrace really doesn't give too much information...

One thing you could try is to disable the threading there, by manually selecting the SerialScheduler (which should already happen automatically if you only have a single thread)

set_scheduler!(:serial)

and seeing if the problem persists?

Just for reference, the actually relevant part of the stacktrace is the error within the tasks, which is

    nested task error: MethodError: no method matching ChainRulesCore.ZeroTangent()
    The applicable method may be too new: running in world age 47133, while current world is 67051.
    
    Closest candidates are:
      ChainRulesCore.ZeroTangent() (method too new to be called from this world context.)
       @ ChainRulesCore ~/.julia/packages/ChainRulesCore/6Pucz/src/tangent_types/abstract_zero.jl:58

It seems like the way we are (ab)using Zygote and chainrules along with the multithreading is not super robust, but honestly I have no clue what the cause of that could be...

@phyjswang
Copy link
Author

Thank you @lkdvos for the comment!

I've tested the case with a single thread (by explicitly requesting a single CPU in my SLURM script) before, and it failed.

And just now I've also add set_scheduler!(:serial) in my main code, which failed again.

This problem is indeed very strange...

@lkdvos
Copy link
Member

lkdvos commented Dec 5, 2024

Are the error messages in all these cases the same?

@phyjswang
Copy link
Author

Yes, in all cases, the relevant part of error messages are the same.

@pbrehmer
Copy link
Collaborator

pbrehmer commented Dec 6, 2024

To give some context: I have run a couple of simulations on an HPC environment also using SLURM and I have not yet encountered these errors. I guess that makes the problem even more strange.

Given that the multi-threaded rrule doesn't seem to be robust, maybe we should disable that by default and leave the multi-threaded reverse pass as an experimental feature? @lkdvos

(Also I wanted to perform some benchmarks on how much the multi-threading really improves the performance in the backwards pass because that is not yet super clear to me.)

@lkdvos
Copy link
Member

lkdvos commented Dec 6, 2024

The problem with just disabling that is that it persists when not run in a multithreaded environment, which is really strange to me. It should be somewhat easy to try and replace the calls with map to check if that really fixes anything?

@pbrehmer
Copy link
Collaborator

pbrehmer commented Dec 6, 2024

So you mean replacing all calls to dtmap or replacing tmap in the rrule with map? Because I suspect that there is some weird interplay between how OhMyThreads handles threads (even in :serial mode?) and how HPC environment configure multi-threading.

@lkdvos
Copy link
Member

lkdvos commented Dec 6, 2024

Honestly, I would suggest trying both to see if it changes anything. It's a bit hard for me to do anything specific, since I can't seem to reproduce it, so its basically a bit of whack-a-mole here.

@pbrehmer
Copy link
Collaborator

pbrehmer commented Dec 6, 2024

Unfortunately, I also can't reproduce it on the HPC environment I have access to. Another thing to try might be to play around with the SLURM setup? For comparison, here's my rather simple batch script:

#!/bin/bash

#SBATCH --job-name=job_name
#SBATCH --time=1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-core=1  # Disable hyperthreading
#SBATCH --cpus-per-task=$n_cores
#SBATCH --output="/path/to/output/output.log"

export PATH="\$PATH:/home/username/.juliaup/bin"

export JULIA_NUM_THREADS=16
export OPENBLAS_NUM_THREADS=4

cd /path/to/repo/
julia --project "/path/to/script/script.jl"

Note that I have installed Julia using juliaup, so I export the corresponding binary.

One reason there could be a problem with the SLURM setup is that apparently the example code runs fine on the front-end node, where I presume the code is just launched from the REPL or using julia --threads=... example_code.jl, right?

@phyjswang
Copy link
Author

#SBATCH --ntasks-per-core=1 # Disable hyperthreading

The essential difference between your script and mine is the above line. I've tested just now by including it. The same error message appears...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants