Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPUSummary.jl v0.1.14 breaks CI of Trixi.jl on skylake-avx512 #6

Open
ranocha opened this issue Mar 10, 2022 · 8 comments
Open

CPUSummary.jl v0.1.14 breaks CI of Trixi.jl on skylake-avx512 #6

ranocha opened this issue Mar 10, 2022 · 8 comments

Comments

@ranocha
Copy link
Member

ranocha commented Mar 10, 2022

We observed some specific problems when going from CPUSummary.jl v0.1.8 to v0.1.14 at Trixi.jl. Everything is fine with the old version of CPUSummary.jl. CI also passes with the new version unless the GitHub CI runner happens to use LLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512) (either ubuntu-latest or windows-latest).
I could reduce this problem at https://github.com/trixi-framework/TrixiDebug.jl. Using the latest version of CPUSummary.jl, CI fails on

Restricting CPUSummary.jl to v0.1.8 let's CI pass on

So far, we have not been able to reproduce this locally...

For context: We use some matrix multiplications based on matmul! from Octavian.jl. To me, it seems like these multiplications fail catastrophically, resulting in the errors shown in CI.

CC @sloede

@ranocha
Copy link
Member Author

ranocha commented Mar 10, 2022

Additional information:

@chriselrod
Copy link
Member

chriselrod commented Mar 10, 2022

Unfortunately, CPUSummary 0.1.8 did not work under wine (that is, they'd segfault Julia as soon as you using CPUSummary, or using any package that depends on it), and this was required by my employer, therefore reverting the changes are not an option.
The newer versions have been and continue to be mostly broken, but I'm not quite sure how to fix it.

I do have skylake-avx512 locally, so I probably just need to spend the time to figure out what is different in generic_topology.jl (which doesn't use hwloc) vs topology.jl, and then fix this plus perhaps also figure out why a misspecification will cause packages like Octavian to get wrong answers.

@chriselrod
Copy link
Member

chriselrod commented Mar 10, 2022

Unless you need to run Julia on wine, I suggest you pin CPUSummary 0.1.8.

@chriselrod
Copy link
Member

chriselrod commented Mar 10, 2022

One problem is that my check for "will Hwloc segfault Julia or throw an error":

p = run(`$(Base.julia_cmd()) --project=$tmpd -e'using Pkg; Pkg.add("Hwloc"); using Hwloc; Hwloc.gettopology()'`, wait=false)
wait(p)
if p.exitcode == 0 && p.termsignal == 0

almost always returns a false positive, even though it passes when run from the REPL.

@ranocha
Copy link
Member Author

ranocha commented Mar 10, 2022

Okay, thanks!

Unless you need to run Julia on wine, I suggest you pin CPUSummary 0.1.8.

Yeah, that's our current workaround at trixi-framework/Trixi.jl#1083

@ranocha
Copy link
Member Author

ranocha commented Mar 10, 2022

If using Hwloc is the problem, it seems to be weird that our CI reports CPUSummary.USE_HWLOC = true for CPUSummary.jl v0.1.14 (and fails tests afterwards), see https://github.com/trixi-framework/TrixiDebug.jl/runs/5493893614?check_suite_focus=true#step:6:391.

@chriselrod
Copy link
Member

chriselrod commented Mar 10, 2022

That also will also generally be inaccurate.

julia> using CPUSummary

julia> CPUSummary.USE_HWLOC
true

julia> isdefined(CPUSummary, :safe_topology_load!)
false

This is a far more reliable check. safe_topology_load! is defined in the file included when using Hwloc, but not in the other.

Therefore, look at isdefined(CPUSummary, :safe_topology_load!) instead of USE_HWLOC.

ranocha added a commit to trixi-framework/Trixi.jl that referenced this issue Mar 10, 2022
@ranocha
Copy link
Member Author

ranocha commented Mar 10, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants