Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hwloc can't bind beyond 8 threads on 16 core Threadripper 1950x #1678

Open
cryptonote-social opened this issue Jun 24, 2018 · 15 comments
Open

Comments

@cryptonote-social
Copy link

Processor is a threadripper 1950x, OS Ubuntu Linux 16.04 LTS. ulimits and hugepages appropriately configured.

If I try to assign more than 8 threads with xmr-stak, I get hwloc errors. The result is xmr-stak is slow on this machine, compared to xmrig which seems to work fine.

[2018-06-24 11:13:31] : Mining coin: monero7
[2018-06-24 11:13:31] : CPU configuration stored in file 'cpu.txt'
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 0.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 1.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 2.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 3.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 4.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 5.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 6.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 7.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 8.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 9.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 10.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 11.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 12.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 13.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 14.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 15.
[2018-06-24 11:13:31] : hwloc: can't bind memory

HASHRATE REPORT - CPU
| ID | 10s | 60s | 15m | ID | 10s | 60s | 15m |
| 0 | 72.8 | (na) | (na) | 1 | 73.0 | (na) | (na) |
| 2 | 73.0 | (na) | (na) | 3 | 72.9 | (na) | (na) |
| 4 | 72.6 | (na) | (na) | 5 | 72.9 | (na) | (na) |
| 6 | 72.6 | (na) | (na) | 7 | 72.9 | (na) | (na) |
| 8 | 65.8 | (na) | (na) | 9 | 65.8 | (na) | (na) |
| 10 | 65.8 | (na) | (na) | 11 | 65.8 | (na) | (na) |
| 12 | 65.8 | (na) | (na) | 13 | 65.8 | (na) | (na) |
| 14 | 65.8 | (na) | (na) | 15 | 65.8 | (na) | (na) |
Totals (CPU): 1109.0 0.0 0.0 H/s

Compare this to xmrig which has no issues with memory binding and gets 200 more h/s:

[2018-06-24 11:14:49] READY (CPU) threads 16(16) huge pages 16/16 100% memory 32.0 MB
| THREAD | AFFINITY | 10s H/s | 60s H/s | 15m H/s |
| 0 | -1 | 81.2 | n/a | n/a |
| 1 | -1 | 81.2 | n/a | n/a |
| 2 | -1 | 82.7 | n/a | n/a |
| 3 | -1 | 83.2 | n/a | n/a |
| 4 | -1 | 81.2 | n/a | n/a |
| 5 | -1 | 83.2 | n/a | n/a |
| 6 | -1 | 81.2 | n/a | n/a |
| 7 | -1 | 82.7 | n/a | n/a |
| 8 | -1 | 82.7 | n/a | n/a |
| 9 | -1 | 82.4 | n/a | n/a |
| 10 | -1 | 81.2 | n/a | n/a |
| 11 | -1 | 81.2 | n/a | n/a |
| 12 | -1 | 83.2 | n/a | n/a |
| 13 | -1 | 83.2 | n/a | n/a |
| 14 | -1 | 81.2 | n/a | n/a |
| 15 | -1 | 81.2 | n/a | n/a |
[2018-06-24 11:15:08] speed 10s/60s/15m 1312.9 n/a n/a H/s max 1313.0 H/s

The difference is far more stark on cn-lite, where xmrig is over 4000 h/2 and I xmr-stak barely cracks 2000 h/s no matter what I try.

I've set my hugepages and various limits appropriately. Here is my lstopo output in case it helps?

lstopo --of console
Machine (31GB total) + Package L#0
NUMANode L#0 (P#0 31GB)
L3 L#0 (8192KB)
L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
L3 L#1 (8192KB)
L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
HostBridge L#0
PCIBridge
PCI 1022:43b6
Block(Disk) L#0 "sda"
Block(Disk) L#1 "sdb"
Block(Removable Media Device) L#2 "sr0"
PCIBridge
PCIBridge
PCI 8086:1539
Net L#3 "enp5s0"
PCIBridge
PCI 1002:67df
GPU L#4 "renderD128"
GPU L#5 "card0"
GPU L#6 "controlD64"
CoProc L#7 "opencl0d0"
PCIBridge
PCI 1002:67df
GPU L#8 "card1"
GPU L#9 "controlD65"
GPU L#10 "renderD129"
CoProc L#11 "opencl0d1"
PCIBridge
PCI 1022:7901
NUMANode L#1 (P#1)
L3 L#2 (8192KB)
L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (64KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#24)
L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (64KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#25)
L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (64KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#26)
L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (64KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#27)
L3 L#3 (8192KB)
L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (64KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#28)
L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (64KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#29)
L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (64KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#30)
L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (64KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#31)
HostBridge L#7
PCIBridge
PCI 1002:67df
GPU L#12 "card2"
GPU L#13 "controlD66"
GPU L#14 "renderD130"
CoProc L#15 "opencl0d2"
PCIBridge
PCI 1022:7901

@psychocrypt
Copy link
Collaborator

psychocrypt commented Jun 24, 2018 via email

@psychocrypt
Copy link
Collaborator

psychocrypt commented Jun 24, 2018 via email

@Spudz76
Copy link
Contributor

Spudz76 commented Jun 24, 2018

Also there is default affinity for all cores in the other miner app (-1)
So why bother setting it in xmr-stak?
Try compiling without hwloc, I doubt the other miner app uses it, if you don't have multiple memory channels or dual physical cpu (not just cores in one cpu) then hwloc makes basically no difference.

Be sure to compile on the rig it will run, and do not use generic, so that the cpu backend gets optimized for it (-mcpu=native). I also get several H/s boost on some machines by using clang 3.8 or newer as compiler versus any gcc revision before 6. Perhaps the other miners are build with clang and therefore the missing 10H/core.

There is also a custom clang from AMD but I only have some ancient tri-core AMD which it doesn't really help with. However it may be exactly what you want to try, it should optimize even better than the generic compilers specifically for newer AMD cpus. I installed it and built with it just fine, it just didn't make any difference on antique CPU.

@cryptonote-social
Copy link
Author

cryptonote-social commented Jun 24, 2018

I've tried all kinds of configs (including no affinity), and they are all much slower than what I get from xmrig. I also tried compiling without hwloc -- again still slower. The config from above is the one generated automatically by xmr-stak which I thought would be most relevant. It appears below. I also compiled from source on the same rig, and it appears mcpu=native should have been on by default as I didn't specify a generic build.

Sounds like my RAM might be installed improperly though? I have two 16 GB DIMMs. I'll try moving one into a different slot I guess.

[
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 0 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 1 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 2 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 3 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 4 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 5 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 6 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 7 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 8 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 9 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 10 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 11 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 12 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 13 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 14 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 15 },

],

@cryptonote-social
Copy link
Author

cryptonote-social commented Jun 24, 2018

Here are my speeds running with affine_to_cpu: false for all threads (similar to turning of hwloc in compilation)

HASHRATE REPORT - CPU
| ID | 10s | 60s | 15m | ID | 10s | 60s | 15m |
| 0 | 65.2 | (na) | (na) | 1 | 65.2 | (na) | (na) |
| 2 | 74.4 | (na) | (na) | 3 | 73.3 | (na) | (na) |
| 4 | 65.2 | (na) | (na) | 5 | 65.2 | (na) | (na) |
| 6 | 73.1 | (na) | (na) | 7 | 65.2 | (na) | (na) |
| 8 | 73.3 | (na) | (na) | 9 | 65.2 | (na) | (na) |
| 10 | 65.2 | (na) | (na) | 11 | 74.4 | (na) | (na) |
| 12 | 73.3 | (na) | (na) | 13 | 74.4 | (na) | (na) |
| 14 | 65.2 | (na) | (na) | 15 | 74.4 | (na) | (na) |
Totals (CPU): 1112.1 0.0 0.0 H/s

Totals (ALL): 1112.1 0.0 0.0 H/s
Highest: 1122.9 H/s

@cryptonote-social
Copy link
Author

Update: I fixed the RAM issue and now have 16gb per channel according to lstopo, and now hwloc is able to pin all 16 threads, but the speed is little changed. So I guess hwloc isn't the issue at all. Not sure what else it could be?

[2018-06-24 17:01:38] : Mining coin: monero7
[2018-06-24 17:01:38] : Starting 1x thread, affinity: 0.
[2018-06-24 17:01:38] : hwloc: memory pinned
[2018-06-24 17:01:38] : Starting 1x thread, affinity: 1.
[2018-06-24 17:01:38] : hwloc: memory pinned
[2018-06-24 17:01:38] : Starting 1x thread, affinity: 2.
[2018-06-24 17:01:38] : hwloc: memory pinned
[2018-06-24 17:01:38] : Starting 1x thread, affinity: 3.
[2018-06-24 17:01:38] : hwloc: memory pinned
[2018-06-24 17:01:38] : Starting 1x thread, affinity: 4.
[2018-06-24 17:01:38] : hwloc: memory pinned
[2018-06-24 17:01:38] : Starting 1x thread, affinity: 5.
[2018-06-24 17:01:38] : hwloc: memory pinned
[2018-06-24 17:01:38] : Starting 1x thread, affinity: 6.
[2018-06-24 17:01:38] : hwloc: memory pinned
[2018-06-24 17:01:38] : Starting 1x thread, affinity: 7.
[2018-06-24 17:01:39] : hwloc: memory pinned
[2018-06-24 17:01:39] : Starting 1x thread, affinity: 8.
[2018-06-24 17:01:39] : hwloc: memory pinned
[2018-06-24 17:01:39] : Starting 1x thread, affinity: 9.
[2018-06-24 17:01:39] : hwloc: memory pinned
[2018-06-24 17:01:39] : Starting 1x thread, affinity: 10.
[2018-06-24 17:01:39] : hwloc: memory pinned
[2018-06-24 17:01:39] : Starting 1x thread, affinity: 11.
[2018-06-24 17:01:39] : hwloc: memory pinned
[2018-06-24 17:01:39] : Starting 1x thread, affinity: 12.
[2018-06-24 17:01:39] : hwloc: memory pinned
[2018-06-24 17:01:39] : Starting 1x thread, affinity: 13.
[2018-06-24 17:01:39] : hwloc: memory pinned
[2018-06-24 17:01:39] : Starting 1x thread, affinity: 14.
[2018-06-24 17:01:39] : hwloc: memory pinned
[2018-06-24 17:01:39] : Starting 1x thread, affinity: 15.
[2018-06-24 17:01:39] : hwloc: memory pinned
[2018-06-24 17:01:39] : Fast-connecting to bigmac:2222 pool ...
[2018-06-24 17:01:39] : Pool bigmac:2222 connected. Logging in...
[2018-06-24 17:01:39] : Difficulty changed. Now: 15000.
[2018-06-24 17:01:39] : Pool logged in.
[2018-06-24 17:01:43] : Result accepted by the pool.
HASHRATE REPORT - CPU
| ID | 10s | 60s | 15m | ID | 10s | 60s | 15m |
| 0 | 71.4 | (na) | (na) | 1 | 71.6 | (na) | (na) |
| 2 | 71.5 | (na) | (na) | 3 | 71.4 | (na) | (na) |
| 4 | 70.5 | (na) | (na) | 5 | 70.5 | (na) | (na) |
| 6 | 70.5 | (na) | (na) | 7 | 70.6 | (na) | (na) |
| 8 | 75.2 | (na) | (na) | 9 | 75.2 | (na) | (na) |
| 10 | 75.2 | (na) | (na) | 11 | 75.2 | (na) | (na) |
| 12 | 71.5 | (na) | (na) | 13 | 71.5 | (na) | (na) |
| 14 | 71.5 | (na) | (na) | 15 | 71.5 | (na) | (na) |
Totals (CPU): 1154.8 0.0 0.0 H/s

@cryptonote-social
Copy link
Author

oops didn't mean to close issue.

@psychocrypt
Copy link
Collaborator

psychocrypt commented Jun 25, 2018 via email

@cryptonote-social
Copy link
Author

OK, I'll play with it a bit more when I get a chance. I tried low-power-mode == true and stuff was pretty different but still a bit of a mess.

I guess I thought there might be some bug in xmr-stak seeing as its default CPU configs have always been spot on and super fast for all my other (dozen or so) CPUs. Seems it just struggles with the Threadrippers I guess?

Appreciate the help. Feel free to close this out. If I have any updates I'll come back & post 'em.

@psychocrypt
Copy link
Collaborator

psychocrypt commented Jun 25, 2018 via email

@Spudz76
Copy link
Contributor

Spudz76 commented Jun 25, 2018

When a CPU has a pretty even 2MB cache per core, then it works out nice with the defaults and single thread (because each thread likes 2MB).
You have 16 cores and 40MB of cache, so you need to run 40/2=20 threads on 16 cores for full cache usage. Add four more cores with no affinity. Also try with prefetch on (no_prefetch:false) to see if it likes that (vs Intel usually don't).

[
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 0 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 1 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 2 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 3 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 4 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 5 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 6 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 7 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 8 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 9 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 10 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 11 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 12 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 13 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 14 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 15 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : false },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : false },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : false },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : false },

],

Such as that. I have seen the missing hashes come back by doing that on some over-cached Intels, and even some under-cached. I do not know why it works it makes little sense, other than maybe the unbound threads roam around the other cores and keep them full of work (fill in the gaps). The rates for cores will drift on display though (not at pool side) as the threads migrate around.

There was an unaccepted patch (that no longer applies cleanly) that generated low_power_mode settings all the way up to 100's of threads and that actually got me a good +18H/s total out of an Intel CPU. Having a much longer stack of threads lets the various internal CPU prefetchers actually work (they do nothing when work is chopped into tiny little molecular runs). Normal low_power_mode only goes up to 5 and may not hit a sweet spot.

@Spudz76
Copy link
Contributor

Spudz76 commented Jun 25, 2018

That PR #1604 is the one with massive threading, I will see if I can rebase to current, or get the author to do so. We had only tested Intels, but I have a feeling the Ryzen would benefit from it even more.

I also run that patch on the few Intels it helped, so I need it updated anyway or I'm stuck on the older build where it applies cleanly. I think it's worth accepting the PR once it's cleaned up nicely, perhaps a CMake flag to expand the threads (it takes much longer to build all those permutations of the CPU-CN-kernel, and most users don't need it).

@cryptonote-social
Copy link
Author

Sorry haven't had a chance to poke around on this any further just yet ... but I don't think it's a Ryzen issue, as xmr-stak loves my Ryzen 2700 and Ryzen 2700x... it's only this processor where I seem to have performance issues.

@JerichoJones
Copy link

@cryptonote-social Have you tried some of the suggestions?

@miningnome
Copy link

Here's my config, and I got more than 6k Hs

"cpu_threads_conf" :
[
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "auto", "affine_to_cpu" : 0 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "auto", "affine_to_cpu" : 2 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "auto", "affine_to_cpu" : 4 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "auto", "affine_to_cpu" : 6 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "auto", "affine_to_cpu" : 8 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "auto", "affine_to_cpu" : 10 },
{ "low_power_mode" : true, "no_prefetch" : true, "asm" : "auto", "affine_to_cpu" : 1 },
{ "low_power_mode" : true, "no_prefetch" : true, "asm" : "auto", "affine_to_cpu" : 3 },
{ "low_power_mode" : true, "no_prefetch" : true, "asm" : "auto", "affine_to_cpu" : 5 },
{ "low_power_mode" : true, "no_prefetch" : true, "asm" : "auto", "affine_to_cpu" : 7 },
{ "low_power_mode" : true, "no_prefetch" : true, "asm" : "auto", "affine_to_cpu" : 9 },
{ "low_power_mode" : true, "no_prefetch" : true, "asm" : "auto", "affine_to_cpu" : 11 },
],

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants