-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hwloc can't bind beyond 8 threads on 16 core Threadripper 1950x #1678
Comments
please post your config cpu.txt
|
your main memory RAM is all located on one channel. Your second logical
socket has no RAM module. This means also that your pc is slower than it
can be.
|
Also there is default affinity for all cores in the other miner app (-1) Be sure to compile on the rig it will run, and do not use generic, so that the cpu backend gets optimized for it ( There is also a custom clang from AMD but I only have some ancient tri-core AMD which it doesn't really help with. However it may be exactly what you want to try, it should optimize even better than the generic compilers specifically for newer AMD cpus. I installed it and built with it just fine, it just didn't make any difference on antique CPU. |
I've tried all kinds of configs (including no affinity), and they are all much slower than what I get from xmrig. I also tried compiling without hwloc -- again still slower. The config from above is the one generated automatically by xmr-stak which I thought would be most relevant. It appears below. I also compiled from source on the same rig, and it appears mcpu=native should have been on by default as I didn't specify a generic build. Sounds like my RAM might be installed improperly though? I have two 16 GB DIMMs. I'll try moving one into a different slot I guess. [ ], |
Here are my speeds running with affine_to_cpu: false for all threads (similar to turning of hwloc in compilation) HASHRATE REPORT - CPU
|
Update: I fixed the RAM issue and now have 16gb per channel according to lstopo, and now hwloc is able to pin all 16 threads, but the speed is little changed. So I guess hwloc isn't the issue at all. Not sure what else it could be? [2018-06-24 17:01:38] : Mining coin: monero7 |
oops didn't mean to close issue. |
you should play with the option low_power mode. You can set a value there
e.g. 2, 3 or 4. This option allows you to run more than one hash per
thread/corr.
|
OK, I'll play with it a bit more when I get a chance. I tried low-power-mode == true and stuff was pretty different but still a bit of a mess. I guess I thought there might be some bug in xmr-stak seeing as its default CPU configs have always been spot on and super fast for all my other (dozen or so) CPUs. Seems it just struggles with the Threadrippers I guess? Appreciate the help. Feel free to close this out. If I have any updates I'll come back & post 'em. |
do you also tried low_power mode 4 and less threads? please also have a
look to
xmrstak.com there you can find configs from other users.
|
When a CPU has a pretty even 2MB cache per core, then it works out nice with the defaults and single thread (because each thread likes 2MB).
Such as that. I have seen the missing hashes come back by doing that on some over-cached Intels, and even some under-cached. I do not know why it works it makes little sense, other than maybe the unbound threads roam around the other cores and keep them full of work (fill in the gaps). The rates for cores will drift on display though (not at pool side) as the threads migrate around. There was an unaccepted patch (that no longer applies cleanly) that generated low_power_mode settings all the way up to 100's of threads and that actually got me a good +18H/s total out of an Intel CPU. Having a much longer stack of threads lets the various internal CPU prefetchers actually work (they do nothing when work is chopped into tiny little molecular runs). Normal low_power_mode only goes up to 5 and may not hit a sweet spot. |
That PR #1604 is the one with massive threading, I will see if I can rebase to current, or get the author to do so. We had only tested Intels, but I have a feeling the Ryzen would benefit from it even more. I also run that patch on the few Intels it helped, so I need it updated anyway or I'm stuck on the older build where it applies cleanly. I think it's worth accepting the PR once it's cleaned up nicely, perhaps a CMake flag to expand the threads (it takes much longer to build all those permutations of the CPU-CN-kernel, and most users don't need it). |
Sorry haven't had a chance to poke around on this any further just yet ... but I don't think it's a Ryzen issue, as xmr-stak loves my Ryzen 2700 and Ryzen 2700x... it's only this processor where I seem to have performance issues. |
@cryptonote-social Have you tried some of the suggestions? |
Here's my config, and I got more than 6k Hs "cpu_threads_conf" : |
Processor is a threadripper 1950x, OS Ubuntu Linux 16.04 LTS. ulimits and hugepages appropriately configured.
If I try to assign more than 8 threads with xmr-stak, I get hwloc errors. The result is xmr-stak is slow on this machine, compared to xmrig which seems to work fine.
[2018-06-24 11:13:31] : Mining coin: monero7
[2018-06-24 11:13:31] : CPU configuration stored in file 'cpu.txt'
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 0.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 1.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 2.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 3.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 4.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 5.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 6.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 7.
[2018-06-24 11:13:31] : hwloc: memory pinned
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 8.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 9.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 10.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 11.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 12.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 13.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 14.
[2018-06-24 11:13:31] : hwloc: can't bind memory
[2018-06-24 11:13:31] : Starting 1x thread, affinity: 15.
[2018-06-24 11:13:31] : hwloc: can't bind memory
HASHRATE REPORT - CPU
| ID | 10s | 60s | 15m | ID | 10s | 60s | 15m |
| 0 | 72.8 | (na) | (na) | 1 | 73.0 | (na) | (na) |
| 2 | 73.0 | (na) | (na) | 3 | 72.9 | (na) | (na) |
| 4 | 72.6 | (na) | (na) | 5 | 72.9 | (na) | (na) |
| 6 | 72.6 | (na) | (na) | 7 | 72.9 | (na) | (na) |
| 8 | 65.8 | (na) | (na) | 9 | 65.8 | (na) | (na) |
| 10 | 65.8 | (na) | (na) | 11 | 65.8 | (na) | (na) |
| 12 | 65.8 | (na) | (na) | 13 | 65.8 | (na) | (na) |
| 14 | 65.8 | (na) | (na) | 15 | 65.8 | (na) | (na) |
Totals (CPU): 1109.0 0.0 0.0 H/s
Compare this to xmrig which has no issues with memory binding and gets 200 more h/s:
[2018-06-24 11:14:49] READY (CPU) threads 16(16) huge pages 16/16 100% memory 32.0 MB
| THREAD | AFFINITY | 10s H/s | 60s H/s | 15m H/s |
| 0 | -1 | 81.2 | n/a | n/a |
| 1 | -1 | 81.2 | n/a | n/a |
| 2 | -1 | 82.7 | n/a | n/a |
| 3 | -1 | 83.2 | n/a | n/a |
| 4 | -1 | 81.2 | n/a | n/a |
| 5 | -1 | 83.2 | n/a | n/a |
| 6 | -1 | 81.2 | n/a | n/a |
| 7 | -1 | 82.7 | n/a | n/a |
| 8 | -1 | 82.7 | n/a | n/a |
| 9 | -1 | 82.4 | n/a | n/a |
| 10 | -1 | 81.2 | n/a | n/a |
| 11 | -1 | 81.2 | n/a | n/a |
| 12 | -1 | 83.2 | n/a | n/a |
| 13 | -1 | 83.2 | n/a | n/a |
| 14 | -1 | 81.2 | n/a | n/a |
| 15 | -1 | 81.2 | n/a | n/a |
[2018-06-24 11:15:08] speed 10s/60s/15m 1312.9 n/a n/a H/s max 1313.0 H/s
The difference is far more stark on cn-lite, where xmrig is over 4000 h/2 and I xmr-stak barely cracks 2000 h/s no matter what I try.
I've set my hugepages and various limits appropriately. Here is my lstopo output in case it helps?
lstopo --of console
Machine (31GB total) + Package L#0
NUMANode L#0 (P#0 31GB)
L3 L#0 (8192KB)
L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
L3 L#1 (8192KB)
L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
HostBridge L#0
PCIBridge
PCI 1022:43b6
Block(Disk) L#0 "sda"
Block(Disk) L#1 "sdb"
Block(Removable Media Device) L#2 "sr0"
PCIBridge
PCIBridge
PCI 8086:1539
Net L#3 "enp5s0"
PCIBridge
PCI 1002:67df
GPU L#4 "renderD128"
GPU L#5 "card0"
GPU L#6 "controlD64"
CoProc L#7 "opencl0d0"
PCIBridge
PCI 1002:67df
GPU L#8 "card1"
GPU L#9 "controlD65"
GPU L#10 "renderD129"
CoProc L#11 "opencl0d1"
PCIBridge
PCI 1022:7901
NUMANode L#1 (P#1)
L3 L#2 (8192KB)
L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (64KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#24)
L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (64KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#25)
L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (64KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#26)
L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (64KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#27)
L3 L#3 (8192KB)
L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (64KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#28)
L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (64KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#29)
L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (64KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#30)
L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (64KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#31)
HostBridge L#7
PCIBridge
PCI 1002:67df
GPU L#12 "card2"
GPU L#13 "controlD66"
GPU L#14 "renderD130"
CoProc L#15 "opencl0d2"
PCIBridge
PCI 1022:7901
The text was updated successfully, but these errors were encountered: