Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linux: assign CPU temperatures by package/core or CCD #1352

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

leahneukirchen
Copy link

Closes #806.
Closes #1048.
Closes #1176.
Closes #1335.
Addresses #879 (on Linux).

@BenBE BenBE added enhancement Extension or improvement to existing feature Linux 🐧 Linux related issues labels Dec 25, 2023
@leahneukirchen leahneukirchen marked this pull request as draft December 25, 2023 15:46
@leahneukirchen
Copy link
Author

Fixed build without libsensors.

Fixed assignment on RPi.

linux/LibSensors.c Outdated Show resolved Hide resolved
linux/LibSensors.c Outdated Show resolved Hide resolved
linux/LinuxMachine.c Outdated Show resolved Hide resolved
linux/LinuxMachine.c Outdated Show resolved Hide resolved
linux/LinuxMachine.c Outdated Show resolved Hide resolved
@BenBE BenBE added this to the 3.4.0 milestone Dec 26, 2023
@leahneukirchen
Copy link
Author

Note that the cpu are only parsed once, this takes <0.1ms on my machine.

linux/LinuxMachine.c Outdated Show resolved Hide resolved
linux/LinuxMachine.c Outdated Show resolved Hide resolved
linux/LinuxMachine.c Outdated Show resolved Hide resolved
linux/LinuxMachine.c Outdated Show resolved Hide resolved
linux/LibSensors.c Outdated Show resolved Hide resolved
@leahneukirchen
Copy link
Author

I think this is ready to go, should I squash it together sensibly?

@BenBE
Copy link
Member

BenBE commented Jan 17, 2024

I think this is ready to go, should I squash it together sensibly?

As you like. Just took a quick peek at the list of commits and some suggest that they mostly rectify things changed in previous commits of the PR: Those are the prime subjects to squash in the PR right now. So there's not really much to squash together right now AFAICS.

@BenBE BenBE linked an issue Feb 2, 2024 that may be closed by this pull request
@BenBE
Copy link
Member

BenBE commented Feb 2, 2024

@leahneukirchen Anything left to do from your side? If not please mark the PR ready for review. TIA.

@leahneukirchen leahneukirchen marked this pull request as ready for review February 5, 2024 13:30
@@ -603,6 +603,100 @@ static void scanCPUFrequencyFromCPUinfo(LinuxMachine* this) {
}
}

#ifdef HAVE_SENSORS_SENSORS_H
static void LinuxMachine_fetchCPUTopologyFromCPUinfo(LinuxMachine* this) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does similar to scanCPUFrequencyFromCPUinfo() scan /proc/cpuinfo.
Parsing the frequency here should not make a performance difference.
Then scanCPUFrequencyFromSysCPUFreq() can be changed to write its values first to a local array and on success override the frequencies from that array.
Thus /proc/cpuinfo is not scanned multiple times.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need to move the ifdef HAVE_SENSORS_SENSORS_H into that function, which makes it harder to read I think.

linux/LinuxMachine.c Show resolved Hide resolved
@@ -674,6 +772,13 @@ Machine* Machine_new(UsersTable* usersTable, uid_t userId) {
// Initialize CPU count
LinuxMachine_updateCPUcount(this);

#ifdef HAVE_SENSORS_SENSORS_H
// Fetch CPU topology
LinuxMachine_fetchCPUTopologyFromCPUinfo(this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe LinuxMachine_fetchCPUTopologyFromCPUinfo() can return whether the topology changed and then LibSensors_countCCDs() and LinuxMachine_assignCCDs() get only called on change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what this means, this code runs only once already?

@fasterit
Copy link
Member

@leahneukirchen Do you want to implement the three review comments from @cgzones ?

@cgzones
Copy link
Member

cgzones commented Apr 18, 2024

I somehow dislike that the sensors are iterated twice: once in LibSensors_countCCDs() and once in LibSensors_getCPUTemperatures().
Do we need to compute the number of ccds beforehand, what about adjusting the CPU data after iterating all sensors (and counting all CCD ones) in LibSensors_getCPUTemperatures() ?

@fasterit
Copy link
Member

I somehow dislike that the sensors are iterated twice: once in LibSensors_countCCDs() and once in LibSensors_getCPUTemperatures(). Do we need to compute the number of ccds beforehand, what about adjusting the CPU data after iterating all sensors (and counting all CCD ones) in LibSensors_getCPUTemperatures() ?

Seconded, libsensors is slow

@eworm-de
Copy link
Contributor

Tested on my notebook with Alder Lake CPU... It shows temperatures for all cores at least.

@ulope
Copy link

ulope commented May 29, 2024

Tested on an AMD EPYC 7F52 (16 core, 32 threads, 16x1 CCX), seems to work. At least plausible looking temperatures are now shown for all cores.

@BenBE
Copy link
Member

BenBE commented Jun 4, 2024

AFAICS this is waiting for some refactoring work, but that work is yet not fully specified/explained. @cgzones can you please elaborate the open discussion threads?.

@leahneukirchen
Copy link
Author

Note that sensors are iterated twice on first start only, later only to read the temperatures. (Patches welcome, I find it hard to rewrite this to use one pass.)

@eworm-de
Copy link
Contributor

eworm-de commented Jun 6, 2024

If this is a one-time event it sounds like a reasonable tradeoff, compared to (probably) more complex code. I have not looked at it myself, though.

@eworm-de
Copy link
Contributor

Oh, this crashes with a segmentation fault if libsensors.so is not available...

@leahneukirchen
Copy link
Author

Sigh. With Kernel 6.9 there seems to be a bug in the logic now, my i7-1355U isn't mapped correctly anymore...

@leahneukirchen
Copy link
Author

With kernel 6.9.5 I get:

% pcat /sys/class/hwmon/hwmon2/*label
/sys/class/hwmon/hwmon2/temp1_label	Package id 0
/sys/class/hwmon/hwmon2/temp2_label	Core 0
/sys/class/hwmon/hwmon2/temp6_label	Core 4
/sys/class/hwmon/hwmon2/temp10_label	Core 8
/sys/class/hwmon/hwmon2/temp11_label	Core 9
/sys/class/hwmon/hwmon2/temp12_label	Core 10
/sys/class/hwmon/hwmon2/temp13_label	Core 11
/sys/class/hwmon/hwmon2/temp14_label	Core 12
/sys/class/hwmon/hwmon2/temp15_label	Core 13
/sys/class/hwmon/hwmon2/temp16_label	Core 14
/sys/class/hwmon/hwmon2/temp17_label	Core 15

Note that the tempID are not sequential.

@eworm-de
Copy link
Contributor

Thanks a lot for the quick fix!

coreTempCount++;
}
}
ccdID++;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the number from the label here, e.g. Tccd5? The features might not be sorted or even monotonically increasing by 1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a total count of CCD however, too...

@Vogtinator
Copy link

On my laptop with a Ryzen 5500U, there's only a single Tctl measurement for the CPU:

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +47.4°C

According to the kernel doc's quote of the AMD manual:

It does not represent an actual physical temperature like die or case temperature.

But it appears somewhat representative, at least here, so it might make sense to include it.

@Vogtinator
Copy link

I just checked on a server with AMD EPYC 7502, which supposedly (no idea whether it's correct!) has 2 sockets x 4 CCD x 2 CCX x 4 cores = 64 cores. The sockets are exposed as separate k10temp instances:

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +38.4°C  
Tccd1:        +35.0°C  
Tccd3:        +36.8°C  
Tccd5:        +35.8°C  
Tccd7:        +35.8°C

k10temp-pci-00cb
Adapter: PCI adapter
Tctl:         +39.8°C  
Tccd1:        +38.0°C  
Tccd3:        +39.0°C  
Tccd5:        +37.5°C  
Tccd7:        +40.0°C

@leahneukirchen
Copy link
Author

Just having one sensor works already, no?

@Vogtinator
Copy link

Just having one sensor works already, no?

Depends on what you mean by "works". It shows a temperature value for each core, but I have no idea whether it matches. FWICT both sensors and its CCDs are read in whatever order sensors provides, which might or might not be related to the CPU core numbers...

@leahneukirchen
Copy link
Author

I tested on a AMD EPYC 7443 that when i lock processes to some cores, only these heat up.

@eworm-de
Copy link
Contributor

This need a rebase on top of 6de810f...

Copy link
Member

@BenBE BenBE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@leahneukirchen : What's still pending with the comment from @Vogtinator ?

@rharish101
Copy link

Can confirm that this solves #806 for me on Arch Linux, with an AMD Ryzen 7 7800X3D.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Extension or improvement to existing feature Linux 🐧 Linux related issues
Projects
None yet
8 participants