-
Notifications
You must be signed in to change notification settings - Fork 28
mOS for HPC v0.8 Administrator's Guide
This document provides the instructions to check out, build, install, boot and validate mOS for HPC. All the instructions provided below are validated on the following system configurations:
Component | Configuration |
---|---|
Processor | Intel(R) Xeon(R) Gold 6140 processor |
Cluster mode | None |
Memory | 128 GB DDR4 |
Distribution | SLES 15 SP1 |
Boot loader | GRUB |
The instructions are annotated where the build, install, boot and validation instructions differ for Intel(R) Xeon Phi(TM) processor 7250 based system configurations.
Component | Configuration |
---|---|
Processor | Intel(R) Xeon Phi(TM) processor 7250 |
Cluster mode | SNC-4 |
Memory mode | Flat |
Memory | 96 GB DDR, 16 GB MCDRAM |
Distribution | CentOS 7.3 |
Boot loader | GRUB |
You may need to modify the steps documented here if you have different hardware or software. See the mOS for HPC v0.8 Readme for information about platform requirements.
The mOS for HPC source can be checked out from GitHub at https://github.com/intel/mOS.
~ $ git clone https://github.com/intel/mOS.git
Cloning into 'mOS'...
remote: Enumerating objects: 7128056, done.
remote: Total 7128056 (delta 0), reused 0 (delta 0), pack-reused 7128056
Receiving objects: 100% (7128056/7128056), 1.44 GiB | 22.17 MiB/s, done.
Resolving deltas: 100% (5921622/5921622), done.
Checking out files: 100% (61523/61523), done.
~ $ cd mOS
mOS $ git checkout 5.4.18_0.8.mos
Checking out files: 100% (61778/61778), done.
Branch '5.4.18_0.8.mos' set up to track remote branch '5.4.18_0.8.mos' from 'origin'.
Switched to a new branch '5.4.18_0.8.mos'
mOS $
The mOS for HPC source contains the config.mos example file that should be used to configure it. The table below shows the settings needed to configure the source code.
Mandatory setting | Description |
---|---|
CONFIG_MOS_FOR_HPC=y |
Activate the mOS for HPC code in the Linux kernel |
CONFIG_MOS_MOVE_SYSCALLS=y |
Activate the mOS for HPC system call forwarding feature |
CONFIG_MOS_SCHEDULER=y |
Enable the mOS for HPC scheduler |
CONFIG_MOS_LWKMEM=y |
Enable the mOS for HPC memory management |
Strongly recommended settings | Description |
CONFIG_NO_HZ_FULL=y |
Activate the tickless feature of Linux. In conjunction with the mOS for HPC scheduler, this limits noise on LWK CPUs. |
CONFIG_RCU_NOCB_CPU=y | Offload RCU callback processing from boot-selected CPUs. mOS for HPC uses this capability to reduce noise on LWK CPUs. |
CONFIG_NODES_SHIFT=6 (or less than 6) |
This controls the size of the NUMA node masks with in the kernel. A value of 6 represents a mask size of 8 bytes and is sufficient for any system with 64 or less numa domains. If a value larger than 6 is specified, kernel memory is wasted and performance is impacted. |
In addition, there are several standard Linux kernel settings that mOS for HPC depends on; e.g., NUMA. See kernel/mOS/Kconfig for details. Although it may appear that it is possible to disable major functions in mOS for HPC by disabling them during the build process, please always use all four mandatory settings plus the strongly recommended ones. mOS for HPC has not really been tested with different settings combinations; not enabling all is for debugging only.
It is recommended that you build kernel RPMs for installation of mOS for HPC. The minimum build system requirements can be found at https://www.kernel.org/doc/html/latest/process/changes.html. A sample configuration file, config.mos , is provided. Your specific compute node hardware may require a different configuration. Please run the following commands from the directory where you checked out mOS for HPC:
mOS $ cp config.mos .config
mOS $ make -j 32 binrpm-pkg
HOSTCC scripts/basic/fixdep
HOSTCC scripts/kconfig/conf.o
LEX scripts/kconfig/lexer.lex.c
HOSTCC scripts/kconfig/expr.o
HOSTCC scripts/kconfig/confdata.o
YACC scripts/kconfig/parser.tab.[ch]
HOSTCC scripts/kconfig/symbol.o
HOSTCC scripts/kconfig/preprocess.o
HOSTCC scripts/kconfig/lexer.lex.o
HOSTCC scripts/kconfig/parser.tab.o
HOSTLD scripts/kconfig/conf
scripts/kconfig/conf --syncconfig Kconfig
UPD include/config/kernel.release
make -f ./Makefile
SYSTBL arch/x86/include/generated/asm/syscalls_32.h
SYSHDR arch/x86/include/generated/asm/unistd_32_ia32.h
SYSHDR arch/x86/include/generated/asm/unistd_64_x32.h
SYSTBL arch/x86/include/generated/asm/syscalls_64.h
SYSHDR arch/x86/include/generated/uapi/asm/unistd_32.h
SYSHDR arch/x86/include/generated/uapi/asm/unistd_64.h
SYSHDR arch/x86/include/generated/uapi/asm/unistd_x32.h
WRAP arch/x86/include/generated/uapi/asm/bpf_perf_event.h
WRAP arch/x86/include/generated/uapi/asm/errno.h
WRAP arch/x86/include/generated/uapi/asm/fcntl.h
WRAP arch/x86/include/generated/uapi/asm/ioctl.h
WRAP arch/x86/include/generated/uapi/asm/ioctls.h
WRAP arch/x86/include/generated/uapi/asm/ipcbuf.h
WRAP arch/x86/include/generated/uapi/asm/param.h
WRAP arch/x86/include/generated/uapi/asm/poll.h
WRAP arch/x86/include/generated/uapi/asm/socket.h
WRAP arch/x86/include/generated/uapi/asm/resource.h
WRAP arch/x86/include/generated/uapi/asm/sockios.h
WRAP arch/x86/include/generated/uapi/asm/termios.h
WRAP arch/x86/include/generated/uapi/asm/termbits.h
WRAP arch/x86/include/generated/uapi/asm/types.h
HOSTCC arch/x86/tools/relocs_32.o
HOSTCC arch/x86/tools/relocs_64.o
HOSTCC arch/x86/tools/relocs_common.o
WRAP arch/x86/include/generated/asm/dma-contiguous.h
WRAP arch/x86/include/generated/asm/early_ioremap.h
WRAP arch/x86/include/generated/asm/export.h
WRAP arch/x86/include/generated/asm/mcs_spinlock.h
WRAP arch/x86/include/generated/asm/mm-arch-hooks.h
WRAP arch/x86/include/generated/asm/mmiowb.h
UPD include/generated/uapi/linux/version.h
UPD include/generated/utsrelease.h
HOSTCC scripts/kallsyms
HOSTCC scripts/conmakehash
HOSTCC scripts/recordmcount
HOSTCC scripts/sortextable
HOSTCC scripts/asn1_compiler
HOSTCC scripts/extract-cert
HOSTCC scripts/genksyms/genksyms.o
LEX scripts/genksyms/lex.lex.c
YACC scripts/genksyms/parse.tab.[ch]
HOSTCC scripts/selinux/mdp/mdp
HOSTCC scripts/selinux/genheaders/genheaders
DESCEND objtool.
.
.
Processing files: kernel-headers-5.4.18_0.8.mos-1.x86_64
Provides: kernel-headers = 5.4.18_0.8.mos kernel-headers = 5.4.18_0.8.mos-1 kernel-headers(x86-64) = 5.4.18_0.8.mos-1
Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1
Obsoletes: kernel-headers
Processing files: kernel-mOS-5.4.18_0.8.mos-1.x86_64
Provides: kernel-mOS = 5.4.18_0.8.mos-1 kernel-mOS(x86-64) = 5.4.18_0.8.mos-1
Requires(interp): /bin/sh
Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1
Requires(post): /bin/sh
Checking for unpackaged file(s): /usr/lib/rpm/check-files /home/admin/rpmbuild/BUILDROOT/kernel-5.4.18_0.8.mos-1.x86_64
Wrote: /home/admin/rpmbuild/RPMS/x86_64/kernel-5.4.18_0.8.mos-1.x86_64.rpm
Wrote: /home/admin/rpmbuild/RPMS/x86_64/kernel-headers-5.4.18_0.8.mos-1.x86_64.rpm
Wrote: /home/admin/rpmbuild/RPMS/x86_64/kernel-mOS-5.4.18_0.8.mos-1.x86_64.rpm
Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.UuYAPp
+ umask 022
+ cd .
+ rm -rf /home/admin/rpmbuild/BUILDROOT/kernel-5.4.18_0.8.mos-1.x86_64
+ exit 0
mOS $
The RPMs built in the previous step needs to be installed on the compute nodes or into the compute node image.
At a minimum install the kernel-5.4.18_0.8.mos-1.x86_64 and kernel-mOS-5.4.18_0.8.mos-1.x86_64 RPMs into your compute node image. The exact RPM names may vary depending on the state of the code, whether a local version name is specified (in something like make menuconfig), and how many times the RPMs are built. However, the 5.4.18_0.8.mos part of the name should remain constant.
~ $ sudo rpm -ivh --force /home/admin/rpmbuild/RPMS/x86_64/kernel-5.4.18_0.8.mos-1.x86_64.rpm
Preparing... ################################# [100%]
Updating / installing...
1:kernel-5.4.18_0.8.mos-1 ################################# [100%]
~ $ sudo rpm -ivh --force /home/admin/rpmbuild/RPMS/x86_64/kernel-mOS-5.4.18_0.8.mos-1.x86_64.rpm
Preparing... ################################# [100%]
Updating / installing...
1:kernel-mOS-5.4.18_0.8.mos-1 ################################# [100%]
~ $
After RPM installation the kernel needs to be added to the grub menu on the compute nodes. The kernel parameters needed are taken from /etc/defaults/grub via the GRUB_CMDLINE_LINUX variable in that file. Please update or replace the GRUB_CMDLINE_LINUX variable in that file as follows:
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 selinux=0 rd.lvm.lv=centos/root rd.lvm.lv=centos/swap intel_pstate=disable nmi_watchdog=0 kernelcore=16G nohz_full=<see following table>"
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 selinux=0 rd.lvm.lv=centos/root rd.lvm.lv=centos/swap intel_pstate=disable nmi_watchdog=0 kernelcore=16G movable_node nohz_full=<see following table>"
The following parameters and values are recommended for mOS for HPC. Not all combinations and variations of boot parameters have been validated and tested. Boot failure is possible if, for example, lwkcpus and lwkmem are not properly set for your system. The lwkcpus and lwkmem parameters can be omitted and the lightweight kernel partition created after booting using the lwkctl command. Please refer to Documentation/kernel-parameters.txt in the mOS for HPC kernel source for further details.
Name | Recommended Value | Description |
---|---|---|
nmi_watchdog | 0 | Disable the NMI watchdog interrupt from occurring in order to eliminate this additional source of noise on the CPUs. An alternative method of turning off the watchdog is writing a zero to the system file /proc/sys/kernel/nmi_watchdog. This approach would eliminate the need to set it here. |
intel_pstate | disable | Do not allow the system to dynamically adjust the frequency of the CPUs. When running HPC applications, we want a stable, consistent CPU frequency across the entire job. |
lwkcpus |
topology dependent |
List of CPUs to be controlled by mOS. This includes the CPUs that will be exclusively owned by mOS (implicitly marked as 'isolated') and also Linux CPUs that will be used by mOS to host utility threads and to execute migrated system calls. The lwkcpus argument designates CPU resources to the LWK. The format of the entries is of the form: lwkcpus=<syscall cpu1>.<lwkcpu set1>:<syscall cpu2>.<lwkcpu set2>... For example: lwkcpus=28.1-13,29-41:42.15-27,43-55 In this configuration, there are two Linux CPUs, 28 and 42, designated to handle syscalls. CPU 28 will host syscalls for LWK CPUs 1-13 and 29-41. CPU 42 will host syscalls for LWK CPUs 15-27 and 43-55. Note that this is a simplified example and may not be the optimized configuration. This parameter is only required if a lightweight kernel partition is to be created at boot time. This parameter can be omitted and a lightweight kernel partition can be created after boot time using the lwkctl command. |
lwkmem |
topology dependent |
Designate memory for use by mOS. The amount of memory requested is specified in parse_mem format (K,M,G). Or designate memory via NUMA domain. The LWK memory requested using this kernel command line can only come from the movable memory in the system. Use the 'kernelcore' command line argument explained below to specify the total amount of non-movable and movable memory in the system. Example #1: lwkmem=126G This requests the kernel to designate a total of 126G of physical memory to the LWK. The memory requested will be allocated from all online NUMA nodes which have movable memory. Example #2: lwkmem=0:58G,1:16G This requests that the kernel designates a total of 58G of physical memory from NUMA node 0 and 16G of physical memory from NUMA node 1 to the LWK. If the full amount of requested memory can not be allocated on a specified NUMA node in the list, then the remainder of the request will be distributed uniformly among the requests on subsequent NUMA nodes in the request list. In this example, if the kernel could designate only 50G on NUMA node 0 then the remaining 8G of the request would be added to the 16G requested from NUMA node 1. This parameter is only required if a lightweight kernel partition is to be created at boot time. This parameter can be omitted and a lightweight kernel partition can be created after boot time using the lwkctl command. |
kernelcore | 16G |
This Linux boot argument sets the total non-movable memory in the system. Non-movable memory is the memory used only by the Linux kernel and cannot be dedicated to the LWK. The kernel treats the rest of the physical memory as movable memory which can be dynamically provisioned between Linux and the LWK. The memory requested using the 'lwkmem' kernel parameter described above can only come from movable memory in the system. Adjust the 'kernelcore' kernel parameter value based on the requirement. In mOS for HPC,
On an Intel(R) Xeon Phi(TM) processor, point b. can be accomplished by specifying the 'movable_node' kernel parameter (described below) along with the 'kernelcore' parameter. Please see BIOS settings below for MCDRAM configuration. Ex: kernel command line parameters: kernelcore=16G movable_node on a system with 96G DDR and 16G MCDRAM In this case,
|
movable_node |
On systems with the Intel(R) Xeon Phi(TM) processor, this marks MCDRAM NUMA nodes as movable nodes, if the MCDRAM is configured as hot-pluggable memory in BIOS, i.e. there won't be any kernel memory allocation in MCDRAM and all of it can be used by applications (Linux or LWK). Please see BIOS settings below for MCDRAM configuration. |
|
nohz_full | CPU dependent | 1-<number of logical CPUs -1>. Ex: nohz_full=1-71 |
The last step is to update the grub configuration using the grub2-mkconfig command. Please ensure that appropriate rd.lvm.lv settings are specified for your system. The grub configuration file is grub.cfg. The location of this file varies. The example below shows a system where it is located in /boot/efi/EFI/centos/grub.cfg. Other systems might have it in /boot/grub2/grub.cfg. You may want to save a backup copy of your grub.cfg file before the following step.
$ sudo grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
Note: This command will add the kernel parameters in GRUB_CMD_LINUX to every entry in the grub menu. You should preserve and restore the existing kernel entries in grub.cfg after running grub2-mkconfig.
It is recommended to treat MCDRAM as hot-pluggable memory. This setting in conjunction with the 'movable_node' kernel parameter is necessary for maximum MCDRAM availability for applications (either Linux or LWK). The following BIOS menu is used to configure MCDRAM:
EDKII Menu -> Advanced -> Uncore Configuration -> Treat MCDRAM as Hot-Pluggable Memory ==> <Yes>
If mOS for HPC has been properly installed and configured then the grub boot menu should have an entry for mOS. Please select the 5.4.18_0.8.mos entry during boot.
CentOS Linux (5.4.18_0.8.mos) 7 (Core)
CentOS Linux (3.10.0-327.36.3.el7.x86_64) 7 (Core)
CentOS Linux (0-rescue-71e25674024146aaa3ff5de0e403b11d) 7 (Core)Use the ^ and v keys to change the selection.
Press 'e' to edit the selected item, or 'c' for a command prompt.
The selected entry will be started automatically in 0s.
In order to validate a successful installation, perform the following steps on the compute nodes where mOS for HPC is installed.
To test that yod is functional, launch a simple application using yod:
$ yod /bin/echo hello
hello
If LWK memory is active then you should be able to see some LWK entries in the process mapping of an LWK process:
[admin@knl-4 ~]$ yod cat /proc/self/maps | grep LWK
0060b000-0060d000 rwxp 00000000 00:00 0 LWK
00800000-00a00000 rwxp 00000000 00:00 0 [heap] LWK
2aaaaaaab000-2aaaaaaaf000 rwxp 00000000 00:00 0 LWK
The above example runs the cat program as an mOS process, reserving LWK memory resources for it.
Alternatively, you can use the lwkctl utility to view the mOS version and LWK configuration.
$ lwkctl -s
mOS version : 0.8
Linux CPU(s): 0,20,40,60 [ 4 CPU(s) ]
LWK CPU(s): 1-19,21-39,41-59,61-79 [ 76 CPU(s) ]
Utility CPU(s): 0,20,40,60 [ 4 CPU(s) ]
LWK Memory(KB): 56623104 56623104 [ 2 NUMA nodes ]
CPU specification was automatically generated.
Memory specification was automatically generated.
$ lwkctl -s
mOS version : 0.8
Linux CPU(s): 0-1,18-19,68-69,86-87,136-137,154-155,204-205,222-223 [ 16 CPU(s) ]
LWK CPU(s): 2-17,20-67,70-85,88-135,138-153,156-203,206-221,224-271 [ 256 CPU(s) ]
Utility CPU(s): 0-1,18-19,68-69,86-87,136-137,154-155,204-205,222-223 [ 16 CPU(s) ]
LWK Memory(KB): 19922944 19922944 19922944 19922944 3145728 3145728 3145728 3145728 [ 8 NUMA nodes ]
CPU specification was automatically generated.
Memory specification was automatically generated.
Check the dmesg log for mOS entries (Intel(R) Xeon Phi(TM) example):
$ sudo dmesg | grep mOS
[ 6.290489] mOS-lwkctl: Creating default memory partition: lwkmem=0:16G,1:16G,2:16G,3:16G,4:4G,5:4G,6:4G,7:4G
[ 6.310113] mOS-mem: Initializing memory management
[ 6.523033] mOS-mem: Node 0: va 0xffff888148000000 pa 0x148000000 pfn 1343488-5537791 : 4194304
[ 6.541588] mOS-mem: Node 0: offlining va 0xffff888148000000 pa 0x148000000 pfn 1343488-5537791:4194304
[ 11.317740] mOS-mem: Node 0: Requested 16384 MB Allocated 16384 MB
[ 11.400463] mOS-mem: Node 1: va 0xffff888840000000 pa 0x840000000 pfn 8650752-12845055 : 4194304
[ 11.419409] mOS-mem: Node 1: offlining va 0xffff888840000000 pa 0x840000000 pfn 8650752-12845055:4194304
[ 14.307081] mOS-mem: Node 1: Requested 16384 MB Allocated 16384 MB
[ 14.389595] mOS-mem: Node 2: va 0xffff888f40000000 pa 0xf40000000 pfn 15990784-20185087 : 4194304
[ 14.409083] mOS-mem: Node 2: offlining va 0xffff888f40000000 pa 0xf40000000 pfn 15990784-20185087:4194304
[ 17.359969] mOS-mem: Node 2: Requested 16384 MB Allocated 16384 MB
[ 17.443401] mOS-mem: Node 3: va 0xffff889640000000 pa 0x1640000000 pfn 23330816-27525119 : 4194304
[ 17.463244] mOS-mem: Node 3: offlining va 0xffff889640000000 pa 0x1640000000 pfn 23330816-27525119:4194304
[ 20.401619] mOS-mem: Node 3: Requested 16384 MB Allocated 16384 MB
[ 20.435741] mOS-mem: Node 4: va 0xffff888640000000 pa 0x640000000 pfn 6553600-7602175 : 1048576
[ 20.455901] mOS-mem: Node 4: offlining va 0xffff888640000000 pa 0x640000000 pfn 6553600-7602175:1048576
[ 21.394487] mOS-mem: Node 4: Requested 4096 MB Allocated 4096 MB
[ 21.428897] mOS-mem: Node 5: va 0xffff888d40000000 pa 0xd40000000 pfn 13893632-14942207 : 1048576
[ 21.449896] mOS-mem: Node 5: offlining va 0xffff888d40000000 pa 0xd40000000 pfn 13893632-14942207:1048576
[ 22.345746] mOS-mem: Node 5: Requested 4096 MB Allocated 4096 MB
[ 22.379905] mOS-mem: Node 6: va 0xffff889440000000 pa 0x1440000000 pfn 21233664-22282239 : 1048576
[ 22.400215] mOS-mem: Node 6: offlining va 0xffff889440000000 pa 0x1440000000 pfn 21233664-22282239:1048576
[ 23.187042] mOS-mem: Node 6: Requested 4096 MB Allocated 4096 MB
[ 23.221759] mOS-mem: Node 7: va 0xffff889b40000000 pa 0x1b40000000 pfn 28573696-29622271 : 1048576
[ 23.242156] mOS-mem: Node 7: offlining va 0xffff889b40000000 pa 0x1b40000000 pfn 28573696-29622271:1048576
[ 24.048852] mOS-mem: Node 7: Requested 4096 MB Allocated 4096 MB
[ 24.065554] mOS-mem: Requested 81920 MB Allocated 81920 MB
[ 24.081471] mOS-lwkctl: LWK creating default LWKMEM partition..Done
[ 24.098153] mOS-lwkctl: Creating default CPU partition:
[ 24.098156] mOS-lwkctl: lwkcpu_profile=normal
[ 24.164631] mOS: LWK CPUs 52-67,256-271 will ship syscalls to Linux CPU 1
[ 24.182237] mOS: LWK CPUs 120-135,188-203 will ship syscalls to Linux CPU 69
[ 24.200178] mOS: LWK CPUs 2-17,206-221 will ship syscalls to Linux CPU 137
[ 24.218024] mOS: LWK CPUs 70-85,138-153 will ship syscalls to Linux CPU 205
[ 24.236021] mOS: LWK CPUs 20-35,224-239 will ship syscalls to Linux CPU 19
[ 24.253830] mOS: LWK CPUs 88-103,156-171 will ship syscalls to Linux CPU 87
[ 24.271619] mOS: LWK CPUs 36-51,240-255 will ship syscalls to Linux CPU 155
[ 24.289336] mOS: LWK CPUs 104-119,172-187 will ship syscalls to Linux CPU 223
[ 24.307218] mOS: Configured LWK CPUs: 2-17,20-67,70-85,88-135,138-153,156-203,206-221,224-271
[ 24.326811] mOS: Configured Utility CPUs: 1,19,69,87,137,155,205,223
[ 24.343941] mOS: LWK CPU profile set to: normal
[ 24.359875] mOS-sched: set unbound workqueue cpumask to 0-1,18-19,68-69,86-87,136-137,154-155,204-205,222-223
[ 24.381437] mOS-sched: IDLE MWAIT enabled. Hints min/max=00000000/00000010. CPUID_MWAIT substates=00000110
[ 27.968364] mOS-lwkctl: LWK creating default partition.. Done
Check to validate that yod is using all the specified LWK CPUs:
$ [
$(yod cat /sys/kernel/mOS/lwkcpus_reserved) == $ (cat /sys/kernel/mOS/lwkcpus) ] && echo "mOS for HPC is operational" || echo "mOS for HPC not operational"
mOS for HPC is operational
When mOS is booted and managing resources, an obvious question is what common system tools tell you about the machine state. Here is some information.
Command / Tool | Notes |
---|---|
top, htop | Behaves as expected showing CPU utilization, process placement across CPUs |
/proc/meminfo free |
By default shows memory usage statistics on both Linux and LWK. Furthermore, the tool mosview can be used to see only LWK side usage or only Linux side usage. |
dmesg | mOS kernel will write information to the syslog, a good place to check for operational health |
debugging and profiling tools | mOS maintains compatibility with Linux so that tools such as ptrace, strace, and gdb continue to work as expected. In addition, Intel(R) Parallel Studio XE tools such as Intel(R) VTune(TM) Profilier and Intel(R) Advisor also work as designed. Please see https://software.intel.com/en-us/vtune-help-building-and-installing-the-sampling-drivers-for-linux-targets for information on building sampling drivers that are compatible with mOS. |
In mOS for HPC the resources, CPUs and memory, can be dynamically partitioned between Linux and the LWK. An LWK partition can be created using the user space command utility 'lwkctl' after the kernel boots up. A default LWK partition can also be created during the kernel boot up by specifying the LWK resources needed on the kernel command line through kernel parameters 'lwkcpus=' and 'lwkmem='.
lwkcpus=<syscall cpu1>.<lwkcpu set1>:<syscall cpu2>.<lwkcpu set2>...
lwkmem=<n1>:<size1>,<n2>:<size2>,...
Where, n1,n2,.. are NUMA node numbers. size1,size2,.. are sizes of the LWKMEM requests on corresponding NUMA node.
Based on the system need, this default LWK partition can be deleted later after the boot-up and a new LWK partition can be created using the lwkctl command.
This command line utility offlines the resources on Linux and hands them over (designates) to the LWK and vice versa. Once this partitioning is complete, further resource partitioning between LWK processes (reservation) is done using the mOS job launch utility yod. The lwkctl command requires root privileges in order to create or delete an LWK partition. Both LWK CPU and LWK memory specifications need to be provided while creating an LWK partition. The specification value of auto is supported for both the LWK CPU and the LWK memory specifications. When auto is used, mOS will generate a topology-aware specification suitable for most HPC application environments. Deleting an LWK partition deletes both LWK CPU and LWK memory designations. The command can also be used to view the current LWK partition.
Quick Reference:
-
Creating LWK partition:
sudo lwkctl -c 'lwkcpus=<lwkcpu_spec> lwkmem=<lwkmem_spec>'
Example 1 (for Intel(R) Xeon Phi(TM)):
sudo lwkctl -c 'lwkcpus=1.52-67,256-271:69.120-135,188-203:137.2-17,206-221:205.70-85,138-153:19.20-35,224-239:87.88-103,156-171:155.36-51,240-255:223.104-119,172-187 lwkmem=0:16G,1:16G,2:16G,3:16G,4:3968M,5:3968M,6:3968M,7:3968M'
Example 2 - mOS determines the configuration:
sudo lwkctl -c 'lwkcpus=auto lwkmem=auto'
Notice that the entire specification needs to be enclosed within ' '
- Deleting LWK partition:
sudo lwkctl -d
-
See the existing LWK partition:
To output in human readable format,
lwkctl -s
To output in raw format,
lwkctl -s -r
For further details regarding the usage refer to the lwkctl man page on the compute node where mOS for HPC is installed.
Typically distribution of interrupts across CPUs is managed by a user space daemon called irqbalance. This daemon is typically launched at the time of bootup by systemd. mOS supports the usage of irqbalance daemon for interrupt balancing, any other similar tools may need further adaptations in the mOS tool lwkctl.
In mOS the kernel tries to keep the interrupts on LWKCPUs to minimal, in doing so all attempts to affinitize interrupts to LWKCPUs from the user space is denied and a RAS message is printed which looks like the following print on the serial console or dmesg,
__irq_set_affinity(irq 333, mask 15) can not affinitize only to LWKCPUs ret -22
Notice that such an action can happen from the irqbalance daemon or any user space tool that is trying to balance interrupts across CPUs but is unaware of the presence of LWKCPUs when an LWK partition is created. In order to avoid such actions from the irqbalance daemon it can be restarted after setting its environment variable IRQBALANCE_BANNED_CPUS to LWKCPUs mask (which can be read from /sys/kernel/mOS/lwkcpus_mask). This ensures that irqbalance daemon ignores CPUs that are booted as LWKCPUs to distributed interrupts in the system.
On systems that have irqbalance daemon the user space tool lwkctl internally takes care of stopping this daemon when an LWK partition is being created or deleted and restarting it with the environment variable IRQBALANCE_BANNED_CPUS set to the value from /sys/kernel/mOS/lwkcpus_mask. It does so only if the irqbalance daemon was already running when lwkctl command was initiated to create or delete an LWK partition. So no further steps are necessary from the user to manually restart the irqbalance daemon. This automatic restart of irqbalance daemon does not take place upon kernel boot up in the case when a default LWK partition is created in the kernel, i.e. if the LWK partition specification is provided on the kernel command line. In such a case user needs to manually restart irqbalance daemon after setting the IRQBALANCE_BANNED_CPUS environment variable to LWKCPUs.
Below instructions provide one time steps to manually restart irqbalance daemon upon the kernel bootup only when a default LWK partition is created during the boot up,
$ sudo service irqbalance restart
Irrespective of whether a default LWK partition is created or not if a user is issuing lwkctl command subsequently after the boot up to dynamically create or delete the LWK partition then it is not necessary to repeat the above steps to manually restart the irqbalance daemon as it is automatically handled in the lwkctl command.
Note: lwkctl command does not have a context across two such commands. It can not save/restore specific IRQBALANCE_BANNED_CPUS environment variables. It merely overwrites the existing value of the environment variable if it is already set in the systemd environment before issuing a lwkctl command. On a system, if a specific value of this environment variable is necessary when there is no LWK partition then the user has to manually set the environment variable IRQBALANCE_BANNED_CPUS after deleting the LWK partition and restart the irqbalance daemon.
Issuing lwkctl commands in quick succession (ex: through some test scripts or unit tests) results in systemd disabling irqbalance daemon as a security measure. By default systemd limits the restart rate of its services. To overcome this, one needs to set up the following file to ensure no such limits on restart rate is enforced by the systemd
For systemd version older than v230,
$ sudo cat /etc/systemd/system/irqbalance.service.d/override.conf
[Service]
StartLimitInterval=0
For systemd version v230 and beyond,
$ sudo cat /etc/systemd/system/irqbalance.service.d/override.conf
[Unit]
StartLimitIntervalSec=0