-
Notifications
You must be signed in to change notification settings - Fork 28
mOS for HPC v0.7 User's Guide
mOS for HPC combines a lightweight kernel (LWK) with Linux. Resources, e.g., CPUs and physical memory blocks, are either managed by the LWK or by the Linux kernel. The process of giving resources to the LWK, thereby taking them away from Linux management, is called designation. Resources that have been designated for the LWK are still visible from Linux but are now under LWK control. For example, the Linux kernel, when directed to do so, can perform I/O using memory designated for the LWK, but the LWK decides how that memory is managed. LWK resource designation can be done at boot time, or later using the lwkctl command.
Giving some or all the designated LWK resources to an mOS process is called reservation and is done at process launch time using a utility called yod (see below). The third stage, allocation, happens when a running process requests a resource; e.g., through calls like mmap() or sched_setaffinity(). A process can only allocate resources that have been reserved for it at launch time, and designated as an LWK resource before that.
The lwkctl command can be used to display the LWK partition information. This includes the list of LWK CPUs, LWK memory and syscall CPUs.
To see the output in human readable format use,
lwkctl -s
To see the output in raw format use,
lwkctl -s -r
For further details regarding usage refer to the lwkctl man page on a compute node where mOS for HPC is installed.
Applications are run under mOS for HPC through the use of a launcher command called yod. Any program not launched with yod will simply run on Linux. This document discusses how to use yod in conjunction with mpirun, but does not discuss job schedulers.
The yod utility of mOS is the fundamental mechanism for spawning LWK processes. The syntax is:
yod yod-arguments program program-arguments
One of yod's principal jobs is to reserve LWK resources (CPUs, memory) for the process being spawned. yod supports a simple syntax whereby a fraction of the resources that have been designated for the LWK are reserved. This is useful when launching multiple MPI ranks per node. In such cases, the general pattern looks like this:
mpirun -ppn N mpirun-args yod -R 1/N yod-args program program-args
This reserves for each MPI rank an equal portion of the designated LWK resources.
Please consult the yod man page for a more thorough description of the yod arguments. Please consult the mpirun man page for further information on mpirun-args.
In addition to the arguments documented in the yod man page, there are some experimental options. They can easily change or even disappear in future releases. Some of the experimental options are described in the table below. All of these are passed to yod via the --opt
option.
Option | Arguments | Description | Additional Notes |
---|---|---|---|
move-syscalls-disable |
None | Disables system call migration from LWK CPUs to designated system call CPUs | |
lwkmem-blocks-allocated |
None |
Enables the tracking and reporting of allocated LWK memory at process exit. The memory usage report counts the number of blocks that were allocated and collates that data by process, size and NUMA domain. The max memory usage by domain report shows the high water mark (in bytes) of each NUMA domain of a LWK process. |
The memory usage report is a total count of block allocations. Block frees are not counted. Thus the total amount of memory allocations reported may exceed the amount of memory designated for the given NUMA domain. The max memory usage by domain shows the high water mark of each NUMA domain. These high water marks did not necessarily occur at the same time. Thus the process' overall high water mark may be less than the sum of the individual domain's high water marks. This option is useful for debugging. It has no other noticeable effect on the hosted applications. Both the existence of this option and the format of its output are subject to change. |
lwkmem-interleave |
size |
Controls the largest size of pages that will be interleaved by the LWK. When yod arguments result in reservation of more than one NUMA memory domain (per memory type), allocations will be interleaved within those domains. This option controls the largest page size that will be used for said interleaving. |
The default is 2m. Legal values are 4k, 2m, 1g and 0 (disabled). In SNC-4 flat mode, this option has no effect if yod is invoked with "-R 1/4" (N=4, 8, 12, ...). This is because yod will normally reserve CPUs and memory from the same NUMA domains in that case. However, this option does have an effect for all other cases of "yod -R 1/N" where N is not a multiple of 4. Specifically, MPI workloads that use 1 or 2 ranks per node have seen improved performance with interleaving. This option has no effect in quadrant flat mode, since there is only one NUMA domain per given memory type. |
lwkmem-load-elf-disable |
Disables loading the initialized/uninitialized sections of the ELF binary image into mOS memory for an mOS process. | The .data and/or .bss sections of a program are loaded into LWK memory by default. This option forces the .data and .bss sections to be loaded into Linux memory instead. | |
lwkmem-zeroes-check |
options
|
Tests LWK memory for zeroness as it is being accessed. The
Detections of non-zero bytes result in messages in the kernel log. These are limited to approximately one per page. In addition to the above, one can include |
This is for diagnostic purposes only. Violations should be treated as bugs in mOS; running in fix mode should never be considered as a supported mode. |
|
None |
Some applications mmap large areas of memory with protections of PROT_NONE. These are, without further modification, inaccessible regions. And therefore it may not be desirable to (immediately) back them with LWK memory. This option defers such backing, leaving unbacked holes in the LWK process' virtual memory. These holes (or portions of holes) may be re-harvested later with mmap() or mprotect() calls that change the protections to something that is accessible. |
This option has been useful in running code on the OpenJDK Java Virtual Machine, from Oracle Corp. |
lwkmem-xpmem-stats |
Enables recording of XPMEM statistics during the application run and display them at process exit. | This is a debug option. It can be used to find XPMEM performance bottlenecks while using huge pages. | |
lwksched-stats |
level
|
Output counters to the kernel log at the time of process exit. Data detail is controlled by <level>. A value of 1 will generate an entry for every mOS CPU that had more than one mOS thread committed to run on it. A value of 2 will add a summary record for the exiting mOS process. A value of 3 will add records for all CPUs in the process and a process summary record for the exiting process regardless of commitment levels. Information provided: PID: This is the TGID of the process. This can be used to visually group the CPUs that belong to a specific process CPUID: CPU corresponding to the data being displayed THREADS: number of threads within the process (main thread plus pthreads) CPUS: number of CPUs reserved for use by this process MAX_COMMIT: high water mark of the number of mOS threads assigned to run on this CPU CPU MAX_RUNNING: high water mark of the number of tasks enqueued to the mOS run queue, including kernel tasks. GUEST_DISPATCH: number of times a non-mOS thread (kernel thread) was dispatched on this CPU. TIMER_POP: the number of timer interrupts. Typically this would be as a result of a POSIX timer expiring or RR dispatching, if enabled through the option lwksched-enable-rr SYS_MIGR: The number of system calls that were migrated to a Linux CPU for execution. SETAFFINITY: The number of sched_setaffinity system calls executed by this CPU. UTIL-CPU: indicator that this CPU has been designated as a utility CPU meant to run utility threads such as the OMP monitor and the PSM Progress threads. |
This option is useful for debugging. The content and format of the output are highly dependent on the current implementation of the mOS scheduler and therefore are likely to change in future releases. |
util-threshold |
<X:Y>
|
The X value indicates the maximum number of LWK CPUs that can be used for hosting utility threads. The Y value represents the maximum number of utility threads allowed to be assigned to any one LWK CPU. Some examples: A value of "0:0" will prevent any utility threads from being placed on an LWK CPU and force all utility threads to be placed on the Linux CPUs that are defined to be the syscall target CPUs. A value of "-1:1" will allow any number of LWK CPUs to hold utility threads however only a maximum of one utility thread will be assigned to each LWK CPU. | Default behavior is X = -1, Y = 1. The UTI API will be the preferred approach to controlling utility thread placement. |
idle-control | <MECHANISM,BOUNDARY> |
MECHANISM is the fast-path idle/dispatch mechanism used by the idle task. The allowed values are:
BOUNDARY is the boundary where the fast dispatch mechanism will be deployed. Beyond this boundary, the CPU will request deep sleep. The allowed values are:
|
The default <MECHANISM,BOUNDARY> is: <mwait,reserved> |
The mOS kernel will reserve unique CPU and memory resources for each process/rank within a node and will assign threads to the CPU resources owned (reserved) by the process. For these reasons, it is advisable to apply these runtime specific environment variables in order not to interfere with this mOS behavior. << Lance to team: any changes here?? KNL vs Xeon?? PSXE 2018 vs 2019 ?? >>
Name | Value | Description |
---|---|---|
I_MPI_PIN | off |
Disables process pinning in Intel MPI. Without this set, Intel MPI gets confused by isolated CPUs (including mOS LWK CPUS) and may attempt to assign ranks to cores not controlled by mOS. Symptoms include core dumps from the pmi_proxy (HYDRA). When disabled via "I_MPI_PIN=off", processes forked by the pmi_proxy will inherit the affinity mask of the proxy, which is what we want for mOS' yod. |
I_MPI_FABRICS |
shm:tmi shm:ofi |
For use on clusters with Intel(R) Omni-Path Fabric. Selects shared memory for intra-process communication. For 2018 Intel MPI editions, the recommended setting for inter-process communication is the Tag Matching Interface. For 2019 Intel MPI editions, the recommended setting for inter-process communication is OpenFabrics Interfaces. See https://software.intel.com/en-us/mpi-developer-guide-linux-selecting-fabrics for additional information. |
I_MPI_TMI_PROVIDER | psm2 | For use on clusters with Intel(R) Omni-Path Fabric. Selects the PSM2 provider for the TMI fabric. This setting is recommended only for 2018 editions of Intel MPI; it has been deprecated or removed in 2019 editions. |
I_MPI_FALLBACK | 0 | Use only the specified communication fabric(s). |
PSM2_RCVTHREAD | 0 or 1 |
When set to 0, disables the PSM2 progress thread. If not disabled, the PSM2 run-time will create an additional thread within each process. This additional thread could interfere with mOS process / thread placement and reduce performance. Some application environments may require the use of this progress thread in order to allow forward progress. In those environments the existence of the PSM2 progress thread must be made known to the mOS kernel through the yod --util_threads option. Please consult the yod man page for a more detailed description of this option. Some 2019 Intel MPI editions require that this thread be enabled. |
PSM2_MQ_RNDV_HFI_WINDOW | 4194304 | For use on clusters with Intel(R) Omni-Path Fabric |
PSM2_MQ_EAGER_SDMA_SZ | 65536 | For use on clusters with Intel(R) Omni-Path Fabric |
PSM2_MQ_RNDV_HFI_THRESH | 200000 | For use on clusters with Intel(R) Omni-Path Fabric |
KMP_AFFINITY | none |
Does not bind OpenMP threads to CPU resources, allowing the mOS kernel to choose from reserved CPU resources. If the operating system supports affinity, the compiler uses the OpenMP thread affinity interface to determine machine topology. Specify |
HFI_NO_CPUAFFINITY | 1 |
For use on clusters with Intel(R) Omni-Path Fabric. Disables affinitization of the PSM2 progress thread. |
The UTIlity thread API (UTI API) has been developed by RIKEN Advanced Institute for Computational Science - Japan, Sandia National Laboratories, and Intel Corp. with feedback from Fujitsu and Argonne National Laboratory.
The UTI API allows run-times and applications to control the placement of the threads which are not the primary computational threads within the application.
The API:
- Keeps these extra threads from interfering with computational threads.
- Allows grouping and placing of utility threads across the ranks within a node to maximize performance.
- Does not require the caller to have detailed knowledge of the system topology or the scheduler. Allows the kernel to provide intelligent placement and scheduling behavior.
- Does not require the caller to be aware of other potentially conflicting run-time or application thread placement actions. CPU selection is managed globally across the node by mOS.
- Header file /usr/include/uti.h contains the function and macro declarations.
- #include <uti.h>
- Library /usr/lib/libmos.so contains the mOS implementation of the UTI API. Link using the following:
- "-lmos'
The programmer can provide behavior and location hints to the kernel. The kernel will then use its knowledge of the system topology and available scheduling facilities to intelligently place and run the utility thread. The scheduler can optimize scheduling actions for the utility thread for the following behaviors: CPU intensive, e.g., constant polling, high or low scheduling priority, processes that block or yield infrequently, or processes that expect to run on a dedicated CPU. The scheduler can also optimize placement considering: L1/L2/L3/NUMA-domain, specific Linux CPU, lightweight kernel CPU, or CPUs that handle fabric interrupts.
There are various ways of specifying a location:
- Explicit NUMA domain
- Supply a bit mask containing NUMA domains.
- Location relative to the caller of the API.
- Same L1, L2, L3, or NUMA domain
- Different L1, L2, L3, or NUMA domain
- Location relative to other utility threads specifying a common key.
- Allows grouping of utility threads used across ranks within the node.
- Used in conjunction with a specification of "Same L1, L2, L3, or NUMA domain"
- Type of CPU
- Can be used in conjunction with the above location specifications
- FWK - Linux CPU running under the Linux scheduler
- LWK - lightweight kernel controlled CPU
- Fabric Interrupt handling CPU
This example shows the required sequence of operations to place utility threads on Linux CPUs running under the same L2 cache.
- Run-time agrees on a unique key value to use across ranks within a node.
- Each rank creates a utility thread and specifies:
- The same location key value.
- Request Same L2
- Request FWK CPU type
- When the first utility thread is created, mOS will pick an appropriate Linux CPU and L2 cache.
- All subsequent utility threads created with the same key will be placed on Linux CPUs and share the same L2 cache.
- The mOS kernel will assign the utility threads balanced across the available CPUs that satisfy the location requested.
The UTI attribute object is an opaque object that contains the behavior and location information to be used by the kernel when a pthread is created. The definition of the fields within the object are OS specific and purposely hidden from the user interface. This object is treated similarly to the pthread_attr object within the pthread library. This object is passed to the uti_pthread_create() interface, along with the standard arguments passed to pthread_create(). The libmos.so library contains the functions used to prepare the attribute object for use.
The following function is provided for initializing the attribute object before use:
- int uti_attr_init(uti_attr_t *attr);
The following function is provided to destroy the attribute object:
- int uti_attr_destroy(uti_attr_t *attr);
This is the list of library functions used to set behaviors in the attribute object:
- int uti_attr_cpu_intensive(uti_attr_t *attr);
- CPU intensive thread, e.g. constant polling
- int uti_attr_high_priority(uti_attr_t *attr);
- Expects high scheduling priority
- int uti_attr_low_priority(uti_attr_t *attr);
- Expects low scheduling priority
- int uti_attr_non_cooperative(uti_attr_t *attr);
- Does not play nice with others. Infrequenct yields and/or blocks
- int uti_attr_exclusive_cpu(uti_attr_t *attr);
- Expectes to run on a dedicated CPU
This is the list of library functions used to set location in the attribute object:
- int uti_attr_numa_set(uti_attr_t *attr, unsigned long *nodemask, unsigned long maxnodes);
- int uti_attr_same_numa_domain(uti_attr_t *attr);
- int uti_attr_different_numa_domain(uti_attr_t *attr);
- int uti_attr_same_l1(uti_attr_t *attr);
- int uti_attr_different_l1(uti_attr_t *attr);
- int uti_attr_same_l2(uti_attr_t *attr);
- int uti_attr_different_l2(uti_attr_t *attr);
- int uti_attr_same_l3(uti_attr_t *attr);
- int uti_attr_different_l3(uti_attr_t *attr);
- int uti_attr_prefer_lwk(uti_attr_t *attr);
- int uti_attr_prefer_fwk(uti_attr_t *attr);
- int uti_attr_fabric_intr_affinity(uti_attr_t *attr);
- int uti_attr_location_key(uti_attr_t *attr, unsigned long key);
The uti_pthread_create interface will return EINVAL if conflicting or invalid specifications are provided in the UTI attributes. For example, EINVAL will be returned if 'Same L2' and 'Different L2' are both requested. In these cases, no thread will be created. In other situations when there is no obvious conflict, the thread will be created, even if the requested location or behavior could not be satisfied. Location and behavior results can be determined using the interfaces listed below. The return values are 1=true, and 0=false. The setting of pthread attributes should be used with caution since, they will override the actions/results provided by the UTI attributes.
- int uti_result_different_numa_domain(uti_attr_t *attr);
- int uti_result_same_l1(uti_attr_t *attr);
- int uti_result_different_l1(uti_attr_t *attr);
- int uti_result_same_l2(uti_attr_t *attr);
- int uti_result_different_l2(uti_attr_t *attr);
- int uti_result_same_l3(uti_attr_t *attr);
- int uti_result_different_l3(uti_attr_t *attr);
- int uti_result_prefer_lwk(uti_attr_t *attr);
- int uti_result_prefer_fwk(uti_attr_t *attr);
- int uti_result_fabric_intr_affinity(uti_attr_t *attr);
-
int uti_result_exclusive_cpu(uti_attr_t *attr);
-
int uti_result_cpu_intensive(uti_attr_t *attr);
-
int uti_result_high_priority(uti_attr_t *attr);
-
int uti_result_low_priority(uti_attr_t *attr);
-
int uti_result_non_cooperative(uti_attr_t *attr);
-
int uti_result_location(uti_attr_t *attr);
-
int uti_result_behavior(uti_attr_t *attr);
-
int uti_result(uti_attr_t *attr);
Note: if your application could be running concurrently with another application using the UTI API, you may need to generate a location key that does not mistakenly match the key in the other application. This example simply uses a statically defined key value.
#include <uti.h>
pthread_attr_t p_attr;
uti_attr_t uti_attr;
int ret;
..
/* Initialize the attribute objects */
if ((ret = pthread_attr_init(&p_attr)) ||
(ret = uti_attr_init(&uti_attr)))
goto uti_exit;
/* Request to put the thread on the same L2 as other utility threads.
* Also indicate that the thread repeatedly monitors a device.
*/
if ((ret = uti_attr_same_l2(&uti_attr)) ||
(ret = uti_attr_location_key(&uti_attr, 123456)) ||
(ret = uti_attr_cpu_intensive(&uti_attr)))
goto uti_exit;
/* Create the utility thread */
if ((ret = uti_pthread_create(idp, &p_attr, thread_start, thread_info, &uti_attr)))
goto uti_exit;
/* Did the system accept our location and behavior request? */
if (!uti_result(&uti_attr))
printf(“Warning: utility thread attributes not honored.\n”);
if ((ret = uti_attr_destroy(&uti_attr)))
goto uti_exit;
..
uti_exit:
Interactions between pthread_attr and uti_attr
Avoid the use of pthread_attr_setaffinity_np when specifying a location with the uti_attr object. The pthread_attr_setaffinity_np directive is prioritized over the uti_attr location requests. If valid CPUs are specified, this action may alter the placement directives requested by the UTI attributes object. If invalid CPUs are provided, this will result in the uti_pthread_create interface returning EINVAL with no utility thread created.
Avoid the use of pthread_attr_set_schedparam and pthread_attr_setschedpolicy when specifying a behavior within the uti_attr_object. These attributes are prioritized over the uti_attr behavior requests. Usage may alter the actions that would have been taken based on the uti_attributes behavior hints. A policy or param that is invalid for an mOS process will result in the uti_pthread_create interface returning EINVAL with no utility thread created.
The XPMEM implementation of mOS is derived from the open source XPMEM implementation https://github.com/hjelmn/xpmem. It is compatible with the open source XPMEM implementation w.r.t user APIs. The user API description can be found either at the open source XPMEM implementation specified at the link above or can be found in mOS XPMEM installation header files mentioned in the table below. In addition few fixes were done to the user space XPMEM library. Users can pick up these changes by re-building/linking their applications with mOS user space XPMEM library.
XPMEM component | Path to installation on mOS |
---|---|
Shared library | /usr/lib/libxpmem.so |
Header files | /usr/include/xpmem/ |
XPMEM kernel module is loaded during the kernel boot up and will be ready to use upon a successful boot up without any additional steps needed by the user.
The mOS XPMEM implementation supports huge pages for the attached virtual memory. This usage of huge pages for an attachment is subjected to constraints mentioned below. In the description below a owner process is any user process that shares its virtual address space to be consumed by others and non-owner is any other user process that attaches the shared address space of the owner process in its own virtual address space and accesses the shared memory. The below table lists the constraints for huge page usage in the non-owner attachments and provides recommendation to limit each such scenarios.
Constraints due to owner virtual memory |
Recommendations |
Usage of huge pages in the owner page table itself. |
Map and share large LWK memory in the owner process. Ex: For XPMEM share mmap() large areas (>2 MB) using MAP_PRIVATE | MAP_ANONYMOUS flags or brk(). |
Remapping of huge pages is not supported for Linux memory. For an LWK process the memory for data/bss, brk and private anonymous mmaps are allocated out of LWK memory and rest of the process memory is allocated from Linux memory (ex: text area, file backed mmaps etc). The huge pages supported by the mOS XPMEM in the non-owner is only for corresponding LWK memory in the owner process. |
Avoid XPMEM share of Linux memory in the owner address space if large memory needs to be XPMEM shared. Expectation is that an LWK process will use more LWK memory than Linux memory. |
The alignment of start address of shared segment |
Create a shared segment with a virtual start address aligned to a huge page boundary. Typically when large memory is mapped through LWKMEM the mapped address is already huge page aligned. |
The length of shared segment. |
Need to be at least 2MB or 4MB based on the size of smallest huge TLB size supported by the hardware. |
Holes in the virtual address space covered by the shared segment |
It is recommended that the non-owner attaches to the owner's shared address space after it is mapped by the owner, Ex: using mmap, mremap or brk etc. Note that this still means one can create an XPMEM share over its virtual address space that is not mapped yet, instead it only recommends that an attachment is created only after the part of address space being attached to is mapped. |
Recreations of memory maps with larger sizes that could potentially result in using higher order huge pages |
Alignment of an XPMEM attachment in the non-owner is largely dependent on the corresponding owner address space of the owner at the time of attachment. If the corresponding owner address space changes i.e. a previously existing map is unmapped and new map is created with large size, then it is recommended to detach the existing xpmem attachment and create a new attachment to ensure that the attachment is aligned to newly allocated huge page size in the owner. |
Constraints due to non-owner virtual memory |
Recommendations |
The length of XPMEM attachment used |
It need to be atleast 2MB or 4MB based on the size of smallest huge TLB size supported by the hardware. |
Fixed start virtual address used for the attachment |
Its recommended that the application do not use a fixed start address (MAP_FIXED) for an attachment start address so that the kernel can choose a best huge page alignment for that attachment. |
Offset of the attachment. |
Offsets specifically that are not multiples of huge page size can result in attaching to an unaligned virtual memory start in the owner address space that in turn forces remap to use smaller pages if the resulting start/end address is in the middle of a huge page. |
The yod option ‘lwkmem-xpmem-stats’ has been implemented to capture the mOS XPMEM statistics while remapping huge pages. When this option is used the statistics would be printed on the dmesg by each LWK process at the end of the run. It can be used while running an LWK application to see if the application experienced one of the above constraints.
Ex: yod -o lwkmem-xpmem-stats <application> <application args>
Some components of the mOS LWK are instrumented with Linux' ftrace support.
Enabling/disabling trace events and dumping the trace buffer requires root permissions.
To enable tracing, write a '1' to the individual event's control file or to the global control file:
# To see all of the supported events:
$ ls /sys/kernel/debug/tracing/events/mos
# To enable just the "mos_clone_cpu_assign" event:
$ echo 1 > /sys/kernel/debug/tracing/events/mos/mos_clone_cpu_assign/enable
# To enable all mOS events:
$ echo 1 > /sys/kernel/debug/tracing/events/mos/enable
After you run, you can dump the trace buffer:
$ cat /sys/kernel/debug/tracing/trace
Tracing real workloads can easily overflow the trace ring, resulting in loss of data earlier in the run. This can be worked around easily by routing the ftrace pipe into a file prior to initiating the workload:
$ cat /sys/kernel/debug/tracing/trace_pipe | tee my.ftrace