Files
Latest commit
monitor
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
Stats Data ---------- ### Raw stats data: generated by `hpcperfstats` A raw stats file consists of a multiline header, followed my one or more record groups. The first few lines of the header identify the version of hpcperfstats, the FQDN of the host, it's uname, it's uptime in seconds, and other properties to be specified. $hpcperfstats 1.0.2 $hostname i101-101.ranger.tacc.utexas.edu $uname Linux x86_64 2.6.18-194.32.1.el5_TACC #18 SMP Mon Mar 14 22:24:19 CDT 2011 $uptime 4753669 These are followed by schema descriptors for each of the types collected: !amd64_pmc CTL0,C CTL1,C CTL2,C CTL3,C CTR0,E,W=48 CTR1,E,W=48 CTR2,E,W=48 CTR3,E,W=48 !cpu user,E,U=cs nice,E,U=cs system,E,U=cs idle,E,U=cs iowait,E,U=cs irq,E,U=cs softirq,E,U=cs !lnet tx_msgs,E rx_msgs,E rx_msgs_dropped,E tx_bytes,E,U=B rx_bytes,E,U=B rx_bytes_dropped,E !ps ctxt,E processes,E load_1 load_5 load_15 nr_running nr_threads ... A schema descriptor consists of the character '!' followed by the type, followed by a space separated list of elements. Each element consists of a key name, followed by a comma-separated list of options; the options currently used are: - E meaning that the counter is an event counter, - W=<BITS> meaning that the counter is <BITS> wide (as opposed to 64), - C meaning that the value is a control register, not a counter, - U=<STR> meaning that the value is in units specified by <STR>. Note especially the event and width options. Certain counters, such as the performance counters are subject to rollover, and as such their widths must be known for the values to be interpreted correctly. \warning The archived stats files do not account for rollover. This task is left for postprocessing. A record group consists of a blank line, a line containing the epoch time of the record and the current jobid, zero of more lines of marks (each starting with the % character), and several lines of statistics. 1307509201 1981063 %begin 1981063 amd64_pmc 11 4259958 4391234 4423427 4405240 235835341001110 187269740525248 62227761639015 177902917871843 amd64_pmc 10 4259958 4391234 4405239 4423427 221601328309784 187292967300939 47879507215852 174113618669738 amd64_pmc 13 4259958 4405238 4391234 4423427 211997466129346 215850892876689 2218837366391 233806061617899 amd64_pmc 12 4392928 4259958 4391234 4423427 6782043270201 102683296940807 2584394368284 174209034378272 ... cpu 11 429720418 0 1685980 43516346 447875 155 3443 cpu 10 429988676 0 1675476 43150935 559410 8 283 ... net ib0 0 0 55915434547 0 0 0 0 0 0 0 0 0 159301288 0 46963995550 0 0 97 0 0 0 31404022 0 ... ps - 4059349377 507410 1600 1600 1600 18 373 ... Each line of statistics contains the type (amd64_pmc, cpu, net, ps,...), the device (11,10,13,12,...,ib0,-...), followed by the counter values in the order given by the schema. Note that when we cannot meaningfully attach statistics to a device, we use '-' as the device name. ### `TYPES` ## Miscellaneous Information There is a large variety of data collected and summarized below: `amd64_pmc` AMD Opteron performance counters (per core) `intel_hsw` Intel Haswell Processor (HSW) (per core) `intel_hsw_ht` Intel Haswell Processor - Hyper-threaded (per logical core) `intel_nhm` Intel Nehalem Processor (NHM) (per core) `intel_uncore` Westmere Uncore (WTM) (per socket) `intel_snb` Intel Sandy Brige (SNB) or Ivy Bridge (IVB) Processor (per core) `intel_snb(hsw)_cbo` Caching Agent (CBo) for SNB (HSW) (per socket) `intel_snb(hsw)_pcu` Power Control Unit for SNB (HSW) (per socket) `intel_snb(hsw)_imc` Integrated Memory Controller for SNB (HSW) (per socket) `intel_snb(hsw)_qpi` QPI Link Layer for SNB (HSW) (per socket) `intel_snb(hsw)_hau` Home Agent Unit for SNB (HSW) (per socket) `intel_snb(hsw)_r2pci` Ring to PCIe Agent for SNB (HSW) (per socket) `ib` Infiniband usage `ib_sw` InfiniBand usage `ib_ext` Infiniband usage `llite` Lustre filesystem usage (per mount), `lnet` Lustre network usage `mdc` Lustre network usage `mic` MIC scheduler account (per hardware thread) `osc` Lustre filesystem usage `block` block device statistics (per device) `cpu` scheduler accounting (per CPU) `mem` memory usage (per socket) `net` network device usage (per device) `nfs` NFS system usage `numa` weird NUMA statistics (per socket) `proc` Process specific data (MaxRSS, executable name etc.) `ps` process statistics `sysv_shm` SysV shared memory segment usage `tmpfs` ram-backed filesystem usage (per mount) `vfs` dentry/file/inode cache usage `vm` virtual memory statistics. For the source and meanings of the counters, see the hpcperfstats source `https://github.com/TACC/hpcperfstats`, the CentOS 5.6 kernel source, especially `Documentation/*`, and the manpages, especially proc(5). \note All chip architecture related types are checked for existence at run time. Therefore, it is unnecessary for the user to filter for these types listed above - they will be filtered at run time. This should also work well for systems composed of multiple types of chip architectures. \warning Due to a bug in Lustre, llite overreports read_bytes. \warning Some event counters (from ib_sw, numa, and possibly others) suffer from occasional dips. This may be due to non-atomic accesses in the (kernel) code that presents the counter, a bug in hpcperfstats, or some other condition. Spurious rollover is easy to detect, however, because a naive adjustment produced a riduculously large delta. \warning We never reset counters, thus to determine the number of events that occurred during a job, you must subtract the value at begin from end. \warning Due to a quirk in the Opteron performance counter architecture, we do not assign the same set of events to each core, see `amd64_pmc.c` in the hpcperfstats source for details.