-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
160 lines (109 loc) · 4.42 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
==================
NMI Manager
==================
This module allows you to Panic or Ignore specific NMI events.
Author: Adrien Mahieux <[email protected]>
Source: https://github.com/saruspete/nmimgr
Tools: https://github.com/saruspete/kdumptools
https://fr.slideshare.net/Saruspete/kernel-crashdump-53496836
------------
What is this
------------
Manage NMI events in a more fine-grained manner than "unknown_nmi_panic" sysctl.
When a production host is unresponsive, we'd like to take a Kernel Dump for
offline issue analysis.
If kdump is correctly setup, we need to crash/panic the system for it to start
This is usually done by sending an NMI to the system (as no userland process
is responding anymore) through the BMC (which name and implementation is
vendor specific).
But if no handler registers the vendor-specific NMI event to trigger a crash,
the kernel logs a "Dazed and confused, but trying to continue" message and
server is still unresponsive.
----------------------------------------
Why not just using "nmi_panic" sysctls ?
----------------------------------------
There is 3 sysctls that allows administrators to generate a panic:
- panic_on_io_nmi
- panic_on_unrecovered_nmi
- unknown_nmi_panic
These sysctl are overkill as multiple NMI can be generated for non-critical
events like:
- Software debugging, like perf on Pentium processors
- External cards like FPGA to communicate
- Motherboard alerts of a dying Power-Supply
If you are fine with the current unknown_nmi_panic settings, this module can
also be used to ignore other NMIs during the dump process, even those who have
a kernel module for handling. This avoid the interruption of the dump process,
thus having a non-usable coredump.
-------------
How to use it
-------------
Build it
--------
Built it for the current kernel:
# make
Or specify custom/multiple versions if you have a build env
# make 2.6.32-642.15.1.el6.x86_64 3.10.0-327.36.1.el7.x86_64 4.8.13-100.fc23.x86_64
Load the module (temporarily)
-----------------------------
Usage as a module (temp, insmod):
insmod nmimgr.ko events_panic=0,1,2,5-12,13,255 events_ignore=99
Check the logs
--------------
As NMI should be a serious indicator, the module will generate some logs at
startup and when handling a new NMI.
When trying new hardware, you may just load the module, generate an NMI from
the BMC and check dmesg for lines containing "nmimgr:", specifically the log
"Handling new NMI".
The code you are interested in (the event) is the decimal value between ( )
If you see this log: "Handling new NMI type:1 event:0x10 (16)"
Then the event code generated is 16.
To make the system panic:
# insmod nmimgr.ko events_panic=16
To ignore it and disable messages:
# insmod nmimgr.ko events_ignore=16
Add it permanently
------------------
Once checked it works with your kernel, you can make it more permanent:
1) copy the module (file nmimgr.ko) in:
# cp nmimgr.kmod.$(uname -r)/nmimgr.ko /lib/modules/$(uname -r)/extra/
2) Regen the module database:
# depmod -a
3) Set the parameters for modprobe:
# echo "options nmimgr events_panic=16" > /etc/modprobe.d/nmimgr.conf
4) Load it with modprobe
# modprobe nmimgr
5) Check the parameters are correctly set (you should see your value)
# cat /sys/module/nmimgr/parameters/events_panic
Parameters
----------
- events_panic=LIST Events to make the kernel Panic
- events_ignore=LIST Events to drop, so no other handler can process them
LIST is standard kernel lists, can be composed of
- simple lists: 0,13,16,44,10
- ranges: 10-100
- Mix of both: 0,1,2-8,10
Should you embed it with your kernel, you can configure it with boot cmd:
nmimgr.events_panic=0,1,2,5-12,13,255 nmimgr.events_ignore=99
--------------
How to test it
--------------
Generate an NMI
---------------
- ipmitool chassis power diag
- vboxmanage debugvm "VMName" injectnmi
- virsh inject-nmi "VMName"
Usual generated NMI events (in decimal, to be used as module parameters):
- HP Ilo : 32,48
- Dell IDRAC: 32,33,48,49
- IBM : 44,60
- VirtualBox: 0,16,32,48
-----------------------
Kernel Revision history
-----------------------
2.6.32: Using notifier_block structs
3.2 : Moved NMI descriptions to an enum: LOCAL, UNKNOWN, MAX
https://lwn.net/Articles/461215/
https://lkml.org/lkml/2012/3/8/386
3.5 : Moved "register_nmi_handler" to a macro + static struct nmiaction fn##_na
This broke the loop logic used between 3.2 and 3.5