Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

megacli and megaclisas-status kill conttroller's FW #136

Open
romeor opened this issue Nov 24, 2022 · 16 comments
Open

megacli and megaclisas-status kill conttroller's FW #136

romeor opened this issue Nov 24, 2022 · 16 comments

Comments

@romeor
Copy link

romeor commented Nov 24, 2022

Hello,
I've installed megacli, megaclisas-status from your repository and ran into an issue with my hardware. First, my HW:

Linux pve2 5.19.17-1-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.17-1 (Mon, 14 Nov 2022 20:25:12  x86_64 GNU/Linux
18:00.0 RAID bus controller: Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx

Raid is 3916 to be precise. Running latest FW: 

Firmware Package Build = 52.22.0-4571
Firmware Version = 5.220.02-3691
PSOC FW Version = 0x0017
PSOC Part Number = 15987-231-8GB
NVDATA Version = 5.2200.21-0585
CBB Version = 23.25.01.00
Bios Version = 7.22.00.0_0x07160300
HII Version = 07.22.03.00
HIIA Version = 07.22.03.00
Driver Name = megaraid_sas
Driver Version = 07.719.03.00-rc1


System Information
        Manufacturer: Supermicro
        Product Name: SYS-110P-WTR

The issue was: as soon as I run

megacli -AdpAllInfo -aALL or megaclisas-status (or periodic run of megaclisas-statusd)

My system freeze for a while, i was not able to write nor read from disk and dmesg was full of these errors:

1661.722811] megaraid_sas 0000:18:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[ 1661.722829] megaraid_sas 0000:18:00.0: FW in FAULT state Fault code:0x10000 subcode:0x0 func:megasas_wait_for_outstanding_fusion
[ 1661.722848] megaraid_sas 0000:18:00.0: resetting fusion adapter scsi0.
[ 1661.723202] megaraid_sas 0000:18:00.0: Outstanding fastpath IOs: 4
[ 1668.382749] megaraid_sas 0000:18:00.0: Waiting for FW to come to ready state
[ 1691.286479] megaraid_sas 0000:18:00.0: FW now in Ready state
[ 1691.286483] megaraid_sas 0000:18:00.0: FW now in Ready state
[ 1691.286684] megaraid_sas 0000:18:00.0: Current firmware supports maximum commands: 5101       LDIO threshold: 0
[ 1691.286687] megaraid_sas 0000:18:00.0: Performance mode :Balanced (latency index = 8)
[ 1691.286688] megaraid_sas 0000:18:00.0: FW supports sync cache        : Yes
[ 1691.286691] megaraid_sas 0000:18:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[ 1691.398489] megaraid_sas 0000:18:00.0: FW supports atomic descriptor : Yes
[ 1693.890459] megaraid_sas 0000:18:00.0: FW provided supportMaxExtLDs: 1       max_lds: 240
[ 1693.890471] megaraid_sas 0000:18:00.0: controller type       : MR(8192MB)
[ 1693.890476] megaraid_sas 0000:18:00.0: Online Controller Reset(OCR)  : Enabled
[ 1693.890479] megaraid_sas 0000:18:00.0: Secure JBOD support   : Yes
[ 1693.890482] megaraid_sas 0000:18:00.0: NVMe passthru support : Yes
[ 1693.890484] megaraid_sas 0000:18:00.0: FW provided TM TaskAbort/Reset timeout        : 6 secs/60 secs
[ 1693.890485] megaraid_sas 0000:18:00.0: JBOD sequence map support     : Yes
[ 1693.890486] megaraid_sas 0000:18:00.0: PCI Lane Margining support    : Yes
[ 1701.562362] megaraid_sas 0000:18:00.0: megasas_get_ld_map_info DCMD timed out, RAID map is disabled
[ 1708.170289] megaraid_sas 0000:18:00.0: Waiting for FW to come to ready state
[ 1728.026073] megaraid_sas 0000:18:00.0: FW now in Ready state
[ 1728.026077] megaraid_sas 0000:18:00.0: FW now in Ready state
[ 1728.026300] megaraid_sas 0000:18:00.0: Current firmware supports maximum commands: 5101       LDIO threshold: 0
[ 1728.026303] megaraid_sas 0000:18:00.0: Performance mode :Balanced (latency index = 8)
[ 1728.026304] megaraid_sas 0000:18:00.0: FW supports sync cache        : Yes
[ 1728.026306] megaraid_sas 0000:18:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[ 1728.402068] megaraid_sas 0000:18:00.0: FW supports atomic descriptor : Yes
[ 1728.550065] megaraid_sas 0000:18:00.0: FW provided supportMaxExtLDs: 1       max_lds: 240
[ 1728.550068] megaraid_sas 0000:18:00.0: controller type       : MR(8192MB)
[ 1728.550069] megaraid_sas 0000:18:00.0: Online Controller Reset(OCR)  : Enabled
[ 1728.550070] megaraid_sas 0000:18:00.0: Secure JBOD support   : Yes
[ 1728.550071] megaraid_sas 0000:18:00.0: NVMe passthru support : Yes
[ 1728.550072] megaraid_sas 0000:18:00.0: FW provided TM TaskAbort/Reset timeout        : 6 secs/60 secs
[ 1728.550074] megaraid_sas 0000:18:00.0: JBOD sequence map support     : Yes
[ 1728.550074] megaraid_sas 0000:18:00.0: PCI Lane Margining support    : Yes
[ 1736.149985] megaraid_sas 0000:18:00.0: megasas_get_ld_map_info DCMD timed out, RAID map is disabled
[ 1742.837909] megaraid_sas 0000:18:00.0: Waiting for FW to come to ready state
[ 1762.581695] megaraid_sas 0000:18:00.0: FW now in Ready state
[ 1762.581700] megaraid_sas 0000:18:00.0: FW now in Ready state
[ 1762.581901] megaraid_sas 0000:18:00.0: Current firmware supports maximum commands: 5101       LDIO threshold: 0
[ 1762.581904] megaraid_sas 0000:18:00.0: Performance mode :Balanced (latency index = 8)
[ 1762.581905] megaraid_sas 0000:18:00.0: FW supports sync cache        : Yes
[ 1762.581907] megaraid_sas 0000:18:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[ 1762.985689] megaraid_sas 0000:18:00.0: FW supports atomic descriptor : Yes
[ 1763.145688] megaraid_sas 0000:18:00.0: FW provided supportMaxExtLDs: 1       max_lds: 240
[ 1763.145690] megaraid_sas 0000:18:00.0: controller type       : MR(8192MB)
[ 1763.145692] megaraid_sas 0000:18:00.0: Online Controller Reset(OCR)  : Enabled
[ 1763.145693] megaraid_sas 0000:18:00.0: Secure JBOD support   : Yes
[ 1763.145694] megaraid_sas 0000:18:00.0: NVMe passthru support : Yes
[ 1763.145695] megaraid_sas 0000:18:00.0: FW provided TM TaskAbort/Reset timeout        : 6 secs/60 secs
[ 1763.145697] megaraid_sas 0000:18:00.0: JBOD sequence map support     : Yes
[ 1763.145698] megaraid_sas 0000:18:00.0: PCI Lane Margining support    : Yes
[ 1763.145699] megaraid_sas 0000:18:00.0: return -EBUSY from megasas_refire_mgmt_cmd 4362 cmd 0x5 opcode 0x10b0100
[ 1763.145732] megaraid_sas 0000:18:00.0: return -EBUSY from megasas_mgmt_fw_ioctl 8408 cmd 0x5 opcode 0x10b0100 cmd->cmd_status_drv 0x3
[ 1763.145782] megaraid_sas 0000:18:00.0: waiting for controller reset to finish
[ 1763.205697] megaraid_sas 0000:18:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
[ 1763.205984] megaraid_sas 0000:18:00.0: Adapter is OPERATIONAL for [scsi:0](https://mail.tlulib.ee/scsi:0)
[ 1763.206131] megaraid_sas 0000:18:00.0: Snap dump wait time   : 15
[ 1763.206132] megaraid_sas 0000:18:00.0: Reset successful for scsi0.
[ 1763.206295] megaraid_sas 0000:18:00.0: 10672 (722633074s/0x0020/DEAD) - Fatal firmware error: Line 188 in fw\raid\utils.c

[ 1763.206572] megaraid_sas 0000:18:00.0: 10675 (722633081s/0x0020/CRIT) - Controller encountered an error and was reset
[ 1763.211401] megaraid_sas 0000:18:00.0: scanning for scsi0...
[ 1763.211666] megaraid_sas 0000:18:00.0: 10719 (722633106s/0x0020/DEAD) - Fatal firmware error: Line 188 in fw\raid\utils.c

[ 1763.211963] megaraid_sas 0000:18:00.0: 10722 (722633113s/0x0020/CRIT) - Controller encountered an error and was reset
[ 1763.218960] megaraid_sas 0000:18:00.0: scanning for scsi0...
[ 1763.221603] megaraid_sas 0000:18:00.0: 10765 (722633133s/0x0020/DEAD) - Fatal firmware error: Line 188 in fw\raid\utils.c

[ 1763.221742] megaraid_sas 0000:18:00.0: 10768 (722633140s/0x0020/CRIT) - Controller encountered an error and was reset
[ 1763.226380] megaraid_sas 0000:18:00.0: scanning for scsi0...

nothing happens with megaraidsas-status and latest storcli, that i got from broadcom site.

Could you please fix or add storcli (ubuntu pkg is available from broadcom site https://www.broadcom.com/products/storage/raid-controllers/megaraid-9560-16i

@ElCoyote27
Copy link
Contributor

Hi,
On such a recent kernel and controller, perhaps megacli no longer work (that binary has not been updated in years). Could you try using storcli? megaclisas-status supports both..
Thanks,
Vincent

@romeor
Copy link
Author

romeor commented Dec 3, 2022

Hi,
it takes megacli as dependency and as soon as megaclisas-statusd starts, server hangs and FW crash happens. When i delete megacli string from megaclisas-status script and execute it, it says no controller found

# megaclisas-status
No MegaRAID or PERC adapter detected on your system!

while runing storcli on raid shows it ok

storcli /c0 /vall show
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.19.17-1-pve
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

---------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC       Size Name
---------------------------------------------------------------
1/238 RAID5 Optl  RW     Yes     RAWBD -   ON   12.221 TB DATA
0/239 RAID1 Optl  RW     Yes     RAWBD -   ON  223.062 GB OS
--------------------------------------------------------------

@ElCoyote27
Copy link
Contributor

Here are my recomendations:

  1. uninstall 'megacli' from your system (this will make the script not find it..) megacli is installed on your system and it crashes your system, you should uninstall it. megaclisas-status is barely calling it when it is first found in the PATH.
  2. type 'which storcli' to check where in the PATH is that CLI
  3. run megacilsas-status with '--debug' and paste the output here.

@romeor
Copy link
Author

romeor commented Dec 6, 2022

Hello,

Am unable to install megaclisas-status without megacli. And I can't remove megacli without removing megaclisas-status. They depend on each other. If i install megaclisas-status right the way, my system will crash again, as it also installs megaclisasstatusd, which runs right after installation and calls for megacli software...

@ElCoyote27
Copy link
Contributor

This must be because you're using your package manager and it has dependencies which co-bundle the two things together.
megaclisas-status is just a self contained script that uses either megacli or storcli. In your situation, I would remove megacli since it crashes your system and just use the plain megaclisas-status script with storcli.
You could install and distribute megaclisas-status outside of your package manager as it is only a script.

@romeor
Copy link
Author

romeor commented Jan 31, 2023

hello again.

it seems like your wrapper is not working with newer storcli binary.

I've installed storcli from server manufcator site (supermicro)
modified your script

os.environ["PATH"] += os.pathsep + "/usr/bin/storcli"
# Find MegaCli
for megabin in "perccli64", "perccli", "storcli64", "storcli":

to exclude megacli from process.

# megaclisas-status
No MegaRAID or PERC adapter detected on your system!

please update

@ElCoyote27
Copy link
Contributor

Hi,
I just got an H750P and I've noticed the following behaviour:

  • The old MegaCLI binary hangs the system (on RHEL8).
  • The old perccli (1.11 from 2014) which supports the Legacy MegaCLI syntax -also- hangs the system.
  • Only the new perccli (perccli-007.1623.0000.0000-1.noarch from 2020) does not hang the system

Unfortunately, the latest perccli/storcli no longer supports the old Legacy MegaCLI syntax so I guess we'll have to rewrite many parts of megaclisas-status.. (Maybe create a percclisas-status?)

:(

@Leox0717
Copy link

@romeor Hello, I encountered the exact same issue, while I only have storcli on my machine. Did you solve the problem?

@romeor
Copy link
Author

romeor commented Mar 18, 2024

Hello, @Leox0717

uninstall megacli and install storcli

@andrewladlow
Copy link

Sorry to bump this, I was just Googling for Line 188 in fw\raid\utils.c from dmesg and came across this - am using a Dell H750 card and made a few changes to the script to suit: https://gist.github.com/andrewladlow/9f4d03aab8ef0e957343b65ee6638c3a

Tested using perccli 007.0127, example output:

megaclisas-status
-- Controller information --
-- ID | H/W Model         | RAM    | Temp | BBU    | Firmware
c0    | PERC H750 Adapter | 8192MB | 42C  | Good   | FW: 52.21.0-4606

-- Array information --
-- ID  | Type   |    Size |  Strpsz | Flags | DskCache |   Status |  OS Path | CacheCade |InProgress
c0u239 | RAID-6 |  87313G |  256 KB | RA,WB |  Enabled |  Optimal |      239 | None      |None

-- Disk information --
-- ID     | Type | Drive Model                        | Size     | Status          | Speed    | Temp | Slot ID  | LSI ID
c0u239p0  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P9R9B | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 26C  | [64:0]   | 23
c0u239p1  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P97HF | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 26C  | [64:1]   | 21
c0u239p2  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P9QHE | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 26C  | [64:2]   | 25
c0u239p3  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P9XY9 | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 27C  | [64:3]   | 24
c0u239p4  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P9RRF | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 26C  | [64:4]   | 22
c0u239p5  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P8GSL | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 26C  | [64:5]   | 20
c0u239p6  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P9Z4Q | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 27C  | [64:6]   | 18
c0u239p7  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P82Z8 | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 27C  | [64:7]   | 19

Not sure what the text would actually be for the BBU if it were to fail, just used [A-Za-z].* as a bit of a guess but this could end up not matching

@ElCoyote27
Copy link
Contributor

@andrewladlow Wow, that's great! I have a an H750P too, let me try your version.

@ElCoyote27
Copy link
Contributor

Unfortunately, later versions of perccli removed the 'megacli' compatibility mode:

# rpm -q perccli
perccli-007.0127.0000.0000-1.noarch
# ./megaclisas-status 
-- Controller information --
-- ID | H/W Model         | RAM    | Temp | BBU    | Firmware     
c0    | PERC H750 Adapter | 8192MB | 49C  | Good   | FW: 52.26.0-5179 

-- Array information --
-- ID  | Type   |    Size |  Strpsz |   Flags | DskCache |   Status |  OS Path | CacheCade |InProgress   
c0u239 | RAID-0 |   1818G |  512 KB | ADRA,WB |  Enabled |  Optimal |      239 | None      |None         

-- Disk information --
-- ID     | Type | Drive Model                                      | Size     | Status          | Speed    | Temp | Slot ID  | LSI ID  
c0u239p0  | SSD  | S620NG0R208075X Samsung SSD 870 EVO 2TB SVT02B6Q | 1.818 TB | Online, Spun Up | 6.0Gb/s  | 26C  | [64:0]   | 8       

but if I upgrade perccli:

# rpm -q perccli
perccli-007.1910.0000.0000-1.noarch
# ./megaclisas-status 
No MegaRAID or PERC adapter detected on your system!

@ElCoyote27
Copy link
Contributor

@andrewladlow There's an updated version here, btw:
https://github.com/ElCoyote27/hwraid/blob/master/wrapper-scripts/megaclisas-status
You seem to be using 1.78 and I have 1.87 in my fork.

@andrewladlow
Copy link

Ah yeah I see what you mean, the script is trying to do -adpCount -NoLog but with the more recent version you just get:

CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 6.1.0-20-amd64
Status = Failure
Description = Deprecated command. Please use the new syntax.

The equivalent command seems to be show ctrlcount, but if you change that in the script you'll hit a similar syntax error when it tries to run -PDGetNum -a0 -NoLog for returnTotalDriveNumber (and so on), shame that it doesn't just accept the older syntax 😅

Thanks for mentioning the version by the way, didn't realise! Mine's from the Debian repo so must be a tad outdated by now

@ElCoyote27
Copy link
Contributor

What other changes did you add? I could only identity a %-5s vs %-6s on line 769.
If you create a PR against my branch I'll review it.
I know @eLvErDe has been super busy these past years so I have no idea if he'd be able to review a PR against the upstream.

@ElCoyote27
Copy link
Contributor

If you run the script with --debug, you'll see the commands it executes:

e.g:

# megaclisas-status --debug 2>&1|grep perccli64|sort -u
# DEBUG (130) : Will use this executable: /opt/MegaRAID/perccli/perccli64
# DEBUG (165) : Got Cached value: /opt/MegaRAID/perccli/perccli64 -LDInfo -l239 -a0 -NoLog
# DEBUG (165) : Got Cached value: /opt/MegaRAID/perccli/perccli64 -LDInfo -lall -a0 -NoLog
# DEBUG (165) : Got Cached value: /opt/MegaRAID/perccli/perccli64 -LdPdInfo -a0 -NoLog
# DEBUG (165) : Got Cached value: /opt/MegaRAID/perccli/perccli64 -PDGetNum -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -AdpAllInfo -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -AdpBbuCmd -GetBbuStatus -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -adpCount -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -AdpGetPciInfo -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -LDInfo -l0 -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -LDInfo -l100 -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -LDInfo -l101 -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -LDInfo -l102 -a0 -NoLog

All of these would have to be rewritten for the newest perccli and the patterns/logic would need to be adjusted too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants