You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
After a few months of running just fine it suddenly started showing that all of my cciss drives are failing, which isn't true.
In my setup, /dev/sda is recognised by smartctl --scan, but /dev/sdb, /dev/sdc, /dev/sdd are HP Smart Array drives, so I configured them in collector.yaml like so:
version: 1
# The host id is a label used for identifying groups of disks running on the same host
# Primiarly used for hub/spoke deployments (can be left empty if using all-in-one image).
host:
id: ""
# This block allows you to override/customize the settings for devices detected by
# Scrutiny via `smartctl --scan`
# See the "--device=TYPE" section of https://linux.die.net/man/8/smartctl
# type can be a 'string' or a 'list'
devices:
- device: /dev/sdb
type:
#- 'cciss,0'
- 'cciss,1'
#- 'cciss,2'
- device: /dev/sdc
type:
- 'cciss,0'
#- 'cciss,1'
#- 'cciss,2'
- device: /dev/sdd
type:
#- 'cciss,0'
#- 'cciss,1'
- 'cciss,2'
Now, for example, running smartctl -a /dev/sdb -d cciss,1 shows that the drive is ok:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-101-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: WD Blue / Red / Green SSDs
Device Model: WDC WDS200T1R0A-68A4W0
Serial Number: 2035C4440507
LU WWN Device Id: 5 001b44 4a75fe2b9
Firmware Version: 411000WR
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Mar 29 00:51:15 2024 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 --- Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 --- Old_age Always - 26916
12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 78
165 Block_Erase_Count 0x0032 100 100 --- Old_age Always - 408220862923
166 Minimum_PE_Cycles_TLC 0x0032 100 100 --- Old_age Always - 4
167 Max_Bad_Blocks_per_Die 0x0032 100 100 --- Old_age Always - 53
168 Maximum_PE_Cycles_TLC 0x0032 100 100 --- Old_age Always - 53
169 Total_Bad_Blocks 0x0032 100 100 --- Old_age Always - 948
170 Grown_Bad_Blocks 0x0032 100 100 --- Old_age Always - 0
171 Program_Fail_Count 0x0032 100 100 --- Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 --- Old_age Always - 0
173 Average_PE_Cycles_TLC 0x0032 100 100 --- Old_age Always - 28
174 Unexpected_Power_Loss 0x0032 100 100 --- Old_age Always - 76
184 End-to-End_Error 0x0032 100 100 --- Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 --- Old_age Always - 0
188 Command_Timeout 0x0032 100 100 --- Old_age Always - 0
194 Temperature_Celsius 0x0022 072 040 --- Old_age Always - 28 (Min/Max 9/40)
199 UDMA_CRC_Error_Count 0x0032 100 100 --- Old_age Always - 0
230 Media_Wearout_Indicator 0x0032 003 003 --- Old_age Always - 0x035502500355
232 Available_Reservd_Space 0x0033 100 100 004 Pre-fail Always - 100
233 NAND_GB_Written_TLC 0x0032 100 100 --- Old_age Always - 60076
234 NAND_GB_Written_SLC 0x0032 100 100 --- Old_age Always - 78265
241 Host_Writes_GiB 0x0030 253 253 --- Old_age Offline - 51959
242 Host_Reads_GiB 0x0030 253 253 --- Old_age Offline - 22004
244 Temp_Throttle_Status 0x0032 000 100 --- Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Selective Self-tests/Logging not supported
In the collector.log file that I've attached, scrutiny first collects the info about /dev/sdb with the proper device argument:
But when it has to collect metrics, it passes the device argument as sat , when it should be cciss,1 instead, so smartctl can't open the device, and that results in scrutiny treating the drive as failing.
Same issue, also using HP hardware (although mines a software raid as the HP card is in passthrough). Apparently using version 0.7.2 still works, I cant figure out how to use a specific version with github docker registry though!
Edit: I've also tried configuring the custom commands section for each drive but it still doesn't respect the cciss,0 part, just sends cciss which fails.
Can confirm, after rolling back to v0.7.2-omnibus tag, delteing influxdb folder and scrutiny.db in /config, restarting scrutiny, then all devices show as passing.
Describe the bug
After a few months of running just fine it suddenly started showing that all of my cciss drives are failing, which isn't true.
In my setup, /dev/sda is recognised by smartctl --scan, but /dev/sdb, /dev/sdc, /dev/sdd are HP Smart Array drives, so I configured them in collector.yaml like so:
Now, for example, running
smartctl -a /dev/sdb -d cciss,1
shows that the drive is ok:In the collector.log file that I've attached, scrutiny first collects the info about /dev/sdb with the proper
device
argument:But when it has to collect metrics, it passes the
device
argument assat
, when it should becciss,1
instead, so smartctl can't open the device, and that results in scrutiny treating the drive as failing.Screenshots
![image](https://private-user-images.githubusercontent.com/65719382/317901853-10c711d7-92f8-4c2b-9f81-4bc9a000cb75.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg4NDI1MjgsIm5iZiI6MTczODg0MjIyOCwicGF0aCI6Ii82NTcxOTM4Mi8zMTc5MDE4NTMtMTBjNzExZDctOTJmOC00YzJiLTlmODEtNGJjOWEwMDBjYjc1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA2VDExNDM0OFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTRkNWU4NTIwN2U4ODkzYzRlMGUxMDg4NDQxYWVjZjcyNTE5MGU0NjFhMTFjN2EyOWZmZDQ5ZjU5ODYxOGUyMGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.k49f7gciq1e4W2vBb9-ebwPK1JY2AFu4nlbHrSxzhHI)
Log Files
collector.log
The text was updated successfully, but these errors were encountered: