Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Scrutiny fails to set correct device type for smartctl, drives show as failing #618

Open
Z0lid opened this issue Mar 28, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Z0lid
Copy link

Z0lid commented Mar 28, 2024

Describe the bug
After a few months of running just fine it suddenly started showing that all of my cciss drives are failing, which isn't true.

In my setup, /dev/sda is recognised by smartctl --scan, but /dev/sdb, /dev/sdc, /dev/sdd are HP Smart Array drives, so I configured them in collector.yaml like so:

version: 1

# The host id is a label used for identifying groups of disks running on the same host
# Primiarly used for hub/spoke deployments (can be left empty if using all-in-one image).
host:
  id: ""


# This block allows you to override/customize the settings for devices detected by
# Scrutiny via `smartctl --scan`
# See the "--device=TYPE" section of https://linux.die.net/man/8/smartctl
# type can be a 'string' or a 'list'
devices:
  - device: /dev/sdb
    type:
      #- 'cciss,0'
      - 'cciss,1'
      #- 'cciss,2'
  - device: /dev/sdc
    type:
      - 'cciss,0'
      #- 'cciss,1'
      #- 'cciss,2'
  - device: /dev/sdd
    type:
      #- 'cciss,0'
      #- 'cciss,1'
      - 'cciss,2'

Now, for example, running smartctl -a /dev/sdb -d cciss,1 shows that the drive is ok:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-101-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     WD Blue / Red / Green SSDs
Device Model:     WDC  WDS200T1R0A-68A4W0
Serial Number:    2035C4440507
LU WWN Device Id: 5 001b44 4a75fe2b9
Firmware Version: 411000WR
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri Mar 29 00:51:15 2024 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.

SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   ---    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   ---    Old_age   Always       -       26916
 12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       78
165 Block_Erase_Count       0x0032   100   100   ---    Old_age   Always       -       408220862923
166 Minimum_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       4
167 Max_Bad_Blocks_per_Die  0x0032   100   100   ---    Old_age   Always       -       53
168 Maximum_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       53
169 Total_Bad_Blocks        0x0032   100   100   ---    Old_age   Always       -       948
170 Grown_Bad_Blocks        0x0032   100   100   ---    Old_age   Always       -       0
171 Program_Fail_Count      0x0032   100   100   ---    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   ---    Old_age   Always       -       0
173 Average_PE_Cycles_TLC   0x0032   100   100   ---    Old_age   Always       -       28
174 Unexpected_Power_Loss   0x0032   100   100   ---    Old_age   Always       -       76
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   072   040   ---    Old_age   Always       -       28 (Min/Max 9/40)
199 UDMA_CRC_Error_Count    0x0032   100   100   ---    Old_age   Always       -       0
230 Media_Wearout_Indicator 0x0032   003   003   ---    Old_age   Always       -       0x035502500355
232 Available_Reservd_Space 0x0033   100   100   004    Pre-fail  Always       -       100
233 NAND_GB_Written_TLC     0x0032   100   100   ---    Old_age   Always       -       60076
234 NAND_GB_Written_SLC     0x0032   100   100   ---    Old_age   Always       -       78265
241 Host_Writes_GiB         0x0030   253   253   ---    Old_age   Offline      -       51959
242 Host_Reads_GiB          0x0030   253   253   ---    Old_age   Offline      -       22004
244 Temp_Throttle_Status    0x0032   000   100   ---    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

In the collector.log file that I've attached, scrutiny first collects the info about /dev/sdb with the proper device argument:

 time="2024-03-28T22:42:24Z" level=info msg="Executing command: smartctl --info --json --device cciss,1 /dev/sdb" type=metrics

But when it has to collect metrics, it passes the device argument as sat , when it should be cciss,1 instead, so smartctl can't open the device, and that results in scrutiny treating the drive as failing.

time="2024-03-28T22:42:25Z" level=info msg="Executing command: smartctl --xall --json --device sat /dev/sdb" type=metrics

Screenshots
image

Log Files
collector.log

@Z0lid Z0lid added the bug Something isn't working label Mar 28, 2024
@dansharpy
Copy link

dansharpy commented Apr 1, 2024

Same issue, also using HP hardware (although mines a software raid as the HP card is in passthrough). Apparently using version 0.7.2 still works, I cant figure out how to use a specific version with github docker registry though!
Edit: I've also tried configuring the custom commands section for each drive but it still doesn't respect the cciss,0 part, just sends cciss which fails.

@Z0lid
Copy link
Author

Z0lid commented Apr 1, 2024

Can confirm, after rolling back to v0.7.2-omnibus tag, delteing influxdb folder and scrutiny.db in /config, restarting scrutiny, then all devices show as passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants