[BUG] NVMe device shows as failed because of just 3 media errors #729

michnovka · 2024-12-06T09:55:45Z

I really think that 3 media errors that dont go up should not mark the device as failed. I read many forums online, seems that as long as this is not increasing, its fine. The threshold value is 0 here, but that makes no sense. Look:

The device is really healthy. Is there something I am missing?

I think it would be cool if I could set the expected value to 3 in config. Then the disk would show as OK as long as this did not increase. This would bring my attention to the disk whenever the errors would go up, and if it starts happening often, I know its time to switch it. WDYT?

superuser@SuperTower:~/dockers/scrutiny$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning       : 0
temperature      : 40 °C (313 K)
available_spare        : 53%
available_spare_threshold       : 10%
percentage_used        : 2%
endurance group critical warning summary: 0
Data Units Read        : 121737857 (62.33 TB)
Data Units Written       : 54373352 (27.84 TB)
host_read_commands       : 7237904998
host_write_commands      : 3071156723
controller_busy_time     : 24094
power_cycles     : 149
power_on_hours    : 16300
unsafe_shutdowns       : 120
media_errors     : 3
num_err_log_entries      : 189
Warning Temperature Time   : 0
Critical Composite Temperature Time   : 0
Temperature Sensor 1           : 40 °C (313 K)
Temperature Sensor 2           : 45 °C (318 K)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0

The text was updated successfully, but these errors were encountered:

AnalogJ · 2025-01-04T22:59:17Z

NVME thresholds are from the links in https://github.com/AnalogJ/scrutiny/blob/master/webapp/backend/pkg/thresholds/nvme_attribute_metadata.go

if you can share some links detailing how Media Errors works, and what an optimal range looks like, I'd be happy to tweak it.

michnovka · 2025-01-15T12:26:40Z

Hi, there are no specific thresholds for media errors. Whats important is that they dont increase. So what might be a good feature (but a much more difficult one to implement) is to add a button "Accept as new acceptable value". So that the drive would not be failed, not until the number increases. This is in accordance with what I found online, as long as the number does not go up, its OK. This would also make it more useful, as now I consider the drive as "permanently failed". If it were to say "Healthy" and only change to "Failed" when the number increased, it would indicate that I should dedicate my attention to it. Now I can easily overlook the increase of media errors, just because it is "failed" all the time anyways. WDYT?

michnovka added the bug Something isn't working label Dec 6, 2024

jacobalberty mentioned this issue Jan 5, 2025

[FEAT] Enable Baseline Exclusion for SMART Errors in Scrutiny Reports #713

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] NVMe device shows as failed because of just 3 media errors #729

[BUG] NVMe device shows as failed because of just 3 media errors #729

michnovka commented Dec 6, 2024

AnalogJ commented Jan 4, 2025

michnovka commented Jan 15, 2025

[BUG] NVMe device shows as failed because of just 3 media errors #729

[BUG] NVMe device shows as failed because of just 3 media errors #729

Comments

michnovka commented Dec 6, 2024

AnalogJ commented Jan 4, 2025

michnovka commented Jan 15, 2025