Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NVMe device shows as failed because of just 3 media errors #729

Open
michnovka opened this issue Dec 6, 2024 · 2 comments
Open

[BUG] NVMe device shows as failed because of just 3 media errors #729

michnovka opened this issue Dec 6, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@michnovka
Copy link

I really think that 3 media errors that dont go up should not mark the device as failed. I read many forums online, seems that as long as this is not increasing, its fine. The threshold value is 0 here, but that makes no sense. Look:

image

image

The device is really healthy. Is there something I am missing?

I think it would be cool if I could set the expected value to 3 in config. Then the disk would show as OK as long as this did not increase. This would bring my attention to the disk whenever the errors would go up, and if it starts happening often, I know its time to switch it. WDYT?

superuser@SuperTower:~/dockers/scrutiny$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning       : 0
temperature      : 40 °C (313 K)
available_spare        : 53%
available_spare_threshold       : 10%
percentage_used        : 2%
endurance group critical warning summary: 0
Data Units Read        : 121737857 (62.33 TB)
Data Units Written       : 54373352 (27.84 TB)
host_read_commands       : 7237904998
host_write_commands      : 3071156723
controller_busy_time     : 24094
power_cycles     : 149
power_on_hours    : 16300
unsafe_shutdowns       : 120
media_errors     : 3
num_err_log_entries      : 189
Warning Temperature Time   : 0
Critical Composite Temperature Time   : 0
Temperature Sensor 1           : 40 °C (313 K)
Temperature Sensor 2           : 45 °C (318 K)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
@michnovka michnovka added the bug Something isn't working label Dec 6, 2024
@AnalogJ
Copy link
Owner

AnalogJ commented Jan 4, 2025

NVME thresholds are from the links in https://github.com/AnalogJ/scrutiny/blob/master/webapp/backend/pkg/thresholds/nvme_attribute_metadata.go

if you can share some links detailing how Media Errors works, and what an optimal range looks like, I'd be happy to tweak it.

@michnovka
Copy link
Author

Hi, there are no specific thresholds for media errors. Whats important is that they dont increase. So what might be a good feature (but a much more difficult one to implement) is to add a button "Accept as new acceptable value". So that the drive would not be failed, not until the number increases. This is in accordance with what I found online, as long as the number does not go up, its OK. This would also make it more useful, as now I consider the drive as "permanently failed". If it were to say "Healthy" and only change to "Failed" when the number increased, it would indicate that I should dedicate my attention to it. Now I can easily overlook the increase of media errors, just because it is "failed" all the time anyways. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants