Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] smartctl returned an error on QNAP #560

Closed
paulmorabito opened this issue Dec 29, 2023 · 10 comments
Closed

[BUG] smartctl returned an error on QNAP #560

paulmorabito opened this issue Dec 29, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@paulmorabito
Copy link

Using scrutiny on my QNAP TS453BE and the latest version of scrutiny and QNAP firmware is once again giving smartctl errors. Compose and logs are below:

services:
  scrutiny:
    image: ghcr.io/analogj/scrutiny:master-omnibus
    container_name: scrutiny
    privileged: true
    cap_add:
      - SYS_RAWIO
    volumes:
      - /run/udev:/run/udev:ro
      - /share/persistent/scrutiny:/scrutiny/config
      - ./influxdb:/opt/scrutiny/influxdb
    devices:
      - /dev/sda
      - /dev/sdb
      - /dev/sdc
      - /dev/sdd
    ports:
      - 8180:8080
      - "8086:8086" 
    restart: always

logs:

__   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
github.com/AnalogJ/scrutiny                             dev-0.7.2
Start the scrutiny server
time="2023-12-29T14:38:53Z" level=info msg="Trying to connect to scrutiny sqlite db: /opt/scrutiny/config/scrutiny.db\n" type=web
time="2023-12-29T14:38:53Z" level=info msg="Successfully connected to scrutiny sqlite db: /opt/scrutiny/config/scrutiny.db\n" type=web
time="2023-12-29T14:38:53Z" level=info msg="InfluxDB certificate verification: true\n" type=web
time="2023-12-29T14:38:53Z" level=info msg="Database migration starting. Please wait, this process may take a long time...." type=web
time="2023-12-29T14:38:53Z" level=info msg="Database migration completed successfully" type=web
time="2023-12-29T14:38:53Z" level=info msg="SQLite global configuration migrations starting. Please wait...." type=web
2023/12/29 14:38:53 /go/src/github.com/analogj/scrutiny/vendor/github.com/go-gormigrate/gormigrate/v2/gormigrate.go:443 SLOW SQL >= 200ms
[422.567ms] [rows:1] INSERT INTO migrations (id) VALUES ("g20220802211500")
time="2023-12-29T14:38:53Z" level=info msg="SQLite global configuration migrations completed successfully" type=web
time="2023-12-29T14:38:58Z" level=info msg="127.0.0.1 - 7e29e6c5589c [29/Dec/2023:14:38:58 +0000] \"HEAD /api/health\" 200 0 \"\" \"curl/7.74.0\" (1ms)" clientIP=127.0.0.1 hostname=7e29e6c5589c latency=1 method=HEAD path=/api/health referer= respLength=0 statusCode=200 type=web userAgent=curl/7.74.0
starting scrutiny collector (run-once mode. subsequent calls will be triggered via cron service)
2023/12/29 14:38:58 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.
 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                                dev-0.7.2
time="2023-12-29T14:38:58Z" level=info msg="Verifying required tools" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --scan --json" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --info --json /dev/sda" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Using WWN Fallback" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --info --json /dev/sdb" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Using WWN Fallback" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --info --json /dev/sdc" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Using WWN Fallback" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --info --json /dev/sdd" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Using WWN Fallback" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Sending detected devices to API, for filtering & validation" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="127.0.0.1 - 7e29e6c5589c [29/Dec/2023:14:38:58 +0000] \"POST /api/devices/register\" 200 2075 \"\" \"Go-http-client/1.1\" (408ms)" clientIP=127.0.0.1 hostname=7e29e6c5589c latency=408 method=POST path=/api/devices/register referer= respLength=2075 statusCode=200 type=web userAgent=Go-http-client/1.1
time="2023-12-29T14:38:58Z" level=info msg="Collecting smartctl results for sda\n" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --xall --json /dev/sda" type=metrics
time="2023-12-29T14:38:58Z" level=error msg="smartctl returned an error code (4) while processing sda\n" type=metrics
time="2023-12-29T14:38:58Z" level=error msg="smartctl detected a checksum error" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Publishing smartctl results for 2yjdn6sd\n" type=metrics
ts=2023-12-29T14:38:58.926265Z lvl=info msg="index opened with 8 partitions" log_id=0mPW3M0l000 service=storage-engine index=tsi
ts=2023-12-29T14:38:58.927330Z lvl=info msg="Reindexing TSM data" log_id=0mPW3M0l000 service=storage-engine engine=tsm1 db_shard_id=1
ts=2023-12-29T14:38:58.927369Z lvl=info msg="Reindexing WAL data" log_id=0mPW3M0l000 service=storage-engine engine=tsm1 db_shard_id=1
time="2023-12-29T14:38:58Z" level=info msg="No notification endpoints configured. Skipping failure notification." type=web
time="2023-12-29T14:38:58Z" level=info msg="127.0.0.1 - 7e29e6c5589c [29/Dec/2023:14:38:58 +0000] \"POST /api/device/2yjdn6sd/smart\" 200 16 \"\" \"Go-http-client/1.1\" (140ms)" clientIP=127.0.0.1 hostname=7e29e6c5589c latency=140 method=POST path=/api/device/2yjdn6sd/smart referer= respLength=16 statusCode=200 type=web userAgent=Go-http-client/1.1
time="2023-12-29T14:38:58Z" level=info msg="Collecting smartctl results for sdb\n" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --xall --json /dev/sdb" type=metrics
time="2023-12-29T14:38:59Z" level=error msg="smartctl returned an error code (4) while processing sdb\n" type=metrics
time="2023-12-29T14:38:59Z" level=error msg="smartctl detected a checksum error" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="Publishing smartctl results for 2yj8s5bd\n" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="No notification endpoints configured. Skipping failure notification." type=web
time="2023-12-29T14:38:59Z" level=info msg="127.0.0.1 - 7e29e6c5589c [29/Dec/2023:14:38:59 +0000] \"POST /api/device/2yj8s5bd/smart\" 200 16 \"\" \"Go-http-client/1.1\" (337ms)" clientIP=127.0.0.1 hostname=7e29e6c5589c latency=337 method=POST path=/api/device/2yj8s5bd/smart referer= respLength=16 statusCode=200 type=web userAgent=Go-http-client/1.1
time="2023-12-29T14:38:59Z" level=info msg="Collecting smartctl results for sdc\n" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="Executing command: smartctl --xall --json /dev/sdc" type=metrics
time="2023-12-29T14:38:59Z" level=error msg="smartctl returned an error code (4) while processing sdc\n" type=metrics
time="2023-12-29T14:38:59Z" level=error msg="smartctl detected a checksum error" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="Publishing smartctl results for jehn4m1n\n" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="No notification endpoints configured. Skipping failure notification." type=web
time="2023-12-29T14:38:59Z" level=info msg="127.0.0.1 - 7e29e6c5589c [29/Dec/2023:14:38:59 +0000] \"POST /api/device/jehn4m1n/smart\" 200 16 \"\" \"Go-http-client/1.1\" (204ms)" clientIP=127.0.0.1 hostname=7e29e6c5589c latency=204 method=POST path=/api/device/jehn4m1n/smart referer= respLength=16 statusCode=200 type=web userAgent=Go-http-client/1.1
time="2023-12-29T14:38:59Z" level=info msg="Collecting smartctl results for sdd\n" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="Executing command: smartctl --xall --json /dev/sdd" type=metrics
time="2023-12-29T14:38:59Z" level=error msg="smartctl returned an error code (4) while processing sdd\n" type=metrics
time="2023-12-29T14:38:59Z" level=error msg="smartctl detected a checksum error" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="Publishing smartctl results for 2yjdutkd\n" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="No notification endpoints configured. Skipping failure notification." type=web
time="2023-12-29T14:38:59Z" level=info msg="127.0.0.1 - 7e29e6c5589c [29/Dec/2023:14:38:59 +0000] \"POST /api/device/2yjdutkd/smart\" 200 16 \"\" \"Go-http-client/1.1\" (188ms)" clientIP=127.0.0.1 hostname=7e29e6c5589c latency=188 method=POST path=/api/device/2yjdutkd/smart referer= respLength=16 statusCode=200 type=web userAgent=Go-http-client/1.1

Please let me know if there is any further debugging or logging needed and I'll get to it.

Thanks,

@paulmorabito paulmorabito added the bug Something isn't working label Dec 29, 2023
@mcarbonne
Copy link

Maybe you can run smartctl from inside the scrutiny docker to obtain detailed logs :

sudo docker exec -it scrutiny /bin/sh

(replace scrutiny by the name of you running container) and then execute smartctl --xall /dev/sda.

@paulmorabito
Copy link
Author

Here you go:

root@7e29e6c5589c:/opt/scrutiny# smartctl --xall /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.60-qnap] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WDC
Product:              WD100EMAZ-00WJTA
Revision:             83.H
Compliance:           SPC-3
User Capacity:        10,000,831,348,736 bytes [10.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        5400 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca273e1ef5b
Serial number:        2YJDN6SD
Device type:          disk
Local Time is:        Wed Jan  3 08:51:52 2024 UTC
SMART support is:     Unavailable - device lacks SMART capability.
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Device does not support Self Test logging
Device does not support Background scan results logging

@mcarbonne
Copy link

The issue is on smartmontools side.
I don't know the kind of HW used in your NAS but maybe it prevents auto detection from smartctl. Nevertheless you can try to manually find a working configuration.
From inside the scrutiny docker, you can run smartctl --xall --device=XXX /dev/sda (replace XXX by sat, scsi ...). There is a list of all available options in man pages (https://linux.die.net/man/8/smartctl).

If you succeed, then have a look at metrics_smart_args parameter. Default value is --xall --json but you can add extra required parameters for you drives (--xall --json --device sat for example).

@paulmorabito
Copy link
Author

I found the missing parameter (--device=sat) and can run this successfully from the command line in the container. When I update the config though, it's not running the command with the addition of --device. I'm setting it according to below:

# Commented Scrutiny Configuration File
#
# The default location for this file is /scrutiny/config/collector.yaml.
# In some cases to improve clarity default values are specified,
# uncommented. Other example values are commented out.
#
# When this file is parsed by Scrutiny, all configuration file keys are
# lowercased automatically. As such, Configuration keys are case-insensitive,
# and should be lowercase in this file to be consistent with usage.


######################################################################
# Version
#
# version specifies the version of this configuration file schema, not
# the scrutiny binary. There is only 1 version available at the moment
version: 1

# This block allows you to override/customize the settings for devices detected by
# Scrutiny via `smartctl --scan`
# See the "--device=TYPE" section of https://linux.die.net/man/8/smartctl
# type can be a 'string' or a 'list'
devices:
  # example for forcing device type detection for a single disk
  - device: /dev/sda
    type: 'sat'
  - device: /dev/sdb
    type: 'sat'
  - device: /dev/sdc
    type: 'sat'
  - device: /dev/sdd
    type: 'sat'
commands:
  #  metrics_scan_args: '--scan --json' # used to detect devices
  #  metrics_info_args: '--info --json' # used to determine device unique ID & register device with Scrutiny
  metrics_smart_args: '--xall --device=sat --json' # used to retrieve smart data for each device.

Error from the logs below:

time="2024-01-04T14:06:58Z" level=info msg="Collecting smartctl results for sda\n" type=metrics
time="2024-01-04T14:06:58Z" level=info msg="Executing command: smartctl --xall --json /dev/sda" type=metrics
time="2024-01-04T14:06:58Z" level=error msg="smartctl returned an error code (4) while processing sda\n" type=metrics
time="2024-01-04T14:06:58Z" level=error msg="smartctl detected a checksum error" type=metrics
time="2024-01-04T14:06:58Z" level=info msg="Publishing smartctl results for 2yjdn6sd\n" type=metrics
time="2024-01-04T14:06:58Z" level=info msg="No notification endpoints configured. Skipping failure notification." type=web

I'm running the latest container version etc. Is there anything I am missing?

@chrisuhg
Copy link

chrisuhg commented Jan 4, 2024

The issue is on smartmontools side. I don't know the kind of HW used in your NAS but maybe it prevents auto detection from smartctl. Nevertheless you can try to manually find a working configuration. From inside the scrutiny docker, you can run smartctl --xall --device=XXX /dev/sda (replace XXX by sat, scsi ...). There is a list of all available options in man pages (https://linux.die.net/man/8/smartctl).

If you succeed, then have a look at metrics_smart_args parameter. Default value is --xall --json but you can add extra required parameters for you drives (--xall --json --device sat for example).

Many thanks for your comment :)

I got the same error in my all of SSD(SATA) on collector.log file: level=error msg="smartctl returned an error code (4) while processing sdf\n" type=metrics

and i went to docker exec -it scrutiny /bin/sh and ran smartctl --xall /dev/sde, it shown me:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.4.302+] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WDC
Product:              WDS250G2B0B-00YS
Revision:             20WD
Compliance:           SPC-3
User Capacity:        250,059,350,016 bytes [250 GB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        Solid State Device
Logical Unit id:      0x5001b444a773882b
Serial number:        202********10
Device type:          disk
Local Time is:        Thu Jan  4 15:35:31 2024 UTC
SMART support is:     Unavailable - device lacks SMART capability.
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Device does not support Self Test logging
Device does not support Background scan results logging

I tired different args of the --device=TYPE:

smartctl --xall --device=ata /dev/sdj
smartctl --xall --device=scsi /dev/sdj
smartctl --xall --device=sat /dev/sdj

until to smartctl --xall --device=sat /dev/sdj, it shown all of the SMART informations ! What is certain is that this type is appropriate !

so i split the compose.yaml to compose.yaml and collector.yaml:
compose.yaml

version: '3.5'

services:
  scrutiny:
    container_name: scrutiny
    image: ghcr.io/analogj/scrutiny:master-omnibus
    privileged: true    # !!PLEASE REMOVE WHEN WORKING!!
    cap_add:
      - SYS_RAWIO
      - SYS_ADMIN
    ports:
      - "8080:8080" # webapp
    environment:
      - PUID=1000
      - PGID=1000
      - DEBUG=true
      - COLLECTOR_LOG_FILE=/opt/scrutiny/config/collector.log
      - SCRUTINY_LOG_FILE=/opt/scrutiny/config/web.log
    volumes:
      - /run/udev:/run/udev:ro
      - ./config:/opt/scrutiny/config
      - ./influxdb:/opt/scrutiny/influxdb
    devices:    # if you will always run in "privileged" mode, you can remove this section
      - /dev/sda
      - /dev/nvme0

collector.yaml | path to ./config/collector.yaml

# Commented Scrutiny Configuration File
#
# The default location for this file is /opt/scrutiny/config/collector.yaml.
# In some cases to improve clarity default values are specified,
# uncommented. Other example values are commented out.
#
# When this file is parsed by Scrutiny, all configuration file keys are
# lowercased automatically. As such, Configuration keys are case-insensitive,
# and should be lowercase in this file to be consistent with usage.
######################################################################

# Version
# version specifies the version of this configuration file schema, not
# the scrutiny binary. There is only 1 version available at the moment
version: 1

# The host id is a label used for identifying groups of disks running on the same host
# Primiarly used for hub/spoke deployments (can be left empty if using all-in-one image).
host:
  id: ""

# This block allows you to override/customize the settings for devices detected by
# Scrutiny via `smartctl --scan`
# See the "--device=TYPE" section of https://linux.die.net/man/8/smartctl
# type can be a 'string' or a 'list'
devices:
  - device: /dev/sda
    type: 'ata'

  - device: /dev/sdb
    type: 'sat'

  - device: /dev/sde
    type: 'sat'

  - device: /dev/sdf
    type: 'ata'

  - device: /dev/sdj
    type: 'sat'

  - device: /dev/nvme0
    type: 'nvme'

remove files or folder to clean the cache:

  • ./config/scrutiny.db
  • ./influxdb/engine
  • [option]./config/collector.log

re-build your scrutiny object and open your config/collector.log file, you should be see the level=info msg="Executing command: smartctl --info --json --device sat /dev/sdf" type=metrics when you searching smartctl --info --json --device sat

now everything is normal 👯
image

I made a lot of efforts to search for the section on collection.yaml or metrics_smart_args in issue or document of offical, But I can't see (or missed) any guides of the Synology Container Manager environment, I hope what i shared is helpful :D

@paulmorabito
Copy link
Author

paulmorabito commented Jan 4, 2024

@chrisuhg Thanks for the info. Although, I'm not sure why you need to rebuild the container when my config files are stored outside of it and the container reads the config upon every start/restart?

@chrisuhg
Copy link

chrisuhg commented Jan 4, 2024

@chrisuhg Thanks for the info. Although, I'm not sure why you need to rebuild the container when my config files are stored outside of it and the container reads the config upon every start/restart?

I knew the container will reads config when every start option, but the Container Manager of the Synology NAS app can not following after i edited the compose.yaml file.

so I though the Start/Stop of the GUI button is equal docker run/stop , the option > build button is equal docker-compose -config compose.yml

image

Maybe I should attach more screenshot to make it easier to understand the "build" or "re-build"~

Anyway, thanks for your ask :D

@paulmorabito
Copy link
Author

paulmorabito commented Jan 4, 2024 via email

@pheetr
Copy link

pheetr commented Jan 7, 2024

I also have a QNAP and had the same issue as you, @paulmorabito. The advice from @chrisuhg helped me get it working.
I just created the collector.yaml file in the config folder (where scrutiny.db is located) and specified all the drive types individually, as 'sat' in my case, as that was what yielded results in the scrutiny's docker console, with command smartctl --xall --device=sat /dev/sda
Restarting the container afterwards was enough to get things going in my case.

Note: I've initially attempted to set the global command arguments in the config, since all my drives are SATA, but that didn't work for me.

collector.yaml:

# Commented Scrutiny Configuration File
#
# The default location for this file is /opt/scrutiny/config/collector.yaml.
# In some cases to improve clarity default values are specified,
# uncommented. Other example values are commented out.
#
# When this file is parsed by Scrutiny, all configuration file keys are
# lowercased automatically. As such, Configuration keys are case-insensitive,
# and should be lowercase in this file to be consistent with usage.


######################################################################
# Version
#
# version specifies the version of this configuration file schema, not
# the scrutiny binary. There is only 1 version available at the moment
version: 1

# The host id is a label used for identifying groups of disks running on the same host
# Primiarly used for hub/spoke deployments (can be left empty if using all-in-one image).
host:
  id: ""


# This block allows you to override/customize the settings for devices detected by
# Scrutiny via `smartctl --scan`
# See the "--device=TYPE" section of https://linux.die.net/man/8/smartctl
# type can be a 'string' or a 'list'
devices:
#  # example for forcing device type detection for a single disk
  - device: /dev/sda
    type: 'sat'
  - device: /dev/sdb
    type: 'sat'
  - device: /dev/sdc
    type: 'sat'
  - device: /dev/sdd
    type: 'sat'
  - device: /dev/sde
    type: 'sat'
  - device: /dev/sdf
    type: 'sat'
  - device: /dev/sdg
    type: 'sat'

@paulmorabito
Copy link
Author

Thanks for the reply @pheetr. I've taken a look with fresh eyes and my config was pointing to the wrong location. I don't check Scrutiny very often and perhaps at some point it was changed to opt/scrutiny/config?

In any case, I had "type: sat" previously set so pointing to the correct config sorted that. I also noticed on restarting that I now have 4 "[/DEV/ -" devices that can't be clicked on or deleted. There was quite a few DB migrations on restart so its perhaps a side effect of it?

Also noted that the global "commands" doesn't seem to work but that's a separate issue to this. I'll close for now as the reported issue is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants