Monitor drive error rates now that `smartd_log` has been removed

With one of the latest releases (V1.46) smartd_log was removed in favour of smartctl with pull request #17600. After I configured the new collector I sadly no longer have any information on the error rates on drives only a generalised pass/fail status , temperature and uptime metrics.

Is there a way to get back the read/write/verify corrected and uncorrected error rates back into the dashboard?

Hi, @cindy. Smartctl is supposed to collect all SMART attributes.


After I configured the new collector

What do you mean configured? It doesn’t require configuration.


If there are no SMART attributes for some disks, share (redact sensitive data if any):

# all devices
smartctl --json --scan

# replace "deviceName" and "deviceType" wit the values from the prev output
smartctl --json --all {deviceName} --device {deviceType}

Hi @ilyam8 thank you so much for your reply.

Very curious, for me it only shows a small subset of items see image below. I took as that they where showing any statistics at all it was working as intended hence my surprise the statistics where missing.
image

As for the ‘configuration’ I did, I was referring the parts from the documentation, I am running the collector in a docker container and had to add the drives and permission to the compose file as follows:
image

Running the commands as mentioned, I think the output is ‘ok’ but please advice if I am overlooking something.

The Scan Command gives me the disks I expect:

root:/# smartctl --json --scan
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      3
    ],
    "svn_revision": "5338",
    "platform_info": "x86_64-linux-6.1.0-21-amd64",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--json",
      "--scan"
    ],
    "exit_status": 0
  },
  "devices": [
    {
      "name": "/dev/sda",
      "info_name": "/dev/sda",
      "type": "scsi",
      "protocol": "SCSI"
    },
    {
      "name": "/dev/sdc",
      "info_name": "/dev/sdc",
      "type": "scsi",
      "protocol": "SCSI"
    },
    {
      "name": "/dev/sdd",
      "info_name": "/dev/sdd",
      "type": "scsi",
      "protocol": "SCSI"
    },
    {
      "name": "/dev/sde",
      "info_name": "/dev/sde",
      "type": "scsi",
      "protocol": "SCSI"
    }
  ]
}

And the data seems to be outputted as far as I can tell (Only added one drive but all 4 show up with a similar set of values. Hidden only the drive type/sn)

root:/# smartctl --json --all /dev/sda --device scsi
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      3
    ],
    "svn_revision": "5338",
    "platform_info": "x86_64-linux-6.1.0-21-amd64",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--json",
      "--all",
      "/dev/sda",
      "--device",
      "scsi"
    ],
    "exit_status": 0
  },
  "local_time": {
    "time_t": 1720689199,
    "asctime": "Thu Jul 11 09:13:19 2024 UTC"
  },
  "device": {
    "name": "/dev/sda",
    "info_name": "/dev/sda",
    "type": "scsi",
    "protocol": "SCSI"
  },
  "scsi_vendor": "HGST",
  "scsi_product": "~REDACTED~",
  "scsi_model_name": "~REDACTED~",
  "scsi_revision": "~REDACTED~",
  "scsi_version": "~REDACTED~",
  "user_capacity": {
    "blocks": 7814037168,
    "bytes": 4000787030016
  },
  "logical_block_size": 512,
  "scsi_lb_provisioning": {
    "name": "fully provisioned",
    "value": 0,
    "management_enabled": {
      "name": "LBPME",
      "value": 0
    },
    "read_zeros": {
      "name": "LBPRZ",
      "value": 0
    }
  },
  "rotation_rate": 7200,
  "form_factor": {
    "scsi_value": 2,
    "name": "3.5 inches"
  },
  "logical_unit_id": "~REDACTED~",
  "serial_number": "~REDACTED~",
  "device_type": {
    "scsi_terminology": "Peripheral Device Type [PDT]",
    "scsi_value": 0,
    "name": "disk"
  },
  "scsi_transport_protocol": {
    "name": "SAS (SPL-4)",
    "value": 6
  },
  "smart_support": {
    "available": true,
    "enabled": true
  },
  "temperature_warning": {
    "enabled": true
  },
  "smart_status": {
    "passed": true
  },
  "temperature": {
    "current": 34,
    "drive_trip": 85
  },
  "power_on_time": {
    "hours": 1641,
    "minutes": 22
  },
  "scsi_start_stop_cycle_counter": {
    "year_of_manufacture": "2013",
    "week_of_manufacture": "51",
    "specified_cycle_count_over_device_lifetime": 50000,
    "accumulated_start_stop_cycles": 4,
    "specified_load_unload_count_over_device_lifetime": 600000,
    "accumulated_load_unload_cycles": 119
  },
  "scsi_grown_defect_list": 0,
  "scsi_error_counter_log": {
    "read": {
      "errors_corrected_by_eccfast": 647707,
      "errors_corrected_by_eccdelayed": 29,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 647736,
      "correction_algorithm_invocations": 586730,
      "gigabytes_processed": "36537.378",
      "total_uncorrected_errors": 0
    },
    "write": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 0,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 0,
      "correction_algorithm_invocations": 13549,
      "gigabytes_processed": "2811.293",
      "total_uncorrected_errors": 0
    },
    "verify": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 0,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 0,
      "correction_algorithm_invocations": 2146,
      "gigabytes_processed": "0.000",
      "total_uncorrected_errors": 0
    }
  }
}

Any advice on where to troubleshoot would be very welcome!

And if smartctl --json --all /dev/sda --device sat?

That seems to fail going off the exit status of ‘2’ :

root:/# smartctl --json --all /dev/sda --device sat
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      3
    ],
    "svn_revision": "5338",
    "platform_info": "x86_64-linux-6.1.0-21-amd64",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--json",
      "--all",
      "/dev/sda",
      "--device",
      "sat"
    ],
    "exit_status": 2
  },
  "local_time": {
    "time_t": 1720692523,
    "asctime": "Thu Jul 11 10:08:43 2024 UTC"
  },
  "device": {
    "name": "/dev/sda",
    "info_name": "/dev/sda [SAT]",
    "type": "sat",
    "protocol": "ATA"
  }
}

I did notice FYI that I had /dev/sdc mapped twice once to sdc once to sdd in my screenshot before oops, correcting this sadly did not make a difference.

Right. You have SCSI devices, they don’t have SMART attributes. What metrics were collected?

"scsi_error_counter_log": {
    "read": {
      "errors_corrected_by_eccfast": 647707,
      "errors_corrected_by_eccdelayed": 29,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 647736,
      "correction_algorithm_invocations": 586730,
      "gigabytes_processed": "36537.378",
      "total_uncorrected_errors": 0
    },
    "write": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 0,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 0,
      "correction_algorithm_invocations": 13549,
      "gigabytes_processed": "2811.293",
      "total_uncorrected_errors": 0
    },
    "verify": {
      "errors_corrected_by_eccfast": 0,
      "errors_corrected_by_eccdelayed": 0,
      "errors_corrected_by_rereads_rewrites": 0,
      "total_errors_corrected": 0,
      "correction_algorithm_invocations": 2146,
      "gigabytes_processed": "0.000",
      "total_uncorrected_errors": 0
    }
  }

read/write/verify corrected and uncorrected

anything else?

Makes sense, they are smart capable but they then do not report the statistics out to the OS if I understand it correctly?
Based on the response from smartctl:

  "smart_support": {
    "available": true,
    "enabled": true
  },
.......,
  "smart_status": {
    "passed": true
  }

As for what they report the post above contained everything I know they do report, and for reference I added a line from the smartd log below:

2024-06-03 10:20:53;	read-corr-by-ecc-fast;136967;	read-corr-by-ecc-delayed;3;	read-corr-by-retry;0;	read-total-err-corrected;136970;	read-corr-algorithm-invocations;105598;	read-gb-processed;11625.882;	read-total-unc-errors;0;	write-corr-by-ecc-fast;0;	write-corr-by-ecc-delayed;0;	write-corr-by-retry;0;	write-total-err-corrected;0;	write-corr-algorithm-invocations;4185;	write-gb-processed;2136.859;	write-total-unc-errors;0;	verify-corr-by-ecc-fast;0;	verify-corr-by-ecc-delayed;0;	verify-corr-by-retry;0;	verify-total-err-corrected;0;	verify-corr-algorithm-invocations;890;	verify-gb-processed;0.000;	verify-total-unc-errors;0;	non-medium-errors;0;	temperature;33;

And these are the same metrics which where reported before with the old plugin:


(Screenshot only showing one set but the rest where there as well)
Is there any way to get these SCSI metrics into the dashboard with the current plugin?

Is there any way to get these SCSI metrics into the dashboard with the current plugin?

I will need to update smartctl collector.

Clear, thank you so much for the help so far already. It really helps to know why it was not showing.
(Should I mark this topic as solved, or leave it open till such time an update is available?)
Is there anything I can do to help out to update the collector?

Fixed in 18119.

1 Like