Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device nodes are not guaranteed to be consistent over time #134

Open
Scandiravian opened this issue Jul 19, 2023 · 12 comments
Open

Device nodes are not guaranteed to be consistent over time #134

Scandiravian opened this issue Jul 19, 2023 · 12 comments

Comments

@Scandiravian
Copy link

Currently the smartctl exporter only attaches the device label to all metrics except smartctl_device. This was introduced in #83.

This can unfortunately lead to issues, since device nodes are not guaranteed to be consistent over time, so after a reboot /dev/sdc might for instance become /dev/sda.

This makes it difficult to create dashboards in Grafana that tracks for instance temperature over time, since a query will break after reboot.

If there is a goal to limit the number of labels sent, I think it would be better to switch to using the serial number as the identifying label sent with metrics. These are not guaranteed to be unique, though I think a conflict will be unlikely in most cases.

@k0ste
Copy link
Contributor

k0ste commented Jul 19, 2023

This can unfortunately lead to issues, since device nodes are not guaranteed to be consistent over time, so after a reboot /dev/sdc might for instance become /dev/sda.

This makes it difficult to create dashboards in Grafana that tracks for instance temperature over time, since a query will break after reboot.

What exactly query are break?

@kfox1111
Copy link

I could see it breaking if you squashed an alert, and then the drive letters flip after reboot, and then the wrong drive is squashed...

disk-by-path might be another way to get a more stable identifier.

@Scandiravian
Copy link
Author

Scandiravian commented Jul 20, 2023

What exactly query are break?

@k0ste Since PromQL doesn't support many-to-many joins and a new timeseries is created for each unique combination of labels, it's not possible to do a "join" between smartctl_device and any of the metrics that only has device to identify it by. In such a case I don't think it's possible to determine the actual piece of hardware that is for instance overheating by querying; it can only be done by manually looking it up.

Then there's also the issue that every time a device node is reassigned it makes any graph that tracks history wrong. If I'm tracking changes in disk space used, power cycles, etc and have alerts based on a percentage increase, those might trigger on a reboot when the device node is changed to a drive with different metrics.

Finally, with three servers, each with four drives, that'll eventually create 48 different timeseries in smartctl_device (each machine has 16 different ways to combine device node+serial number). This doesn't affect queries directly, but it does make it difficult to determine what is the current "right" one.

@Scandiravian
Copy link
Author

Scandiravian commented Jul 20, 2023

disk-by-path might be another way to get a more stable identifier.

@kfox1111 I did consider suggesting this as well, though I decided against as that information is either available through smartctl or is not consistent (UUID for instance changing when a disk is formatted)

As I understand it, the use-case for this exporter is to track individual pieces of hardware over time, for instance if a drive is about to fail. I think the way that makes it the easiest to setup good tracking is to identify each harddrive by an id that doesn't change over time. I think the serial number is the only piece of data available that works for that, though I'm by no means a hardware expert, so there might be (and there probably are) a smarter solution than I can think of 😅

@k0ste
Copy link
Contributor

k0ste commented Jul 20, 2023

@Scandiravian if you operate by linux device name - this is totally wrong, you should operate only by device serial_number. Linux device names are not persistent, for example:

  • you one of 36 disks, for example /dev/sdaa was powered off (Backplane/PSU problem)
  • when power was returned your disk will be /dev/sdab
  • what you will to do? Reboot kernel?

All your record rules / alerts should look like this

smartctl_device{form_factor="3.5 inches"} * on (instance, device)
  group_left () smartctl_device_temperature > 30

In this case, in one moment in time, the device label will be the same in all metrics. This how meta labels was designed

@kfox1111
Copy link

I think there is a use case for querying both by disk-by-path (so you can identify slot 3 in node B in queries) as well as drive serial_numbers so you can track a drive no matter where it shows up.

@kennethso168
Copy link

kennethso168 commented Feb 21, 2024

I think there is a use case for querying both by disk-by-path (so you can identify slot 3 in node B in queries) as well as drive serial_numbers so you can track a drive no matter where it shows up.

Agree. I believe to support both use cases, it may be appropriate to revert #83. And user should configure to drop the relevant label(s) in prometheus scrape config.

I have forked and reverted #83 for my own use. If deemed appropriate I can make a PR as well.

@k0ste
Copy link
Contributor

k0ste commented Feb 21, 2024

I think there is a use case for querying both by disk-by-path (so you can identify slot 3 in node B in queries) as well as drive serial_numbers so you can track a drive no matter where it shows up.

Agree. I believe to support both use cases, it may be appropriate to revert #83. And user should configure to drop the relevant label(s) in prometheus scrape config.

I have forked and reverted #83 for my own use. If deemed appropriate I can make a PR as well.

This impossible to "resolve" on Prometheus side, because before drop something Prometheus should download something

What exactly issue do you have with current design?

@kennethso168
Copy link

Oh my Google-fu should be really bad yesterday.

I would like to track the lifetime (e.g. Total Bytes Written) of a disk consistently. Yesterday I tested by intentionally causing a flip in device node. And the stats were, as expected, flipped.

grafana1

You have provided an alert rule example, which inspired me to do something like this

grafana2

That was still two series for the same drive before and after the device node flip. I would really like to join them as one. Then I was stuck.

In fact, I was using VictoriaMetrics instead of Prometheus. I tried using MetricsQL label_del function to drop the device label. I got duplicate output timeseries. I was stuck and thought that it was impossible to solve without changing the labels exported in the exporter. Thus I forked the project and added back exporting of labels including serial of the hard drive etc. and configured my scrape config to drop the device label

And after your reply I Googled again and came up with metricsQL: add function for merging time series values based on label value

That inspired me to come up with the following query and my problem is solved!

max without (device) (
    smartctl_device{form_factor="3.5 inches"} 
        * on (instance, device) group_left() 
    smartctl_device_attribute{instance=~'fileserver',attribute_name="Total_LBAs_Written",attribute_value_type="raw"}
) * 512

grafana3

And for calculating the rate of increase:

grafana4

Still open to discussion to whether adding serial number label is necessary.

@k0ste
Copy link
Contributor

k0ste commented Feb 24, 2024

That inspired me to come up with the following query and my problem is solved!

Good to hear that!

Also you can make this query like this

sum by (attribute_name, model_name, serial_number) # <- result labels, that you actually need
  (smartctl_device_attribute{attribute_name="Total_LBAs_Written",
  attribute_value_type="raw"}
    * on (instance, device) group_right(attribute_name)
  smartctl_device{form_factor="3.5 inches"}
) * 512

When more you practice and more resolve your production cases, then more you get experience to create dashboards what works on you. For me is a priority to "do not go to ssh the host to find out something"
On our host dashboard we have a panel where supply or engineer teams can get answers:

  • how much disks installed on some host
  • what disk models, serial numbers
  • how long disk works
  • disk health (your SMART attributes)

Something like this:

Screenshot 2024-02-24 at 17 51 19

We also have our thresholds (value for disk replacement alert), you can sort Grafana tables by attribute value, for example

Screenshot 2024-02-24 at 17 53 32

Good luck!

@lazywebm
Copy link

lazywebm commented May 2, 2024

In the process of setting up this exporter for the first time, running on a system with 25 SATA disks attached. #83 is not a good change in my opinion. I would always want to have the device serial number in the metrics output, for reasons described above (unstable identification via /dev/sdX).

Until then, I'll use my own forked version as well.

@Informatic
Copy link
Contributor

IMO somewhat correct solution for this is replacing /dev/sd* with proper /dev/disk/by-id/... symlinks. This in fact should be solvable by passing a (seemingly) undocumented --device by-id option to smartctl --scan:

# smartctl --scan --device by-id 
/dev/disk/by-id/ata-CT500MX500SSD1_XXX -d scsi # /dev/disk/by-id/ata-CT500MX500SSD1_XXX, SCSI device
/dev/disk/by-id/ata-Hitachi_HUS724030ALE641_XXX -d scsi # /dev/disk/by-id/ata-Hitachi_HUS724030ALE641_XXX, SCSI device
/dev/disk/by-id/ata-ST31000528AS_XXX -d scsi # /dev/disk/by-id/ata-ST31000528AS_XXX, SCSI device
/dev/disk/by-id/ata-TOSHIBA_HDWD130_XXX -d scsi # /dev/disk/by-id/ata-TOSHIBA_HDWD130_XXX, SCSI device
/dev/disk/by-id/ata-TOSHIBA_HDWD130_YYY -d scsi # /dev/disk/by-id/ata-TOSHIBA_HDWD130_YYY, SCSI device
/dev/disk/by-id/ata-TOSHIBA_HDWD130_ZZZ -d scsi # /dev/disk/by-id/ata-TOSHIBA_HDWD130_ZZZ, SCSI device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device

As seen above, it's not perfect since it only applies to SATA/SAS devices, but this should be easily solvable regardless in smartmontools.

I'll (try to) prepare a PR adding a flag enabling this behaviour. (since it's a breaking change over what was there before)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants