Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add device type support #107

Closed
wants to merge 1 commit into from
Closed

Add device type support #107

wants to merge 1 commit into from

Conversation

SuperQ
Copy link
Contributor

@SuperQ SuperQ commented Jan 10, 2023

Allow passing a custom -d / --device= flag to smartctl. The default is the same (auto) as upstream smartctl.

Fixes: #26

Signed-off-by: SuperQ [email protected]

Allow passing a custom `-d` / `--device=` flag to smartctl. The default
is the same (`auto`) as upstream smartctl.

Fixes: #26

Signed-off-by: SuperQ <[email protected]>
@kfox1111
Copy link

I think it can (and in some cases, must?) be passed multiple times. sometimes one per drive?

@SuperQ
Copy link
Contributor Author

SuperQ commented Jan 10, 2023

According to the docs, it can only set once per device. But, given that we loop over the devices, it does make sense that we allow setting it differently per device.

We could do something like parse the device like --smartctl.device=/dev/foo;type. For example, --smartctl.device=/dev/sda;cciss,0.

@darxriggs
Copy link

@SuperQ As already suggested by you, this change should be adapted so that it is possible to set the device type per --smartctl.device parameter. This is for example to support the RAID controller case where only the logical device is visible to the OS and the actual physical devices behind this logical device can only be accessed via the smartctl --device=TYPE parameter.

Example: smartctl usage

smartctl --all --device=cciss,0 /dev/sda
smartctl --all --device=cciss,1 /dev/sda

The part where the result of smartctl --scan is used should also be updated. For some RAID controllers this scanning also returns the device type which is available as the type property in JSON. For these controllers specifying the devices manually then would not be required anymore.

Example: smartctl --scan output

/dev/sda -d scsi 
/dev/sdb -d scsi 
/dev/bus/0 -d megaraid,0 
/dev/bus/0 -d megaraid,1 
/dev/bus/0 -d megaraid,2 
/dev/bus/0 -d megaraid,3

@k0ste
Copy link
Contributor

k0ste commented Jan 27, 2023

@darxriggs thank you for good explanation!
Can you also record and paste (as .txt attachment) all json outputs from smartctl? For developer is very (almost impossible) to develop without test data

@darxriggs
Copy link

@darxriggs
Copy link

Looking at the currently generated metrics and this change, it becomes apparent that a third detail should be adapted too - the labels.

Example

# HELP smartctl_device_smart_status General smart status
# TYPE smartctl_device_smart_status gauge
smartctl_device_smart_status{device="sda"} 1

Now that device is not unique anymore with this change (it will be possible to specify the same device but with different types) an additional label is required to identify the actual device. Either an equivalent of the type or info_name from the JSON output could be used.

Example

# HELP smartctl_device_smart_status General smart status
# TYPE smartctl_device_smart_status gauge
smartctl_device_smart_status{device="sda","type"="cciss,0"} 1
smartctl_device_smart_status{device="sda","type"="cciss,1"} 1

@kfox1111
Copy link

Is that really true? I thought device still needed to be something under /dev, so should be unique still?

@darxriggs
Copy link

darxriggs commented Jan 27, 2023

@kfox1111 Let me refer you to the man page of smartctl's --device option. It describes all the supported types and use cases in more detail.

A clear distinction must be made between the positional device argument and the named --device option (contained as type property in the JSON) of smartctl and the device label of smartctl_exporter.

@kfox1111
Copy link

Cool. thanks for the info. :)

@anthonyeleven
Copy link

Followup to @darxriggs ' notes above:

It is entirely possible and not that uncommon for a given system to present both LSI HBA virtual devices and passed-through individual drives.

Eg. the first two drives are mirrored with the HBA and need -d megaraid,0, -d megaraid,1, but the balance of drives are passed through and are accessed as /dev/sdXX.

On at least some HBAs with some personality / mode settings, any physical drives that aren't part of a VD are passed through.

While I personally am not a fan of HBA RAID or RoC HBAs in general, we're still stuck with legacies, and some organizations embrace HBA RAID.

Ideally the exporter would iterate smartctl invocations over the results from smartctl --scan, invoking as necessary for each entry. Having to hardcode devices and access flags in a config file or on the commandline would be rather inconvenient and prone to errors -- that would require external scripting to discover devices and keep that mapping updated across inventory changes.

To the note about adding a type label for, let's call them occulted devices, I get where the suggested scheme is coming from. I don't think that's the ideal approach, though. type as a label is IMHO a misnomer, this isn't a type but rather a subunit in the vein of a LUN. I'd rather see something like device=sda,0 or device=sda.0 or device=megaraid0.0. It would be awkward to have to correlate labels for some but not all metrics, or to have to devices complex relabeling rules at Prometheus ingest.

Notes:

  • Dell's PERC HBAs are rebadged LSI.
  • A system may contain multiple HBAs of the same or different type, one might argue that metrics should reflect the HBA to which a drive is attached.

@anthonyeleven
Copy link

anthonyeleven commented Jul 13, 2023

Also, to be clear, the positional /dev/xx parameter to storcli is not actually used, so it could be anything. Arguably then the device label should be cciss,0, cciss,1, megaraid,4, etc. I'm currently using a customized fork of someone's smartmon.sh and jumping through hoops to make these labels useful -- and aligned with drives that are not handicapped by being under an RoC volume.

@robbat2
Copy link
Contributor

robbat2 commented Oct 16, 2023

Can we figure out how to make this more scalable & managable?

Example: a 36-disk supermicro chassis or a 60-disk BackBlaze pod.

Concerns:

  • naively, this would lead to a very long commandline!
  • changing the parameters to update a list of drives means restarting the exporter; could this be avoided w/ a config file that hot-reloads?
  • This would bringing back the config file that was removed in 0.8.0
  • Are there any controllers where the identifier is not stable between boots (e.g. the drives are not numbered by slot, but rather by enumeration)
  • Are there multipath SAS controllers&disks, where a single disk shows up twice the same system?

@NiceGuyIT
Copy link
Member

@SuperQ Do you mind if I make this a feature branch so others can work on it and iron out the details? I don't have any devices for testing.

@zxzharmlesszxz
Copy link
Contributor

Any updates?

@anthonyeleven
Copy link

At this point I'm just going to have my protege write a SMART collector from scratch in Python. This is a pretty glaring omission on the part of smartctl_exporter.

@kfox1111
Copy link

@anthonyeleven Could your protege update this pr instead? It seems like the pr is close, but the original developer ran out of time to finish it?

@anthonyeleven
Copy link

He's not a strong golang coder.

@zxzharmlesszxz
Copy link
Contributor

I made changes for fix this. #205

@jakubgs
Copy link

jakubgs commented May 28, 2024

This PR was promising, it seemed like a simple solution to a big problem with cciss type devices.

@anthonyeleven
Copy link

We've given up on smartctl_exporter for now and have instead been forking storcli.py

@k0ste
Copy link
Contributor

k0ste commented Oct 16, 2024

@SuperQ we have working PR #235 without merge
Can you please push it so the community can start rebase for the other good patches?

Thanks

@SuperQ
Copy link
Contributor Author

SuperQ commented Dec 19, 2024

Superseded by #257

@SuperQ SuperQ closed this Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non standard device accessors such as -d cciss,N do not work
9 participants