Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jmx state metrics #12369

Merged
merged 24 commits into from
Oct 17, 2024
Merged

jmx state metrics #12369

merged 24 commits into from
Oct 17, 2024

Conversation

SylvainJuge
Copy link
Contributor

@SylvainJuge SylvainJuge commented Sep 30, 2024

Fixes #12229

This introduces the ability to capture "state metrics" that are captured from a JMX attribute value.

A "state metric" has the following properties:

  • the value is captured from a single MBean attribute
  • the MBean attribute providing state value can be mapped to a string
  • the state is reflected in a single "state attribute"
  • the state metric value will be either 0 or 1
  • the state metric will use the "state attribute" for breakdown, so removing the "state attribute" will always sum to 1.
  • in order to limit cardinality, the set of possible states must be known and provided in advance through a "state mapping".

This follows a similar way to map state as we have in the (experimental) hardware semantic conventions as seen here with the hw.state attribute.


Example from a common MBean in Tomcat:

⚠️ This is just an example of "how to capture state metrics", the actual definitions that might be added for Tomcat will not match this example

JMX object name: Catalina:type=Connector,port=*, JMX attribute name stateName, the value of stateName comes from LifeCycleState enum (12 distinct values).

We can capture this as tomcat.connector.state with two attributes

  • state with value: ok, failed or degraded (name is not limited to state and any name could be used here).
  • port that provides per-port breakdown (from the mbean wildcard)

The YAML configuration used:

rules:
  - bean: Catalina:type=Connector,port=*
    mapping:
      stateName:
        type: state
        metric: tomcat.connector.state
        metricAttribute:
          port: param(port)
          state:
            ok: STARTED
            failed: [STOPPED,FAILED]
            degraded: _

For a given port, let's say 8080 and this configuration 3 metrics are captured on every sample. The metric type will always be updowncounter and the value either 0 or 1.

When stateName = STARTED, we have:

  • tomcat.connector.state value = 1, attributes port = 8080 and state = ok
  • tomcat.connector.state value = 0, attributes port = 8080 and state = failed
  • tomcat.connector.state value = 0, attributes port = 8080 and state = degraded

When stateName = STOPPED or FAILED, we have:

  • tomcat.connector.state value = 0, attributes port = 8080 and state = ok
  • tomcat.connector.state value = 1, attributes port = 8080 and state = failed
  • tomcat.connector.state value = 0, attributes port = 8080 and state = degraded

For other values of stateName, we have:

  • tomcat.connector.state value = 0, attributes port = 8080 and state = ok
  • tomcat.connector.state value = 0, attributes port = 8080 and state = failed
  • tomcat.connector.state value = 1, attributes port = 8080 and state = degraded

The ability to limit cardinality is relevant here to reduce the cardinality from the 12 potential values to only 3, most of which will very likely never be relevant in practice.

Comment on lines +99 to +112
Metric m2 = attr.get("ATTRIBUTE2");
assertThat(m2).isNotNull();
assertThat(m2.getMetric()).isEqualTo("METRIC_NAME2");
assertThat(m2.getDesc()).isEqualTo("DESCRIPTION2");
assertThat(m2.getUnit()).isEqualTo("UNIT2");

JmxRule def2 = defs.get(1);
assertThat(def2.getBeans()).containsExactly("OBJECT:NAME3=*");
assertThat(def2.getMetricAttribute()).isNull();

assertThat(def2.getMapping()).hasSize(1);
Metric m3 = def2.getMapping().get("ATTRIBUTE3");
assertThat(m3.getMetric()).isEqualTo("METRIC_NAME3");
assertThat(m3.getUnit()).isNull();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[for reviewer] this is mostly adding extra test assertions, not related to adding state metrics.

Comment on lines +138 to +146
assertThat(attr)
.hasSize(5)
.containsKeys("ATTRIBUTE31", "ATTRIBUTE32", "ATTRIBUTE33", "ATTRIBUTE34", "ATTRIBUTE35");
assertThat(attr.get("ATTRIBUTE32")).isNull();
assertThat(attr.get("ATTRIBUTE33")).isNull();
assertThat(attr.get("ATTRIBUTE34")).isNotNull();
Metric attribute34 = attr.get("ATTRIBUTE34");
assertThat(attribute34).isNotNull();
assertThat(attribute34.getMetric()).isEqualTo("METRIC_NAME34");
assertThat(attr.get("ATTRIBUTE35")).isNull();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[for reviewer] this is mostly adding extra test assertions, not related to adding state metrics.

Comment on lines +195 to +233
assertThat(m.getAttributes())
.hasSize(3)
.extracting("attributeName")
.contains("LABEL_KEY1", "LABEL_KEY2", "LABEL_KEY3");

MetricInfo metricInfo = m.getInfo();
assertThat(metricInfo.getMetricName()).isEqualTo("PREFIX.METRIC_NAME1");
assertThat(metricInfo.getDescription()).isEqualTo("DESCRIPTION1");
assertThat(metricInfo.getUnit()).isEqualTo("UNIT1");
assertThat(metricInfo.getType()).isEqualTo(MetricInfo.Type.COUNTER);
})
.anySatisfy(
m -> {
assertThat(m.getMetricValueExtractor().getAttributeName()).isEqualTo("ATTRIBUTE2");
assertThat(m.getAttributes())
.hasSize(2)
.extracting("attributeName")
.contains("LABEL_KEY1", "LABEL_KEY2");

MetricInfo metricInfo = m.getInfo();
assertThat(metricInfo.getMetricName()).isEqualTo("PREFIX.METRIC_NAME2");
assertThat(metricInfo.getDescription()).isEqualTo("DESCRIPTION2");
assertThat(metricInfo.getUnit()).isEqualTo("UNIT2");
})
.anySatisfy(
m -> {
assertThat(m.getMetricValueExtractor().getAttributeName()).isEqualTo("ATTRIBUTE3");

MetricInfo mb3 = m.getInfo();
assertThat(mb3.getMetricName()).isEqualTo("PREFIX.ATTRIBUTE3");
MetricInfo metricInfo = m.getInfo();
assertThat(metricInfo.getMetricName()).isEqualTo("PREFIX.ATTRIBUTE3");
assertThat(metricInfo.getDescription()).isNull();

// syntax extension - defining a default unit and type
assertThat(mb3.getType()).isEqualTo(MetricInfo.Type.UPDOWNCOUNTER);
assertThat(mb3.getUnit()).isEqualTo("DEFAULT_UNIT");
assertThat(metricInfo.getType())
.describedAs("default type should match jmx rule definition")
.isEqualTo(jmxDef.getMetricType());
assertThat(metricInfo.getUnit())
.describedAs("default unit should match jmx rule definition")
.isEqualTo(jmxDef.getUnit());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[for reviewer] this is mostly adding extra test assertions, not related to adding state metrics.

return new ArrayList<>(set.values());
}

private static StateMapping getEffectiveStateMapping(Metric m, JmxRule rule) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[for reviewer] for state mapping, the resulting mapping is not a combination of rule level and metric level definitions. Only one of them is used.

@SylvainJuge SylvainJuge marked this pull request as ready for review October 1, 2024 14:39
@SylvainJuge SylvainJuge requested a review from a team as a code owner October 1, 2024 14:39
@SylvainJuge
Copy link
Contributor Author

Making it ready for review now that it has been validated/tested with a real tomcat instance.

@trask
Copy link
Member

trask commented Oct 2, 2024

cc @PeterF778 in case you have a chance to review

@PeterF778
Copy link
Contributor

PeterF778 commented Oct 2, 2024

The YAML configuration used:

rules:
  - bean: Catalina:type=Connector,port=*
    mapping:
      stateName:
        type: updowncounter
        metric: tomcat.connector.count
        stateMapping:
          ok: STARTED
          failed: [STOPPED,FAILED]
          degraded: '*'
        metricAttribute:
          state: statekey()
          state: param(port)

I believe you have a typo, it should be "port: param(port)".

Also, for the sake of better understanding, shouldn't the metric name be something like tomcat.connector.state rather than count. I do not see any real counts here.

@PeterF778
Copy link
Contributor

Some observations: the OpenTelemetry metric type must always be updowncounter, and the stateMapping can be used only for one metric attribute. Hence, we could use a different syntax (hopefully easier to read and maintain):

rules:
  - bean: Catalina:type=Connector,port=*
    mapping:
      stateName:
        type: state    # maps automatically to updowncounter with values 0 or 1
        metric: tomcat.connector.status
        metricAttribute:
          state:
            ok: STARTED
            failed: [STOPPED,FAILED]
            degraded: '*'
          port: param(port)

@trask
Copy link
Member

trask commented Oct 2, 2024

the simplification seems nice.

would this attribute have to be named state, or would its type be implicitly understood due to the presence of the nested yaml?

          state:
            ok: STARTED
            failed: [STOPPED,FAILED]
            degraded: '*'

@PeterF778
Copy link
Contributor

PeterF778 commented Oct 2, 2024

would this attribute have to be named state, or would its type be implicitly understood due to the presence of the nested yaml?

          state:
            ok: STARTED
            failed: [STOPPED,FAILED]
            degraded: '*'

It could be named anything, the nested yaml would provide the way to map it to OTel metric.

@SylvainJuge
Copy link
Contributor Author

Thanks a lot @PeterF778 for your suggestions, that really makes it simpler:

  • adding a state metric type that automatically maps to 0/1 updowncounter
  • using nested yaml structure to detect the state attribute
  • naming the metric tomcat.connector.state even if that's more something to "conventions for JMX state metrics", but given this is a new feature we can start spec/document it right now without having to change other metrics.

So I will update this PR to implement this.

Copy link
Contributor Author

@SylvainJuge SylvainJuge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeterF778 I've added the changes you suggested, let me know if there are further changes/improvements needed here.

Comment on lines +198 to +200
protected Object getSampleValue(MBeanServerConnection connection, ObjectName objectName) {
return extractAttributeValue(connection, objectName, logger);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[for reviewer] this allows to bypass the sample value check for the "artificial" attribute value that we need for the state metric.

Comment on lines +125 to +129
case STATE:
{
// CHECKSTYLE:ON
throw new IllegalStateException("state metrics should not be registered");
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[for reviewer] this is needed to have the switch statement cover all cases.

@github-actions github-actions bot requested a review from theletterf October 9, 2024 08:41
}

if (defaultState == null) {
throw new IllegalStateException("missing default state");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is requiring the default state necessary? I understand that if it is missing and the actual state string value is not in the list, the metric value will be zero for all combinations of provided metric attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this might be "best practice" rather than strict requirement, but this would likely be the case for all the mappings embedded in the agent anyway, so preventing invalid metrics and making mappings more future proof is probably a better option, I'm fine to remove this as a strict requirement though.

@SylvainJuge
Copy link
Contributor Author

Also I think changing the default wildcard value from * to _ could help avoid having to quote the default '*' as _ is a valid string literal in YAML, while * is not, I'll update the PR with this proposal as it makes mappings simpler to write.

@PeterF778
Copy link
Contributor

Also I think changing the default wildcard value from * to _ could help ...

Yes, this is fine. However, I just realized that the values of the string MBeans can be anything, in particular * and _ (or even empty string). But I do not have any ideas for fixing that.

@SylvainJuge
Copy link
Contributor Author

I just realized that the values of the string MBeans can be anything, in particular * and _ (or even empty string). But I do not have any ideas for fixing that.

I agree, it can be anything, however I think it's definitely unlikely enough that we need to wait for any feedback on this to be an issue. The empty string might already be supported and we would just need to have a way to escape the _.

@SylvainJuge
Copy link
Contributor Author

@trask do you think this could make it for the 2.9.0 release ?

…etry/instrumentation/jmx/yaml/MetricStructure.java

Co-authored-by: Jay DeLuca <[email protected]>
@laurit laurit added this to the v2.9.0 milestone Oct 16, 2024
Comment on lines 239 to 242
connector_state:
ok: STARTED
failed: [STOPPED,FAILED]
degraded: _
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought of open-telemetry/opentelemetry-configuration#115

though we're already not doing this in the jmx metrics (prefering yaml brevity), so this comment should not hold up this PR

@jack-berg do you think this approach (using objects for brevity) in jmx metrics module is ok long-term?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would switch to using an array of objects, for example with name and value, then we could add a third one with default attribute key with value true|false to indicate which entry would be the default instead of using a "magic value" to represent this.

@@ -312,25 +361,125 @@ void testConf8() throws Exception {
assertThat(mb1.getUnit()).isNull();
}

private static final String CONF9 =
"--- # keep stupid spotlessJava at bay\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄

- a string literal or a string array
- a `_` character to provide default option and avoid enumerating all values.

Exactly one `_` value must be present in the mapping to ensure all possible values of the MBean attribute can be mapped to a state key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why this is a requirement. If there are only certain states that you are interested in, why not allow to only specify those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a requirement, but allows to ensure that state metrics are always captured consistently, this is more best practice than a technical requirement.

Doing so means that at a given time, the sum of all samples is equal to 1, which translates to "the state of the thing we observe is known at all times". I think it's worth doing that for now unless we have good examples where it can be problematic.

The example I have in mind when adding this restriction is again probably biased by current focus on Tomcat state, so I'll elaborate a bit:

Let's say we monitor 3 tomcat servers, each of them with a single connector and a state metric to get the connector state with on|off values. When consuming those metrics to create a dashboard:

Scenario 1:

  • default set to off to indicate something is not right.
  • creating a linear chart (sum) with breakdown on state will always sum to 3, and the breakdown provides an obvious clue when something is not in on state.
  • creating a pie-chart with the latest value with breakdown on state also provides a good representation of the 3 tomcat instances.
  • the end-user does not even need to know how many tomcat instances exist, you get information from the ratio of on|off in the charts and then can investigate what could be wrong if needed.
  • creating an extra dedicated unknown state could be used to gather those states as they might represent a gap in the metrics mapping.

Scenario 2:

  • state value on when connector is RUNNING, off when connector is STOPPED, other possible values are not mapped.
  • creating a linear chart (sum) with breakdown on state will sum to 3 when the state is either on or off, but will go down whenever an un-mapped state is present
  • the end-user needs to know that the "usual value is 3", and that "when it goes down to less than 3 something is unknown and maybe we have a gap in our mapping"
  • it is possible to trigger an alert by monitoring when the value goes down, but this is definitely more complex and depends mostly on the backend and metrics processing.
  • with some visualizations it's even more tricky, for example when creating a pie-chart of the latest reported values to represent the "current state", if not all instances can be represented because their state is unknown then they will be ignored. In an extreme case, if only a single instance is in on state and the two others are in an unknown state then the pie-chart could be misleading representing "100% is running and it's fine", when in practice it's "only 1/3 fine".

Given that we usually need to monitor things that are working properly and even more the ones that not working properly (which are not in an expected state), having this as a requirement makes it harder to shoot yourself in the foot by missing a state whose value comes from an implementation detail of the monitored product.

@trask
Copy link
Member

trask commented Oct 17, 2024

@trask do you think this could make it for the 2.9.0 release ?

let's discuss in Thu SIG meeting, we can hold release until Fri if we need a bit more time to discuss

@trask
Copy link
Member

trask commented Oct 17, 2024

Have you looked at the state metrics in the collector? open-telemetry/semantic-conventions#1032 (comment)

@SylvainJuge
Copy link
Contributor Author

Have you looked at the state metrics in the collector? open-telemetry/semantic-conventions#1032 (comment)

No, I wasn't aware of those "state metrics", but looking for those

For k8sclusterreceiver is the one where this comment author has challenges, and we can see that the metric modeling is not idea and far from "state metrics" as defined here:

k8s.pod.phase metric where the value encodes the state link

some metrics split in two metrics depending on the state, for example k8s.replicaset.available and k8s.replicaset.desired.

For kubeletstatsreceiver (link) does not seem to report metrics that look like "state metrics".

@SylvainJuge
Copy link
Contributor Author

I've restored the original wildcard *:

  • it must be quoted when used alone AND in a string array (I initially thought in an array it was fine without, but after checking it's not)
  • the case @breedx-splk raised during today's meeting about duplicate mappings is already covered by unit tests

So it's ready for a final review (just to be sure) and it should be good to be merged.

@trask trask merged commit 238a201 into open-telemetry:main Oct 17, 2024
56 checks passed
@SylvainJuge SylvainJuge deleted the state-jmx-metrics branch October 18, 2024 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to capture non-numeric MBean attributes as metrics
5 participants