Service Map feature - investigation ticket for open questions #359

idastambuk · 2024-04-15T11:12:32Z

Opening this ticket as a way to track open questions on the service map feature.

For the most part, we tried to emulate the filtering and aggregation in Open Search Dashboards. However, we ran into a few cases where the numbers for latency and throughput don't match:

1. Destination resource filtering

In order to get correct latencies and throughput for a service, the span-* indices need to be filtered by operations that a particular service receives. For example, the average latency for customer from below will be calculated from only the spans that have HTTP get customer operationName

This makes sense for all the services as well. This is calculated from the query filter below (from Data prepper dashboard with Hot R.O.D test data):

However, this doesn't hold up for the frontend service. In the test data, there are actually traces that contain spans with serviceName frontend and operationName /driver.DriverService/FindNearest AND others with serviceName driver and operationName /driver.DriverService/FindNearest. From the trace timeline below we can see that these spans overlap:

Shouldn't one of these spans then be excluded from the calculation for either frontend's or driver's latecy/throughput? More specifically, shouldn't it be the 'frontend's span?
Therefor, we would expect that the above filter would not only include the operationName, but also the service that it's a request TO. So we wouldn't include the /driver.DriverService/FindNearest operation in frontend's stats, only in driver's, meaning that there would be a filter: operationName='findDriver' && serviceName='driver'.
There's a probably a reason why the filtering doesn't include the serviceName too, so it would be good to know the background on this.

2. Latency calculations for single trace view
Open Search dashboards has a single trace view, where the user can view spans, timeline and the service map for the particular trace.
We request the stats for the service map by including the aggregations for trace list (with a range) and just adding a filter by traceId:{our trace}.
However, we discovered this is not how Open Search does it. In the code for observability-dashboards it seems like the average latency of the service is divided by the throughput:
https://github.com/opensearch-project/dashboards-observability/blob/79faaebf3cce56f6bd0f4c3b8f7d712518b4f8d3/public/components/trace_analytics/components/traces/trace_view.tsx#L194

This leads to a significant difference in calculations between the plugin and OS Dashboards:
.

Does this mean that spans aren't filtered by operation name/ destination.resource (like in the previous section) when calculating latency for a service in a single trace? In general, it would be good to have some info on why this is calculated like so.

The text was updated successfully, but these errors were encountered:

idastambuk added question Further information is requested datasource/OpenSearch labels Apr 15, 2024

aws-ds-token-creator bot added this to AWS Datasources Apr 15, 2024

github-project-automation bot moved this to Incoming in AWS Datasources Apr 15, 2024

idastambuk moved this from Incoming to Waiting in AWS Datasources Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service Map feature - investigation ticket for open questions #359

Service Map feature - investigation ticket for open questions #359

idastambuk commented Apr 15, 2024

Service Map feature - investigation ticket for open questions #359

Service Map feature - investigation ticket for open questions #359

Comments

idastambuk commented Apr 15, 2024