Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service Map feature - investigation ticket for open questions #359

Open
idastambuk opened this issue Apr 15, 2024 · 0 comments
Open

Service Map feature - investigation ticket for open questions #359

idastambuk opened this issue Apr 15, 2024 · 0 comments
Labels
datasource/OpenSearch question Further information is requested

Comments

@idastambuk
Copy link
Contributor

Opening this ticket as a way to track open questions on the service map feature.

For the most part, we tried to emulate the filtering and aggregation in Open Search Dashboards. However, we ran into a few cases where the numbers for latency and throughput don't match:

1. Destination resource filtering

In order to get correct latencies and throughput for a service, the span-* indices need to be filtered by operations that a particular service receives. For example, the average latency for customer from below will be calculated from only the spans that have HTTP get customer operationName

Screenshot 2024-04-15 at 12 15 48

This makes sense for all the services as well. This is calculated from the query filter below (from Data prepper dashboard with Hot R.O.D test data):

Screenshot 2024-04-15 at 12 18 00

However, this doesn't hold up for the frontend service. In the test data, there are actually traces that contain spans with serviceName frontend and operationName /driver.DriverService/FindNearest AND others with serviceName driver and operationName /driver.DriverService/FindNearest. From the trace timeline below we can see that these spans overlap:

Screenshot 2024-04-15 at 12 21 43

Shouldn't one of these spans then be excluded from the calculation for either frontend's or driver's latecy/throughput? More specifically, shouldn't it be the 'frontend's span?
Therefor, we would expect that the above filter would not only include the operationName, but also the service that it's a request TO. So we wouldn't include the /driver.DriverService/FindNearest operation in frontend's stats, only in driver's, meaning that there would be a filter: operationName='findDriver' && serviceName='driver'.
There's a probably a reason why the filtering doesn't include the serviceName too, so it would be good to know the background on this.

2. Latency calculations for single trace view
Open Search dashboards has a single trace view, where the user can view spans, timeline and the service map for the particular trace.
We request the stats for the service map by including the aggregations for trace list (with a range) and just adding a filter by traceId:{our trace}.
However, we discovered this is not how Open Search does it. In the code for observability-dashboards it seems like the average latency of the service is divided by the throughput:
https://github.com/opensearch-project/dashboards-observability/blob/79faaebf3cce56f6bd0f4c3b8f7d712518b4f8d3/public/components/trace_analytics/components/traces/trace_view.tsx#L194

This leads to a significant difference in calculations between the plugin and OS Dashboards:
Screenshot 2024-04-15 at 13 03 08. Screenshot 2024-04-15 at 13 03 15

Does this mean that spans aren't filtered by operation name/ destination.resource (like in the previous section) when calculating latency for a service in a single trace? In general, it would be good to have some info on why this is calculated like so.

@idastambuk idastambuk added question Further information is requested datasource/OpenSearch labels Apr 15, 2024
@idastambuk idastambuk moved this from Incoming to Waiting in AWS Datasources Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasource/OpenSearch question Further information is requested
Projects
Status: Waiting
Development

No branches or pull requests

1 participant