You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Opening this ticket as a way to track open questions on the service map feature.
For the most part, we tried to emulate the filtering and aggregation in Open Search Dashboards. However, we ran into a few cases where the numbers for latency and throughput don't match:
1. Destination resource filtering
In order to get correct latencies and throughput for a service, the span-* indices need to be filtered by operations that a particular service receives. For example, the average latency for customer from below will be calculated from only the spans that have HTTP get customer operationName
This makes sense for all the services as well. This is calculated from the query filter below (from Data prepper dashboard with Hot R.O.D test data):
However, this doesn't hold up for the frontend service. In the test data, there are actually traces that contain spans with serviceName frontend and operationName /driver.DriverService/FindNearest AND others with serviceName driver and operationName /driver.DriverService/FindNearest. From the trace timeline below we can see that these spans overlap:
Shouldn't one of these spans then be excluded from the calculation for either frontend's or driver's latecy/throughput? More specifically, shouldn't it be the 'frontend's span?
Therefor, we would expect that the above filter would not only include the operationName, but also the service that it's a request TO. So we wouldn't include the /driver.DriverService/FindNearest operation in frontend's stats, only in driver's, meaning that there would be a filter: operationName='findDriver' && serviceName='driver'.
There's a probably a reason why the filtering doesn't include the serviceName too, so it would be good to know the background on this.
This leads to a significant difference in calculations between the plugin and OS Dashboards: .
Does this mean that spans aren't filtered by operation name/ destination.resource (like in the previous section) when calculating latency for a service in a single trace? In general, it would be good to have some info on why this is calculated like so.
The text was updated successfully, but these errors were encountered:
Opening this ticket as a way to track open questions on the service map feature.
For the most part, we tried to emulate the filtering and aggregation in Open Search Dashboards. However, we ran into a few cases where the numbers for latency and throughput don't match:
1. Destination resource filtering
In order to get correct latencies and throughput for a service, the span-* indices need to be filtered by operations that a particular service receives. For example, the average latency for
customer
from below will be calculated from only the spans that haveHTTP get customer
operationNameThis makes sense for all the services as well. This is calculated from the query filter below (from Data prepper dashboard with Hot R.O.D test data):
However, this doesn't hold up for the
frontend
service. In the test data, there are actually traces that contain spans with serviceNamefrontend
and operationName/driver.DriverService/FindNearest
AND others with serviceNamedriver
and operationName/driver.DriverService/FindNearest
. From the trace timeline below we can see that these spans overlap:Shouldn't one of these spans then be excluded from the calculation for either frontend's or driver's latecy/throughput? More specifically, shouldn't it be the 'frontend's span?
Therefor, we would expect that the above filter would not only include the operationName, but also the service that it's a request TO. So we wouldn't include the
/driver.DriverService/FindNearest
operation infrontend
's stats, only indriver
's, meaning that there would be afilter: operationName='findDriver' && serviceName='driver'
.There's a probably a reason why the filtering doesn't include the serviceName too, so it would be good to know the background on this.
2. Latency calculations for single trace view
Open Search dashboards has a single trace view, where the user can view spans, timeline and the service map for the particular trace.
We request the stats for the service map by including the aggregations for trace list (with a range) and just adding a filter by
traceId:{our trace}
.However, we discovered this is not how Open Search does it. In the code for observability-dashboards it seems like the average latency of the service is divided by the throughput:
https://github.com/opensearch-project/dashboards-observability/blob/79faaebf3cce56f6bd0f4c3b8f7d712518b4f8d3/public/components/trace_analytics/components/traces/trace_view.tsx#L194
This leads to a significant difference in calculations between the plugin and OS Dashboards:
.
Does this mean that spans aren't filtered by operation name/ destination.resource (like in the previous section) when calculating latency for a service in a single trace? In general, it would be good to have some info on why this is calculated like so.
The text was updated successfully, but these errors were encountered: