forked from cloudfoundry/docs-cf-admin
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtroubleshooting_slow_requests.html.md.erb
159 lines (107 loc) · 9.3 KB
/
troubleshooting_slow_requests.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
title: Troubleshooting Slow Requests in Cloud Foundry
owner: Routing
---
This topic suggests ways that an operator of Cloud Foundry (CF) can diagnose the location of app request delays.
## <a id="request-path"></a> App Request Path
App requests typically transit the following components. Only the router (Gorouter) and app are within the scope of Cloud Foundry. Operators may have the HAProxy load balancer that comes with CF deployed, instead of or in addition to, an infrastructure load balancer.
<%= image_tag("request_lifecycle.png") %>
You can use `time` to measure a request's full round trip time from the client and back, examine `cf logs` output to measure the time just within Cloud Foundry, and add log messages to your app for fine-grained measurements of where the app itself takes time. By comparing these times, you can determine whether your delay comes from outside CF, inside the Gorouter, or in the app.
The following sections describe a scenario of diagnosing the source of delay for an app `app1`.
## <a id="total-latency"></a> Measure Total Round-Trip App Requests
On a command line, run `time curl -v APP-ENDPOINT` to measure the total round-trip time for deployed app `app1`.
For example:
<pre class="terminal">
$ time curl -v http://app1.app_domain.com
GET /hello HTTP/1.1
Host: app1.app_domain.com
User-Agent: curl/7.43.0
Accept: */*
HTTP/1.1 200 OK
Content-Type: application/json;charset=utf-8
Date: Tue, 14 Dec 2016 00:31:32 GMT
Server: nginx
X-Content-Type-Options: nosniff
X-Vcap-Request-Id: c30fad28-4972-46eb-7da6-9d07dc79b109
Content-Length: 602
hello world!
real 2m0.707s
user 0m0.005s
sys 0m0.007s
</pre>
The `real` time output shows that the request to `http://app1.app_domain.com` took approximately 2 minutes, round-trip. This seems like an unreasonably long time, so it makes sense to find out where the delay is occurring. To narrow it down, the next step measures the part of that request response time that comes from within Cloud Foundry.
<p class="note"><strong>Note</strong>: If your <code>curl</code> outputs an error like <code>Could not resolve host: NONEXISTENT.com</code> then DNS failed to resolve. If <code>curl</code>returns normally but lacks a <code>X-Vcap-Request-Id</code>, the request from the Load Balancer did not reach Cloud Foundry.</p>
## <a id="within-cf"></a> Measure App Requests within Cloud Foundry
The `cf logs` command streams log messages from the Gorouter as well as from apps. To see the timestamps of Gorouter messages to and from your app, do the following:
1. If necessary, run `cf apps` to determine the name of the app.
1. Run `cf logs APP-NAME`. Replace `APP-NAME` with the name of the app.
2. From another terminal window, send a request to your app.
3. After your app returns a response, enter `Ctrl-C` to stop streaming `cf logs`.
For example:
<pre class="terminal">
$ cf logs app1
2016-12-14T00:33:32.35-0800 [RTR/0] OUT app1.app_domain.com - [14/12/2016:00:31:32.348 +0000] "GET /hello HTTP/1.1" 200 0 60 "-" "HTTPClient/1.0 (2.7.1, ruby 2.3.3 (2016-11-21))" "10.0.4.207:20810" "10.0.48.67:61555" x_forwarded_for:"52.3.107.171" x_forwarded_proto:"http" vcap_request_id:"01144146-1e7a-4c77-77ab-49ae3e286fe9" response_time:120.00641734 app_id:"13ee085e-bdf5-4a48-aaaf-e854a8a975df" app_index:"0" x_b3_traceid:"3595985e7c34536a" x_b3_spanid:"3595985e7c34536a" x_b3_parentspanid:"-"
2016-12-14T00:32:32.35-0800 [APP/PROC/WEB/0]OUT app1 received request at [14/12/2016:00:32:32.348 +0000] with "vcap_request_id": "01144146-1e7a-4c77-77ab-49ae3e286fe9"
^C
</pre>
In the example above, the first line contains timestamps from the Gorouter for both when it received the request and what was its response time processing the request:
* `14/12/2016:00:31:32.348`: Gorouter receives request
* `response_time:120.00641734`: Gorouter round-trip processing time
This output shows that it took 120 seconds for the Gorouter to process the request, which means that the 2-minute delay above takes place within CF, either within the Gorouter or within the app.
To determine whether the app is responsible, [add logging](#app_logs) to your app to measure where it is spending time.
<p class="note"><strong>Note</strong>: Every incoming request should generate an access log message. If a request does not generate an access log message, it means the Gorouter did not receive the request.</p>
## <a id="app_logs"></a>Use App Logs to Locate Delays in CF
To gain a more detailed picture of where delays exist in your request path, augment the logging that your app generates. For example, call your logging library from the request handler to generate log lines when your app receives a request and finishes processing it:
<pre class="terminal">
2016-12-14T00:33:32.35-0800 [RTR/0] OUT app1.app_domain.com - [14/12/2016:00:31:32.348 +0000] "GET /hello HTTP/1.1" 200 0 60 "-" "HTTPClient/1.0 (2.7.1, ruby 2.3.3 (2016-11-21))" "10.0.4.207:20810" "10.0.48.67:61555" x_forwarded_for:"52.3.107.171" x_forwarded_proto:"http" vcap_request_id:"01144146-1e7a-4c77-77ab-49ae3e286fe9" response_time:120.00641734 app_id:"13ee085e-bdf5-4a48-aaaf-e854a8a975df" app_index:"0" x_b3_traceid:"3595985e7c34536a" x_b3_spanid:"3595985e7c34536a" x_b3_parentspanid:"-"
2016-12-14T00:32:32.35-0800 [APP/PROC/WEB/0]OUT app1 received request at [14/12/2016:00:32:32.348 +0000] with "vcap_request_id": "01144146-1e7a-4c77-77ab-49ae3e286fe9"
2016-12-14T00:32:32.50-0800 [APP/PROC/WEB/0]OUT app1 finished processing req at [14/12/2016:00:32:32.500 +0000] with "vcap_request_id": "01144146-1e7a-4c77-77ab-49ae3e286fe9"
</pre>
Comparing the router access log messages from the [previous section](#within-cf) with the new app logs above, we can construct the following timeline:
* `14/12/2016:00:31:32.348`: Gorouter receives request
* `2016-12-14T00:32:32.35`: App receives request
* `2016-12-14T00:32:32.50`: App finishes processing request
* `2016-12-14T00:33:32.35`: Gorouter finishes processing request
The timeline indicates that the Gorouter took close to 60 seconds to send the request to the app and another 60 seconds to receive the response from the app. This suggests a delay either with the Gorouter, or in network latency between the Gorouter and Diego cells hosting the app.
## <a id="gorouter"></a>Time the Gorouter Processing
To determine whether a Gorouter delay comes from the Gorouter itself or network latency between the Gorouter and the app, log into the Gorouter and compare the response times from calling the app two different ways:
* Call the app through the router proxy to process requests through the Gorouter
* Directly call a specific instance of the app, bypassing Gorouter processing
To make this comparison, do the following:
1. Log in to the `router` using `bosh ssh`. See the [BOSH SSH documentation](https://bosh.io/docs/cli-v2.html#ssh-mgmt) for more information.
1. Use `time` and `curl` to measure the response time of an app request originating from and processing through the Gorouter, but not running through the client network or load balancer:
<pre class="terminal">
$ time curl -H "Host: app1.app_domain.com" <span>http</span>://IP-GOROUTER-VM:80"
</pre>
1. Obtain the IP address and port of a specific app instance by running the following and recording associated `host` and `port` values:
<pre class="terminal">
$ cf curl /v2/apps/$(cf app app1 --guid)/stats
{
"0": {
"state": "RUNNING",
"stats": {
[...]
"host": "10.10.148.39",
"port": 60052,
[...]
},
[...]
}
</pre>
1. Use the IP address and port values to measure response time calling the app instance directly, bypassing Gorouter processing:
<pre class="terminal">
$ time curl <span>http<span>://APP-HOST-IP:APP-PORT
</pre>
If the Gorouter and direct response times are similar, it suggests network latency between the Gorouter and the Diego cell. If the Gorouter time is much longer than the direct-to-instance time, the Gorouter is slow processing requests. The next section explains why the Gorouter might be slow.
## <a id="gorouter-latency"></a>Potential Causes for Gorouter Latency
* Routers are under heavy load from incoming client requests.
* Apps are taking a long time to process requests. This increases the number of concurrent threads held open by the Gorouter, reducing capacity to handle requests for other apps.
## <a id="ops-recommendations"></a>Operations Recommendations
- Monitor CPU load for Gorouters. At high CPU (70%+), latency increases. If the Gorouter CPU reaches this threshold, consider adding another Gorouter instance.
- Monitor latency of all routers using metrics from the Firehose. Do not monitor the average latency across all routers. Instead, monitor them individually on the same graph.
- Consider using [Pingdom](http://www.pingdom.com) against an app on your Cloud Foundry deployment to monitor latency and uptime.
- Consider enabling access logs on your load balancer. See your load balancer documentation for how. Just as we used Gorouter access log messages above to determine latency from the Gorouter, you can compare load balancer logs to identify latency between the load balancer and the Gorouter. You can also compare load balancer response times with the client response times to identify latency between client and load balancer.
- [Deploy a nozzle to the Loggregator Firehose](../loggregator/nozzle-tutorial.html) to track metrics for the Gorouter. Available metrics include:
- CPU utilization
- Latency
- Requests per second