Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primary cannot stay connected to worker behind a load balancer #732

Closed
campma opened this issue Mar 15, 2024 · 5 comments
Closed

Primary cannot stay connected to worker behind a load balancer #732

campma opened this issue Mar 15, 2024 · 5 comments

Comments

@campma
Copy link

campma commented Mar 15, 2024

Summary

Hi, I am evaluating Cronicle for use at my organization. In initial tests using a few throwaway VMs, this worked great, and has just about all the features I was looking for! The plan is to put this into our primary VPC as a primary node, and then run worker nodes inside of our various environments (which are also separate VPCs and aren't directly accessible from the primary node). However, in attempting to load it into our development environment, I am running into a problem where I cannot get the worker to connect to the primary node.

Steps to reproduce the problem

I tried multiple ways of accessing the worker from the primary, including the use of an ssh tunnel on a middle-man bastion server, before settling on a locked down port on the load balancer into that environment. In this configuration, adding the worker using the Add Server button in the server page allows it to connect, and will then almost immediately go grey. It appears that when adding in the load balancer address in the worker hostname box, it grabs the hostname from the worker server env, and then tries to use that to connect to, instead of the FQDN DNS address I gave it initially. Additionally, the worker's logs appear to suggest that it thinks the load balancer IP is the primary, and is trying to connect back to it, instead of reaching out to the actual primary.

Your Setup

  • Single Primary hosted in primary VPC, let's say this VPC has network 192.168.0.0/24.
    • In order to allow the worker to connect back to it, it sits behind a public facing load balancer with the ports limited to our public IPs.
    • The config.json of the primary has configured the base url of the load balancer, and the worker has a base url configured to its respective load balancer.
  • Current test worker is hosted in a development VPC that mimics production VPC, let's say that this VPC has networks 192.168.5.0/24, 192.168.6.0/24, 192.168.7.0/24, & 192.168.8.0/24.
    • Development VPC is pretty isolated, only traffic in or out goes through either an ssh bastion server, or through a load balancer
    • The jobs that are being run need to be run on the app servers behind the load balancer
    • AWS Application Load balancer has port 3012 open, points to a target group where this worker is the only one in it, and it shows as healthy
  • The expectation is that assuming I get this working, I will be putting workers in other VPCs as well, which have similar isolation

Operating system and version?

Both primary and worker are running Ubuntu 22.04

Node.js version?

v21.7.0 & v20.11.1

Cronicle software version?

Version 0.9.44

Are you using a multi-server setup, or just a single server?

Single Primary, single worker (but want to expand to multiple workers)

Are you using the filesystem as back-end storage, or S3/Couchbase?

local filesystem

Can you reproduce the crash consistently?

It's not crashing per se, but the issue is consistent.

Log Excerpts

Primary

Cronicle.log:

[1710534773.097][2024-03-15 20:32:53][ip-192-168-0-24][80242][Cronicle][debug][9][Sending API request to remote server: http://xxdevvpc-lbxx.us-east-1.elb.amazonaws.com:3012/api/app/check_add_server][]
[1710534773.151][2024-03-15 20:32:53][ip-192-168-0-24][80242][Cronicle][debug][4][Adding remote slave server to cluster: ip-192-168-8-73][{"hostname":"ip-192-168-8-73","ip":"192.168.8.73"}]
[1710534773.151][2024-03-15 20:32:53][ip-192-168-0-24][80242][Cronicle][debug][5][Adding slave to cluster: ip-192-168-8-73 (192.168.8.73)][]
[1710534773.152][2024-03-15 20:32:53][ip-192-168-0-24][80242][Cronicle][debug][8][Connecting to slave via socket.io: http://ip-192-168-8-73:3012][]
[1710534778.163][2024-03-15 20:32:58][ip-192-168-0-24][80242][Cronicle][debug][5][Marking slave as disabled: ip-192-168-8-73][]
[1710534779.164][2024-03-15 20:32:59][ip-192-168-0-24][80242][Cronicle][debug][6][Reconnecting to slave: ip-192-168-8-73][]
[1710534779.164][2024-03-15 20:32:59][ip-192-168-0-24][80242][Cronicle][debug][8][Connecting to slave via socket.io: http://ip-192-168-8-73:3012][]

Worker

Cronicle.log:

[1710537884.052][2024-03-15 16:24:44][ip-192-168-8-73][3143204][Cronicle][debug][2][Spawning background daemon process (PID 3143204 will exit)][["/root/.nvm/versions/node/v20.11.1/bin/node","/opt/cronicle/lib/main.js"]]
[1710537884.315][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][2][Cronicle v0.9.44 Starting Up][{"pid":3143212,"ppid":1,"node":"v20.11.1","arch":"x64","platform":"linux","argv":["/root/.nvm/versions/node/v20.11.1/bin/node","/opt/cronicle/lib/main.js"],"execArgv":[]}]
[1710537884.317][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][9][Writing PID File: logs/cronicled.pid: 3143212][]
[1710537884.318][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][9][Confirmed PID File contents: logs/cronicled.pid: 3143212][]
[1710537884.318][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][2][Server IP: 192.168.8.73, Daemon PID: 3143212][]
[1710537884.319][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][3][Starting component: Storage][]
[1710537884.322][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][3][Starting component: WebServer][]
[1710537884.328][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][3][Starting component: API][]
[1710537884.328][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][3][Starting component: User][]

[1710537884.329][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][3][Starting component: Cronicle][]
[1710537884.329][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][3][Cronicle engine starting up][]
[1710537884.331][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][4][Using broadcast IP: 192.168.8.255][]
[1710537884.331][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][4][Starting UDP server on port: 3014][]
[1710537884.346][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][4][Server not found in cluster -- waiting for a master server to contact us][]
[1710537884.346][2024-03-15 16:24:44][ip-192-168-8-73][3143212][Cronicle][debug][2][Startup complete, entering main loop][]

WebServer.log:

[1710538445.322][2024-03-15 16:34:05][ip-192-168-8-73][3143212][WebServer][debug][8][New incoming HTTP connection: c73][{"ip":"::ffff:192.168.6.115","port":3012,"num_conns":1}]
[1710538445.322][2024-03-15 16:34:05][ip-192-168-8-73][3143212][WebServer][debug][8][New HTTP request: GET / (192.168.6.115)][{"id":"r74","socket":"c73","version":"1.1"}]
[1710538445.322][2024-03-15 16:34:05][ip-192-168-8-73][3143212][WebServer][debug][9][Incoming HTTP Headers:][{"host":"192.168.8.73:3012","connection":"close","user-agent":"ELB-HealthChecker/2.0","accept-encoding":"gzip, compressed"}]
[1710538445.322][2024-03-15 16:34:05][ip-192-168-8-73][3143212][WebServer][debug][9][Serving static file for: /][{"file":"/opt/cronicle/htdocs"}]
[1710538445.323][2024-03-15 16:34:05][ip-192-168-8-73][3143212][WebServer][debug][9][Serving directory index: /opt/cronicle/htdocs/index.html][]
[1710538445.324][2024-03-15 16:34:05][ip-192-168-8-73][3143212][WebServer][debug][9][Serving pre-gzipped version of file: /opt/cronicle/htdocs/index.html.gz][]
[1710538445.324][2024-03-15 16:34:05][ip-192-168-8-73][3143212][WebServer][debug][9][Sending streaming HTTP response: 200 OK][{"Content-Encoding":"gzip","Etag":"\"65384-1090-1709238073372\"","Last-Modified":"Thu, 29 Feb 2024 20:21:13 GMT","Content-Type":"text/html","Content-Length":1090,"Cache-Control":"public, max-age=3600","Access-Control-Allow-Origin":"*","Server":"Cronicle 1.0","Connection":"close"}]
[1710538445.324][2024-03-15 16:34:05][ip-192-168-8-73][3143212][WebServer][debug][9][Request complete][{"id":"r74"}]
[1710538445.325][2024-03-15 16:34:05][ip-192-168-8-73][3143212][WebServer][debug][9][Response finished writing to socket][{"id":"r74"}]
[1710538445.325][2024-03-15 16:34:05][ip-192-168-8-73][3143212][WebServer][debug][9][Request performance metrics:][{"scale":1000,"perf":{"total":2.612,"queue":0.066,"read":0.01,"process":0.005,"write":0.801},"counters":{"bytes_in":134,"bytes_out":1368,"num_requests":1}}]
[1710538445.325][2024-03-15 16:34:05][ip-192-168-8-73][3143212][WebServer][debug][9][Closing socket: c73][]
[1710538445.325][2024-03-15 16:34:05][ip-192-168-8-73][3143212][WebServer][debug][8][HTTP connection has closed: c73][{"ip":"::ffff:192.168.6.115","port":3012,"total_elapsed":3,"num_requests":1,"bytes_in":134,"bytes_out":1368}]

Any help you could provide would be greatly appreciated!

@jhuckaby
Copy link
Owner

Hi there! Thank you for the very detailed issue report.

Having read everything, I think this is the key part right here:

[1710534778.163][2024-03-15 20:32:58][ip-192-168-0-24][80242][Cronicle][debug][5][Marking slave as disabled: ip-192-168-8-73][]
[1710534779.164][2024-03-15 20:32:59][ip-192-168-0-24][80242][Cronicle][debug][6][Reconnecting to slave: ip-192-168-8-73][]
[1710534779.164][2024-03-15 20:32:59][ip-192-168-0-24][80242][Cronicle][debug][8][Connecting to slave via [socket.io](http://socket.io/): http://ip-192-168-8-73:3012][]

It is pretty clear to me from this log excerpt that something is closing the WebSocket connections that Cronicle needs to stay alive between the primary and worker servers. This is likely going to be security related, i.e. some kind of security software, or perhaps a setting in your load balancer or proxy software.

Some load balancers / proxies do not support WebSocket connections unless you explicitly allow them. See this recent issue regarding nginx: #535

I also recall someone else reporting this recently: #725

The only solution there was separate Cronicle instances, and no WebSocket connections spanning across a load balancer / proxy.

I'm sorry if this doesn't help, but I don't know what else to try here. Things like this usually always come down to some incompatibility with WebSockets (and sometimes specifically socket.io, which Cronicle uses on top of WebSockets), and a piece of network security or proxy software or hardware in your environment, closing the sockets prematurely.

I hope this helps. Best of luck to you.

@campma
Copy link
Author

campma commented Mar 18, 2024

Yeah, I figured that was the part that was wrong here. This hostname/IP is not directly accessible via the primary node. It is connecting initially to a load balancer, which is forwarding the packets to the worker node behind it. The worker then responds with its internal IP that is inaccessible from the Primary, so the primary tries to connect to that instead, and fails.

I guess what I was hoping to hear was that there was a way to have it actually use the FQ DNS that I supplied it in the initial Add Server dialog. I gave it a domain of (scrubbed example) xxdevvpc-lbxx.us-east-1.elb.amazonaws.com, and it turns around and tries to use the hostname (and related IP) instead. If it actually used the DNS provided, or even the base_url in the config.json, it wouldn't have a problem. If it doesn't exist, is that a feature that can be added relatively easily? I feel like I wouldn't be alone in wanting such a feature.

@campma
Copy link
Author

campma commented Apr 25, 2024

Hi @jhuckaby, I wanted to follow up with you on this, and see if you had read my reply. I feel like the real problem is that the worker node replaces the initial DNS entry with either its IP or hostname, so the initial connection works because the DNS correctly routes it to the worker, but then the worker "corrects" the primary with the non-routable hostname/IP, and that's when it goes blank. I'd just like to see if there's a way to get a feature where the worker doesn't correct the primary's connection string, or at least makes the "correction" configurable.

@jhuckaby
Copy link
Owner

Try overriding the worker's hostname and IP by using these two top-level config properties in the worker's config.json file:

"hostname": "corrected-hostname.mydomain.com",
"ip": "1.2.3.4"

These are undocumented properties but they will override the "auto-detection" that happens on startup, where the system tries to figure out its own hostname and IP.

@campma
Copy link
Author

campma commented Apr 30, 2024

Thanks @jhuckaby! This worked for my situation! If these are undocumented, it would seem worthwhile to add to documentation, as it would probably help a number of people who have nodes behind load balancers or NATs.

@campma campma closed this as completed Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants