-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker container healthcheck for CNCLI usage #1841
Comments
I think the health check in place currently is for the image CC: @TrevorBenson |
Sounds ideal. Would something like this make sense? ... A component-mapped docker image, allowing users to enable/disable individual components on a per container basis (eg: dbsync, ogmios, cncli, etc), each with a working healthcheck. This allows for modular containers that follow the single responsibility principle, running one process per container. When not in a container, these services are enabled with deploy-as-systemd.sh. This change would allow easier deployment of the services in a containerized environment. Perhaps the container's service could be chosen by simply setting an env variable like so ... Run the container as the default service - Leave commented (or omit from compose file) for default: Choose to override, eg. using cncli sync instead: [[ -z ${SERVICE} ]] && SERVICE="cnode.sh" An example docker compose:
|
Separate healthcheck scripts could be created, but would potentially use the same logic for all non cnode.sh scripts anyway. This would result in the operator needing to change not only the entrypoint process variable, but redefine the healthcheck, also requiring associated documentation explaining how to properly make all changes for either docker or docker-compose environments. I think it makes the most sense to just enhance the healthcheck to align with the entrypoint process and reduce any additional steps the operator would need to make except setting the entrypoint process, but if that is not preferred I am open to other suggestions. Based on the above thoughts I have already started a branch to address enhancing the healthcheck to be more flexible. It leverages the ENTRYPOINT_PROCESS variable used to change the executed process in the monolithic container image and adds additional logic making the health check dynamic.
The current HEALTHCHECK definition in the Dockerfile uses an interval of 5 minutes and a timeout of 100 seconds, retries is undefined which should inherit the default of 3 retries. This should result in a 15-18 minute window of unhealthy behavior before a container would be marked unhealthy (due to CPU remaining above 80% or somehow binary isn't running but container did not crash due to wrapper script pid 1). These values can be adjusted for sidecars by the operator at the time of container creation. |
I do like the idea of more modular containers, each containing the parts they require instead of a single monolithic container with everything. The guild-deploy.sh is mostly built around adding things into a single host environment with the dependencies mostly mixed together instead of a separate list of what is required only to add db-sync, ogmios, cncli, etc.. So currently the container images would not be significantly smaller except for the binaries themselves removed from the image. While this would reduce the image size slightly, it would not be very significant compared to upstream container images which are built with only the required dependencies and binary. The build process also does not leverage a cache of image layers. This would result in the layers of the each image, which are essentially identical between them, taking up separate space on the container host as well as network bandwidth to download. This is even when there are no differences between images or multiple release of the same image version. All of this could be changed, but to get the most benefit from the work:
Not all of the above is required to split the images up. However, the drawbacks quickly build up and increase the maintenance required of the operator to not overfill their container storage partitions. If the container storage is not separated from the container volume storage, in separate partitions, then each image reduces available space for volumes (cardano-node DB, db-sync, etc.) Over time this can lead to exhausting the partitions space, potentially leading to corrupted cardano node DB and db-sync DB if/when the amount of available memory can no longer contain operations that cannot be commited to disk. Also if the node/dbsync are stopped prior to providing additional storage to allow the memory to commit to disk. For these reasons I would be against splitting up the containers unless implemenbting the above changes, including leveraging a cache in the build process so that overlayfs layers are not constantly being rebuilt. This would also reduce the amount of layers required for pulls of new versions of each image. EDIT: However, while this provides more elegant container images the work of splitting container images and implementing the above changes requires many more hours of work, testing and review than just enhancing the healthcheck to work with the current design of the monolithic container image. |
@TrevorBenson - On contrary, I'd argue their healthchecks are actually different (eg: dbsync being on tip would be checked using it's prom metrics for queue length, cncli sync status would be checking actual sync status => would likely be using sqlite against block table) rather than simply checking for presence of those processes being up. They'd have their own variables and functions to check anyways, and wouldnt be reusing existing healthcheck script. If we've to make existing healthcheck script re-usable, then we'd need to add a rather complete health checking mechanism for those components - as regards different naming, that can easily be managed using service name added as a prefix |
We can definitely extend it to have a check unique to each process it supports. At the moment the extended healthcheck in the PR draft ensures wrapper scripts are not obfuscating a failed binary, or extended periods of time with excessive CPU load, which might impact the node process. Checking for open ports, expected responses from socket connections, or various other methods are common to healthchecks. For checking tip on cncli or dbsync, I would suggest not using koios, but instead relying on the node query tip. This way if the node isn't syncing (for whatever reason) then only the node is marked as unhealthy. This prevents a reduction in observability by not marking cncli or dbsync as unhealthy if they are in sync with the node, even if the node is behind tip. |
I have no idea where koios [API] comes into picture, wasnt referenced anywhere - I said it should be checked on individual component instead of node, I mentioned dbsync's own prometheus metrics (that runs locally on port 8080 by default), for cncli => I mention checking result of cncli sync status (to be >= 99%), both of these maintain their own status, regardless of node. You already have node's status reported in node, and there will likely be dependencies set in compose for when node fails. Thus, for other components, this should be querying what makes sense for that component, not node (and for above mentioned example, checking process is kinda not relevant, as absence of process will likely end up crashing container - while it's presence wouldnt mark it healthy). |
The current healthcheck scripts default method for checking the tip https://github.com/cardano-community/guild-operators/blob/3c13f1cc83a211148bb257c182984a2b5903a5fc/files/docker/node/addons/healthcheck.sh#L13C1-L43C3 |
Hello,
I wonder if it makes sense to request an improvement to the builtin healthcheck of the Docker container? ...
When enabling CNCLI services by configuring it to run as a docker sidecar container, the sidecar container's built-in healthcheck fails.
Would it be possible to update healthcheck.sh to detect when the container has been started with 'ENTRYPOINT_PROCESS: cncli.sh', and 'command: sync, validate, etc...' and adjust the heathcheck that is performed?
Below is an example docker compose file to illustrate a sidecar config...
(The CNCLI sidecar container's entrypoint is overridden by setting env variable:
ENTRYPOINT_PROCESS: cncli.sh
andcommand: sync
)Thankyou!
-Adam
The text was updated successfully, but these errors were encountered: