-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postgres backup verification error - The server failed to start #83
Comments
In the run above, the verification is failing because the local server instance (used for validation) did not start within the timeout period set for There are other factors that can cause the server to start slowly such as CPU and Memory resource allocation. The |
If your database does not use a "public" schema, you can specify the name of your database's schema via the |
Thanks Wade. Those four parameter values are set to the defaults and the table_schema parameter is set to the appropriate schema. I added the timeout parameter with 300 seconds and the verification now times out in 5 minutes. I have updated the Dockerfile to pull Postgres v13 as our app database is v13. Hopefully that isn't the cause of the issue.
|
That could be the issue. We haven't tested the backup-container with that version. It could be that some startup parameters have changed. What's the version of the database you're backing up? |
Full version:
I will redo the backup build using v12 for to see if it is compatible. |
It's typically preferable to have the backup-container use the same or newer version as the database you are backing up. Would you be up for troubleshooting the startup issues and contributing the upgrade back to the project? I can provide you with some guidance on how to go about troubleshooting. |
Certainly. Keep in mind that this is my first and only experience with OpenShift and I am still learning how everything works/interacts. |
Open a terminal on a running version of your
For the purpose of troubleshooting it does not really matter what the values are for You're basically mimicking what's done by this function, https://github.com/BCDevOps/backup-container/blob/master/docker/backup.postgres.plugin#L104-L123, without suppressing the log output. |
I'm interested in this as well, using Postgis extension of Postgres 13, and just starting to look at using backup container. |
Below is the output;
|
Once the upgrade is complete, the Postgis extension would need to be installed on the |
@MScallion, Other then the What's the output from |
|
Should the Environment Name be the namespace? |
Yes, but that shouldn't effect the verification. |
I accepted all of the default values for the build. Below is our param file for the deployment. Side note, the webhook did not produce notifications in Teams, as desired.
|
When I updated the app database to v13 I did not contribute the changes because I split the database user into a DB owner and an application proxy user. I understood that this could not be used because it was inconsistent with previous templates. |
@MScallion, You are missing environment variables for |
That would be causing these lookups to return blank strings and cause the startup to fail; https://github.com/BCDevOps/backup-container/blob/master/docker/backup.postgres.plugin#L119-L120 |
Are you saying you've modified the code, or just configured the container in a very specific way? |
Thanks Wade, I don't see those in the backup-container repository, which is the repo that I cloned. I added the four parameters to the deployment template.
Should I add the two lines as-is? ie are these values to be set the same for all projects or should I update them to reflect project specific components?
|
Those values are a customization that need to be done for each application of the container when you use the The examples were to show: |
I replaced the APP_USER/APP_PASSWORD secret values with app-db-owner-username/app-db-owner-password and app-proxy-username/app-proxy-password in order to configure a super user, DB owner and an application user, in an attempt to be in compliance with the principle of least privileges. I then administered the necessary role and grants for the app user. I also updated the Dockerfile to references:
and to include grants that helped to resolve an error early on, possibly as a result of the changes to the users;
|
So for our application database we have used the following parameters in the deployment template;
For which I have provided the following values;
|
Those are the default ones which are used for the legacy mode of operation only. When using the newer |
Understood, I will work to add those then post my results. I had assumed that because the backups execute successfully that the credentials might not be the root cause of the issue. The backups successfully complete in ~2 seconds, condensing the 66mb DB storage into 13mb in the backup file. I just updated the secret username and password pair to use the postgres user and the issue persisted, |
The readme states that the parameter names are prefixed with the service name. Does it have to be exact? If so then I will likely need to alter our service names as they currently include the environment, ie patroni-master-dev. I am hoping that I can use PATRONI_MASTER_USER, PATRONI_MASTER_PASSWORD for all environments, without having to modify the existing service names. |
The verification again timed out in 5m after I added those parameters to the deployment template and redeployed;
|
@WadeBarnes, I have cloned/updated the orgbook-containers repo and I am able to run both the backup and validation. I am now receiving a validation error stating that the application role, to which I grant all object privileges, does not exist. The grants are required because I have split the user into a DB owner and application proxy user. I am investigating alternate dumpfile options, ie pg_dumpall and/or the -x option, to find a best fit for my current configuration and restore capabilities. |
No, it has to be exact. It's best practice to keep environment designations (i.e. |
Would you be able to share the error log? |
Thanks for the tip Wade. One limitation that I see is that we have test and UAT in the same namespace. We will need to review our configuration and naming standards. I have confirmed that the role does not exist in the generated SQL. The documentation states that pg_dumpall can include the roles and I will work with that to determine the best settings for our requirement.
|
FYI, I tried running backup verification with my postgis 13:3-.1 image, within a OpenShift CronJob, and got the following error: waiting for server to start.................................................................................................�[31m pg_ctl: directory "/var/lib/pgsql/data/userdata" does not exist Checking the image, /var/lib/postgresql/data exists instead. The backups seems to run properly (from a CronJob), I just seem to have issues running backup verification. |
@basilv, Do you have a volume mounted to |
Oops, yup would help to mount the verification PVC. |
Here's the latest result after adding the verification PVC and setting the database startup timeout to 300 seconds pg_ctl: directory "/var/lib/pgsql/data/userdata" does not exist �[31m[!!ERROR!!] - Backup verification failed: /backups/daily/2021-08-04/fom-db-dev-86-fom_2021-08-04_14-46-14.sql.gz Note that I'm not using separate roles/accounts like MScallion, so that shouldn't be the issue. Presumably I'm either missing some required setting, or there's a Postgres 13 compatibility issue. I do have the output from /backup.sh -c, provided below: Listing configuration settings ...�[0m Settings:
|
@MScallion, have you been able to resolve the issues you were having? |
No, as per #85 I figured out how to run the backup verification from my cron job, but it still fails as described above. I've manually tested the database restore process and confirmed the backups are working properly, so its just something in the verification script. Not sure if it is a Postgres 13 issue, or a PostGIS issue (e.g. maybe the verification script is skipping initialization needed for PostGIS extension). |
It's failing the start postgres during the verification, which is typically due to the database credentials not being wired up correctly. You could troubleshoot by running the |
Would be nice to have an option to show the logs without having to build a completely different container to remove the >/dev/null 2>&1 bit. Looking at the onStartServer function defined in backup.postgres.plugin, it seems to execute 'run-postgresql', but I don't see that defined anywhere (and would have expected pg_ctl to be used, like it is in onStopServer). I tried running pg_ctl start/init, here's the results: Also, I've had a hard time telling from the documentation how to supply the credentials. I'm using Openshift CronJob approach. For the cronjob container, I'm supplying env variables DATABASE_USER and DATABASE_PASSWORD as well as POSTGRESQL_USER and POSTGRESQL_PASSWORD (all values pulled from secret). I'm pretty sure I just need DATABASE_USER & DATABASE_PASSWORD though. (FYI, I've updated my database to use patroni (with postgres 13 and PostGIS). This hasn't seemed to impact the backup procedure at all - still works fine, so at least that's positive.) |
No. I would need to update the backup.postgres.plugin file in the BCDevOps/backup-container repo to use pg_dumpall, which I have verified locally, in place of pg_dump -Fc,. I am also not able to pull from my fork of that repo due to security settings. In the short-term I was able to backup and successfully restore the database manually by 1) executing my deployment script to create the database, users and roles then 2) restoring the SQL from the backup. Fortunately I had separated the create database from the create DB objects in my deployment scripts. I am not able to successfully validate the backup files using the automation provided in the backup-container scripts, due to the grants to the roles, which I have added to the source templates and which are not included in the pg_dump function. |
Thanks @joeleonjr , I checked all my prior comments but see no API keys. (I didn't check old edits of comments, no idea how to do that.) |
Thanks! @MScallion is on a separate team, so they'll need to respond. |
First time executing the backup was this morning and I cannot get the verification to work. We do not have a public schema in our database, hopefully it is not required as it was not for the backup. I have attempted the verification several times. Any help in steps that I can take to investigate the cause of this issue is appreciated.
Namespace = 30245e-dev.
The text was updated successfully, but these errors were encountered: