-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent downtime #177
Comments
It looks like CloudFlare doesn't have alerts, they invite you to use Pingdom instead, which is probably great for Pingdom's business, but sucks for us and our users. I really wish we had our own infrastructure. |
The server rebooted again during the night. Fortunately this time I was running a monitoring script that restarted the app automatically. The site was down for about 15 minutes according to the logs, which is still an unacceptably long time. |
We've talked about this in IRC. I've sent an email to http://bearstech.com/ asking if they can provide us with an alternative to hosting on OpenShift. |
CloudFlare had some issues in Europe today, at least one of our users was affected, he got a timeout error page and reported the problem on Twitter. |
Hello I'm the whiner from Twitter :-) |
Why are you using openshift ? It seems pretty expensive. |
@matunixe
Work perfectly in EU, it has only has issues today. |
What would we replace it with? There are reasons why we use CloudFlare, and for the most part it does the job well. As I've said before I wish we had our own infrastructure, but that costs time and money, and we're short on both.
We've already white-listed Tor. As for today's connectivity issues, they could have happened to any Internet service provider.
We're using OpenShift Online, which is free. They're in the process of relaunching it with a container-based infrastructure, so their website may be confusing right now. |
For monitoring a website, https://www.nixstats.com/ is a nice service. |
Here's how we can work around CloudFlare's lack of alerts: instead of pinging the site every few minutes with an HTTP request, let's make the app ping a monitoring server continuously with small TCP packets. That way we can detect downtime almost instantly, without overloading the app and spamming its logs with lots of ping HTTP requests. |
We have another problem which is the postgresql server dying for no apparent reason. It's happened more than a few times now. I need to figure out why, and make the python code restart postgres automatically in production (instead of quitting without even sending an alert). |
Upgrading PostgreSQL may be part of the solution: liberapay/openshift-cartridge-postgres#4. We're not set up for zero-downtime upgrades though (liberapay/openshift-cartridge-postgres#5). |
I've just upgraded postgresql from 9.4.5 to 9.4.10 in production. We had a few minutes of downtime, but on the bright side #480 worked perfectly: visitors saw a proper 503 page, I received a Sentry alert, and the app recovered automatically once the new version of postgres was installed. |
There's one option we haven't considered yet: moving "down" to AWS. It's what we're currently running on through OpenShift, but getting rid of the abstraction layer would open up many possibilities. https://aws.amazon.com/elasticbeanstalk/ and https://aws.amazon.com/rds/ seem to make it pretty easy to run a Python web app with a PostgreSQL database. If I'm reading AWS' pricing correctly having one micro EC2 instance and one micro PostgresSQL DB costs around 3 USD per week. We could pay that. Moreover it looks like the first 12 months are free. |
And bandwidths. I don't know about the bandwidth cost on openshift for now but AWS will cost some (or a lot of) dollars even when hiding behind CloudFlare. |
Thanks @revi, I hadn't seen the data transfer costs. They seem to be pretty low though.
Our CloudFlare stats: "Uncached Bandwidth | Last Month | 1 GB". In other words even if our outbound traffic increased tenfold we'd only have to pay an extra $0.81 per month (less than $0.20 per week). Moreover, the free tier for the first 12 months includes "15 GB of bandwidth out aggregated across all AWS services". OpenShift doesn't cost us anything, but the version we're running on is deprecated and flawed. If the next version of OpenShift doesn't become available for production use soon and with attractive pricing (or sponsorship), then AWS seems like the best alternative. |
CloudFlare isn't our only outbound traffic though, there's also emails (Mailgun SMTP), Sentry events, API requests to MangoPay and other platforms, etc. |
That's good to hear, though AWS free tier is full of tricks to get full advantages of it. For other traffics- I think it's minor compared to CF traffic. (And I think we can use AWS SES - their free tier allows 62,000 mails from their infra, making it cheaper than Mailgun.) |
Note that the 62000 "free" outbound emails per month don't include the data transfer cost, but there's no way around that, so it's indeed better than the 10000 free email per month we currently have on Mailgun. |
We've just suffered significant downtime again due to http://status.openshift.com/incidents/n6lkx92g8wyk. |
Working on #505 has led me to take another look at the possibilities to improve our server stack, and thus our uptime. Here is what I came up with:
They represent different trade-offs between monetary costs, engineering costs, reliability, etc. I haven't been able to determine a clear winner so far. Liberapay can run on cheap servers, but we need high security. We want as little downtime as possible, but we have a small budget. It's a difficult situation. I'm wary of AWS because of its complex pricing, and because of "GCE vs AWS in 2016: Why you shouldn’t use Amazon". I dislike Google Cloud because I dislike Google. I'm also not convinced that it would actually be better for us in terms of costs compared to AWS. Clever Cloud is theoretically nice, but their cheap PostgreSQL plans are stuck at version 9.2.8, which is unacceptable (we need at least 9.4, and we really want 9.6). I dread the OVH option because setting it up and maintaining it would cost a lot of time, and we would be responsible for the security of the entire stack (from the kernel to all the way up). |
Hello @Changaco , I actually can't help you with money behavior but I'm able to do something regarding of the sysadmin part. Especially if you decide to go with a solution like OVH/Scaleway/DO/etc. 👋 |
Hey @matunixe, it'd be great to have some help, but what are you proposing exactly? |
Hi @Changaco,
I can give you my time, security knowledge and experience in maintaining servers. I'm definitely not an AWS pro but for sure I can be useful with a more classical option. |
I've been taking a deeper look at GCloud and AWS. GCloud does appear to be cheaper than AWS, but it really isn't much better than a bunch of VMs: they don't support PostgreSQL as a service, and the flexible Python environments we'd need to run Liberapay aren't available in Europe (according to this doc). AWS is a lot better, but if we went with an optimal setup (EU region, zero downtime) it would cost us more than $6 per week (not counting taxes). Theoretically we could achieve the same result for less than half of that if we did everything ourselves, but the monetary savings would be more than offset by the increased engineering costs. So, it looks like our best bet is to go with AWS but stay within the free tier (one webapp runner, one non-redundant database). That won't guarantee zero downtime, but system updates should no longer be a problem like they are on OpenShift (AWS Elastic Beanstalk supports Managed Updates for Single-Instance environments). Once the time-limited free tier is over it should cost us less than 5€ per week (incl. taxes). |
Let's try AWS. If it ends up costing us too much or causing us other problems then we'll reconsider. @matunixe Thanks for offering to help. :-) I may ping you someday. |
The migration to AWS is done (see #553 for details), I can finally close this issue! \o/ |
The site was out for about an hour this morning, and it seems to be OpenShift's fault. 😠 😞
This message appeared in the
rhc tail
I had running:I see no error message indicating why the app wasn't restarted after the reboot.
rhc app restart
brought the site back.The text was updated successfully, but these errors were encountered: