Prevent downtime #177

Changaco · 2016-02-22T08:37:57Z

The site was out for about an hour this morning, and it seems to be OpenShift's fault. 😠 😞

This message appeared in the rhc tail I had running:

The system is going down for reboot in 1 minute!
This node will restart in 1 minutes. Please save your work. [Rebooting nodes via Ansible]

I see no error message indicating why the app wasn't restarted after the reboot.

rhc app restart brought the site back.

The text was updated successfully, but these errors were encountered:

Changaco · 2016-02-23T10:46:39Z

It looks like CloudFlare doesn't have alerts, they invite you to use Pingdom instead, which is probably great for Pingdom's business, but sucks for us and our users. I really wish we had our own infrastructure.

Changaco · 2016-02-29T08:55:10Z

The server rebooted again during the night. Fortunately this time I was running a monitoring script that restarted the app automatically. The site was down for about 15 minutes according to the logs, which is still an unacceptably long time.

Changaco · 2016-02-29T13:38:26Z

We've talked about this in IRC. I've sent an email to http://bearstech.com/ asking if they can provide us with an alternative to hosting on OpenShift.

Changaco · 2016-06-20T14:53:13Z

CloudFlare had some issues in Europe today, at least one of our users was affected, he got a timeout error page and reported the problem on Twitter.

ghost · 2016-06-20T14:59:04Z

Hello I'm the whiner from Twitter :-)
What about let go Cloudflare ? It's a very painful-in-the-ass product, block Tor, don't work in EU...

angristan · 2016-06-20T15:05:20Z

Why are you using openshift ? It seems pretty expensive.

angristan · 2016-06-20T15:06:31Z

@matunixe

What about let go Cloudflare ? It's a very painful-in-the-ass product, block Tor, don't work in EU...

Work perfectly in EU, it has only has issues today.

Changaco · 2016-06-20T15:28:53Z

What about let go Cloudflare ?

What would we replace it with? There are reasons why we use CloudFlare, and for the most part it does the job well.

As I've said before I wish we had our own infrastructure, but that costs time and money, and we're short on both.

It's a very painful-in-the-ass product, block Tor, don't work in EU...

We've already white-listed Tor. As for today's connectivity issues, they could have happened to any Internet service provider.

Why are you using openshift ? It seems pretty expensive.

We're using OpenShift Online, which is free. They're in the process of relaunching it with a container-based infrastructure, so their website may be confusing right now.

kindlyfire · 2016-06-20T17:15:46Z

For monitoring a website, https://www.nixstats.com/ is a nice service.

Changaco · 2016-11-07T16:32:18Z

Here's how we can work around CloudFlare's lack of alerts: instead of pinging the site every few minutes with an HTTP request, let's make the app ping a monitoring server continuously with small TCP packets. That way we can detect downtime almost instantly, without overloading the app and spamming its logs with lots of ping HTTP requests.

Changaco · 2016-12-03T14:03:20Z

We have another problem which is the postgresql server dying for no apparent reason. It's happened more than a few times now. I need to figure out why, and make the python code restart postgres automatically in production (instead of quitting without even sending an alert).

Changaco · 2017-01-01T10:15:36Z

Upgrading PostgreSQL may be part of the solution: liberapay/openshift-cartridge-postgres#4. We're not set up for zero-downtime upgrades though (liberapay/openshift-cartridge-postgres#5).

Changaco · 2017-01-07T10:25:00Z

I've just upgraded postgresql from 9.4.5 to 9.4.10 in production. We had a few minutes of downtime, but on the bright side #480 worked perfectly: visitors saw a proper 503 page, I received a Sentry alert, and the app recovered automatically once the new version of postgres was installed.

Changaco · 2017-01-07T20:35:23Z

There's one option we haven't considered yet: moving "down" to AWS. It's what we're currently running on through OpenShift, but getting rid of the abstraction layer would open up many possibilities.

https://aws.amazon.com/elasticbeanstalk/ and https://aws.amazon.com/rds/ seem to make it pretty easy to run a Python web app with a PostgreSQL database.

If I'm reading AWS' pricing correctly having one micro EC2 instance and one micro PostgresSQL DB costs around 3 USD per week. We could pay that. Moreover it looks like the first 12 months are free.

revi · 2017-01-08T06:45:30Z

And bandwidths.

I don't know about the bandwidth cost on openshift for now but AWS will cost some (or a lot of) dollars even when hiding behind CloudFlare.

Changaco · 2017-01-08T09:43:47Z

Thanks @revi, I hadn't seen the data transfer costs. They seem to be pretty low though.

AWS data transfer pricing:

first 1GB / month: free
up to 10TB / month: $0.09 / GB

Our CloudFlare stats: "Uncached Bandwidth | Last Month | 1 GB".

In other words even if our outbound traffic increased tenfold we'd only have to pay an extra $0.81 per month (less than $0.20 per week).

Moreover, the free tier for the first 12 months includes "15 GB of bandwidth out aggregated across all AWS services".

OpenShift doesn't cost us anything, but the version we're running on is deprecated and flawed. If the next version of OpenShift doesn't become available for production use soon and with attractive pricing (or sponsorship), then AWS seems like the best alternative.

Changaco · 2017-01-08T10:27:35Z

CloudFlare isn't our only outbound traffic though, there's also emails (Mailgun SMTP), Sentry events, API requests to MangoPay and other platforms, etc.

revi · 2017-01-08T10:36:23Z

That's good to hear, though AWS free tier is full of tricks to get full advantages of it.

For other traffics- I think it's minor compared to CF traffic. (And I think we can use AWS SES - their free tier allows 62,000 mails from their infra, making it cheaper than Mailgun.)

Changaco · 2017-01-08T10:43:18Z

Note that the 62000 "free" outbound emails per month don't include the data transfer cost, but there's no way around that, so it's indeed better than the 10000 free email per month we currently have on Mailgun.

Changaco · 2017-01-13T20:13:43Z

We've just suffered significant downtime again due to http://status.openshift.com/incidents/n6lkx92g8wyk.

Changaco · 2017-03-09T09:17:23Z

Working on #505 has led me to take another look at the possibilities to improve our server stack, and thus our uptime. Here is what I came up with:

migrate to AWS
migrate to Google Cloud
migrate to Clever Cloud
build our own thing with a few cheap OVH virtual machines

They represent different trade-offs between monetary costs, engineering costs, reliability, etc. I haven't been able to determine a clear winner so far.

Liberapay can run on cheap servers, but we need high security. We want as little downtime as possible, but we have a small budget. It's a difficult situation.

I'm wary of AWS because of its complex pricing, and because of "GCE vs AWS in 2016: Why you shouldn’t use Amazon".

I dislike Google Cloud because I dislike Google. I'm also not convinced that it would actually be better for us in terms of costs compared to AWS.

Clever Cloud is theoretically nice, but their cheap PostgreSQL plans are stuck at version 9.2.8, which is unacceptable (we need at least 9.4, and we really want 9.6).

I dread the OVH option because setting it up and maintaining it would cost a lot of time, and we would be responsible for the security of the entire stack (from the kernel to all the way up).

ghost · 2017-03-09T12:45:59Z

Hello @Changaco ,

I actually can't help you with money behavior but I'm able to do something regarding of the sysadmin part. Especially if you decide to go with a solution like OVH/Scaleway/DO/etc.
Ping me if you are interested.

👋

Changaco · 2017-03-10T09:03:34Z

Hey @matunixe, it'd be great to have some help, but what are you proposing exactly?

ghost · 2017-03-10T10:30:46Z

Hi @Changaco,

I dread the OVH option because setting it up and maintaining it would cost a lot of time, and we would be responsible for the security of the entire stack (from the kernel to all the way up).

I can give you my time, security knowledge and experience in maintaining servers. I'm definitely not an AWS pro but for sure I can be useful with a more classical option.

Changaco · 2017-03-10T10:51:13Z

I've been taking a deeper look at GCloud and AWS.

GCloud does appear to be cheaper than AWS, but it really isn't much better than a bunch of VMs: they don't support PostgreSQL as a service, and the flexible Python environments we'd need to run Liberapay aren't available in Europe (according to this doc).

AWS is a lot better, but if we went with an optimal setup (EU region, zero downtime) it would cost us more than $6 per week (not counting taxes). Theoretically we could achieve the same result for less than half of that if we did everything ourselves, but the monetary savings would be more than offset by the increased engineering costs.

So, it looks like our best bet is to go with AWS but stay within the free tier (one webapp runner, one non-redundant database). That won't guarantee zero downtime, but system updates should no longer be a problem like they are on OpenShift (AWS Elastic Beanstalk supports Managed Updates for Single-Instance environments). Once the time-limited free tier is over it should cost us less than 5€ per week (incl. taxes).

Changaco · 2017-03-13T08:46:42Z

Let's try AWS. If it ends up costing us too much or causing us other problems then we'll reconsider.

@matunixe Thanks for offering to help. :-) I may ping you someday.

Changaco · 2017-03-18T20:08:16Z

The migration to AWS is done (see #553 for details), I can finally close this issue! \o/

Changaco added the critical issues that threaten the very existence of Liberapay label Feb 22, 2016

Changaco self-assigned this Feb 22, 2016

This was referenced Nov 5, 2016

Fix hot deployment #150

Closed

Watercooler - Q4 2016 liberapay/salon#78

Closed

Changaco mentioned this issue Dec 21, 2016

Automatically (re)start postgres in production #462

Merged

Changaco mentioned this issue Mar 8, 2017

Transparent accounting for organizations #505

Open

Changaco mentioned this issue Mar 13, 2017

Migration to AWS #553

Closed

Changaco closed this as completed Mar 18, 2017

Changaco mentioned this issue Mar 18, 2017

Watercooler - Q1 2017 liberapay/salon#98

Closed

Changaco mentioned this issue Apr 18, 2018

AWS #1075

Closed

Changaco mentioned this issue Apr 7, 2020

Servers #1727

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent downtime #177

Prevent downtime #177

Changaco commented Feb 22, 2016

Changaco commented Feb 23, 2016

Changaco commented Feb 29, 2016

Changaco commented Feb 29, 2016

Changaco commented Jun 20, 2016

ghost commented Jun 20, 2016

angristan commented Jun 20, 2016

angristan commented Jun 20, 2016

Changaco commented Jun 20, 2016

kindlyfire commented Jun 20, 2016

Changaco commented Nov 7, 2016

Changaco commented Dec 3, 2016

Changaco commented Jan 1, 2017 •

edited

Loading

Changaco commented Jan 7, 2017

Changaco commented Jan 7, 2017

revi commented Jan 8, 2017

Changaco commented Jan 8, 2017

Changaco commented Jan 8, 2017

revi commented Jan 8, 2017

Changaco commented Jan 8, 2017

Changaco commented Jan 13, 2017

Changaco commented Mar 9, 2017 •

edited

Loading

ghost commented Mar 9, 2017 •

edited by ghost

Loading

Changaco commented Mar 10, 2017

ghost commented Mar 10, 2017

Changaco commented Mar 10, 2017

Changaco commented Mar 13, 2017

Changaco commented Mar 18, 2017

Prevent downtime #177

Prevent downtime #177

Comments

Changaco commented Feb 22, 2016

Changaco commented Feb 23, 2016

Changaco commented Feb 29, 2016

Changaco commented Feb 29, 2016

Changaco commented Jun 20, 2016

ghost commented Jun 20, 2016

angristan commented Jun 20, 2016

angristan commented Jun 20, 2016

Changaco commented Jun 20, 2016

kindlyfire commented Jun 20, 2016

Changaco commented Nov 7, 2016

Changaco commented Dec 3, 2016

Changaco commented Jan 1, 2017 • edited Loading

Changaco commented Jan 7, 2017

Changaco commented Jan 7, 2017

revi commented Jan 8, 2017

Changaco commented Jan 8, 2017

Changaco commented Jan 8, 2017

revi commented Jan 8, 2017

Changaco commented Jan 8, 2017

Changaco commented Jan 13, 2017

Changaco commented Mar 9, 2017 • edited Loading

ghost commented Mar 9, 2017 • edited by ghost Loading

Changaco commented Mar 10, 2017

ghost commented Mar 10, 2017

Changaco commented Mar 10, 2017

Changaco commented Mar 13, 2017

Changaco commented Mar 18, 2017

Changaco commented Jan 1, 2017 •

edited

Loading

Changaco commented Mar 9, 2017 •

edited

Loading

ghost commented Mar 9, 2017 •

edited by ghost

Loading