Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent downtime #177

Closed
Changaco opened this issue Feb 22, 2016 · 27 comments
Closed

Prevent downtime #177

Changaco opened this issue Feb 22, 2016 · 27 comments
Assignees
Labels
critical issues that threaten the very existence of Liberapay

Comments

@Changaco
Copy link
Member

The site was out for about an hour this morning, and it seems to be OpenShift's fault. 😠 😞

This message appeared in the rhc tail I had running:

The system is going down for reboot in 1 minute!
This node will restart in 1 minutes. Please save your work. [Rebooting nodes via Ansible]

I see no error message indicating why the app wasn't restarted after the reboot.

rhc app restart brought the site back.

@Changaco Changaco added the critical issues that threaten the very existence of Liberapay label Feb 22, 2016
@Changaco Changaco self-assigned this Feb 22, 2016
@Changaco
Copy link
Member Author

It looks like CloudFlare doesn't have alerts, they invite you to use Pingdom instead, which is probably great for Pingdom's business, but sucks for us and our users. I really wish we had our own infrastructure.

@Changaco
Copy link
Member Author

The server rebooted again during the night. Fortunately this time I was running a monitoring script that restarted the app automatically. The site was down for about 15 minutes according to the logs, which is still an unacceptably long time.

@Changaco
Copy link
Member Author

We've talked about this in IRC. I've sent an email to http://bearstech.com/ asking if they can provide us with an alternative to hosting on OpenShift.

@Changaco
Copy link
Member Author

CloudFlare had some issues in Europe today, at least one of our users was affected, he got a timeout error page and reported the problem on Twitter.

@ghost
Copy link

ghost commented Jun 20, 2016

Hello I'm the whiner from Twitter :-)
What about let go Cloudflare ? It's a very painful-in-the-ass product, block Tor, don't work in EU...

@angristan
Copy link

Why are you using openshift ? It seems pretty expensive.

@angristan
Copy link

@matunixe

What about let go Cloudflare ? It's a very painful-in-the-ass product, block Tor, don't work in EU...

Work perfectly in EU, it has only has issues today.

@Changaco
Copy link
Member Author

What about let go Cloudflare ?

What would we replace it with? There are reasons why we use CloudFlare, and for the most part it does the job well.

As I've said before I wish we had our own infrastructure, but that costs time and money, and we're short on both.

It's a very painful-in-the-ass product, block Tor, don't work in EU...

We've already white-listed Tor. As for today's connectivity issues, they could have happened to any Internet service provider.

Why are you using openshift ? It seems pretty expensive.

We're using OpenShift Online, which is free. They're in the process of relaunching it with a container-based infrastructure, so their website may be confusing right now.

@kindlyfire
Copy link
Member

For monitoring a website, https://www.nixstats.com/ is a nice service.

This was referenced Nov 5, 2016
@Changaco
Copy link
Member Author

Changaco commented Nov 7, 2016

Here's how we can work around CloudFlare's lack of alerts: instead of pinging the site every few minutes with an HTTP request, let's make the app ping a monitoring server continuously with small TCP packets. That way we can detect downtime almost instantly, without overloading the app and spamming its logs with lots of ping HTTP requests.

@Changaco
Copy link
Member Author

Changaco commented Dec 3, 2016

We have another problem which is the postgresql server dying for no apparent reason. It's happened more than a few times now. I need to figure out why, and make the python code restart postgres automatically in production (instead of quitting without even sending an alert).

@Changaco
Copy link
Member Author

Changaco commented Jan 1, 2017

Upgrading PostgreSQL may be part of the solution: liberapay/openshift-cartridge-postgres#4. We're not set up for zero-downtime upgrades though (liberapay/openshift-cartridge-postgres#5).

@Changaco
Copy link
Member Author

Changaco commented Jan 7, 2017

I've just upgraded postgresql from 9.4.5 to 9.4.10 in production. We had a few minutes of downtime, but on the bright side #480 worked perfectly: visitors saw a proper 503 page, I received a Sentry alert, and the app recovered automatically once the new version of postgres was installed.

@Changaco
Copy link
Member Author

Changaco commented Jan 7, 2017

There's one option we haven't considered yet: moving "down" to AWS. It's what we're currently running on through OpenShift, but getting rid of the abstraction layer would open up many possibilities.

https://aws.amazon.com/elasticbeanstalk/ and https://aws.amazon.com/rds/ seem to make it pretty easy to run a Python web app with a PostgreSQL database.

If I'm reading AWS' pricing correctly having one micro EC2 instance and one micro PostgresSQL DB costs around 3 USD per week. We could pay that. Moreover it looks like the first 12 months are free.

@revi
Copy link
Member

revi commented Jan 8, 2017

And bandwidths.

I don't know about the bandwidth cost on openshift for now but AWS will cost some (or a lot of) dollars even when hiding behind CloudFlare.

@Changaco
Copy link
Member Author

Changaco commented Jan 8, 2017

Thanks @revi, I hadn't seen the data transfer costs. They seem to be pretty low though.

AWS data transfer pricing:

  • first 1GB / month: free
  • up to 10TB / month: $0.09 / GB

Our CloudFlare stats: "Uncached Bandwidth | Last Month | 1 GB".

In other words even if our outbound traffic increased tenfold we'd only have to pay an extra $0.81 per month (less than $0.20 per week).

Moreover, the free tier for the first 12 months includes "15 GB of bandwidth out aggregated across all AWS services".


OpenShift doesn't cost us anything, but the version we're running on is deprecated and flawed. If the next version of OpenShift doesn't become available for production use soon and with attractive pricing (or sponsorship), then AWS seems like the best alternative.

@Changaco
Copy link
Member Author

Changaco commented Jan 8, 2017

CloudFlare isn't our only outbound traffic though, there's also emails (Mailgun SMTP), Sentry events, API requests to MangoPay and other platforms, etc.

@revi
Copy link
Member

revi commented Jan 8, 2017

That's good to hear, though AWS free tier is full of tricks to get full advantages of it.

For other traffics- I think it's minor compared to CF traffic. (And I think we can use AWS SES - their free tier allows 62,000 mails from their infra, making it cheaper than Mailgun.)

@Changaco
Copy link
Member Author

Changaco commented Jan 8, 2017

Note that the 62000 "free" outbound emails per month don't include the data transfer cost, but there's no way around that, so it's indeed better than the 10000 free email per month we currently have on Mailgun.

@Changaco
Copy link
Member Author

We've just suffered significant downtime again due to http://status.openshift.com/incidents/n6lkx92g8wyk.

@Changaco
Copy link
Member Author

Changaco commented Mar 9, 2017

Working on #505 has led me to take another look at the possibilities to improve our server stack, and thus our uptime. Here is what I came up with:

  • migrate to AWS
  • migrate to Google Cloud
  • migrate to Clever Cloud
  • build our own thing with a few cheap OVH virtual machines

They represent different trade-offs between monetary costs, engineering costs, reliability, etc. I haven't been able to determine a clear winner so far.

Liberapay can run on cheap servers, but we need high security. We want as little downtime as possible, but we have a small budget. It's a difficult situation.

I'm wary of AWS because of its complex pricing, and because of "GCE vs AWS in 2016: Why you shouldn’t use Amazon".

I dislike Google Cloud because I dislike Google. I'm also not convinced that it would actually be better for us in terms of costs compared to AWS.

Clever Cloud is theoretically nice, but their cheap PostgreSQL plans are stuck at version 9.2.8, which is unacceptable (we need at least 9.4, and we really want 9.6).

I dread the OVH option because setting it up and maintaining it would cost a lot of time, and we would be responsible for the security of the entire stack (from the kernel to all the way up).

@ghost
Copy link

ghost commented Mar 9, 2017

Hello @Changaco ,

I actually can't help you with money behavior but I'm able to do something regarding of the sysadmin part. Especially if you decide to go with a solution like OVH/Scaleway/DO/etc.
Ping me if you are interested.

👋

@Changaco
Copy link
Member Author

Hey @matunixe, it'd be great to have some help, but what are you proposing exactly?

@ghost
Copy link

ghost commented Mar 10, 2017

Hi @Changaco,

I dread the OVH option because setting it up and maintaining it would cost a lot of time, and we would be responsible for the security of the entire stack (from the kernel to all the way up).

I can give you my time, security knowledge and experience in maintaining servers. I'm definitely not an AWS pro but for sure I can be useful with a more classical option.

@Changaco
Copy link
Member Author

I've been taking a deeper look at GCloud and AWS.

GCloud does appear to be cheaper than AWS, but it really isn't much better than a bunch of VMs: they don't support PostgreSQL as a service, and the flexible Python environments we'd need to run Liberapay aren't available in Europe (according to this doc).

AWS is a lot better, but if we went with an optimal setup (EU region, zero downtime) it would cost us more than $6 per week (not counting taxes). Theoretically we could achieve the same result for less than half of that if we did everything ourselves, but the monetary savings would be more than offset by the increased engineering costs.

So, it looks like our best bet is to go with AWS but stay within the free tier (one webapp runner, one non-redundant database). That won't guarantee zero downtime, but system updates should no longer be a problem like they are on OpenShift (AWS Elastic Beanstalk supports Managed Updates for Single-Instance environments). Once the time-limited free tier is over it should cost us less than 5€ per week (incl. taxes).

@Changaco
Copy link
Member Author

Let's try AWS. If it ends up costing us too much or causing us other problems then we'll reconsider.

@matunixe Thanks for offering to help. :-) I may ping you someday.

@Changaco
Copy link
Member Author

The migration to AWS is done (see #553 for details), I can finally close this issue! \o/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
critical issues that threaten the very existence of Liberapay
Development

No branches or pull requests

4 participants