diff --git a/outages/2016/2016-06 AWS.md b/outages/2016/2016-06 AWS.md deleted file mode 100644 index b19795fa2..000000000 --- a/outages/2016/2016-06 AWS.md +++ /dev/null @@ -1,16 +0,0 @@ -## Root Cause - -- datacenter power outage after storm - -## Duration - -10h - -## Impact - -- EC2, EBS -- Sydney only - -## Media - -- https://www.readitquik.com/articles/cloud-3/top-7-aws-outages-that-wreaked-havoc/ diff --git a/outages/2016/2016-10-21 Dyn DNS.md b/outages/2016/2016-10-21 Dyn DNS.md deleted file mode 100644 index 946791bef..000000000 --- a/outages/2016/2016-10-21 Dyn DNS.md +++ /dev/null @@ -1,19 +0,0 @@ -## Root Cause - -- DNS DDoS by Mirai IoT botnet - -## Duration - -~11h - -## Impact - -- all DynDNS customers - -## Media - -- https://www.techradar.com/news/5-of-the-worlds-biggest-network-outages - -## Status Page - -- https://www.dynstatus.com/incidents/nlr4yrr162t8 diff --git a/outages/2019/2019-02-07 Cloudflare.md b/outages/2019/2019-02-07 Cloudflare.md deleted file mode 100644 index 640621105..000000000 --- a/outages/2019/2019-02-07 Cloudflare.md +++ /dev/null @@ -1,15 +0,0 @@ -## Root Cause - -- bad deployment causing massive CPU spike - -## Duration - -30min - -## Impact - -- only HTTP 502 pages were delivered - -## Status Page - -- https://blog.cloudflare.com/cloudflare-outage/ diff --git a/outages/2019/2019-05-02 Azure.md b/outages/2019/2019-05-02 Azure.md deleted file mode 100644 index ecd838b3c..000000000 --- a/outages/2019/2019-05-02 Azure.md +++ /dev/null @@ -1,15 +0,0 @@ -## Root Cause - -- DNS migration from "legacy DNS" to Azure DNS - -## Duration - -19:43 and 22:35 UTC - -## Impact - -- most cloud services - -## Media - -- https://build5nines.com/may-2-2019-major-azure-outage-due-dns-migration-issue/ diff --git a/outages/2019/2019-05-18 Salesforce.md b/outages/2019/2019-05-18 Salesforce.md deleted file mode 100644 index 5e29c0a23..000000000 --- a/outages/2019/2019-05-18 Salesforce.md +++ /dev/null @@ -1,17 +0,0 @@ -## Root Cause - -- internal DB update script messed up user privileges (making them too open) - -## Duration - -- ~15h - -## Impact - -- all customers shut off to prevent unprivileged data access - -## Media - -- https://techhq.com/2019/05/salesforce-hit-by-15-hour-downtime/ -- https://www.crn.com/slide-shows/cloud/6-things-to-know-about-the-latest-salesforce-outage -- https://www.datacenterdynamics.com/opinions/salesforce-database-outage-why-it-happened-and-how-prevent-another-one/ diff --git a/outages/2019/2019-06-02 GCP Outage.md b/outages/2019/2019-06-02 GCP Outage.md deleted file mode 100644 index d31f41771..000000000 --- a/outages/2019/2019-06-02 GCP Outage.md +++ /dev/null @@ -1,22 +0,0 @@ -## Root Cause - -- Network control plane -- automation tool - -## Duration - -- 4h - -## Impact - -- G-Suite, Gmail, Google Docs, Google Drive, Google Cloud, YouTube -- Vimeo, Shopify, Discord, Snapchat - -## Media - -- https://techhq.com/2019/08/what-we-learned-from-google-clouds-june-outage/ - -## Status Page - -- https://www.google.com/appsstatus -- https://status.cloud.google.com/incident/cloud-networking/19009 diff --git a/outages/2019/2019-06-24 Verizon.md b/outages/2019/2019-06-24 Verizon.md deleted file mode 100644 index d9706d4f2..000000000 --- a/outages/2019/2019-06-24 Verizon.md +++ /dev/null @@ -1,16 +0,0 @@ -## Root Cause - -- BGP route leak -- Route propagation - -## Duration - -3h - -## Impact - -- Google, AWS, Reddit, Netflix, Cloudflare customers - -## Media - -- https://slate.com/technology/2019/06/verizon-dqe-outage-internet-cloudflare-reddit-aws.html diff --git a/outages/2019/2019-07-18 Slack.md b/outages/2019/2019-07-18 Slack.md deleted file mode 100644 index 99e8290d9..000000000 --- a/outages/2019/2019-07-18 Slack.md +++ /dev/null @@ -1,16 +0,0 @@ -## Root Cause - -- some servers unavailablity, performance degradation - -## Duration - -~7h - -## Impact - -- connectivity issues -- 10-25% error rate - -## Media - -- https://www.cnet.com/news/slack-explains-last-weeks-hours-long-outage/ diff --git a/outages/2020/2020-03-03 Azure.md b/outages/2020/2020-03-03 Azure.md deleted file mode 100644 index da70b6f71..000000000 --- a/outages/2020/2020-03-03 Azure.md +++ /dev/null @@ -1,15 +0,0 @@ -## Root Cause - -- physical datacenter malfunction of air ventilation, overheating HW - -## Duration - -6h - -## Impact - -- us-east1 - -## Media - -- https://techhq.com/2020/12/3-biggest-public-cloud-outages-of-2020/ diff --git a/outages/2020/2020-05-12 Slack.md b/outages/2020/2020-05-12 Slack.md deleted file mode 100644 index 48a45e092..000000000 --- a/outages/2020/2020-05-12 Slack.md +++ /dev/null @@ -1,22 +0,0 @@ -## Root Cause - -- scaling up automation failure -- new servers were not added to LBs, causing continuous performance degradation - -## Duration - -3h (everyone) -1d (for Electron app users) - -## Impact - -- no messages could be sent - -## Media - -- https://statusgator.com/blog/2020/08/21/5-biggest-outages-of-q2-2020/ - -## Status Page - -- https://status.slack.com/2020-05-12 (Incident Report) -- https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack/ (Postmortem) diff --git a/outages/2020/2020-05-17 Zoom.md b/outages/2020/2020-05-17 Zoom.md deleted file mode 100644 index 30634789d..000000000 --- a/outages/2020/2020-05-17 Zoom.md +++ /dev/null @@ -1,15 +0,0 @@ -## Root Cause - -- undisclosed - -## Duration - -7h - -## Impact - -- customers unable to join meetings - -## Media - -- https://statusgator.com/blog/2020/08/21/5-biggest-outages-of-q2-2020/ diff --git a/outages/2020/2020-06-10 IBM Cloud.md b/outages/2020/2020-06-10 IBM Cloud.md deleted file mode 100644 index 795959b7c..000000000 --- a/outages/2020/2020-06-10 IBM Cloud.md +++ /dev/null @@ -1,11 +0,0 @@ -## Duration - -several hours - -## Impact - -- cloud down globally - -## Media - -- https://statusgator.com/blog/2020/08/21/5-biggest-outages-of-q2-2020/ diff --git a/outages/2020/2020-06-29 Github.md b/outages/2020/2020-06-29 Github.md deleted file mode 100644 index c0a780f91..000000000 --- a/outages/2020/2020-06-29 Github.md +++ /dev/null @@ -1,11 +0,0 @@ -## Duration - -2h - -## Impact - -- FIXME - -## Media - -- https://statusgator.com/blog/2020/08/21/5-biggest-outages-of-q2-2020/ diff --git a/outages/2020/2020-08-24 Zoom.md b/outages/2020/2020-08-24 Zoom.md deleted file mode 100644 index be893b2d7..000000000 --- a/outages/2020/2020-08-24 Zoom.md +++ /dev/null @@ -1,15 +0,0 @@ -## Root Cause - -- not disclosed - -## Duration - -3h - -## Media - -- https://techhq.com/2020/12/3-biggest-public-cloud-outages-of-2020/ - -## Status Page - -- https://status.zoom.us/ diff --git a/outages/2020/2020-11-26 AWS.md b/outages/2020/2020-11-26 AWS.md deleted file mode 100644 index ee1d2c71d..000000000 --- a/outages/2020/2020-11-26 AWS.md +++ /dev/null @@ -1,17 +0,0 @@ -## Root Cause - -?? - -## Duration - -?? - -## Impact - -- only us-east1 -- Roku, Adobe, Glassdoor, Autodesk, The Wall Street Journal, 1Password -- Kinesis Data Streams API and other dependent services - -## Media - -- https://techhq.com/2020/12/3-biggest-public-cloud-outages-of-2020/ diff --git a/outages/2021/2021-03-10 OVH SBG Datacenters.md b/outages/2021/2021-03-10 OVH SBG Datacenters.md deleted file mode 100644 index 53e5167fe..000000000 --- a/outages/2021/2021-03-10 OVH SBG Datacenters.md +++ /dev/null @@ -1,13 +0,0 @@ -## Impact - -- 4 datacenters down -- 2 destroyed -- recovery >10days - -## Media - -- https://www.bleepingcomputer.com/news/technology/ovh-data-center-burns-down-knocking-major-sites-offline/ - -## Provider Status Page - -- https://status.us.ovhcloud.com/ diff --git a/outages/2021/2021-03-23 quay.io.md b/outages/2021/2021-03-23 quay.io.md deleted file mode 100644 index 7565c558a..000000000 --- a/outages/2021/2021-03-23 quay.io.md +++ /dev/null @@ -1,15 +0,0 @@ -## Impact - -- No image pulls possible - -## Duration - -4h - -## Root Cause - -- somehow AWS related - -## Status Page - -- https://status.quay.io/incidents/vfs19hmq660h (Incident Report) diff --git a/outages/2021/2021-05-11 Salesforce.md b/outages/2021/2021-05-11 Salesforce.md deleted file mode 100644 index f720a261b..000000000 --- a/outages/2021/2021-05-11 Salesforce.md +++ /dev/null @@ -1,15 +0,0 @@ -## Impact - -- All services not available due to DNS outage - -## Duration - -4h - -## Root Cause - -- failed global DNS change - -## Status Page - -- https://www.theregister.com/2021/05/19/salesforce_root_cause/ (Report) diff --git a/outages/2021/2021-06-08 Fastly.md b/outages/2021/2021-06-08 Fastly.md deleted file mode 100644 index 1a0584093..000000000 --- a/outages/2021/2021-06-08 Fastly.md +++ /dev/null @@ -1,17 +0,0 @@ -## Impact - -- global incident -- high origin loads -- "Customers could continue to experience a period of increased origin load and lower Cache Hit Ratio (CHR)." - -## Duration - ->2h - -## Root Cause - -- unknown - -## Status Page - -- https://status.fastly.com/incidents/vpk0ssybt3bj (Report) diff --git a/outages/2021/2021-10-04 Facebook.md b/outages/2021/2021-10-04 Facebook.md deleted file mode 100644 index 07098eead..000000000 --- a/outages/2021/2021-10-04 Facebook.md +++ /dev/null @@ -1,12 +0,0 @@ -## Impact - -- Facebook, Instagram, Whatsapp down - -## Duration - -6h - -## Media - -- https://engineering.fb.com/2021/10/04/networking-traffic/outage/ -- https://twitter.com/jgrahamc/status/1445068309288951820 diff --git a/outages/2021/2021-12-08 AWS.md b/outages/2021/2021-12-08 AWS.md deleted file mode 100644 index 2144fc901..000000000 --- a/outages/2021/2021-12-08 AWS.md +++ /dev/null @@ -1,12 +0,0 @@ -## Impact - -- different services in us-east1# - -## Duration - ->4h - -## Media - -- https://www.zdnet.com/article/aws-goes-down-and-with-it-goes-a-host-of-websites-and-services/ - diff --git a/outages/2022/2022-02-22 Slack.md b/outages/2022/2022-02-22 Slack.md deleted file mode 100644 index 6dcac3405..000000000 --- a/outages/2022/2022-02-22 Slack.md +++ /dev/null @@ -1,10 +0,0 @@ ---- -impact: Slack not loading -duration: 5h -cause: | - Quote from Slack status page: *A configuration change inadvertently lead to a sudden - increase in activity on our database infrastructure. Due to this increased activity, - the affected databases failed to serve incoming requests to connect to Slack.* -links: -- https://status.slack.com/2022-02-22 ---- diff --git a/outages/2022/2022-03-01 Apple.md b/outages/2022/2022-03-01 Apple.md deleted file mode 100644 index 42b4bc59b..000000000 --- a/outages/2022/2022-03-01 Apple.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -impact: App Store, Maps, TV -duration: 4h -cause: DNS problems -links: -- https://www.crn.com/news/cloud/the-10-biggest-cloud-outages-of-2022-so-far?page=6 ---- diff --git a/outages/2022/2022-04-05 Atlassian.md b/outages/2022/2022-04-05 Atlassian.md deleted file mode 100644 index 9f37249b0..000000000 --- a/outages/2022/2022-04-05 Atlassian.md +++ /dev/null @@ -1,14 +0,0 @@ ---- -impact: | - 400 companies and anywhere from 50,000 to 400,000 users had no access to JIRA, - Confluence, OpsGenie, JIRA Status page, and other Atlassian Cloud services - -duration: ">14days for some customers" -cause: | - global scale orchestration human error, instead of shutting down component - product instances were terminated - -links: -- https://www.atlassian.com/engineering/april-2022-outage-update -- https://newsletter.pragmaticengineer.com/p/scoop-atlassian?s=r ---- diff --git a/outages/2022/2022-06-21 Cloudflare.md b/outages/2022/2022-06-21 Cloudflare.md deleted file mode 100644 index 55c5d6b11..000000000 --- a/outages/2022/2022-06-21 Cloudflare.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -impact: many affected websites -duration: 1h 15min -cause: "*A change to the network configuration in those locations caused an outage*" -links: -- https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022 ---- diff --git a/outages/2022/2022-07-08 AWS US East2 AZ1.md b/outages/2022/2022-07-08 AWS US East2 AZ1.md deleted file mode 100644 index 20babc502..000000000 --- a/outages/2022/2022-07-08 AWS US East2 AZ1.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -impact: AZ1 of US East2 without connectivity -duration: 20min -cause: power failure -links: -- https://www.thousandeyes.com/blog/aws-outage-analysis-july-28-2022 ---- diff --git a/outages/2022/2022-08-09 Google Search+Maps.md b/outages/2022/2022-08-09 Google Search+Maps.md deleted file mode 100644 index cab0266c8..000000000 --- a/outages/2022/2022-08-09 Google Search+Maps.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -impact: Google Search, Google Maps globally unavailable -duration: 1h -cause: software update -links: -- https://www.networkworld.com/article/971832/top-10-outages-of-2022.html ---- diff --git a/outages/2022/2022-09-15 Zoom.md b/outages/2022/2022-09-15 Zoom.md deleted file mode 100644 index 4dab5e033..000000000 --- a/outages/2022/2022-09-15 Zoom.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -impact: no meeting globally -duration: 1h -cause: unclear -links: -- https://www.thousandeyes.com/blog/internet-report-pulse-update-september-26-2022 ---- diff --git a/outages/2022/2022-10-25 Whatsapp.md b/outages/2022/2022-10-25 Whatsapp.md deleted file mode 100644 index adc3ce8ef..000000000 --- a/outages/2022/2022-10-25 Whatsapp.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -impact: users unable to send/receive messages -duration: 2h -cause: backend application service failure -links: -- https://www.thousandeyes.com/blog/internet-report-pulse-update-november-7-2022 ---- diff --git a/outages/2022/2022-12-05 AWS US East2.md b/outages/2022/2022-12-05 AWS US East2.md deleted file mode 100644 index be14794e1..000000000 --- a/outages/2022/2022-12-05 AWS US East2.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -impact: US East2 connectivity issues -duration: 75min -cause: unclear -links: -- https://www.networkworld.com/article/971716/aws-suffers-outage-at-its-us-east-2-cloud-region.html ---- diff --git a/outages/2023/2023-01-25 Microsoft Teams.md b/outages/2023/2023-01-25 Microsoft Teams.md deleted file mode 100644 index 8a9c918a3..000000000 --- a/outages/2023/2023-01-25 Microsoft Teams.md +++ /dev/null @@ -1,8 +0,0 @@ ---- -impact: World-wide MS Teams outage -duration: 1h -cause: network configuration error -links: -- https://twitter.com/msft365status/status/1549934141738651648?s=21&t=bq79TBlFzHzpiH_qwB_tVA -- https://portal.office.com/adminportal/home?#/servicehealth/:/alerts/TM402718 ---- diff --git a/outages/2023/2023-02-13 Oracle OCI.md b/outages/2023/2023-02-13 Oracle OCI.md deleted file mode 100644 index fc33dd725..000000000 --- a/outages/2023/2023-02-13 Oracle OCI.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -impact: OCI Vault, API Gateway, Oracle Digital Assistant and OCI Search with OpenSearch -duration: 3d -cause: performance problems in DNS-based load management -links: - - https://www.networkworld.com/article/3688509/oracle-outages-serve-as-warning-for-companies-relying-on-cloud-technology.html ---- diff --git a/outages/2023/2023-02-16 GCP.md b/outages/2023/2023-02-16 GCP.md deleted file mode 100644 index 866fb1fda..000000000 --- a/outages/2023/2023-02-16 GCP.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -impact: Gmail, Youtube, Google Drive partial outage -duration: 6h -cause: network update caused traffic disruption -links: -- https://www.linkedin.com/pulse/recent-cloud-platform-outages-2023-pankaj-kumar-mandal ---- diff --git a/outages/2023/2023-03-09 Datadog.md b/outages/2023/2023-03-09 Datadog.md deleted file mode 100644 index 44463abf4..000000000 --- a/outages/2023/2023-03-09 Datadog.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -impact: Service outage -duration: 2d -cause: OS update -links: - - https://www.crn.com/news/cloud/the-15-biggest-cloud-outages-of-2023?page=5 ---- diff --git a/outages/2023/2023-04-07 SpaceX.md b/outages/2023/2023-04-07 SpaceX.md deleted file mode 100644 index 60f9be3e8..000000000 --- a/outages/2023/2023-04-07 SpaceX.md +++ /dev/null @@ -1,15 +0,0 @@ -## Impact - -No connection - -## Duration - -2h - -## Root Cause - -Expired certificate - -## Links - -https://blog.cloudflare.com/q2-2023-internet-disruption-summary/ diff --git a/outages/2023/2023-04-25 GCP-europe-west-9.md b/outages/2023/2023-04-25 GCP-europe-west-9.md deleted file mode 100644 index acdd1fd01..000000000 --- a/outages/2023/2023-04-25 GCP-europe-west-9.md +++ /dev/null @@ -1,10 +0,0 @@ ---- -impact: | - Cloud region europe-west-9 was offline (one day) - Zone europe-west-9-a was offline (two weeks) - -duration: 1d -cause: fire after cooling system water pipe leak -links: -- https://status.cloud.google.com/incidents/dS9ps52MUnxQfyDGPfkY ---- diff --git a/outages/2023/2023-06-13 AWS-us-east1.md b/outages/2023/2023-06-13 AWS-us-east1.md deleted file mode 100644 index ed891089d..000000000 --- a/outages/2023/2023-06-13 AWS-us-east1.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -impact: Service degradation of 104 AWS services (that where using AWS Lambda) -duration: 3h -cause: Lambda scaling crossing a new threshold hit a functional bug -links: -- https://aws.amazon.com/message/061323/ ---- diff --git a/outages/2023/2023-07-05 Azure.md b/outages/2023/2023-07-05 Azure.md deleted file mode 100644 index a196e6f64..000000000 --- a/outages/2023/2023-07-05 Azure.md +++ /dev/null @@ -1,8 +0,0 @@ ---- -impact: Region West Europe partially down -duration: 8h -cause: fiber cut caused by severe weather conditions in the Netherlands -links: -- https://azure.status.microsoft/en-gb/status/history/ -- https://www.youtube.com/watch?v=tODJb-Tm_q0 ---- diff --git a/outages/2023/2023-11-02 Cloudflare.md b/outages/2023/2023-11-02 Cloudflare.md deleted file mode 100644 index b480b61ac..000000000 --- a/outages/2023/2023-11-02 Cloudflare.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -impact: Cloudflare control panel and analytics outage -duration: 2d -cause: data center outage + high availability did not work -links: -- https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/ ---- diff --git a/outages/README.md b/outages/README.md deleted file mode 100644 index 557ee553d..000000000 --- a/outages/README.md +++ /dev/null @@ -1,14 +0,0 @@ -## Cloud / SaaS Outages - -A collection of large cloud outages collected by year which to serve as a data basis -for the argument against believing in cloud SLAs. Not that I'm against using the cloud, -but knowing there will be several days outage a year with a certain probability despite -any promised SLA is important. - -### Listing criteria - -To understand which outages are listed and why others are not. The criteria are roughly - -- outage is global -- or very long >1h -- or causing a complete service loss in a region