Skip to content

Commit

Permalink
Markdown, spelling fixes in posts
Browse files Browse the repository at this point in the history
  • Loading branch information
jabenninghoff committed Nov 1, 2023
1 parent 6c8ff02 commit 866b537
Show file tree
Hide file tree
Showing 10 changed files with 154 additions and 112 deletions.
34 changes: 33 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,38 +2,70 @@
"cSpell.words": [
"Aboelssaad",
"Allspaw",
"Ashgate",
"Aven",
"Bainbridge",
"Benninghoff",
"Braithwaite",
"CIO",
"CISO",
"CONOPS",
"Cyentia",
"Danyel",
"dastergon",
"Drachten",
"Edmondson",
"Eede",
"Endsley",
"Flin",
"Fong",
"Forsgren",
"Frazelle",
"Garvin",
"Helmreich",
"Holling",
"Hollnagel",
"infosec",
"jabenninghoff",
"joshualande",
"Kersten",
"Klinect",
"Lauche",
"Leveson",
"lorin",
"Maguire",
"Muhren",
"Nemeth",
"Nimda",
"OODA",
"OWASP",
"Packrat",
"Perrow",
"Petoff",
"Provan",
"PSAS",
"readr",
"reimagining",
"Renn",
"renv",
"Repenning",
"rescanning",
"rmarkdown",
"rstudio",
"sciency",
"SIRA",
"SIRAcon",
"sociotechnical",
"SPSS",
"Stanke",
"Sterman",
"STPA",
"Westrum"
"SUBSAFE",
"Veracode",
"Villalba",
"Walle",
"Westrum",
"wormable",
"Wreathall"
]
}
10 changes: 5 additions & 5 deletions _posts/2016-12-02-successful-safety-programs.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: post
title: Elements of Succesful Safety Programs
title: Elements of Successful Safety Programs
author: jabenninghoff
comments: true
---
Expand All @@ -9,10 +9,10 @@ I've previously [written](http://transvasive.com/?p=21) about how aviation safet

What are the elements of a successful safety program and how can we apply these elements to security programs? Comparison of the [SUBSAFE](https://en.wikipedia.org/wiki/SUBSAFE) program and [PhishMe's](https://cofense.com) offering, along with my own experience implementing a vulnerability management (VM) program at a previous company suggest that successful safety or security risk management programs share common features:

* Explicit Goals
* Defined Activities
* Feedback & Incentives
* Continual Improvement
- Explicit Goals
- Defined Activities
- Feedback & Incentives
- Continual Improvement

## Three Programs

Expand Down
60 changes: 32 additions & 28 deletions _posts/2019-09-15-chaos-resilience-engineering.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,44 +7,48 @@ comments: true
I'm giving a talk next Tuesday (9/24) at at the [September OWASP MSP Meeting](https://www.meetup.com/OWASP-MSP-Meetup/events/264466608/) on *"Chaos & Resilience Engineering"*. Because the talk is told as a story and a demo, I won't be posting copies of the slides, but I am including an abstract and a list of references here. The talk tells the story of my journey to find chaos engineering, introduces chaos engineering, describes how it is complemented by resilience engineering, and discusses how to get started and join the movement.

## Abstract

*Chaos engineering started at Netflix in 2011 with the invention of the Chaos Monkey, a tool that intentionally disrupted systems on the production network to discover systemic weaknesses so that they could be removed. Since then, the Chaos Monkey has grown to become the Simian Army, and chaos engineering has spread to a global community that develops free & commercial tools to facilitate experiments in QA and production.*

*My journey to chaos & resilience engineering started in 2009 with my desire to find a better way, leading me to the world of safety science and to its connection to the work at Netflix, Etsy, and elsewhere. In this talk, I'll explain chaos engineering, the prerequisites for doing it in production, and how it relates to resilience. I will share some of the work I've done in chaos engineering (in a small way) and resilience engineering (in a larger way), and also ask attendees to share their own experiences in chaos & resilience engineering - you might not or realize how easy it is to get started, or know that you're already doing it!*

## My Journey to Chaos Engineering
* [Risk Homeostasis](https://en.wikipedia.org/wiki/Risk_compensation#Risk_homeostasis)
* [The Checklist Manifesto](https://en.wikipedia.org/wiki/The_Checklist_Manifesto)
* [How Complex Systems Fail](http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf) ([video](https://www.youtube.com/watch?v=2S0k12uZR14))
* [Engineering a Safer World](https://mitpress.mit.edu/books/engineering-safer-world)
* [STAMP/STPA/CAST](https://psas.scripts.mit.edu/home/)
* [Managing Risk and System Change](https://psychology.tcd.ie/postgraduate/msc-riskandchange/)
* [Secure360](https://secure360.org)

- [Risk Homeostasis](https://en.wikipedia.org/wiki/Risk_compensation#Risk_homeostasis)
- [The Checklist Manifesto](https://en.wikipedia.org/wiki/The_Checklist_Manifesto)
- [How Complex Systems Fail](http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf) ([video](https://www.youtube.com/watch?v=2S0k12uZR14))
- [Engineering a Safer World](https://mitpress.mit.edu/books/engineering-safer-world)
- [STAMP/STPA/CAST](https://psas.scripts.mit.edu/home/)
- [Managing Risk and System Change](https://psychology.tcd.ie/postgraduate/msc-riskandchange/)
- [Secure360](https://secure360.org)

## Chaos & Resilience Engineering
* [Chaos Monkey](https://github.com/Netflix/chaosmonkey)
* [Simian Army](https://github.com/Netflix/SimianArmy) (retired)
* [Chaos Engineering Book](https://www.oreilly.com/library/view/chaos-engineering/9781491988459/)
* [Awesome Chaos Engineering](https://github.com/dastergon/awesome-chaos-engineering)
* [Gremlin](https://www.gremlin.com) (free, limited feature version available for up to 5 nodes)
* [Gremlin Demo](https://github.com/jabenninghoff/gremlin-demo)
* [Principles of Chaos Engineering](https://principlesofchaos.org):

- [Chaos Monkey](https://github.com/Netflix/chaosmonkey)
- [Simian Army](https://github.com/Netflix/SimianArmy) (retired)
- [Chaos Engineering Book](https://www.oreilly.com/library/view/chaos-engineering/9781491988459/)
- [Awesome Chaos Engineering](https://github.com/dastergon/awesome-chaos-engineering)
- [Gremlin](https://www.gremlin.com) (free, limited feature version available for up to 5 nodes)
- [Gremlin Demo](https://github.com/jabenninghoff/gremlin-demo)
- [Principles of Chaos Engineering](https://principlesofchaos.org):
1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
1. Hypothesize that this steady state will continue in both the control group and the experimental group.
1. Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
1. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
* [Resilience Engineering Book](https://www.crcpress.com/Resilience-Engineering-Concepts-and-Precepts/Woods-Hollnagel/p/book/9780754649045)
* [Four Potentials of Resilience](https://erikhollnagel.com/ideas/resilience%20assessment%20grid.html)
* [Etsy Blameless Post-Mortem](https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/)
- [Resilience Engineering Book](https://www.crcpress.com/Resilience-Engineering-Concepts-and-Precepts/Woods-Hollnagel/p/book/9780754649045)
- [Four Potentials of Resilience](https://erikhollnagel.com/ideas/resilience%20assessment%20grid.html)
- [Etsy Blameless Post-Mortem](https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/)

## How to get started and join the movement
* [After-Action Review](https://en.wikipedia.org/wiki/After-action_review):
* What was expected to happen?
* What actually happened?
* Why were these different?
* What has been learned?
* John Boyd’s [OODA Loop](https://en.wikipedia.org/wiki/OODA_loop)
* [Situation Awareness](https://en.wikipedia.org/wiki/Situation_awareness#Theoretical_model)
* [Safety II](https://www.england.nhs.uk/signuptosafety/wp-content/uploads/sites/16/2015/10/safety-1-safety-2-whte-papr.pdf)
* [FMEA](https://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis)
* [STPA/CAST Handbooks](http://psas.scripts.mit.edu/home/materials/)
* [Veracode State of the Software V9](https://info.veracode.com/report-state-of-software-security-volume-9.html)

- [After-Action Review](https://en.wikipedia.org/wiki/After-action_review):
- What was expected to happen?
- What actually happened?
- Why were these different?
- What has been learned?
- John Boyd’s [OODA Loop](https://en.wikipedia.org/wiki/OODA_loop)
- [Situation Awareness](https://en.wikipedia.org/wiki/Situation_awareness#Theoretical_model)
- [Safety II](https://www.england.nhs.uk/signuptosafety/wp-content/uploads/sites/16/2015/10/safety-1-safety-2-whte-papr.pdf)
- [FMEA](https://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis)
- [STPA/CAST Handbooks](http://psas.scripts.mit.edu/home/materials/)
- [Veracode State of the Software V9](https://info.veracode.com/report-state-of-software-security-volume-9.html)
82 changes: 43 additions & 39 deletions _posts/2020-05-01-chaos-resilience-secure360.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,49 +11,53 @@ I'm speaking at [Secure360](https://secure360.org) on May 5, 2020, presenting an
My story is told in three acts: My journey to find chaos engineering (ACT I), Chaos engineering and how resilience engineering complements it (ACT II), What I’ve learned so far (ACT III), and How to get started with chaos & resilience engineering (END).

## ACT I: My Journey to Chaos Engineering
* [Risk Homeostasis](https://en.wikipedia.org/wiki/Risk_compensation#Risk_homeostasis)
* [The Checklist Manifesto](https://en.wikipedia.org/wiki/The_Checklist_Manifesto)
* [How Complex Systems Fail](http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf) ([video](https://www.youtube.com/watch?v=2S0k12uZR14))
* [Engineering a Safer World](https://mitpress.mit.edu/books/engineering-safer-world)
* [STAMP/STPA/CAST](https://psas.scripts.mit.edu/home/)
* [TCD: Managing Risk and System Change](https://psychology.tcd.ie/postgraduate/msc-riskandchange/)
* [Lund: Human Factors & System Safety](https://www.humanfactors.lth.se)

- [Risk Homeostasis](https://en.wikipedia.org/wiki/Risk_compensation#Risk_homeostasis)
- [The Checklist Manifesto](https://en.wikipedia.org/wiki/The_Checklist_Manifesto)
- [How Complex Systems Fail](http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf) ([video](https://www.youtube.com/watch?v=2S0k12uZR14))
- [Engineering a Safer World](https://mitpress.mit.edu/books/engineering-safer-world)
- [STAMP/STPA/CAST](https://psas.scripts.mit.edu/home/)
- [TCD: Managing Risk and System Change](https://psychology.tcd.ie/postgraduate/msc-riskandchange/)
- [Lund: Human Factors & System Safety](https://www.humanfactors.lth.se)

## ACT II: Chaos & Resilience Engineering
* [Chaos Monkey](https://github.com/Netflix/chaosmonkey)
* [Simian Army](https://github.com/Netflix/SimianArmy) (retired)
* [Chaos Engineering](https://www.oreilly.com/library/view/chaos-engineering/9781491988459/) Book
* [Chaos Engineering: System Resiliency in Practice](http://shop.oreilly.com/product/0636920203957.do) (new book, 2020)
* [Principles of Chaos Engineering](https://principlesofchaos.org)
* [Resilience Engineering](https://www.crcpress.com/Resilience-Engineering-Concepts-and-Precepts/Woods-Hollnagel/p/book/9780754649045) Book
* [The Four Potentials of Resilience](https://erikhollnagel.com/ideas/resilience%20assessment%20grid.html)
* [Etsy Blameless Post-Mortem](https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/)

- [Chaos Monkey](https://github.com/Netflix/chaosmonkey)
- [Simian Army](https://github.com/Netflix/SimianArmy) (retired)
- [Chaos Engineering](https://www.oreilly.com/library/view/chaos-engineering/9781491988459/) Book
- [Chaos Engineering: System Resiliency in Practice](http://shop.oreilly.com/product/0636920203957.do) (new book, 2020)
- [Principles of Chaos Engineering](https://principlesofchaos.org)
- [Resilience Engineering](https://www.crcpress.com/Resilience-Engineering-Concepts-and-Precepts/Woods-Hollnagel/p/book/9780754649045) Book
- [The Four Potentials of Resilience](https://erikhollnagel.com/ideas/resilience%20assessment%20grid.html)
- [Etsy Blameless Post-Mortem](https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/)

## ACT III: What I’ve learned so far
* Lesson 1: Incident Management Teams in Technology are similar to those in Oil & Gas
* Crichton, M. T., Lauche, K., & Flin, R. (2005). [Incident Command Skills in the Management of an Oil Industry Drilling Incident: a Case Study](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1468-5973.2005.00466.x) ([PDF](https://www.academia.edu/38675561/Incident_Command_Skills_in_the_Management_of_an_Oil_Industry_Drilling_Incident_a_Case_Study))
* *Muhren, W. J., van den Eede, G. G. P., & van de Walle, B. A. (2007). [Organizational learning for

- Lesson 1: Incident Management Teams in Technology are similar to those in Oil & Gas
- Crichton, M. T., Lauche, K., & Flin, R. (2005). [Incident Command Skills in the Management of an Oil Industry Drilling Incident: a Case Study](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1468-5973.2005.00466.x) ([PDF](https://www.academia.edu/38675561/Incident_Command_Skills_in_the_Management_of_an_Oil_Industry_Drilling_Incident_a_Case_Study))
- *Muhren, W. J., van den Eede, G. G. P., & van de Walle, B. A. (2007). [Organizational learning for
the incident management process](https://research.tilburguniversity.edu/en/publications/organizational-learning-for-the-incident-management-process-lesso) ([PDF](https://aisel.aisnet.org/cgi/viewcontent.cgi?article=1131&context=ecis2007))*
* [Situation Awareness](https://en.wikipedia.org/wiki/Situation_awareness#Theoretical_model)
* Dossier 1: A socio-technical case study of an IT major incident management team
* Lesson 2: Safety has risk assessment methods that can be applied to computer systems
* [NIST 800-30](https://csrc.nist.gov/publications/detail/sp/800-30/rev-1/final)
* [STPA Handbook](http://psas.scripts.mit.edu/home/materials/)
* [FMEA](https://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis) (Failure mode and effects analysis)
* [GameDay Discussion](https://queue.acm.org/detail.cfm?id=2371297) (2012)
* Dossier 3: A comparison of NIST and STPA risk assessment methods applied to an informational website
* Lesson 3: Changes cause outages
* [Wikipedia: Downtime](https://en.wikipedia.org/wiki/Downtime)
- [Situation Awareness](https://en.wikipedia.org/wiki/Situation_awareness#Theoretical_model)
- Dossier 1: A sociotechnical case study of an IT major incident management team
- Lesson 2: Safety has risk assessment methods that can be applied to computer systems
- [NIST 800-30](https://csrc.nist.gov/publications/detail/sp/800-30/rev-1/final)
- [STPA Handbook](http://psas.scripts.mit.edu/home/materials/)
- [FMEA](https://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis) (Failure mode and effects analysis)
- [GameDay Discussion](https://queue.acm.org/detail.cfm?id=2371297) (2012)
- Dossier 3: A comparison of NIST and STPA risk assessment methods applied to an informational website
- Lesson 3: Changes cause outages
- [Wikipedia: Downtime](https://en.wikipedia.org/wiki/Downtime)

## END: How to get started with chaos & resilience engineering
* Chaos Engineering -- break stuff
* [Twin Cities Chaos Engineering Community](https://www.meetup.com/Twin-Cities-Chaos-Engineering-Community/)
* [Awesome Chaos Engineering](https://github.com/dastergon/awesome-chaos-engineering)
* [Gremlin](https://www.gremlin.com)
* Resilience Engineering -- fix stuff
* [Resilience engineering papers](https://github.com/lorin/resilience-engineering) ([Where do I start?](https://github.com/lorin/resilience-engineering/blob/master/intro.md) page)
* [Learning from Incidents in Software](https://www.learningfromincidents.io)
* *DevOps -- build stuff*
* *[Google DevOps](https://cloud.google.com/devops)*
* *[Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations](https://itrevolution.com/book/accelerate/) - [Nicole Forsgren](https://nicolefv.com), Jez Humble, Gene Kim*
* information-safety.org [resources](/resources/)

- Chaos Engineering -- break stuff
- [Twin Cities Chaos Engineering Community](https://www.meetup.com/Twin-Cities-Chaos-Engineering-Community/)
- [Awesome Chaos Engineering](https://github.com/dastergon/awesome-chaos-engineering)
- [Gremlin](https://www.gremlin.com)
- Resilience Engineering -- fix stuff
- [Resilience engineering papers](https://github.com/lorin/resilience-engineering) ([Where do I start?](https://github.com/lorin/resilience-engineering/blob/master/intro.md) page)
- [Learning from Incidents in Software](https://www.learningfromincidents.io)
- *DevOps -- build stuff*
- *[Google DevOps](https://cloud.google.com/devops)*
- *[Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations](https://itrevolution.com/book/accelerate/) - [Nicole Forsgren](https://nicolefv.com), Jez Humble, Gene Kim*
- information-safety.org [resources](/resources/)
13 changes: 7 additions & 6 deletions _posts/2020-07-12-failover-conf.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,10 @@ The highlights for me were two talks on Site Reliability Engineering (SRE) by Je
The downside of the conference was the unusually high number of marketing emails participants received; I mean, I know it's a *free* conference, but even Gremlin admitted there were too many. Thankfully, you can watch all the talks without registration [here](https://www.youtube.com/playlist?list=PLLIx5ktghjqItStdp_NUh3CQ_y4M49Gb1).

The conference also had a dedicated Slack for discussion during and after the talks, which was for me at least as interesting as the talks themselves. From the Slack discussion, I got recommendations on some additional academic reading on Resilience Engineering from J Paul Reed, which I am sharing here:
* [REdeploy](https://re-deploy.io/)
* [SNAFUcatchers](https://snafucatchers.github.io/)
* [Approaching Overload: Diagnosis and Response to Anomalies in Complex and Automated Production Software Systems](https://www.researchgate.net/publication/333091997_Approaching_Overload_Diagnosis_and_Response_to_Anomalies_in_Complex_and_Automated_Production_Software_Systems) (masters thesis)
* [Managing the Hidden Costs of Coordination](https://queue.acm.org/detail.cfm?id=3380779)
* [ACM Queue Vol. 17 No. 6, Human Factors](https://queue.acm.org/issuedetail.cfm?issue=3380774) - includes the article above
* [Maps, Context, and Tribal Knowledge: On the Structure and Use of Post-Incident Analysis Artifacts in Software Development and Operations](https://jpaulreed.com/jpaulreed-lund-thesis-v1_1.pdf) (I think this was actually recommended by John Allspaw, also available at [Lund](https://lup.lub.lu.se/student-papers/search/publication/8966930))

- [REdeploy](https://re-deploy.io/)
- [SNAFUcatchers](https://snafucatchers.github.io/)
- [Approaching Overload: Diagnosis and Response to Anomalies in Complex and Automated Production Software Systems](https://www.researchgate.net/publication/333091997_Approaching_Overload_Diagnosis_and_Response_to_Anomalies_in_Complex_and_Automated_Production_Software_Systems) (masters thesis)
- [Managing the Hidden Costs of Coordination](https://queue.acm.org/detail.cfm?id=3380779)
- [ACM Queue Vol. 17 No. 6, Human Factors](https://queue.acm.org/issuedetail.cfm?issue=3380774) - includes the article above
- [Maps, Context, and Tribal Knowledge: On the Structure and Use of Post-Incident Analysis Artifacts in Software Development and Operations](https://jpaulreed.com/jpaulreed-lund-thesis-v1_1.pdf) (I think this was actually recommended by John Allspaw, also available at [Lund](https://lup.lub.lu.se/student-papers/search/publication/8966930))
Loading

0 comments on commit 866b537

Please sign in to comment.