Table of Contents
- Table of Contents
- Foreword
- Preface
- Part I - Introduction
- Chapter 1 - Introduction
- Chapter 2 - The Production Environment at Google, from the Viewpoint of an SRE
- Part II - Principles
- Chapter 3 - Embracing Risk
- Chapter 4 - Service Level Objectives
- Chapter 5 - Eliminating Toil
- Chapter 6 - Monitoring Distributed Systems
- Chapter 7 - The Evolution of Automation at Google
- Chapter 8 - Release Engineering
- Chapter 9 - Simplicity
- Part III - Practices
- Chapter 10 - Practical Alerting
- Chapter 11 - Being On-Call
- Chapter 12 - Effective Troubleshooting
- Chapter 13 - Emergency Response
- Chapter 14 - Managing Incidents
- Chapter 15 - Postmortem Culture: Learning from Failure
- Chapter 16 - Tracking Outages
- Chapter 17 - Testing for Reliability
- Chapter 18 - Software Engineering in SRE
- Chapter 19 - Load Balancing at the Frontend
- Chapter 20 - Load Balancing in the Datacenter
- Chapter 21 - Handling Overload
- Chapter 22 - Addressing Cascading Failures
- Chapter 23 - Managing Critical State: Distributed Consensus for Reliability
- Chapter 24 - Distributed Periodic Scheduling with Cron
- Chapter 25 - Data Processing Pipelines
- Chapter 26 - Data Integrity: What You Read Is What You Wrote
- Chapter 27 - Reliable Product Launches at Scale
- Part IV - Management
- Chapter 28 - Accelerating SREs to On-Call and Beyond
- Chapter 29 - Dealing with Interrupts
- Chapter 30 - Embedding an SRE to Recover from Operational Overload
- Chapter 31 - Communication and Collaboration in SRE
- Chapter 32 - The Evolving SRE Engagement Model
- Part V - Conclusions
- Chapter 33 - Lessons Learned from Other Industries
- Chapter 34 - Conclusion
- Appendix A - Availability Table
- Appendix B - A Collection of Best Practices for Production Services
- Appendix C - Example Incident State Document
- Appendix D - Example Postmortem
- Appendix E - Launch Coordination Checklist
- Appendix F - Example Production Meeting Minutes
- Bibliography