Bibliography
-
[Ada15] Bram Adams, Stephany Bellomo, Christian Bird, Tamara Marshall-Keim, Foutse Khomh, and Kim Moir, "The Practice and Future of Release Engineering: A Roundtable with Three Release Engineers", IEEE Software, vol. 32, no. 2 (March/April 2015), pp. 42–49.
-
[Agu10] M. K. Aguilera, "Stumbling over Consensus Research: Misunderstandings and Issues", in Replication, Lecture Notes in Computer Science 5959, 2010.
-
[All10] J. Allspaw and J. Robbins, Web Operations: Keeping the Data on Time: O’Reilly, 2010.
-
[All12] J. Allspaw, "Blameless PostMortems and a Just Culture", blog post, 2012.
-
[All15] J. Allspaw, "Trade-Offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages", MSc thesis, Lund University, 2015.
-
[Ana07] S. Anantharaju, "Automating web application security testing", blog post, July 2007.
-
[Ana13] R. Ananatharayan et al., "Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams", in SIGMOD '13, 2013.
-
[And05] A. Andrieux, K. Czajkowski, A. Dan, et al., "Web Services Agreement Specification (WS-Agreement)", September 2005.
-
[Bai13] P. Bailis and A. Ghodsi, "Eventual Consistency Today: Limitations, Extensions, and Beyond", in ACM Queue, vol. 11, no. 3, 2013.
-
[Bai83] L. Bainbridge, "Ironies of Automation", in Automatica, vol. 19, no. 6, November 1983.
-
[Bak11] J. Baker et al., "Megastore: Providing Scalable, Highly Available Storage for Interactive Services", in Proceedings of the Conference on Innovative Data System Research, 2011.
-
[Bar11] L. A. Barroso, "Warehouse-Scale Computing: Entering the Teenage Decade", talk at 38th Annual Symposium on Computer Architecture, video available online, 2011.
-
[Bar13] L. A. Barroso, J. Clidaras, and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition, Morgan & Claypool, 2013.
-
[Ben12] C. Bennett and A. Tseitlin, "Chaos Monkey Released Into The Wild", blog post, July 2012.
-
[Bla14] M. Bland, "Goto Fail, Heartbleed, and Unit Testing Culture", blog post, June 2014.
-
[Boc15] L. Bock, Work Rules!, Twelve Books, 2015.
-
[Bol11] W. J. Bolosky, D. Bradshaw, R. B. Haagens, N. P. Kusters, and P. Li, "Paxos Replicated State Machines as the Basis of a High-Performance Data Store", in Proc. NSDI 2011, 2011.
-
[Boy13] P. G. Boysen, "Just Culture: A Foundation for Balanced Accountability and Patient Safety", in The Ochsner Journal, Fall 2013.
-
[Bra15] VM Brasseur, "Failure: Why it happens & How to benefit from it", YAPC 2015.
-
[Bre01] E. Brewer, "Lessons From Giant-Scale Services", in IEEE Internet Computing, vol. 5, no. 4, July / August 2001.
-
[Bre12] E. Brewer, "CAP Twelve Years Later: How the "Rules" Have Changed", in Computer, vol. 45, no. 2, February 2012.
-
[Bro15] M. Brooker, "Exponential Backoff and Jitter", on AWS Architecture Blog, March 2015.
-
[Bro95] F. P. Brooks Jr., "No Silver Bullet—Essence and Accidents of Software Engineering", in The Mythical Man-Month, Boston: Addison-Wesley, 1995, pp. 180–186.
-
[Bru09] J. Brutlag, "Speed Matters", on Google Research Blog, June 2009.
-
[Bul80] G. M. Bull, The Dartmouth Time-sharing System: Ellis Horwood, 1980.
-
[Bur99] M. Burgess, Principles of Network and System Administration: Wiley, 1999.
-
[Bur06] M. Burrows, "The Chubby Lock Service for Loosely-Coupled Distributed Systems", in OSDI '06: Seventh Symposium on Operating System Design and Implementation, November 2006.
-
[Bur16] B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, "Borg, Omega, and Kubernetes" in ACM Queue, vol. 14, no. 1, 2016.
-
[Cas99] M. Castro and B. Liskov, "Practical Byzantine Fault Tolerance", in Proc. OSDI 1999, 1999.
-
[Cha10] C. Chambers, A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and N. Weizenbaum, "FlumeJava: Easy, Efficient Data-Parallel Pipelines", in ACM SIGPLAN Conference on Programming Language Design and Implementation, 2010.
-
[Cha96] T. D. Chandra and S. Toueg, "Unreliable Failure Detectors for Reliable Distributed Systems", in J. ACM, 1996.
-
[Cha07] T. Chandra, R. Griesemer, and J. Redstone, "Paxos Made Live—An Engineering Perspective", in PODC '07: 26th ACM Symposium on Principles of Distributed Computing, 2007.
-
[Cha06] F. Chang et al., "Bigtable: A Distributed Storage System for Structured Data", in OSDI '06: Seventh Symposium on Operating System Design and Implementation, November 2006.
-
[Chr09] G. P. Chrousous, "Stress and Disorders of the Stress System", in Nature Reviews Endocrinology, vol 5., no. 7, 2009.
-
[Clos53] C. Clos, "A Study of Non-Blocking Switching Networks", in Bell System Technical Journal, vol. 32, no. 2, 1953.
-
[Con15] C. Contavalli, W. van der Gaast, D. Lawrence, and W. Kumari, "Client Subnet in DNS Queries", IETF Internet-Draft, 2015.
-
[Con63] M. E. Conway, "Design of a Separable Transition-Diagram Compiler", in Commun. ACM 6, 7 (July 1963), 396–408.
-
[Con96] P. Conway, "Preservation in the Digital World", report published by the Council on Library and Information Resources, 1996.
-
[Coo00] R. I. Cook, "How Complex Systems Fail", in Web Operations: O’Reilly, 2010.
-
[Cor12] J. C. Corbett et al., "Spanner: Google’s Globally-Distributed Database", in OSDI '12: Tenth Symposium on Operating System Design and Implementation, October 2012.
-
[Cra10] J. Cranmer, "Visualizing code coverage", blog post, March 2010.
-
[Dea13] J. Dean and L. A. Barroso, "The Tail at Scale", in Communications of the ACM, vol. 56, 2013.
-
[Dea04] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", in OSDI’04: Sixth Symposium on Operating System Design and Implementation, December 2004.
-
[Dea07] J. Dean, "Software Engineering Advice from Building Large-Scale Distributed Systems", Stanford CS297 class lecture, Spring 2007.
-
[Dek02] S. Dekker, "Reconstructing human contributions to accidents: the new view on error and performance", in Journal of Safety Research, vol. 33, no. 3, 2002.
-
[Dek14] S. Dekker, The Field Guide to Understanding "Human Error", 3rd edition: Ashgate, 2014.
-
[Dic14] C. Dickson, "How Embracing Continuous Release Reduced Change Complexity", presentation at USENIX Release Engineering Summit West 2014, video available online.
-
[Dur05] J. Durmer and D. Dinges, "Neurocognitive Consequences of Sleep Deprivation", in Seminars in Neurology, vol. 25, no. 1, 2005.
-
[Eis16] D. E. Eisenbud et al., "Maglev: A Fast and Reliable Software Network Load Balancer", in NSDI '16: 13th USENIX Symposium on Networked Systems Design and Implementation, March 2016.
-
[Ere03] J. R. Erenkrantz, "Release Management Within Open Source Projects", in Proceedings of the 3rd Workshop on Open Source Software Engineering, Portland, Oregon, May 2003.
-
[Fis85] M. J. Fischer, N. A. Lynch, and M. S. Paterson, "Impossibility of Distributed Consensus with One Faulty Process", J. ACM, 1985.
-
[Fit12] B. W. Fitzpatrick and B. Collins-Sussman, Team Geek: A Software Developer’s Guide to Working Well with Others: O’Reilly, 2012.
-
[Flo94] S. Floyd and V. Jacobson, "The Synchronization of Periodic Routing Messages", in IEEE/ACM Transactions on Networking, vol. 2, issue 2, April 1994, pp. 122–136.
-
[For10] D. Ford et al, "Availability in Globally Distributed Storage Systems", in Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.
-
[Fox99] A. Fox and E. A. Brewer, "Harvest, Yield, and Scalable Tolerant Systems", in Proceedings of the 7th Workshop on Hot Topics in Operating Systems, Rio Rico, Arizona, March 1999.
-
[Fow08] M. Fowler, "GUI Architectures", blog post, 2006.
-
[Gal78] J. Gall, SYSTEMANTICS: How Systems Really Work and How They Fail, 1st ed., Pocket, 1977.
-
[Gal03] J. Gall, The Systems Bible: The Beginner’s Guide to Systems Large and Small, 3rd ed., General Systemantics Press/Liberty, 2003.
-
[Gaw09] A. Gawande, The Checklist Manifesto: How to Get Things Right: Henry Holt and Company, 2009.
-
[Ghe03] S. Ghemawat, H. Gobioff, and S-T. Leung, "The Google File System", in 19th ACM Symposium on Operating Systems Principles, October 2003.
-
[Gil02] S. Gilbert and N. Lynch, "Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services", in ACM SIGACT News, vol. 33, no. 2, 2002.
-
[Gla02] R. Glass, Facts and Fallacies of Software Engineering, Addison-Wesley Professional, 2002.
-
[Gol14] W. Golab et al., "Eventually Consistent: Not What You Were Expecting?", in ACM Queue, vol. 12, no. 1, 2014.
-
[Gra09] P. Graham, "Maker’s Schedule, Manager’s Schedule", blog post, July 2009.
-
[Gup15] A. Gupta and J. Shute, "High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads", in Workshop on Business Intelligence for the Real Time Enterprise, 2015.
-
[Ham07] J. Hamilton, "On Designing and Deploying Internet-Scale Services", in Proceedings of the 21st Large Installation System Administration Conference, November 2007.
-
[Han94] S. Hanks, T. Li, D. Farinacci, and P. Traina, "Generic Routing Encapsulation over IPv4 networks", IETF Informational RFC, 1994.
-
[Hic11] M. Hickins, "Tape Rescues Google in Lost Email Scare", in Digits, Wall Street Journal, 1 March 2011.
-
[Hix15a] D. Hixson, "Capacity Planning", in ;login:, vol. 40, no. 1, February 2015.
-
[Hix15b] D. Hixson, "The Systems Engineering Side of Site Reliability Engineering", in ;login: vol. 40, no. 3, June 2015.
-
[Hod13] J. Hodges, "Notes on Distributed Systems for Young Bloods", blog post, 14 January 2013.
-
[Hol14] L. Holmwood, "Applying Cardiac Alarm Management Techniques to Your On-Call", blog post, 26 August 2014.
-
[Hum06] J. Humble, C. Read, D. North, "The Deployment Production Line", in Proceedings of the IEEE Agile Conference, July 2006.
-
[Hum10] J. Humble and D. Farley, Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation: Addison-Wesley, 2010.
-
[Hun10] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, "ZooKeeper: Wait-free coordination for Internet-scale systems", in USENIX ATC, 2010.
-
[IAEA12] International Atomic Energy Agency, "Safety of Nuclear Power Plants: Design, SSR-2/1", 2012.
-
[Jai13] S. Jain et al., "B4: Experience with a Globally-Deployed Software Defined WAN", in SIGCOMM '13.
-
[Jon15] C. Jones, T. Underwood, and S. Nukala, "Hiring Site Reliability Engineers", in ;login:, vol. 40, no. 3, June 2015.
-
[Jun07] F. Junqueira, Y. Mao, and K. Marzullo, "Classic Paxos vs. Fast Paxos: Caveat Emptor", in Proc. HotDep '07, 2007.
-
[Jun11] F. P. Junqueira, B. C. Reid, and M. Serafini, "Zab: High-performance broadcast for primary-backup systems.", in Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on 27 Jun 2011: 245–256.
-
[Kah11] D. Kahneman, Thinking, Fast and Slow: Farrar, Straus and Giroux, 2011.
-
[Kar97] D. Karger et al., "Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web", in Proc. STOC '97, 29th annual ACM symposium on theory of computing, 1997.
-
[Kem11] C. Kemper, "Build in the Cloud: How the Build System Works", Google Engineering Tools blog post, August 2011.
-
[Ken12] S. Kendrick, "What Takes Us Down?", in ;login:, vol. 37, no. 5, October 2012
-
[Kinc09] Kincaid, Jason. "T-Mobile Sidekick Disaster: Danger’s Servers Crashed, And They Don’t Have A Backup." Techcrunch. n.p., 10 Oct. 2009. Web. 20 Jan. 2015, https://techcrunch.com/2009/10/10/t-mobile-sidekick-disaster-microsofts-servers-crashed-and-they-dont-have-a-backup
-
[Kin15] K. Kingsbury, "The trouble with timestamps", blog post, 2013.
-
[Kir08] J. Kirsch and Y. Amir, "Paxos for System Builders: An Overview", in Proc. LADIS '08, 2008.
-
[Kla12] R. Klau, "How Google Sets Goals: OKRs", blog post, October 2012.
-
[Kle06] D. V. Klein, "A Forensic Analysis of a Distributed Two-Stage Web-Based Spam Attack", in Proceedings of the 20th Large Installation System Administration Conference, December 2006.
-
[Kle14] D. V. Klein, D. M. Betser, and M. G. Monroe, "Making Push On Green a Reality", in ;login:, vol. 39, no. 5, October 2014.
-
[Kra08] T. Krattenmaker, "Make Every Meeting Matter", in Harvard Business Review, February 27, 2008.
-
[Kre12] J. Kreps, "Getting Real About Distributed System Reliability", blog post, 19 March 2012.
-
[Kri12] K. Krishan, "Weathering The Unexpected", in Communications of the ACM, vol. 55, no. 11, November 2012
-
[Kum15] A. Kumar et al., "BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing", in SIGCOMM '15.
-
[Lam98] L. Lamport, "The Part-Time Parliament", in ACM Transactions on Computer Systems 16, 2, May 1998.
-
[Lam01] L. Lamport, "Paxos Made Simple", in ACM SIGACT News 121, December 2001.
-
[Lam06] L. Lamport, "Fast Paxos", in Distributed Computing 19.2, October 2006.
-
[Lim14] T. A. Limoncelli, S. R. Chalup, and C. J. Hogan, The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2: Addison-Wesley, 2014.
-
[Loo10] J. Loomis, "How to Make Failure Beautiful: The Art and Science of Postmortems", in Web Operations: O’Reilly, 2010.
-
[Lu15] H. Lu et al, "Existential Consistency: Measuring and Understanding Consistency at Facebook", in SOSP '15, 2015.
-
[Mao08] Y. Mao, F. P. Junqueira, and K. Marzullo, "Mencius: Building Efficient Replicated State Machines for WANs", in OSDI '08, 2008.
-
[Mas43] A. H. Maslow, "A Theory of Human Motivation", in Psychological Review 50(4), 1943.
-
[Mau15] B. Maurer, "Fail at Scale", in ACM Queue, vol. 13, no. 12, 2015.
-
[May09] M. Mayer, "This site may harm your computer on every search result?!?!", blog post, January 2009.
-
[McI86] M. D. McIlroy, "A Research Unix Reader: Annotated Excerpts from the Programmer’s Manual, 1971–1986".
-
[McN13] D. McNutt, "Maintaining Consistency in a Massively Parallel Environment", presentation at USENIX Configuration Management Summit 2013, video available online.
-
[McN14a] D. McNutt, "Accelerating the Path from Dev to DevOps", in ;login:, vol. 39, no. 2, April 2014.
-
[McN14b] D. McNutt, "The 10 Commandments of Release Engineering", presentation at 2nd International Workshop on Release Engineering 2014, April 2014.
-
[McN14c] D. McNutt, "Distributing Software in a Massively Parallel Environment", presentation at USENIX LISA 2014, video available online.
-
[Mic03] Microsoft TechNet, "What is SNMP?", last modified March 28, 2003, https://technet.microsoft.com/en-us/library/cc776379%28v=ws.10%29.aspx.
-
[Mea08] D. Meadows, Thinking in Systems: Chelsea Green, 2008.
-
[Men07] P. Menage, "Adding Generic Process Containers to the Linux Kernel", in Proc. of Ottawa Linux Symposium, 2007.
-
[Mer11] N. Merchant, "Culture Trumps Strategy, Every Time", in Harvard Business Review, March 22, 2011.
-
[Moc87] P. Mockapetris, "Domain Names - Implementation and Specification", IETF Internet Standard, 1987.
-
[Mol86] C. Moler, "Matrix Computation on Distributed Memory Multiprocessors", in Hypercube Multiprocessors 1986, 1987.
-
[Mor12a] I. Moraru, D. G. Andersen, and M. Kaminsky, "Egalitarian Paxos", Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-12-108, 2012.
-
[Mor14] I. Moraru, D. G. Andersen, and M. Kaminsky, "Paxos Quorum Leases: Fast Reads Without Sacrificing Writes", in Proc. SOCC '14, 2014.
-
[Mor12b] J. D. Morgenthaler, M. Gridnev, R. Sauciuc, and S. Bhansali, "Searching for Build Debt: Experiences Managing Technical Debt at Google", in Proceedings of the 3rd Int’l Workshop on Managing Technical Debt, 2012.
-
[Nar12] C. Narla and D. Salas, "Hermetic Servers", blog post, 2012.
-
[Nel14] B. Nelson, "The Data on Diversity", in Communications of the ACM, vol. 57, 2014.
-
[Nic12] K. Nichols and V. Jacobson, "Controlling Queue Delay", in ACM Queue, vol. 10, no. 5, 2012.
-
[Oco12] P. O’Connor and A. Kleyner, Practical Reliability Engineering, 5th edition: Wiley, 2012.
-
[Ohn88] T. Ohno, Toyota Production System: Beyond Large-Scale Production: Productivity Press, 1988.
-
[Ong14] D. Ongaro and J. Ousterhout, "In Search of an Understandable Consensus Algorithm (Extended Version)".
-
[Pen10] D. Peng and F. Dabek, "Large-scale Incremental Processing Using Distributed Transactions and Notifications", in Proc. of the 9th USENIX Symposium on Operating System Design and Implementation, November 2010.
-
[Per99] C. Perrow, Normal Accidents: Living with High-Risk Technologies, Princeton University Press, 1999.
-
[Per07] A. R. Perry, "Engineering Reliability into Web Sites: Google SRE", in Proc. of LinuxWorld 2007, 2007.
-
[Pik05] R. Pike, S. Dorward, R. Griesemer, S. Quinlan, "Interpreting the Data: Parallel Analysis with Sawzall", in Scientific Programming Journal vol. 13, no. 4, 2005.
-
[Pot16] R. Potvin and J. Levenberg, "The Motivation for a Monolithic Codebase: Why Google stores billions of lines of code in a single repository", in Communications of the ACM, vol. 59, no. 7, 2016. Video available on YouTube.
-
[Roo04] J. J. Rooney and L. N. Vanden Heuvel, "Root Cause Analysis for Beginners", in Quality Progress, July 2004.
-
[Sai39] A. de Saint Exupéry, Terre des Hommes (Paris: Le Livre de Poche, 1939, in translation by Lewis Galantière as Wind, Sand and Stars.
-
[Sam14] R. R. Sambasivan, R. Fonseca, I. Shafer, and G. R. Ganger, "So, You Want To Trace Your Distributed System? Key Design Insights from Years of Practical Experience", Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-14-102, 2014.
-
[San11] N. Santos and A. Schiper, "Tuning Paxos for High-Throughput with Batching and Pipelining", in 13th Int’l Conf. on Distributed Computing and Networking, 2012.
-
[Sar97] N. B. Sarter, D. D. Woods, and C. E. Billings, "Automation Surprises", in Handbook of Human Factors & Ergonomics, 2nd edition, G. Salvendy (ed.), Wiley, 1997.
-
[Sch14] E. Schmidt, J. Rosenberg, and A. Eagle, How Google Works: Grand Central Publishing, 2014.
-
[Sch15] B. Schwartz, "The Factors That Impact Availability, Visualized", blog post, 21 December 2015.
-
[Sch90] F. B. Schneider, "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial", in ACM Computing Surveys, vol. 22, no. 4, 1990.
-
[Sec13] Securities and Exchange Commission, "Order In the Matter of Knight Capital Americas LLC", file 3-15570, 2013.
-
[Sha00] G. Shao, F. Berman, and R. Wolski, "Master/Slave Computing on the Grid", in Heterogeneous Computing Workshop, 2000.
-
[Shu13] J. Shute et al., "F1: A Distributed SQL Database That Scales", in Proc. VLDB 2013, 2013.
-
[Sig10] B. H. Sigelman et al., "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure", Google Technical Report, 2010.
-
[Sin15] A. Singh et al., "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network", in SIGCOMM '15.
-
[Skel13] M. Skelton, "Operability can Improve if Developers Write a Draft Run Book", blog post, 16 October 2013.
-
[Slo11] B. Treynor Sloss, "Gmail back soon for everyone", blog post, 28 February 2011.
-
[Tat99] S. Tatham, "How to Report Bugs Effectively", 1999.
-
[Ver15] A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, "Large-scale cluster management at Google with Borg", in Proceedings of the European Conference on Computer Systems, 2015.
-
[Wal89] D. R. Wallace and R. U. Fujii, "Software Verification and Validation: An Overview", IEEE Software, vol. 6, no. 3 (May 1989), pp. 10, 17.
-
[War14] R. Ward and B. Beyer, "BeyondCorp: A New Approach to Enterprise Security", in ;login:, vol. 39, no. 6, December 2014.
-
[Whi12] J. A. Whittaker, J. Arbon, and J. Carollo, How Google Tests Software: Addison-Wesley, 2012.
-
[Woo96] A. Wood, "Predicting Software Reliability", in Computer, vol. 29, no. 11, 1996.
-
[Wri12a] H. K. Wright, "Release Engineering Processes, Their Faults and Failures", (section 7.2.2.2) PhD Thesis, University of Texas at Austin, 2012.
-
[Wri12b] H. K. Wright and D. E. Perry, "Release Engineering Practices and Pitfalls", in Proceedings of the 34th International Conference on Software Engineering (ICSE '12). (IEEE, 2012), pp. 1281–1284.
-
[Wri13] H. K. Wright, D. Jasper, M. Klimek, C. Carruth, Z. Wan, "Large-Scale Automated Refactoring Using ClangMR", in Proceedings of the 29th International Conference on Software Maintenance (ICSM '13), (IEEE, 2013), pp. 548–551.
-
[Yor11] N. York, "Build in the Cloud: Accessing Source Code", Google Engineering Tools blog post, June 2011.
-
[Zoo14] ZooKeeper Project (Apache Foundation), "ZooKeeper Recipes and Solutions", in ZooKeeper 3.4 documentation, 2014.