Highly recommended
For more details, see https://www.oreilly.com/library/view/software-engineering-at/9781492082781/
Google is huge and has accumulated a lot of expertise – from programming a single computer through to managing fleets of servers, with processes being moved from one server to another. They have a metaphor for this – when you start off small, you treat servers like pets, giving them a lot of time and attention. Gradually, as you increase in scale, you end up treating servers like cattle – anonymous and the opposite of pets. From this book’s point of view, Software Engineering is more rigorous and heavily supported by tools and automation – this book covers the past two decades of Google’s experience of applying Software Engineering at a very large scale. Quite often, their in-house tools (or variants of them) are available as open-sourced tools. Another frequently mentioned aspect is Hyrum’s law – ‘With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviours of your system will be depended on by somebody’ especially regarding increasing the scale of operations.
So it should not be a surprise that this is a huge topic and an eye-opening book, spanning 25 chapters – yet economically written. It has one curator, two main authors and lots of contributing authors. It starts off with its thesis and then spends the next three parts of the book (Culture, Processes, Tools) building on it. For those who need an aide memoire or an executive summary, each chapter has a ‘TL;DR’ summary at its end.
There are some really good ideas for sharing knowledge throughout Google. Every decision seems to trade-off costs (money, machine resources, personnel costs, cost of acting vs not acting, impact on society). Decisions are made using data/evidence, precedence, and argument.
Part II, Culture (7 chapters), covers some interesting topics – working in teams, knowledge sharing, catering for a diverse set of customers, team leadership (for small and large teams) and, finally, gathering data on engineers’ productivity.
The ‘How to Work Well on Teams’ chapter covers how developers’ insecurities can have a bad effect on the projects they work on. It promotes the benefits of working in teams (as opposed to the lone wolf software developer). Being able to work with others is heavily emphasised, boiled down to ‘humility, respect, and trust’.
The ‘Engineering for Equity’ chapter is all about the impact that Software Engineering has on ‘under represented and diverse societies’ and this is something that Google is still learning to do. ‘Build for everyone’ is an example of a Google brand statement that is more of an aspiration than an achievement. It seems that a Hippocratic Oath will be a requirement for future engineers – but if you are a single engineer in a large organisation, how can you be sure your work won’t be used for evil purposes?
The ‘How to lead a team’ chapter covers the challenges faced by Engineering Management and Technical Leads, how to become a leader, the importance of ‘humility, respect and, trust’ and some useful leadership patterns and anti-patterns.
The ‘Leading at Scale’ chapter builds on its predecessor, handling the leadership of multiple related teams and explores three quotes – ‘Always be deciding’, ‘Always be leaving’, and ‘Always be scaling’. This topic is out of my league but the advice given seems sound.
The ‘Measuring Engineering Productivity’ chapter covers a thorny topic. For Google to grow and properly scale up, it embraced data-driven decisions and automating engineers’ processes with in-house tests. It also takes into account the wisdom of measuring a particular thing and supplementing raw data with anecdotes, structured interviews, and, case studies. The Goals / Signals / Metrics framework is also covered.
Part III, Processes (8 chapters), starts off with a chapter on ‘Style Guides and Rules’ – showing how to use and develop them within an organisation. In particular, automation using tools is used to enforce them – guides are recommendations, rules are laws that must be adhered to – or otherwise have exceptions granted on a need-to-use basis. There are all sorts of goals for the use of rules, and in this case, it is keeping Google’s vast codebase (more than 2 billion lines of code) resilient to change and keeping their tens of thousands of engineers productive. A lot of programming experience resides in Google and a lot of effort is spent sharing and refining that knowledge.
The ‘Code Review’ chapter discusses this topic in general and Google’s approach – using Critique, their in-house tool. They have so much code written that some of their engineers have a saying, “If you’re writing it from scratch you’re doing it wrong!”. It is also an opportunity for code authors and reviewers to share knowledge.
The ‘Documentation’ chapter covers everything from in-source comments through to stand-alone documents and why Google prefers to treat them the same as source code. A key aim throughout is to know your audience and instead of writing a ‘covers everything’ document but to provide documents targetting different types of readers.
The next four chapters cover Testing – an overview of it, unit-testing, mocking (‘test doubles’), and really large scale testing. In fact, they could be a separate book in their own right.
The ‘Testing Overview chapter covers developer-driven, automated testing and why Google does it – higher quality, resilience to change, reducing the cost of Quality Assurance.
The ‘Unit Testing’ chapter reveals that most (80%) of tests in Google are Unit-Tests and discusses the pros and cons of using them in Google via their best practices.
The ‘Test Doubles’ chapter describes how Google got its fingers burnt using test doubles and covers how to successfully use them. It emphasises that code must be written with test doubles in mind because retrofitting tests is a painful exercise and that, in general, this type of testing is difficult to scale.
The ‘Larger Testing’ chapter covers tests that provide confidence that overall a system works as intended. It notes the role of configuration changes as the main reason for major outages and that they should be version controlled. It goes on to show that large scale tests are… complicated.
The ‘Deprecation’ chapter covers the orderly removal of obsolete systems – from internal APIs to entire software stacks. It notes that they have found migrating to entirely new systems extremely expensive and that refactoring an existing system is worth considering. One process for finding products that use a deprecated system is to simply have planned outages and see what breaks. It covers the social and technical challenges of deprecation.
Part IV, Tools, takes up nearly half of this book, spanning 10 chapters and could be another book in its own right. The ‘Version Control and Branch Management’ chapter explains that Google uses Piper, its in-house VCS to meet the challenge of supporting 50,000 engineers working on 80 terabytes of content and metadata, billions of lines of code, 60,000+ commits per day in a single repository. They do release branches, but pretty much everything else is done in that repository.
The ‘Code Search’ chapter covers Google’s in-house tool for browsing and searching their codebase (also called Code Search) which seems incredibly useful and part of Google’s ‘secret sauce’. As well as searching the current state of code it can search the historical states of the codebase – including configuration settings. The pros and cons of different indexing strategies are discussed. Code Search is an important productivity tool because engineers spend more time reading and searching code than actually writing it.
The ‘Build Systems and Build Philosophy’ chapter discusses how conventional Build tools don’t handle the scale that Google works at. Their response was to build a new tool, Blaze (an open source version called Bazel is available). It seeks to build systems across multi-core, multi-system environments.
The next chapter covers code reviews at Google – both manual and automatic, supported by more tools (such as Critique). Critique is tightly bound to Google’s monorepo, so for their FLOSS projects, they built a similar tool, Gerrit, which, in turn is tightly bound to Git.
The ‘Static Analysis’ chapter documents Google’s exploits – how to make static analysis tools scale – especially by integrating it with the daily workflow and by having a feedback loop between developers and the people providing their tools.
The ‘Dependency Management’ chapter covers what is commonly referred to as ‘DLL Hell’ – but on a larger scale. It is mainly an account of what does not work at Google’s scale and how this is a serious, ongoing problem. The perils of open sourcing a project are covered as well.
The ‘Large Scale Changes’ chapter is all about changes that ‘are logically related but cannot practically be submitted as a single unit’ – and, as is their way, they heavily use automation.
The ‘Continuous Integration’ and ‘Continuous Delivery’ chapters, like other chapters, covers a difficult topic in a frank manner, from Google’s point of view, with examples of their own infrastructure.
Finally, the ‘Compute as a Service’ chapter covers Google’s journey in this area. As ever, automation plays a heavy role – from a single server through to automating turning up a datacentre. The difficulties of starting off small (treating servers like pets – human-intensive system administration) and gradually increasing in scale (treating servers like cattle – automation-intensive system administration) are discussed.
Criticisms. This book could do with a glossary – especially of Google-specific phrases and project names. As would be expected, plenty of websites are cited. Unfortunately, some of them are abbreviated (e.g. oreil.ly/k3CJx) and are a nightmare to type in. Also, this is the second book I’ve read that cites a book by Kevlin Henney that gets his name wrong.
Conclusions:
- Contains ideas for infrastructure along with what Google did and how their systems had to evolve over time.
- Lots of Enterprise-level Software Engineering advice, backed up by anecdotes. With TL;DRs for those who need an executive summary.