Site Reliability Engineering
30 Sep 2016Taken as a whole the book is good but not stellar.
Looking at individual chapters, as expected, some are really good but some feel a lot like blog posts from the internal/external engineering teams copy/pasted for the book. There is no cohesion that is transmitted through the entire book, and sometimes if feels like many pieces are stitched together to aim for something greater than its parts, and there it falls rather short.
Overall a bit too long for my taste, where it is also losing one “star” from the rating. I would recommend this book mostly to teams which are already quite advanced in their daily ops tasks and are not suffering from “cloud childhood issues”. On a fair point - also clearly stated inside the SRE book as well - not all things are applicable or can be ported to smaller teams or smaller infrastructure, as it won’t make any technical or commercial sense.
Book contains though many good ideas and practices, I’ll put down the a few of them:
-
don’t try to have 100% availability as none of the supporting systems (network, power grid, storage, so on) has it, so going from 99.999% to 100% availability adds to much burden to costs and people
- the SRE are part developers part sysadmins, and their main goal is to automate all possible tasks related to the systems/networking/provisioning environment(s)
- half of their time can be Ops related tasks, if more than that is needed, tasks are pushed back to development teams to take over
- no more than 1-2 incidents per day/shift, and all should have post-mortem analysis in order to improve
- all SRE activities should be related to risk, and if we assign a risk capital that we can spend we can mitigate between having more releases with possibly more incidents and downtime, spending risk capital or doing less features and releases, thus having increased stability and spending less risk capital
- up to 50% should be coding-related or creating automated tools/improvements as the SRE spends less time than that, we should consider pushing back tasks to the team themselves
- availability should be considered in the context of other infrastructure of the client or application, in order to have trade-off between leeway for extra capacity or not
- collaboration inside the team is essential for achiving some of the ideas
- our monitoring tools should give us the end of quarter (or month for that matter) metrics and make it easy to see our current “risk spending”
- SLO service level objectives/ I - indicator (e.g. latency) / A - agreement not all 3 are similar, and sre focus on SLO and SLI
- 4 broad categories of systems: user facing systems (availability, latency, throughput), storage (latency, availability, durability), big data (throughput, end2end latency), corectenss of data (not sre relevant, but app relevant)
- average over 10 minutes, measure every 10 seconds so we don’t get a skewed image of our system performance/issues for bursty requests
- introduce throttling or restarting the systems so users don’t become over-reliant on it, or assume it’s always available, and also testing those scenarios
- the Four Golden Signals for monitoring:
- Latency: time it takes to service a single request, also consdering error reporting
- Traffic: how much demand is being placed on a system, e.g. HTTP requests per second
- Errors: the rate of failing requests
- Saturation: how “full” your service is (considering constraints: memory, CPU, network), with an utilization target, e.g. 80%, as system degrade well before 100%
- strive to focus on simplicity for monitoring, as it can get out of hand fast
- choose to alert on symptoms or imminent problems, adapting your targets
- Automation pros
- Consitency; Faster Repairs; Time Saving - “managing cattle instead of pets”
- all automated tasks should be backed up by tests which are executed against the automation itself
- release engineering should be “baked in” at the start of a project, as later it’s more difficult to retrofit
- for Google, everything is bundled together for builds, so you get a sort of “whole world” when building a particular version of the tools
- there is a short chapter describing SRE approach to the monitoring tools and aggregation of data
- balance of the on-call engineers to no more than 50% of their time
- having a common approach for troubleshooting makes it not only more reliable but also safer and more predictable from a person POV
- similar to 5 Whys: WHAT / WHERE / WHY approach
- focus on getting the system back online (if no data loss is incurred) instead of finding first what’s the root cause or doing extensive analysis
- have endpoints for self-diagnosis of systems and connected systems
- create tools to run tests against the systems and take notes of what was tried or not
- going over an (internal) system which tracks incidents and can create statistics related to incidents/team/applications and makes it easy to see the bigger issues regarding particular areas
- some points about testing: unit test, integration, system, non-functional (performance, stress)
- testing of configuration; - canary testing; - break-glass testing where changes which broke things are auto-referenced to the cuplrit commit(s)
- implement various techniques to spread load more evenly across datacenters/clusters: Weighted Round Robin, client-side throttling, customer quotas
- implement systems which can work in different states (e.g. work as degraded) of healthy: still providing some replies, querying less data; there lots of ideas presented:
- Lame Duck - when server is working but they’re requested to stop sending requests to this node/location/…
- how and when to retry depending on traffic propagation patterns and fan-out rules
- have something similar to QoS implemented in order to discern various types of requests/traffic to be handled with priority
- Simple RR, Weighted Round Robin and Least-Loaded Round Robin
- presenting a few cases of possible cascading failures and ways to address them:
- determining possible application and server limits is regarded as one of the important ways to prevent later failures; try to measure in a system(s) as close as possible to the final (production) one
- servers should try to protect some configured hard limits in which no issues appearing due to load should come up by discarding/throttling requests
- latency propagation can account for one of the ways in which deadlines are being reinforced by downstream systems
- “load test until it breaks and beyond”, might provide clues or early help for later on-call troubleshooters
- try to answer questions regarding service recovery after restart/load is returned within usual limits
- consider dropping traffic in order to remedy this situation
- intro to distributed consensus algorithm - mainly Paxos is presented, with a few others and some variations mentioned
- informal approaches to solving this non-trivial problem will most certainly result in broken systems; applicable for leader election, distributed locking, or critical shared state
- Paxos algo is a series of unique numbered proposals, if majority of acceptors agree the proposal is commited and saved (stored locally as well)
- Replicated State Machine (RSM) requires a consensus algorithg to work correctly: execute same set of operations, in same order, on different processes
- some considerations are presented for various consensus algorithms: Multi-Paxos, Fast Paxos, number of replicas, geo-distributed location ideas and issues are presented
- chapter mentioning issues that might appear (and how to solve some) if implementing a distributed cron
- very nice chapter on DR, backup and data integrity
- the most important part is the recovery of data, it’s not sufficient to have backups; you need to have tools and procedures in place to be able to quickly recover
- there need to be tasks which check that saved data is valid data, otherwise restoring backed up data solves nothing
- some ideas of data snapshotting/streaming backups that maintain 2 levels of backups: short term and long term
- there need to be tools in place which need to check that there is no live data corrupted or detect issues earlier on, as it will get attention sooner on how to fix it
- always test recovery procedures with actual data backups
- Google Music example was rather interesting from the view point of procedures
- launching products at scale: challanges and some ideas presented - o sort of super SRE team which recommends best practices and ideas
- batched roll-outs, canary releases, procedures
- best practices for new hires:
- shadow senior members of the teams during on-call and regular work, participate in all tasks - with help from others, don’t throw grunt work to new employees,
- reverse engineering existing applications and presenting them to existing team (members)
- do small meaningful tasks that benefit all as well help the individual learn more
- go on-call after all initial ramp up is done, and some sort of informal test is passed
- have rehearse environments where specific scenarios can be tested/replicated
- break actual things in production and see how to fix them
- try to separate the daily tasks of people on-call / not-on-call, as interruptions are a huge time consuming task, switching between tasks causes lots of lost time
- have blameless postmortem analysis