Web Operations by John Allspaw
28 Jan 2017I was really surprised by this book, taking into account that the domain is rapidly changing and technologies come and go. As with most good books, there are lots of good advice peppered through the book, and as well as some so-and-so chapters, overall a nice read. I would have considered this book good, if it was just out of print but coming with so many great ideas and tips for 2010 it’s quite an achievement in itself.
Some ideas I’ve jotted down:
Ch1 - mention of SRE, in a 2010 book, pretty cool, huh?
Ch2/3 - Monitoring & Metrics
- Describing how a monitoring tool or system - Ganglia presented - should be built, some of its trade-offs and some implementation details
- Going over UDP vs TCP in a very hands-on approach and a great description of what should we monitor, presenting RRD (round robin database) notion used in Graphite (and other similar tools)
- Monitor high-level metrics, and always track business relevant metrics, not only CPU/memory and the like
Ch4/5 - CD / Infra as Code
- work in small batches/increments, to reduce risk, gain confidence and visibly improve the product
- always use the tools at hand and try to follow your procedures
-
have everything (configuration, provisioning info, tools) in a version control system so you can rebuild infra in case of disasters
- Getting feedback from customers is important, not only to keep them in the loop but acknowledge and try to work with them
- More monitoring: understand what you’re monitoring
- understand normal behavior
- learn and improve from your actions and users
- Dev and Ops, On call developers, feature toggles in production, blameless post-mortems
- always include statistics from outside your organization/data center so you are sure if traffic flows in or it’s just your monitoring tools running
- have an overall Application Performance Index that tracks the overall site feel
- a really great chapter about MySQL and database with many tips and tricks and possible approaches to storing and retrieving data
- sharding should not be taken lightly, there are usually other options which should be considered first
- consider a NoSQL approach if applicable
- possible replication scenarios/approach for replicating data
- use caching if possible/available - “cache early and often”
- determine your allowed data loss - RPO / RTO - depending on accepted business analysis
Cache early and often, but don’t shard early and often.
Above all, remember this: take backups. The long-term benefits of backups have been proved by scientists, whereas the rest of my advice has no basis more reliable than my own meandering experience.
Data is your most precious business asset, and it is irreplaceable.