Infrastructure as Code - Kief Morris
24 Dec 2016Even though most of the ideas are quite implementation agnostic and they are rather higher level than going into the nitty gritty details, the book provides plenty of good approaches to the most commonly encountered situations when migrating from the so-called “iron age” to the much praised “cloud age”. When comparing this book with its period sibling - the SRE book - this feels better put together while aiming to help the readers to take the first steps in migrating their tools, monitoring, practices and finally infrastructure to a dynamically provisioned one. For starters in this wide domain I would definitely see this book as a prerequisite for the SRE book.
“For example it might take a member of the infrastructure team 5 to 10 minutes to provision a new server […]. But when the full cycle is analyzed […] the total time might vary from a minimum of 6 days up to 18 days.”
Part 1 / Chapter 1
- Challenges encountered as soon as infra grows, and initial hurdle of introducing automation appears
- server sprawl: infrastructure gets too big, or too big too quickly
- snowflake server (pets vs cattle): when too much time was invested into “tuning” a particular host
- configuration drift: updates, patches, hand tuning
- automation fear: get pass the initial moment of using automation just for provisioning
- Fixes
- Systems are easily re-created and created always the same
- Start with a automation/scripting first approach, instead of the other way around
- Use definition files (recipes for Chef, and so on…)
- Make the process self-documenting (going as far as auto-creating a paper trail if needed)
- Store everything in version control, and do small changes instead of big-bang migrations
- Antifragility - use MTTR instead of MTBF, as there will always be failures
- this part is really in-tune with the Google SRE approach to all (possible) issues
Chapter 2
- The dynamic infra should be automated, not necessarily virtualized
- solution should be: programmable, on-demand, and with self-service; otherwise it cannot be considered as a valid platform
- “hand-cranked cloud” is not a cloud
- bare-metal “cloud” is possible
- Storage should provide: block, and object storage (think S3)
- Cloud portability should be taken into account if deciding on a public cloud platform
Chapter 3 - Infra definition tools / Chapter 4 - Server configuration tools
- everything should be scriptable/automatable otherwise we can’t consider it as a mature software offering
- all “steps” of the automation should be idempotent, with validation performed before and after the fact
- configuration should be external of the tool(s) used, and stored in a version control system (e.g. git)
- the CMDB is tied to the “iron age” and not the “cloud age”
- infrastructure should be in an ever-updating “database”, it should not be checked to conform to the initial design
- all changes should be done routinely, reliably, and safely
- tools should be used to apply the right configuration/updates to selected pools of servers
- Chef, Puppet, Ansible, SaltStack, CFEngine
- running commands on multiple servers should be done as a quick fix, and not as day-to-day operations
- Containers are not VMs :)
- extra care needs to be taken into account when selecting containers; e.g. using internal images repos instead of open ones
Chapter 5 - General infra services
- continously test tools, changes, and procedures so you gain confidence in your infra and tools
- alerting should warn in time or trigger actionable items
- displaying CPU/memory graphs is not a “information radiator”
- clear red/green “radiators” are much more helpful from the perspective of something is on fire and needs to be put out
Chapter 6/7 - Provisioning Servers/Managing Server Templates
- server provisioning can be approached in multiple ways:
- always deploy bare OS and later add required tools/apps as updates
- always deploy your required tools / apps packaged in the image, and re-create on updates/deployments
- take care of data or logs which might need to be saved before the destruction of the instance
- with what Kief really hits the nail on the head are antipatterns, the what-not-to-do sections are really very sound advice
- antipattern: hot cloned server, in the case when you create new servers from existing old/running ones you’re just waiting for the disaster around the corner and troubleshooting sessions about non-existing errors
- 2 models to bootstrap a server: push (similar to the way Puppet works) or pull (where the server itself starts tools to bring it up to date or to finalize the install)
- preferred in the end, pull as there is less burden on the infrastructure side, both are valid approaches though
- always test the changes, run smoke/sanity checks on the latest additions
- For server provisioning: at creation time, in the template itself, or mixed
- Baking templates from OS install image or from (minimal) stock image
- some resources: RedHat Atomic, CoreOS, Microsoft Nano, Ubuntu Snappy, VMWare Photon
- Layering multiple templates is the more advanced approach but these can fall short when the updating process can cause serious issues and delays, depending on the implementation
Chapter 8/9 - Updating Servers/Define Infrastructure
- immutable servers are difficult to attain, in that practically every single change can cause the sever to drift from the ideal “frozen” instance
- continuous configuration synchronization would mean that everything pertaining to the server config is kept in check by an automated tool, the problem is where we draw the line and consider the change not to be related to the server’s config
- same push/pull discussion as before, related to the continuous deployment of changes
- as a best practice try to keep the “configuration surface area” as small as possible, think CoreOS for the base server template
- transactional server updates (NixOS example) where all or none of the changes are applied
- antipattern: handcrafter infra :)
- antipattern: per-environment definition files
- pattern: reusable definition files
- it will be easy to apply to test the changes if the definition file is parametrized and reusable from the point of view of its configuration
- as everything will be in VCS, traceability should come out of the box and with no extra overhead
- antipattern: monolithic stack - it will seem easy to update, but actually any type of change either small or big will have the possibility to break/affect everything else
- the environment can be divided into multiple stacks and each stack can be handled separately
- we can leverage the OS packaging (e.g. deb, rpm) in order to share configuration between the stacks
- good practice would be to manage the application code and the Infra Code together
- we get not only everything tested in all environments by different teams, but a tighter integration withing the app/infra parts which can be delivered as a single unit instead of disparate components with various points that can be easily affected
- good example would be microservices, where each component of the app can be handled separately from all other components without interferring with the overall functionality
- tools to be considered: Terraform, AWS CloudFormation, Heat (from RHAT OpenStack)
- running the definition tools can be done from a workstation, but the best practice is to have it performed by an agent, this way avoiding the snowflake enviroment, easier to log and track, easier to trigger automatically