General Operations 101 [video]

The Ops guys from Etsy are giving us a short spin on monitoring of a complex system.

First, John is presenting us an investigation from the point of view of an hopefully objective investigator looking at past outages - it is quite a high level overview, but with lots of great ideas and methodologies and even going futher at more of a philosophical approach behind trying to fix the culture and not the people.

Investigation 101: lots of data is collected and hopefully aggregated across systems in order to get insights into the various systems: interestingly Twitter feeds, internal chats/email and graphing of various systems load/performance is all mixed together in order to do a post-mortem analysis of what went wrong. Also, finding out how the problems unfold themselves in the organization.

Daniel is giving us a short intro for Graphite/Carbon/statsd. Jon is having a short review of remote connectivity, screen multiplexers and skims over some ticketing services such as JIRA. Laurie quickly installs and configures Nagios.

Overall I would say that for seasoned DevOps / Sysadmins it might not come with a bunch of new tricks, but for an introductory course it is pretty good and Jon’s take on the whole investigative part is rather interesting.

You can find the videos on O’Reilly website: General Operations 101