Article originally published by VictorOps
The original all-hands-on-deck culture faltered during growth.
Raleigh Schickel, DevOps Manager, has seen uShip evolve from a small team with a few developers, to a larger company with a dev team size of 60 and growing. Initially, everyone was always on-call for their own code. But this culture of ownership changed as the company grew.
“As we hired more people, code ownership centralized, though not intentionally,” says Raleigh. Developers started expecting Operations to be responsible for the running system, and started working on features and moving on.“
uShip is a continuous deployment shop and developers are empowered to deploy code at any time. As Operations became more centralized, it was becoming more difficult to determine cause of application issues.
“Problems that could have had a quick easy fix would lag,” says Raleigh. “This made our Time to Identify and Time to Resolve unnecessarily long. The question became: How can we decrease time to identify and resolve?”
The challenge: how to recreate that early culture of code ownership.
uShip’s developers also expressed their desire to own their own infrastructure and not wait on others. But they didn’t want to be on-call. Raleigh says, this didn’t add up.
“[The developers] are looking for ways to speed up their development processes and time to market, but there are security and operational problems with that. If they change a setting and go to sleep and the service breaks, who deals with it? Who knows the most about it? I don’t know what they did.”
In response to their request, Raleigh convinced the developers to take on-call responsibilities. “Our rationale was this,” says Raleigh. “If you are willing to be responsible for the code you are delivering today, then we can expand access to the infrastructure that we are creating for tomorrow.” They agreed.
VictorOps helps democratize the on-call process.
Now (more than ever) with a team of 60 and growing, uShip needed a better way to manage on-call. uShip had handled incidents and log communication via email, with Nagios paging the team directly. This process was unwieldy.
They chose VictorOps for intelligent alerting, routing, and incident management. Now this 60+ person team could intelligently and humanely handle incidents that might occur anytime. Raleigh explains:
“Before VictorOps, we were limited to the same four or five people who were on-call all the time and that was a burnout gig. VictorOps allowed us to democratize the on-call process. We spread out the on-call load, which helps build empathy among developers about what other people go through. It allows those people who have traditionally been on-call to step back for a moment and catch their breath.”
The VictorOps developer discounted pricing program enabled uShip to affordably provide accounts for the entire development team.
Creative approaches to on-call rotation schedules.
To manage their on-call schedules, development teams work on a three-month cycle in which each team spends two weeks on call. They are on-call from 6pm until 9am, at which time the Ops team takes over.
uShip’s developers use the VictorOps team set up and scheduling features extensively. Since each team sets its own schedule for its members, they have used their creativity to design complex rotations. For example, they used VictorOps to put themselves on-call in two-hour chunks.
Raleigh especially loves the scheduled override feature because if there is a last-minute schedule change, it’s no problem. If someone on-call gets sick or something happens, they can just create an override instead of having to tweak the on-call schedule.
Devs on-call handle application health and well-defined issues.
Raleigh explains that uShip’s development teams are primarily on-call to monitor application health. They respond to incidents related to questions such as, How many exceptions do we have? Is the marketplace healthy? Do we have enough new listings? Do we have enough new users?
Developers are also on-call for infrastructure issues that have well-defined, simple fixes. “As long as the alert is clear and tells them what is going on, they can go push a button and easily fix a problem,” says Raleigh. “If they have to go reset app servers, we have buttons for that.”
However, if a Linux server is broken and requires intensive troubleshooting or if a telemetry system is down, an Ops team member handles those incidents and are not a developer’s responsibility to solve.
Always on-call in their particular area of expertise.
For the most part, developers aren’t part of a time-based on-call rotation. Instead, they are always on-call for their code in their area of expertise. Via the VictorOps Incident Automation Engine, Raleigh set up routing keys that send each alert to the right expert. During feature releases, the responsible dev team goes on-call until everyone is comfortable that the deployment was successful.
“Developers get to think about and understand the whole system in a way that they were not able to before,” says Raleigh. “Their mindset was: of course my code works. Actually, there is a giant system out there that interacts with your code.”
Using Slack to create manual incidents eliminates even more noise.
The dev teams self-organized to have one person from each team on-call at all times in case of a problem. They wrote an app called the Victorbot that allows anybody in Slack to create a manual incident and page the appropriate team via Slack in case of an emergency. “This is another way that VictorOps has helped us only page the right team when we need a response,” says Raleigh.
Devs on-call feel empathy and build even better code.
Raleigh explains why putting devs on-call has been so great for the team. He says, “The devs get a little taste of what it’s like to wake up in the middle of the night and handle the platform. They have shown great desire to make sure the launch of new code is healthy and for being the primary person on-call for it at launch. The best part is that we’re shifting back to the ownership culture.”
Choosing to build features rather than building a huge operations team.
Ultimately, owning code isn’t just a nice-to-have. It enables uShip to put its resources toward development rather than toward supporting increasingly complicated infrastructure, especially as microservices proliferate and require specialization. Raleigh says:
“If you believe in democratizing operations, then developers need to be on-call. Otherwise, if you have 20 microservices and five go down, how many Ops people would you need to put that fire out? It’s a choice. Are you going to pay for developers or are you going to pay for an Ops team? We think our company benefits more from delivering products. The more developers you have, the more you can develop product. It’s just kind of the reality.”
The DevOps team has more time to innovate.
With uShip’s culture shift and devs now available to take on-call, Raleigh’s DevOps team has more time to focus their work on helping the company innovate faster, which means writing code and developing and improving infrastructure. Raleigh says, “At some point in your company’s life, you’re going to take another look at code that was written with ideals in mind, and realize that as volume and traffic increase, it doesn’t always perform very well. Right now, for example, our team is currently focused on writing code to reduce load on the database.
“We’re turning two of my senior dot net developers into SREs so we can focus on this work,” Raleigh continues. “It’s good for everyone involved to be doing this type of work as a means to improve platform performance and mitigate future issues and alerts. We would rather have the team working for the future than firefighting today’s application issues.”