I came across an article titled “Devs don’t want to do ops” that started with the premise that developers managing their own production infrastructure is stressful (it is), questioned whether development and operations should be separated again, and settled on declaring “DevOps is dead,” and that platform engineering is the future. It was quite a ride. It also raised some good questions about DevOps, and the ideal approach to building and running code. Is DevOps really dead? Is platform engineering really the future? What does it mean to “own your own code in production?”
Why were we doing DevOps in the first place?
The original idea behind DevOps was to solve issues that stemmed from development and operations being separate teams, with code handed off between them in order to deploy something in production. Being 2 separate teams, they had 2 separate sets of priorities that, while not necessarily mutually exclusive, led to tensions and slowed release cadences down. Developers prioritized building and rapid shipping, and operations prioritized keeping things running smoothly.
DevOps stemmed from the realization that both teams have the same larger goals of using software to help the business make money, but different focuses on how to do that. Rather than keeping these roles on completely separate teams, where these different focuses were easy to turn into conflicts, we could put the 2 groups together and negotiate these concerns up front and then go forward with an optimized solution.
So why didn’t that work out?
Well, the idea was good, but the implementations were…surprisingly hard to pull off well. In fact, there’s a whole musical number about it. Basically, it seems like businesses will do almost anything to avoid actually forming cross-functional teams. I have a personal hypothesis that the entire DevOps movement started to fix bad Scrum implementations. Businesses were told they need to build teams that could handle every aspect of building and running software, including product owners (representing customers), QA, designers, and operations. And businesses responded by saying “Yeah, yeah – we’ll make teams that have developers, QA, and product people on them.” So people started a new movement to fix this screw up and bring operations into the team (seriously – look at the makeup of most scrum teams in organizations that say they do Scrum – 0 operations, and rarely a someone whose job consists of designing the user experience in your application).
At this point – either businesses got on board and made operations a part of the process from the beginning, or they dug their heels in and decided to rely on managed infrastructure in the cloud. I know I keep bringing up cloud platforms, but I think they had a huge impact in DevOps evolving into the bastardizations that people use today. It’s a testament to companies like AWS that developers have been able to build and run their own infrastructure in production, while still building features, for this long. Even before they went as heavy into serverless and fully-managed services, companies like AWS made running infrastructure incredibly easy. Add to that strong tech companies adopting a “run your own application in production” philosophy (again, something that makes a lot of sense and works really well when you have actual cross-functional teams), and all of a sudden you hit the “who needs operations” mindset that led to the original “Devs don’t want to do Ops” article.
So did DevOps fail?
Given that DevOps is a practice of having developers and operators work closely together, on the same team, for the entire software development process, and that practice has worked well for companies that did it that way…no, it hasn’t. The real problem with DevOps is that despite what it is being perfectly clear, nobody bothered to do it, opting instead to try to change the definition to “tooling team,” or “devs run the production servers” in order to claim to be following the latest “industry best practice,” all without having to actually change anything. That said, the lack of actual adoption certainly indicates that DevOps isn’t going to be a legitimate thing in software development.
That leaves us with the question of how we should be approaching reducing the tension between writing and running software on the cloud. There are other best practices that have caught on and should be kept, such as developers should own their code in production. After all, almost all of the issues coming out of production these days tend to be code problems, so we do need to take a more active role in running the code.
In terms of operations, the most “hands-on” a lot of applications get is running directly on virtual machines, utilizing some pre-defined autoscaling options. More likely now is that you’re running applications using managed services, maybe in some sort of container management service (either direct container management, like Elastic Container Service from AWS, or in a managed Kubernetes cluster), functions-as-a-service (a la AWS Lambdas), or you’re using other services in a completely on-demand fashion, without provisioning instances at all. In every case, almost all of the physical infrastructure is handled by your cloud provider, leaving just application monitoring, observability, and alerting for issues in the code. All of which are basically development functions now.
If anything’s killed DevOps, that’s probably it – “running” code doesn’t require “running” infrastructure anymore, so what we usually of as “operations” has fallen by the wayside, for better or for worse. The good news is, most of our problems are code-based. The bad news is, it put a lot of extra pressure on developers. Pile writing our infrastructure as code on top of that, and there’s not a lot of “Ops” left that isn’t being done by “Devs,” thus leading to the problem that the original article claims platform engineering solves.
The real issue – What is an operations?
Given that most of traditional operations is being handled by cloud providers, but there’s still a bunch of non-traditional development tasks that developers don’t want to be doing, the pertinent question probably isn’t “what’s going to replace DevOps?” but rather “what role does operations play now?” Well, for starters, there’s building out service infrastructure in the first place. Yes, I know we’re doing “infrastructure as code,” but at this point we’re moving away from raw “code” (like conventional bash scripts or Chef scripts) and onto deployment files (like Terraform or Helm charts). Those don’t have to be written by developers,
In fact, cloud architectures are a lot more involved to run reliably than I’ve over-simplified here. Yes, you should be replicating instances and containers, but there’s more to running in the cloud than spinning up more instances. In addition to replication, other factors in your architecture include scaling, inter-service communication (e.g. firewall rules, event streams and message queues, etc.), how to handle database migrations, and system performance. Oh, and since you’re running all of this on the cloud, you likely want some basic understanding of costs and about when you want to change your approach. I’m not saying you have to memorize (or always have handy) your cloud provider’s price list (because I don’t, and usually don’t think twice about it), but these services can charge a lot of money, so it certainly helps to know your most cost-effective options for various services at various usage volumes.
These operational concerns touch on a lot of other roles that have traditionally been considered development-related, like software architecture and site reliability. What’s important is that none of this can really be done independently of (or in isolation from) developers, and it can’t be substituted with a tools team. “Operations” works for lack of a better term, since it’s more focused on the ancillary aspects of running software. I don’t really know if I’m willing to go so far as to call it another job, other than the fact that the amount of effort involved in that work plus the time and effort involved in writing code can seem like it’s adding up to more than a standard 40 hours per week. On the surface it seems like better automation will do all of this for you, but there’s a significant amount of expertise involved in building that automation, which means you still need an operations team anyways.
So does platform engineering solve everything?
Well…meh. Platform engineering is probably a topic all of its own, but in this context, it helps via the building of “paved road” solutions, which certainly simplifies a lot of things. From what I’ve read, I’m not totally sure that it actually solves anything. If anything, platform engineering seems to be something organizations with a strong DevOps do, but as part of the larger context of getting the people who write and run the software to work more closely together.
I think the bigger takeaway from all of this is that cutting the people who were in charge of running software and piling their work onto the people in charge of writing the software wasn’t successful. But while we were coming to that realization, the job of running code changed. The need for operations and feature development to work hand-in-hand stayed the same, however. Some practices, like platform engineering, certainly seem useful, and a good fit in a DevOps culture, but nothing I’ve seen shakes me from the belief that what really leads to success are actual cross-functional teams, so you’ll have to forgive me for remaining unconvinced that platform engineering is the silver bullet it’s being hyped to be.