Software Engineering

SE Radio 569: Vladyslav Ukis on Rolling out SRE in an Enterprise : Software Engineering Radio


Vladislav UkisVladyslav Ukis, author of the book Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations, discusses how to roll out SRE in an enterprise. SE Radio host Brijesh Ammanath speaks with Vlad about the origins of SRE and how it complements ITIL (Information Technology Infrastructure Library). They examine how firms can establish foundations for rolling out SRE, as well as how to overcome challenges they might face in adopting. Vlad also recommends steps that organizations can take to sustain and advance their SRE transformation beyond the foundations.

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact content@computer.org and include the episode number and URL.

Brijesh Ammanath 00:00:17 Welcome to Software Engineering Radio. I’m your host, Brijesh Ammanath. And today my guest is Vladyslav Ukis. Vlad is the head of R&D at Siemens Healthineers Teamplay digital health platform and reliability lead for all of Siemens Healthineers digital health products. Vlad is also the author of the book Establishing SRE Foundations, A Step-by-Step Guide to Introducing Site Reliability Engineering and Software Delivery Organizations. Vlad, welcome to Software Engineering Radio. Is there anything I missed in your bio that you would like to add?

Vladyslav Ukis 00:00:47 Thank you very much, Brijesh, for inviting me and for introducing me. I think you’ve covered everything. So looking forward to getting started with the episode.

Brijesh Ammanath 00:00:57 Great. We have covered SRE previously in SE radio in episode 548 where Alex discussed implementing service level objectives, episode 544 where Ganesh discussed the differences between DevOps and SRE, episode 455 where Jamie talked about software telemetry, and episode 276 where Bjorn talked about site reliability engineering as a subject. In this episode, we will talk about the foundations of implementing SRE within an organization and I’ll also make sure that we link back to all those previous episodes in the show notes. To start off Vlad, can you give me a brief introduction on what SRE is and how it differs from traditional ops?

Vladyslav Ukis 00:01:39 Let me start by giving you a little bit of history of SRE. SRE is a methodology that’s called site reliability engineering, and it was conceived at Google because Google had a big problem many years ago, which was Google was growing and the number of people that was required to operate Google also was growing, and the problem was that Google was growing so fast that it became impossible to hire the operations engineer in line with the growth of Google. And they were looking for solutions to that problem: How can you grow a web property in such a way that you don’t require a linear growth of operation personnel in order to run the site? And that led to the birth of SRE approaches, which they then several years later wrote up in the well-known SRE books by Google, and this is where it’s coming from. So it’s got its origins in a way of setting up operations in such a way that you can grow the site, the web property, and at the same time you don’t need to grow linearly the personnel that’s required to run it.

Vladyslav Ukis 00:03:04 So it’s got a very business-oriented approach and digging deeper, it’s got its origins in software engineering. At Google, there is a saying that SRE is what happens when you task software engineers with designing the operations function of the enterprise. And it’s true. So you, once you dig into this, you see the software engineering approach inside SRE. How it’s different from the traditional way of operating software is that it’s got a set of primitives that enable you to create good alignment of the organization on operational concerns because it gives the participants in a software delivery organization clear roles to fulfill, and using that then the alignment can be brought about if an organization is serious about implementing SRE. And once that alignment is there, then it’s possible to do the alerting of the operations engineers, not just on the traditional IT parameters — like for example, CPU is too high or the memory is too low — but you actually are able to alert on the symptoms that are really experienced by the users. So you are alerting on the higher-level stuff, so to speak, that’s really felt by the user. And once you do this, then also the alerts, they are much more meaningful to the operations engineers running the site because then there is a clear connection between the alert and the user experience, and with that the motivation to fix the problem is high. And also you don’t get as many problems, you don’t get as many alerts as you would if you just alert on the IT parameters like CPU usage is too high and things like that.

Brijesh Ammanath 00:05:01 I like the quote when you say SRE is what happens when you get software engineers to design operations and run it. And I believe that also implies that software engineers will implement the software engineer design principles, like continuous integration and engineering principles around measurability?

Vladyslav Ukis 00:05:18 Yeah, so in terms of software engineering approach in SRE, fundamentally SRE brings to the table is, imagine you’ve got a software engineering team and the software engineering team is ready to ship some digital service into production. And typically, they just do it and then they see what happens. With SRE, that’s not the approach that the team would take. With SRE, before doing the final deployment, the team will get together including the product owner and they will define the so-called service level objectives for the service, and these service level objectives, they would then quantify the reliability of the service — the reliability that they want the service to fulfill. And then once deployed to production, that reliability, which is quantified, will get monitored and then they will get alerts on whenever they don’t fulfill their liability as envisioned. So you see, it creates a very powerful feedback loop where you apply effectively the tried-and-true scientific method to software operations.

Vladyslav Ukis 00:06:32 So you, before you deploy to production, you then define the SLOs which quantify the reliability that you want your service to provide. And then, once the service is in production, then you get feedback from production that tells you whenever you don’t fulfill the reliability that you actually thought the service would provide. So, it provides that powerful additional feedback loop, which is actually pretty tight. And that means that you don’t just do continuous integration in a sense that you’ve got some stages, some stages that lead you through some testing towards production. But you also think about the operational aspects much more during the development because there is an ongoing conversation about the quantification of reliability.

Brijesh Ammanath 00:07:24 We will dig a bit deeper into SLOs, how do you go and educate the teams about it and how do you implement it later in the podcast. But prior to that, I wanted to understand a bit about prior to SRE organizations used methodologies like ITIL, information technology infrastructure library, and some organizations still continue to use that. Is SRE complimentary to ITIL, or is it something which will replace ITIL?

Vladyslav Ukis 00:07:53 Right. ITIL is a very, very popular methodology to set up the IT function of an enterprise. I think there is a bit of misconception there in the industry. On the one hand, ITIL is there to, as the name suggests, set up the IT function of an enterprise. So every enterprise requires an IT function in order to set up the shared services that are used by all the departments, and that’s what ITIL is great for. Whereas SRE has got a different focus, and therefore it’s also complementary to ITIL. So SRE’s focus is to put a software delivery organization in a position to operate the digital services at scale. So, it’s not about setting up an IT function of an enterprise; it’s about really be able to operate highly scalable digital services that the company offers as a product. So, therefore the existence of ITIL and SRE in an enterprise is very complimentary.

Vladyslav Ukis 00:09:03 So there is actually no contradiction there, but you are totally right in noticing that actually in the industry, these things they are of not clearly delineated, which leads to questions, okay, so do we now do SRE or do we now do ITIL? And if we now do ITIL, do we need to throw it overboard and replace it with SRE? Because these are two different methodologies which have got totally different focus — well, not totally different focus, but I would say rather different focus. So these questions, they actually don’t need to arise because those two methodologies are complimentary. So one thing is with ITIL, you set up your IT function in such a way that everything is compliant, that you provide good quality of service to the enterprise users, and with SRE you create a powerful alignment on operational concerns within the software delivery organization that also operates the services that you offer.

Brijesh Ammanath 00:10:05 Right. So if I understood it correctly, ITIL is broader in scope; it’s about introducing the entire IT function and setting up that environment, whereas SRE is focused on addressing the concern about reliability? Is that a right understanding?

Vladyslav Ukis 00:10:20 Yes, in general that’s the right understanding. That’s right.

Brijesh Ammanath 00:10:23 Okay. Appreciate, you know, Google introduced SRE as a concept based on their journey of setting it up. It was very new to the industry. And since then many organizations have introduced SRE into their own way of operating and setting up operations. Can you tell me the common pitfalls or challenges that organizations have encountered while introducing SRE in the existing setup?

Vladyslav Ukis 00:10:48 Definitely. Thanks for this question because that’s exactly the question that I was answering at length while I was writing my book Establishing SRE foundations. The central question of the book was, okay, so you’ve got some examples of SRE implementation at companies like Google where it originated, and those are the companies that were born on the internet and therefore, they were looking for new approaches to operate highly scalable digital services. And now, you’ve got some traditional organization and you want to also introduce something like SRE because you think it might help you with the operations of your digital services, but you’ve got a totally different context. You’ve got a totally different context from the organizational point of view, from the people point of view, from the technical point of view, from the culture point of view, from the process point of view. So everything is different.

Vladyslav Ukis 00:11:47 Now, would it be possible to take say SRE out of Google and implant it into another organization, and would it start blossoming or not? And the main challenges there I would say are a couple, which with SRE you’ve got some responsibilities that are typically not there in a traditional software delivery organization. For example, in a traditional software delivery organization, the developers, they never go on call. Developers just develop and as you mentioned with the example of continuous integration, their tasks and with the final internal environment, so to speak. From then onwards, then someone else takes the software and brings it into production, whatever it is, whether it’s on premise or say some data center or Cloud deployment and so on. So with SRE, developers they need to start going on call for their services. The extent to which they go on call is a matter of negotiation.

Vladyslav Ukis 00:12:59 So, they could either go on call completely — so being fully on call, fully responsible for their services — or it could be just a small percentage of their time, but in any case, developers they need to go on call. That’s a huge change. And that means that developers need to start acting like traditional operations engineers. Whereas on the other side, on the side of the operations, they are used to operate services. So they are used to being on call, whereas what they need to do under the SRE framework, they need to enable developers to go on call. And that’s a totally new thing to them because they suddenly need to become software developers developing a framework, developing an infrastructure that enables others to do operations. And that’s a very big change because then in essence the development department needs to do operations work and the operations department needs to do development work, and that’s a difficult transformation.

Brijesh Ammanath 00:13:59 Do you have any stories around how developers within your organization took the ask about getting involved in operations and being on call? How was their reaction, and how did you approach that negotiation?

Vladyslav Ukis 00:14:12 Yes, definitely thanks for asking that question. I think that’ll be a very interesting one to answer and hopefully also to listen to. When we started with the Siemens Healthineers Teamplay digital health platform, we were the first ones in the company to offer software as a service. We were the first ones in the company to put up a service out there — it was in the Cloud, or it is in the Cloud — and then offer that as an offering on a subscription basis. So before that, the company didn’t sell subscriptions and with the Teamplay digital health platform, we started selling subscriptions. So with the sell of subscriptions came also the realization that now the obligation of running the services is actually on us. And with that then came the realization that we need to learn how to operate the services, and the services are deployed in six data centers around the world.

Vladyslav Ukis 00:15:13 And there was also a growing number of users. And with that, of course, the expectations of the availability of the service were growing higher and higher. With the higher expectations of availability of the service, also the realization came in that that leads to shorter and shorter time to recover from the incidents that might happen. And with that then came the realization that in order to be able to recover from incidents fast, we need totally new processes, which we didn’t have back then. So we need the developers to be very close to production; only then it’s possible to recover fast from the incidents. And we need to equip the developers, first of all with some technical infrastructure for being able to do so. Then also with some processes and with some mindset change because that’s a totally new area for them. So once that realization set in, we then started looking for solutions, and after stumbling a couple of times, we then arrived at SRE. We then started learning about SRE, so what that means and how that could work, could that work in our context?

Vladyslav Ukis 00:16:32 And then we decided to give it a try at some point. So we then decided to start building a very small piece of infrastructure inside the operations organization. So we put a real developer inside the operations organization who then started digging deeper into the SRE concepts and implementing them for our organization. And then we started going team by team. So, then essentially traversing the organization, onboarding them onto the infrastructure and doing this in a very agile manner, which means the infrastructure was always no more than one step ahead of the teams that were using the infrastructure. That means that the feedback loop between a feature implemented in the infrastructure and that feature being used by one of the teams was very tight, which drove then the further development of the infrastructure. So we made sure that any feature that we implement gets used by the teams in their daily operations. Very quickly with that we get either the confirmation that the feature implemented properly or we get feedback how to adapt the feature to meet the need of a particular team better. So, that was our approach, and over time we managed to implant the SRE ideas in all teams until the point came where SRE became the default methodology of running services in the organization.

Brijesh Ammanath 00:18:09 I’d like to dig a bit deeper into that statement where you said you started off by injecting one developer into the operations team and that kind of started blossoming that entire journey for implementing SRE across teams. What was the skillset of that developer, and was he fine with moving into operations? Did he struggle initially? What were the challenges that you faced around getting the operations team to accept that developer as part of that team? Can you give me a bit more color over that please?

Vladyslav Ukis 00:18:40 The developer actually was very happy in the operations organization because our operations organization is also very, very close to development. So, our operations organization actually doesn’t do traditional operations in a sense that there are lots of people, like teams that are just operating services because we’ve got the SRE model now, and that means that the majority of operations activities, they are happening in the development teams using the SRE infrastructure. So, the developer was actually pretty happy because it was development work for him. So, it wasn’t anything kind of totally different. It was just the context was different because the context was about implementing the SRE infrastructure, but it was development still. And that’s also one of the original sort of strengths of SRE that it’s all inspired by software engineering. Therefore for that developer it was still the software engineering world which was important.

Vladyslav Ukis 00:19:42 So the developer started learning about SRE together with me and we then drove the transformation by understanding the features that would be needed in the infrastructure, by understanding the team’s needs so that they would be willing to use the infrastructure. And that’s actually one of the important points. So we didn’t force anyone, any team, to use the SRE infrastructure. So if a team was happier using something different, then we accepted this and then moved on to another team — which by the way didn’t happen a lot because it was clear that the SRE infrastructure provides advantages. So that was our journey, and I think the apprehension of developers to, for example, take part in the SRE infrastructure implementation work would not be generally there. So if a developer is open to work on infrastructure instead of, for example, on some fancy application development, then that will be still a very interesting development field for a developer.

Brijesh Ammanath 00:20:59 Right. I’d now like to move on to the approach and if you can help me walk through a step-by-step approach to establishing SRE foundation. You’ve expanded on this in your book about assessment of readiness, achieving organizational buy-in, and the organizational structures that need to be changed. So if you can just expand on that please.

Vladyslav Ukis 00:21:21 Yeah, thank you. This is a very broad question, of course, because I wrote an entire book about this. Let me give it a try and summarize this as far as possible. When you’ve got an organization that’s new to SRE, that has never done operations before, or that did operations using some other means which didn’t make the organization happy in terms of operations and therefore they want to try SRE, then there will be several significant steps to take. One significant step at the very beginning is actually to decide — and that already requires quite some alignment of the organization. On the one hand, it requires alignment at different levels of the organization. That means that there needs to be some interested people in the teams to give it a try, which means some interested people in the operations organization, some interested people in the development organization, because they see the potential value of applying SRE in the organization.

Vladyslav Ukis 00:22:29 Then another important bit is that investing into the SRE infrastructure and investing into using the infrastructure by the development teams requires effort, and therefore the leadership of the organization needs to be aligned on giving it a try, which means the head of product, head of development, head of operations, they need to be aligned that they want to give it a try because it will require capacity in the operations teams and in the development teams. So, that alignment needs to be achieved to some degree. So that means that SRE at some point needs to find its place on the list of the bigger initiatives that the organization undertakes. So each organization will have a list like that. Either it’s uh, covered in the an entire portfolio management system or there’s just a list of initiatives that the organization undertakes and SRE needs to find its place there.

Vladyslav Ukis 00:23:31 It needs to be there because it requires the involvement of all the roles in a software delivery organization because the software developers will be involved, the product owners will be involved, and the operations engineers will be involved. Therefore in order to make it happen, a certain degree of alignment at the leadership level will be required as well. Then the next step once that is there is to assess what actually needs to be done in different parts of the organization in order to bring the organization onto SRE. So, you would need to assess things like, okay, so where are we in terms of the organization in the sense of what are the formal and informal leadership structures? So, how can we influence teams, how can we influence people in that particular organization? Then in terms of the people assessment, you need to understand how far away people are from production.

Vladyslav Ukis 00:24:33 So, are the developers currently totally disconnected from production and they just don’t get feedback loops from production or there are already some feedback loops and therefore they are already somewhat closer? Maybe there is a difference there between the teams. Maybe one team is already really operating the services actually quite well, just not using SRE means, and maybe there are teams that are really too far away from production. So you need to understand this. Then the next assessment that needs to be done is technical. So what are the technical means that are available in order to run something like SRE? So do we have unified logging in the organization? Do we actually know which services are deployed and where? Say, then what is the current, say, strategy for alerting? What do we alert upon? Is the alert fatigue already now, or maybe there are just no alerts because the development organization is totally disconnected from production.

Vladyslav Ukis 00:25:36 You need to understand this. And then in terms of culture also you need to assess the organization on the western model, which defines certain aspects of high-performance organization. Like, for example, what is the level of cooperation in the organization? Do we have a typical divide between the operations organization and the development organization and then the development organization just throws their software over defense to the operations organization. So what is the degree of cooperation there? Then you need to assess things like okay, so how does the organization treat the risks that are presented that surface themselves? Do the messengers get killed, or are the messengers welcome to present negative news and then the organization has got good structures to learn from them and move forward. They need to understand in general how cohesive the organization works in terms of the bridges between the departments.

Vladyslav Ukis 00:26:38 So, how close is the collaboration between development and product management,; how close and is the cooperation between the development and operations; and then is there any cooperation at all between the product management organization and the operations organization? So you need to understand these things like that in order to assess the culture. Also another aspect that would pay into the culture is how does the organization deal with failure if there is an outage, so what is done? Are there any postmortems? Is there any blame game going on? Are people fearful to voice their concerns or the other way around? So that’s another aspect of understanding where the organization is. So then once you’ve taken that step, that means you’ve got already a permission to run the SRE transformation and you also now have assessed the organization from various dimensions. So organization, people, tech culture process as well.

Vladyslav Ukis 00:27:38 So what is the process of releasing this software and so on? How frequently is it released? Then you need to, you are in a position to craft some plan of how the SRE transformation could potentially unfold — and I’m deliberately saying “could potentially unfold” because this is such a huge socio-technical change for an organization that has never done operations using SRE that you’ll never be able to predict what will happen. It all depends on the people that are in there and there is a lot of non-determinism that will be going on during such a transformation. So then once you start, I think one of the first things will need to be to come up with some minimal SRE infrastructure and then finding a team that is most willing to jump on it. And then from there you start snowballing. So you then improve the infrastructure based on the feedback from the first team.

Vladyslav Ukis 00:28:38 Then you find the second-best team to put onto the infrastructure because they’re also interested. Then you find the third best team and so on, until it becomes a thing in the organization and there are so many teams on the infrastructure already that people are talking about it, and teams are then generally either already waiting to get on board or even actively knocking on the door and asking when they could be onboarded. So then with the onboarding onto the SRE infrastructure, several major things will happen in the team. So one major thing that will happen is that the definition of the service level objectives that I mentioned earlier — so the initial quantification of reliability will happen. And then another major step will be for each team is to start reacting to the SLO breaches that will be coming from the SRE infrastructure that will start tracking the defined SLOs in all deployment environments that are relevant.

Vladyslav Ukis 00:29:42 So generally in all production deployment environments. So once that is in place, then at some point the formalization of the on-call rotations will need to happen, and with that then the conversations between the product operations, the operations development and product management need to happen in order to understand a good split of the on-call work between the developers and the operations engineers. So that’ll be one of the major points and then at some point also further things will evolve and unfold like for example, at some point then the SRE infrastructure will be mature enough to start tracking the error budget consumption in such a way that you’ll be able to aggregate the data and present the data to various stakeholders, to the product managers, to the leadership, and so on, so that everybody becomes aware of the reliability of the services and data driven decision making about whether we are investing now into reliability versus whether we are investing now into new features could be answered in a more data-driven approach than before. So as you can see, very many steps on the way, but the good thing is that with every small step you are making a small improvement that is also visible and therefore you don’t need to run all the way through to the end until you start seeing improvements. Every little step will mean a tangible improvement.

Brijesh Ammanath 00:31:19 Yeah, quite a few topics over there that we can deep dive into later in the session, but when I summarize it, I think there are primarily three foundational steps. First is the alignment to ensure that the SRE transformation initiative gets into that prioritized list of initiatives. And for that alignment to happen you need all stakeholders, or majority of stakeholders, to be supporting it because it involves cost as well as capacity allocated for the transformation. The second foundational step would be the current state assessment to understand where is the organization currently and the third one, once you’ve got that list into the prioritized list of initiatives and you’ve got the current state assessment, the third foundational step would be to plan for SRE transformation and once you have planned it, the next steps that you spoke about starting onboarding and formalization of on-call schedule and so on are all implementation steps that come after the foundation. Would that be a correct summary, Vlad?

Vladyslav Ukis 00:32:18 Yeah, I think so. Thanks for summarizing it succinctly.

Brijesh Ammanath 00:32:22 Excellent. Now we’ll dig a bit deeper into each of these and I’d really be interested in understanding, do you have any example or story on how you went about getting that alignment and getting stakeholder support for such a major transformation initiative?

Vladyslav Ukis 00:32:39 Yes, definitely for sure. So, concretely what we did at Teamplay digital health platform was first of all, there were a couple of people in the organization who were interested in trying SRE because they were intrinsically motivated to, on the one hand improve the status quo, but on the other hand also they saw, themselves, the potential. So they were eager to explore the potential of SRE because they saw that that would be a good fit for what we were doing. Then a couple of bottom-up things happened like some presentations were there just informal meetings like lean coffee, the organizations about SRE, what that could mean, what that could bring to the organization, what improvements could that yield for us. And that seeded already the initial understanding that there is something out there which could actually help us with taming the beast in production, so to speak.

Vladyslav Ukis 00:33:43 Because, as I mentioned earlier, actually everything was growing, and that means the number of users was growing, the number of digital services was growing, the expectations in terms of availability of course were growing, and the number of data centers where the platform was deployed was growing, the number of applications on the platform was growing; everything was growing, and once you are in such a situation, you really need some innovative approaches to really tame the beast in production. Otherwise, if you don’t have the right organization for this, it just doesn’t work. So what happened next? We started preparing the leadership team to put SRE into the portfolio management for the organization. So in the portfolio management, we’ve got bigger initiatives that the organization undertakes, and they are all stack ranked. So on the one hand it was important to put SRE onto that list, and the second important thing was to rank it high enough so that it gets noticed by the teams, so to speak, and we’ll be able to allocate some capacity in each team in order to work on this.

Vladyslav Ukis 00:34:56 Then we were talking separately to the head of development, head of operations, head of product, and were having conversations about the issues that we had back then with operating the platform and how SRE could help, and what we would need in order to make the first steps there and then assess whether we are seeing improvements. And then if we were, then we would be rolling out SRE more and more in the organization. So once those leaders who are kind of on board or in a sense that they also would give it a try, so they would agree to giving it a try, then we managed to bring this into the portfolio discussion and bring SRE onto the portfolio list, and then rank it high enough so that enough capacity could be allocated in teams. So, that was the process that we took, and then since then I also advised several other product lines inside the organization and showed them the process, and they were also following the process and reported that that kind of approach to getting the initial alignment was helpful.

Vladyslav Ukis 00:36:10 So I’d say, in summary, the initial alignment is working both ways. It’s working bottom-up. You need to have some people in the organization in the teams that are interested in that kind of thing. So you need to prepare the teams themselves, and you also need to work at the leadership level — so top-down — so that at some point some capacity is allocated for the SRE work and then you can get started. I would say that combination of bottom-up and top-down is absolutely necessary here because one without the other doesn’t work. So if you don’t have anything prepared in the team yet and then you get the leadership alignment and then the leaders will come and say, okay, now, work on SRE. I don’t think that’ll work because then the teams will feel like they’re getting overruled by some buzzword that they’re not aware of and the managers they just read about it in some management magazine. And that’s then I think yeah, they might think, okay, so that’s not fit for purpose because what we’re doing here is something different and so on.

Vladyslav Ukis 00:37:18 So I think that’s not a good idea. And the other way around, if you’ve got then teams burning with desire to try SRE because they think that that would improve the operational capabilities of the organization, but the leadership is not aligned and doesn’t allocate capacity in one way or another, then I think you can probably get started a little bit using bottom-up initiatives, but you’ll not be able to bring it to a point where it’ll become a major initiative and all the teams will be onboarded and so on. That’ll not work, so you’ll be able to only go so far. Therefore, that combination is important, and that’s how we did it. And that’s how I saw that also being a successful approach in other product lines.

Brijesh Ammanath 00:38:06 Vlad, you talked about developers doing on call. Usually that’s been a very thorny topic, and developers take it very personally because it impacts their work-life balance. Do you have any stories in terms of, what were the challenges you faced around this conversation, and how did you address it? And any tips for our listeners in terms of if they had to roll it out in that organization, well what could they look at doing and what learnings do you have for them?

Vladyslav Ukis 00:38:31 Brijesh, thank you very much for asking this question and I’m really looking forward to answering it because I think that was the most frequently asked question by the developers when we started the SRE transformation. So do I now need to go on call out of hours? Do I need to get up at 4:00 AM at night to rectify my service? We had lots of questions like this, and I’m happy to share how we addressed this. What we started doing right at the beginning of SRE transformation was to say, look, the whole thing is an experiment. We are new to operating software as a service, we are just trying out whether SRE would be useful for us in our context. Therefore, let’s only go on call and talk about on call in the context of the regular business hours. Regardless where you are, regardless which time zone your team is in, we are only talking about on call during business hours. And that went down very well because developers generally they’re eager to try something new, and if it’s still within the business hours doesn’t disrupt their life outside of work, then they are generally happy and looking forward to trying new things.

Vladyslav Ukis 00:39:54 So, this is still partly the approach that we’ve got right now. So now what we’ve got is then a development team that’s happy with the on-call hours by being on call only during the usual business hours. But still, that challenges a development team very profoundly because a typical development team that has never done operations before actually has never had live feedback loop from production. The development team was working on a release for some time and then once that release was over, then the development team started looking into the next release, then worked on that second release for some time, then moved on to the third release. And this is how life in a development team unfolded. Now with SRE and on call, suddenly all that changes because you get a live feedback loop from production, which you need to react to. And the development team then needs to reorganize itself in terms of how they allocate capacity, in terms of how they distribute the knowledge to be effective at being on call — because it doesn’t make sense to put somebody on call who don’t know how to rectify the services.

Vladyslav Ukis 00:41:12 Then you need to adapt your planning procedures, capacity allocation procedures. So lots of aspects are touched upon when you introduce that live feedback loop from production into a development team. And also, you need to take into account a particular deployment topology that you might be having. For example, in the Teamplay digital health platform we have got six data centers around the world, and now if you are saying that you are on call then are you on call for all the six data centers, or are you on call for only one, and for how long and so on. So each team needs to deal with those questions, and we took a coaching based approach and brought that to each team and discussed that at length in each team in order to find the setup that’s suitable for them. So, we don’t have a one-size-fits-all approach there, but each team found over time an approach that’s most appropriate for them that can also change over time.

Vladyslav Ukis 00:42:15 So that’s when it comes to the operations of the services that the teams own, which means that the scope of a person that’s going on call is just their service that they own. And that’s what we call now bottom-up monitoring because it just looks at the services in depth. What we then learned was required additionally to be introduced in order to really provide a reliable service is the so-called top-down monitoring. The top-down monitoring is system-level monitoring that looks at, we call them core functionalities, that cut through all the services and all the teams and provide really core functionalities — as the name suggests — without which the platform doesn’t work. One example of those core functionalities on our platform is we are in the healthcare domain and we connect hospitals to the Cloud and upload data from hospitals after minimization to the cloud.

Vladyslav Ukis 00:43:23 So we’ve got a core functionality that is a function of the data being uploaded to a data center from all connected hospitals on average over a time window. If that data-upload throughput drops significantly, then we consider this as a potential problem with one of the core functionalities, and we look into this. So that combination of top-down monitoring done by the teams looking at their services that they own, respectively, and then that top-down monitoring of core functionalities done by a small central operations team is the best setup for us. In terms of on call, the developers are on call, eight-five means eight hours a day, five days a week, but for core functionalities, the operations team, they are responsible to be on call 24/7. Still, here we managed to set up the follow-the-sun approach — means putting people into three different time zones, eight hours each, so that actually the people, they all operate only during their business hours, but still we ensure enough on-call coverage and enough on-call depth in order to provide a reliable platform. So that was our answer to this.

Brijesh Ammanath 00:44:57 I think a few points stood out for me. One is it’s important to call out initially that it’s an experimental approach so it’s not something which is set in stone. So developers have that flexibility to feedback and change the approach, if needed. I think that provided them the reassurance. So that’s very important. And I think your tip about stressing that developers only need to support during business hours. That’s a very good point, something for us to take on board for other organizations who want to implement SRE. I think you answered also nicely transitions us to the next topic which is around sustainance. So once you’ve got the foundations in place, what are the key elements for sustaining and advancing and building on the foundations of SRE?

Vladyslav Ukis 00:45:39 In order to sustain SRE further in the organization, at some point you would need to start formalizing the SRE as a role in the organization, and that then can be either seen as a responsibility that a developer takes on or it could be even a full-time SRE role. It depends on the context, but you need to deal with the formalization of the role, number one in the organization. Then number two, another thing, you need to establish error budget based, data-driven decision making where you then decide — which means prioritize — investments in feature work versus investments in reliability work based on error budget consumption. The SRE infrastructure needs to provide data which is aggregated and presented accordingly, so that different stakeholders can engage with the data and make decisions based on the data. Once you’ve got this, then that’s another point that entrenches SRE well in the inner workings of an organization — and even better if you’ve got some organization-wide continuous improvement framework and you can put SRE there, or rather just reliability there, as a dimension for continuous improvement. Then that’s even better because then you are part of a bigger continuous improvement framework where you inserted reliability as a dimension, which is measured using SRE means.

Vladyslav Ukis 00:47:18 Then another thing that you can do, which can be effective is the setup of an SRE community of practice where the people from different teams — development organization, operations organization — can meet on a cadence and then share experience, have lean coffee sessions, have lunch and learn sessions, brown bag lunches and so on, just to foster the exchange, and to foster the advancements and the maturation of the SRE practice over time.

Brijesh Ammanath 00:47:54 Thanks, Vlad. I’d like you to just expand on the concept of error budget. If you can explain to our listeners what an error budget is, I think it’ll be useful to understand the previous answer and the importance of it.

Vladyslav Ukis 00:48:06 Definitely. Actually, I think I should have introduced that so long ago at the beginning of the episode, but let me do that now. So, once you’ve defined your service-level objectives, then the error budget is calculated automatically based on the service level objectives. So let me take a simple example. Imagine you set an availability SLO to say 90%. That means you want your say endpoint for example, it’s at the endpoint level. For example, your endpoint should be available for 90%. That means, for example, depending on how you calculate this, but a calculation could be that it’s available in 90% of the calls in a given period of time. That means that your budget for errors is a hundred minus 90, 10% of the calls — and that’s your error budget. So the error budget is calculated automatically based on the SLO. If your SLO is 90%, then your error budget is 10%.

Vladyslav Ukis 00:49:08 If your SLO is 95%, then your error budget is 5%. That means then in the last example, in 5% of the cases, if it’s an availability SLO, then you are allowed to be non-available, and then you can use that error budget for things like deployments because every deployment has got the potential to chip away a little bit of the error budget because deployments can cause failures, or just during a runtime something happens and you are not available for some time and then you use your error budget. So what the powerful concept behind the error budget tracking is that the SRE infrastructure can tell you whether you actually used up your error budget but still didn’t use more, or whether you actually used more error budget than you were granted by the SLO. And this is something that you can then feed into the decision making by doing proper aggregations at the service level, then maybe even team level, and so on. So you can do aggregations that are necessary in order to engage different stakeholders, and that enables you then to say, okay, so actually we granted to this set of services the error budget of 5%, but actually they used, say, 10%. That means they’re using more error budget than granted and that means they’re less reliable than dictated by the SLOs. And that means then as a consequence we need to invest into reliability of those services because we actually want them to be more reliable than they currently are.

Brijesh Ammanath 00:50:43 Right. So I guess it also indicates or error budget is the budget or the capacity for the development team to roll out changes because once you have exhausted that, you’ve got to focus on reliability stories rather than on enhancements. We have covered a lot of ground here Vlad, but if there was one thing an engineering manager should remember from our show, what would that be?

Vladyslav Ukis 00:51:06 I think if it’s just one thing, then at its core, SRE helps you to quantify reliability and then introduce a process around tracking whether you are in compliance with the quantified reliability. If it’s one thing, then I’d say quantify reliability, which is actually a hard problem because usually the development teams traditionally they’re not very good at quantifying reliability. And SRE provides you with means to do so and also with processes that put your organization onto the continuous improvement path in terms of reliability, and all that is possible because the reliability is quantified. Therefore I would say quantify reliability. If it’s just one thing that you want to take away from this podcast.

Brijesh Ammanath 00:52:01 That’s a good way to remember it, I would say. Was there anything we missed that you would like to mention?

Vladyslav Ukis 00:52:06 Brijesh, there is so much in each of the points that we discussed today, so I don’t think we have missed anything grossly, but there is so much more to cover, so there is so much more to learn and I would encourage everyone to go ahead and deepen the knowledge in terms of SRE and in terms of reliability in general.

Brijesh Ammanath 00:52:28 Absolutely. And I’ll make sure we have a link to your book in the show notes so that people can learn more about rolling out SR in their own organizations and learn from your learnings.

Vladyslav Ukis 00:52:38 Thank you. Thank you very much for having me, and it was a pleasure to be here.

Brijesh Ammanath 00:52:42 Vlad, thank you for coming on the show. It’s been a real pleasure. This is Brijesh Ammanath for Software Engineering Radio. Thank you for listening.

[End of Audio]