Randy Shoup of eBay discusses the evolution of eBay’s tech stack. SE Radio host Jeremy Jung speaks with Shoup about eBay’s origins as a single C++ class with an Oracle database, a five-year migration to multiple Java services, sharing a database between the old and new systems, building a distributed tracing system, working with bare metal, why most companies should stick to cloud, why individual services should own their own data storage, how scale has caused solutions to change, rejoining a former company, choosing what to work on first, the Accelerate Book, and improving delivery time.
This transcript was automatically generated. To suggest improvements in the text, please contact email@example.com and include the episode number and URL.
Jeremy Jung 00:00:17 Today I’m talking to Randy Shoup, he’s the VP of Engineering and Chief Architect at eBay. He was previously the VP of Engineering at WeWork and Stitch Fix, and he was also a Chief Engineer and Distinguished Architect at eBay back in 2004. Randy, welcome back to Software Engineering Radio. This will be your fifth appearance on this show. I’m pretty sure that’s a record.
Randy Shoup 00:00:39 Thanks, Jeremy. I’m really excited to come back. I always enjoy listening to, and then also contributing to Software Engineering Radio
Jeremy Jung 00:00:46 Back at QCon 2007, you spoke with Marcus Volter — he was the founder of SE Radio — and you were talking about developing eBay’s new search engine at the time. And kind of looking back, I wonder if you could talk a little bit about how eBay was structured back then, maybe organizationally, and then we can talk a little bit about the tech stack and that sort of thing.
Randy Shoup 00:01:09 Oh, sure. Okay. Yeah. So eBay started in 1995 — I just want to like orient everybody: same as the web, same as Amazon, same as a bunch of stuff. eBay was actually almost 10 years old when I joined that seemingly very old first time. So yeah, what was eBay’s tech stack like then? So, eBay has gone through five generations of its infrastructure. It was transitioning between the second and the third when I joined in 2004. So the first iteration was Pierre Omidyar, the founder, three-day Labor Day weekend in 1995 playing around with this new cool thing called the Web. He wasn’t intending to build a business, he just was playing around with auctions and wanted to put up a webpage. So he had a Pearl back end and every item was a file, and it lived on his little 486 tower or whatever he had at the time. So that wasn’t scalable and wasn’t meant to be. The second generation of eBay architecture was what we called V2. Very creatively.
Randy Shoup 00:02:02 That was a C++ monolith, an ISAPI DLL with essentially — well, at its worst, which grew to 3.4 million lines of code in that single DLL. And basically in a single class, not just in a single like repo or a single file, but in a single class. So that was very unpleasant to work in, as you can imagine. eBay had about a thousand engineers at the time and they were as you can imagine, like really stepping on each other’s toes and not being able to make much forward progress. So starting in, I want to call it 2002, so two years before I joined, they were migrating to the creatively named V3. And V3’s architecture was Java and not microservices, but like we didn’t even have that term, but it wasn’t even that. It was mini applications.
Randy Shoup 00:02:49 So actually let’s take a step back. V2 was a monolith, so like all of eBay’s code in that single DLL and like that was buying and selling and search and everything. And then we had two monster databases: a primary and a backup, big Oracle machines on Sun hardware that was bigger than refrigerators. And that ran eBay for a bunch of years before we changed the upper part of the stack. We chopped up that single monolithic database into a bunch of domain-specific databases or entity-specific databases, right? So a set of databases around users, sharded by the user ID — we could talk about all that if you want — Items again, sharded by item ID, transactions sharded by transaction ID, dot dot dot. I think when I joined, it was the several hundred instances of Oracle databases spread around, but still that monolithic front end.
Randy Shoup 00:03:41 And then in 2002, I want to say we started migrating into that V3 that I was saying, okay. So that was a rewrite in Java, again, many applications. So you take the front end and instead of having it be in one big unit, it was this EAR file. If the hundred people remember back to those days in Java, 220 different of those. So like, one of them for the search one application would be the search application and it would do all the search related stuff, the handful of pages around search ditto for the buying area, ditto for the checkout area, ditto for the selling area dot dot dot, 220 of those. And that was again, vertically sliced domains. And then the relationship between those V3 applications and the databases was a many-to-many things. So like many of those applications would interact with items. So they would interact with those items, databases. Many of them would interact with users. And so they would interact with the user databases, et cetera, happy to go into as much gory detail as you want about all that. But like that’s what, but we were in the transition period between the V2 monolith to the V3 mini applications in 2004. I’m just going to pause there and like, let me know where you want to take it.
Jeremy Jung 00:04:57 Yeah. So you were saying that it started as Pearl, then it became C++, and that’s kind of interesting that you said it was all in one class, right?
Randy Shoup 00:05:06 So, it’s pretty much, yeah.
Jeremy Jung 00:05:08 Wow. That’s got to be a gigantic file. . .
Randy Shoup 00:05:10 It was brutal. I mean, completely brutal. Yeah. 3.4 million lines of, yeah. We were hitting compiler limits on the number of methods per class. So, I’m scared that I happen to know that at least at the time, Microsoft allowed you 16K methods per class and we were hitting that limit. So, not great.
Jeremy Jung 00:05:28 Wow. It’s just kind of interesting to think about how do you walk through that code, right? I guess you just have this giant file.
Randy Shoup 00:05:37 Yeah. I mean, there were different methods, but yeah, it was a big mess. I mean, it was a monolith, it was a spaghetti mess. And as you can imagine, Amazon went through a really similar thing by the way. So this wasn’t super, I mean, it was bad, but like we weren’t the only people that were making that mistake and just like Amazon, where they were able, they did like one update a quarter at that period, like 2000, we were doing something really similar, like very, very slow updates. And when we moved to V3, the idea was to changes much faster. And we were very proud of ourselves starting in 2004 that we upgraded the whole site every two weeks and we didn’t have to do the whole site, but like each of those individual applications that I was mentioning, right. Those 220 applications, each of those would roll out on this biweekly cadence and they had interdependencies. And so we rolled them out in this dependency order and anyway, lots of, lots of complexity associated with that. Yeah. There you go.
Jeremy Jung 00:06:34 The V3 that was written in Java, I’m assuming this was a complete rewrite. You, didn’t use the C++ code at all?
Randy Shoup 00:06:41 Correct, yeah. We migrated page by page. So in the transition period, which lasted probably five years, there were pages in the beginning, all pages were served by V2. In the end, all pages are served by V3 and over time you iterate and you like rewrite and maintain in parallel the V3 version of XYZ page and the V2 version of XYZ page. And then when you’re ready, you start to test out at low percentages of traffic what would, what does V3 look like? Is it correct? And when it isn’t you go and fix it, but then ultimately you migrate the traffic over, did fully be in the V3 world and then you remove or comment out or whatever, the code that supported that in the V2 monolith.
Jeremy Jung 00:07:27 And then you had mentioned using Oracle databases, did you have a set for V2 and a set for V3 and you were kind of trying to keep him in?
Randy Shoup 00:07:35 Oh, great question. Thank you for asking that question. No, no. We had the databases. So again, as I mentioned, we had pre-demonolith. That’s my that’s a technical term pre broken up the databases starting in, let’s call it 2000, actually. I’m almost certain, it’s 2000 because we had a major site outage in 1999, which everybody still remembers who was there at the time. Wasn’t me or I, I wasn’t there at the time, but you can look at that anyway. So yeah, starting in 2000, we broke up that monolithic database into what I was telling you before those entity aligned databases. Again, one set for items, one set for users, one set for transactions, dot dot, dot. Those databases were shared between V3 using those things, oh sorry, V2 using those things and V3 using those things. And then so we’ve completely decoupled the rewrite of the database kind of data storage layer from the rewrite of the application layer, if that makes sense.
Jeremy Jung 00:08:32 Yeah. So, so you had V2 that was connecting to these individual Oracle databases. You said like they were for different types of entities, like maybe for items and users and things like that. But it was a shared database situation where V2 was connected to the same database as V3. Is that right?
Randy Shoup 00:08:50 Correct. And also in V3, even when done different V3 applications were also connecting to the same database. Again, like anybody who used the user entity, which is a lot were connecting to the user suite of databases and anybody who used the item entity, which again is a lot. We’re connecting to the item databases, et cetera. So yeah, it was this many to many. That’s what I was trying to say many to many relationship between applications in the V3 world and databases.
Jeremy Jung 00:09:19 Okay. Yeah. I think I got it because,
Randy Shoup 00:09:21 It’s easier with a diagram.
Jeremy Jung 00:09:23 Yeah. Because when you, when you think about services, now you think of services having dependencies on other services. Whereas in this case you would have multiple services that rather than talking to a different service, they would all just talk to the same database. They all needed users. So they all needed to connect to the user’s database.
Randy Shoup 00:09:42 Right? Exactly. And so I don’t want to jump ahead in this conversation, but like the problems that everybody who’s feeling uncomfortable at the moment you’re right to feel uncomfortable because that was an unpleasant situation and microservices or more generally the idea that individual services would own their own data. And only in the only interactions to the service would be through the service interface and not like behind the services back to the, to the data storage layer that’s better. And Amazon discovered that, a lot of people discovered that around that same early 2000s period. And so yeah, we had that situation at eBay at the time. It was better than it was before. Right, right? Better than a monolithic database and a monolithic application layer, but it definitely also had issues as you can imagine.
Jeremy Jung 00:10:26 Thinking about back to that time where you were saying it’s better than a monolith, what were sort of the tradeoffs of you have a monolith connecting to all these databases versus you having all these applications, connecting to all these databases, like what were the things that you gained and what did you lose if that made sense?
Randy Shoup 00:10:46 Yeah. Well, why we did it in the first place is like isolation between development teams, right? So we’re looking for developer productivity or the phrase we used to use was feature velocity so how quickly would we be able to move? And to the extent that we could move independently. The search team could move independently from the buying team, which could move independently from the selling team, et cetera. That was what we were gaining. What were we losing? When you’re in a monolith situation, if there’s an issue, you know where it is, it’s in the monolith. You might not know where in the monolith, but like there’s only one place it could be. And so an issue that one has when you break things up into smaller units, especially when they have this shared mutable state, essentially in the form of these databases, like who changed that column?
Randy Shoup 00:11:35 What’s the deal? Actually, we did have a solution for that or something that really helped us, which was more than 20 years ago. We had something that we would now call distributed tracing, where actually I talked about this way back in the 2007 thing, because it was pretty cool at the time. Just like the spans one would create using a modern distributed tracing open telemetry or any of the distributed tracing vendors, just like you would do that. We didn’t use the term span, but that same idea where, and the goal was the same to like debug stuff. So every time we were about to make a database call, we would say, ìHey, I’m about to make this dataî. We would log about to make this database call and then it would happen. And then we would log whether it was successful or not successful.
Randy Shoup 00:12:18 We could see how long it took, et cetera. And so we built our own monitoring system, which we called Central Application Logging or CAL totally proprietary to eBay, happy to talk about whatever gory details you want to know about that. But it was pretty cool. Certainly way back in 2000, it was. And that was our mitigation against the thing I’m telling you, which is when not, if something is weird in the database, we can kind of back up and figure out where it might have happened. Or things are slow, what’s the deal? And because sometimes the database is slow for reasons. And what thing is from an application perspective, I’m talking to 20 different databases, but things are slow. Like what is it? And CAL helped us to figure out both elements of that, right?
Randy Shoup 00:13:04 Like what applications are talking to what databases and what backend services and like debug and diagnose from that perspective. And then for a given application, what databases and backend services are you talking to and debug that. And then we had monitors on those things and we would notice when databases would, where it be a lot of errors or where, when databases starting in slower than they used to be. And then we implemented what people would now call circuit breakers where we would notice that, oh everybody who’s trying to talk to database 1, 2, 3, 4 is seeing it slow down. I guess 1, 2, 3, 4 is unhappy. So now flip everybody to say, don’t talk to 1, 2, 3, 4. And like just that kind of stuff, you’re not going to be able to serve. But whatever, that’s better than stopping everything. So I hope that makes sense. So all these, all these like modern resilience techniques, we had our own proprietary names for them, but we implemented a lot of them way back when.
Jeremy Jung 00:14:02 Yeah. And I guess just to contextualize it for the audience, I mean this was back in 2004?
Randy Shoup 00:14:09 No, this was 2000.
Jeremy Jung 00:14:10 Oh, back in 2000. Okay.
Randy Shoup 00:14:11 Yeah. Again, because we had this, sorry to interrupt you because we had the problem so that we were just talking about where many applications are talking to many services and databases and we didn’t know what was going on. And so we needed some visibility into what was going on. Sorry, go ahead.
Jeremy Jung 00:14:25 Yeah. Okay. So all the way back in 2000, there’s a lot less services out there. Like nowadays you think about so many software as a service products. If you were building the same thing today, what are some of the services that people today would just go and say like, oh, I’ll just, I’ll just pay for this and have this company handle it for me. That wasn’t available then.
Randy Shoup 00:14:47 Yeah, sure. Well, there were no, essentially no, well there was no Cloud. Cloud didn’t happen until 2006 and there were a few software as a service vendors like Salesforce existed at the time, but they weren’t usable in the way you’re thinking of where I could give you money and you would operate a technical or technological software service on my behalf. You know what I mean? So we didn’t have any of the monitoring vendors. We didn’t have any of the stuff today. So yeah. So what would we do to solve that specific problem today? I would, as we do today at eBay, I would instrument everything with Open Telemetry because that’s generic. Thank you, Ben Sigelman and Lightstep for starting that whole Open Sourcing process of that thing and getting all the vendors to respect it.
Randy Shoup 00:15:34 And then I would choose for my back end, I would choose one of the very many wonderful distributed tracing vendors of which there are so many, I can’t remember. Like Lightstep is one, Honeycomb, .dot dot. There are a bunch of backend distributed tracing vendors in particular for that, what else do you have today is, I mean, we could go on for hours on this one, but like, we didn’t have distributed logging or we didn’t have like logging vendors? So there was no Splunk, there was no, any of those distributed log or centralized logging vendors. So we didn’t have any of those things. We were like cavemen; we built our own data centers. We racked our own servers. We installed all the OSS in them. By the way, we still do all that because it’s way cheaper for us at our scale to do that. But happy to talk about that too. Anyway, but yeah, no, the people who live in, I don’t know if this is where you want to go. In 2022, the software developer has this massive menu of options. If you only have a credit card and it doesn’t usually cost that much, you can get a lot of stuff done from the Cloud vendors, from the software service vendors, et cetera, et cetera. And none of that existed in 2000.
Jeremy Jung 00:16:44 It’s really interesting to think about how different, I guess the development world is now, like, because you mentioned how Cloud wasn’t even really a thing until 2006. All these vendors that people take for granted, none of them existed. And so it’s just, it must have been a very, very different time.
Randy Shoup 00:17:03 Well, every year is better than the previous year? In software, every year. So at that time we were really excited that we had all the tools and capabilities that we did have. And also you look back from 20 years in the future and it looks caveman from that perspective all those things were cutting edge at the time. What happened really was the big companies rolled their own. Everybody built their own data centers, racked their own servers, at least at scale. And the best you could hope for the most you could pay anybody else to do is rack your servers for you. You know what I mean? Like there were external people and they still exist, a lot of them the Rackspace, Equinix is etc. of the world. Like they would have a co-location facility, you ask them, please I’d like to buy these specific machines and please rack these specific machines for me and connect them up on the network in this particular way. That was the thing you could pay for. But you pretty much couldn’t pay them to put software on there for you. That was your job and then operating it, was also your job. If that makes sense.
Jeremy Jung 00:18:04 And then back then, would that be where employees would actually have to go to the data center and then put in their Windows CD or their Linux CD and, actually do everything right there.
Randy Shoup 00:18:17 Yeah 100%. In fact, again anybody who operates data centers, I mean, there’s more automation, but conceptually, when we run three data centers ourselves at eBay right now all of our software runs on them. So like we have those physical data centers. We have employees that physically work in those things, physically rack and stack the servers again, we’re smarter about it now. Like we buy a whole rack, we roll the whole rack in and cable it with one big, kachunk sound as distinct from individual wiring and the networks are different and better. So there’s a lot less like individual stuff, but at the end of the day, but yeah, everybody in quotes, everybody at that time was doing that or paying somebody to do exactly that. Right?
Jeremy Jung 00:18:58 Yeah. And it’s interesting too, that you mentioned that it’s still being done by eBay. You said you have three data centers because it seems like now maybe it’s just assumed that someone’s using a Cloud service they’re using AWS or whatnot. And so, oh, go ahead.
Randy Shoup 00:19:16 Well, I was going to rip off what you said, how the world has changed. I mean, and so much, right? So I’ve been, it’s fine. You didn’t need to say my whole LinkedIn, but like I used to work on Google Cloud. So I’ve been a Cloud vendor at a bunch of previous company as I’ve been a Cloud consumer Stitch Fix and WeWork and other places. So I’m fully aware, fully personally aware of all that stuff. But yeah, I mean, eBay is at the size where it is actually cost effective, very cost effective. Can’t tell you more than that for us to operate our own infrastructure. Right? So, no one would expect if Google didn’t operate their own infrastructure, nobody would expect Google to use somebody else’s right. Like that doesn’t make any economic sense.
Randy Shoup 00:19:54 And, Facebook is in the same category for a while. Twitter and PayPal have been in that category. So there’s like this, they are the known hyperscalers, right? The Google, Amazon, Microsoft that are like Cloud vendors in addition to consumers, internally of their own, their own Clouds. And then there’s a whole class of other places that operate their own internal Clouds in quotes, but don’t offer them externally. And again, Facebook or Meta is one example, eBays another. Dropbox actually famously started in the Cloud and then found it was much cheaper for them to operate their own infrastructure again, for the particular workloads that they had. So, yeah, there’s probably, I’m making this up, call it two dozen around the world of these, I’m making this term up many hyperscalers, right? Like self hyperscalers or something like that. And eBays in that category.
Jeremy Jung 00:20:46 I know this is kind of a big, what if, but you were saying how once you reach a certain scale, that’s when it makes sense to move into your own data center. And I’m wondering if eBay had started more recently, like, let’s say in the last 10 years. I wonder if it would’ve made sense for it to start on a public Cloud and then move to its own infrastructure after it got bigger or if it really did make sense to just start with your own infrastructure from the start.
Randy Shoup 00:21:18 Oh, I’m so glad you asked that the answer is obvious, but like, I’m so glad you asked that because it, I love to make this point. No one should ever ever start by building your own servers and your own Cloud. Like no, you should be so lucky after years and years and years that you outgrow the Cloud vendors. Right? It happens, but doesn’t happen that often, it happens so rarely that people write articles about it when it happens. Do what I mean? Like Dropbox is a good example. So yes, 100% anytime. Where are we 2022? Anytime in more than the last 10 years. Yeah. Let’s call it 2010, 2012, right? When Cloud had proved itself many times over. Anybody who starts since that time should absolutely start in the public Cloud, there’s no argument about it.
Randy Shoup 00:22:04 And again, one should be so lucky that over time you’re seeing successive zeros added to your Cloud bill and it becomes so many zeros that it makes sense to shift your focus toward building and operating your own data centers. And I haven’t been part of that transition. I’ve been the other way. At other places where I’ve migrated from owned data centers and CoLOS into public Cloud, that’s the more common migration. And again, there are a handful, maybe not even a handful of companies that have migrated away, but when they do, they’ve done all the math, right. I mean, Dropbox has done some great talks and articles about their transition and boy, the math makes sense for them. So. Yep.
Jeremy Jung 00:22:46 Yeah. And it also seems like maybe it’s for certain types of businesses where moving off of public Cloud makes sense. Like you mentioned Dropbox where so much of their business is probably centered around storage or centered around bandwidth and there’s probably certain workloads that it’s like need to leave public Cloud earlier.
Randy Shoup 00:23:06 Yeah. I think that’s fair. I think that’s an insightful comment. Again, it’s all about the economics at some point it’s a big investment to, and it takes years to develop the, forget the money that you’re paying people, but like just to develop the internal capabilities, they’re very specialized skill sets around building and operating data centers. So like it’s a big deal. And so are there particular classes of workloads where you would for the same dollar figure or whatever migrate earlier or later? I’m sure that’s probably true. And again, one can absolutely imagine. Well, and they say Dropbox in this example. Yeah. It’s because like they need to go direct to the storage. I mean, like they want to remove every middle person from the flow of the bytes that are coming into the storage media and it makes perfect sense for them. And when I last understood what they were doing, which was a number of years ago, they were hybrid, right. So they had, they had completely they kept the top external layer in public Cloud. And then the storage layer was all custom. I don’t know what they do today, but people could check.
Jeremy Jung 00:24:11 And kind of coming back to your first time at eBay, is there anything you felt that you would’ve done differently with the knowledge you have now, but with the technology that existed then?
Randy Shoup 00:24:25 Gosh, that’s the 20/20 hindsight. The one that comes to mind is the one we touched on a little bit, but I’ll say it more starkly. If I could go back in time 20 years and say, Hey, we’re about to do this V3 transition at eBay. I would have had us move directly to what we would now call microservices in the sense that individual services own their own data storage and are only interacted with through the public interface. There’s a famous Amazon memo around that same time. So Amazon did the transition from a monolith into what we would now call microservices over about a 4-5 year period, 2000 to 2005. And there’s a famous Jeff Bezos memo, from the early part of that, where seven requirements, I can’t remember them, but essentially it was, you may never talk to anybody else’s database.
Randy Shoup 00:25:20 You may only interact with other services through their public interfaces. I don’t care what those public interfaces are. So they didn’t standardize around Corva or Json or GRPC, which didn’t exist at the time. Like they didn’t standardize around any particular interaction mechanism, but you did need to again, have this kind of microservice capability. That’s modern terminology where services own their own data and nobody can talk in the back door. So that is the one architectural thing that I wish with 20/20 hindsight that I would bring back in my time travel to 20 years ago. Because that does help a lot. And to be fair, Amazon was pioneering in that approach. And a lot of people internally and externally from Amazon, I’m told, didn’t think it would work and it did famously. So that’s, that’s the thing I would do. Yeah.
Jeremy Jung 00:26:09 I’m glad you brought that up because when you had mentioned that I think you said there were 220 applications or something like that at certain scales people might think like, oh, that sounds like microservices to me. But when you mentioned that microservice to you means it having its own data store. I think that’s a good distinguishing to bring up.
Randy Shoup 00:26:30 Yeah. So I talk a lot about microservices that have for a decade or so. Yeah. I mean several of the distinguishing characteristics are the micro and microservices as size and scope of the interface, right? So you can have a service oriented architecture with one big service or some very small number of very large services. But the micro and microservice means this thing does maybe doesn’t have one operation, but it doesn’t have a thousand and the several or the handful or several handfuls of operations are all about this one particular thing. So that’s the one part of it. And then the other part of it that is critical to the success of that is owning your own data storage. So each service again, it’s hard to do this with a diagram, but like imagine the bubble of the service surrounding the data storage, right? So like people, anybody from the outside, whether they’re interacting synchronously, asynchronously, messaging, synchronous, whatever HTP doesn’t matter are only interacting to the bubble and never getting inside where the data is. I hope that makes sense.
Jeremy Jung 00:27:32 Yeah. I mean it’s kind of in direct contrast to before you were talking about how you had all these databases that all these services shared. So it was probably hard to kind of keep track of who had modified data. One service could modify it, then another service could go to get data out and it’s been changed, but it didn’t change it. So it could be kind of hard to track what’s going on.
Randy Shoup 00:27:53 Yeah, exactly. Integration at the database level is something that people have been doing since probably the 1980s. And so again, in retrospect it looks like caveman approach. It was pretty advanced at the time, actually, even the idea of sharding of ìHey, there are users and the users live in databases, but they don’t all live in the same oneî. They live in 10 different databases or 20 different databases. And then there’s this layer that, for this particular user, it figures out which of the 20 databases it’s in and finds it and gets it back. And that was all pretty advanced. And by the way, that’s all those capabilities still exist. They’re just hidden from everybody behind nice, simple, software as a service interfaces. Anyway, but that takes nothing away from your excellent point, which is, yeah. When you’re, again, when you’re is this many to many relationship between applications and databases and there’s shared mutable state in those databases that is shared, that’s bad. It’s not bad to have state, it’s not bad to have mutable state, it’s bad to have shared mutable state.
Jeremy Jung 00:28:58 Yeah. And I think anybody who’s kind of interested in learning more about the, you had talked about sharding and things like that. If they go back and listen to your first appearance on Software Engineering Radio, it kind of struck me how you were talking about sharding and, and how it was something that was kind of unique or unusual. Whereas today it feels like it’s very, I don’t know, quaint is the right word, but it’s like, it’s something that people kind of are accustomed to now.
Randy Shoup 00:29:24 Yeah. It seems obvious in retrospect. Yeah. At the time, and by the way, eBay, didn’t invent sharding. As I said, in 2007, Google and Yahoo and Amazon and it was the obvious, took a while to reach it. But it’s one of those things where once people have the brainwave to see, ìOh you know what? We don’t actually have to store this in one databaseî. We can chop that database up into chunks that looks similar to that self. That was reinvented by lots of the big companies at the same time, again because everybody was solving that same problem at the same time. But yeah, when you look back and you, I mean, like, and by honestly, like everything that I said there, it’s still like those, all the techniques about how you shared things. And there’s lots of, it’s not interesting anymore because the problems have been solved, but all those solutions are still the solutions. If that makes any sense?
Jeremy Jung 00:30:14 For sure. I mean I think anybody who goes back and listens to it. Yeah. Like you said, it’s very interesting because it all still applies. And it’s like, I think the solutions that are kind of interesting to me are ones where it’s things that could have been implemented long ago, but we just later on realized like this is how we could do it.
Randy Shoup 00:30:36 Well, part of it is, as we grow as an industry, we discover new problems. We get to the point where sharding over databases is only a problem when one database doesn’t work. When your, the load that you put on that database is too big or you want the availability of multiple. And so that’s not a day one problem, right? That’s a day two or day 2000 kind of problem, right? And so a lot of these things, well it’s software. So like we could have done, any of these things in older languages and older operating systems with older technology. But for the most part we didn’t have those problems or we didn’t have them at sufficiently, enough people didn’t have the problem for us to have solved it as an industry, if that makes any sense?
Jeremy Jung 00:31:21 Yeah. No, that’s a good point because you think about when Amazon first started and it was just a bookstore. Right? And the number of people using the site were, who knows it was, it might have been tens a day or hundreds a day. I don’t know. And so like you said, the problems that Amazon has now in terms of scale are just like, it’s a completely different world than when they started.
Randy Shoup 00:31:43 Yeah. I mean, probably I’m making it up, but I don’t think that’s too off to say that it’s a billion times more, their problems are a billionfold from what they were.
Jeremy Jung 00:31:53 The next thing I’d like to talk about is, you came back to eBay I think about is, has it been about two years ago?
Randy Shoup 00:32:02 Two years. Yeah.
Jeremy Jung 00:32:03 Yeah. And so tell me about the experience of coming back to an organization that you had been at 10 years prior or however long it was like, how is your onboarding different when it’s somewhere you’ve been before?
Randy Shoup 00:32:18 Yeah, sure. So like you said, I worked at eBay from 2004 to 2011 and I worked in a different role than I have today. I worked mostly on eBay search engine and then I left to co-found a startup, which was in the 99% instead of the one like didn’t really do much. I worked at Google in the early days of Google Cloud, as I mentioned on Google app engine and had a bunch of other roles including more recently, like you said, Stitch Fix and WeWork leading those engineering teams. And so coming back to eBay as Chief Architect and leading the developer platform, essentially part of eBay. What was the onboarding like? I mean, lots of things had changed, in the intervening 10 years or so, and lots had stayed the same, not in a bad way, but just some of the technologies that we use today are still some of the technologies we used 10 years ago, a lot has changed though.
Randy Shoup 00:33:08 A bunch of the people are still around. So there’s something about eBay that people tend to stay a long time. It’s not really very strange for people to be at eBay for 20 years. In my particular team of what’s called at 150, there are four or five people that have crossed their 20-year anniversary at the company. And I rejoined with a bunch of other boomerangs as the term we use internally. So it’s including the CEO, by the way. So sort of bringing the band back together, a bunch of people that had gone off and worked at other places, have come back for various reasons over the last couple of years. So it was both a lot of familiarity, a lot of unfamiliarity, a lot of familiar faces. Yep.
Jeremy Jung 00:33:47 So I mean, having these people who you work with still be there and actually coming back with some of those people, what were some of the big, I guess, advantages or benefits you got from those existing connections?
Randy Shoup 00:34:01 Yeah. Well as with all things imagine, everybody can imagine like getting back together with friends that they had from high school or university, or like you had some people had some schooling, at some point, and like you get back together with those friends and there’s this there’s this implicit trust in most situations of because you went through a bunch of stuff together and you knew each other a long time ago. And so that definitely helps when you’re returning to a place where again, there are a lot of familiar faces where there’s a lot of trust built up. And then it’s also helpful, eBays a pretty complicated place. And it’s 10 years ago, it was too big to hold in any one person’s head and it’s even harder to hold it in one person’s head now, but to be able to come back and have a little bit of that, well, more than a little bit of that context about, okay, here’s how eBay works.
Randy Shoup 00:34:47 And here are the unique complexities of the marketplace because it’s very unique in the world. And so yeah, no, I mean it was helpful. It’s helpful a lot. And then also in my current role, my main goal actually is to just make all of eBay better? So we have about 4,000 engineers and my team’s job is to make all of them better and more productive and more successful. And being able to combine knowing the context about eBay and having a bunch of connections to the people that a bunch of the leaders here combining that with 10 years of experience doing other things at other places that’s helpful because now there are things that we do at eBay that, okay, well there are, that this other place is doing, this has that same problem and is solving it in a different way. And so maybe we should look into that option.
Jeremy Jung 00:35:34 So you mentioned just trying to make developers work or lives easier. You start the job. How do you decide what to tackle first? Like how do you figure out where the problems are or what to do next?
Randy Shoup 00:35:47 Yeah, that’s a great question. So again, my, I lead this thing that we internally called the velocity initiative, which is about just making, giving us the ability to deliver features and bug fixes more quickly to customers, right? And so for that problem, how can we deliver things more quickly to customers and improve get more customer value and business value. What I did with, in collaboration with a bunch of people is what one would call a value stream map. And that’s a term from lean software and lean manufacturing where you just look end to end at a process and like say all the steps and how long those steps take. So a value stream, as you can imagine, like all these steps that are happening at the end, there’s some value, right? Like we’ve produced some feature or hopefully gotten some revenue or like helped out the customer of the business in some way.
Randy Shoup 00:36:38 And so a value mapping that value stream. That’s what it means. And when you can see the end-to-end process and like really see it in some kind of diagram, you can look for opportunities like, oh, okay, well if it takes us, I’m making this, it takes us a week from when we have an idea to when it shows up on the site. Well some of those steps take five minutes. That’s not worth optimizing, but some of those steps take five days and that is worth optimizing. And so getting some visibility into the system looking end to end with some, with the kind of view of the system, system’s thinking, that will give you the knowledge about or the opportunities about what can be improved. And so that’s what we did.
Randy Shoup 00:37:17 And we didn’t talk with all 4,000 engineers or all whatever, half a thousand teams or whatever we had, but we sampled a few. And after we talked with three teams, we were already hearing a bunch of the same things. So we were hearing in the whole product life cycle, which I like to divide into four stages. I like to say, there’s Planning. How does an idea become a project or a thing that people work on? Software Development, how does a project become committed code? Software Delivery, how does committed code become a feature that people actually use? And then what I call, Post-release Iteration, which is okay, it’s now out there on the site and we’re turning it on and offer individual users. We’re learning in analytics and usage in the real world and experimenting. And so there were opportunities at eBay at all four of those stages, which I’m happy to talk about, but what we ended up seeing again and again is that, that software delivery part was our current bottleneck.
Randy Shoup 00:38:12 So again, that’s the, how long does it take from an engineer when she commits her code to, it shows up as a feature on the site? And two years ago, before we started the work that I’ve been doing for the last two years, with a bunch of people, on average at eBay, it was like a week and a half. So it’d be a week and a half between when someone’s finished and then, it gets code reviewed and dot, dot, dot it gets rolled out. It gets tested all that stuff. It was essentially 10 days now for the teams that we’ve been working with, it’s down to two. So we used a lot of what people may be familiar with the Accelerate book. So it’s called Accelerate by Nicole Forsgen, Jez Humble and Jean Kim, 2018.
Randy Shoup 00:38:50 Like if there’s one book anybody should read about software engineering, it’s that. So please read Accelerate. It summarizes almost a decade of research from the state of DevOps reports, which the three people that I mentioned led. So Nicole Forsgen is a doctor. She’s a PhD in data science. She knows how to do all this stuff. Anyway so when your problem happens to be software delivery, the Accelerate book tells you all the kind of continuous delivery techniques, trunk based development, all sorts of stuff that you can do to solve those problems. And then there are also four metrics that they use to measure the effectiveness of an organization’s software delivery. So people might be familiar with Deployment Frequency, how often are we deploying a particular application. Lead time for change? That’s that time from when a developer commits our code to when it shows up on the site. Change Failure Rate, which is when we deploy code, how often do we roll it back or hot fix it, or there’s some problem that we need to address. And then meantime to Restore, which is when we have one of those incidents or problems, how quickly can we roll it back or do that hot fix.
Randy Shoup 00:39:54 And again, the beauty of Nicole Forsgen research summarized in the Accelerate book is that the science shows that companies cluster, in other words, mostly the organizations that are not good at deployment frequency and lead time are also not good at the quality metrics of meantime to restore and change failure rate and the companies that are excellent at deployment frequency and lead time are also excellent at meantime to recover and change failure rate. So companies or organizations divide into these four categories. So there’s low performers, medium performers, high performers, and then elite performers and eBay on average at the time. And still on average is solidly in that medium performer category. So, and what we’ve been able to do with the teams that we’ve been working with is we’ve been able to move those teams to the high category. So just super briefly, and I will give you a chance to ask you some more questions, but like in the low category, all those things are kind of measured in months, right?
Randy Shoup 00:40:53 So how long, how often are we deploying measure that in months, how long does it take us to get a commit to the site? Measure that in months and then the low performer, sorry, the medium performers are like everything’s measured in weeks, right? So like it, we would deploy couple once every couple weeks or once a week lead time is measured in weeks, etc. The high performers things are measured in days and the elite performers things are measured in hours. And so you can see there’s like order of magnitude improvements when you move from one of those kind of clusters to another cluster anyway. So what we were focused on again, because our problem with software delivery was moving the whole set of teams from that medium performer category where things are measured in weeks to the high performer category where things are measured in days.
Jeremy Jung 00:41:39 Throughout all this, you said the big thing that you focused on was the delivery time. So somebody wrote code and they felt that it was ready for deployment, but for some reason it took 10 days to actually get out to the actual site. So I wonder if you could talk a little bit about maybe a specific team or a specific application where, where was that time being spent? You said you moved from 10 days to two days. What was happening in the meantime?
Randy Shoup 00:42:06 Yeah, no, that’s a great question. Thank you. Yeah so okay, now we looked end to end at the process and we found that software delivery was the first place to focus. And then there are other issues in other areas, but we’ll get to them later. So then to improve software delivery, now we asked individual teams, we did something like I’m some conversation like I’m about to say. So we said, Hi, it looks like you’re deploying kind of once or twice a month. If I told you, you had to deploy once a day, tell me all the reasons why that’s not going to work. And the teams are like, oh, of course, well it’s a build times take too long. And the deployments aren’t automated and our testing is flaky. So we have to retry it all the time and dot dot, dot, dot.
Randy Shoup 00:42:44 And we said, Great! You just gave my team, our backlog. Right? So rather than just coming and like let’s complain about it, which the teams were it’s legit for them to complain. We were able because again, the developer program or the developer platform is part of my team. We said, great, like you just gave us, you just told us all the, all your top issues or your impediments, as we say, and we’re going to work on them with you. And so every time we had some idea about, well, I bet we can use Canary deployments to automate the deployment which we have now done. We would pilot that with a bunch of teams we’d learn what works and doesn’t work. And then we would roll that out to everybody. So what were the impediments? It was a little bit different for each individual team, but in some it was the things we ended up focusing on or have been focusing on are build times, so we build everything in Java still.
Randy Shoup 00:43:29 And even though we’re generation five, as opposed to that generation three that I mentioned still build times for a lot of applications were taking way too long. And so we spent a bunch of time improving those things and we were able to take stuff from hours down to single digit minutes. So that’s a huge improvement to developer productivity. We made a lot of investment in our continuous delivery pipelines. So making all the automation around deploying something to one environment and checking it there, then deploying it into a common staging environment and checking it there and then deploying it from there into the production environment. And then rolling it out via this Canary mechanism, we invested a lot in something that we call traffic mirroring, which we didn’t invent, but other places have a different name for this.
Randy Shoup 00:44:12 I don’t know that there’s a standard industry name. Some people call it shadowing, but the idea is I have a change that I’m making, which is not intended to change the behavior. Like lots of changes that we make, bug fixes, et cetera, upgrading to new Open-Source dependencies, whatever, changing the version of the framework. There’s a bunch of changes that we make regularly day to day as developers, which are like, refactoring’s kind of where we’re not actually intending to change the behavior. And so traffic mirroring was our idea of you have the old code that’s running in production and you fire a production request at that old code and it responds. But then you also fire that request at the new version and compare the results, did the same JSON come back between the old version and the new version.
Randy Shoup 00:44:54 That’s a great way kind of from the outside to sort of black box detect any unintended changes in the behavior. And so we definitely leverage that very, very aggressively. We’ve invested in a bunch of other things, but all those investments are driven by what do the particular teams tell us are getting in their way. And there are a bunch of things that the teams themselves have been motivated to do. So my team’s not the only one that’s making improvements. Teams have moved from branching development to trunk-based development, which makes a big difference, making sure that PR approvals and like code reviews are happening much more regularly. So like right after a thing that some teams have started doing is like immediately after standup in the morning, everybody does all the code reviews that are waiting. And so things don’t drag on for 2-3 days because whatever, so there’s just like everybody kind of works on that much more quickly. Teams are building their own automations for things like testing, site speed, and accessibility and all sorts of stuff. So, like all the things that a team goes through in the development and roll out of their software, we’ve been spending a lot of time automating and making leaner, making more efficient.
Jeremy Jung 00:45:59 So some of those, it sounds like the, the PR example is really on the team. Like you you’re telling them like, Hey, this is something that you internally should change how you work for things like improving the build time and things like that. Did you have like a separate team that was helping these teams speed that process up? Or what was that like?
Randy Shoup 00:46:21 Yeah. Great. I mean, and you give to those two examples are like you say very different. So I’m going to start from, we just simply showed everybody, here’s your deployment frequency for this application. Here’s your lead time for this application. Here’s your change failure rate and here’s your meantime to restore. And again, as I didn’t mention before all the state of DevOps research in the Accelerate book prove that by improving those metrics, you get better engineering outcomes and you also get better business outcomes. So like it’s scientifically proven that improving those four things matters. Okay. So now we’ve shown to teams, Hey, we would like you to improve for your own good but, more broadly at eBay, we would like the deployment frequency to be faster. And we would like the lead time to be shorter. And the insight there is when we deploy smaller units of work, when we don’t like batch up a week’s worth of work, a month’s worth of work, much less risky to just deploy like an hour’s worth of work.
Randy Shoup 00:47:19 And the insight is the hour’s worth of work fits in your head. And if you roll it out and there’s an issue, first off rolling back’s no big deal, because you’ve only lost an hour of work for a temporary period of time. But also like you never have this thing, like what in the world broke? Because like with a month’s worth of work, there’s a lot of things that changed and a lot of stuff that could break. But with an hour’s worth of work, it’s only like one change that you made. So if something happens, like it’s pretty much, pretty much guaranteed to be that thing. Anyway that’s the backstory and so yeah. Then we were just working with individual teams. Oh yeah, so the teams were motivated to see what’s the biggest bang for the buck in order to improve those things.
Randy Shoup 00:47:57 How can we improve those things? And again, some teams were saying, well you know what? A huge component of that lead time between when somebody commits and it’s a feature on the site, a huge percentage of that maybe multiple days is like waiting for somebody to code review. Okay, great. We can just change our team agreements and our team behavior to make that happen. And then yes, to answer your question about were the other things like building the Canary capability and traffic mirroring and build time improvements. Those were done by central platform and infrastructure teams some of which were in my group and some of which are in peer groups in my part of the organization. So yeah. So I mean like providing the generic tools and generic capabilities. Those are absolutely things that a platform organization does.
Randy Shoup 00:48:41 Like that’s our job adnd we did it. And then there are a bunch of other things like that are around kind of team behavior and how you approach building a particular application that are and should be completely in the control of the individual teams. And we were trying not to be, not trying not to be, we were definitely not being super prescriptive. Like we didn’t come in and say, by next Tuesday, we want you to be doing Trump based development by the Tuesday after that we wanted see test driven development dot, dot, dot. We would just offer to teams, here’s where you are. Here’s where we know you can get, because like we work with other teams and we’ve seen that they can get there, we just work together on, well, what’s the biggest bang for the buck and what would be most helpful for that team? So it’s like a menu of options and you don’t have to take everything off the menu, if that makes sense.
Jeremy Jung 00:49:26 And how did that communication flow from you and your team down to the individual contributor? Like you have, I’m assuming you have engineering managers and technical leads and all these people sort of in the chain. How does it actually go through that?
Randy Shoup 00:49:40 Thanks for asking that. Yeah. I didn’t really say how we work as an initiative. So there are a bunch of teams that are involved and we have every Monday morning, so just so happens it’s late Monday morning today. So we already did this a couple hours ago, but once a week we get all the teams that are involved, both like the platform kind of provider teams and also the product, or we would say domain like consumer teams. And we do a quick scrum of scrums, like a big old kind of stand up. What have you all done this week? What are you working on next week? What are you blocked by kind of idea. And there are probably 20 or 30 teams again, across the individual platform capabilities and across the teams that consume this stuff and everybody gives a quick update and it’s a great opportunity for people to say, oh, I have that same problem too.
Randy Shoup 00:50:29 Maybe we should offline try to figure out how to solve that together. Or you built a tool that automates the site speed stuff, that’s great. I would so love to have that. And so this weekly meeting has been a great opportunity for us to share wins, share help that people need and then get teams to help with each other. And also, similarly, one of the platform teams would say something like, Hey, we’re about to be done or beta, let’s say if this new Canary capability, I’m making this up, anybody want to pilot that for us? And then you get a bunch of hands raised, Oh, we would be very happy to pilot that, that would be great. So that’s how we communicate back and forth. And it’s kind of like engineering managers are the kind of level that are involved in that typically. So it’s not individual developers, but it’s like somebody on most, every team, if that makes any sense. So, that’s kind of how we do that communication back to the individual developers, if that makes sense.
Jeremy Jung 00:51:25 So it sounds like you would have, like you said, the engineering manager go to the standup, and you said maybe 20-30 teams, or I’m just trying to get a picture for how many people are in this meeting.
Randy Shoup 00:51:37 It’s like 30 or 40 people.
Jeremy Jung 00:51:38 Okay.
Randy Shoup 00:51:39 And again, it’s quick, right? So it’s an hour. So we just go, boom, boom, boom, boom. And we’ve just developed a cadence of people. Like we have a shared Google doc and like people like write their little summaries of what they’re, what they’ve worked on and what they’re working on. So we’ve over time made it so that it’s pretty efficient with people’s time and pretty, pretty dense in a good way of like information flow back and forth. And then also separately, we meet more in more detail with the individual teams that are involved, again, try to elicit, okay, now here’s where you are. Please let us know what problems you’re seeing with this part of the infrastructure or problems you’re seeing in the pipelines or something like that. And we’re constantly trying to learn and get better and solicit feedback from teams on what we can do differently.
Jeremy Jung 00:52:25 Earlier you had talked a little bit about how there were a few services that got brought over from V2 or V3. Ebay basically kind of more legacy or older services that are, have been a part of eBay for quite some time. And I was wondering if there were things about those services that made this process different, like in terms of how often you could deploy or just what were some key differences between something that was made recently versus something that has been with the company for a long time?
Randy Shoup 00:53:00 Sure. I mean, the stuff that’s been with the company for a long time was best in class as of when we built it maybe 15 or sometimes 20 years ago. There’re actually even less than a handful. There are, as we speak, there are two or three of those V3 clusters or applications or services still around and they should be gone and completely migrated away from in the next couple of months. So like, we’re almost at the end of moving all to more modern things, but yeah I mean, again, stuff that was state of the art 20 years ago, which was like deploying things once every two weeks, like that was a big deal in 2000 or 2004. And it’s like, that was fast in 2004 and it’s slow in 2022. So yeah. I mean, what’s the difference?
Randy Shoup 00:53:46 Yeah. I mean a lot of these things if they haven’t already been migrated, there’s a reason and it’s because often that they’re way in the guts of something that’s really important. This is a core part of making these examples up and they’re not even right, but like it’s a core part of the payments flow. It’s a core part of how sellers get paid. And those aren’t examples, those are modern, but you see what I’m saying? Like stuff that’s really core to the business and that’s why it’s lasted.
Jeremy Jung 00:54:14 And I’m kind of curious from the perspective of some of these new things you’re introducing, like you’re talking about improving continuous delivery and things like that. When you’re working with some of these services that have been around a long time, are the teams, the rate at which they deploy or the rate at which you find defects, is that noticeably different from services that are more recent?
Randy Shoup 00:54:41 Absolutely. I mean, and that’s true of any legacy at any place. Right? So yeah, I mean, people are legitimately have some trepidation let’s say about changing something that’s been running the business for a long, long time. And so it’s a lot slower going exactly because it’s not always completely obvious what the implications are of those changes. So we were very careful and we trust things a whole lot and maybe we didn’t write stuff with a whole bunch of automated tests in the beginning. And so there’s a lot of manual stuff there. This is just what happens when you have a company that’s been around for a long time.
Jeremy Jung 00:55:19 Yeah. I guess just to kind of to start wrapping up, as this process of you coming into the company and identifying where the problems are and working on ways to speed up delivery, is there anything that kind of came up that really surprised you? I mean, you’ve been at a lot of different organizations. Is there anything about your experience here at eBay that was very different than what you’d seen before?
Randy Shoup 00:55:45 No, I mean, it’s a great question. I don’t think, I mean, I think the thing that’s surprising is how unsurprising it is. Like there’s not the details are different. Like, okay we have this V3. I mean, we have some uniqueness around eBay, but I think what is maybe pleasantly surprising is all the techniques about how one might notice the things that are going on in terms of again, deployment, frequency, lead time, et cetera, and what techniques you would deploy to make those things better? All the standard stuff applies. So again, all the techniques that are mentioned in the state of DevOps research and in Accelerate and just all the known good practices of software development, they all apply everywhere. I think that’s the wonderful thing. So like maybe the most surprising thing is how unsurprising or how applicable the standard industry standard techniques are. I certainly hope that to be true, but that’s why we, I didn’t really say, but we piloted this stuff with a small number of teams exactly because we thought, and it would turned out to be true that they applied, but we weren’t entirely sure. We didn’t know what we didn’t know. And we also needed proof points not just out there in the world, but at eBay that these things made a difference and it turns out they do.
Jeremy Jung 00:56:58 Yeah. I mean, I think it’s easy for people to kind of get caught up and think like, my problem is unique or my organization is unique. And, but it, but it sounds like in a lot of cases, maybe we’re not so different.
Randy Shoup 00:57:10 I mean, the stuff that works tends to work. Yeah, there’s always some detail, but yeah. I mean, all aspects of the continuous delivery and kind of lean approach the software. I mean, we, the industry have yet to find a place where they don’t work, seriously, yet to find any place where they don’t work.
Jeremy Jung 00:57:27 If people want to learn more about the work that you’re doing at eBay, or just follow you in general, where should they head?
Randy Shoup 00:57:34 So I tweet summary regularly at, @randyshoup. So my name all one word, R A N D Y S H O U P. I had always wanted to be a blogger. Like there is randyshop.com and there are some blogs on there, but they’re pretty old someday. I hope to be doing more writing. I do a lot of conference speaking though. So I speak at the QCon conferences. I’m going to be at the CraftCon in Budapest in week and a half as of this recording. So you can often find me on Twitter or on software conferences.
Jeremy Jung 00:58:02 All right, Randy. Well, thank you so much for coming back on Software Engineering Radio.
Randy Shoup 00:58:07 Thanks for having me, Jeremy. This is fun.
Jeremy Jung 00:58:09 This has Ben Jeremy Jung for Software Engineering Radio. Thanks for listening.
[End of Audio]