Xe Iaso of Tailscale discusses how a VPN can be a useful tool when building software. SE Radio host Jeremy Jung spoke with Iaso about what VPNs are, onboarding, access control, authentication in the network vs individual services, peer-to-peer vs centralized VPNs, relay servers, tech stacks, forking the go compiler, the iOS network extension limit, testing and infrastructure, running your company on your own product, working at Heroku vs Tailscale, and their experience writing technical blog posts.
This transcript was automatically generated. To suggest improvements in the text, please contact firstname.lastname@example.org and include the episode number and URL.
Jeremy Jung 00:00:16 Today I’m talking to Xe Iaso. They’re the archmage of infrastructure at Tailscale, and they also have a great blog everyone should check out. Xe welcome to Software Engineering Radio.
Xe Iaso 00:00:27 Thanks. It’s great to be here.
Jeremy Jung 00:00:29 I think the first thing we should start with is what’s a VPN? Because I think some people, they may have used it to remote into their workplace or something like that, but I think the scope of what it’s good for and what it does is a lot broader than that. So maybe you could talk a little bit about that first.
Xe Iaso 00:00:47 Okay. A VPN is short for virtual private network. It’s basically a fake network that’s overlaid on top of existing networks, and then you can use that network to do whatever you would with a normal computer network. This term has been co-opted by companies that are attempting to get into the, like, hide-my — style market where you know, you encrypt your internet information and keep it safe from hackers. So that makes it really annoying and hard to talk about what a VPN actually is because Tailscale, the company I work for, is closer to like the actual intent of a VPN and not just, you know, like hide your internet traffic that’s already encrypted anyway with another level of encryption and just make a great access point for three-letter agencies.
Jeremy Jung 00:01:37 But are there use cases past that, like when you’re developing a piece of software, why would you decide to use a VPN outside of just because I want my, you know, my workers to be able to get access to this stuff?
Xe Iaso 00:01:52 So, something that’s come up when I’ve been working at Tailscale is that sometimes we’ll make changes to something and it’ll be changes to like the user experience of something on the admin panel or something. So in a lot of other places I’ve worked, in order to have other people test that, you know, you’d have to push it to the Cloud; it would have to spin up a review app in Heroku or some terrifying terraform abomination would have to put it out onto like an actual cluster or something. But with Tailscale, if your app is running locally, you just give the name of your computer and the port number and other people are able to just see it and poke it and experience it. And that basically turns the feedback cycle from having to wait for the state of the world to converge to make a change. Press F5, give the URL to a coworker, and be like, Hey is this Gucci?
Jeremy Jung 00:02:52 They can connect to your app as if you were both connected to the same switch. You don’t have to worry about pushing to a Cloud service or opening ports, things like that.
Xe Iaso 00:03:01 Yep. It will act like it’s in the same room even when they’re not. It’ll even work if you’re at both at Starbucks and the Starbucks has reasonable policies, like ‘holy crap don’t allow devices to connect to each other directly.’ So you’re working on like your screenplay app at your Starbucks or something and you have a coworker there and you’re like, Hey, check this out and give them the link. And then you know, they’re also seeing the screenplay editor.
Jeremy Jung 00:03:28 In terms of security and things like that, I’m picturing it kind of like we were sitting in the same room and there’s a switch and we both plugged in. Normally, when you do something like that you kind of have full access to whatever else is on the switch, you know, provided it’s not being blocked by a firewall. Is there like a layer of security on top of that that a VPN service like Tailscale would provide?
Xe Iaso 00:03:54 Yes. There are these things called access control lists, which are kind of like firewall rules except you don’t have to deal with the nightmare of writing an IP tables rule that also works in Windows firewall and whatever they use in MAC OS. The ACL rules are applied at the tail net level for every device in the tail net. So if you have like developer machines, you can put people into groups as things like developers and say that developer machines can talk to production but not people in QA. They can only talk to testing and people on SRE have, you know, permissions to go everywhere and people within their own teams can connect to each other. You can make more complicated policies like that fairly easily.
Jeremy Jung 00:04:40 And when we think about infrastructure for companies, you were talking about how there could be development infrastructure, production infrastructure, and you kind of separate it all out. When you’re working with Cloud infrastructure, a lot of times there’s the — I always forget what it stands for, but there’s like IAM, there’s like policies that you can set up with the Cloud provider that says these users can access this or these machines can access this. And I wonder from your perspective when you would choose to use that versus use something at the network or the VPN level?
Xe Iaso 00:05:14 The way I think about it is that things like IAM enforce permissions for more granularly scoped things like ‘can create EC2 instances’ or ‘can delete EC2 instances or something like that.’ And that’s just kind of a different level of thing. Tailscale ACLs are more, you know, ‘X is allowed to connect to Y’ or with Tailscale SSH, X is allowed to connect as user why? And that’s really different than like arbitrary capability things like IAM offers. You could think about it as an IAM system, but the main provisions of just exposing are can X connect to Y on Zed port?
Jeremy Jung 00:05:55 What are some other use cases where if you weren’t using a VPN you’d have to do a lot more work or there’s a lot more complexity kind of what are some cases where it’s like okay, using a VPN here makes a lot of sense.
Xe Iaso 00:06:08 There is a service internal at Tailscale called Go links, which is a clone of Google’s so-called Go links where it’s basically URL shortener that lives at http://Go and, you know, you have Go/something to get to some internal admin service or another thing to get to like, you know, the company directory in Notion or something. And this kind of thing you could do with a normal setup. You know, you could set it up and have to do OAuth challenges everywhere and have to make sure that everyone has the right DNS configurations so that it shows up in the right place. And then you’d have to deal with https because OAuth requires https for understandable and kind of important reasons, and it’s just a mess. Like, there’s so many layers of stuff the barrier to get, you know, like just a darn URL shortener up turns from like 20 minutes into three days of effort trying to understand how these various arcane things work together.
Xe Iaso 00:07:13 You need to have state for your OAuth implementation; you need to worry about what the hell a Jot is. It’s just bad. And I really think that something like Tailscale with everybody has an IP address in order to get into the network you have to sign in with your Auth provider. Your Auth provider tells Tailscale who you are. So transitively every IP address is tied to an owner, which means that you can enforce access permission based on the IP address and the metadata about it that you grab from the Tailscale daemon. It’s just so much simpler. Like you don’t have to think about, oh how do I set up OAuth this time? What the hell is an OAuth proxy? What is a Kubernetes? That sort of thing. You just think about doing the thing and you just do it, and then everything else gets taken care of. It’s like kind of the ultimate network infrastructure because it’s both omnipresent and something you don’t have to think about. And I think that’s really the power of Tailscale.
Jeremy Jung 00:08:12 Typically, when you would spin up a service that you want your developers or your system admins to be able to log into, you would have to have some way of authenticating and authorizing that user. And so, you were talking about bringing in OAuth and having your service understand that. But I guess what you’re saying is that when you have something like Tailscale that’s kind of front-loaded I guess? You authenticate with Tailscale, you get onto the network, you get your IP and then from that point on you can access all these different services that know like, Hey because you’re on the network, we know you’re authenticated and those services can just maybe map that IP that’s not going to change to like users in some kind of table and not have to worry about figuring out how do I authenticate this user?
Xe Iaso 00:09:05 I would personally more suggest that you use the Whois lookup route in the Tailscale daemon’s local API, but basically yeah you don’t really have to worry too much about the authentication layer because the authentication layer has already been done — you know, you’ve already done your two factor with Gmail or whatever and then you can just transitively push that property onto your other machines.
Jeremy Jung 00:09:30 So when you talk about this Whois daemon, can you give an example of ‘I’m in the network, now I’m going to make a service call to an application,’ what am I doing with this Whois daemon?
Xe Iaso 00:09:42 It’s more of like an internal API call that we expose via Tailscale D’s Unix socket. But basically you give it an IP address and a port and it tells you who the person is. It’s kind of like the Unix ident protocol in a way except completely not. And at a high level, you know, if you have something like a proxy for Grafana, you have that proxy for Grafana make a call to the local Tailscale daemon and be like, hey who is this person? And the Tailscale daemon will spit back adjacent object like ‘oh it’s this person on this device’ and there you can do additional logic like maybe you shouldn’t be allowed to delete things from an iOS device. You know, crazy ideas like that. There’s not really support for arbitrary capabilities in Tailscale D at the time of recording, but we’ve had some thoughts. Would be cool.
Jeremy Jung 00:10:40 Would that also include things like having roles for example, even if it’s just strings, that you get back so that your application would know, okay this person is supposed to have admin access to this service based on what I got back from this service?
Xe Iaso 00:10:57 Not currently. You can probably do it via convention or something, but what’s currently implemented in the actual source code and user experience, you can’t do that right now. It is something that I’ve been trying to think about different ways to solve, but it’s also a problem that’s a bit big for me personally to tackle.
Jeremy Jung 00:11:17 There’s so many, I guess, different ways of doing it that it’s kind of interesting to think of a solution that’s kind of built into the network, yeah?
Xe Iaso 00:11:28 Yeah. And when I describe that authentication thing to some people it makes them recoil in shock because there’s kind of a Stockholm syndrome-type effect with security for a lot of things where the easy way to do something and the secure way to do something are, you know, like completely opposite and directly conflicting with each other in almost every way. And over time people have come to associate security, or like corporate VPNs, as annoying, complicated and difficult, and the idea of something that isn’t annoying, complicated, or difficult will make people reject it. Like, just on principle because you know, they’ve been trained that, you know, VPN equals ‘virtual pain network’ and it’s hard to get that association out of people’s heads because you know a lot of VPNs are virtual pain networks. Like, I used to work for Salesforce, and Salesforce had this corporate VPN where no matter what you did, all of your traffic would go out to the internet from their data center — I think it was in San Francisco or something — and I was in the Seattle area so whenever I had the VPN on my latency to Google shot up by like eight times, and being a software person, you know, I used Google the same way that others breathe, and it was just not fun and I only had the VPN on for the bare minimum of when I needed it and, oh God it was so bad.
Jeremy Jung 00:13:01 Like some people when they picture VPN, they picture exactly what you’re describing where all of my traffic is going to get routed to some central point, it’s going to go connect to the thing for me, and then send the result back. So maybe you could talk a little bit about why that’s maybe a wrong assumption, I guess, in the case of Tailscale or maybe in the case of just more modern VPN solutions.
Xe Iaso 00:13:24 Yeah, so the thing that I was describing is what I’ve been lovingly calling the ‘single point of failure as a service’ type model of VPN? Where you know, you have like the big server somewhere, it concentrates all the connections and you know like does things to make the computer feel like they’ve teleported over there, but overall it’s a single point of failure and if that falls over, you know, like, goodbye VPN, everybody’s just totally screwed. And in contrast, Tailscale does a more peer-to-peer thing, so that everyone is basically on equal footing. Everyone can send traffic directly to each other, and if it can’t get directly to there it’ll use a network of relay servers lovingly called DERP, and you don’t have to worry about your single point of failure in your cluster because there’s just no single point of failure. Everything will directly communicate as much as possible, and if it can’t it’ll still communicate anyway.
Jeremy Jung 00:14:26 Let’s say I start up my computer and I want to connect to a server in a data center somewhere, at the very beginning am I connecting to some server hosted at Tailscale and then there’s some kind of negotiation process where after that I connect directly, or do I just connect directly straight away?
Xe Iaso 00:14:47 If you just turn on your laptop and log in, it signs into Tailscale and gets you on the tail net and whatnot. Then it will actually start all connections via DERP just so that it can negotiate the direct connection and in case it can’t, you know, it’s already connected via DERP so it just continues the connection with DERP. And this creates a kind of seamless magic type experience where doing things over DERP is slower. Yes, it is measurably slower because, you know, like you’re not going directly; you’re doing TCP inside of TCP and you know that comes with an average minefield of lasers or whatever you call it. And it does work though. It’s not ideal if you want to do things like copy large amounts of data, but if you just want to SSH into to prod and see the logs for what the heck is going on and why you’re getting a page at 3:00AM, it’s pretty great.
Jeremy Jung 00:15:43 Which you recalling DERP, is it where you have servers kind of all over the world and somehow it determines which ones I guess is it, which one’s closest to your destination or which one’s closest to you? I’m kind of,
Xe Iaso 00:15:57 It’s really interesting. It’s one of the most weird distributed systems type things that I’ve ever seen. It’s the kind of thing that could only come out of the mind of an ex-Googler, but basically every Tailscale node has a connection to all of the DERP servers, and through process of, you know, latency testing, it figures out which connection is the fastest and the lowest latency and it calls that it’s home DERP. But because everything is connected to every DERP, you can have two people with different home DERPs getting their packets relayed to other clients from different DEPTs. So, you know, if you have a laptop in Ottawa and a laptop in San Francisco, the laptop in San Francisco will probably use the DERP that’s closest to it, but the laptop in Ottawa will also use the DERP that’s closest to it. So you get this sort of like asynchronous thing, and it actually works out a lot better in practice and you’re probably imagining.
Jeremy Jung 00:16:51 And then these servers, what was the technical term for them? Are they like relays or what’s the…?
Xe Iaso 00:16:56 They’re relays. They only really deal with encrypted wire guard packets and there’s no way for us at Tailscale to see the contents of DERP messages. It is literally just a forwarder; it literally just forwards things based on the key ID.
Jeremy Jung 00:17:12 I guess if Tailscale isn’t able to decrypt the traffic, is that because the keys are only on the user’s devices, like it’s on their laptop and on the server they’re trying to reach or…?
Xe Iaso 00:17:26 Yeah, the private keys are live and die with those devices — or the devices they were minted on — and the public keys are given to the coordination server and the coordination server spreads those around to every device in your tailnet. It does some limiting so that like if you don’t have ACL access to something, you don’t get the public key for it. The public key, not the private key, the public key, not the private key; and then you know, you just go that way and it’ll just figure it out. It’s pretty nice.
Jeremy Jung 00:17:53 When we’re kind of talking about situations where it can’t connect directly, that’s where you would use the relay. What are kind of the typical cases where that happens where you aren’t able to just connect directly?
Xe Iaso 00:18:06 Hotel wifi and paranoid network security setups. Hotel wifi is the most notorious one because you know you have like an overpriced wifi connection and if you bring, like, I don’t know, like you’re recording a bunch of footage on your iPhone and because in 2022 the iPhone has a USB2 connection on it and you know you want to copy that, you want to use the network but you can’t, so you could just let it upload through iCloud or something or do the bare minimum you need to get the data off with DERP. It wouldn’t be ideal but it would work, and ironically enough, that entire complexity involved with, you know, doing TCP inside of TCP to copy a video file over to your laptop might actually be faster than USB2, which is something that I did the math for a while ago and I just started laughing.
Jeremy Jung 00:19:02 That is pretty ridiculous.
Xe Iaso 00:19:04 Welcome to the future, man.
Jeremy Jung 00:19:07 In terms of connecting directly, usually when you have a computer on the internet, you don’t have all your ports open, you don’t necessarily allow just anybody to send you traffic over UDP, and so forth. Let’s say I want to send UDP data to a server on my network, but, you know, maybe it has some TCP ports open. I’m assuming once I connect into the network via the VPN I’m able to use other protocols and ports that weren’t necessarily exposed. Is that correct?
Xe Iaso 00:19:40 Yeah, you can use UDP. You can do basically anything you would do on a normal network except multicast because multicast is weird. I mean there’s thoughts on how to handle multicast, but the main problem is that like wire guard, which is what a Tailscale is built on top of — the so-called OSI model layer 3 network, where it’s at, like you know, the IP address level and multicast is a layer-2 or data-link layer type thing, and there are different numbers. And you can’t really easily put, like, broadcast packets into IP. IPV4 thinks otherwise, but in practice, no, people don’t actually use the broadcast address.
Jeremy Jung 00:20:23 So, for someone who has a project or their company wants to get started, I mean, what does onboarding look like? What do they have to do to get all these devices talking to one another?
Xe Iaso 00:20:35 Basically, you install Tailscale, you log in with a little GUI thing, or on a Linux server you run Tailscale UP, and then you all log into a like a G-suite account with the same domain name. So you know, if your domain is like example.com, then everybody logs in with their example.com G-suite account, and there is no step three. Everything is allowed and everything can just connect and you can change the permissions from there. By default the ACLs are set to a, you know, very permissive allow everyone to talk to everyone on any port just so that people can verify that it’s working. You can ping to your heart’s content, you can play Minecraft with others, you can host an HTTP server, you can SSH into your development box and write blog posts with Emacs, whatever you want.
Jeremy Jung 00:21:26 Okay, you install the software on your servers, your workstations, your laptops and so on. And then after that there’s some kind webpage or dashboard you would go in and say I want these people to be able to access these things and these ports and so on.
Xe Iaso 00:21:44 You can customize the access control rules with something that looks like Json, but with trailing commas and comments allowed, and you can go from there to customize basically anything to your heart’s content. You can set rules so that people on the DevOps team can access everything, but you know maybe marketing doesn’t need access to the production database, so you don’t have to worry about that as much.
Jeremy Jung 00:22:10 There’s been different, I guess you would call them VPN protocols — I mean, there’s people have probably worked with IPsec in some situations, they may have heard of open VPN, wire guard. In the case of Tailscale, I believe you chose to build it on top of wire guard. So, I wonder if you could talk a little bit about why you chose wire guard and maybe what makes it unique.
Xe Iaso 00:22:35 I wasn’t on the team that initially wrote like the core of Tailscale itself, but from what I understand wire guard was chosen because what overhead? It’s literally you just encrypt the packets, you send it to the other server or the other server decrypts them and, you know, you’re done. It’s also based purely on the key pairs involved. And from what I understand like at the wire guard protocol level, there’s no reason why you would need an IP address at all ,in theory, but in practice you kind of need an IP address because, you know, everything sucks. But also wire guard is like UDP-only, which I think it’s like core implementation which is a step up from like anyconnect and openVPN where they have TCP modes so you can experience the glorious trash fire of TCP-in-TCP. And from what I understand with wire guard, you don’t need to set up a certificate authority or figure out how the heck to revoke certificates. You just have key pairs and if a node needs to be removed you delete the key pair, and you’re done. And I think that really matches up with a lot of the philosophy behind how Tailscale networks work a lot better. You know, you have a list of keys, and if the network changes the list of keys changes; that’s the end of the story.
Jeremy Jung 00:23:55 So maybe one of the big selling points was just what has the least amount of things, I guess, to deal with? Or what’s the simplest when you’re using it a component that you want to put into your own product. You kind of want the least amount of things that could go wrong, I guess?
Xe Iaso 00:24:10 Yeah, it’s more like simple but not like limiting — like, for example, a set of tinker toys is simple in that you know you can build things that you don’t have to worry too much about the material science but a set of tinker toys is also limiting because you know like they’re little wooden dowels and little circles made out of wood that you stick the dowels into. You know, you can only do so much with it. And I think that in comparison wire guard is simple, you know there’s just key pairs, they’re just encryption, and it’s simple in it’s like overall theory and its implementation, but it’s not limiting. Like, you can do pretty much anything you want with it.
Jeremy Jung 00:24:52 Inherently, whenever we build something that’s what we want. But that’s an interesting way of putting it.
Xe Iaso 00:24:57 Yeah, it can be kind of annoyingly hard to figure out how to make things as simple as they need to be but still allow for complexity to occur, so you don’t have to like set up a keyboard macro to write ‘if error not equals nil’ over and over.
Jeremy Jung 00:25:11 I guess the next thing I’d like to talk a little bit about is we’ve covered it a little bit but at a high level I understand that Tailscale uses wire guard, which is the open-source VPN protocol I guess you could call it. And then there’s the client software you’re saying you need to install on each of the servers and workstations, but there’s also a control plane, and I wonder if you could kind of talk a little bit about, I guess at a high level, what are all the different components of Tailscale?
Xe Iaso 00:25:42 There’s the agent that you install on your devices. The agent is basically the same between all the devices; it’s all written in Go, and turns out that Go can actually cross compile fairly well. So, you have your implementation in Go that is basically the same code more or less running on Windows, Mac OS, FreeBSD, Android, Chrome OS, iOS, Linux — I think I just listed all the platforms, I’m not sure. But you have that and then there’s the sort of control plane on Tailscale’s side. The control plane is basically like Control which is I think a Get Smart reference, and that is basically a key Dropbox. So you authenticate through there, that’s where the admin panel’s hosted and that’s what tells the different Tailscale nodes, the keys of all the other machines on the tail net and also on Tailscale’s side there’s DERP, which is a fleet of a bunch of different VPSs and various Clouds all over the world — both to try to minimize cost and to have resiliency because if both digital ocean and vulture go down globally we probably have bigger problems.
Jeremy Jung 00:26:55 I believe you mentioned that the clients were written in Go, are the control plane and the relay the DERP portion, are those also written in Go or are they…?
Xe Iaso 00:27:06 They’re all written in Go, yeah. Go as much as possible. Yeah. It’s kind of what happens when you have some ex-Go team members is the core people involved in Tailscale. Like there’s a Go compiler fork that has some additional patches that go upstream, either can’t accept, won’t accept or hasn’t yet accepted. For a while it was how we did things like trying to shave off bytes from binary size to attempt to fit it into the iOS network extension limit because for some reason they only allowed you to have 15 megabytes of RAM for both, like, your application and working RAM, and it turns out that 15 megabytes of RAM is way more than enough to do something like openVPN but you know when you have a peer-to-peer VPN engine, it doesn’t really work that well. So, a lot of interesting engineering challenges.
Jeremy Jung 00:27:59 That was specifically for iOS, so to run it on an iPhone?
Xe Iaso 00:28:03 Yeah, and amazingly after the person who did all of the optimization to the linker — trying to get the binary size down as much as possible like replacing Unicode packages was something that’s more code efficient, you know like basically all but compressing parts of the binary to try to save space — then the iOS, I think, 15 beta dropped and we found out that they increased the network extension RAM limit to 50 megabytes, and the look of defeat on that poor person’s face. I feel very bad for him.
Jeremy Jung 00:28:37 You got what you wanted but you’re sad about it.
Xe Iaso 00:28:40 Yeah.
Jeremy Jung 00:28:41 So that’s interesting too. You were using a fork of the Go compiler?
Xe Iaso 00:28:46 Basically, everything that is built is built using the Tailscale fork at the Go compiler
Jeremy Jung 00:28:53 Going forward is the sort of assumption is that’s what you’ll do or is it you’re hoping you can get this stuff upstream and then eventually move off of it?
Xe Iaso 00:29:02 I’m pretty sure that — I don’t know if I can really make a forward-looking statement like that, but I’ve come to accept the fact that there’s a fork in the Go compiler and as a result it allows a lot more experimentation and a bit more control over what’s going on. I’m not like the most happy with it, but I understand why it exists and I’ve made my peace with it.
Jeremy Jung 00:29:25 And I suppose it helps somewhat that the people who are working on it actually originally worked on the Go compiler at Google. Is that right?
Xe Iaso 00:29:34 Oh yeah. If there weren’t ex-Go team people working on that then I would definitely feel way less comfortable about it. But I trust that the people that are working on it know what they’re doing — at least enough.
Jeremy Jung 00:29:47 I feel like that’s kind of the position we put ourselves in with software in general, right? Is like do we trust ourselves enough to do this thing we’re doing?
Xe Iaso 00:29:55 Yeah, trust is a —-.
Jeremy Jung 00:29:58 I think one of the things that’s interesting about Tailscale is that it’s a product that’s kind of, it’s like network infrastructure, right? It’s to connect you to your other devices, and that’s a little different than somebody running a software-as-a-service. And so how do you test something that’s like built to support a network and how is that different than just making a web app or something like that?
Xe Iaso 00:30:23 Well, it’s a lot more complicated for one, especially when you have to have multiple devices in the mix with multiple different operating systems. And I was working on some integration tests sting stuff for a while, and it was really complicated. You have to spin up virtual machines, you know you have to like make sure the virtual machines are attempting to download the version of the Tailscale client you want to test. And it’s quite a lot, in practice.
Jeremy Jung 00:30:50 I mean, do you have a lab, you know, with Android phones and iPhones and laptops and all this sort of stuff, and you have some kind of automated test suite to see like, hey if these machines are in Ottawa and my server’s in San Francisco, like you’re mentioning before that I can get from my iPhone to this server and the data center over here? That kind of thing.
Xe Iaso 00:31:13 What’s the right way to phrase this without making things look bad? It’s a work in progress. It’s really a hard problem to solve, especially when the company is fully remote and, like, the address that’s listed on the business records is literally one of the founder’s condos because you know the company has no office so that makes the logistics for a lot of this even more fun.
Jeremy Jung 00:31:38 Probably any company that’s in an early stage feels the same way where it’s like, everything’s a work in progress and we’re just going to, we’re going to keep going and we’re going to get there and as long as everything keeps running we’re good.
Xe Iaso 00:31:51 Yeah, I don’t like thinking about it in that way because it kind of sounds like pessimistic or defeatist, but at some level it’s, it really is a work in progress because it’s a hard problem, and hard problems take a lot of time to solve — especially if you want a solution that you’re happy with.
Jeremy Jung 00:32:08 And I think it’s kind of a unique case too where it’s not like if it goes down it’s like people can’t do their job right? So it’s, yeah.
Xe Iaso 00:32:18 Actually, if Tailscale’s control plane goes down, I don’t think people would notice until they tried to like reboot a laptop or connect a new device to their tail net because once all the Tailscale agents have all of the information they need from the control plane, you know, they just continue on independently and don’t have to care. DERP is also fairly independent of the, like, the key Dropbox component, and you know if that goes down DERP doesn’t care at all.
Jeremy Jung 00:32:50 Oh okay. So if the control plane is down as long as you had authenticated earlier in the day, you can still, I don’t know if it’s cached or something, but you can still continue to reach the relay servers, the DERP servers or your …. ?
Xe Iaso 00:33:06 …other nodes. Yeah. Yeah, I’m pretty sure that in most cases the control plane could be down for several hours a day and nobody would notice unless they’re trying to deal with the panel.
Jeremy Jung 00:33:16 Got it. That’s a little bit of a relief I suppose for all of you running it.
Xe Iaso 00:33:21 Yeah, it’s also kind of hard to sell people on the idea of here is a VPN thing; you don’t need to self-host it and they’re like, what? Why? And yeah, can be fun.
Jeremy Jung 00:33:35 Though, I mean I feel like anybody who has self-hosted a VPN, they probably like don’t really want to do it. I don’t know, maybe I’m wrong.
Xe Iaso 00:33:46 So, a lot of the idea of wanting to self-host it is, I think it’s more of like trying to be self-sufficient and not have to rely on other companies’ failures dictating your company’s downtime. And you know like from some level that’s very understandable, and you know, if Tailscale were to get bought out and the new owners would like basically kill the product, they’d still have something that would work for them. I don’t know if, like, such a defeatist attitude is productive, but it is certainly the opinion that I have received when I have asked people why they want to self-host other people don’t want to deal with identity providers or the like they want to use their own identity provider. And what was hilarious was there was one thing where they were like, our old VPN server died once and we got locked out of our network so therefore we want to self-host Tailscale in the future so that this won’t happen again. And I’m like, buddy, let’s just take a moment and retrace the steps here cause I don’t think you mean what you think you mean.
Jeremy Jung 00:34:49 Yeah, yeah.
Xe Iaso 00:34:51 In general, like, I suggest people that you know, even if they’re like way deep into the Tailscale Kool-Aid, they still have at least one other method of getting into their servers. Ideally too. I admit that I come from an SRE style background and I am way more paranoid than most, but I usually like having a backup just in case.
Jeremy Jung 00:35:12 So I suppose on that note, let’s talk a little bit about your role at Tailscale. The title of the archmage infrastructure is one of the coolest titles I’ve seen. So maybe you can go a little bit into what that entails at Tailscale.
Xe Iaso 00:35:27 I started that title as a joke that kind of stuck. My initial intent was that every time someone asked, I’d say I’d have a different, you know, like mystic sounding title, but archmage of infrastructure kind of stuck. And since then I’ve actually been pivoting more into developer relations stuff rather than pure software engineering. And from the feedback that I’ve gotten at the various conferences I’ve spoken at, they like that title even though it doesn’t really fit with developer relations work at all; it’s like it fits because it doesn’t — you know, that kind of cony kind of way.
Jeremy Jung 00:36:01 I guess this would go more into the infrastructure side, but what does the scale of your infrastructure look like? I mean, I think that you touched a little bit on the fact that you have relay servers all over the place and you’ve got this control plane, but I wonder if you could give people a little bit of perspective of what kind of undertaking this is?
Xe Iaso 00:36:21 I’m pretty sure at this point we have more developer laptops and the like than we do production servers. I’m pretty sure that the scale of production servers are in the tens at most. It turns out that computers are pretty darn efficient and you don’t really need, like, a lot of computers to do something amazing.
Jeremy Jung 00:36:41 The part that I guess surprises me a little bit is the relay servers I suppose because I would imagine there’s a lot of traffic that goes through those. Are you finding that just most of the time they just aren’t needed and usually you can make a direct connection and that’s why you don’t need too many of these?
Xe Iaso 00:36:56 From what I understand, I don’t know if we actually have a way to tell, like, what percentage of data is going over the relays versus not. And I think that was an intentional decision that may have been revisited — I’m operating based off of like 6-12 month old information right now — but in general, the only state that the relay servers has is in-RAM and whenever you disconnect the state is dropped, and even then that state is like, you know, this key is listening, it is connected in case you want to send packets over here, I guess. It’s a bit less bandwidth and you’re probably thinking it’s not like enough to max it out 24/7, but it is measurable and there are some costs associated with it. This is also why it’s on Digital Ocean and Vulture and not AWS, but in general it’s a lot less than you’d think. I’m pretty sure that, like, if I had to give a baseless assumption, I’d say that probably about like 85% of traffic goes directly, and the remaining is like the few cases in the whole punching engine that we haven’t figured out yet. Like Palo Alto fire walls, oh God those things are in nightmare.
Jeremy Jung 00:38:12 I see. So it’s most of the traffic actually ends up being straight peer-to-peer, doesn’t have to go through your infrastructure, and therefore it’s like you don’t need too many machines to make this whole thing work.
Xe Iaso 00:38:26 Yeah, it turns out that computers are pretty darn fast, and that copying data is something that computers are really good at doing. So if you have, you know, some pretty darn fast computers basically just sitting there and copying data back and forth all day, like you can do a lot with shockingly little. When I first started I believe that the DERP VMs were using like sometimes as little as one core in 512 megabytes of RAM as like a primary DERP. And we only noticed when there were some weird connection issues for people that were only on DERP because there were enough users that the machine had ran out of memory. So we just, you know, upped the virtual machine size and called it a day. But it’s truly remarkable how far you can get with very little.
Jeremy Jung 00:39:12 And you mentioned the relay servers, the DERP servers, were on services like Digital Ocean and Vulture, I’m assuming because of the bandwidth cost. For the control plane, is that on AWS or some other big Cloud provider?
Xe Iaso 00:39:28 It’s on AWS, I believe it’s in EU Central one.
Jeremy Jung 00:39:31 You’re helping people connect from device to device. And in a situation like that, what does monitoring look like and incidents — like, what are you looking for to determine like, hey, something’s not working?
Xe Iaso 00:39:46 There’s monitoring with, you know, Prometheus, Grafana, all of that stuff. There are some external probing things. There’s also some continuous functional testing for trying to connect to Tailscale and, like ,log in as an account, and if that fails like twice in a row, then you know something’s very wrong and, you know, raise the alarm. But in general, a lot of our monitoring is kind of hard at some level because we’re Tailscale. Tailscale can’t always benefit from Tailscale to help operate Tailscale because, you know, it’s Tailscale. So still trying to figure out how to detangle the chicken and egg situation, it’s really annoying.
Jeremy Jung 00:40:30 There’s the term ‘dog fooding’, right, where they’re saying like, oh we run our own development on our own platform or our own software, but I could see when your product is network infrastructure VPNs where that could be a little, little dicey.
Xe Iaso 00:40:44 Yeah, it is very annoying, but I’m pretty sure we’ll figure something out. It’s just a matter of when. Another thing that’s come up is we’ve kind of wanted to use Tailscale’s SSH features where you’d specify ACL’s rules to allow people to SSH into other nodes as various users, but if that becomes your main access to production, then, you know, like, if Tailscale is down and you’re Tailscale, how do you get in? Then there’s been various philosophical discussions about this. It’s also slightly worse if you use what’s called check mode in SSH where Tailscale SSH without check mode. You know, you just, the server checks against the policy rules and the ACL and if it’s okay it lets you in. And if not it says no. But with check mode there’s also this like 8-hour quote-unquote lifetime for you to have like pseudo mode on GitHub where you do an Auth challenge with your Auth provider and then you know, you’re given a hey this person has done this thing type verification. And if that’s down and that goes through the control plane, and if the control plane is down in your Tailscale trying to debug the control plane and in order to get into the control plane over Tailscale, you need to use the control plane. You know, that’s like chicken and egg problem level 78, which is a mythical level of chicken and egg problem that has only been foretold in the legends of yore or something.
Jeremy Jung 00:42:12 At that point, it sounds like somebody just needs to drive to the data center and plug into the switch.
Xe Iaso 00:42:18 I mean, it probably wouldn’t be like, you know, we need to get it person with an angle grinder off of Craigslist type pad like it was with a Facebook BGP outage. But it’s definitely a chicken and egg problem in its own right. It makes you do a lot of lateral thinking too, which is also kind of interesting.
Jeremy Jung 00:42:35 When you say ‘lateral thinking’, I’m just kind of curious if you have an example of what you mean.
Xe Iaso 00:42:40 I don’t know of any example that isn’t NDA’d, but basically, you know, Tailscale is getting to the point where Tailscale is relying on Tailscale to make Tailscale function and you know, yeah this is a classic ouroboros-style problem. I’ve heard a wise friend of mine said that that is an ideal problem to have, which sounds weird at face value, but if you’re getting to that point, that means that you’re successful enough that you’re having that problem, which is in itself a good thing, paradoxically.
Jeremy Jung 00:43:12 Better to have that problem than to have nobody care about the product, right?
Xe Iaso 00:43:17 Yeah.
Jeremy Jung 00:43:18 Kind of on that note, you mentioned you worked at Salesforce — I believe that was working on Heroku. I wonder if you could talk a little about your experience working at, you know, Tailscale, which is kind of more of a, you know, early startup versus an established company like Salesforce.
Xe Iaso 00:43:38 So, at the time I was working at Heroku, it definitely didn’t feel like I was working at Salesforce for the majority of it. It felt like I was working, you know, at Heroku — like on my resume I list it as Heroku when I talked about it to people, I said I worked at Heroku and that Salesforce was this, you know, mythical ohana thing that I didn’t have to deal with unless I absolutely had to. By the end of the time I was working at Heroku, the Salesforce sort of started to creep in and, you know, we moved from tracking issues in GitHub issues like we were used to using their — what’s the polite way to say this? Their creation, which was like the moral equivalent of Jira implemented on top of Salesforce. You had to be behind the VPN for it and, you know, every ticket had 20 fields and there were no templates. And in comparison with Tailscale, you know, we just use GitHub issues. Maybe some, like, things in Notion for doing like longer term tracking or kanban stuff, but it’s nice to not have, you know, all of the pomp and ceremony of filling out 20 fields in a ticket for like two sentences of this thing is obviously wrong and it’s causing X to happen, please fix.
Jeremy Jung 00:44:56 I like that phrase, ‘the creation’. That’s a very diplomatic term.
Xe Iaso 00:45:02 I mean, I can think of other ways to describe it, but I’m pretty sure those ways wouldn’t be allowed on the podcast. .
Jeremy Jung 00:45:09 But yeah, I know what you mean for sure. Where it feels like there’s this movement from hey, let’s just do what we need — like, let’s fill in the information that’s actually relevant and don’t do anything else — to a shift to we need to fill in these 10 fields because that’s the thing we do. Yeah,
Xe Iaso 00:45:30 Yeah. And in the time I’ve been working for Tailscale, I’m like employee ID12 and Tailscale has gone from a company where I literally know everyone to just recently to the point where I don’t know everyone anymore. And it’s a really weird feeling. I’ve never been in a like a small-stage startup that’s gotten to this size before, and I’ve described some of my feelings to other people who have been there and they’re like, Yeah, welcome to the club. So, I figure a lot of it is normal. From what I understand though, there’s a lot of intentionality to try to prevent Tailscale from becoming, you know, like Google-style organizational complexity unless that is absolutely necessary to do something.
Jeremy Jung 00:46:13 It’s a function of size, right? Like as you have more people, more teams, then more process comes in. That’s a really tricky balance to grow and still keep that feeling of I’m just doing the thing, I’m doing the work rather than all this other process stuff.
Xe Iaso 00:46:32 Yeah. But I’ve also kind of managed to pigeonhole myself off into a corner with devRel stuff and that’s been nice. Been working a bunch with like marketing people and helping out with support occasionally and doing a God-awful amount of writing.
Jeremy Jung 00:46:48 The writing for our audience’s benefit, I think they should really check out your blog because I think that the way you write your articles is very thoughtful in terms of the balance of the actual example code or example scripts and the descriptions, and there’s a little bit of a narrative sometimes too.
Xe Iaso 00:47:09 I’m actually more of a prose writer just by like how I naturally write things.
Jeremy Jung 00:47:15 As we wrap up, is there anything we missed or anything else you want to mention?
Xe Iaso 00:47:19 If you want to look at my blog, it’s on xeiaso.net. That’s X-E-I-A-S-O.net. That’s where I post things. You can see like the 280-something articles at time of recording; it’s probably going to get to 300 at some point. (Oh God, it’s going to get to 300 at some point.) And yeah, I try to post articles about weekly, depending on facts and circumstances. I have a bunch of talks coming up, like one about the hilarious over engineering I did in my blog and maybe some more if I get back positive responses from calls for paper submissions. I have a couple talks that are going to be up by the time this is published. One of them is my ‘Rust cough’ talk on my, what was it called? I think it was called The Surreal Horrors of PAM or something where I discussed my experience trying to bug a PAM module in Rust for work. And it’s the kind of story where, you know it’s bad when you have a break point on DL Open.
Jeremy Jung 00:48:23 That sounds like a nightmare.
Xe Iaso 00:48:25 Oh yeah. Like part of attempting to fix that process involved going very deep. We’re talking like an HTML frame set in the internet archive for SunOS documentation that was written around the time that PAM was used. Like, things that are bad enough were like everything in the frame set, but the contents had eroded away through bit rot and, you know, you’re very lucky just to have what you do.
Jeremy Jung 00:48:52 Well, I’m glad it was you and not me. We’ll get to hear about it and not have to go through the suffering ourselves.
Xe Iaso 00:48:58 Yeah. One of the things I’ve been telling people is that I’m not like a brilliant programmer. Like, I know a bunch of people who are definitely way smarter than me, but what I am is determined and determination is a bit stronger of a force than you’d think.
Jeremy Jung 00:49:13 Yeah. I mean without it nothing gets done. Right?
Xe Iaso 00:49:16 Yeah.
Jeremy Jung 00:49:17 Very cool. Well, Xe thank you so much for coming on Software Engineering Radio.
Xe Iaso 00:49:22 Yeah, thank you for having me. I hope you have a good day, and try out Tailscale — note my bias, but I think it’s great.
Jeremy Jung 00:49:28 This has been Jeremy Jung for Software Engineering Radio. Thanks for listening.
[End of Audio]