Software Engineering

SE Radio 556: Alex Boten on Open Telemetry : Software Engineering Radio


Alex BotenSoftware engineer Alex Boten, author of Cloud Native Observability with Open Telemetry, joins host Robert Blumen for a conversation about software telemetry and the OpenTelemetry project. After a brief review of the topic and the OpenTelemetry project’s origins rooted in the need for interoperability between telemetry sources and back ends, they discuss the OpenTelemetry server and its features, including transforms, filtering, sampling, and rate limiting. They consider a range of topics, starting with alternative topologies with and without the telemetry server, server pipelines, and scaling out the server, as well as a detailed look at extension points and extensions; authentication; adoption; and migration.

Transcript brought to you by IEEE Software magazine. This transcript was automatically generated. To suggest improvements in the text, please contact content@computer.org and include the episode number and URL.

Robert Blumen 00:00:16 For Software Engineering Radio. This is Robert Bluman. Today I have with me Alex Boten. Alex is a senior staff software engineer at LightStep. Prior to that, he was at Cisco. He’s contributed to open-source projects in the telemetry area, including the OpenTelemetry project. He’s the author of the book, Cloud Native Observability with OpenTelemetry, and that will be the subject of our conversation today. Alex, welcome to Software Engineering Radio.

Alex Boten 00:00:50 Hello. Thank you for having me. It’s great to be here.

Robert Blumen 00:00:52 Would you like to add anything about your background that I didn’t mention?

Alex Boten 00:00:57 I think you captured most of it. I’ve been contributing to OpenTelemetry for a little bit over three years. I’ve worked on various components of the project as well as the specification, and I’m currently a maintainer on the OpenTelemetry Collector.

Robert Blumen 00:01:11 Great. Now on Software Engineering Radio, we have covered quite a lot of telemetry-related issues, including Logging in episode 220, High Cardinality Monitoring, which was 429, Prometheus Distributed Tracing and episode 455, which was called Software Telemetry. So, listeners can definitely listen to some of those in our back catalog to get more general information. We’ll be focusing more in this conversation about what OpenTelemetry brings to the table that we have not already covered. Let’s start out with, in the telemetry space, where could you situate OpenTelemetry? What is it similar to? What is it different? What problem does it solve?

Alex Boten 00:02:02 That’s a great question. So, I think the problem that OpenTelemetry aims to solve — and we’ve already seen it happen in the industry today — is it changes how application developers instrument their application, how telemetry is generated, and how it’s collected, and then transmitted across systems. And if I were to think of what it’s similar to the first thing that comes to mind are the projects that really caused it to emerge, which are OpenCensus and OpenTracing, which are two other open-source projects that were formed a little bit earlier. I think it started in maybe 2017, 2016, to provide a standard around producing distributed tracing. And then also OpenCensus also addressed a little bit around metrics and log collection.

Robert Blumen 00:02:50 What was going on in the telemetry area prior to those projects that created the need for them, and what did they do?

Alex Boten 00:02:57 Yeah, so I think, if you think of telemetry as the domain in software, it’s been around for a really long time, right? Like, people as early as the earliest of computer scientists wanted to know what their computers were doing. And earlier in the days of having a single machine, it was fairly easy to print some log statements and look at what your machine was doing. But as the industry grew, as the Internet of Things picked up, as systems became larger and larger to address the increasing demand, I think systems became inherently more complex. And we’ve seen an evolution of what software telemetry really became. So, if you think of earlier we were able to log data on a single system. As people had to deploy multiple systems, a need for centralized logging came along so that you can aggregate and do aggregate searches on logs.

Alex Boten 00:03:54 And that became really costly. And then we saw an increase in folks wanting to capture more meaningful metrics from their systems where they could create dashboards and do queries, whereas it was cheaper than going through and analyzing log data. And I think the thing that I’ve seen happen in the last 20 years is every time there was a new maybe paradigm around the type of telemetry that systems should emit, there has been a chance for innovation to take place, which is great to see, but if you’re an end user who’s just trying to get telemetry out of a system, out of an application, it’s a really frustrating process to have to go and reinstrument your code every few months or every few years, depending on what the flavor of the day is. And I think what OpenCensus and OpenTracing and OpenTelemetry tried to capture is addressing the pain that users have when it comes to instrumenting their code.

Robert Blumen 00:04:49 What is the relationship of OpenTelemetry to other systems out there, such as Zipkin, Jaeger, Graylog, Prometheus?

Alex Boten 00:05:00 So the relationship that OpenTelemetry has with the Zipkin, the Jaegers and the Prometheus of the world is really around providing interoperability between those systems. So, an application developer would instrument their code using OpenTelemetry, and then they can emit that telemetry data to whatever backend systems they want. So, if you wanted to continue using Jaeger, you could definitely do that with an application that’s instrumented with OpenTelemetry. The other thing that OpenTelemetry tries to do is it tries to provide a translation layer so that folks that are maybe today emitting data to Zipkin or to Jaeger or to Prometheus can deploy a collector within their environments and then translate the data from a specific format of those other systems into the OpenTelemetry format, so that they can then emit the data to whatever backend they choose by simply updating the configuration on their Collector without having to go back to their applications who may be legacy systems that nobody wants to modify anymore and still be able to send their data to different destinations.

Robert Blumen 00:06:06 Is OpenTelemetry then an interoperability standard, a system, or both?

Alex Boten 00:06:13 It’s really the standard to instrument your applications and to provide the interoperability between the different systems. OpenTelemetry doesn’t offer a backend; there’s no log database or metrics database that OpenTelemetry provides. Maybe at some point in the future that that will happen. We’re certainly seeing people that are supporting the OpenTelemetry format starting to provide those backend options for folks that are emitting only OpenTelemetry data. But that’s not something the project is interested in solving at this point. It’s really about the instrumentation piece and the collection and transmission of the data.

Robert Blumen 00:06:52 In reading about this, I came across discussion of a protocol called OTLP. Can you explain what that is?

Alex Boten 00:07:00 So the OpenTelemetry protocol is a protocol that’s generated from protobuf definitions. Every implementation of OpenTelemetry supports its aim is to provide high performance data transmission in a format that’s standardized across all the implementations. It’s also supported by the OpenTelemetry Collector. And what it really means is, so this format supports all the different signals that OpenTelemetry supports. So, log traces, metrics, and maybe down the road, events and profiling, which is currently being developed in the project. And the idea is if you support the OpenTelemetry protocol, this is the protocol that you would use to either transmit the data, or if you’re a vendor or if you’re a backend provider, you would use that protocol to receive the data. And it’s actually been really good to see even projects like Prometheus starting to support the OTLP protocol for transmitting data.

Robert Blumen 00:07:56 So, let me summarize what we have so far, and you can tell me if I’ve understood. I’m building an application, I could instrument it in a way that’s compatible with this standard. I might not even know where my logs or metrics are going to end up. And then whoever uses my system, which may be people in the same organization or maybe I’m shipping an open-source project, which has many users — they can then plug in their backend of choice, and they are not necessarily tied to any decisions I made about how I think the telemetry will be collected. It creates the ability of users to plug and play between the applications and the backends. Is that more or less correct?

Alex Boten 00:08:42 Yeah, that’s exactly right. I think it really decouples the instrumentation piece, which historically has been the most expensive aspect of organizations gaining observability within in their systems, from the decision of where am I going to send that data. And the nice thing about this is that it really frees the end users from the idea of vendor lock-in, which I think a lot of us who have worked in in systems for a long time always found it to be difficult. The conversation of trying to maybe try out a new vendor if you wanted to test some new feature that you wanted to have or whatever, usually would mean that you would have to go back and re-instrument your code. Whereas now with OpenTelemetry, if you have instrumented your application, hopefully this is the last time you have to worry about instrumenting your application because you can just point that data to different backends.

Robert Blumen 00:09:34 A little while ago you did mention the Collector, and we will be spending some time on that, but I want to understand what are the possible configurations of the system. What I think we’re talking about now is if the code is instrumented with the OpenTelemetry standard, that it could talk directly to backends. The other option being you have a Collector in between them. Are those the two main configurations?

Alex Boten 00:10:02 Yeah, that’s right. It’s also possible to configure your instrumented application to send data to backends directly: if you wanted to choose to send the data to Jaeger, I think most implementations that support OpenTelemetry officially have a Jaeger exporter, for example. So there are options if you wanted to send data from your application to your backend, but ideally you would send that data in a protocol that you can then configure using an OpenTelemetry Collector later down the line.

Robert Blumen 00:10:31 Let’s come back to Collector in a bit, but I want to talk about instrumentation. Often if I want to talk to a certain backend, I need to use their library to emit the telemetry. How does that change with OpenTelemetry?

Alex Boten 00:10:49 Yeah, so with the OpenTelemetry standard, you have two aspects of the instrumentation. So, there’s the OpenTelemetry API, which is really what most developers would interact with. There’s a very limited amount of surface area that the API covers. For example, for tracing the APIs, essentially you can start a span and you can finish a span and get a tracer. That’s roughly the surface area that’s trying to be covered there. And the idea we wanted to push forward with, with our limited API, is to just reduce the cognitive load that users would have to take on to adopt OpenTelemetry. The other piece of the instrumentation that folks will have to interact with is the SDK, which really allows end users to configure how the telemetry is produced and where it’s sent to. If you’re thinking about this in the context of how is it different from particular backend and its instrumentation, the, the difference is what OpenTelemetry you would only ever use the OpenTelemetry API and configure the SDK to send data to the backend of choice.

Alex Boten 00:11:55 But the API that you would use for instrumenting the code wouldn’t be any different depending on which backend you send it to. And there’s that clear separation between the API and the SDK that allows you to really only instrument with that minimal interface and worry about the details of how and where that data is sent using the SDK configuration, which in my book I refer to as telemetry pipelines.

Robert Blumen 00:12:17 In that discussion you mentioned tracing, I’ve seen a lot of logging systems, you can log whatever you want and then it puts the burden on a Collector to pick up the logs and format them. And then metrics, you may have to use a library. If I’m adopting OpenTelemetry, how does it handle logs and metrics?

Alex Boten 00:12:40 Yeah, so for metrics, there is an API that calls out specific instruments. So OpenTelemetry has a list of, I believe it’s six instruments currently that it supports to more or less have the same functionality as like the library. And I think a lot of those instruments were developed in collaboration with both the open metrics and the Prometheus communities to ensure that we’re compatible with those folks. So, for the logging library, that’s a little bit different in OpenTelemetry — or at least it was at the time of writing my book, which was written in 2021, mostly. The idea behind logging and OpenTelemetry was, we already were aware there were so many different APIs for logging in each language. Each language has like a dozen logging APIs and we didn’t necessarily want to create a new logging API that people would have to adopt. And so, the idea was to really hook into those existing APIs. It’s been an interesting transition though. I think in the past, maybe in the past six or eight months or so, there’s been almost an ask for an API and an SDK in the logging signal as well. That’s still currently in development. So, stay tuned for what’s going to happen there.

Robert Blumen 00:13:51 In what languages are the OpenTelemetry SDKs available?

Alex Boten 00:13:57 Yeah, so there is currently 11 officially supported languages. I’m probably going to forget some of them, but there’s definitely one in C++, in Go, in Rust, in Python, Ruby, PHP, Java, JavaScript, all those languages are covered officially by OpenTelemetry. And what this means is that the implementations were reviewed by someone on the technical committee, and the implementations themselves live within the OpenTelemetry organization in GitHub and has the same process. We have maintainers and approvers for each one of those languages. There’s a couple of additional implementations that aren’t officially supported yet, but that’s really just because there hasn’t been enough contributors to it yet. So, I think there’s one in Lua and maybe Julia is the other one?

Robert Blumen 00:14:46 I’ve found when instrumenting code up and spend a lot of time doing things like writing a message that a certain method has been called, and here are the parameters — very boilerplate steps. I understand that OpenTelemetry can to some extent automate that? How does that work?

Alex Boten 00:15:08 Yeah, so there is — one of the very first OTEPs (the OpenTelemetry Enhancement Proposals) that was created in the early stages of the project was to help to support auto instrumentation out of the box. So, the effort of auto instrumentation in different languages is at different stages. So, I know the Java and the Python auto instrumentation efforts are a little bit further along. I think .NET is coming along nicely, and I think JavaScript is, as well. But the idea behind auto instrumentation with OpenTelemetry specifically is very similar to what we’ve seen in other efforts before where it really ties instrumentation to existing third party open-source library or third party libraries. Right? And the idea being, for example, if you’re using the Python SDK — I’m using that as an example because I spent a decent amount of time writing some code there.

Alex Boten 00:16:02 If you’re using the Python SDK and you wanted to use, for example, the Python Redis library, well you could use the instrumentation library that’s provided by OpenTelemetry, which allows you to call to this library, which monkey patches the Redis library that it then makes a call to. But, in that intermediate step, it acts as a middle layer that instruments the calls to the library that you would be making. So, if you were calling Konnect, for example, it would call Konnect on the instrumentation library start span, maybe record some kind of metric about the operation, make the call to the Redis library, and then on the return it would end the span and produce some telemetry there with some semantic convention attributes.

Robert Blumen 00:16:49 Explain the term monkey patching.

Alex Boten 00:16:52 So monkey patching is when a library intercepts a call and replaces a call with itself instead of the original call. So, in the case of the Redis example I was using, the Redis instrumentation library intercepts the call to connect to Redis, and then it replaces it with its own connect call, which does the instrumentation, as well.

Robert Blumen 00:17:17 This I could see being very useful in that if you’ve got a library and something’s going wrong inside of the library, I don’t know where, then the previous option has been that I need to get the source code of the library, and if I want logging, I would have to go and insert log statements or insert metrics or whatever type of telemetry I’m trying to capture into someone else’s source code and rebuild it. So, does this enable you to get visibility of what is happening inside third-party libraries that you’ve downloaded with your package manager and you’re not interested in modifying the code?

Alex Boten 00:17:57 Right. I think that’s a key benefit of it is that you’re finally able to see what these libraries are doing, or maybe you’re not familiar with the code or you’re not really sure of the path through the code and you’re able to see all of the library calls that are instrumented on underneath the original call of your application, which a lot of the time you’ll find problems there, but it’s really hard to identify them because you don’t necessarily know what’s happening without reading the source code underneath at all.

Robert Blumen 00:18:24 I’ve used some of those languages in the 11. I am aware that every language is different as far as what access it gives you to intercept things at runtime or maybe generate byte code and inject it into the library. I would think that the ability to do this is going to differ considerably based on the language, and maybe C++ being rather unfriendly to that. Do you expect to achieve a parity with all the languages in the extent that you can offer this feature? Or will it always work better on some than others?

Alex Boten 00:19:02 That’s a great question. I think, ideally, I imagine that instrumentation libraries are a temporary fix. I really believe that what everybody’s hoping for within the community, and we’ve seen some Open Source projects already reach out and start instrumenting their applications. We’re really hoping that these libraries will in use the OpenTelemetry API to instrument themselves and remove the need for these instrumentation libraries altogether. For example, if an HTTP server framework were to instrument its calls to its endpoints using OpenTelemetry, the end user wouldn’t even need this instrumentation library. And we could achieve parity across all the languages because each one of those libraries would just use the standard rather than relying on either byte code manipulation or monkey patching, which it works for what it is, but it’s not always the greatest option.

Alex Boten 00:20:01 With monkey patching, maybe the underlying libraries call changes parameters, and you have to keep track of those changes within those instrumentation libraries. And so that, that always poses a challenge. But ideally, like I said, those libraries would, will go away as the project continues to gain traction across the industry. And we’ve already seen, I think there was a few Python open-source projects that reached out. I know the Spring folks in Java had a project to instrument using OpenTelemetry. Envoy and a few other proxies have also started using OpenTelemetry. So it’s definitely, I think in some magician lab we’re great for the short term, but in the long term it would be ideal if things were instrumented themselves.

Robert Blumen 00:20:45 That would be great. But there are always going to be some older libraries that maybe not under as active development where there’s not really anyone around to modify them. Then you always have this to fall back on in those cases. I wouldn’t see it’s going away.

Alex Boten 00:21:02 Right. Ideally it would, the norm would become instrument your libraries with OpenTelemetry, and for those libraries that aren’t being modified and absolutely continue to use the mechanisms that we have in place today.

Robert Blumen 00:21:16 Now I think it’s the time to start talking about the Collector. We’ve talked about the source and how this data gets published. A little while ago we talked about you can send directly data from a publisher to a backend or you can have a Collector in between. What is the Collector, what does it do, why might I want one?

Alex Boten 00:21:36 Yeah, so the Collector is a separate process that would be running inside your environment. It is a binary that’s published as a separate binary, or docker image if you’re interested in that. There’s also packages for, I think, Debian and RedHat. And the Collector is really a destination for your telemetry that can then act as a router. So, it has a selection of, I believe it’s over a hundred receivers, which support different formats and also can scrape metric data from different systems. And it has exporters, and again, I lose track of it, but I think it’s over a hundred formats of exporters that the OpenTelemetry Collector supports. So you can send data to it in one format and export it using a different format if you’re so keen on. You can also use processors within the Collector, which allow you to manipulate the data, whether it be for things like redacting, maybe PII that you might have, or if you wanted to enrich the data with some additional attributes — maybe about your environments that only the Collector would know about.

Alex Boten 00:22:44 And that’s the Collector in a nutshell. It’s available to deploy, as I said, as an image or as a package. There’s also, you can deploy using Helm charts. You can deploy using the OpenTelemetry operator if you’re using a Kubernetes environment.

Robert Blumen 00:22:59 I’m going to delve into some of those internal components. I want to talk first a little bit about the networking. It can be simpler if I have N sources and number of K backends, instead of an N cross K topology, an N cross 1 and 1 cross K. Do you have any thoughts on, is that a motivator to simplify your networking and everything that goes along with that? Is that a motivator for adopting a Collector?

Alex Boten 00:23:30 Yeah, I think so. I think the Collector makes it very appealing for a variety of reasons. One being that your egress from your network may only be coming from one point. So, from a security auditing kind of perspective, you can see where all the data is really going out rather than having a bunch of different endpoints that have to be connected to some external systems. I think from that point alone, it’s definitely worth deploying a Collector within a network. I think there is also the ability to throttle the data that’s going out is key. If you have N endpoints that are sending data, it’s really difficult to throttle how much data is actually leaving your network, which could end up being costly. So, if you wanted to do things like sampling, you would probably want to have a Collector in place, so that you could really adjust it as needed.

Robert Blumen 00:24:22 How much telemetry can one instance of Collector handle?

Alex Boten 00:24:30 Yeah, I mean I think that always depends on the size of the instance that you’re running. They’re on the OpenTelemetry Collector repository. There is a pretty comprehensive benchmarks that have been run against the Collector for both traces and logs and metrics. And I believe the instance sizes that were used, if memory serves right, they were using ECE2 for the testing for the benchmarks. And I believe that’s all listed on the website there. For folks that are interested in finding out.

Robert Blumen 00:25:01 If I wanted to either run more workload than what I could put through one instance or for high-availability reasons, have a clustered implementation with a multiple Collectors, is it possible say to put a load balancer in front of it and distribute it? Or what are the options for a more clustered implementation?

Alex Boten 00:25:24 Yeah, so the way you would want to probably deploy this is: you would want to use some kind of load balancer depending on the, the telemetry you’re sending out, you may want to use like a routing processor that allows you to be more specific as to which data each one of the Collectors will be receiving. So for example, if you had, maybe a bunch of Collectors that are deployed that are closer to your applications, that would then be routed through maybe a Collector as a gateway and you wanted to send only a certain number of traces to the Collector as a gateway, you could fork it using the routing processor based on the trace IDs or something like that, if you wanted to.

Robert Blumen 00:26:06 So, with stateless servers you can set up a pretty dumb load balancer and every request would get routed essentially to a random instance. Is there any reasons I have a bit more of a sharding or pinning of certain workloads in a clustered implementation?

Alex Boten 00:26:27 I think some of this depends on what you’re doing with the Collectors. So for example, if you’re doing sampling on traces, you wouldn’t want your sampling decision being made across, like there’s, there’s no way to share that sampling decision across Collectors. And so, you would want to be able to make that decision on the same instance of the Collector, for example. And so you would really want all of the data for a specific trace to go to the same Collector to be able to make the decision on the sample.

Robert Blumen 00:26:56 You use the word gateway, which is a common word, but I understand it means something specific in OpenTelemetry where you have a gateway model and an agent model. Explain those two models, the difference between them.

Alex Boten 00:27:11 Yeah, so in the agent deployment for the OpenTelemetry Collector, you would be running your OpenTelemetry Collector on the same host or the same node, maybe as part of a demon set in Kubernetes. So, you would have a separate instance of the Collector for each one of the nodes that are running inside your environment. And you would have your application sending data to the local agent before it would then send it up to wherever your destination is. In the gateway deployment model, you would have the Collector act as a standalone application, and it would have its own deployment. Maybe you would have one per data center or maybe one per region. And that would act as maybe the egress out of your network. And that’s kind of the gateway deployment.

Robert Blumen 00:28:02 What you described as an agent model that sounds very similar to me of what I’ve seen called sidecar with some other services. Is agent the same as a sidecar?

Alex Boten 00:28:14 Yes and no. It can be like a sidecar, I think when I think of a sidecar as, I would assume that it would be attached to every application that’s running with a sidecar alongside it, which would mean that you might end up with several instances of the Collector running on the same node, for example, which may be necessary in specific cases, or it may not be, it really depends on your use case, whether or not there’s accessibility from your application to the host at all. That depends on what your policies are, how your policies are confined or defined. So, it could be the same as the sidecar, but it doesn’t necessarily have to be.

Robert Blumen 00:28:52 Delving more into the internals of the Collector and what you can do, you talked about processors and exporters — and you’ve covered some of this before, but why don’t you start with what are some of the major types of processors that you might want to use?

Alex Boten 00:29:11 Yeah, so I think that the two recommended processors by the community are the, the batch processor, which tries to take your data and batch it rather than sending it every time there’s telemetry coming in. This is trying to optimize some of the compression and reduce the amount of data that gets sent out. So that’s one of the recommended processor. The other one is the memory limit processor, which limits kind of the upper bound of memory that you would allow a Collector to use. So you would probably want to use that in the case of you have a specific instance of some sort with some kind of memory defined, you would want to configure your memory limit processor to be below that threshold so that when the Collector hits that memory limit, it can start returning error messages to all of its receivers so that maybe the senders of the data can go ahead and back off on the amount of data that’s being sent or something like that.

Alex Boten 00:30:02 One of the other processors that’s really interesting to many folks is the transform processor, which allow you to use the OpenTelemetry Transformation Language to modify data. So, maybe you want to strip some particular attributes, or maybe you want to change some values inside your telemetry data and you can do that with the transform processor, which is still currently under development. But I think it early days in the processor there was a lot of excitement around what could be done with processors. And so, people started developing filtering processors and attribute processor for metrics and all these other kind of processors that made it a little bit complicated to know which processors folks should be using because there’s so many of them. And sometimes, one may support one signal but not the other, whereas the transform processor really tries to maybe unify this and to a single processor like that can be used to do all of that.

Robert Blumen 00:30:55 You said there’s a lot of excitement around this feature. What was it that people found so exciting about it?

Alex Boten 00:31:01 Yeah, I think from the maintainer and contributor standpoint, I think we were looking forward to deprecating some of the other processors that could be combined within a single one. It reduces the, again, I think it reduces the cognitive load that people have to deal with when ramping up on OpenTelemetry. I think knowing that if you want to modify your telemetry, all you have to do is use this one processor and, learn the language that you would need to transform the data as opposed to going through and searching the repository for five or six different processors. I think that’s generally great to consolidate that a little bit.

Robert Blumen 00:31:39 Tell me more about the language that is used to do these transforms.

Alex Boten 00:31:43 Yeah, so the OpenTelemetry language for folks that are interested in finding the full definition is it’s all available inside the OpenTelemetry Collector: can trip repository, but it really allows folks to define in a language that signal agnostic what they would like to do with their data. So it allows you to get particular attributes, set particular attributes, and modify data inside your Collector.

Robert Blumen 00:32:09 The other internal component of Collectors I want to spend some time on is exporters. What do those do?

Alex Boten 00:32:17 Yeah, so the exporter take the data that’s been ingested by the OpenTelemetry Collector. So, the OpenTelemetry Collector use receivers to receive the data in a format that’s specific to whichever receiver is configured. It then transforms the data to internal data format within the Collector and then it exports it using whichever exporter is configured. So, the exporter’s job is to take the data, the internal data format, and format it to the specification of the destination of the exporter.

Robert Blumen 00:32:50 Okay. So, what are some examples of different exporters that are available?

Alex Boten 00:32:54 Yeah, so there’s a bunch of exporters that are vendor-specific exporters that live in the repository today. There’s also many of the open-source projects have their own exporters. So, Jaeger has its own, Prometheus has its own exporter. There’s a few different logging options as well. Yeah.

Robert Blumen 00:33:12 So data comes in, it goes through some number of processors and then goes out through an exporter. Is there a concept of a pipeline that maps the path that data takes through the Collector?

Alex Boten 00:33:26 Yeah, so the best place to find this is really inside the Collector configuration. So, the Collector is configured using YAML and at the very essence of it, you would configure your exporters, your receivers, and your processors, and then you would define the path through those components in the pipeline section of the configuration, which allows you to specify what pipelines you want to configure for tracing, and for logs, and for metrics to go through to the Collector. So, you would configure your receivers there, and then your processors, and then your exporters within each one of those definitions. And you can configure multiple pipelines for each signal, giving them individual names.

Robert Blumen 00:34:07 And how does incoming data select or get mapped onto a particular pipeline?

Alex Boten 00:34:14 Yeah, so the way that the data would be mapped on each pipeline is via the specific receiver that is used to receive the data. So for example, if you’ve configured a Jaeger receiver on one pipeline and a Zipkin exporter on a different pipeline and you’re sending data through Zipkin, then the pipeline that has the Zipkin endpoint would be the destination of that data, and then that’s the pipeline that the data would go through.

Robert Blumen 00:34:40 So, does each endpoint listen on a different port or does it have a path or what’s the mapping?

Alex Boten 00:34:47 Yeah, so that depends on the specific receiver. So, some receivers have the ability to configure different paths; some only configure different ports. It also depends on the protocol that you’re using for the receiver and whether it supports it or not. And as I mentioned, there’s also these things known as scrapers, which are receivers that can go out and scrape different endpoints for metrics, for example. And those can also be configured as receivers, which would then take their own path to the Collector.

Robert Blumen 00:35:17 I think we’ve been mostly talking about under the assumption of a push model, but this scraper sounds like it also supports pull. Did I understand that correctly?

Alex Boten 00:35:28 Yeah, that’s correct. And, if you think of the Prometheus receiver, for example, the Prometheus receiver uses the pull model as well. So, you would define the targets that you would like to scrape, and then the data will be pulled into the Collector as opposed to pushed to the Collector.

Robert Blumen 00:35:43 So to wrap this all up, then I would instrument or configure my sources to point them toward the OTel Collector or Collectors. My network, they would have a domain name or an IP address and a port and maybe a path that comes after that. They’re instrumented, they push data out, it goes to the Collector, the Collector will process it and then export it back into backend of choice. Is that a good description of the whole process?

Alex Boten 00:36:17 Yeah, that’s exactly right.

Robert Blumen 00:36:18 How do the sources authenticate themselves to the Collector?

Alex Boten 00:36:23 Yeah, so for authenticating to the OpenTelemetry Collector, there’s several extensions that are available for authentication. So, there’s OIDC authentication extension, there’s the bear token authentication extension. You can also use the basic Auth extension if you’d like. So, there’s a few different available extensions for that.

Robert Blumen 00:36:43 Yeah, okay. Well, let’s talk about extensions. So, what are the extension points that are offered?

Alex Boten 00:36:49 Yeah, so extensions are essentially components in the Collector that don’t necessarily have anything to do with the pipeline of the telemetry going through the Collector. And so, some of the extensions that are available are the pprof extension, which allows you to get profiling data out of the Collector. There’s the health check extension, which allows you to run health checks against the Collector, and there’s a few other ones that are all available in the Collector repositories.

Robert Blumen 00:37:20 Okay. So, we’ve pretty much covered most of what I had planned about what it does, how it works. Suppose you have a project that has not been built with this in mind and is interested in migrating. What is a possible migration path to OTel from a project that might have been built several years ago before this was available?

Alex Boten 00:37:45 I would say the first path that I would recommend to folks is really to think about is there a way that I can drop in a Collector and receive data in the format that’s already maybe being emitted by an application. That’s really the very first step that I would suggest taking. I know that there’s a few different mechanisms for collecting telemetry that predate the Collector. So, telegraph is an example of one of those. If you have telegraph running in your environment and you’re interested in seeing if you can connect it to the Collector, maybe that’s a good place to start is, to look at connecting the two. And I know Telegraph, for example, emits OTLP, so that’s already something that is somewhat supported. So that’s really the first step I would take is can I just get away with dropping in a Collector and emitting a format that’s maybe already supported?

Alex Boten 00:38:30 One thing to note is if you have a format out there that’s not currently supported in the Collector, you can always go to the community and ask, ‘hey, is this a component that folks are interested in in adopting?’ And that’s always a good avenue to kind of take on. If you’ve got commitment from your organization to maybe change the instrumentation libraries that you’re using within your code, then great. I would start looking at resources. I know there’s a few different use cases that have been documented, I think on OpenTelemetry.io around migrating away from either OpenTracing or OpenCensus. So, I would definitely start looking for those resources.

Robert Blumen 00:39:07 So we’ve talked about the history and what it does, what’s on the roadmap?

Alex Boten 00:39:12 Yeah, so on the roadmap for OpenTelemetry, which we actually very recently published. So, up until earlier this year there wasn’t an official roadmap published by the community. But we’re finally starting to change the process a little bit to try and really focus the efforts of the community. So, currently on the roadmap we have five projects that are happening. So, some of the work is being done around both client-side instrumentation, so either, web browser-based or mobile clients, and around profiling. So, this is profiling data being emitted either using an existing format, but there’s some discussion around whether or not there’s going to be an additional signal called profiles to OpenTelemetry. There’s also a lot of effort being put into trying to stabilize semantic conventions. So, if you’ve seen the semantic conventions inside the OpenTelemetry specification, you’ll probably know that a lot of them are marked as experimental.

Alex Boten 00:40:10 And that’s just because we haven’t had the chance of really focus the community on trying to come to agreement on what stable Semantic conventions should look like. So, there’s a lot of effort to bring in experts in each one of the domains to ensure that they make sense. The other efforts that I’m excited about, because I’m part of the work, is to put together a configuration layer for OpenTelemetry as a whole so that users can configure using some kind of configuration file, take that configuration file across any implementation, and know that the same results will occur. So, for example, if you’re configuring your Jaeger exporter in Python, using this configuration format you’d be able to take that same configuration to your .NET implementation or Java and not have to write code manually to translate that configuration. And then, there’s some effort around function as a service support from OpenTelemetry. So, the group is currently focused around lambdas because that’s the first serverless or function as a service model that’s come to us. But there’s also effort to bring in folks from Azure and GCP as well. To kind of round that out.

Robert Blumen 00:41:19 We’re at time, we’ve covered everything. Where can listeners find your book?

Alex Boten 00:41:25 Yeah, so you can find a book on Amazon. You can also buy directly from Packet Publishing. And yeah, it’s also available at your local bookstores.

Robert Blumen 00:41:35 If users would like to find your presence anywhere on the internet, where should they look?

Alex Boten 00:41:40 Yeah, so they can, they can find me on LinkedIn a little bit on Mastadon or on Twitter — though not as much anymore. And they can find me on the Slack channels for the CNCF Slack instance. I’m pretty active there.

Robert Blumen 00:41:55 Alex Boten, thank you very much for speaking to Software Engineering Radio.

Alex Boten 00:41:59 Yeah, thank you very much. It’s been great.

Robert Blumen 00:42:01 This has been Robert Blumen for Software Engineering Radio. Thank you for listening. [End of Audio]