What is Distributed Tracing, and Why Should you Care About it?

A presentation at All Things Open in October 2021 in Raleigh, NC, USA by Ricardo Ferreira

Criminal Investigation TV Shows and MOvies @riferrei

Detective TV shows and movies • Collect evidence @riferrei @riferrei • Searching data • Interrogation • Creating timelines • Forensic science • Motive probing • Human Psychology • Build strong case

Agenda • What is distributed tracing? • How distributed tracing works? • Benefits of distributed tracing • Challenges: it ain’t all flowers @riferrei

Who am I? • Principal Developer 🥑 at Elastic • Community, Developer Relations • Before Elastic ➡ Confluent, Oracle, and Red HAT (JBoss) • Distributed Systems, databases, observability, streaming systems • https://riferrei.com @riferrei

for starters: it is nothing new! type Log struct { request_path string request_size int64 status int32 latency_ms float64 } Each log statement has its own “schema” @riferrei logEntry := Log{ “/customers/find”, 840, 200, 35 } fmt.Printf(“%+v”, logEntry)

Follow the thread troubleshooting Log{“/customer/find”, 235, 200, 30} API Thread 2 Thread 1 Log{“/api/find”, 840, 200, 35} Log{“/api/find”, 840, 200, 45} @riferrei Customer API Log{“/db/find”, 450, 200, 5} Database Customer Log{“/customer/find”, 235, 200, 42} Database Log{“/db/find”, 450, 200, 3}

Stitching the threads manually type CustomLog struct { request_path string request_size int64 status int32 latency_ms float64 customer_id int64 database_id int64 } @riferrei logEntry := CustomLog{ requestPath(), requestSize(), requestStatus(), latencyMS(), cust.customerID(), db.databaseTenantID() }

begin of chaos: distributed computing Host 1 Host 2 Thread 1 Thread 1 Thread 2 @riferrei Thread 2

Using Virtualization Host 1 @riferrei Customer API Database Customer Database API Thread 2 API Thread 1 VM 2 Thread 2 Thread 1 VM 1 Customer API Database Customer Database

Using containerization Host 1 VM 1 VM 2 @riferrei Container 2 API Database Customer Database API Thread 2 Customer Thread 1 API Thread 2 Thread 1 Container 1 Customer API Database Customer Database

“Let’s now break down the services into functions” Ops: Is this a Joke to you? @riferrei @riferrei

Tracing automates system-wide stitching Transaction data is collected and becomes searchable ready Transaction Service A @riferrei Service B Service C Service D

How Distributed Tracing works? @riferrei

“Watcha talkin’ about willis?” It is all about setting the context @riferrei @riferrei

Tracing, spans, and context propagation Trace ID: 12345 ⬅ This is the context! Transaction (Root Span) Trace ID: 12345 Service A (Child Span) Trace ID: 12345 @riferrei Service B (Child Span) Time: 55ms Time: 30ms Time: 15ms Trace ID: 12345 Service C (Child Span) Time: 5ms Trace ID: 12345 Service D (Child Span) Time: 5ms

Capture, process, store, repeat Se rv ic e A( Ch ild Sp an ) Tim 30 e: ms Serv ice B (Child S p an) Time: 5ms S p an) C (Child Service vic Ser @riferrei pa dS (Chil D e n) Time: 15ms : Time 5ms Tracing Pipeline Data Store

One heck of a data store 💡 Metrics Traces Data Store Logs @riferrei

Benefits of Distributed Tracing @riferrei

Reduced Mean time to resolution (MTTR) 10% 60% 30% Suspect Drill Down Solve • Watching metric values • Understand topologies • Read logs and events • Caught up with alerts • Isolate the anomalies • Create code patches • Bringing people onboard • Collect contextual data • Create a new release @riferrei

Finding Bugs between releases V1.12 @riferrei V1.13

Challenges: because it ain’t all flowers @riferrei

Picking open-source is not always Easy Agent for my Programming language? @riferrei Frameworks versus Libraries? Data Store Scalability? Does my Architecture Fit?

Black-Box versus White-Box tracing Black-Box @riferrei White-Box • Code is not changed • Require code changes • Handled by the runtime • Handled by the application • Minimal execution visibility • Full Execution visibility

Client-side tracing is on its early days @riferrei

Sampling everything comes at a price ? @riferrei Tracing Pipeline ?

Useful Resources ü Join #openTelemetry on https://slack.cncf.io ü https://github.com/riferrei/otel-with-java ü Book ➡ distributed tracing in practice ✅ @riferrei

Additional Resources Discuss Forum @riferrei Elastic Community Ricardo’s Channel

Ricardo Ferreira
@riferrei

1 / 35

Let’s face it: most people talking about o11y (observability) all end up talking about distributed tracing somehow. It is a technology that is radically changing the way we identify and solve technical problems. In a world where virtually all applications are born distributed — it seems to be something that you as an SRE ought to know in more detail.

This talk will provide a pragmatic overview of distributed tracing by clearly articulating its motivation, problems it solves, the challenges, technologies you should use to ensure a vendor-agnostic implementation, and which aspects you should consider while picking an o11y backend.

While discussing the challenges, this talk will highlight white-box versus black-box instrumentation, which is valuable knowledge to determine where the developer’s responsibility finishes and when the Ops team starts, and — when both team’s responsibilities may entangle.

Video

Code

The following code examples from the presentation can be tried out live.

https://github.com/riferrei/otel-with-java

OpenTelemetry in Java with Elastic Observability

This project showcase how to instrument a microservice written in Java using OpenTelemetry, to produce telemetry data (traces and metrics) to Elastic Observability.

Buzz and feedback

Here’s what was said about this presentation on social media.

Thanks @riferrei for a really great talk at All Things Open explaining the basics of distributed tracing. Although I'm new to the topic, I was able to follow along for all of it and I learned a lot. That's everything I'm looking for in a conference talk :-)
— Dani (@DanisYellis) October 19, 2021
Finally, tomorrow I'll give an in-person talk about distributed tracing at @AllThingsOpen. This will be an introductory talk about this where I will make you understand how it works and also — fall in love with distributed tracing 😍

If you're in Raleigh tomorrow, come to say 👋🏻 pic.twitter.com/7kFBKhEcm5
— Ricardo Ferreira (@riferrei) October 18, 2021
We're thrilled to have Ricardo Ferreira - @riferrei, Principal Developer Advocate at @elastic, presenting 'What is distributed tracing, and why should you care about it?' #AllThingsOpen! https://t.co/DKMD4DZeg6 pic.twitter.com/uDJMsl7l0v
— All Things Open (@AllThingsOpen) October 7, 2021

What is Distributed Tracing, and Why Should you Care About it?

Link for this presentation:

HTML code for embedding:

Share on social media:

Video

Code

OpenTelemetry in Java with Elastic Observability

Buzz and feedback