What is Distributed Tracing, and Why Should You Care About it? Ricardo Ferreira Principal Developer Advocate Elastic @riferrei

Criminal Investigation TV Shows and MOvies @riferrei

Detective TV shows and movies • Collect evidence @riferrei @riferrei • Searching data • Interrogation • Creating timelines • Forensic science • Motive probing • Human Psychology • Build strong case

Agenda • What is distributed tracing? • How distributed tracing works? • Benefits of distributed tracing • Challenges: it ain’t all flowers @riferrei

Who am I? • Principal Developer 🥑 at Elastic • Community, Developer Relations • Before Elastic ➡ Confluent, Oracle, and Red HAT (JBoss) • Distributed Systems, databases, observability, streaming systems • https://riferrei.com @riferrei

What is Distributed Tracing? @riferrei

for starters: it is nothing new! type Log struct { request_path string request_size int64 status int32 latency_ms float64 } Each log statement has its own “schema” @riferrei logEntry := Log{ “/customers/find”, 840, 200, 35 } fmt.Printf(“%+v”, logEntry)

Follow the thread troubleshooting Log{“/customer/find”, 235, 200, 30} API Thread 2 Thread 1 Log{“/api/find”, 840, 200, 35} Log{“/api/find”, 840, 200, 45} @riferrei Customer API Log{“/db/find”, 450, 200, 5} Database Customer Log{“/customer/find”, 235, 200, 42} Database Log{“/db/find”, 450, 200, 3}

Stitching the threads manually type CustomLog struct { request_path string request_size int64 status int32 latency_ms float64 customer_id int64 database_id int64 } @riferrei logEntry := CustomLog{ requestPath(), requestSize(), requestStatus(), latencyMS(), cust.customerID(), db.databaseTenantID() }

begin of chaos: distributed computing Host 1 Host 2 Thread 1 Thread 1 Thread 2 @riferrei Thread 2

Using Virtualization Host 1 @riferrei Customer API Database Customer Database API Thread 2 API Thread 1 VM 2 Thread 2 Thread 1 VM 1 Customer API Database Customer Database

Using containerization Host 1 VM 1 VM 2 @riferrei Container 2 API Database Customer Database API Thread 2 Customer Thread 1 API Thread 2 Thread 1 Container 1 Customer API Database Customer Database

“Let’s now break down the services into functions” Ops: Is this a Joke to you? @riferrei @riferrei

Tracing automates system-wide stitching Transaction data is collected and becomes searchable ready Transaction Service A @riferrei Service B Service C Service D

Detective TV shows and movies • Collect evidence @riferrei @riferrei • Searching data • Interrogation • Creating timelines • Forensic science • Motive probing • Human Psychology • Build strong case

How Distributed Tracing works? @riferrei

“Watcha talkin’ about willis?” It is all about setting the context @riferrei @riferrei

Tracing, spans, and context propagation Trace ID: 12345 ⬅ This is the context! Transaction (Root Span) Trace ID: 12345 Service A (Child Span) Trace ID: 12345 @riferrei Service B (Child Span) Time: 55ms Time: 30ms Time: 15ms Trace ID: 12345 Service C (Child Span) Time: 5ms Trace ID: 12345 Service D (Child Span) Time: 5ms

Capture, process, store, repeat Se rv ic e A( Ch ild Sp an ) Tim 30 e: ms Serv ice B (Child S p an) Time: 5ms S p an) C (Child Service vic Ser @riferrei pa dS (Chil D e n) Time: 15ms : Time 5ms Tracing Pipeline Data Store

One heck of a data store 💡 Metrics Traces Data Store Logs @riferrei

Detective TV shows and movies • Collect evidence @riferrei @riferrei • Searching data • Interrogation • Creating timelines • Forensic science • Motive probing • Human Psychology • Build strong case

Benefits of Distributed Tracing @riferrei

Reduced Mean time to resolution (MTTR) 10% 60% 30% Suspect Drill Down Solve • Watching metric values • Understand topologies • Read logs and events • Caught up with alerts • Isolate the anomalies • Create code patches • Bringing people onboard • Collect contextual data • Create a new release @riferrei

Finding Bugs between releases V1.12 @riferrei V1.13

Improved Team Collaboration @riferrei

Detective TV shows and movies • Collect evidence @riferrei @riferrei • Searching data • Interrogation • Creating timelines • Forensic science • Motive probing • Human Psychology • Build strong case

Challenges: because it ain’t all flowers @riferrei

Picking open-source is not always Easy Agent for my Programming language? @riferrei Frameworks versus Libraries? Data Store Scalability? Does my Architecture Fit?

Black-Box versus White-Box tracing Black-Box @riferrei White-Box • Code is not changed • Require code changes • Handled by the runtime • Handled by the application • Minimal execution visibility • Full Execution visibility

Client-side tracing is on its early days @riferrei

Sampling everything comes at a price ? @riferrei Tracing Pipeline ?

Wrapping Up @riferrei

Useful Resources ü Join #openTelemetry on https://slack.cncf.io ü https://github.com/riferrei/otel-with-java ü Book ➡ distributed tracing in practice ✅ @riferrei

Additional Resources Discuss Forum @riferrei Elastic Community Ricardo’s Channel

Thank You @riferrei