What is Distributed Tracing, and Why Should you Care About it?

A presentation at All Things Open in October 2021 in Raleigh, NC, USA by Ricardo Ferreira

Slide 1

Slide 1

What is Distributed Tracing, and Why Should You Care About it? Ricardo Ferreira Principal Developer Advocate Elastic @riferrei

Slide 2

Slide 2

Criminal Investigation TV Shows and MOvies @riferrei

Slide 3

Slide 3

Detective TV shows and movies • Collect evidence @riferrei @riferrei • Searching data • Interrogation • Creating timelines • Forensic science • Motive probing • Human Psychology • Build strong case

Slide 4

Slide 4

Agenda • What is distributed tracing? • How distributed tracing works? • Benefits of distributed tracing • Challenges: it ain’t all flowers @riferrei

Slide 5

Slide 5

Who am I? • Principal Developer 🥑 at Elastic • Community, Developer Relations • Before Elastic ➡ Confluent, Oracle, and Red HAT (JBoss) • Distributed Systems, databases, observability, streaming systems • https://riferrei.com @riferrei

Slide 6

Slide 6

What is Distributed Tracing? @riferrei

Slide 7

Slide 7

for starters: it is nothing new! type Log struct { request_path string request_size int64 status int32 latency_ms float64 } Each log statement has its own “schema” @riferrei logEntry := Log{ “/customers/find”, 840, 200, 35 } fmt.Printf(“%+v”, logEntry)

Slide 8

Slide 8

Follow the thread troubleshooting Log{“/customer/find”, 235, 200, 30} API Thread 2 Thread 1 Log{“/api/find”, 840, 200, 35} Log{“/api/find”, 840, 200, 45} @riferrei Customer API Log{“/db/find”, 450, 200, 5} Database Customer Log{“/customer/find”, 235, 200, 42} Database Log{“/db/find”, 450, 200, 3}

Slide 9

Slide 9

Stitching the threads manually type CustomLog struct { request_path string request_size int64 status int32 latency_ms float64 customer_id int64 database_id int64 } @riferrei logEntry := CustomLog{ requestPath(), requestSize(), requestStatus(), latencyMS(), cust.customerID(), db.databaseTenantID() }

Slide 10

Slide 10

begin of chaos: distributed computing Host 1 Host 2 Thread 1 Thread 1 Thread 2 @riferrei Thread 2

Slide 11

Slide 11

Using Virtualization Host 1 @riferrei Customer API Database Customer Database API Thread 2 API Thread 1 VM 2 Thread 2 Thread 1 VM 1 Customer API Database Customer Database

Slide 12

Slide 12

Using containerization Host 1 VM 1 VM 2 @riferrei Container 2 API Database Customer Database API Thread 2 Customer Thread 1 API Thread 2 Thread 1 Container 1 Customer API Database Customer Database

Slide 13

Slide 13

“Let’s now break down the services into functions” Ops: Is this a Joke to you? @riferrei @riferrei

Slide 14

Slide 14

Tracing automates system-wide stitching Transaction data is collected and becomes searchable ready Transaction Service A @riferrei Service B Service C Service D

Slide 15

Slide 15

Detective TV shows and movies • Collect evidence @riferrei @riferrei • Searching data • Interrogation • Creating timelines • Forensic science • Motive probing • Human Psychology • Build strong case

Slide 16

Slide 16

How Distributed Tracing works? @riferrei

Slide 17

Slide 17

“Watcha talkin’ about willis?” It is all about setting the context @riferrei @riferrei

Slide 18

Slide 18

Tracing, spans, and context propagation Trace ID: 12345 ⬅ This is the context! Transaction (Root Span) Trace ID: 12345 Service A (Child Span) Trace ID: 12345 @riferrei Service B (Child Span) Time: 55ms Time: 30ms Time: 15ms Trace ID: 12345 Service C (Child Span) Time: 5ms Trace ID: 12345 Service D (Child Span) Time: 5ms

Slide 19

Slide 19

Capture, process, store, repeat Se rv ic e A( Ch ild Sp an ) Tim 30 e: ms Serv ice B (Child S p an) Time: 5ms S p an) C (Child Service vic Ser @riferrei pa dS (Chil D e n) Time: 15ms : Time 5ms Tracing Pipeline Data Store

Slide 20

Slide 20

One heck of a data store 💡 Metrics Traces Data Store Logs @riferrei

Slide 21

Slide 21

Detective TV shows and movies • Collect evidence @riferrei @riferrei • Searching data • Interrogation • Creating timelines • Forensic science • Motive probing • Human Psychology • Build strong case

Slide 22

Slide 22

Benefits of Distributed Tracing @riferrei

Slide 23

Slide 23

Reduced Mean time to resolution (MTTR) 10% 60% 30% Suspect Drill Down Solve • Watching metric values • Understand topologies • Read logs and events • Caught up with alerts • Isolate the anomalies • Create code patches • Bringing people onboard • Collect contextual data • Create a new release @riferrei

Slide 24

Slide 24

Finding Bugs between releases V1.12 @riferrei V1.13

Slide 25

Slide 25

Improved Team Collaboration @riferrei

Slide 26

Slide 26

Detective TV shows and movies • Collect evidence @riferrei @riferrei • Searching data • Interrogation • Creating timelines • Forensic science • Motive probing • Human Psychology • Build strong case

Slide 27

Slide 27

Challenges: because it ain’t all flowers @riferrei

Slide 28

Slide 28

Picking open-source is not always Easy Agent for my Programming language? @riferrei Frameworks versus Libraries? Data Store Scalability? Does my Architecture Fit?

Slide 29

Slide 29

Black-Box versus White-Box tracing Black-Box @riferrei White-Box • Code is not changed • Require code changes • Handled by the runtime • Handled by the application • Minimal execution visibility • Full Execution visibility

Slide 30

Slide 30

Client-side tracing is on its early days @riferrei

Slide 31

Slide 31

Sampling everything comes at a price ? @riferrei Tracing Pipeline ?

Slide 32

Slide 32

Wrapping Up @riferrei

Slide 33

Slide 33

Useful Resources ü Join #openTelemetry on https://slack.cncf.io ü https://github.com/riferrei/otel-with-java ü Book ➡ distributed tracing in practice ✅ @riferrei

Slide 34

Slide 34

Additional Resources Discuss Forum @riferrei Elastic Community Ricardo’s Channel

Slide 35

Slide 35

Thank You @riferrei