What is Distributed Tracing, and Why Should You Care About it? Ricardo Ferreira Principal Developer Advocate Elastic @riferrei
Slide 2
Criminal Investigation TV Shows and MOvies
@riferrei
Slide 3
Detective TV shows and movies • Collect evidence
@riferrei @riferrei
•
Searching data
•
Interrogation
•
Creating timelines
•
Forensic science
•
Motive probing
•
Human Psychology
•
Build strong case
Slide 4
Agenda
• What is distributed tracing?
• How distributed tracing works?
• Benefits of distributed tracing
• Challenges: it ain’t all flowers
@riferrei
Slide 5
Who am I? • Principal Developer 🥑 at Elastic • Community, Developer Relations • Before Elastic ➡ Confluent,
Oracle, and Red HAT (JBoss)
• Distributed Systems, databases,
observability, streaming systems
• https://riferrei.com
@riferrei
Slide 6
What is Distributed Tracing? @riferrei
Slide 7
for starters: it is nothing new!
type Log struct { request_path string request_size int64 status int32 latency_ms float64 } Each log statement has its own “schema” @riferrei
logEntry := Log{ “/customers/find”, 840, 200, 35 } fmt.Printf(“%+v”, logEntry)
begin of chaos: distributed computing Host 1
Host 2
Thread 1
Thread 1
Thread 2
@riferrei
Thread 2
Slide 11
Using Virtualization Host 1
@riferrei
Customer
API
Database
Customer
Database
API
Thread 2
API
Thread 1
VM 2
Thread 2
Thread 1
VM 1
Customer
API
Database
Customer
Database
Slide 12
Using containerization Host 1 VM 1
VM 2
@riferrei
Container 2
API
Database
Customer
Database
API
Thread 2
Customer
Thread 1
API
Thread 2
Thread 1
Container 1
Customer
API
Database
Customer
Database
Slide 13
“Let’s now break down the services into functions”
Ops: Is this a Joke to you? @riferrei @riferrei
Slide 14
Tracing automates system-wide stitching
Transaction data is collected and becomes searchable ready Transaction
Service A
@riferrei
Service B
Service C
Service D
Slide 15
Detective TV shows and movies • Collect evidence
@riferrei @riferrei
•
Searching data
•
Interrogation
•
Creating timelines
•
Forensic science
•
Motive probing
•
Human Psychology
•
Build strong case
Slide 16
How Distributed Tracing works?
@riferrei
Slide 17
“Watcha talkin’ about willis?” It is all about setting the context
@riferrei @riferrei
Slide 18
Tracing, spans, and context propagation Trace ID: 12345 ⬅ This is the context!
Transaction (Root Span)
Trace ID: 12345
Service A (Child Span) Trace ID: 12345
@riferrei
Service B (Child Span)
Time: 55ms
Time: 30ms Time: 15ms
Trace ID: 12345
Service C (Child Span)
Time: 5ms
Trace ID: 12345
Service D (Child Span)
Time: 5ms
Slide 19
Capture, process, store, repeat
Se rv ic e
A(
Ch ild
Sp
an
) Tim 30 e: ms
Serv ice B (Child S p an)
Time: 5ms
S p an) C (Child Service
vic Ser
@riferrei
pa dS (Chil D e
n)
Time: 15ms
: Time 5ms
Tracing Pipeline
Data Store
Slide 20
One heck of a data store 💡
Metrics
Traces
Data Store
Logs
@riferrei
Slide 21
Detective TV shows and movies • Collect evidence
@riferrei @riferrei
•
Searching data
•
Interrogation
•
Creating timelines
•
Forensic science
•
Motive probing
•
Human Psychology
•
Build strong case
Slide 22
Benefits of Distributed Tracing @riferrei
Slide 23
Reduced Mean time to resolution (MTTR)
10%
60%
30%
Suspect
Drill Down
Solve
• Watching metric values
• Understand topologies
• Read logs and events
• Caught up with alerts
• Isolate the anomalies
• Create code patches
• Bringing people onboard
• Collect contextual data
• Create a new release
@riferrei
Slide 24
Finding Bugs between releases
V1.12
@riferrei
V1.13
Slide 25
Improved Team Collaboration
@riferrei
Slide 26
Detective TV shows and movies • Collect evidence
@riferrei @riferrei
•
Searching data
•
Interrogation
•
Creating timelines
•
Forensic science
•
Motive probing
•
Human Psychology
•
Build strong case
Slide 27
Challenges: because it ain’t all flowers @riferrei
Slide 28
Picking open-source is not always Easy
Agent for my Programming language?
@riferrei
Frameworks versus Libraries?
Data Store Scalability?
Does my Architecture Fit?
Slide 29
Black-Box versus White-Box tracing
Black-Box
@riferrei
White-Box
• Code is not changed
• Require code changes
• Handled by the runtime
• Handled by the application
• Minimal execution visibility
• Full Execution visibility
Slide 30
Client-side tracing is on its early days
@riferrei
Slide 31
Sampling everything comes at a price
? @riferrei
Tracing Pipeline
?
Slide 32
Wrapping Up
@riferrei
Slide 33
Useful Resources
ü Join #openTelemetry on https://slack.cncf.io ü https://github.com/riferrei/otel-with-java ü Book ➡ distributed tracing in practice ✅
@riferrei
Slide 34
Additional Resources Discuss Forum
@riferrei
Elastic Community
Ricardo’s Channel