What is it?
Is it the same as monitoring?
Is it a kind of tool?
Definitely something to do with devops
Can you see what it's doing?
If it starts squeaking, can you find out why?
dashboard shows me what the car is doing
while i'm driving
learn what “normal” looks like
tells me when I need to look deeper
helps me understand the car
beyond mere function
so I can drive it better
systems that show us what they are doing
while they are running
learn what “normal” looks like
show when we need to look deeper
observable systems help us understand them
so they are more operable
and more maintainable!
What does “normal” look like?
How does it hum?
What does it do when it fails?
What does it do when part of it fails?
Observability lets us... observe
Graph all the things!
Graph all the these 3 things:
Graph all the these 3 things:
“Golden Signals”
Graph all the these 3 things:
“Golden Signals”
What about the work your system does?
What about the users of your system?
You need to be better friends with your ops team :)
But why wouldn't you want to understand your system better?
You don't understand it until it's in production
curl https://host0.myapp.internal/x/status
{"status":"ok"}
Even a socket timeout or 502 error is helpful
Can extend this - e.g. lengthy startup procedure
curl https://host1.myapp.internal/x/version
{"build": 1234, "git_sha": "321face"}
“Did my fix roll out yet?”
“Are all nodes on the same version?”
curl https://host2.myapp.internal/x/info
{
"requests": 4074,
"errors": 5,
"cache_hits": 15923,
"cache_misses": 24,
}
A great way of making a system more observable...
... is to have it write down what it's doing.
11:32:33 [INFO ] Processing request for user 42
11:32:33 [WARN ] sprocket cache timed out!
11:32:33 [ERROR] NullPointerException in adjustSprocket()
11:32:33 [INFO ] Processing request for user 42
11:32:33 [WARN ] sprocket cache timed out!
11:32:33 [INFO ] Request processed successfully
11:32:33 [INFO ] Processing request for user 43
11:32:33 [WARN ] sprocket cache timed out!
11:32:33 [INFO ] Processing request for user 44
11:32:33 [INFO ] Request processed successfully
11:32:34 [ERROR] NullPointerException in adjustSprocket()
11:32:34 [INFO ] Processing request for user 43
11:32:34 [INFO ] Request processed successfully
11:32:34 [INFO ] Processing request for user 44
11:32:33 [INFO ] Processing request for user 42
11:32:33 [INFO ] Initializing sprockets
11:32:33 [WARN ] SprocketImpl.init() is deprecated
11:32:33 [WARN ] sprocket cache timed out!
11:32:33 [INFO ] Request processed successfully
11:32:33 [INFO ] Processing request for user 43
11:32:33 [INFO ] Initializing sprockets
11:32:33 [WARN ] SprocketImpl.init() is deprecated
11:32:33 [WARN ] sprocket cache timed out!
11:32:33 [INFO ] Processing request for user 44
11:32:33 [INFO ] Initializing sprockets
11:32:33 [WARN ] SprocketImpl.init() is deprecated
11:32:33 [INFO ] Request processed successfully
11:32:34 [ERROR] NullPointerException in adjustSprocket()
11:32:34 [ERROR] THIS SHOULD NEVER HAPPEN!
11:32:34 [INFO ] Processing request for user 43
11:32:34 [INFO ] Initializing sprockets
11:32:34 [WARN ] SprocketImpl.init() is deprecated
11:32:34 [INFO ] Request processed successfully
11:32:34 [INFO ] Processing request for user 44
09:00 [INFO ] Starting service on port 8080
15:34 [ERROR] Exception: the widget exploded!
15:34 [INFO ] Shutting down service
“Log every time the system does work”
“Don't log as ERROR unless it's actionable”
“Only output WARNING and above in production”
“I can't find the error because this log file is full of identical warnings”
“Man, I wish we'd enabled DEBUG logging before it crashed!”
“Let's spend a sprint cleaning up our logs”
“The node was down because logs filled up the disk”
“All nodes were down because logs filled up the disks”
You're reading your logs
A machine is generating your logs...
Why is a human reading them?
There are too many of them and you are too slow
grep -F '[ERROR]'
This is about the simplest machine we could imagine
We can imagine more ☺
e.g. Bugsnag, Sentry, etc
Send it your errors
It notifies you when it sees a new one
“Inbox” workflow for tracking errors
Understands stack traces, Javascript source maps, etc
Use these!
Assumes most errors are bugs
What about errors you can't fix?
What about errors you won't fix?
Often really useful information
Really hard to manage
You'll always have more exceptions than you have time to fix
You at least want to know frequency/impact
Recording that an exception happened looks pretty similar to logging:
[ERROR] NullPointerException in adjustSprocket()
Exception notification services are machines reading your logs
But we're structuring the information for machine consumption
{
"hostname": "host1.myapp",
"app": "myapp",
"error": "NullPointerException",
"stack_trace": [["adjustSprocket()", 318], ...],
}
If you're logging for machines to read...
Write logs machines are good at reading
Parsing strings is slow and imprecise
Anything that can write JSON will do
{
"hostname": "host1.myapp",
"app": "myapp",
"error": "NullPointerException",
"stack_trace": [["adjustSprocket()", 318], ...],
}
But don't stop there!
No longer constrained by an 80 character terminal
e.g. user_id
, mobile_platform
, items_in_shopping_cart
Now our logs are structured...
We can do analytics on them
“Which item has the worst cache hit rate?”
“Show me the error rate for large shopping carts”
Look for trends and correlations:
“How often does that "sprocket cache" warning occur?”
“How many users does it affect?”
“Does it correlate with high traffic?”
“Does it only happen to certain hosts?”
“Alert me if it starts happening more than once a minute.”
Splunk, ELK, Loggly, Honeycomb
Google BigQuery
Kafka Streams
Hadoop ecosystem (e.g. Impala)
We could just add "level": "ERROR"
But now machines are reading our logs, can we do better?
DEBUG, INFO, WARNING, ERROR
Not all events are equally interesting:
A host outage
A customer couldn’t buy your product because your code threw an exception
A call to the payment service timed out, but you retried and succeeded
A customer bought a product successfully
Annotate logs with a measure of how interesting they are
Instead of a crude log level...
Try a number
"sample_rate": 100
"sample_rate": 100
Now you can throw away the other 99
Don't pay to store or process them
"Dynamic": choose sample rate based on interestingness
Analysis tools need to understand sample rate
Easy if you're rolling your own
Some tools understand sampling natively
Controllability is the dual of observability
Goes together nicely!
curl -XPUT -dDEBUG https://host2.myapp.internal/x/log_level
Set log level dynamically
Turn parts of your system on and off at runtime
... or for certain users
... or for 10% of traffic
Deploy both code paths
Check at runtime
How does being able to control the system make it more observable?
Poke it and see how it responds!
You don't understand it until it's in production
Enable features just for your team
Turn off that new feature, just for one user
Ramp the new change to 10% and compare
See also: