timeouts engineering
src: Lies, damned lies & timeout engineering - 2017
Abstract
Classic engineering approach to cope with failures, especially with timeouts: use retries.
Some wordings:
- Timeout is an approximation of failure. False positives are possible after timeouts.
- Retry is an effort to reduce failure, but:
- has a cost
- has an effect on the system
- Failure Prevention is an escalation to avoid certain part.
A timeout is not necessarily a network issue. They could be misleading. Timeout informs us about communication to remote service. The root cause can come from various location:
- service: resource retention, head-of-line blocking, locking…
- library+VM: GC, function calls with indeterministic timing…
- kernel: system calls with indeterministic timing, background tasks, unfaire scheduling…
- hardware: blocking IO, congestion, cycle stealing…
- network: packet drop, queue backup…
Timeout cascade can happen especially when cascading requests (A→B→C).
Inconvenient truths about timeouts:
- timeouts often do not indicate remote service health
- the optimal timeout is a moving target
- often less predictable in shared environment
- have gaps in overall timeline
Retry != Replay
- potential behavioral change
- state change
- hidden retries in lower stack (e.g. TCP)
Retry at scale is a serious subject that needs to be handled carefully, as it can DDOS the system itself:
- large number of clients
- stateful backend, fixed route
- full connectivity mesh
- variable connections per route
- container-based deploy
So service mesh can make retries significantly worse:
- top-level retries affect the whole system
- bigger multipliers toward the bottom
- hard to predict bottleneck
Failure prevention could amplify a bad decision:
- local failures become global
- transient failures become permanent
- inconsistent state
Some advice to reduce the timeout issue:
- good timeout - minimize false positives
- characterize latencies by stack
- root cause reasoning
- whenever possible, trace
- good retries - judicious, with brakes
- customize, do not copy the configuration from others
- err on the side of conservative
- create state-based rules
- treat failure as both global and local
- enforce top-down budget
- apply and act on back pressure
- test!
- test common failure scenarios
- test failures at different parts of the architecture
- test partial failures
- test common failure scenarios with more forgiving timeout/retry/prevention logic
The speaker is using the following config:
- timeout
- overall: 500ms
- request: 150ms
- connect: 200ms
- pipelining: request timeout does not reset connection
- retry
- read: 2 tries, no write-back, no backoff
- write: 3 tries, random backoff (5-200ms)
- overall retry budget: 20% of requests, min 10 retries per second (10s credit window)
- blackhole: 5 consecutive failures, revive after 30s
- centralized topology manager, changes dampened > 1min