timeouts engineering

Abstract

Classic engineering approach to cope with failures, especially with timeouts: use retries.

Some wordings:

  • Timeout is an approximation of failure. False positives are possible after timeouts.
  • Retry is an effort to reduce failure, but:
    • has a cost
    • has an effect on the system
  • Failure Prevention is an escalation to avoid certain part.

A timeout is not necessarily a network issue. They could be misleading. Timeout informs us about communication to remote service. The root cause can come from various location:

  • service: resource retention, head-of-line blocking, locking…
  • library+VM: GC, function calls with indeterministic timing…
  • kernel: system calls with indeterministic timing, background tasks, unfaire scheduling…
  • hardware: blocking IO, congestion, cycle stealing…
  • network: packet drop, queue backup…

Timeout cascade can happen especially when cascading requests (A→B→C).

Inconvenient truths about timeouts:

  • timeouts often do not indicate remote service health
  • the optimal timeout is a moving target
  • often less predictable in shared environment
  • have gaps in overall timeline

Retry != Replay

  • potential behavioral change
  • state change
  • hidden retries in lower stack (e.g. TCP)

Retry at scale is a serious subject that needs to be handled carefully, as it can DDOS the system itself:

  • large number of clients
  • stateful backend, fixed route
  • full connectivity mesh
  • variable connections per route
  • container-based deploy

So service mesh can make retries significantly worse:

  • top-level retries affect the whole system
  • bigger multipliers toward the bottom
  • hard to predict bottleneck

Failure prevention could amplify a bad decision:

  • local failures become global
  • transient failures become permanent
  • inconsistent state

Some advice to reduce the timeout issue:

  • good timeout - minimize false positives
    • characterize latencies by stack
    • root cause reasoning
    • whenever possible, trace
  • good retries - judicious, with brakes
    • customize, do not copy the configuration from others
    • err on the side of conservative
    • create state-based rules
  • treat failure as both global and local
  • enforce top-down budget
  • apply and act on back pressure
  • test!
    • test common failure scenarios
    • test failures at different parts of the architecture
    • test partial failures
    • test common failure scenarios with more forgiving timeout/retry/prevention logic

The speaker is using the following config:

  • timeout
    • overall: 500ms
    • request: 150ms
    • connect: 200ms
    • pipelining: request timeout does not reset connection
  • retry
    • read: 2 tries, no write-back, no backoff
    • write: 3 tries, random backoff (5-200ms)
    • overall retry budget: 20% of requests, min 10 retries per second (10s credit window)
  • blackhole: 5 consecutive failures, revive after 30s
  • centralized topology manager, changes dampened > 1min