🗒️ l-lin

Search

❯

✏️ 2024-09-21 🗓️️ Log

article/video
network

timeouts engineering

src: Lies, damned lies & timeout engineering - 2017

Abstract

Classic engineering approach to cope with failures, especially with timeouts: use retries.

Some wordings:

Timeout is an approximation of failure. False positives are possible after timeouts.

Retry is an effort to reduce failure, but:

has a cost

has an effect on the system

Failure Prevention is an escalation to avoid certain part.

A timeout is not necessarily a network issue. They could be misleading. Timeout informs us about communication to remote service. The root cause can come from various location:

service: resource retention, head-of-line blocking, locking…

library+VM: GC, function calls with indeterministic timing…

kernel: system calls with indeterministic timing, background tasks, unfaire scheduling…

hardware: blocking IO, congestion, cycle stealing…

network: packet drop, queue backup…

Timeout cascade can happen especially when cascading requests (A→B→C).

Inconvenient truths about timeouts:

timeouts often do not indicate remote service health

the optimal timeout is a moving target

often less predictable in shared environment

have gaps in overall timeline

Retry != Replay

potential behavioral change

state change

hidden retries in lower stack (e.g. TCP)

Retry at scale is a serious subject that needs to be handled carefully, as it can DDOS the system itself:

large number of clients

stateful backend, fixed route

full connectivity mesh

variable connections per route

container-based deploy

So service mesh can make retries significantly worse:

top-level retries affect the whole system

bigger multipliers toward the bottom

hard to predict bottleneck

Failure prevention could amplify a bad decision:

local failures become global

transient failures become permanent

inconsistent state

Some advice to reduce the timeout issue:

good timeout - minimize false positives

characterize latencies by stack

root cause reasoning

whenever possible, trace

good retries - judicious, with brakes

customize, do not copy the configuration from others

err on the side of conservative

create state-based rules

treat failure as both global and local

enforce top-down budget

apply and act on back pressure

test!

test common failure scenarios

test failures at different parts of the architecture

test partial failures

test common failure scenarios with more forgiving timeout/retry/prevention logic

The speaker is using the following config:

timeout

overall: 500ms

request: 150ms

connect: 200ms

pipelining: request timeout does not reset connection

retry

read: 2 tries, no write-back, no backoff

write: 3 tries, random backoff (5-200ms)

overall retry budget: 20% of requests, min 10 retries per second (10s credit window)

blackhole: 5 consecutive failures, revive after 30s

centralized topology manager, changes dampened > 1min

Recent Notes

kubetailrb
2025-08-14
- programming-language/ruby
kotlin
2025-08-14
- programming-language/kotlin
- quest/side
sops
2025-08-14
so you want to build an event driven system
2025-08-14
- article
- architecture/eda
ai
2025-08-14
- ai

Graph View

Backlinks

No backlinks found

Created with Quartz v4.2.4 © 2025

GitHub
LinkedIn