🗒️ l-lin

Search

❯

✏️ 2024-07-12 🗓️️ Log

article/video
devops

Resilient application in a degraded world

src: ⚡Une application résiliente, dans un monde partiellement dégradé ⚡ - 2024-05-03

Abstract

not a question “what if it fails”, but “what should we do when it fails”

X-nines availability not quite useful, user experience is better

check on the response time instead

if all your microservices communicate with each other, you’ve just created a distributed monolith

microservices without asynchronous communication are as good as writing monolith app

blast radius

small blast radius is better as one component would not impact another

Observability

logs

expensive

metrics

cardinality

distributed trace

sampling

expensive

SLI - SLO - SLA

SLI: Service Level Indicator

rate

mean

percentile

e.g.

latency: Xms to respond to this type of request

error rate: X% of requests fail

throughput: X request per seconds

availability: service available for X% of the time

durability: X% of chance a file exists again after 1 year

SLO: Service Level Objective

measured by SLI

difficult to chose SLO

e.g.

latency: < 25ms

error rate: < 0.01% of requests fails

throughput; >= 250 request per seconds

availability: >= 99% of the time

durability: 14-nines % of chance

SLA: Service Level Agreement

contract with the user

SLI + SLO + SLA

Notre Api doit répondre avec succès à 99% des requêtes reçues pendant un mois calendaire.

SLI: “répondre avec succès à … des requêtes reçues pendant un mois calendaire”

SLO: 99%

En cas de non respect de cet engagement, nous vous offrirons une réduction de 25% du montant de votre facture du mois suivant.

SLA: “En cas de non respect … offrirons une réduction de 25%“

Alerts

only alert when we need to ACT

Timeouts

short

should depend on user experience, i.e. user can wait no longer than ~2s

cancel request if user leaves

feature flipping

deactivate expensive functionalities, automatically

blue / green deployments - canary

load tests

auto-scaling

do not auto-scale based on CPU but on the user experience

Random jitter

random cache duration

random cron periodicity

session duration

token validity

Retry

request fails? retry

request takes too much time? cancel and retry

add max number of retries

add exponential backoff

Degraded mode

if component fail, do not fail the whole system

⇒ better user experience

load shedding

do not accept 10% of requests that may cause the whole system to fail

circuit breaker

Chaos engineering

chaos monkey

test when a component fail

Recent Notes

kubetailrb
2025-08-21
- programming-language/ruby
kotlin
2025-08-21
- programming-language/kotlin
- quest/side
sops
2025-08-21
so you want to build an event driven system
2025-08-21
- article
- architecture/eda
ai
2025-08-21
- ai

Graph View

Resilient application in a degraded world
Observability
SLI - SLO - SLA
Alerts
Timeouts
Random jitter
Retry
Degraded mode
Chaos engineering

Backlinks

No backlinks found

Created with Quartz v4.2.4 © 2025

GitHub
LinkedIn