Resilient application in a degraded world
src: ⚡Une application résiliente, dans un monde partiellement dégradé ⚡ - 2024-05-03
Abstract
- not a question “what if it fails”, but “what should we do when it fails”
- X-nines availability not quite useful, user experience is better
- check on the response time instead
- if all your microservices communicate with each other, you’ve just created a distributed monolith
- microservices without asynchronous communication are as good as writing monolith app
- blast radius
- small blast radius is better as one component would not impact another
Observability
- logs
- expensive
- metrics
- cardinality
- distributed trace
- sampling
- expensive
SLI - SLO - SLA
SLI: Service Level Indicator
- rate
- mean
- percentile
- e.g.
- latency: Xms to respond to this type of request
- error rate: X% of requests fail
- throughput: X request per seconds
- availability: service available for X% of the time
- durability: X% of chance a file exists again after 1 year
SLO: Service Level Objective
- measured by SLI
- difficult to chose SLO
- e.g.
- latency: < 25ms
- error rate: < 0.01% of requests fails
- throughput; >= 250 request per seconds
- availability: >= 99% of the time
- durability: 14-nines % of chance
SLA: Service Level Agreement
- contract with the user
SLI + SLO + SLA
- Notre Api doit répondre avec succès à 99% des requêtes reçues pendant un mois calendaire.
- SLI: “répondre avec succès à … des requêtes reçues pendant un mois calendaire”
- SLO: 99%
- En cas de non respect de cet engagement, nous vous offrirons une réduction de 25% du montant de votre facture du mois suivant.
- SLA: “En cas de non respect … offrirons une réduction de 25%“
Alerts
- only alert when we need to ACT
Timeouts
- short
- should depend on user experience, i.e. user can wait no longer than ~2s
- cancel request if user leaves
- feature flipping
- deactivate expensive functionalities, automatically
- blue / green deployments - canary
- load tests
- auto-scaling
- do not auto-scale based on CPU but on the user experience
Random jitter
- random cache duration
- random cron periodicity
- session duration
- token validity
Retry
- request fails? retry
- request takes too much time? cancel and retry
- add max number of retries
- add exponential backoff
Degraded mode
- if component fail, do not fail the whole system
- ⇒ better user experience
- load shedding
- do not accept 10% of requests that may cause the whole system to fail
- circuit breaker
Chaos engineering
- chaos monkey
- test when a component fail