Resilient application in a degraded world

Abstract

  • not a question “what if it fails”, but “what should we do when it fails”
  • X-nines availability not quite useful, user experience is better
    • check on the response time instead
  • if all your microservices communicate with each other, you’ve just created a distributed monolith
  • microservices without asynchronous communication are as good as writing monolith app
  • blast radius
    • small blast radius is better as one component would not impact another

Observability

  • logs
    • expensive
  • metrics
    • cardinality
  • distributed trace
    • sampling
    • expensive

SLI - SLO - SLA

SLI: Service Level Indicator

  • rate
  • mean
  • percentile
  • e.g.
    • latency: Xms to respond to this type of request
    • error rate: X% of requests fail
    • throughput: X request per seconds
    • availability: service available for X% of the time
    • durability: X% of chance a file exists again after 1 year

SLO: Service Level Objective

  • measured by SLI
  • difficult to chose SLO
  • e.g.
    • latency: < 25ms
    • error rate: < 0.01% of requests fails
    • throughput; >= 250 request per seconds
    • availability: >= 99% of the time
    • durability: 14-nines % of chance

SLA: Service Level Agreement

  • contract with the user

SLI + SLO + SLA

  • Notre Api doit répondre avec succès à 99% des requêtes reçues pendant un mois calendaire.
    • SLI: “répondre avec succès à … des requêtes reçues pendant un mois calendaire”
    • SLO: 99%
  • En cas de non respect de cet engagement, nous vous offrirons une réduction de 25% du montant de votre facture du mois suivant.
    • SLA: “En cas de non respect … offrirons une réduction de 25%“

Alerts

  • only alert when we need to ACT

Timeouts

  • short
  • should depend on user experience, i.e. user can wait no longer than ~2s
  • cancel request if user leaves
  • feature flipping
    • deactivate expensive functionalities, automatically
    • blue / green deployments - canary
  • load tests
  • auto-scaling
    • do not auto-scale based on CPU but on the user experience

Random jitter

  • random cache duration
  • random cron periodicity
  • session duration
  • token validity

Retry

  • request fails? retry
  • request takes too much time? cancel and retry
  • add max number of retries
  • add exponential backoff

Degraded mode

  • if component fail, do not fail the whole system
    • better user experience
  • load shedding
    • do not accept 10% of requests that may cause the whole system to fail
  • circuit breaker

Chaos engineering

  • chaos monkey
  • test when a component fail