evolutionary architecture

Abstract

Instead of replacing everything all at once, choose one end-to-end use case and reimplement that. You bound your risk, and are guaranteed to learn a lot you did not expect. Use those learnings to help inform the next step.

At the GOTO 2014 conferences in Copenhagen and Aarhus, I had the opportunity to have an extended set of conversations with Martin Fowler about an idea he had been turning over recently in his head — “Sacrificial Architectures”. Every technology architecture is of necessity temporary, and we should both get comfortable with, and take advantage of, that reality. Coincidentally, I’d also been thinking along similar lines — that it is far more a privilege than a burden to get to rewrite a system you have outgrown. So I was inspired to write down a few thoughts about evolutionary architecture.

Why Are Architectures Temporary?

There is no one perfect architecture for all products and all scales. Any architecture meets a particular set of goals or range of requirements (functionality, scale, etc.), within a particular set of constraints or context.

The functionality of your product or service will almost certainly evolve over time. It should not be surprising that your architecture should as well.

Your scale changes, hopefully up and to the right! What works at scale X rarely works at scale 10X or 100X.

Finally, the longer you keep doing something, the more deeply you learn about it. So even if functionality or scale never changes, your future you knows a lot more about your domain than you do now.

The Small and The Large

The goal of a small startup is to prove a business model, within strict constraints around resources and time. So a startup’s architecture should optimize for cheap, rapid changes to the product. This means technologies that are familiar to the team and easy to use. Typically these days this is a monolithic application in a dynamic language like Ruby or PHP over a single monolithic RDBMS. And while I’m feeling some eBay and Google colleagues shudder, this is absolutely correct.

Even more importantly, being a startup means building for the near term. There is no guarantee of any future beyond 3/6/12 months, and if you are around, your business is likely to be very different than it is now. So there is absolutely no reason to build for that future. Thinking about 2 years ahead is not only not particularly useful, but counterproductive. Any effort spent on that future comes with a serious opportunity cost — that effort could and should be spent improving your product in the now.

I have advised a number of small startups over the last several years, and I often get asked “C_ould you tell us how eBay and Google do things?_” Sure I can tell you, but you have to promise me up and down that you won’t do it!

By contrast to the startup, the primary goals for a large Internet company are to meet the needs of its (comparatively large) number of current users, make efficient use of its (comparatively large) resources, and maximize the productivity of its (comparatively large) engineering organization. It’s far more about efficient execution in an established direction than about business model exploration. And now we need to focus on various flavors of scaling issues — how to scale the organization and the technology to continue being efficient and productive.

eBay and Google, depending on how you count, are each on their fifth entire rewrite of their architecture from top to bottom. eBay’s architecture, for example, evolved from

  • Perl and files (v1, 1995), to
  • C++ and Oracle (v2, 1997), to
  • XSL and Java (v3, 2002), to
  • Full-stack Java (v4, 2007), to
  • Polyglot micro-services (2013+)

Looking back with 20–20 hindsight, some technology guesses look prescient and others look short-sighted. But each of those phases used the best (cheapest, fastest, etc.) tool for the job at the time. The related obvious point is that if eBay had implemented the 1995 equivalent of micro-services out of the gate, we would not even be talking about the company. v1 would have collapsed of its own weight, and would have taken the company with it.

Typically at this scale, we have divided a monolithic team into smaller focused groups, componentized our monolithic application into something like micro services, and sharded our persistence infrastructure. We have designed in resilience to failures in machines, networks, data centers, etc. We have also introduced specialized systems for particular technological niches, like analytics, search, etc. We have probably found that some of our technological needs are not met well by anything preexisting in the commercial or open-source worlds, and so have built custom systems from the ground up.

I like to think of it this way — it’s not that eBay and Google had to evolve their architectures; they got to evolve their architectures! It is very much a first-world problem to be growing so fast that you outstrip your current architecture. It is a rare and wonderful privilege to have to rewrite. I’m not missing that this is almost always “under the gun”, and pretty stressful, but you genuinely should feel happy that it is even worth it. People care about your product, it’s straining under the weight of their enthusiasm, and you have the resources to fix it.

Conclusions and Recommendations

So what can we learn from this? I will suggest a few concrete lessons:

1. Build for the “now”

Build to meet the needs for your near-term time horizon, about which you have reasonable certainty on requirements and priorities. Depending on where you are in the cycle, this may be a few months, 1–2 years, etc.

Beyond that horizon, you will likely need to evolve the architecture (if you are lucky!) — you just don’t know now how or in which direction. Expect it, accept it, welcome it. Getting to evolve your architecture is not an indicator of failure; it is an indicator of success.

2. Prefer evolution

Once you have met your needs in (1), if you have choices among a number of different technological approaches, prefer the one which gives you the maximum ability to modify / replace / evolve the architecture. Finance folks call this “option value, and just as in the markets, it is often worth it to pay now for flexibility in the future.

In the technology world, maximizing option value is about minimizing the cost of replacing or upgrading parts of the system. There are two related ways to reduce this cost: Simplicity and Isolation. Bounding the complexity of any one component makes that component inexpensive to replace. Bounding the interaction surface area between components makes the replacement of that component inexpensive for the rest of the system. Strict component encapsulation, loose coupling, and event-driven / data-driven programming styles are all examples of this.

It won’t be a surprise to anyone that modern programming approaches from Agile methodologies to the Reactive Manifesto place a premium on these properties.

3. Evolve Incrementally

When you are so lucky that you have to evolve the architecture, do it iteratively and incrementally. As Martin Fowler likes to say,

The only thing a Big Bang rewrite guarantees is a Big Bang.

Instead of replacing everything all at once, choose one end-to-end use case and reimplement that. You bound your risk, and are guaranteed to learn a lot you did not expect. Use those learnings to help inform the next step.


Abstract

By taking an evolutionary approach, we can tackle these aspects incrementally and allow us to learn a lot about the problem to be solved and possible solutions before deciding. Instead of a big-design-up-front based on many assumptions, we create options among which we can choose the optimal one as late as possible. 23 architectural aspects to guide design.

Key takeaways:

  • build only what you need now
  • have a concept on how to evolve the architecture
    • for several scenarios: how do we react in case of…
  • keep options for as long as possible

Why?

Team continuously deliver to Users. Users feedback-based development to Team.

Focus on LEARNING.

Evolutionary

  • works for today’s needs
  • can adapt to future’s needs in many small steps

Architectural aspects

  • persistence
  • translation (UI and data)
  • Communication between parts
  • scaling
  • security
  • journaling, auditing
  • reporting
  • data migration, import
  • releasability
  • versioning
  • backward compatibility
  • response times + througput
  • archiving data
  • data validation
  • distribution
  • event sourcing
  • public interfaces
  • time & time zones & calendars
  • history
  • exception handling
  • layering & structure
  • testability

Persisting data

First check if we really need to persist data or not, then introduce persisting, then scaling, …

  • keep data in-memory
    • “can we solve the business problem?”
  • abstract away data persistency from business logic
    • “we’ll really have to persist data”
  • persist data in database relational / document-based / key-value / graph-based
    • “now we know what data we have”
  • scale database horizontal / vertical / none
    • “if we need scaling, we add it”

Storing data

  • CRUD
    • CRUD with history (alternative to local in-memory cache with distributed token)
      • same table
      • separate table
  • Event sourcing (possible path to CRUD)
    • Event stream only (always project)
      • cache project if not found
        • cache invalidation
          • size
          • time
          • on update
        • distributed cache
          • local in-memory cache with distributed token
      • read model update on write
        • consistent
        • eventual-consistent
      • snapshot-events

Layering and structure

Transclude of 2023-03-23-evolutionary-architecture.excalidraw

Scaling

  • single thread
    • prepare for scaling by using immutability, pure functions
    • vertical scaling
  • horizontal scaling: inter-process communication / fallacies of distributed computing
    • multi-thread
    • multi-process
    • multi-machine
    • auto-scaling

Response times

  • do all side effects on the request synchronously
    • offload side effects that don’t have to happen immediately (message bus)
    • fire and forget request and get notified when done
    • split 1 request into several
      • parallelism
      • cache results for individual request