Software Design is Knowledge Building

Software development can be reduced to a single, iterative action. Almost everything we do in the course of a day — the pull requests, the meetings, the whiteboard diagrams, the hallway conversations — is an explanation. Our job is to explain, over and over, the meaning of our software.

1. The Story

This is the story of system SVC from company ORG. It’s a true story, but I’ve smoothed out the details: by making it generic, I hope that it will be more familiar.

  1. ORG relies on an integration service, SaaS, to decouple its business logic from vendor software dealing with billing, analytics, customer management, etc.
  2. ORG shifts from assume we have infinite budget mode to we need to break even next year or we’ll die.
  3. Some dutiful analyst notices that ORG spends an egregious amount of money on seemingly innocuous middleware SaaS.
  4. Some dutiful executive figures that, seeing as they already spend a lot of money on software engineers, they should be able to replace SaaS with a system built in-house, to be called SVC.
  5. Some dutiful manager tasks X10, one of ORG ’s finest engineers, with the job of building SVC. X10 will be working alone and on a tight schedule, as the executive wants to get rid of SaaS before the next billing cycle kicks in.
  6. X10 gets the job done in time, as everyone in ORG has come to expect from her. With SVC working and SaaS out of the way, X10 moves on to more important—or more pressing—matters and, eventually, leaves ORG.
  7. TEAM takes over ownership of SVC. For all intents and purposes, development is done, they only need to keep the lights on.
  8. Requirements change, assumptions are proven wrong, unknown unknowns are uncovered. A dutiful product owner realizes that the business now demands changes to SVC, so he writes down some tickets for TEAM to work on.
  9. TEAM fails miserably to deliver. They don’t seem to know what they are doing, taking forever to apply the smallest changes, always side-tracked by bugs and production outages.
  10. TEAM is, uh, restructured into TEAM++, of higher average seniority, and continues working on SVC.
  11. GOTO 9.

This entire process takes less than a year.

2. Commentary

What is going on with SVC? Did X10 do a bad job? By all accounts, she did not. The project was finished on time, according to specifications; ORG will save a lot of money next year. X10 followed best practices, too: good test coverage, no over-engineering. Surely she cut some corners; she didn’t think twice about the design of the system. As in any project, there’s a lot of room for improvement, but nothing that a team of competent engineers couldn’t handle. But, if X10 got the hard part of the job done, how come not one but many teams of competent engineers fail to apply a few small changes to the system?

What fascinates me about this scenario is how a seemingly functional 6-month-old project automatically turns into a haunted forest just by changing hands. Regardless of its age, SVC is textbook legacy software because, more often than not, a question posed about the system, to any team member, results in the same answer: I don’t know.

The problem is that TEAM members don’t have enough elements to build a satisfactory mental model of SVC. They need to go by a mix of the client’s interpretation of what the system should be, and what they can tell from the code that the system actually is. These views can be disconnected and contradictory. The code may tell the what and the how, but it doesn’t tell the why. Only X10 could say what was a functional requirement, what a technical necessity, what a whim, what an accident. The team has to resort to reverse engineering, extrapolating, and guessing.

3. The Theory

Underlying the decision to move X10 out of the project once the system was operational, is the common misconception that software development consists of producing code; that, once working code exists, programmers should act as interchangeable operators of varying qualities1.

In his Software Aging paper, David Parnas warns against putting software in the hands of developers who haven’t contributed to (and thus don’t understand) its design:

Although it is essential to upgrade software to prevent aging, changing software can cause a different form of aging. The designer of a piece of software usually had a simple concept in mind when writing the program. If the program is large, understanding that concept allows one to find those sections of the program that must be altered when an update or correction is needed. Understanding that concept also implies understanding the interfaces used within the system and between the system and its environment.

Changes made by people who do not understand the original design concept almost always cause the structure of the program to degrade. Under those circumstances, changes will be inconsistent with the original concept; in fact, they will invalidate the original concept. Sometimes the damage is small, but often it is quite severe. After those changes, one must know both the original design rules, and the newly introduced exceptions to the rules, to understand the product. After many such changes, the original designers no longer understand the product. Those who made the changes, never did. In other words, nobody understands the modified product. Software that has been repeatedly modified (maintained) in this way becomes very expensive to update. Changes take longer and are more likely to introduce new “bugs”. Change induced aging is often exacerbated by the fact that the maintainers feel that they do not have time to update the documentation. The documentation becomes increasingly inaccurate thereby making future changes even more difficult.

TEAM was bound to make what Parnas calls “ignorant surgery”, the system degrading over time. But that doesn’t quite explain why they were immediately helpless as they took over the project. I find a better characterization in Peter Naur’s Programming as Theory Building:

At least with certain kinds of large programs, the continued adaptation, modification, and correction of errors in them, is essentially dependent on a certain kind of knowledge possessed by a group of programmers who are closely and continuously connected with them.

Programming should be regarded as an activity by which the programmers form or achieve a certain kind of insight, a theory, of the matters at hand. This suggestion is in contrast to what appears to be a more common notion, that programming should be regarded as a production of a program and certain other texts.

In this view, the mental model that allows the designer to map a subset of the world (the domain) to and from the system, and not the system itself, is the primary product of the software design activity:

  1. The programmer having the theory of the program can explain how the solution relates to the affairs of the world that it helps to handle. Thus the programmer must be able to explain, for each part of the program text and for each of its overall structural characteristics, what aspect or activity of the world is matched by it. Conversely, for any aspect or activity of the world the programmer is able to state its manner of mapping into the program text.
  2. The programmer having the theory of the program can explain why each part of the program is what it is, in other words is able to support the actual program text with a justification of some sort.
  3. The programmer having the theory of the program is able to respond constructively to any demand for a modification of the program so as to support the affairs of the world in a new manner. Designing how a modification is best incorporated into an established program depends on the perception of the similarity of the new demand with the operational facilities already built into the program. The kind of similarity that has to be perceived is one between aspects of the world.

Naur defines software design as an intellectual activity, consisting of building and having a theory, where theory is understood as

the knowledge a person must have in order not only to do certain things intelligently but also to explain them, to answer queries about them, to argue about them, and so forth.

Notice the similarity to Zach Tellman’s thesis in his ongoing newsletter:

Software development can be reduced to a single, iterative action. Almost everything we do in the course of a day — the pull requests, the meetings, the whiteboard diagrams, the hallway conversations — is an explanation. Our job is to explain, over and over, the meaning of our software.

We must tell a story about what our software is, and what it’s expected to become. When understanding software, we tell that story to ourselves. When changing software, we tell that story to others. Software which is complex takes a long time to explain.

A more conventional way to define the software design activity is in terms of minimizing complexity. If we acknowledge that reducing ambiguity, obscurity, unknown unknowns, and cognitive load—all of them forms of removing complexity—also enables better understanding and easier explanations, then we should conclude that both models are compatible, if not equivalent.

4. Postscript

The theory-building view explains why TEAM couldn’t take ownership of SVC. When X10 left the company, taking the development context—the mental model—with her, the system deteriorated. In Naur’s terms, while still operational, the system was dead:

The building of the program is the same as the building of the theory of it by the team of programmers. During the program life a programmer team possessing its theory remains in active control of the program, and in particular retains control over all modifications. The death of a program happens when the programmer team possessing its theory is dissolved. A dead program may continue to be used for execution in a computer and to produce useful results. The actual state of death becomes visible when demands for modifications of the program cannot be intelligently answered. Revival of a program is the rebuilding of its theory by a new programmer team.

TEAM ’s duty, then, is to revive the system by building a new theory of it. But reconstructing the model while keeping the system operational is a slow and difficult process—a hard sell for an organization convinced that development has just finished. Naur goes as far as saying that program revival, from code and documentation alone, is impossible. The program should preferably be discarded, and the new team should be given the opportunity to resolve the problem from scratch.

With three extra decades of hindsight, I tend to disagree. Revival is very hard, yes, but I’ve seen it happen. It may require that the new team ultimately rewrite every line of the original, one at a time. And I’ve seen fresh starts fail more consistently than that2.

Knowing that revival is a plausible future need has powerful consequences for our work. To approach it correctly, we should mind the people that one day will have to take the project out of its coma: in the style of the code and the structure of the system, but also in its paratexts—comments, docstrings, READMEs, Pull Requests, commit messages, Jira tickets, and Confluence pages.

Granted, my story was an all-too-perfect illustration of Naur’s thesis. I can’t prove it, but I suspect that we could benefit from accepting his theory as a law: the ultimate goal of software design should be (organizational) knowledge building.

So the next time you choose a name, or structure a project, or ponder whether to write or omit a certain comment, rather than thinking in terms of the burden on future maintainers, think: how much will this decision affect—how much will it help or hinder—their building of a mental model of the system, of the business, of the world.

Footnotes

  1. A misconception similarly made by those who intend to replace programmers with statistical models.

  2. In my experience, developers advocate for greenfield rewrites to escape the operational annoyances of production systems. They, too, fall in the trap of assuming that clean code is the hard part of software development. Even in the unlikely case that they produce a replacement matching or improving on the original system, they won’t stick around to run it in production when development slows down. What is left is another stillborn like SVC.