secure by design - Benefits of cloud thinking

Abstract

  • The twelve-factor app and cloud-native concepts can be used to increase the security of applications and systems.
  • You should run your application as stateless processes that can be started or decommissioned for any occasion.
  • Any result of processing should be stored to a backing service, such as a database, log service, or distributed cache.
  • Separating code and configuration is the key to allowing deployment to multiple environments without rebuilding the application.
  • Sensitive data should never be stored in resource files, because it can be accessed even after the application has terminated.
  • Configuration that changes with the environment should be part of the environment.
  • Administration tasks are important and should be part of the solution; they should be run as processes on the node.
  • Logging shouldn’t be done to a local file on disk, because it yields several security issues.
  • Using a centralized logging service yields several security benefits, regardless of whether you’re running an application in the cloud or on-premise.
  • Service discovery can increase security by improving availability and promoting an ever-changing system.
  • Applying the concept of the three R’s—rotate, repave, and repair—significantly improves many aspects of security. Designing your applications for the cloud is a prerequisite for doing this.

The 12-factor app and cloud-native concepts

The 12-factor app is a methodology for building software-as-a-service applications:

  • codebase: one codebase tracked in a revision control, many deploys
  • dependencies: explicitly declare and isolate dependencies
  • config: store configuration in the environment
  • backing services: treat backing services as attached resources
  • build, release, run: strictly separate build and run stages
  • processes: execute the app as one or more stateless processes
  • port binding: export services via port binding
  • concurrency: scale out via the process model
  • disposability: maximize robustness with fast startup and graceful shutdown
  • dev/prod parity: keep development, staging and production as similar as possible
  • logs: treat logs as event streams
  • admin processes: run admin/management tasks as one-off processes

Quote

A cloud-native application is an application that has been designed and implemented to run on a Platform-as-a-Service installation and to embrace horizontal elastic scaling.

Storing configuration in the environment

  • don’t put environment configuration in code
  • never store secrets in resource files
    • unless it’s encrypted, otherwise, it can be accessible even after the application has terminated
    • add significant complexity, e.g. where do you store the decryption key, and how do you provide it to the application?
    • secrets are shared with everyone in the development team, regardless of whether they’re stored in code or in resource files ⇒ secrets should be provided at runtime

Placing configuration in the environment

A common practice used in the cloud and suggested by the 12-factor app methodology is to store configuration data in environment variables, e.g. in env variables if you’re using UNIX.

Audit trail

From a development perspective, using this pattern makes life a whole lot easier because it reduces implementation complexity because the responsibility of creating audit trails shifts from the application code to the infrastructure layer, i.e. it becomes an IAM (Identify and Access Management) problem.

Sharing secrets

Application developers can focus on using secrets rather than managing them. But it’s not enough in security perspective as in most operating systems, a process’s environment variables can be flushed out, which becomes a security problem if the secrets are stored in clear text.

For example, in most Linux systems, it’s possible to inspect environment variables using cat /proc/$PID/environ.

Encryption

Storing encrypted secrets in environment variables certainly minimizes the risk of leaking sensitive data, but the general problems with decryption remain. For example, how do you provide the decryption key to the application? Where do you store it? And how do you update it?

Another strategy is to use ephemeral secrets that change frequently in an automatic fashion, but this requires a different mindset.

Separate processed

One of the main pieces of advice on how to run your application in a cloud environment is to run it as a separate stateless process.

The main direct security advantage is improving the availability of the service, e.g. by easily spinning up new service instances when needed to meed a rise of client traffic.

You also get some improvement in integrity because you can easily decommision a service instance with a problem, be it memory leakage or a suspect infection.

Deploying and running are separate things

principle of least privilege

Every program and every privileged user of the system should operate using the least amount of privilege necessary to complete the job.

It’s even harmful to have higher privileges than necessary. If the process or component is hacked, then the hacker can do things there’s no need to allow.

Processing instances don’t hold state

A well-designed cloud application shouldn’t assume a specific instance is linked to a specific client. Each and every call from each client should en up at any of the instances that are on duty at the moment.

For this reason, the processes shouldn’t save any client conversation state between calls. Any result of processing a client request must either be sent back to the client or stored in a stateful backing service, usually a database.

backing services

A backing service is an external resource your application uses, typically by accessing it over the network. This can be a database for persistent storage, a message queue, a logging service, or a shared cache.

An important aspect of backing services is that they should be managed by the environment—not by the application. An application shouldn’t connect to a database specified in code. Instead, which database to use should be part of the deployment, as mentioned in the previous section on storing configuration in the environment. When the connection to the database is managed by the environment, it’s possible to detach the database and attach another to replace it during runtime.

Security benefits

Separating installation and deployment from running the application works well with the principle of least privilege. The process can do only what the application is intended to do (e.g. contact the database or write to a log file), the effects will be contained. The attacker can’t compromise the integrity of the system itself.

When processed are stateless and share nothing (except backing services), it’s easy to scale capacity up or down according to need.

Obviously, this is good for availability: fire up a few more servers, and they immediately share the load.

You can even do zero-downtime releases by starting servers with a new version of the software at the same time you kill the old servers.

Avoid logging to file

Logs are used as the primary source of information. This means logs must be accessible at all times, but the data they contain must also be locked away because of its sensitive nature, a contradiction in terms, it seems.

But great security and high accessibility aren’t mutually exclusive features. There is a design pattern used by cloud-native applications that addresses this dichotomy: it’s called logging as a service.

Let’s analyze it from a CIA perspective:

Confidentiality

Allowing logs to be easily fetched is the key to high accessibility, but it also introduces the problem of confidentiality.

Storing log data on disk and accessing it using remote login makes log access an IAM problem, but the strategy only holds as long as no one is able to download any files. If logs can be downloaded, it becomes extremely difficult to uphold access rights, which more or less defeats the purpose of IAM.

Integrity

A system behaves the same way no matter what you write in the logs. This certainly makes sense, but if you consider logs as evidence or proof, then integrity suddenly becomes important. For example, if logs are used in a court case claiming that a transaction has been made, then you want to be sure the logs haven’t been tampered with. This becomes important when logging to a file.

Availability

A common way to mitigate this is to have an admin process that automatically rotates logs.

You do, indeed, need to consider several security issues when logging to a file on disk. Some are harder and some are easier to resolve than others, but one can’t help thinking that there must be a better way to do this—and there is. The solution is found in the cloud, and it involves logging as a service rather than logging to a file on disk.

Logging as a Service

Confidentiality

Every log call is sent over the network to the logging service.

Restricting access to the logging system isn’t enough on its own to solve the sensitive data access problem, but if you choose to separate log data into different categories, such as Audit, System, and Error, then the logging system could easily restrict users to seeing only log data of a certain category. And, as a developer, this makes perfect sense, because you’re probably interested in technical data only (for example, debug or performance data) and not sensitive audit information.

As a final note, there’s one more significant distinction between file-based and service-based logging to remember. When logging to disk, you might consider the disk to be within the same trust boundary as your application. If so, you don’t need to worry about protecting data while in transit (from your application to the disk). But if you use a logging service, data is sent over a network, which opens up the possibility of eavesdropping on log traffic. This means that using service-based logging requires data protection while in transit, which is often done using TLS (Transport Layer Security) or end-to-end encryption.

Integrity

If you choose service-based logging, the task becomes more or less trivial. This is because the logging system can easily be designed to separate read and write operations, where writes are only performed by trusted sources.

For example, if you set up your application to be the only one authorized to write data and give all other consumers read access, then your logs can only contain data written by your application. If you also choose an append-only strategy, then you ensure that log data is never updated, deleted, or overwritten, but only appended to the logs. This way, you can easily ensure a high level of integrity at a low cost.

Availability

If the network is less reliable than disk access and the service can’t be accessed, then log data might need to be buffered in local memory to minimize rollbacks if the risk of losing data is acceptable.

You can adapt the storage capacity based on need and improve overall availability, which is particularly easy if the service is running in the cloud.

Admin processed

Admin tasks, e.g. running batch imports or triggering a resend of all messages from a specific time interval, are often treated as second-class citizens compared to what is seen as the real functionality that customers directly interface with.

Admin functionality should be treated as first-class citizen; it should be developed together with the system, version controlled on par with the rest of the functionality, and deployed to the live system as a separate interface (API or GUI). You get several security benefits:

  • better confidentiality, because the system can be locked down
  • integrity is improved, because the admin tools are ensured to be well synchronized with the rest of the system
  • availability of admin tasks is improved even under system stress

The security risk of overlooked admin tasks

Having such a means of general access as ssh opens up the attack surface more than is necessary. If the ssh is allowed, there’s a risk of it being used by the wrong people for the wrong reasons. An attacker that happens to get their hands on root-level ssh access can do almost unlimited harm.

The second risk is that if (or rather when) the system and the admin scripts get out of sync, bad things can happen. For example, the development team refactors the database structure, and the sysadmin SQL commands aren’t updated accordingly, applying the old SQL commands on the new table structure can cause havoc and potentially destroy data.

There’s a third risk: having system code maintained separately from system admin scripts by different groups of people tends to contribute to a strained diplomatic relationship between the two groups of people, something that’s not beneficial in the long run.

Admin tasks as first-class citizens

Even if an attacker gans access to the admin API, they’ll only be able to trigger the predefined functionality, not any general OS or SQL command, the attack surface is kept low.

Even if the admin functionality is part of the deployed system, you’ll still want it as a separate process. You’ll probably want to have the admin parts available when things are getting slow and unresponsive. If it’s embedded in the same process as the usual functionality, there’s a risk that it’ll become unavailable at the wrong time.

Example

A classic admin task is rotating logs.

By following the guidelines in this section, you get all 3 kinds of security benefits:

  • confidentiality increases because the system is locked down to only provided specific admin tasks and not, for example a general SQL prompt
  • integrity is better because you know that the admin tools are in sync with the application, so there’s no risk of those tools working on old assumptions and causing havoc
  • availability is higher because it’s possible to launch admin tasks even under high load

Service discovery and load balancing

2 central concepts in cloud env and PaaS solutions. Service discovery can improve security because it can be used to increase the availability of a system. It also increases security by allowing a system to stay less static.

Centralized load balancing

Cloud-native applications are stateless and elastic by definition, which means individual instances come and go, and they should be able to do so without the consumer of a service being affected.

When using centralized load balancing, the consumer, of the caller, is unaware of how many instances of an application are sharing the load and which instance will receive a specific request. The distribution of the load is managed centrally.

Client-side load balancing

An alternative approach to centralized load balancing is client-side load balancing. This puts the decision of which instance to call on the caller.

Several reason to use client-side load balancing instead of centralized load balancing:

  • simplify your architecture and deployment processes
  • allows the caller to make informed decisions on how to distribute the load

You need service discovery to do client-side load balancing. Because the discovery is performed at runtime, service discovery works well in ever-changing cloud environments.

Embracing change

Change is good from a security perspective. Spreading load across multiple instances can increase the availability of a system. Moreover, a system, application or environment that’s constantly changing is far more difficult to exploit than on that stays static.

The three R’s of enterprise security

  • rotate secrets every few minutes or hours
  • repave servers and applications every few hours
  • repair vulnerable software as soon as possible (within a few hours) after a patch is available

The common traditional rationale is that change increases risk. In order to reduce risk, rules are made to prevent change in systems or in software in various ways. For example, limitations on how often new versions of applications are allowed to be released. Protocols and processed are introduced that turn the cycle time for getting new updates and patches on the OS level. Secrets such as passwords, encryption keys, and certificates are rarely or never changed. Reinstalling an entire server is almost unheard of.

The purpose of the three R’s is the total opposite: reduce risk by increasing change.

Rotate

Password can be treated as ephemeral by the platform. They’re generated on demand and then injected directly into the environment of a running container or host. They will only ever live in nonpersistent RAM. You don’t have to deal with the hassle of encrypting the password placed in the configuration file.

Certificates can be rotated in a similar fashion. Because you can rotate them instead of renewing the, you’re reducing the time frame for which a certificate is valid. If someone were to steal it, they wouldn’t have much time to use it. In addition, you’ll never again have a problem with a certificate expiring because you forgot to renew it.

Tip

Keep secrets short-lived and replace them when they expire.

Sometimes the application needs to be aware of how the credentials are being rotated, e.g. it involves details better encapsulated in the application rather than making the platform aware of application-specific details. With this approach, the application is now responsible for retrieving updated secrets before they expire. In terms of responsibility, this approach is similar to client-side load balancing in that it puts more responsibility on the application.

Note

Rotating secrets doesn’t improve the security of the secrets themselves, but it’s an effective way of reducing the time during which a leaked secret can be misued.

Repave

Recreating all servers and containers and the applications running on them from a known good state every few hours is an effective way of making it hard for malicious software to spread through the system. A PaaS can perform rolling deployments of all application instances, and if the applications are cloud-native, you can do this without any downtime.

Instead of only redeploying when you’re releasing a new version of your application, you can do this every other hour, redeploying the same version. Rebuild your VM or container from a base image and deploy a fresh instance of your application on it. Once this new instance is up and running, terminate one of the older ones. By terminate, we meant to burn it down to the ground and don’t reuse anything, including erasing anything put on a file mount used by the server instance.

By repaving all your instances, you not only erase your server and application but also any possible malicious software placed on the instance, perhaps as part of an ongoing APT attack.

Repair

Think of repairing as a variant of repaving. If you’ve got repaving down, it’s not that different tot start repairing.

If you’re familiar with continuous delivery and continuous deployment, you might already be applying the repair concept on your own applications, even if you don’t know it.

If you’re constantly changing your software, an attacker constantly needs to find new ways to break it.