Building resilient microservice workflows with temporal
In today’s fast-paced and ever-evolving world of software-as-a-service (SAAS) companies, it is crucial for software engineers to have the right architecture, infrastructure and design patterns to adapt quickly as user and product growth lead to new business needs. SafetyCulture is constantly innovating and exploring efficient and reliable systems to manage business processes and workflows. In this article, we’ll take a look at how our engineering group adopted the Temporal.io workflow engine and how it has increased our engineering productivity and overall success.
Modern Application Architecture
Microservices architecture allows software engineering teams to develop and scale services at a granular level, enabling the larger engineering organization to consist of loosely coupled teams that own specific feature domains. This localization of knowledge and complexity makes the organization more responsive, scalable and able to deliver features quickly compared to traditional monolith architectures.
Microservices typically interact with each other either synchronously, through direct request-response, or asynchronously, through event-driven communication using an event broker such as Apache Kafka.
Asynchronous, event-driven communication enables complex processes and interactions between microservices as a series of loosely coupled domain events, in a pattern known as “choreography”. These events are published by producers and then consumed asynchronously by subscribers, where flow control implicitly defined because producers have no explicit knowledge of subscribers. Choreography as a decentralized approach to service composition is highly scalable and well suited to fire-and-forget scenarios.
Challenges that Need Innovative Approaches
Certain business problems do not suit either direct point-to-point service calls or choreography. These problems often involve complex and/or long running business processes that span multiple services and may additionally involve human interaction, or processes in a distributed environment that require transactional properties, such as processing a user sign-up request or SAGAs.
Simple point-to-point request-response communication is unreliable for these types of problems due to the unpredictable nature of distributed networked services. Meanwhile, relying on choreography-style communication leads to state being dispersed across multiple disparate services, making it difficult to maintain transaction state and visibility of a long-running business process.
Developers must write ad-hoc retry and backoff logic, durable timers, storing state and other boilerplate glue functionality to solve these problems, adding complexity to microservices. SafetyCulture’s extensive backend includes areas that exhibited these traits, with different teams implementing custom solutions using a combination of timers, caches, databases and kafka. This creates friction for engineers to maintain or add new features to services.
Trying to maintain state for long running business processes in a microservices architecture.
We realized that we needed better design patterns and tools to address these types of problems when we had to make compromises such as limiting the size of bulk functionality to reduce implementation complexity, or simplifying functionality for faster market launches. Our engineering discussions both revealed the need for a solution and pointed to the potential solution, as the concepts of “flow control” and “orchestration” consistently emerged as patterns when brainstorming solutions. This led us to explore the orchestration architecture pattern for microservices.
Microservice Orchestration
Microservice orchestration involves the use of a centralized platform to manage all coordination logic and state required for executing business processes or workflows. This centralization provides clear and explicit flow control and the orchestration platform ensures consistency among individual services. Orchestration offers an architecture to build durable, resilient business processes as workflows with better end-to-end visibility. Compared to choreography, orchestration trades off scalability with consistency.
Source: distributed-transaction-patterns-microservices-compared
The Search for an Orchestrator
The concept of workflow orchestration engines has existed since the early 2000s and there are numerous options available, with a comprehensive listing found at awesome-workflow-engines. To select the best orchestration engine for SafetyCulture, a 6-point evaluation criteria was used to create a shortlist orchestration engines :
- Reliability — ability to handle network interruptions and unexpected services outages
- Scalability — ability to easily and quickly scale to higher workload
- Observability and durability — logging, persistence, UI for debugging workflows
- API-driven workflow management with GUI management capabilities
- Compatibility with our tech stack
- Ease of configuring workflows
Temporal stood out from the rest because it not only met the evaluation criteria, but offered additional benefits, such as a familiar developer experience using their main programming language of choice, and code-based workflow definitions which are more effective at expressing complex business logic rather than YAML, JSON or DSLs such as BPML.
Temporal Core Concepts
Like any new tool, Temporal comes with a learning process for both infrastructure and programming model.
Temporal Server Infrastructure
The Temporal server consists of several component services responsible for executing units of application logic in a resilient manner by persisting state changes and automatically handling intermittent failures. A Temporal cluster includes the server component services, as well as database, admin and web UI components (see diagram below). A Temporal cluster can be deployed as a standalone executable, in docker containers, in self-hosted Kubernetes using helm charts, or as a managed Temporal Cloud service.
Programming Model
Temporal’s execution paradigm operates on the principle of execution inversion, where the responsibility of executing your workflow code lies with worker applications and the server is only responsible for orchestrating the execution of your workflows as a series of tasks. In other words, your business logic in the form of a workflow definition is not executed on the server but rather, the server schedules workflows as tasks for execution by worker applications. The Temporal server persists task execution history by recording task inputs and results, enabling workflows to be reliable and resilient against failures in any component of the Temporal ecosystem.
A worker is a service or application that connects to the Temporal server, registers the workflows it can execute, and polls the server for workflow tasks to execute. It can be thought of as similar to a Kafka consumer that registers topics it listens to and has handlers to process messages.
Temporal Architecture Diagram
Defining and Implementing Workflows
The structure of a workflow in Temporal is defined as a function, but it must follow strict rules for determinism. Every time a workflow is executed, it must follow the exact same sequence of steps. This precludes the workflow from directly executing code that may result in varying or unpredictable outcomes, such as network API calls or non-deterministic language primitives like random number generation or multithreading routines.
To ensure deterministic behavior, the Temporal SDK offers alternatives for language primitives that have inherent randomness. For example, to start a new goroutine:
workflow.Go(ctx, f(x, y, z)) // instead of go f(x, y, z)
Code that may fail intermittently is placed inside functions known as activities. Unlike workflows that must follow deterministic behavior, a Temporal activity can contain any code without limitations. The Temporal server orchestrates activities as tasks and if an activity fails, future execution of the workflow starts from the initial state of an activity. Results from successful activity execution are persisted by the Temporal server in the workflow’s execution history, so that during re-executions of a workflow, the activity is not run again but its result is retrieved, thereby ensuring workflow determinism.
Putting it together, consider the following example workflow that retrieves weather forecasts for a list of cities using a weather API.
Activity function example that fetches forecast for a city:
package activities
… // abbreviated code
// GetForecast is an activity that uses the weather client to fetch a city’s weather forecast
func (a *WeatherForecastActivity) GetForecast(ctx context.Context, city string) (*clients.WeatherForecast, error) {
forecast, err := a.weatherClient.GetForecast(ctx, city)
if err != nil {
return nil, err
}
return forecast, nil
}
Workflow function example that fetches forecasts for a list of cities:
package workflows
… //abbreviated code
func ForecastWeatherWorkflow(ctx workflow.Context, cities []string) (map[string]*WeatherForecast, error) {
response := map[string]*WeatherForecast{}
ac := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Second,
})
var activity *activities.WeatherForecastActivity
for _, city := range cities {
var forecast *WeatherForecast
err := workflow.ExecuteActivity(ac, activity.GetForecast, city).Get(ctx, &forecast)
if err != nil {
return nil, err
}
response[city] = forecast
}
return response, nil
}
Worker application example:
package main
… //abbreviated code
func main() {
// Create the client object just once per process
c, err := client.NewClient(client.Options{})
if err != nil {
log.Fatalln(“unable to create Temporal client”, err)
}
defer c.Close()
w := worker.New(c, “weather-forecast-queue”, worker.Options{})
//register activity and workflows
activity := activities.NewWeatherForecastActivity()
w.RegisterActivity(activity.GetForecast)
w.RegisterWorkflow(workflows.ForecastWeatherWorkflow)
// Start listening to the Task Queue
err = w.Run(worker.InterruptCh())
if err != nil {
log.Fatalln(“unable to start Worker”, err)
}
}
SDK client application example that starts a weather forecast workflow:
package main
… // abbreviated code
func main() {
// The temporal client is a heavyweight object
c, err := client.Dial(client.Options{})
if err != nil {
log.Fatalln(“Unable to create client”, err)
}
defer c.Close()
workflowOptions := client.StartWorkflowOptions{
ID: “weather_forecast_workflowID”,
TaskQueue: “weather-forecast-queue”,
}
we, err := c.ExecuteWorkflow(
context.Background(),
workflowOptions,
workflows.ForecastWeatherWorkflow,
[]string{“sydney”, “manchester”, “manilla”, “kansas-city”},
)
if err != nil {
log.Fatalln(“Unable to execute workflow”, err)
}
log.Println(“Started workflow”, “WorkflowID”, we.GetID(), “RunID”, we.GetRunID())
// Synchronously wait for the workflow completion.
var result map[string]workflows.WeatherForecast
err = we.Get(context.Background(), &result)
if err != nil {
log.Fatalln(“Unable get workflow result”, err)
}
}
An execution of the above workflow looks like this in Temporal Web UI:
The above example workflow and activity functions may looks like standard Go code, but as a Temporal workflow, the Temporal SDK guarantees the workflow runs to completion with all request and response parameters persisted by the Temporal server. In other words, without any explicit implementations for durable timers, retry logic or state stores, this workflow is robust against failures such as:
- Intermittent failures in third party APIs, handled with the built-in retry policy featuring exponential backoff;
- Service interruptions or failures in any component of the Temporal cluster
Temporal’s default retry policies ensure a workflow resumes from the point of failure once services are restored, without external intervention. The retry policy consists of a collection of attributes that can be customized according to the needs of a specific scenario.
Some Best Practices With Temporal
Do’s
- Ensure activities are idempotent — activities should be designed to be retryable by default. The default Temporal setting is to retry activities forever, so carefully consider the consequences if they are not idempotent
- Favor heavier activities to avoid a bloated workflow history — Using a single activity to wrap multiple API calls or using activity heartbeat with progress information can prevent a large workflow history that makes it difficult to debug and requires more memory for storage and replay. Instead, consider wrapping several API calls into a single activity, or using activity heartbeat with progress information when you are performing an operation on a large sequence of resources
- Monitor both Temporal server and SDK metrics for better observability — Both the Temporal server and Temporal SDK emit metrics that can be scraped and stored by the likes of Prometheus. Temporal also supplies default Grafana dashboards for easy performance monitoring.
Don’ts
-
Don’t rely solely on metrics like P95 or P99 latency when deciding on activity retry policies — below is an example latency distribution with P99 latency well under 1 second, but there is tail latency of 1 or 2 minutes. This could result in unexpected activity failure if the retry policy has a small initial interval and limited number of retries based only on P95 or P99 metrics
-
Don’t pass large data blobs between activities — Temporal stores input and results of an activity, so it is an anti-pattern to pass large data blobs from one activity to another
-
Don’t write long workflow functions — decompose long workflows using the language primitives at your disposal (e.g. Go functions, methods on struct types) to simplify the workflow and improve their maintainability and testability
For additional best practices, check out Drew Hoskins’ presentation from the Temporal Replay conference.
Introducing Temporal into SafetyCulture
Our Temporal adoption journey began with testing the Temporal server’s capabilities in Docker containers on an engineer’s laptop. This validated the applicability of Temporal workflows for our needs and verified its resiliency and orchestration capabilities.
Building on this promising start, we moved on to develop a Proof of Concept workflow in our sandpit environment and conducted an Engineering Design Review for the introduction of a workflow orchestration engine into our engineering platform. Our internal World Class Engineering program, an initiative to continuously improve SafetyCulture engineering as an organization and encourage grassroots new ideas and solutions, provided the perfect platform for spreading the word.
The adoption process posed some challenges, such as the difficulty in reusing the default Helm charts within our infrastructure setup. However, Temporal helpfully provided suggestions and we subsequently created our own charts by referencing kubernetes resources compiled from the default Helm charts provided by Temporal. Other challenges included some small gaps in language-specific documentation where again, Temporal engineers were prompt in responding to questions on their slack channels.
Off the back of a successful PoC, internal workshops and presentations were held for the engineering group. One engineering team took the initiative to implement workflows for multiple backend batch processes and were able to design, write and deploy interactive workflows using just a single worker service. In contrast, previous implementations required three separate microserves to achieve a similar level of reliability, observability and interactivity for just a single batch process. The simplicity of the solution enables us to solve customer problems faster and more reliably than ever before.
Conclusion
Temporal is a cloud-native workflow engine that helps developers create and execute complex, long-running and fault-tolerant applications. It offers a developer-friendly way to build and manage business workflows that are scalable, resilient and observable without having to write custom timers, state persistence or other boilerplate code. While this article discussed the basics of a Temporal workflow, there are many more compelling features such as rate limiting, heartbeat activities, signals and queries that makes Temporal the swiss-army knife of workflow orchestration platforms.
At SafetyCulture, Temporal has become a core piece of an engineer’s toolkit. By allowing our engineers to use command-driven orchestration patterns to solve problems, it complements event-driven choreography design patterns in our microservice architecture. The table below summarizes the typical characteristics of choreography versus orchestration design patterns:
Choreography vs Orchestration
The additional flexibility to tackle engineering challenges has given us new ways of rethinking our approach to complex, long running processes and solving new challenges brought on by the scale of the product and user growth on our cloud platform.
If you like to tackle difficult challenges that have real world impact and delight customers, find out more about SafetyCulture and get in touch!