🇺🇸 Pizzeria and Resilient Systems

Arley Pádua
5 min readSep 15, 2018

--

Thinking about services, we always rely on transferring data over the network but, do you think that network is reliable ?

We are aware that it is not. Certainly you’ve experienced hardware, software or security issues while talking to a remote service.

Focusing on the code now, I’m sure that you saw this at least once in your life before:

var service = new MyWebService();
var result = service.Process(request);

There’s some sort of magic happening behind scenes here:
First the data gets serialized, goes through the wire, arrives on the destiny, the data is deserialized, the code is invoked with the data, get the result, serialize the result, return through the wire, deserialize it and finally the calling code is able to continue doing what is is supposed to do.

There is a big process happening on these two lines of code, and several issues might happen, for instance, the most common case: timeouts. See on the simple image bellow the complexity that it might have considering all the magic happening.

Synchronous request (request x response)

What usually developers do is: Timeout happened, log that something went wrong and lets fix that later.

Then your system becomes less resilient to failure and your team would need to be proactive, see what the problem was, and fail finding it because most of the times it succeed, except sometime it fails for “no reason”. Also there will be always those questions: When did the timeout happen ? Did the request arrived on the destination ? Did the server process the message ? Did all the side effects happen ? If not, what do I still need to do to make my system coherent?

I could enumerate dozens of questions, that could take forever for your team to fix such scenarios.

A fun fact is that in these scenarios you always hear that common phrase:

It works on my machine!

Having said that, I want to point that a mindset and architecture change is needed in order to solve these scenarios.

To introduce this mindset, try to answer yourself this question: Why do you need that these processes within your system need to be synchronous ?

Most of the cases you don’t need to make a request and receive the response on the exact same time. You can handle it in such a way that eventually you will get a response in the future.

Let me tell you a real life story and make it easier to understand:

You go into a pizzeria, the waiter comes to you and you order a pepperoni pizza. Then, the waiter takes your order to the kitchen, which will eventually be picked up by a chef, prepared, put into the oven, after a few minutes it is ready and the chef shouts on the kitchen: Order number 87 is ready! The waiter takes the order and give it to you.

If a request-response was possible in real life, the waiter would awkwardly stare at you until the pizza is ready. And he cannot do it, because he needs to accept orders from other visitors and forward them to the kitchen queue.

This little story is nothing different from what resilient distributed system should do.

A system (you) should send a message through a message broker (waiter) telling that it needs to process something (order). A worker (chef) will eventually pick this message and process it (prepare your order). Once it is ready the worker sends a message (order ready) through the broker and it is delivered to a service that is interested on this message (you).

If you introduce this concept within your system, it will become less dependent on who process your request by sending a message to the broker containing your request details (a.k.a. commands) and listening to an eventual message telling that something happened with that request (a.k.a. events).

Asynchronous flow

Getting back to network failures, now you decoupled most of network issues from the caller and the message broker has the task of address this issues to you.

There are a few strategies available to the message broker:

  • If a worker fails to process a message, the message broker could adopt a progressive retry, by retrying the request in 1 second, than 5 seconds, 30 seconds, 1 minute and so on… you should decide the better strategy.
  • If it keeps failing, it seems that it is not a transient issue and someone needs to look at it, so after the progressive retry strategy, the message broker forwards it to a different place (dead letter queues), where someone could look into the issue, fix it, and then retry manually.

All of these strategies adds a layer of resiliency to your system and your team don’t need to worry about timeouts or any transient errors, because they will be retried and processed. They can start worrying about issues after this layer and be more precise on mitigating what the issue is.

Introducing this type of communication between your systems is a challenge and some changes are needed. I’ll describe bellow some considerations that you always need to have.

Workers are re-entrant (Idempotency)

Considering the fact that messages will be retried, the result of handling this message should always be the same.

For instance, if you have a service that generates an invoice for a customer, you need to code your worker in a way that it generates a single invoice for every request. You don’t want to have a new invoice every time it gets retried.

Guarantee the side effects

If your worker does a job that other systems might be interested on the output of it, make sure you always send your messages, even if it is a retry operation of a previous failure.

There might be a case where you processed the request, persisted and failed to send the message telling that you successfully processed. Next it is processed, the data will be on the desired state, but the messages still need to be sent.

Sending these messages should never be a problem, since all workers are idempotent.

Monitor the health of your worker

Because a system cannot be always 100% resilient, you should adopt strategies to track your system’s health.

  • Create alerts
    You don’t want to be blind on what is happening with the processing time, memory and CPU usage of your workers.
  • Solve issue as soon as possible
    If you don’t pay attention you will end-up with a huge list of messages and possibly loose important ones when they show up.

This is just an introduction on how resilient systems can help you on the daily development and there are more to talk in future posts. If you think it is interesting, keep updated on @_arleypadua.

Cya on the next one!

This article is also available in portuguese.

--

--

Arley Pádua

Software Engineer and passionate about distributed systems