On availability aspects of microservices

Does splitting a service into microservices, help or hurt the availability of overall system?

Jan 30, 2022

This post is a copy of the gist I posted in response to the WeekendDevPuzzle of 2022-01-29, shown below.

As the thread above has submissions from people who chose to share their perspectives on the puzzle, I'll be linking this post to this Twitter thread, to avoid polluting the original one.

Motivation for today's topic

Unlike the usual peel the onion kind of topics, this one focuses more on how we think & our mental models of architectural decisions. But there's more.

Over the past 15 yrs, programming has become extremely accessible. While that's undoubtedly a good thing, it's become much easier to just rely on best practice, with or without context. Today's puzzle was designed to bring to the discussion table, that context, albeit in a super simplified fashion.

Dissecting the puzzle

Let's dissect the question, and bring out the core elements of it.

Basics

For a distributed system where call graph looks like A → B → C, it's easy to see that the call succeeds only when all of the components are available. We can write this as: P(U) = P(A) * P(B) * P(C), where

Side note: If reading the word probability immediately switched off your mind, don't worry. We'll keep the maths super basic, given our super simplified scenarios. Who knows, it might even encourage you to get more comfortable with it!

It would seem like we have the answer to our puzzle then. But do we? Let's dig deeper.

A deeper look

The puzzle mentions the following:

What do you think it means for P(LB), P(WEB) and P(DB)?

So, at least we now have some richer perspective, even if more questions got added, about Scenario A. What about Scenarios B & C? After all, there was no mention about downtime numbers for them in the puzzle.

An even deeper look

Let's start with Scenario B:

Similarly, we can analyse the picture for Scenario C, with only one additional point of interest:

Is there anything else we're missing?

Implicit assumptions

We made an implicit assumption above, that our network is perfect. But that isn't always the case (cable breaks, SFP failures, OS network table full, network saturation, router misconfiguration, the list is long). A surprising number of engineers tend to assume perfect networks, just because they've not observed it fail, until it does, and again and again.

Let's denote the network's availability to be P(N) < 1.0. Clearly every distributed call over that network, has its availability reduced by this factor. So, Scenario A would look something like P(U) = P(LB) * P(N) * P(WEB) * P(N) * P(DB). Two network hops has meant a reduction in availability by P(N)^2. Can you see why?

Nitesh correctly points this out, though he mistakes my comment about observed availability to be design availability.

There're a few other implicit assumptions being made here, but for the sake of brevity, let's ignore them for now.

I need a break!

At this point, some of you might be thinking: "WTH! Dissecting it was supposed to simplify the question, not add to it!". Let's take a deep breath

deep-breath
deep-breath

Sometimes, breaking down a problem may noticeably amplify the number of factors, but the story from there only gets better, as we start hacking & slashing away at the problem, by making simplifying (yet informed and explicit) assumptions. So let's start doing that.

My submission

Let's quick go over the calculations we did earlier:

I'm gonna make the following assumptions. The exact numbers don't matter, it's the model that matters:

For the remaining parameters, let's evaluate two situations, H1 (for hypothesis 1) and H2.

H1 (refactoring improved availability because of reduced code complexity, no more buggy DB driver code eating up web resources, etc):

H2 (refactoring reduced availability due to poor coding, poor abstractions, lesser integration testing, etc):

Let's see what we get

Scenario H1 H2
A 99.43% 99.43%
B 99.18% 98.91%
C 99.18% 98.91%

So, this is very interesting. Whether you believe me or not, I didn't plan for the numbers to be like this. The following observations stand out to me:

  1. Scenario A is noticeably better than B or C, even in H1, i.e. where we're assuming a betterment of availability.
  2. Scenario B == Scenario C on availability.
  3. H2 being worse off in Scenarios B & C is no surprise, as we're deliberately assuming a worsening of post-refactor availabilities.

Let's analyse these outcomes one by one.

Analysing the outcomes

Scenario A is noticeably better than B & C even in H1

One can read this as Breaking a monolith while increasing individual availabilities, still reduced the overall availability. How did this happen? Well, two things:

  1. More moving parts means product of probabilities decreases faster.
  2. Network being assumed to be 99.995% available added to the cost. Even if we assume P(N) == 100% (ideal network), we still get P(B) = 99.20% for H1, which means that the number of moving parts tends to dominate.

Can we assume this to be a universal truth? I'd say no. The takeaway here is that number of moving parts impacts availability substantially, and when breaking a monolith, the individual availability improvements should be large enough to compensate for the reduction due to increase in moving parts, e.g. if the refactoring in H1 improved P(WEB) == 99.85% and P(SVC) == 99.75%, Scenario B & C become better than A.

Scenario B == Scenario C on availability

This is just an artifact of choosing P(SMART_CLIENT) == 99.99%. A higher number would put Scenario B better than C, and a lower number will reverse that. The takeaway here is that when focussing on embedded smart clients, their code quality (which reflects in P(SMART_CLIENT)) needs to be very high for it to be better than a HA dedicated hardware LB.

This part, though so clear mathematically, was a bit of a surprise. The fact that I was surprised, tells me that there was a bias in my head, a chink in my mental model, that assumed embedded smart clients to be better. Reflecting deeper, I feel the performance implications were leaking into the availability assumptions in my head.

An important point

It's not the answer that's important here, it's the model or the factors that lead to the answer. The value of each parameter would vary depending upon the circumstances, so the outcome can be different. So, I'd request you to focus on the takeaways, instead of the answer.

Conclusion

Phew! This was one of the longer ones.

Some of you might say that I've taken a particularly elaborate (to the point of being unnecessary) approach to the puzzle. I wouldn't disagree with you. But, in my defense, this is supposed to be a weekend puzzle 😄, to be noodled over & over, in different forms, lazily & elaborately. It isn't designed to be a race.

My own reason to be this elaborate, is simply that I wanted to lay as exhaustively as I could, all the different elements of the puzzle. For some of you, it might bring attention to a hidden chink in your mental model. For a few others, it might've helped convert implicit assumptions into explicit ones. For the remaining, it could either be an affirmation of how they thought, or an opportunity to help correct my calculations here.

Irrespective of whether you liked this elaborate approach or not, or even agree with my view to the puzzle, I hope you still had fun thinking about it, including all of the different aspects of it.

If you've got a comment to share, head over to the gist of this post