Olha (Pt. look)
Chapter ]|[ where our protagonist tries to notice all the trees falling in the forest
Hello!
I finally found substack’s customisation section and happy to introduce to you my dear friend, the Hedgehog-Duck. Quacks like a duck, swims like a duck but is a hedgehog. It’s part of an important, yet currently secret, architectural pattern.
Today’s Best Band Ever™ is ISAN, an electronic minimalist duo with a knack for using analogue synthesisers. This album keeps me sane.
Difficulty level: Hey, not too rough.
Fascinating how widespread impostor syndrome is in our industry (I’m also not spared) - technology is moving so fast, it literally became impossible to be an expert in most current technologies at the same time. One can be called a full-stack engineer, but they still would have to choose and specialise. Worse, we’re not just engineers anymore, we’re Senior Principal DevSecGitMLInfraFinFizzBuzzOps Architects.
On the other hand, looking at all the tech blogs, there’s an impression that everybody’s building absolutely cutting-edge solutions where Kafka drives quantum graph kubernetes to supremacy, purely as a wake-up exercise during their morning coffee.
Except some guys like these. They’re for real.
No wonder we have this cognitive dissonance. Learning on the job is 80% of what we do, still we’re expected to have at least 10 years of experience with any technology that appeared this year.
Fake it till you move on, and you don’t need it anymore. Oh, well.
This post will be me thinking out loud about one of the problems to which I don’t have a clear answer.
How does one tell that the system is working as expected?
During the last couple of years, I haven’t answered probably 99% of phone calls that were from unknown numbers. I must’ve lost a fortune and upset so many Microsoft supporters that wanted to remove viruses from my computer. To be fair, I don’t like phone calls. It was fun, talking to my friends as a child (hey, I still remember our first 5-digit phone number!), but the older I was getting, the more phone calls started to associate with stress. And spam. We’re living through the informational overload.
I wonder if that’s the cause why GenZ and Millenials watching everything with subtitles?
Yes, about telling if something’s broken. I’m not even going into the depths of how can we rely on hardware, when we know that cosmic radiation is known to flip bits in memory chipsets. Seriously, I can’t wait to blame a Single Event Upset for something not working.
WARN: Only boring technical rambling ahead
Let’s start with two axioms.
Axiom one. Event driven architectural patterns are the sanest.
(Always have been)
It’s somewhat ironic that with all the hype about Kafka, Kinesis & co, decoupling message producer and consumer is a very, very. Very old idea. I mean, our whole industry is barely two hundred years old, if you stretch up to Ada Lovelace.
Here’s what Alan Kay (one of the pioneers of OOP) says about naming of Object Oriented Programming:
I'm sorry that I long ago coined the term "objects" for this topic because it gets many people to focus on the lesser idea.
The big idea is "messaging" ... The Japanese have a small word -- ma -- for "that which is in between" -- perhaps the nearest English equivalent is "interstitial". The key in making great and growable systems is much more to design how its modules communicate rather than what their internal properties and behaviors should be. Think of the internet -- to live, it (a) has to allow many different kinds of ideas and realizations that are beyond any single standard and (b) to allow varying degrees of safe interoperability between these ideas.
So yes, if you want to scale, you need to decouple and go with messaging. And remember — no ~capes~ states.
Axiom two. Dijkstra is right again.
Being able to do something once, doesn’t automatically mean we can do it one thousand times. Humans don’t scale.
But Sascha, that this is not true!
If one has one thousand apple trees, one thousand humans will happily pick the apples.
While that is true, that is not the scalability Dijkstra talks about or issues solutions at scale present.
Enter Levels of abstraction.
Let’s have a simple example. A little piece of independent code that takes an input and produces a result. No state, no configuration. Nice, pure function.
Let’s run it, and it’s trivial to tell whether it succeeded.
Now let’s bring it to the cloud and let’s remember everything fails all the time. An obvious solution to an occasional failure is adding a retry. Nice little queue in front of our function, incremental back off, fifth try the charm.
We ran it a million times and observed a hundred failures. What does this hundred tells us? Pretty much nothing. We could’ve lost a hundred messages, or they could’ve been retried and processed.
See? Same criteria, but we went one level of abstraction up and what used to be information, degraded to being just data.
In this particular example, the new metric is whether all messages were processed, which is a surprisingly non-trivial task. Another, easier, angle at this is whether our retry system has given up. In other words, we add a dead letter queue and monitor messages being added to it. Voilà!
And then we take that queue-function-DLQ module and chain it with another, and the history repeats.
The problem here is that every time we compose something known into a bigger one, we go one abstraction level up and need new quality criteria. The old one doesn’t scale.
In other words, the sum of individual metrics doesn’t necessarily translate to the total metric of a composite system.
This is the reason why simply dumping all one’s metrics into a Data-Elastic-Open-Splunk-Search-Athena-Dog doesn’t provide any information. Just another system to manage.
After all, it looks like there’s a way out, just no Unified Theory Of Observability. We can scale our solutions both in size and complexity, but that comes at a cost of figuring out how to know its health at each level.
But wait, there’s more!
Simply knowing whether there were processing errors is only half of the answer. The other half is knowing whether the system is not broken. Confused? Let’s rephrase.
Question 1: Does the system work?
Question 2: Does the system not not work?
Well, that’s a mouthful. Let’s try again.
Question 1: Are the messages processed correctly?
Question 2: Are messages not lost?
Surprisingly, we can monitor those dead letter queues all we want, but if the messages never reach the processing nodes, we might be happily ignorant that the whole thing is broken. Or, for example, if nobody’s calling the police, this might mean there are no problems, but also that the phone lines are dead.
This is the famous tree that falls in the forest, but nobody’s around to write a haiku about it.
This becomes an issue, when that million messages come from thousands of sources. Over random periods of time. With unpredictable frequency.
I used to naively think (I do that a lot) that one can, say, gather an average number of observed sources over, say, 24 hours and if it’s stable - we’re good. Nope. Again, no universal solution.
My suspicion is that the only way to solve that would be to have synthetic heartbeat events, coming at known intervals, being ignored by the actual processing but providing … well the channel availability. Might work, but feels like cheating, akin to making class methods public so that they can be unit tested.
Scaling is hard, and I’m not even a dragon.
Also, scaling is fun. Being able to solve massive problems is what makes cloud so exciting.
Scale thee well.
Take care.