In this talk from RabbitMQ Summit 2019 we listen to Omar Elasfar from Zalando.
Zalando's Communication platform products are powered by a RabbitMQ cluster. Multiple telemetry points allows the system to gain scalability and resilience patterns in real time. Managing cost efficiency and performance at Zalando has been a seamless effort thanks to our setup.
Short biography
Working in the intersection of technology / human experience design / product development has always been the area where Omar Elasfar strives the most. Starting his career as a software engineer then moving towards product management and building a mobile only traffic crowd sourcing experience as the starting point of passengers daily commute, was the starting point of multiple endeavors building different complex products.
For the past 3 years, Omar has been working as a Software Project Manager with an engineering department focusing sending daily shoppers the right message using the right medium and timing, building systems crunching million data points and providing shoppers with the items that delights them the most.
Monitoring & scaling based on RabbitMQ telemetry
Thanks, everyone, for joining this talk.
Before starting, I would like to share the motivation of giving this talk. We've attended the last year RabbitMQ Summit. Not me personally, but my team members. And then, we generally have a knowledge sharing session within our team around what was the knowledge gained from the Summit and what kind of things that we could include in our normal day-to-day work.
Towards the end of the debriefing session, we asked ourselves, “What would have been a topic that we could have contributed to or was missing last year?” We came to the conclusion, one year ago, that monitoring was not really covered in the first summit edition. We were set, from one year ago, we're going to come this year and present this topic. Now, we're doing this and here we are today. Actually, it seems like we were not the only team that thought about this.
Monitoring seems to be a first-class citizen in 3.8, like with native Prometheus support and, at the same time, a Grafana plugin that gives you all the dashboards that you can need. Other talks today, as well, are focusing on monitoring or tracing on how you get better visibility over your system and how you run it. This is basically what I'm going to be talking about, how we monitor our system and its production state nowadays. Unfortunately, we're not yet on 3.8, so all of my talk is covering around what we used to monitor from 3.6 onwards and our experience based on what we were monitoring at that point in time.
Overview
That's an overview of what I'm going to be talking about today. An intro to the Zalando. An intro to our team and what we're doing, and how we use RabbitMQ. Basically, we use it in our own product. What we do monitor, today, and what does our monitoring look like, what we have learned. So, a couple of information, like from the trenches, what we have observed monitoring our system. And then, how we actually use RabbitMQ telemetry to scale our system dynamically.
Zalando at a glance
That’s Zalando, basically, at a glance. We started off 11 years ago. Nowadays, we operate in 17 countries. We have a little bit over 29 million active customers, so people who perceive Zalando as a starting point of fashion. We seem to be on the right track. Most of our traffic actually comes from mobile devices, so whether that's mobile web or our native apps.
This slide actually was just updated the day before yesterday because I had October numbers but we shared our November numbers. We have, in the past quarter, more than a billion site visits. In total, we have like 15,500 employees. A headquarter in Berlin with around 3000 engineers focusing on different engineering problems around the fashion space.
Customer inbox at a glance
This is customer inbox, at a glance. Has anyone shopped at Zalando before? Probably, you've interacted with our product. Customer inbox is the team that manages the inbox premises of the customer. We view your email inbox as a premise and as a window to our shop as well, coming back or getting the information that you need at a glance. At the same space, your notification center is an inbox from our point of view as well.
This is the spot where we're able to reach our customers and notify them about information that they care about. Customer inbox is mainly the team that builds the overall tooling around communication with our consumers. We have an in-house-built tooling around template designing, audience building, campaign setup and sending transactional and commercial communication to our customers.
We always strive, while we send the communication to the outside world - to our customers, that it's first of all supportive, so it provides the needed information that a customer needs at that point in time. Whether that's creating an account, getting a confirmation that your account has been created, signing up for our newsletters, and knowing more through our newsletters, or in the transactional flow where you get your order confirmation, shipping confirmation, return confirmation, payment, open invoice - all of this transactional flow that gives you the right piece of information at the right time. It's, at the same time, very supportive around what you need to do now.
On the other hand, we want to build a very deep relationship with our customers. We don't want our communication to be only transactional. We also want our commercial communication with our customers to build the long-lasting relationship that they find the message that we send come to them at the right point in time, whether that’s at the moment when they're waiting for a sale, or at the moment when they’ve bought something and they want to couple this with something else. This is what we mainly focus on, on customer inbox side.
We're not only supporting just emails and push notifications, but those are our most prominent channels that we use. We also support sending SMS’s on Facebook Messenger. This is the whole premise of building our platform in-house, that we can easily extend this towards any channel in the future. That extensibility comes with the fact that we rely on RabbitMQ as well because, at any point in time, could just have another topic, another routing key, and another consumer and, out-of-the-box, you have another channel.
Woken up in the middle of the night?
Next, takes me actually to the bitter point of what we've done with the solution. Has anyone been just woken up at night, to just add a couple of more instances? Not as much as people in my team would see. Actually, we have seen this a lot.
Within customer inbox, we have for multidisciplinary teams working on the same product. In the end, we have the product that builds the audience and, at the same time, builds the template, and also sends this out to our providers, and collect the asynchronous and synchronous feedback about what's going on. This whole product is supported by a 24/7 round which is, generally, the majority of the four teams participate in our 24 rounds.
We normally have what we call a WOR meeting, which is the weekly operation and review meeting that takes place with the handover of the on-call duty between one person to the other. That’s a one-hour meeting where we go through our SLOs, product SLOs alongside specific service SLOs. And also, going through the incidents that occurred. Why was this person woken up? What was written in the post mortem? What kind of action items that were there?
We've noticed a pattern there. Sometimes, we’re just woken up at night for someone to just add a couple of more instances, messages in the queue are just drained. Afterwards, scaling down the application, everything is working fine. It's sort of anemic post mortem, at that point in time. What was the action item and gained knowledge out of this is that, probably we're having more messages than expected and, if we would have scaled, on our own, that would have solved the problem.
This is actually what we've been aiming to do over the past year, in 80% of the cases because, generally, just scaling, sometimes might worsen the problem that you have. You just need to be careful around what kind of applications that you're aiming to scale dynamically without caring for this.
You just need to make sure that you don't end up in another bigger problem when you're scaling an application where you don't have available connections to a data store, or in scaling an application that has a dependency on an upstream system where you're actually causing more harm in the customer experience than benefit by protecting your own system. Because, if our system scales and we're able to handle messages at a higher throughput, that doesn't necessarily mean that the consumer will enjoy this. Because, if the shop itself is not able to scale as fast as we can, customers will not be able to interact with us. 80% is the amount of times that we've observed that that's within our domain and scaling made sense.
Our Setup / Inter-product broker
Before talking about how we scale, I'd like to give you an intro of our setup, how we use RabbitMQ, and what other things that we actually rely on. RabbitMQ is the inter-product broker within the customer inbox realm.
Within Zalando, in general, we have another event streaming platform which is called Nakadi. It's an open source project. This is where most of the products communicate when the messages between systems are relevant for other products as well. This is where we generally get information about other things that are happening outside our realm. And then, we process everything that's within our domain, within our RabbitMQ cluster. That is split into a constellation of good products, like when we build the audience of customers that we want to actually communicate to, or when we're just consuming events that happen on the platform, in general, and start communicating with the customers right away, and building the message and sending it out.
We're generally running on a three-node cluster, with a Docker-ized RabbitMQ running on AWS. That’s the actual official logo of an EC2 instance, by the way. That was a new learning for me while creating those slides.
In general, like those three building blocks, are a constellation of 70 microservices that deal with everything around message building, template building, asynchronous and synchronous feedback. Not all of those are connected to the RabbitMQ broker. I think around 22 of those actually have direct connection to RabbitMQ, but the whole platform interconnects and work together.
Within our setup, we rely on different feature sets that are provided by RabbitMQ. We have high availability queues for everything that's within the hot path of the message that we want to make sure that it's highly available and running on different instances in case that we miss out on a node while operating. We rely on publish confirms as well for messages that we want to make sure that we shouldn't lose, so that they're always sent out. In that case, for instance, a password reset or an order confirmation, we don't want to miss out on sending this email out. This is something that a customer expects. Most of the cases it's legally binding as well and you don't want to miss out on sending out an order confirmation.
We rely on a couple of exclusive queues as well. This is something that I'm going to get into actually why we use them and how we're using them. We have a couple of lazy queues that's mainly around deadlettering or delayed processing while bundling certain things together. That's basically our setup, in the time being.
We have a Blue/Green deployment with a predefined configuration file. That’s something that was mentioned earlier in another talk. It's better to have your topology in a centralized place so that you're able to just deploy another broker right away. This is the case for all of our exchanges, binding queues, that we know that we need them. On the other hand, we have a couple of queues and exchanges that are dynamically creating my applications based on interactions with our email management campaigning UI which the application takes over by itself, actually, when we deploy from one broker to another.
How we collect metrics
How we're collecting metrics. We rely heavily on the management endpoint plugin. That's enabled on our cluster. We poll the Management API endpoint to get all of the information that we store. We do this with a tooling called ZMON. That's actually a Zalando open-source monitoring and alerting tooling that has kickstarted actually as a Hack Week project, maybe like five years ago, because of the lack of the possibility within normal monitoring tooling to define any entity that you're interested in. That was one way of dealing with this. It's a very flexible solution of defining any kind of entity that you need, create checks. After creating a check, creating an alert. Sometimes just a dummy alert as well that you make sure that the check is periodically running and you're collecting all the data, and the data is persistent for, by default, 30 days. And then, you hook this up to Grafana so you have nice dashboards of visualizing everything.
Since I mentioned on-call duty, we rely on Opsgenie as well for getting paged. This is directly hooked up to any of our checks, in ZMON, that we consider mission critical. By default, we have tier definitions of application and what is an application criticality within the whole Zalandro realm. A critical application, by default, comes with a critical alert that comes by default with a critical team that takes care of the on-call duty of all of this.
One thing, actually, that I would like to mention as well, while polling the management endpoint, we generally do this on the nodes’ aggregate endpoint to get all of the information of all nodes because we run on a three-node setup and we aggregate this information in, so you get aggregate information and individual information from nodes. But if you're running a multi-cluster setup, your monitoring system is ideally aware of what your nodes are so that it could automatically poll each and every node on its own. Or, you might have a different setup where you have a dedicated node that runs for monitoring and reporting to aggregate everything. That would be something that your check system needs to be aware of while polling this.
Another thing that this is something that you shouldn't be doing very aggressively. Just running a check every one second might not mean that beneficial because the management endpoint, at the moment, is highly coupled as well with the run time of the RabbitMQ itself. While excessively polling it, you might actually cause more harm than visibility. Our checks are generally running at a 30- to a 60-second interval. That's generally sufficient, in our case, to have the visibility that we need over our cluster at that point in time.
What do we monitor
That's a snippet of our Grafana dashboard. Most of the dashboards are our own opinion of what we actually monitor. That does not mean that that contains all the information that you could actually monitor. The endpoints are very excessive and verbose. The information that you get, you could get everything that you see in the management UI alongside a lot of things that you could combine this with your own application stack or clients together.
Most of the screenshots that I'm showing now are actually just from our RabbitMQ Grafana dashboard. This is not the only dashboard that we use to monitor. Every application has generally its own dashboard where it monitors its own performance, data stores, and everything, alongside its own connections, and channels, and publish consumption rate from RabbitMQ. That’s just from our Grafana dashboard of RabbitMQ which is generally a very good starting point for me to check up where are things going wrong, if something is going wrong, because that gives you a very high visibility on the overall broker.
The first panel that we have is actually about the infrastructure system itself. We split our monitoring very similar to how it is documented as well within the RabbitMQ Monitoring Guidelines. We monitor the system - the underlying system itself and the RabbitMQ nodes. We monitor the queues, the exchanges, the overall footprint of every queue and exchange, and number of messages and messages in RAM in and around all of this, alongside the clients.
Nodes at a glance
This is how the panels are set. The first overview here is actually from the nodes at a glance which are our three nodes. This is where we monitor the I/O rates, the operations, the available disk space, the used memory, the file descriptors.
Whenever actually, you see a bump over here that goes beyond your limit, that's an alarming thing to look at because this is where RabbitMQ would probably start throttling your producers alongside your consumers not being able to catch up. The instance of memory usage and where it is from the watermark definition of our broker as well. So, how far we are and how far we have leeway. And the CPU usage of all nodes.
Globals at a glance
And then, our next panel that we actually have as well, on the very big top is actually globals, what's happening on a global level over there. We monitor the global queue statistics, so the overall number of messages and the overall number of messages that are in a ready state, and the overall number of messages that are in a rejected state or have been rejected by a consumer. Alongside, the messages rates themselves.
On a global level, the overall number of connections and overall number of consumers that we have at any point in time. That's highly fluctuating, depending on how many instances we run for every service, but they all should be slightly in line. If you see an uptick in message rate while, at the same time, our consumers are unable to handle this uptick, we would also see an uptick in the number of connections and consumers automatically scaling out to deal with the throughput. If you look at a bigger timeframe, you would see this scaling up and down rapidly very nicely.
One thing to point out over here, so the number of consumers that we have on this panel is almost 10,000 but, at the same time, there are around 2000 messages that are not being consumed. But actually, the number 2000 has something in our system. We generally have deadletters. A couple of messages might be deadlettered. We have alarming and alerting on different sessions that we have. For commercial messaging, deadlettering 2000 messages is within our acceptable SLO, while missing out commercial messages. This is something that we just have in our dead letter queue. We're now consuming but, if it increases beyond this, someone will probably get an alert and have to check on this.
Queues at a glance
Then, we have an overall panel around the queues - the message rate per queue, the consumer utilization of every consumer, the number of messages per queue, the actual RAM usage per queue - what is actually those. Because we have different sizes of different messages, so an entry message coming into our system, before rendering, or templating, or figuring out what's going to be eventually going out is very slim.
On the contrary, a message that's already rendered, and it's just an email that's about to go out. A small number of messages in that queue might have a very higher memory and CPU footprint than a small message in an asynchronous feedback queue that we use to persist information about the metadata of the message itself. We look at this hand-to-hand alongside the message bytes per queue, and the message byte in RAM as well per queue. That gives you an overview of how many messages you have actually, what is the message rate per queue, and what is the footprint that you have for this queue. That's a very good view actually on the consumer utilization.
This is generally how consumers should look like, like at 100%. 100% means that you're utilizing your system at its best. Your consumers are able to cope with the production throughput of your producers. You don't really need to add more consumers at any queue. Everything is probably in memory, which is the state that we always try to strive for but, in reality, that's not always the case.
This is actually a view from a busy time of our queues at a glance. If you actually look in between, that was almost around midnight where we're generally not sending anything to our customers post 12:00 to 1:00 am because this is generally where most of the people are sleeping. We generally try to optimize to open rate, so we try to send early in the morning. But then, if you look at this at an earlier point in time in the morning, we have a lot of consumers actually being underutilized or unutilized and then popping back up. At the same time, we have a higher message rate per queue. In general, most of our queue, at around 1000 operations per second and we have some backlogs in certain queues. Some queues have 20k messages that they still need to start consuming.
Clients at a glance
In general, not having 100% consumption utilization. It's not the perfect state but it doesn't mean that your system is not functioning because a consumer can slowly lag but then, all of a sudden, it can start picking up again. That’s where the producers started slowing down on production rate or the consumer applications started adding more instances. So, 100% is the optimal that you should try to optimize for. Being below 100%, for a very long time, is something alarming that you should start checking your application code, why your application is not catching up with the production rate. By having smaller dips and then catching up again, that's generally how you see your system dealing with burstiness of certain producers, where a producer suddenly wakes up, produces like a million messages. And then, all of a sudden, the consumers are just prefetching, sifting through messages and starting processing them slowly but surely.
Then, to give you a little bit of an overview of our application stack, before driving into clients. Most of our backend services are Java, Spring Boot applications where most of the applications that running at Spring Boot 2.0 or higher have Prometheus by default. This is where they report everything to centralized checks of the application state, alongside anything that's still running below Spring 2.0 is actually reporting all of this to the normal health check endpoint, where we have an overview of all the interaction of this application with our broker. We generally report number of consumers, number of connections, number of channels, publish rate, consumption rate.
Each application has its own user as well. We have this overview per application. Generally, that's a very easy way for us to know which application at the moment is hogging more resources and which application is producing more and then consuming. The majority of our applications are either producers or consumers, but we have an overlap set of applications where they both produce and consume, where I would generally recommend that you run the production and consumption of different connections with their own channels because that will give you a better throughput, in general, and that would not impact your application if the consumption is throttled, while the application publishing is not throttled at all by the RabbitMQ.
The more you monitor, the more you get to know your system
As always, the more you monitor your system, the more that you get to know about it. Or, the more that bad things happen, the more you get to know what's going on.
What did we learn
A very interesting learning forum for me, personally, within our organization, is our 24/7 chat because this is generally where someone gets pulled in. And then, someone actually shares a tracing overview, or a Grafana dashboard, or a check that goes flaky. Just going through these monitoring pieces of the organization, that generally gives you a more holistic overview of what's actually going on. This is, by practice, what we figured out as well within our system.
Missing Exclusive queues
I would like to share with you a couple of more examples of what we learned while we monitor and the use case of the thing that we learned. As I mentioned earlier, we rely on exclusive queues. To give you a little bit more context around how we do this and why we do exclusive queues, the most computationally intensive process in our pipeline is actually rendering the message itself. We've built a templating solution in-house with our own UI, where marketeers can go inside the tool, drag and drop any base components that they want to fill in a template. And then, that automatically generates the template that could go out. It could be either a push notification, an email, or anything else.
Email is generally the most intensive because we have to render the full HTML. At the same time, we render the actual text content. And then, we have to render both. We have a rendering system that renders all of this.
Microservice architecture, like every service isn’t about the context. Templates are in a system, rendering happens in another system, and there is an orchestrator in between. We have, as most systems do, a worker service that knows about the message, fetches the template, then calls the rendering system, renders the template, and then passes this on into another exchange and queues to go to different senders.
We started caching the actual template state and the rendering engine of what template that I'm rendering at the moment. We added distributed cache. We relied on Redis. That was not actually cutting through to the SLO measures that we had in place because, generally, we define the SLO of the product. And then, that's something that we don't break. Afterwards, that moves further. But, with different workloads, we have to optimize further to meet all of this. We ended up actually with local caching as well on the render services themselves. Those are services that actually scale out on their own.
This is where we relied on Spring Cloud Bus alongside exclusive queues. Our template service generally publishes a template change event, inside our RabbitMQ, for coordination between both template service and render service. Within the render service, the first instance that it picks up the message, fans this out through the RabbitMQ Spring Cloud Bus to a number of render service instances to invalidate their local caches. The render services themselves actually dynamically scale, either based on queue backlog or based on CPU. You cannot really check on the number of exclusive queues that you plan on monitoring. This number actually always changes because you just scale out / scale in. And then, you have queues. They’re there. But you need to know the actual numbers that you need to check against.
This is what ZMON, alongside the management endpoint, helped us to have clear visibility over all of this because with an AWS instance that runs an application of ourselves that's defined as an entity. And we know how many entities are running actually for this application. At the same time, from the management endpoint, we know how many exclusive queues are running at that point. Once an application, passes the health check and becomes hooked up to the load balancer and actually starts getting traffic, we know that it exists and we know that it has a bind in the exclusive queue along next to it. If those number matches, then everything is fine. Everyone is connected. Everyone gets invalidated in the right manner.
If we have a mismatch between those numbers, then we have an instance missing queue which we then have to either intervene or replace. Generally, just killing the connection, which is something that we do automatically as well, solves this issue. This was our usage of exclusive queues. That was the learning actually that we've gained around monitoring the number of exclusive queues, around the number of instances running, and how to actually have better visibility around the usage of exclusive queues and what you expect from them.
Default no. of cached channels
Another learning that we know is the default number of cached channels. A little bit of context as well around channels. RabbitMQ generally have a long-lasting TCP connection from the application to the broker itself. And then, within every connection, you actually could multiplex this to multiple channels. That gives you either a higher throughput while publishing or a higher throughput while consuming. It's generally recommended that you have a single connection and utilize multiple channels within your application.
While we were monitoring our applications on our channel churn rate and we've noticed actually that we're not optimizing the channel’s creations in one of our applications to the expected throughput from this application, so we dug deeper. We figured out that the default of the caching connection factory in Spring Boot is actually one cache channel. Once our application introduced publish confirms, because it was an application that produces messages, that we wanted not to lose at all. While application starts opening a channel, starts publishing into it, and waiting acknowledgement, it started creating way more channels than we expected. We have been able to drill down the root cause of this - because we were not reusing channels at all. We were starting up a channel. The number of cache channels that we could use is one, so there is only one channel cached, another thread picks it up. But we have 200 threads running in parallel, so all of those tried to create channels. One of them finishes, caches one channel, picks up by the next thread, but the other threads that are currently working are also generating other channels. That was very evident in our graph. We saw like 2000 channels by a single application and like a huge memory footprint because you get-- channels are relatively cheap towards connections but they're also using resources and you don't want to waste your resources on something that you could optimize. What's recommended, generally, is having like a single-digit number of cache channels.
We actually went a little bit higher because the application in place that we're talking about is actually consuming from our event-streaming platform where we might have actually a backlog of 10 million messages. Those 10 million messages are 10 million price reductions that happened overnight. Since we don't send communication overnight, we start in the morning but we want to go through the whole backlog before the peak normal transactional flow of the shop and people ordering. We want to finish up knowing about price reductions, matching all of this with customers that are interested in items that are now reduced.
We capped the number of channels, actually, to 200 which gave us the right amount of throughput that we need, with the right configurations of threads that are running in this application, with the right number of batch consumption from our external event processing platform. That gave us the sweet spot. You're able to actually cap this using the Connection Factory by setting a timeout. If you set the default to the number of channels, without a timeout, you don't really actually cap this. If you set a timeout, then you actually reach a cache number of channels, at the same time, a capping of those channels. That was our next learning.
The other thing that we did. It's another usage of the management endpoint that we used for flow control, sort of, parking and processing messages in an exponential backoff manner. As I mentioned earlier, we rely on RabbitMQ as an inter-product broker. But, at the same time, within the company, we have an event streaming platform.
Coming to the price reduction use case, for instance. We have a constellation of services. That's aware, most of the time, of what's on your wish list or what's on your cart. And then, whenever there is a price reduction happening on the platform, we try to match those reductions with customers that are interested in this. That's the audience building part. And then, that triggers events that go into outside of our RabbitMQ system towards our event streaming platform which we then consume again inside our RabbitMQ system for enrichment. And then, we process the message.
Since we don't have commercial communication running at night, we generally park all of this on our event streaming platform. We don't want to actually park those messages inside the broker because we would like to optimize the usage of the broker to always have everything in memory.
Our event streaming platform is built on top of Kafka and is built, actually, for different teams to consume, based on whichever cursor that they're in. It provides the functionality of parking messages outside. This is what we do. And then, when we wake up in the morning, our setup services. And then, they start processing.
Other usage of the Management endpoint
We actually check the number of messages inside our RabbitMQ for a specific set of queues. If we still have messages over there, we actually throttle our consuming application that brings messages in until our services inside have processed this backlog. That's another interesting usage of telemetry within a production workload, but I would highly encourage that you use this with a lot of caution and having the same defaults on your end. Within our application, it's built in a sense that the management endpoint might not provide a value or we have the same default. And then, based on this, we decide whether we want to throttle or not, because that mixes also your production load with telemetry information.
In our case, it made sense because we had a default that we could fall back to and it made sense in our normal processing pipeline. But that was another interesting usage of the management endpoint to back pressure processing and parking messages outside of our brokers, so that we make sure that our broker is performing in line with our expected throughput and SLOs.
How we solved scaling
Now, that takes interesting branding, I know. The color matches the RabbitMQ color. That was not intentional. Actually, that's the lite version of AWS branding for Cloudwatch, Cloudwatch alarms and Autoscaling. What an interesting coincidence.
What we have built around the telemetry and management endpoint is a Lambda function. As I mentioned earlier, we run on AWS. Over there, we’ve built a small Python Lambda function that actually phones the management endpoint and fetches the queue sizes of queues that we're interested in or, also sometimes, does a computation, of a computation, of a combination of queues that we're interested in, in a single metric. And then, the Lambda function reports this information to CloudWatch. Based on the metrics that we set in CloudWatch, we define CloudWatch Alarms to actually trigger our Autoscaling strategy.
Within a combination of services that we have, we have the primarily scaling strategy. Now, it’s all based on queue backlog. You could set this based on the queue having an absolute or a relative value more than X for X amount of minutes. That will trigger the alarm. Based on this alarm, your Auto Scaling kicks in. And then, afterwards, if that does not exist anymore, your downscaling kicks in, depending on how you downscale and how you cool down.
This is how we solve actually scaling our consumers, where scaling of those consumers make sense automatically, without letting anyone get woken up at 2:00 AM to just add a couple of more instances.
This is actually how it looks like in practice. I think that was last week, I was checking our Autoscaling strategy and when Autoscaling kicked in, based on a queue backlog-- that's I think, early in the morning, that was around 8:00. The combination of both those queues are generally a good indicator on our end that we need to scale a further downstream dependency that will be overhauled with load.
As I mentioned, in another example, like we have a worker where worker just delegates some calls with different services, and then publishes a message outside. One of those dependencies actually scales based on queue, because that was a better indicator on our case than like request latency or CPU usage. Because, based on those messages on those queues, we already know that this service will start working now. By taking those matrices in, reporting our scaling, at the time that the messages reached the service already, the service is already ready to deal with this workload.
You could see this in practice. We were piling up a couple of messages over there. We were running at four instances at that point in time. This service actually scaled to six because that's our increment step. We add two more instances at the time. And then, once the backlog was processed, the service gradually scaled down. In the end, from our point of view, we scaled in a cost-efficient manner, like that’s around 20 minutes, to handle the burst of a backlog at a very fast processing speed.
Other possible solutions / RabbitMQ-cloudwatch-exporter
This is actually not the only way to solve it as well. There is actually a RabbitMQ plugin. Not in the RabbitMQ Git repository. It's actually an open source plugin developed by an individual contributor. It's called Cloudwatch Exporter. It exports all of the information in the Management API endpoint to Cloudwatch.
If you want to try out what I just mentioned today, you can just enable this plugin and just export everything. But also, do this with caution because you're exporting all information and Cloudwatch bills you by matrix that you export times the number of nodes that you have - it depends on your cluster setup, times the number of times that you actually export all of this. It's tier-based. You pay more for the first couple of tiers. You actually still pay more for the first couple of tiers, if you've moved to another tier.
Generally, that was one of the main reasons why we didn't opt in for using this. Another reason, is that our Lambda function setup was in production seven months before that plugin actually came into the open source community. After we came out, we debated internally within our team whether we want to switch to this, deprecate our lambda function or not. In the end, we did basic math around how much we would be paying extra for Cloudwatch while, anyway, we have our own check alerting storage of matrix in our own in-house tooling. It didn't really make sense to do all of this. We thought about actually contributing into filtering out matrix so that it becomes a little bit more similar to what we have in our Lambda function.
But also, we put this on the side because we're not going to use this plugin, because our biggest rock at hand that we want to solve is actually moving to Kubernetes. At that point in time, that might pose other challenge of how we report those matrix. It might be done differently. We might handle this in our own controller that operates our cluster. We're still debating this.
If you want to use what I've just presented, just right away, you can just enable this plugin in your test setup, start seeing matrices over there. What also the plugin lacks is a combination of combining multiple matrix together. That's actually a feature available in Cloudwatch itself. You could combine two matrix, merge them together. You get another matrix and then, based on the combination matrix, you could actually scale. You have to do this with care as well because of its number of messages, that's easy. You don't really need to care about a lot of math. But if you're combining CPU utilization with consumption rate, then you have to be very aware of what you're combining together.
Across service boundaries
The last thing that I want to share from our setup is across service boundaries, and tracing, and what we have been doing recently. One year ago, we decided on adopting Open Tracing as an organization. The magnitude of services that we're running in production, with different production load, is becoming unmanageable from a bird's eye viewpoint to get an overview of the overall architecture, different dependencies between microservices and systems.
Within our team alone, which is like a four-multidisciplinary team, we run 70 services. Zalando has 200 teams. We run almost 6000 microservices in production. Getting a full overview of like a business transaction or a production boundary, without tracing, becomes a little bit hard. We've adopted Open Tracing and we've set on LightStep as one of our means to visualize our traces. We also utilize the LightStep plugin, in Grafana, as well to oversee what's going on.
This has been beneficial in our case, when it comes to communication and different communication workload, is that we've defined the streams of a thread which is a collection of spans and traces that intercept multiple services based on the business function that we care of. Generally, there is a stream running on order of placement. That doesn't just include our system but that includes checkout completed, order placed, payment preauthorization occurred, reservations on stock level occurred, order position level reservations occurred, and data or a business event was admitted to the Event Stream Bus. Our system consumed it, then we start everything into our RabbitMQ cluster. That gives you a very holistic view of what's going on.
How we did this, within our team, is that we built a small client library for tracing everything going inside RabbitMQ with an annotation, so that it's not that intrusive within our code base. You just annotate when you're publishing or consuming. And then, by default, that starts shipping the traces.
We have, within each service and our product overview Grafana dashboard, that overview of our RabbitMQ performance within the context of what we want to serve as a business. When it comes to order confirmation, we have an overview of whether our production or consumption is generally breaching the SLO that we've defined for this exact use case. Not just looking at RabbitMQ holistically, or at a queue, or an exchange, or a service, publisher consumption rate because our system that consumes and enriches data does this for like 200 events.
In order to have the visibility on a specific use case that you're interested in, we figured out that relying on tracing and seeing all of this in our Grafana dashboard made more sense. We're able to see that we're meeting what we committed on as a team or as a product. That's the last thing, actually, that I wanted to share with you.
I would recommend it. It's not in production for a very long time. We've instrumented our services a year ago. But now, it's actually starting to pay off because all other teams instrumented their systems. If you check a span and a trace now, you could see what's going on throughout the whole organization.
[Applause]