In this talk from RabbitMQ Summit 2019 we listen to Kacper Mentel from Erlang Solutions.
RabbitMQ runs on a battle-tested Erlang VM which provides Distributed, Fault-Tolerant, Highly Available and Robust systems out of the box. But this doesn't always prevent our RabbitMQ installation from misbehaving without clear signs of what's wrong. Here Tracing, another Erlang feature, comes to the rescue!
Short biography
Kacper Mentel (
GitHub )
is an Erlang/Elixir Developer at Erlang Solutions. Also, he worked as a RabbitMQ consultant. In this role, he was auditing RabbitMQ installations, debugging issues and designing solutions.
Do not guess, Trace
I’m glad I can be here, giving this talk to you. I would like to share with you the debugging process of an airline application system along with the tools that leverage one of the Erlang VMs coolest feature – Erlang Tracing–a RabbitMQ software powered by the Erlang virtual machine. Before we begin, a few words about myself… My name is Casper Mentel, I’m a Software Engineer at Erlang Solutions. I’ve been working as a RabbitMQ consultant and apart from that I’m also involved in load testing and development of our open source XMPP server, written in Erlang called MongooseIM.
Agenda
The agenda for the presentation has been split into three content sections. The first part is the “Real-world” problem in which I will describe the problem that we had and how we reproduced that problem via load tests along with the results. The second part is about the actual debugging process. I will try to answer where to start queues and also I will present tools that I considered useful and applied during the debugging process. In the last part I will explain the solution to the problem.
“Real-world” problem
When working as a RabbitMQ consultant, I saw different RabbitMQ installations with their different problems. The problem that I’m going to focus through this presentation is one of these.
The Problem
One of our customers had a system at the RabbitMQ–Kaurav. As a side note, the RabbitMQ version used by the unit system was 3.7.9. In that system among applications that interact with RabbitMQ was an application that made some processing decisions based on the queue’s data like number of ready messages in each queue. The diagram depicts what the system looked like. So, we have RabbitMQ node with management plug-in enabled. Management plug-in is a component that allows you to manage RabbitMQ to some extent. It provides insights from the system that allows you to get some data and statistics related to the RabbitMQ installation.
It also exposes HTTP API that can be used to get some statistics about the system like for example information about the queues. So the application was just periodically sending HTTP requests to the management plug-in API to /api/queues endpoint to get data about the queues and the plug-in just returned that data to the application.
But from time to time it happened that obtaining the data from the API took a long time. Sometimes the application was not even able to get any response from the API. Also, as the application was not fed with fresh data, it could not make optimal decisions and the system became inefficient. This was a real problem for our customer.
The Plan
To find the root cause of a problem, we wanted to reproduce that. Once we can do that, we could find the Erlang function in the management plug-in source code that handles HTTP requests that comes to this MGMT plug-in in HTTP API. Once we know exactly which function is responsible for that we could learn how the function spends this execution time.
Load Test Platform
We ran load tests to reproduce the issue. For load tests we used our continuous load testing platform that was developed for load testing Mongoose. The diagram shows a logical architecture of the platform. What’s worth remembering from that diagram is that we have two docker hosts – one for the system under test, which in our case was just the cluster of RabbitMQ nodes. The other docker host is for load generators.
Every single application that is running our load test platform runs inside a docker container. When it comes to load generation, we use Amoc–a murder of crows. This is a load generator that was created with an Erlang solution (written in Erlang). It is open source, designed to support massively parallel load tests.
When it comes to how Amoc works, it is fairly simple. Amoc has an abstraction which is called user. It is just an Erlang process underneath and the user follows a certain scenario. A scenario is just an Erlang module because of which we can simulate very complex behaviors because the only thing that limits us is just Erlang itself.
The last thing is the system level metrics like for example CPU utilization of the docker containers or some application level metrics. For example, the number of queues in RabbitMQ or some custom metric that we implemented in an Amoc scenario are gathered and stored in an influx database to be later visualized.
We used the above load test scenario to recreate the problem. The diagram shows the simplified version of the scenario. We just multiplied it in the final load test. So, we had a queue that has two consumers (marked in C and blue arrows). One is marked P here with a red arrow. There is also another Amoc user there, marked A with a green arrow. It just simulates the application that was sending this HTTP request to the management plug-in HTTP API. In the final log test, we sent an HTTP request every two seconds.
Final Load Test Specification
First graph
So, when it came to the final load test specification, we had three RabbitMQ nodes, 3000 queues, 3000 publishers, 6000 consumers, and 20 connections that were shared between the MQB clients. Each node had 7 CPU cores available during a load test. A 100% telegraph here means 1 CPU core utilization. Hence, each node consumed about 6 out of the 7 CPU cores.
Second graph
These two nodes consumed about 1.4 gigabytes of memory. The green line depicts greater memory consumption, which may be due to the fact that this node had to handle more HTTP requests than the others.
Third graph
This shows the response time of the management plug-in HTTP API requests to the /api/queue endpoint. The red line presents medium response time of 1.5 seconds. The other lines present maximum values of 95%, 99%, and 999%. During peaks, the reach was about 20 seconds.
With these results, we considered our problem reproduction successful.
Debugging Process
Once we had our problem reproduced, we could start the actual debugging process. The idea was to find an Erlang function that is called when a request to this /api/queue endpoint comes in.
Where to start?
First we can manually look through the source code to find functions that might be called when request to this management API endpoint comes in. Once we have some candidates, we can verify which one is correct using our Erlang Tracing.
This management plug-in uses cowboy for serving HTTP requests. Cowboy is a well-known Erlang HTTP server. In cowboy, there is a mapping between HTTP endpoints and Erlang modules. Each endpoint has a dedicated callback module. One of the cowboy callbacks maps from this to a JSON function. This RabbitMQ management VM queues to JSON function and seems to be the one that handles all these HTTP requests.
In the diagram, you can see that when a request to /api/queues comes in, it is handled by cowboy. This in turn calls to JSON callback where the actual business logic resides. Here we have just gathered data about the queues. That was only a hypothesis that we confirmed using Erlang Tracing.
Erlang Tracing in a Nutshell
Erlang Tracing is a neat feature of the beam that lets you inspect function calls message passing directly on the VM real-time. But, what does it look like in practice? You can attach an Erlang remote shell to a running Erlang node and can enable tracing. When a pattern is provided, this will match a particular function(s) and upon calling there will be some information about it in the Erlang shell process mailbox or if you use some higher-level tracing library for example recon.
To confirm our hypothesis, we first attached an Erlang shell to run RabbitMQ node and then enabled tracing to JSON function using recon. We passed at the top of the code the module name, function name, and RIT of that function. We like to trace and found out the number of traces. When that function was entered, the output that we see basically reflects that the pattern we provided matches the function in the system. Once we enable tracing, we send an HTTP request to the management plug-in manually. This request is sent using a curl, which is not visible on the snippet. Once we did the trace popped up. As you can see there is some information about that particular function call. There is the module name, function name and arguments that were passed to that function call. So, in this way we confirmed that we found the right function.
Profiling the Function
One of the techniques that help with profiling is using flame graphs. Flame graphs are visualization of profile software allowing the most frequent code paths to be identified quickly and accurately. It perfectly fitted our case as we wanted to identify the most expensive code behind the API.
Here you can see an exemplary flame graph. It shows a stack trace of a trace function. In other words, which of the functions are called from trace 1 and how much resources they consume relatively in the trace 1? Such visualization allows us to quickly identify suspicious places in the code as longer the function bar more resources it is likely to consume. For example the dispatch command functions on the graph cells and other functions from which the tree visible on the left hand side, consumes the majority of its resources.
For Erlang, we have a flame library that gathers traces of a particular function and builds a flame graph out of it. So we decided to use a flame in our load tests, but the question that arose was how could we inject that library into our RabbitMQ docker image? For this we created a RabbitMQ docker repository that makes it easy to build RabbitMQ docker image using custom view source code. See below graph:
We performed a number of measurements and we got two different flame graphs. The green rectangles in these graphs depict the key difference and are quite interesting. The main difference is the run augmentation function; however, what does that mean? From the load test results and manual code analysis, we knew that the query to management API sometimes work quickly and may be slower at times in the load test results.
Here, the HTTP API response times median was 1.5 seconds while the maximum values were about 20 seconds. Also, we manually measured the execution time of that function and the result was consistent with the others. This occurred due to a cache in the management plug-in. When the cache is valid, queue data is returned almost immediately. But when the cache is invalid fed data must be gathered, which takes time. Despite these two graphs are of the same length, they represent different execution times. The first graph presents the case when the query works quickly and the second graph shows slower query processing time.
Note that in the second graph the augment function takes so much time because fresh data needs to be gathered. But the documented function is the second one before the smaller highlighted green rectangle below. So, why don’t we see the data collection function in these graphs? This is because the cache sites in another Erlang process.
This general server called function is the second one below the blue bar with slip and makes an actual call to the cache process and waits for a response. Depending on the cache state, the response comes back quickly or slowly.
Above is the visualization of the cache process. Fetch request is sent when the cache is valid and the data is returned to the original caller. When the cache is invalid, fresh data needs to be gathered and then the data above the queues are returned to the caller. This fetch request and choose data are sent via online messages because this is how Erlang processes communicate with each other. Data collection is implemented list queue stats function invoked from the cache process. We tried to profile the function with EFLAME, which returned these results:
- Erlang heap memory allocator tries to allocate 42 gigabytes of memory
- EFLAME takes advantage of Erlang Tracing to generate flame graphs
- Gets overloaded with a number of trace events generated by the trace function
Profiling with EPROF
The above action leads to FPROF, which according to the official journal documentation is a time profiling tool that uses trace to file for minimal performance impact. Indeed the tool dealt with collecting data functions mostly; however, it took several minutes to produce the results.
The output was quite big so these slides contain only the crucial lines. fproof:apply provides a function and takes as arguments the module name and function name. We profiled at least queue stats function that’s responsible for gathering data about the queues. In the top-level code entry, the function marked with a percentage is the one relevant to the current entry. Functions that are below have been called from the marked function.
The third column ‘ACC’ shows total execution time of the function expressed in milliseconds. For example, the total execution time of the current function equals to 58 milliseconds. This specific entry involves queue stats function and notice that this function spent most of its execution time in form of a de-arranged function. Here, we can try adding an entry that concerns this specific function and accordingly check where that function has majorly spent its execution time.
In this way, we can go through the FPROF output and combat the real cause of slow API management.
Reading through the FPROF output below, we ended up with the following entry:
{ {exometer_slide,to_normalized_list,6}, 3713, 364774.737, 206.874}
This in turn called two suspicious functions from the same module. We got a further insight as below:
{ {orddict,find,2}, 19825884, 133239.862, 132804.853}
We found that a lot of the execution time was consumed by these two functions (approx. 86%). This led us to the exometer slide of RabbitMQ management agent plug-in and if you check the module you will find the connection between them.
At this stage, we decided to close the investigation. There is no doubt that the issue is certainly there. We will share our thoughts with the community so that we can come up with a reliable solution together. But, all the investigation did not happen while being at the customer site, but afterwards. The solution to the problem that we applied for our customer has a slightly different approach. So, let’s think for a while if there’s another way of obtaining queue names with corresponding number of messages in them.
It turned out that there is a RabbitMQ queue medium for all function that allows us to achieve exactly that information that we were interested in, but directly from queues processes. So, we could use that API from a custom RabbitMQ plug-in while exposing a HTTP endpoint, which would return that information when queried. We turned that idea into reality and built a proof-of-concept RabbitMQ plug-in called rabbitmq-queue-info that does exactly this:
It just exposes an HTTP endpoint, which is called list queues when a request to that endpoint comes in. Information about the queues is gathered directly from queues processes and is returned to the calling entity.
The plug-in is load tested and we found that the specification was exactly the same as the one for management HTTP API load test. Here are the results:
The first graph is the one that depicts management HTTP API response times as shown previously. The second graph shows the response times to HTTP requests to this list queues endpoint from this rabbitmq-queue-info plug-in. Note that the median is below 200 milliseconds, while percentiles and maximum values are slightly higher. But, in the majority of cases it is still below 300 milliseconds.
The performance of the second plug-in rabbitmq-queue-info was definitely better compared to the HTTP API plug-in.
Shout-outs
At the end I would like to say thank you to my brother Schmo Mentel who helped me a lot with all of this investigation. I like to thank Andrew Talansky, who was actually involved in the whole process of debugging that plug-in.
[Applause]