CloudAMQP has long advised against using mingling, gossip, and heartbeats with Python Celery. Why? Mainly because of the flood of unnecessary messages they introduce. This article explores the purpose of these features and their impact on resource consumption, especially when RabbitMQ serves as the message broker for Celery.
Before diving into the intricacies of resource consumption, let's first explore the concepts of Mingling, Gossip, and Heartbeats in Celery.
Mingle
When a Celery worker starts up, it doesn't want to be left out of the loop. To ensure it's up-to-date with the latest information, it interacts with other existing workers. This process is called "mingling". During this interaction, the worker gathers information about tasks that have been revoked and the logical clocks of the other workers. Essentially, this is a synchronization step and it happens just once—at startup time.
Picture this as arriving at a meeting slightly late. Before diving into the agenda, you might quickly ask a colleague for any critical updates you missed in the initial minutes.
By default, mingle is enabled in a Celery worker but can be disabled.
Gossip
As workers perform their duties, they maintain a light chatter with one another. This is termed as "Gossip". Right now, the main reason workers maintain this light chatter is to synchronize their "clocks" or their sense of time. This is also a synchronization step that happens continuously.
This can be likened to a busy office environment, where colleagues might provide quick updates about their work progress or any new developments. Celery workers keep each other informed to ensure a harmonious and informed working environment.
By default, Gossip is enabled but can be disabled.
Heartbeats
Heartbeats are signals sent at regular intervals to verify the connectivity and liveliness of a worker. They're analogous to a pulse or heartbeat in living organisms, signifying that a system is active and functioning. Within Celery, they ensure that a worker is still connected to the broker and is operating as expected. If a Heartbeat fails, it's an indication that there might be an issue with the worker, and corrective actions can be taken.
By default, Heartbeat is also enabled but can also be disabled.
While Mingling, Gossip, and Heartbeats play a role in the orchestration of workers, they also consume resources and can, under specific conditions, cause strain on the system.
We conducted some testing with Celery and a RabbitMQ instance on CloudAMQP. What follows are our findings.
Requirements
- A RabbitMQ instance running on CloudAMQP(Sassy Squirrel plan).
- Celery 5.1+ with Python 3.9+ running locally.
Starting Celery
Celery was started with the command:
➜ celery -A python_celery worker --concurrency=1 -n worker1@localhost --loglevel=debug
The RabbitMQ web manager shows that three AMQP connections have been created from
the
py-amqp
Python client, which represents the Celery app.
Figure 1 - Connections from Celery application
One connection has two channels, while the others have just one. The dual-channel connection is
designated for the 'consumer' or 'Task worker'. While one channel is subscribed to the task queue,
the other is subscribed to a queue named `
celeryev.80cc1a4b-0c2f-4738-9198-480de1575c4d.
Interestingly, this latter queue receives Celery's heartbeat messages,
which are distinct from AMQP heartbeats.
What about the remaining connections? They're for Celery's 'Mingling' and 'Gossip' features.
Even though all of these are enabled, only one connection is actively used by Celery.
The 'Gossip' connection sends messages intermittently to the exchange
celeryev
with a routing key that is bound to the first connection's heartbeat queue,
celeryev.80cc1a4b-0c2f-4738-9198-480de1575c4d.
On the other hand, the 'Mingling' connection remains inactive. We can even shut it down via the RabbitMQ web manager, and Celery's worker logs show no changes.
Celery has minimal documentation on the differences between 'Mingling' and 'Gossip'. In the source code, there are separate classes for both of them. Perhaps they will be merged in a future version?
Let's explore the heartbeat messages sent by the 'Gossip' class in Celery in more detail.
Deep Dive: Heartbeat Messages**
Celery's 'Gossip' class sends out heartbeat messages in a specific manner. The messages are dispatched to the
celeryev
exchange with a routing key of
worker.[workername].
The exchange is a topic exchange with bindings set to the
celeryev.[UUID] queue with the value
worker.#.
These messages are not persistent, which means that unless RabbitMQ is under
severe memory stress, they will not be written to disk. We can inspect messages
by creating a new queue with the same binding to the
celeryev
exchange. Here is an example of the message body:
{"hostname": "worker2@localhost", "utcoffset": 5, "pid": 42572, "clock": 200, "freq": 2.0, "active": 0, "processed": 0, "loadavg": [2.06, 1.99, 2.16], "sw_ident": "py-celery", "sw_ver": "5.1.2", "sw_sys": "Darwin", "timestamp": 1651009149.021955, "type": "worker-heartbeat"}
There is perhaps some useful information in the message above; every two seconds, a worker receives a notification that it, along with other workers, has an active separate connection that is still sending messages. But nothing here is critical to the tasks to be done by the worker.
What happens if we remove the binding so that the worker's connection continues to publish these messages but they are never received? In that case, RabbitMQ will drop the messages. Curiously, no log entries are printed in the Celery client. It does not care that it is not receiving its updates. That leads one to wonder why they are sent.
Let's try another experiment. Given that we know the format of messages (and their headers, which are not detailed here), we can publish our own custom message to the exchange pretending to be a 'fake' worker. In this case, the Celery worker does take note of this and prints entries to the log:
[2022-04-29 11:20:14,239: DEBUG/MainProcess] workerfake@localhost joined the party
[2022-04-29 11:20:57,969: INFO/MainProcess] missed heartbeat from workerfake@localhost
The Celery worker notes that the 'fake' worker joined (it actually didn't; there is no additional consumer to the task queue) and then notes that no further messages were published by the fake worker. This might be useful for applications if you parse the Celery logs for errors like this. However, you would be much better off monitoring the number of consumers and the number of messages enqueued for the task queue.
Scaling with Multiple Workers
Finally, we can consider what happens as the number of task workers increases. Each worker's gossip connection will publish a message every two seconds to the `celeryev` exchange.
Each worker's heartbeat message consumer has its own queue, and every queue receives messages from every other worker. This means that the messages delivered to workers scale with the square of the number of workers. If you have a small number of task workers, this is not a huge issue.
However, CloudAMQP has some customers with nearly 1000 task workers. In that case, the message delivery rate approaches 250,000 messages per second ((1000 workers * 0.5 messages/sec)^2). That throughput can cause RabbitMQ to go out of memory. For our free Lemur accounts, the monthly message limit can easily be exceeded by gossip/heartbeat.
Mitigating the Message Overload
There are several strategies that could be used to reduce the load on RabbitMQ caused by these heartbeat messages. A partial solution is to decrease the frequency of message publishing. However, as of Celery 5.2, we couldn't find any configuration to modify this setting... A knowledgeable Python user can edit the source code for the 'Heart' class (celery/worker/heartbeat.py) to adjust the frequency from two to a larger value.
Another approach available in RabbitMQ 3.12 is to utilize stream queues for the heartbeat queues. Messages can be published to a single queue, and all consumers can read from it. In this scenario, the message throughput scales linearly with the number of workers. Implementing this change would require significant modifications to the Celery codebase.
CloudAMQP's Suggestions
We have observed that default configurations for Celery can result in excessive message traffic that provides little utility and can cause severe stress on the RabbitMQ cluster. CloudAMQP recommends specifying flags that disable heartbeat, gossip, and mingling. The command line would look like this:
➜ celery -A python_celery worker --concurrency=1 -n worker1@localhost --loglevel=debug --without-mingle --without-gossip --without-heartbeat
There are numerous advantages to using these flags:
- 1 AMQP connection with 1 channel per worker instead of 3 AMQP connections and 4 channels
- The only messages managed by RabbitMQ are tasks to be performed by workers
- Alarms can easily be configured on the number of tasks in the message queue or the number of consumers subscribed to a queue; Celery logs are not needed to estimate the number of workers
Conclusion
When using Python Celery with RabbitMQ, understanding features like Heartbeats, Mingling, and Gossip is crucial. These features can significantly impact message traffic and resource consumption. To optimize performance and resource usage, consider following CloudAMQP's recommendations.
Ready to start using Celery with RabbitMQ in your architecture? CloudAMQP is one of the world's largest RabbitMQ cloud hosting providers. In addition to RabbitMQ, we have also created our in-house message broker, LavinMQ - we benchmarked its throughput at around 1,000,000 messages/sec.
Easily create a free RabbitMQ or free LavinMQ instance today to start testing out RabbitMQ with Celery. You will be asked to sign up first if you do not have an account, but it’s super easy to do. For any suggestions, questions, or feedback, get in touch with us at contact@cloudamqp.com