Menu

Monitoring

Overview

This document provides an overview of topics related to RabbitMQ monitoring. Monitoring your RabbitMQ installation is an effective means to detect issues before they affect the rest of your environment and, eventually, your users.

Many aspects of the system can be monitored. This guide will group them into a handful of categories:

Log aggregation across all nodes and applications is closely related to monitoring and also covered in this guide.

A number of popular tools, both open source and commercial, can be used to monitor RabbitMQ.

Infrastructure and Kernel Metrics

First step towards a useful monitoring system starts with infrastructure and kernel metrics. There are quite a few of them but some are more important than others. The following metrics should be monitored on all nodes in a RabbitMQ cluster and, if possible, all instances that host RabbitMQ clients:

  • CPU stats (user, system, iowait & idle percentages)
  • Memory usage (used, buffered, cached & free percentages)
  • Virtual Memory statistics (dirty page flushes, writeback volume)
  • Disk I/O (operations & amount of data transferred per unit time, time to service operations)
  • Free disk space on the mount used for the node data directory
  • File descriptors used by beam.smp vs. max system limit
  • TCP connections by state (ESTABLISHED, CLOSE_WAIT, TIME_WAIT)
  • Network throughput (bytes received, bytes sent) & maximum network throughput)
  • Network latency (between all RabbitMQ nodes in a cluster as well as to/from clients)

There is no shortage of existing tools (such as Prometheus or Datadog) that collect infrastructure and kernel metrics, store and visualise them over periods of time.

RabbitMQ Metrics

The RabbitMQ management plugin provides an API for accessing RabbitMQ metrics. The plugin will store up to one day's worth of metric data. Longer term monitoring should be accomplished with an external tool. Collecting and storing metrics for the long term is critically important for anomaly detection, root cause analysis, trend detection and capacity planning.

This section will cover multiple RabbitMQ-specific aspects of monitoring.

Monitoring Clusters

When monitoring clusters it is important to understand the guarantees provided by the HTTP API. In a clustered environment every node can serve metric endpoint requests. This means that it is possible to retrieve cluster-wide metrics from any node, assuming the node is capable of contacting its peers. That node will collect and aggregate data from its peers as needed before producing a response.

Every node also can serve requests to endpoints that provide node-specific metrics for itself as well as other cluster nodes. Like infrastructure and OS metrics, node-specific metrics must be stored for each node individually. However, HTTP API requests can be issued to any node.

As mentioned earlier, inter-node connectivity issues will affect HTTP API behaviour. It is therefore recommended that a random online node is chosen for monitoring requests, e.g. with the help of a load balancer or round-robin DNS.

Note that some endpoints (e.g. aliveness check) perform operations on the contacted node specifically; they are an exception, not the rule.

Frequency of Monitoring

Many monitoring systems poll their monitored services periodically. How often that's done varies from tool to tool but usually can be configured by the operator.

It is recommended that metrics are collected at 60 second intervals (or at least 30 second intervals if higher update rate is desired). More frequent collection will increase load and the system and provide no practical benefits for monitoring systems.

Cluster-wide Metrics

Cluster-wide metrics provide the highest possible level view of cluster state. Some (cluster links, detected network partitions) respresent node interaction, others are aggregates of certain metrics across all cluster members. They are complimentary to infrastructure and node metrics.

GET /api/overview is the HTTP API endpoint that returns cluster-wide metrics.

MetricJSON field name
Cluster name cluster_name
Cluster-wide message rates message_stats
Total number of connections object_totals.connections
Total number of channels object_totals.channels
Total number of queues object_totals.queues
Total number of consumers object_totals.consumers
Total number of messages (ready plus unacknowledged) queue_totals.messages
Number of messages ready for delivery queue_totals.messages_ready
Number of unacknowledged messages queue_totals.messages_unacknowledged
Messages published recently message_stats.publish
Message publish rate message_stats.publish_details.rate
Messages delivered to consumers recently message_stats.deliver_get
Message delivery rate message_stats.deliver_get.rate
Other message stats message_stats.* (see this document)

Node Metrics

There are two HTTP API endpoints that provide access to node-specific metrics:

  • GET /api/nodes/{node} returns stats for a single node
  • GET /api/nodes returns stats for all cluster members
The latter endpoint returns an array of objects. Monitoring tools that support (or can support) that as an input should prefer that endpoint since it redunces the number of requests that has to be issued. When that's not the case, the former endpoint can be used to retrieve stats for every cluster member in turn. That implies that the monitoring system is aware of the list of cluster members.

Most of the metrics represent point-in-time absolute values. Some, however, represent activity over a recent period of time (for example, GC runs and bytes reclaimed). The latter metrics are primarily useful when compared to their previous values and historical mean/percentile values.

MetricJSON field name
Total amount of memory used mem_used
Memory usage high watermark mem_limit
Is a memory alarm in effect? mem_alarm
Free disk space low watermark disk_free_limit
Is a disk alarm in effect? disk_free_alarm
File descriptors available fd_total
File descriptors used fd_used
File descriptor open attempts io_file_handle_open_attempt_count
Sockets available sockets_total
Sockets used sockets_used
Message store disk reads message_stats.disk_reads
Message store disk writes message_stats.disk_writes
Inter-node communication links cluster_links
GC runs gc_num
Bytes reclaimed by GC gc_bytes_reclaimed
Erlang process limit proc_total
Erlang processes used proc_used
Runtime run queue run_queue

Individual Queue Metrics

Individual queue metrics are made available through the HTTP API via the GET /api/queues/{vhost}/{qname} endpoint.

MetricJSON field name
Memory memory
Total number of messages (ready plus unacknowledged) messages
Number of messages ready for delivery messages_ready
Number of unacknowledged messages messages_unacknowledged
Messages published recently message_stats.publish
Message publishing rate message_stats.publish_details.rate
Messages delivered recently message_stats.deliver_get
Message delivery rate message_stats.deliver_get.rate
Other message stats message_stats.* (see this document)

Monitoring Tools

The following is an alphabetised list of third-party tools to collect RabbitMQ metrics. These tools have the capability to monitor the recommended system and RabbitMQ metrics. Note that this list is by no means complete.

Monitoring ToolOnline Resource(s)
AppDynamics AppDynamics, GitHub
collectd GitHub
DataDog DataDog RabbitMQ integration, GitHub
Ganglia GitHub
Graphite Tools that work with Graphite
Munin Munin docs, GitHub
Nagios GitHub
New Relic NewRelic Plugins, GitHub
Prometheus Prometheus guide, GitHub
Zabbix Blog article
Zenoss RabbitMQ ZenPack, Instructional Video

Application-level Metrics

A system that uses RabbitMQ, or any messaging-based system, is almost always distributed or can be treated as such. In such systems it is often not immediately obvious which component is problematic. Every single one of them, including applications, should be monitored and investigated.

Some infrastructure-level and RabbitMQ metrics can demonstrate presence of an unusual system behaviour or issue but can't pin point the root cause. For example, it is easy to tell that a node is running out of disk space but not always easy to tell why. This is where application metrics come in: they can help identify a run-away publisher, a repeatedly failing consumer, a consumer that cannot keep up with the rate, even a downstream service that's experiencing a slowdown (e.g. a missing index in a database used by the consumers).

Some client libraries (e.g. RabbitMQ Java client) and frameworks (e.g. Spring AMQP) provide means of registering metrics collectors or collect metrics out of the box. With others developers have to track metrics in their application code.

What metrics applications track can be system-specific but some are relevant to most systems:

  • Connection opening rate
  • Channel opening rate
  • Connection failure (recovery) rate
  • Publishing rate
  • Delivery rate
  • Positive delivery acknowledgement rate
  • Negative delivery acknowledgement rate
  • Mean/95th percentile delivery processing latency

Log Aggregation

While not technically a metric, one more piece of information can be very useful in troubleshooting a multi-service distributed system: logs. Consider collecting logs from all RabbitMQ nodes as well as all applications (if possible). Like metrics, logs can provide important clues that will help identify the root cause.

Getting Help and Providing Feedback

If you have questions about the contents of this guide or any other topic related to RabbitMQ, don't hesitate to ask them on the RabbitMQ mailing list.

Help Us Improve the Docs <3

If you'd like to contribute an improvement to the site, its source is available on GitHub. Simply fork the repository and submit a pull request. Thank you!