Menu

Monitoring

Overview

This document provides an overview of topics related to RabbitMQ monitoring. Monitoring your RabbitMQ installation is an effective means to detect issues before they affect the rest of your environment and, eventually, your users.

Many aspects of the system can be monitored. This guide will group them into a handful of categories:

Log aggregation across all nodes and applications is closely related to monitoring and also covered in this guide.

A number of popular tools, both open source and commercial, can be used to monitor RabbitMQ.

What is Monitoring?

For the purpose of this guide we define monitoring as a continuous process of capturing the behaviour of a system via health checks and metrics in order to detect anomalies: when the system is unavailable, experiences an unusual load, exhausted of certain resources or otherwise does not behave within its normal (expected) parameters. Monitoring involves collecting and storing metrics for the long term, which is critically important not just for anomaly detection but also root cause analysis, trend detection and capacity planning.

Monitoring systems typically integrate with alerting systems.

When an anomaly is detected by a monitoring system an alarm of some sort is typically passed to an alerting system, which notifies interested parties such as the technical operations team.

Having monitoring in place means that important deviations in system behavior, from degraded service in some areas to complete unavailability, is easier to detect and the root cause takes significantly less time to find. Operating a distributed system is a bit like trying to get out of a forest without a GPS navigator device or compass. It doesn't matter how brilliant or experienced the person is, having relevant information is critically important for a good outcome.

A Health check is the most basic aspect of monitoring. It involves a periodically executed command or set of commands that collect a few essential metrics of the monitored system and evaluate them. For example, whether RabbitMQ's Erlang VM is running is one such check. It involves a metric (is an OS process running?), an expected range of normal operating parameters (the process is expected to be running, otherwise it cannot possibly serve any clients) and an evaluation step. Of course, there are more varieties of health checks. Which one is considered to be most appropriate depends on the definition of a "healthy node" used, and thus is a system- and team-specific decision. RabbitMQ CLI tools provide a number of commands that can serve as useful health checks. They will be covered later in this guide.

While health checks are a useful tool, they only provide so much insight into the state of the system because they are by design focused on just one or a handful of metrics, usually check a single node and can only reason about the state of that node at a particular moment in time. For a more comprehensive assessment many more metrics have to be collected continuously at reasonable intervals. This allows more types of anomalies to be detected, as some can only be identified over longer periods of time. This is usually done by tools commonly referred to as monitoring tools of which there are a grand variety. This guides covers some tools commonly used for RabbitMQ monitoring.

Some metrics are RabbitMQ-specific: they are collected and reported by RabbitMQ nodes. Rest of the guide refers to them as just "RabbitMQ metrics". Examples include the number of socket descriptors used, total number of enqueued messages or inter-node communication traffic rates. Others metrics are collected and reported by the OS kernel. Such metrics are often called system metrics or infrastructure metrics. System metrics are not RabbitMQ-specific: CPU utilisation rate, total amount of memory used by a certain process or group of processes, network packet loss rate and so on. Both types are important to monitor as individual metrics are not useful in all scenarios but when analysed collectively, they can quickly provide a reasonably complete insight into the state of the system, and help operators form a hypothesis as to what's going on and needs further investigation and/or fixing.

Frequency of Monitoring

Many monitoring systems poll their monitored services periodically. How often that's done varies from tool to tool but usually can be configured by the operator.

Very frequent polling can have negative consequences on the system under monitoring. For example, excessive load balancer checks that open a test TCP connection to a node can lead to a high connection churn. Excessive checks of channels and queues in RabbitMQ will increase its CPU consumption, which can be substantial when there are many (say, 10s of thousands) of them on a node.

It is recommended that metrics are collected at 60 second intervals (or at least 30 second intervals if higher update rate is desired). More frequent collection will increase load and the system and provide no practical benefits for monitoring systems.

Infrastructure and Kernel Metrics

First step towards a useful monitoring system starts with infrastructure and kernel metrics. There are quite a few of them but some are more important than others. The following metrics should be monitored on all nodes in a RabbitMQ cluster and, if possible, all instances that host RabbitMQ clients:

  • CPU stats (user, system, iowait & idle percentages)
  • Memory usage (used, buffered, cached & free percentages)
  • Virtual Memory statistics (dirty page flushes, writeback volume)
  • Disk I/O (operations & amount of data transferred per unit time, time to service operations)
  • Free disk space on the mount used for the node data directory
  • File descriptors used by beam.smp vs. max system limit
  • TCP connections by state (ESTABLISHED, CLOSE_WAIT, TIME_WAIT`)
  • Network throughput (bytes received, bytes sent) & maximum network throughput)
  • Network latency (between all RabbitMQ nodes in a cluster as well as to/from clients)

There is no shortage of existing tools (such as Prometheus or Datadog) that collect infrastructure and kernel metrics, store and visualise them over periods of time.

RabbitMQ Metrics

The RabbitMQ management plugin provides an API for accessing RabbitMQ metrics. The plugin will store up to one day's worth of metric data. Longer term monitoring should be accomplished with an external tool.

This section will cover multiple RabbitMQ-specific aspects of monitoring.

Monitoring of Clusters

When monitoring clusters it is important to understand the guarantees provided by the HTTP API. In a clustered environment every node can serve metric endpoint requests. This means that it is possible to retrieve cluster-wide metrics from any node, assuming the node is capable of contacting its peers. That node will collect and aggregate data from its peers as needed before producing a response.

Every node also can serve requests to endpoints that provide node-specific metrics for itself as well as other cluster nodes. Like infrastructure and OS metrics, node-specific metrics must be stored for each node individually. However, HTTP API requests can be issued to any node.

As mentioned earlier, inter-node connectivity issues will affect HTTP API behaviour. It is therefore recommended that a random online node is chosen for monitoring requests, e.g. with the help of a load balancer or round-robin DNS.

Note that some endpoints (e.g. aliveness check) perform operations on the contacted node specifically; they are an exception, not the rule.

Cluster-wide Metrics

Cluster-wide metrics provide the highest possible level view of cluster state. Some (cluster links, detected network partitions) respresent node interaction, others are aggregates of certain metrics across all cluster members. They are complimentary to infrastructure and node metrics.

GET /api/overview is the HTTP API endpoint that returns cluster-wide metrics.

MetricJSON field name
Cluster name `cluster_name`
Cluster-wide message rates `message_stats`
Total number of connections `object_totals.connections`
Total number of channels `object_totals.channels`
Total number of queues `object_totals.queues`
Total number of consumers `object_totals.consumers`
Total number of messages (ready plus unacknowledged) `queue_totals.messages`
Number of messages ready for delivery `queue_totals.messages_ready`
Number of unacknowledged messages `queue_totals.messages_unacknowledged`
Messages published recently `message_stats.publish`
Message publish rate `message_stats.publish_details.rate`
Messages delivered to consumers recently `message_stats.deliver_get`
Message delivery rate `message_stats.deliver_get.rate`
Other message stats `message_stats.*` (see this document

Node Metrics

There are two HTTP API endpoints that provide access to node-specific metrics:

  • GET /api/nodes/{node} returns stats for a single node
  • GET /api/nodes returns stats for all cluster members

The latter endpoint returns an array of objects. Monitoring tools that support (or can support) that as an input should prefer that endpoint since it redunces the number of requests that has to be issued. When that's not the case, the former endpoint can be used to retrieve stats for every cluster member in turn. That implies that the monitoring system is aware of the list of cluster members.

Most of the metrics represent point-in-time absolute values. Some, however, represent activity over a recent period of time (for example, GC runs and bytes reclaimed). The latter metrics are primarily useful when compared to their previous values and historical mean/percentile values.

MetricJSON field name
Total amount of memory used `mem_used`
Memory usage high watermark `mem_limit`
Is a memory alarm in effect? `mem_alarm`
Free disk space low watermark `disk_free_limit`
Is a disk alarm in effect? `disk_free_alarm`
File descriptors available `fd_total`
File descriptors used `fd_used`
File descriptor open attempts `io_file_handle_open_attempt_count`
Sockets available `sockets_total`
Sockets used `sockets_used`
Message store disk reads `message_stats.disk_reads`
Message store disk writes `message_stats.disk_writes`
Inter-node communication links cluster_links
GC runs `gc_num`
Bytes reclaimed by GC `gc_bytes_reclaimed`
Erlang process limit `proc_total`
Erlang processes used `proc_used`
Runtime run queue `run_queue`

Individual Queue Metrics

Individual queue metrics are made available through the HTTP API via the GET /api/queues/{vhost}/{qname} endpoint.

MetricJSON field name
Memory `memory`
Total number of messages (ready plus unacknowledged) `messages`
Number of messages ready for delivery `messages_ready`
Number of unacknowledged messages `messages_unacknowledged`
Messages published recently `message_stats.publish`
Message publishing rate `message_stats.publish_details.rate`
Messages delivered recently `message_stats.deliver_get`
Message delivery rate `message_stats.deliver_get.rate`
Other message stats `message_stats.*` (see this document

Application-level Metrics

A system that uses RabbitMQ, or any messaging-based system, is almost always distributed or can be treated as such. In such systems it is often not immediately obvious which component is problematic. Every single one of them, including applications, should be monitored and investigated.

Some infrastructure-level and RabbitMQ metrics can demonstrate presence of an unusual system behaviour or issue but can't pin point the root cause. For example, it is easy to tell that a node is running out of disk space but not always easy to tell why. This is where application metrics come in: they can help identify a run-away publisher, a repeatedly failing consumer, a consumer that cannot keep up with the rate, even a downstream service that's experiencing a slowdown (e.g. a missing index in a database used by the consumers).

Some client libraries (e.g. RabbitMQ Java client) and frameworks (e.g. Spring AMQP) provide means of registering metrics collectors or collect metrics out of the box. With others developers have to track metrics in their application code.

What metrics applications track can be system-specific but some are relevant to most systems:

  • Connection opening rate
  • Channel opening rate
  • Connection failure (recovery) rate
  • Publishing rate
  • Delivery rate
  • Positive delivery acknowledgement rate
  • Negative delivery acknowledgement rate
  • Mean/95th percentile delivery processing latency

Health Checks

A health check is a periodically executed command that tries to determine whether an aspect of the RabbitMQ service is operating normally.

There is a series of health checks that can be performed, starting with the most basic and virtually never producing false positives, to increasingly more comprehensive, intrusive, and opinionated that have a higher probability of false positives. In other words, the more comprehensive a health check is, the less conclusive the result will be.

Health checks can verify the state of an individual node (node health checks), or the entire cluster (cluster health checks).

Individual Node Checks

This section covers several examples of node health check. They are organised in stages. Higher stages perform increasingly comprehensive and opinionated checks which have a higher probability of false positives. Some stages have dedicated RabbitMQ CLI tool commands, others can require additional tools.

Note that while the health checks are ordered, a greater number does not necessarily indicate a "better" check.

The health checks can be used selectively and combined. Unless noted otherwise, the checks should follow the same monitoring frequency recommendation as metric collection.

Stage 1

The most basic check ensures that the runtime is running and (indirectly) that CLI tools can authenticate to it.

Except for the CLI tool authentication part, the probability of false positives can be considered approaching 0 except for upgrades and maintenance windows.

rabbitmq-diganostics ping performs this check:

rabbitmq-diagnostics ping -q
# => Ping succeeded if exit code is 0

Stage 2

A slightly more comprehensive check is executing rabbitmq-diganostics status status:

This effectively includes the stage 1 check plus retrieves some essential system information which is useful for other checks and should always be available if RabbitMQ is running on the node (see below).

rabbitmq-diagnostics -q status
# => [output elided for brevity]

This is a common way of sanity checking a node. The probability of false positives can be considered approaching 0 except for upgrades and maintenance windows.

Stage 3

Includes previous checks and also verifies that the RabbitMQ application is running (not stopped with rabbitmqctl stop_app or the Pause Minority partition handling strategy) and there are no resource alarms.

# lists alarms in effect across the cluster, if any
rabbitmq-diagnostics -q alarms

rabbitmq-diagnostics check_running is a check that makes sure that the runtime is running and the RabbitMQ application on it is not stopped or paused.

rabbitmq-diagnostics check_local_alarms checks that there are no local alarms in effect on the node. If there are any, it will exit with a non-zero status.

The two commands in combination deliver the stage 3 check:

rabbitmq-diagnostics -q check_running && rabbitmq-diagnostics -q check_local_alarms
# if both checks succeed, the exit code will be 0

The probability of false positives is low. Systems hovering around their high runtime memory watermark will have a high probability of false positives. During upgrades and maintenance windows can raise significantly.

Specifically for memory alarms, the GET /api/nodes/{node}/memory HTTP API endpoint can be used for additional checks. In the following example its output is piped to jq:

curl --silent -u guest:guest -X GET http://127.0.0.1:15672/api/nodes/[email protected]/memory | jq
# => {
# =>     "memory": {
# =>         "connection_readers": 24100480,
# =>         "connection_writers": 1452000,
# =>         "connection_channels": 3924000,
# =>         "connection_other": 79830276,
# =>         "queue_procs": 17642024,
# =>         "queue_slave_procs": 0,
# =>         "plugins": 63119396,
# =>         "other_proc": 18043684,
# =>         "metrics": 7272108,
# =>         "mgmt_db": 21422904,
# =>         "mnesia": 1650072,
# =>         "other_ets": 5368160,
# =>         "binary": 4933624,
# =>         "msg_index": 31632,
# =>         "code": 24006696,
# =>         "atom": 1172689,
# =>         "other_system": 26788975,
# =>         "allocated_unused": 82315584,
# =>         "reserved_unallocated": 0,
# =>         "strategy": "rss",
# =>         "total": {
# =>             "erlang": 300758720,
# =>             "rss": 342409216,
# =>             "allocated": 383074304
# =>         }
# =>     }
# => }

The breakdown information it produces can be reduced down to a single value using jq or similar tools:

curl --silent -u guest:guest -X GET http://127.0.0.1:15672/api/nodes/[email protected]/memory | jq ".memory.total.allocated"
# => 397365248

rabbitmq-diagnostics -q memory_breakdown provides access to the same per category data and supports various units:

rabbitmq-diagnostics -q memory_breakdown --unit "MB"
# => connection_other: 50.18 mb (22.1%)
# => allocated_unused: 43.7058 mb (19.25%)
# => other_proc: 26.1082 mb (11.5%)
# => other_system: 26.0714 mb (11.48%)
# => connection_readers: 22.34 mb (9.84%)
# => code: 20.4311 mb (9.0%)
# => queue_procs: 17.687 mb (7.79%)
# => other_ets: 4.3429 mb (1.91%)
# => connection_writers: 4.068 mb (1.79%)
# => connection_channels: 4.012 mb (1.77%)
# => metrics: 3.3802 mb (1.49%)
# => binary: 1.992 mb (0.88%)
# => mnesia: 1.6292 mb (0.72%)
# => atom: 1.0826 mb (0.48%)
# => msg_index: 0.0317 mb (0.01%)
# => plugins: 0.0119 mb (0.01%)
# => queue_slave_procs: 0.0 mb (0.0%)
# => mgmt_db: 0.0 mb (0.0%)
# => reserved_unallocated: 0.0 mb (0.0%)

Stage 4

Includes all checks in stage 4 plus a check on all enabled listeners (using a temporary TCP connection).

To inspect all listeners enabled on a node, use rabbitmq-diganostics listeners:

rabbitmq-diagnostics -q listeners
# => Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
# => Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
# => Interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
# => Interface: [::], port: 15672, protocol: http, purpose: HTTP API
# => Interface: [::], port: 15671, protocol: https, purpose: HTTP API over TLS (HTTPS)

rabbitmq-diagnostics check_port_connectivity is a commend that performs the basic TCP connectivity check mentioned above:

rabbitmq-diagnostics -q check_port_connectivity
# If the check succeeds, the exit code will be 0

The probability of false positives is generally low but during upgrades and maintenance windows can raise significantly.

Stage 5

Includes all checks in stage 3 plus checks that there are no failed virtual hosts.

RabbitMQ CLI tools currently do not provide a dedicated command for this check, but here is an example that could be used in the meantime:

rabbitmqctl eval 'true = lists:foldl(fun(VHost, Acc) -> Acc andalso rabbit_vhost:is_running_on_all_nodes(VHost) end, true, rabbit_vhost:list()).'
# => true

The probability of false positives is generally low except for systems that are under high CPU load.

Stage 6

Includes all checks in stage 5 plus checks all channel and queue processes on the target queue for aliveness.

The combination of rabbitmq-diagnostics check_port_connectivity and rabbitmq-diagnostics node_health_check is the closest alternative to this check currently available.

This combination of commands includes all checks up to and including stage 4 and also check all channel and queue processes on the target queue for aliveness:

rabbitmq-diagnostics -q check_port_connectivity && \
rabbitmq-diagnostics -q node_health_check
# if both checks succeed, the exit code will be 0

The probability of false positives is moderate for systems under above average load or with a large number of queues and channels (starting with 10s of thousands).

Optional Check 1

This check verifies that an expected set of plugins is enabled. It is orthogonal to the primary checks.

rabbitmq-plugins list --enabled is the command that lists enabled plugins on a node:

rabbitmq-plugins -q list --enabled
# => Configured: E = explicitly enabled; e = implicitly enabled
# => | Status: * = running on [email protected]
# => |/
# => [E*] rabbitmq_auth_mechanism_ssl       3.7.10
# => [E*] rabbitmq_consistent_hash_exchange 3.7.10
# => [E*] rabbitmq_management               3.7.10
# => [E*] rabbitmq_management_agent         3.7.10
# => [E*] rabbitmq_shovel                   3.7.10
# => [E*] rabbitmq_shovel_management        3.7.10
# => [E*] rabbitmq_top                      3.7.10
# => [E*] rabbitmq_tracing                  3.7.10

A health check that verifies that a specific plugin, rabbitmq_shovel is enabled and running:

rabbitmq-plugins -q is_enabled rabbitmq_shovel
# if the check succeeded, exit code will be 0

The probability of false positives is generally low but raises in environments where environment variables that can affect rabbitmq-plugins are overridden.

Monitoring Tools

The following is an alphabetised list of third-party tools commonly used to collect RabbitMQ metrics. These tools vary in capabilities but usually can collect both infrastructure-level and RabbitMQ metrics.

Note that this list is by no means complete.

Monitoring ToolOnline Resource(s)
AppDynamics AppDynamics, GitHub
`collectd` GitHub
DataDog DataDog RabbitMQ integration, GitHub
Ganglia GitHub
Graphite Tools that work with Graphite
Munin Munin docs, GitHub
Nagios GitHub
New Relic NewRelic Plugins, GitHub
Prometheus Prometheus guide, GitHub
Zabbix Blog article
Zenoss RabbitMQ ZenPack, Instructional Video

Log Aggregation

While not technically a metric, one more piece of information can be very useful in troubleshooting a multi-service distributed system: logs. Consider collecting logs from all RabbitMQ nodes as well as all applications (if possible). Like metrics, logs can provide important clues that will help identify the root cause.

Getting Help and Providing Feedback

If you have questions about the contents of this guide or any other topic related to RabbitMQ, don't hesitate to ask them on the RabbitMQ mailing list.

Help Us Improve the Docs <3

If you'd like to contribute an improvement to the site, its source is available on GitHub. Simply fork the repository and submit a pull request. Thank you!