Menu

Monitoring

This document is intended to guide you on which RabbitMQ and system metrics are most important to monitor. Monitoring your RabbitMQ installation is an effective means to intercept issues before they affect the rest of your environment and, eventually, your users.

Infrastructure and Kernel Metrics

First step towards a useful monitoring system starts with infrastructure and kernel metrics. There are quite a few of them but some are more important than others. The following metrics should be monitored on all nodes in a RabbitMQ cluster and, if possible, all nodes that host applications:

  • CPU (idle, user, system, iowait)
  • Memory (free, cached, buffered)
  • Disk I/O (reads & writes per unit time, I/O wait percentages)
  • Free Disk Space
  • File descriptors used by beam.smp vs. max system limit
  • Network throughput (bytes received, bytes sent) vs. maximum network link throughput
  • VM statistics (dirty page flushes, writeback volume)
  • System load average (/proc/loadavg)

There is no shortage of existing tools (such as Graphite or Datadog) that collect infrastructure and kernel metrics, store and visualise them over periods of time.

RabbitMQ Metrics

The RabbitMQ management plugin provides a starting point for monitoring RabbitMQ metrics. One limitation, however, is that only up to one day's worth of metrics are stored. Storing historical metrics can be an important tool to determine the root cause of issues affecting your users or to plan for future capacity.

RabbitMQ metrics are made available through the HTTP API via the api/queues/vhost/qname endpoint. It is recommended to collect metrics at 60 second intervals because more frequent collection may place too much load on the RabbitMQ server and negatively affect performance.

MetricJSON field name
Memorymemory
Queued Messagesmessages
Un-acked Messagesmessages_unacknowledged
Messages Publishedmessage_stats.publish
Message Publish Ratemessage_stats.publish_details.rate
Messages Deliveredmessage_stats.deliver_get
Message Delivery Ratemessage_stats.deliver_get.rate
Other Message Statsmessage_stats.* (see this document)

Third Party Monitoring Tools

The following is an alphabetised list of third-party tools to collect RabbitMQ metrics. These tools have the capability to monitor the recommended system and RabbitMQ metrics.

PluginOnline Resource(s)
AppDynamicsAppDynamics, GitHub
collectdGitHub
DataDogDataDog-RabbitMQ Integration, GitHub
GangliaGitHub
GraphiteTools that work with Graphite
MuninMunin homepage, GitHub
NagiosGitHub
New RelicNewRelic Plugins, GitHub
PrometheusPrometheus Plugin
ZabbixBlog article
ZenossRabbitMQ ZenPack, Instructional Video

Application-level Metrics

A system that uses RabbitMQ, or any messaging-based system, is almost always distributed or can be treated as such. In such systems it is often not immediately obvious which component is problematic. Every single one of them, including applications, should be monitored and investigated.

Some infrastructure-level and RabbitMQ metrics can demonstrate presence of an unusual system behaviour or issue but can't pin point the root cause. For example, it is easy to tell that a node is running out of disk space but not always easy to tell why. This is where application metrics come in: they can help identify a run-away publisher, a repeatedly failing consumer, a consumer that cannot keep up with the rate, even a downstream service that's experiencing a slowdown (e.g. a missing index in a database used by the consumers).

Some client libraries (e.g. RabbitMQ Java client) and frameworks (e.g. Spring AMQP) provide means of registering metrics collectors or collect metrics out of the box.

Log Aggregation

While not technically a metric, one more piece of information can be very useful in troubleshooting a multi-service distributed system: logs. Consider collecting logs from all RabbitMQ nodes as well as all applications (if possible). Like metrics, logs can provide important clues that will help identify the root cause.