Menu

Monitoring with Prometheus & Grafana

Overview

This guide covers RabbitMQ monitoring with two popular tools: Prometheus, a monitoring toolkit; and Grafana, a metrics visualisation system.

These tools together form a powerful toolkit for long-term metric collection and monitoring of RabbitMQ clusters. While RabbitMQ management UI also provides access to a subset of metrics, it by design doesn't try to be a long term metric collection solution.

Please read through the main guide on monitoring first. Monitoring principles and available metrics are mostly relevant when Prometheus and Grafana are used.

Some key topics covered by this guide are

Grafana dashboards follow a number of conventions to make the system more observable and anti-patterns easier to spot. Its design decisions are explained in a number of sections:

Built-in Prometheus Support

As of 3.8.0, RabbitMQ ships with built-in Prometheus & Grafana support.

Support for Prometheus metric collector ships in the rabbitmq_prometheus plugin. The plugin exposes all RabbitMQ metrics on a dedicated TCP port, in Prometheus text format.

These metrics provide a deep insights into the state of RabbitMQ nodes and the runtime. They make reasoning about the behaviour of RabbitMQ, applications that use it and various infrastructure elements a lot more informed.

Grafana Support

Collected metrics are not very useful unless they are visualised. Team RabbitMQ provides a prebuilt set of Grafana dashboards that visualise a large number of available RabbitMQ and runtime metrics in context-specific ways.

There is a number of dashboards available:

and others. Each is meant to provide an insight into a specific part of the system. When used together, they are able to explain RabbitMQ and application behaviour in detail.

Note that the Grafana dashboards are opinionated and uses a number of conventions, for example, to spot system health issues quicker or make cross-graph referencing possible. Like all Grafana dashboards, they are also highly customizable. The conventions they assume are considered to be good practices and are thus recommended.

An Example

When RabbitMQ is integrated with Prometheus and Grafana, this is what the RabbitMQ Overview dashboard looks like:

RabbitMQ Overview Dashboard

Quick Start

Before We Start

This section explains how to set up a RabbitMQ cluster with Prometheus and Grafana dashboards, as well as some applications that will produce some activity and meaningful metrics.

With this setup you will be able to interact with RabbitMQ, Prometheus & Grafana running locally. You will also be able to try out different load profiles to see how it all fits together, make sense of the dashboards, panels and so on.

This is merely an example; the rabbitmq_prometheus plugin and our Grafana dashboards do not require the use of Docker Compose demonstrated below.

Prerequisites

The instructions below assume a host machine that has a certain set of tools installed:

  • A terminal to run the commands
  • Git to clone the repository
  • Docker Desktop to use Docker Compose locally
  • A Web browser to browse the dashboards

Their installation is out of scope of this guide. Use

git version
docker info && docker-compose version

on the command line to verify that the necessary tools are available.

Clone a Repository with Manifests

First step is to clone a Git repository, rabbitmq-prometheus, with the manifests and other components required to run a RabbitMQ cluster, Prometheus and a set of applications:

git clone https://github.com/rabbitmq/rabbitmq-prometheus.git
cd rabbitmq-prometheus/docker

Run Docker Compose

Next use Docker Compose manifests to run a pre-configured RabbitMQ cluster, a Prometheus instance and a basic workload that will produce the metrics displayed in the RabbitMQ overview dashboard:

docker-compose -f docker-compose-metrics.yml up -d
docker-compose -f docker-compose-overview.yml up -d

The docker-compose commands above can also be executed with a make target:

make metrics overview

When the above commands succeed, there will be a functional RabbitMQ cluster and a Prometheus instance collecting metrics from it running in a set of containers.

Access RabbitMQ Overview Grafana Dashboard

Now navigate to http://localhost:3000/dashboards in a Web browser. It will bring up a login page. Use admin for both the username and the password. On the very first login Grafana will suggest changing your password. For the sake of this example, we suggest that this step is skipped.

Navigate to the RabbitMQ-Overview dashboard that will look like this:

RabbitMQ Overview Dashboard Localhost

Congratulations! You now have a 3-nodes RabbitMQ cluster integrated with Prometheus & Grafana running locally. This is a perfect time to learn more about the available dashboards.

RabbitMQ Overview Dashboard

All metrics available in the management UI Overview page are available in the Overview Grafana dashboard. They are grouped by object type, with a focus on RabbitMQ nodes and message rates.

Health Indicators

Single stat metrics at the top of the dashboard capture the health of a single RabbitMQ cluster. In this case, there's a single RabbitMQ cluster, rabbitmq-overview, as seen in the Cluster drop-down menu just below the dashboard title.

The panels on all RabbitMQ Grafana dashboards use different colours to capture the following metric states:

  • Green means the value of the metric is within a healthy range
  • Blue means under-utilisation or some form of degradation
  • Red means the value of the metric is below or above the range that is considered healthy

RabbitMQ Overview Dashboard Single-stat

Default ranges for the single stat metrics will not be optimal for all RabbitMQ deployments. For example, in environments with many consumers and/or high prefetch values, it may be perfectly fine to have over 1,000 unacknowledged messages. The default thresholds can be easily adjusted to suit the workload and system at hand. The users are encouraged to revisit these ranges and tweak them as they see fit for their workloads, monitoring and operational practices, and tolerance for false positives.

Metrics and Graphs

Most RabbitMQ and runtime metrics are represented as graphs in Grafana: they are values that change over time. This is the simplest and clearest way of visualising how some aspect of the system changes. Time-based charting makes it easy to understand the change in key metrics: message rates, memory used by every node in the cluster, or the number of concurrent connections. All metrics except for health indicators are node-specific, that is, they represent values of a metric on a single node.

Some metrics, such as the panels grouped under CONNECTIONS, are stacked to capture the state of the cluster as a whole. These metrics are collected on individual nodes and grouped visually, which makes it easy to notice when, for example, one node serves a disproportionate number of connections.

We would refer to such a RabbitMQ cluster as unbalanced, meaning that at least in some ways, a minority of nodes perform the majority of work.

In the example below, connections are spread out evenly across all nodes most of the time:

RabbitMQ Overview Dashboard CONNECTIONS

Colour Labelling in Graphs

All metrics on all graphs are associated with specific node names. For example, all metrics drawn in green are for the node that contains 0 in its name, e.g. rabbit@rmq0. This makes is easy to correlate metrics of a specific node across graphs. Metrics for the first node, which is assumed to contain 0 in its name, will always appear as green across all graphs.

It is important to remember this aspect when using the RabbitMQ Overview dashboard. If a different node naming convention is used, the colours will appear inconsistent across graphs: green may represent e.g. rabbit@foo in one graph, and e.g. rabbit@bar in another graph.

When this is the case, the panels must be updated to use a different node naming scheme.

Thresholds in Graphs

Most metrics have pre-configured thresholds. They define expected operating boundaries for the metric. On the graphs they appear as semi-transpared orange or red areas, as seen in the example below.

RabbitMQ Overview Dashboard Single-stat

Metric values in the orange area signal that some pre-defined threshold has been exceeded. This may be acceptable, especially if the metric recovers. A metric that comes close to the orange area is considered to be in healthy state.

Metric values in the red area need attention and may identify some form of service degradation. For example, metrics in the red area can indicate that an alarm in effect or when the node is out of file descriptors and cannot accept any more connections or open new files.

In the example above, we have a RabbitMQ cluster that runs at optimal memory capacity, which is just above the warning threshold. If there is a spike in published messages that should be stored in RAM, the amount of memory used by the node go up and the metric on the graph will go down (as it indicates the amount of available memory).

Because the system has more memory available than is allocated to the RabbitMQ node it hosts, we notice the dip below 0 B. This emphasizes the importance of leaving spare memory available for the OS, housekeeping tasks that cause short-lived memory usage spikes, and other processes. When a RabbitMQ node exhausts all memory that it is allowed to use, a memory alarm is triggered and publishers across the entire cluster will be blocked.

On the right side of the graph we can see that consumers catch up and the amount of memory used goes down. That clears the memory alarm on the node and, as a result, publishers become unblocked. This and related metrics across cluster then should return to their optimal state.

There is No "Right" Threshold for Many Metrics

Note that that the thresholds used by the Grafana dashboards have to have a default value. No matter what value is picked by dashboard developers, they will not suitable for all environments and workloads.

Some workloads may require higher thresholds, others may choose to lower them. While the defaults should be adequate in many cases, the operator must review and adjust the thresholds to suit their specific requirements.

Relevant Documentation for Graphs

Most metrics have a help icon in the top-left corner of the panel.

RabbitMQ Overview Dashboard Single-stat

Some, like the available disk space metric, link to dedicated pages in RabbitMQ documentation. These pages contain information relevant to the metric. Getting familiar with the linked guides is highly recommended and will help the operator understand what the metric means better.

Using Graphs to Spot Anti-patterns

Any metric drawn in red hints at an anti-pattern in the system. Such graphs try to high highlight sub-optimal uses of RabbitMQ. A red graphs with non-zero metrics should be investigated. Such metrics might indicate an issue in RabbitMQ configuration or sub-optimal actions by clients (publishers or consumers).

In the example below we can see the usage of greatly inefficient polling consumers that keep polling, even though most or even all polling operation return no messages. Like any polling-based algorithm, it is wasteful and should be avoided where possible.

It is a lot more and efficient to have RabbitMQ push messages to the consumer.

RabbitMQ Overview Dashboard Antipatterns

Example Workloads

The Prometheus plugin repository contains example workloads that us PerfTest to simulate different workloads. Their goal is to exercise all metrics in the RabbitMQ Overview dashboard. These examples are meant to be edited and extended as developers and operators see fit when exploring various metrics, their thresholds and behaviour.

To deploy a workload app, run docker-compose -f docker-compose-overview.yml up -d. The same command will redeploy the app after the file has been updated.

To delete all workload containers, run docker-compose -f docker-compose-overview.yml down or

gmake down

More Dashboards: Raft and Erlang Runtime

There are two more Grafana dashboards available: RabbitMQ-Raft and Erlang-Distribution. They collect and visualise metrics related to the Raft consensus algorithm (used by Quorum Queues and other features) as well as more nitty-gritty runtime metrics such as inter-node communication buffers.

The dashboards have corresponding RabbitMQ clusters and PerfTest instances which are started and stopped the same as the Overview one. Feel free to experiment with the other workloads that are included in the same docker directory.

For example, the docker-compose-dist-tls.yml Compose manifest is meant to stress the inter-node communication links. This workload uses a lot of system resources. docker-compose-qq.yml contains a quorum queue workload.

To stop and delete all containers used by the workloads, run docker-compose -f [file] down or

make down

Installation

Unlike the Quick Start above, this section covers monitoring setup geared towards production usage.

We will assume that the following tools are provisioned and running:

  • A 3-node RabbitMQ 3.8 cluster
  • Prometheus, including network connectivity with all RabbitMQ cluster nodes
  • Grafana, including configuration that lists the above Prometheus instance as one of the data sources

RabbitMQ Configuration

Cluster Name

First step is to give the RabbitMQ cluster a descriptive name so that it can be distinguished from other clusters.

To find the current name of the cluster, use

rabbitmq-diagnostics -q cluster_status

This command can be executed against any cluster node. If the current cluster name is distinctive and appropriate, skip the rest of this paragraph. To change the name of the cluster, run the following command (the name used here is just an example):

rabbitmqctl -q set_cluster_name testing-prometheus

Enable rabbitmq_prometheus

Next, enable the rabbitmq_prometheus plugin on all nodes:

rabbitmq-plugins enable rabbitmq_prometheus

The output will look similar to this:

rabbitmq-plugins enable rabbitmq_prometheus

Enabling plugins on node rabbit@ed9618ea17c9:
rabbitmq_prometheus
The following plugins have been configured:
  rabbitmq_management_agent
  rabbitmq_prometheus
  rabbitmq_web_dispatch
Applying plugin configuration to rabbit@ed9618ea17c9...
The following plugins have been enabled:
  rabbitmq_management_agent
  rabbitmq_prometheus
  rabbitmq_web_dispatch

started 3 plugins.

To confirm that RabbitMQ now exposes metrics in Prometheus format, get the first couple of lines with curl or similar:

curl -s localhost:15692/metrics | head -n 3
# TYPE erlang_mnesia_held_locks gauge
# HELP erlang_mnesia_held_locks Number of held locks.
erlang_mnesia_held_locks{node="rabbit@65f1a10aaffa",cluster="rabbit@65f1a10aaffa"} 0

Notice that RabbitMQ exposes the metrics on a dedicated TCP port, 15692 by default.

Prometheus Configuration

Once RabbitMQ is configured to expose metrics to Prometheus, Prometheus should be made aware of where it should scrape RabbitMQ metrics from. There are a number of ways of doing this. Please refer to the official Prometheus configuration documentation . There's also a first steps with Prometheus guide for beginners.

Prometheus will periodically scrape (read) metrics from the systems it monitors, every 60 seconds by default. RabbitMQ metrics are updated periodically, too, every 5 seconds by default. Since this value is configurable, check the metrics update interval by running the following command on any of the nodes:

rabbitmq-diagnostics environment | grep collect_statistics_interval
# => {collect_statistics_interval,5000}

The returned value will be in milliseconds.

For production systems, we recommend a minimum value of 15s for Prometheus scrape interval and a 10000 (10s) value for RabbitMQ's collect_statistics_interval. With these values, Prometheus doesn't scrape RabbitMQ too frequently, and RabbitMQ doesn't update metrics unnecessarily. If you configure a different value for Prometheus scrape interval, remember to set an appropriate interval when visualising metrics in Grafana with rate() - 4x the scrape interval is considered safe.

If you are using RabbitMQ's Management UI default 5 second auto-refresh, you may want to keep the default collect_statistics_interval setting, which is also 5000 ms (5 seconds) for this reason.

To confirm that Prometheus is scraping RabbitMQ metrics from all nodes, ensure that all RabbitMQ endpoints are UP on the Prometheus Targets page, as shown below:

Prometheus RabbitMQ Targets

Grafana Configuration

The last component in this setup is Grafana. If this is your first time integrating Grafana with Prometheus, please follow the official integration guide.

After Grafana is integrated with the Prometheus instance that reads and stores RabbitMQ metrics, it is time to import the Grafana dashboards that Team RabbitMQ maintains. Please refer to the the official Grafana tutorial on importing dashboards in Grafana.

Grafana dashboards for RabbitMQ and Erlang are open source and publicly from the rabbitmq-prometheus GitHub repository. To import the RabbitMQ-Overview Grafana Dashboard from GitHub, navigate to the raw version and copy paste the file contents in Grafana, then click Load, as seen below:

Grafana Import Dashboard

Repeat the process for all other Grafana dashboards that you would like to use with this RabbitMQ deployment.

Congratulations! Your RabbitMQ is now monitored with Prometheus & Grafana!

Using Prometheus with RabbitMQ 3.7

RabbitMQ versions prior to 3.8 can use a separate plugin, prometheus_rabbitmq_exporter, to expose metrics to Prometheus. The plugin uses RabbitMQ HTTP API internally and requires visualisation to be set up separately.

Getting Help and Providing Feedback

If you have questions about the contents of this guide or any other topic related to RabbitMQ, don't hesitate to ask them on the RabbitMQ mailing list.

Help Us Improve the Docs <3

If you'd like to contribute an improvement to the site, its source is available on GitHub. Simply fork the repository and submit a pull request. Thank you!