Menu

RabbitMQ Kubernetes Operator reaches 1.0

November 17th, 2020 by admin

We are pleased to announce that the RabbitMQ Operator for Kubernetes is now generally available. The RabbitMQ Operator makes it easy to provision and manage RabbitMQ clusters consistently on any certified Kubernetes distribution.  Operators inform the Kubernetes container orchestration system how to provision and control specific applications. The Kubernetes (hereafter K8s) Operator pattern is a way to extend the K8s API and state management to include the provisioning and management of custom resources -- resources not provided in a default K8s deployment. In this post, we’ll discuss how the Operator enables the K8s system to control a RabbitMQ cluster.

Where do I start?

If you are new to K8s, start with learning the basics of K8s and kubectl before attempting to use the operator.  You can also visit the Kube Academy for more in-depth primers. Watch this episode of TGIR to see how easy it is to deploy and monitor a RabbitMQ cluster with the operator. Next you can try the quick start guide. In a minute or two you will have the operator and your first instance of operator created rabbitMQ cluster.

Want to go further? Look at the operator yaml examples for quick deployment of advanced clusters. For the entire set of supported features, look at the documentation.

The Rabbit K8s Operator

It’s recommended that you first be familiar with the basics of K8s. If you need a refresher, VMware’s Kube Academy has a comprehensive set of resources.  The Rabbit MQ K8s operator is composed of 2 building blocks:
  • Custom resources: An extension to the native K8s resources in order to manage custom state of a particular stateful platform or application. In our case the custom resource  is reflecting the configuration size and state of a RabbitMQ cluster 
  • Custom controller: A k8s controller is a non-terminating loop that regulates the state of standard K8s resources, like ReplicaSet, StatefulSet, or Deployments. The custom controller is adding a non-terminating control loop with custom logic to regulate the state of the custom resource. In the case of RabbitMQ cluster, the controller can settle changes to cluster & node configurations.

What’s the value of a RabbitMQ Cluster Operator?

The RabbitMQ Cluster Operator (aka the operator) is a bridge between K8s managed states and RabbitMQ configuration and state.  The RabbitMQ Cluster Operator helps to simplify two types of tasks:
  • Provisioning of new clusters
  • Post-installation tasks, managed by K8s

Provisioning of a new RabbitMQ cluster

Creating a new RabbitMQ cluster on K8s manually is a multi step process, that the operator is automating:
  1. Creating the RabbitMQ configuration file as a ConfigMap, so that is can be mounted as a file by the RabbitMQ container
  2. Setting the K8s secrets required by RabbitMQ (TLS certificates, default user password etc)
  3. Creating a StatefulSet that will manage the cluster nodes (as pods)
  4. Creating a headless service(a service without a cluster IP) to manage rabbitMQ node discovery
  5. Creating a service for RabbitMQ clients to access the cluster
 

The operator is not the only way to automate this process, but it has the benefit of mapping all the RabbitMQ configurations into a k8s descriptor and a custom resource.  This means that there is a single source of truth regarding the cluster state, and users can manage their clusters configurations using Gitops

By using an operator to provision RabbitMQ clusters, the user enjoys several benefits:

  • A declarative API to create a RabbitMQ cluster with any setting with a single command. The operator is automating the provisioning of the complex set of K8s resources composing the cluster: such as services, pods, statefulset, persistent volumes etc 
  • The operator comes with a set of YAML examples so in most cases you have almost no effort in order to have a development or even a production grade cluster.
  • Once the cluster is created it has a K8s state and description you can observe to know if your RMQ cluster is ready. RabbitMQ comes with in-built support for Prometheus. Node and cluster metrics can be visualised with Grafana.
  • The operator goes further and displays status conditions for the RabbitMQ cluster. The possible status conditions are: 
    • ClusterAvailable - RabbitMQ accessible by client apps
    • AllReplicasReady - RabbitMQ cluster fully available
    • ReconcileSuccess - Custom Resource reconciled successfully. This being false may denote that the user needs to intervene (for example if TLS is enabled but the secret does not exist).
  • K8s controllers can now regulate the set of resources that compose the cluster: The rabbitMQ nodes, the routing service, the node registry service and the volumes. If any of these resources is not available according to K8s liveness probe, K8s will auto heal it
While the operator is useful for day 1 tasks, it confers even greater benefits when we talk about day 2 operations triggered by the user or by k8s.

Post Installation Tasks

Having a RabbitMQ cluster provisioned is just the beginning of the journey as any developer and operator knows. There are various lifecycle events to take care of:
  • Scaling the cluster when the application messaging volume increases as a result of additional functionality or demand growth
  • Self healing the cluster when a node is crashed or the network breaks
  • Rotating certificates
  • Upgrading the cluster with zero downtime when a new version of RabbitMQ is needed for security patching or new features
Many of the above processes require graceful termination of a node, some RabbitMQ API call after node start, or more complex flows that have a sequence of K8s lifecycle events and RabbitMQ cluster events.This is exactly where the value of the operator stands out. 
  • The Cluster Operator will allow RabbitMQ users to address all of these with a simple K8s CLI declarative command.
  • The operator will automate these complex flows using both K8s building block and custom controller logic that takes care of the RabbitMQ administrative tasks
The operator now supports the core of day 2 RabbitMQ operations such as:
  • Reconfigurations
  • Enabling / disabling of plugins
  • Self healing 
  • Scaling
  • In place upgrade
  • Certificate rotations - using rolling updates
In the future, the operator can be upgraded to provide new flows. For example, a new version of the operator may be released with a new major version of RabbitMQ. The use of the operator pattern means we can provide specific logic to, say, upgrade the existing RabbitMQ clusters to a new major version without losing messages and without downtime. The existence of a new operator version does not force users to upgrade existing clusters; K8s will still be able to manage these clusters going forward. 

It is only a K8s operator that can automate such complex and delicate processes in such an elegant way and without a risk. Here’s an example of how RabbitMQ will save users from pain and errors by automating the process of in-place upgrade. The diagram lists the manual steps a user need to perform in order to have a rolling upgrade of the cluster (without an operator)

Now watch as Gerhard covers more advanced topics in running RabbitMQ reliably on K8s.

But wait - there’s more

The operator comes with a kubectl plugin that provides many commands to make your life easier. As described here, you can install the kubectl plugin using krewSome handy commands are installing the cluster operator as well as creating, listing and deleting RabbitMQ clusters. Other commands targeting a specific RabbitMQ cluster include printing the default user secret, opening the RabbitMQ management UI, enabling debug logging on all RabbitMQ nodes, and running perf-test.

What’s next?

The Cluster Operator gives users a lot of power to create clusters and manage them. There is still a wider scope of functionality to support, depending on your requests and feedback.

Some examples are:

  • Examples of topologies using Istio for encryption /decryption of traffic between nodes as well as client with traffic
  • Scaling down gracefully
  • Monitoring the operator - the operator will report metrics in Prometheus format
  • Labelling cluster resources with RMQ and operator metadata
  • Testing on additional K8s providers (currently testing on Tanzu Kubernetes Grid and GKE)
In addition, we plan to add another operator that will wrap some of RabbitMQ's APIs with K8s declarative API, allowing for creation of users, queues and exchanges.

We welcome feedback, feature requests, bug reports and any questions you have regarding RabbitMQ:

 

This Month in RabbitMQ, Aug/Sep 2020 Recap

November 6th, 2020 by Michael Klishin

This month in RabbitMQ features a blog from Michael Klishin on deploying RabbitMQ on Kubernetes. Also this month: RabbitMQ consumers on AWS, a three-part series on developing microservices with Lumen and RabbitMQ, and several articles on RabbitMQ and ASP.NET Core.

Highlights and Updates

Community Writings and Resources

Learn More

Ready to learn more? Check out these upcoming opportunities to learn more about RabbitMQ:

This Month in RabbitMQ, July 2020 Recap

August 31st, 2020 by Michael Klishin

It’s not the holidays yet, but the RabbitMQ community has presents for you anyway! The RabbitMQ Kubernetes cluster operator is now open-sourced and developed in the open in GitHub. Also, Gavin Roy has a new Python app that migrates queues between types. Finally, a webinar on RabbitMQ consumers from Ayanda Dube, Head of RabbitMQ Engineering at Erlang Solutions.

Highlights and Updates

Project updates

Community Writings and Resources

Learn More

Ready to learn more? Check out these upcoming opportunities to learn more about RabbitMQ:

Udemy is running a special training on RabbitMQ: 12 courses at $12.99 each. Highlights:

  • Learn RabbitMQ: Asynchronous Messaging with Java and Spring
  • Getting Started .NET Core Microservices RabbitMQ
  • RabbitMQ & Java (Spring Boot) for System Integration

In upcoming webinars: RabbitMQ consumers under-the-hood. Presented by Ayanda Dube, Head of RabbitMQ Engineering at Erlang Solutions.

Deploying RabbitMQ to Kubernetes: What’s Involved?

August 10th, 2020 by Michael Klishin

Introduction

Over time, we have seen the number of Kubernetes-related queries on our community mailing list and Slack channels soar. In this post we'd like to explain the basics of a DIY deployment of RabbitMQ on Kubernetes: what Kubernetes resources will be necessary, how to make sure RabbitMQ nodes use durable storage, how to approach configuration of sensitive values, and so on.

Deploying a statful data service such as RabbitMQ to Kubernetes is a bit like assembling a jigsaw puzzle. There are multiple pieces involved:

In this post, we will try to cover the key parts as well as mention a couple more steps that are not technically required to run RabbitMQ on Kubernetes, but every production system operator will have to worry about sooner rather than later:

  • How to set up cluster monitoring with Prometheus and Grafana
  • How to deploy a PerfTest instance to do basic functional and load testing of the cluster

This post by no means covers every aspect that may be relevant when deploying RabbitMQ to Kubernetes; our goal is to highlight the most important parts. Deployment- and workload-specific decisions such as what resource limits to apply to RabbitMQ node pod (containers), what kind of durable storage to use, how to approach TLS certificate/key pair rotation, log aggregation, and upgrades are great topics for separate blog posts. Let us know what you'd like to see in a follow-up!

Executable Examples

The files that accompany this post can be found in the DIY RabbitMQ on Kubernetes example repository. This post uses a Google Kubernetes Engine (GKE) cluster but Kubernetes concepts are universal.

To follow along the examples,

This post assumes that the reader is familiar with kubectl usage basics and the tool is set up to work with a GKE cluster.

RabbitMQ Docker Image

We recommend using the community RabbitMQ Docker image. The image is maintained by the Docker Community and is built with the latest versions of RabbitMQ, Erlang and OpenSSL. The image has a variant built with RabbitMQ release candidates for early testing and adoption.

Now let's begin with the first building block of a RabbitMQ cluster running on Kubernetes: picking a namespace to deploy to.

Kubernetes Namespace and Permissions (RBAC)

Every set of Kubernetes objects belongs to a Kubernetes Namespace. RabbitMQ cluster resources are no exception.

We recommend using a dedicated Namespace to keep the RabbitMQ cluster separate from other services that may be deployed in the Kubernetes cluster. Having a dedicated namespace makes logical sense and makes it easy to grant just enough permissions to the cluster nodes. This is a good security practice.

RabbitMQ's Kubernetes peer discovery plugin relies on the Kubernetes API as a data source. On first boot, every node will try to discover their peers using the Kubernetes API and attempt to join them. Nodes that finish booting emit a Kubernetes event to make it easier to discover such events in cluster activity (event) logs.

The plugin requires the following access to Kubernetes resources:

  • get access to the endpoints resource
  • create access to the events resource

Specify a Role, Role Binding and a Service Account to configure this access.

An example namespace, along with RBAC rules can be seen in the rbac.yaml example file.

If following from the example, use the following command to create a namespace and the required RBAC rules. Note that this creates a namespace called test-rabbitmq.

The kubectl examples below will use the test-rabbitmq namespace. This namespace can be set to be the default one for convenience:

Alternatively, --namespace="test-rabbitmq" can be appended to all kubectl commands demonstrated below.

Use a Stateful Set

RabbitMQ requires using a Stateful Set to deploy a RabbitMQ cluster to Kubernetes. The Stateful Set ensures that the RabbitMQ nodes are deployed in order, one at a time. This avoids running into a potential peer discovery race condition when deploying a multi-node RabbitMQ cluster.

There are other, equally important reasons for using a Stateful Set instead of a Deployment: sticky identity, simple network identifiers, stable persistent storage and the ability to perform ordered rolling upgrades.

The Stateful Set definition file is packed with detail such as mounting configuration, mounting credentials, opening ports, etc, which is explained topic-wise in the following sections.

The final Stateful Set file can be found in the under gke directory.

Create a Service For Clustering and CLI Tools

The Stateful Set definition can reference a Service which gives the Pods of the Stateful Set their network identity. Here, we are referring to the v1.StatefulSet.Spec.serviceName property.

This is required by RabbitMQ for clustering, and as mentioned in the Kubernetes documentation, has to be created before the Stateful Set.

RabbitMQ uses port 4369 for port 4369 for node discovery and port 25672 for inter-node communication. Since this Service is used internally and does not need to be exposed, we create a Headless Service. It can be found in the example headless-service.yaml file.

If following from the example, run the following to create a Headless Service for inter-node and CLI tool traffic:

The service now can be observed in the test-rabbitmq namespace:

Use a Persistent Volume for Node Data

In order for RabbitMQ nodes to retain data between Pod restarts, node's data directory must use durable storage. A Persistent Volume must be attached to each RabbitMQ Pod.

If a transient volume is used to back a RabbitMQ node, the node will lose its identity and all of its local data in case of a restart. This includes both schema and durable queue data. Syncing all of this data on every node restart would be highly inefficient. In case of a loss of quorum during a rolling restart, this will also lead to data loss.

In our statefulset.yaml example, we create a Persistent Volume Claim to provision a Persistent Volume.

The Persistent Volume is mounted at /var/lib/rabbitmq/mnesia. This path is used for a RABBITMQ_MNESIA_BASE location: the base directory for all persistent data of a node.

A description of default file paths for RabbitMQ can be found in the RabbitMQ documentation.

Node's data directory base can be changed using the RABBITMQ_MNESIA_BASE variable if needed. Make sure to mount a Persistent Volume at the updated path.

Node Authentication Secret: the Erlang Cookie

RabbitMQ nodes and CLI tools use a shared secret known as the Erlang Cookie, to authenticate to each other. The cookie value is a string of alphanumeric characters up to 255 characters in size. The value must be generated before creating a RabbitMQ cluster since it is needed by the nodes to form a cluster.

With the community Docker image, RabbitMQ nodes will expect the cookie to be at /var/lib/rabbitmq/.erlang.cookie. We recommend creating a Secret and mounting it as a Volume on the Pods at this path.

This is demonstrated in the statefulset.yaml example file.

The secret is expected to have the following key/value pair:

To create a cookie Secret, run

This will create a Secret with a single key, cookie, taken from the file name, and the file contents as its value.

Administrator Credentials

RabbitMQ will seed a default user with well-known credentials on first boot. The username and password of this user are both guest.

This default user can only connect from localhost by default. It is possible to lift this restriction by opting in. This may be useful for testing but very insecure. Instead, an administrative user must be created using generated credentials.

The administrative user credentials should be stored in a Kubernetes Secret, and mounting them onto the RabbitMQ Pods. The RABBITMQ_DEFAULT_USER and RABBITMQ_DEFAULT_PASS environment variables then can be set to the Secret values. The community Docker image will use them to override default user credentials.

Example for reference.

The secret is expected to have the following key/value pair:

To create an administrative user Secret, use

This will create a Secret with two keys, user and pass, taken from the file names, and file contents as their respective values.

Users can be create explicitly using CLI tools as well. See RabbitMQ doc section on user management to learn more.

Node Configuration

There are several ways to configure a RabbitMQ node. The recommended way is to use configuration files.

Configuration files can be expressed as Config Maps, and mounted as a Volume onto the RabbitMQ pods.

To create a Config Map with RabbitMQ configuration, apply our minimal configmap.yaml example:

Use an Init Container

Since Kubernetes 1.9.4, Config Maps are mounted as read-only volumes onto Pods. This is problematic for the RabbitMQ community Docker image: the image can try to update the config file at the time of container startup.

Thus, the path at which the RabbitMQ config is mounted must be read-write. If a read-only file is detected by the Docker image, you'll see the following warning:

While the Docker image does work around the issue, it is not ideal to store the configuration file in /tmp and we recommend instead making the mount path read-write.

As a few other projects in the Kubernetes community, we use an init container to overcome this.

Examples:

Run The Pod As the rabbitmq User

The Docker image runs as the rabbitmq user with uid 999 and writes to the rabbitmq.conf file. Thus, the file permissions on rabbitmq.conf must allow this. A Pod Security Context can be added to the Stateful Set definition to achieve this. Set the runAsUser, runAsGroup and the fsGroup to 999 in the Security Context.

See Security Context in the Stateful Set definition file.

Importing Definitions

RabbitMQ nodes can importi definitions exported from another RabbitMQ cluster. This may also be done at node boot time.

Following from the RabbitMQ documentation, this can be done using the following steps:

  1. Export definitions from the RabbitMQ cluster you wish to replicate and save the file
  2. Create a Config Map with the key being the file name, and the value being the contents of the file (See the rabbitmq.conf Config Map example)
  3. Mount the Config Map as a Volume on the RabbitMQ Pod in the Stateful Set definition
  4. Update the rabbitmq.conf Config Map with load_definitions = /path/to/definitions/file

Readiness Probe

Kubernetes uses a check known as the readiness probe to determine if a pod is ready to serve client traffic. This is effectively a specialized health check defined by the system operator.

When an ordered pod deployment policy is used — and this is the commended option for RabbitMQ clusters — the probe controls when the Kubernetes controller will consider the currently deployed pod to be ready and proceed to deploy the next one. This check, if not chosen appropriately, can deadlock a rolling cluster node restart.

RabbitMQ nodes that belong to a clsuter will attempt to sync schema from their peers on startup. If no peer comes online within a configurable time window (five minutes by default), the node will give up and voluntarily stop. Before the sync is complete, the node won't mark itself as fully booted.

Therefore, if a readiness probe assumes that a node is fully booted and running, a rolling restart of RabbitMQ node pods using such probe will deadlock: the probe will never succeed, and will never proceed to deploy the next pod, which must come online for the original pod to be considered ready by the deployment.

It is therefore recommended to use a very basic RabbitMQ health check for readiness probe:

While this check is not thorough, it allows all pods to be started and re-join the cluster within a certain time period, even when pods are restarted one by one, in order.

This is covered in a dedicated section of the RabbitMQ clustering guide: Restarts and Health Checks (Readiness Probes).

The readiness probe section in the Stateful Set definition file demonstrates how to configure a readiness probe.

Liveness Probe

Similarly to the readiness probe described above, Kubernetes allows for pod health checks using a different health check called the liveness probe. The check determines if a pod must be restarted.

As with all health checks, there is no single solution that can be recommended for all deployments. Health checks can produce false positives, which means reasonably healthy, operational nodes will be restarted or even destroyed and re-created for no reason, reducing system availability.

Moreover, a RabbitMQ node restart won't necessarily address the issue. For example, restarting a node that is in an alarmed state because it is low on available disk space won't help.

All this is to say that liveness probes must be chosen wisely and with false positives and "recoverability by a restart" taken into account. Liveness probes also must use node-local health checks instead of cluster-wide ones.

RabbitMQ CLI tools provide a number of pre-defined health checks that vary in how thorough they are, how intrusive they are and how likely they are to produce false positives in different scenarios, e.g. when the system is under load. The checks are composable and can be combined. The right liveness probe choice is a system-specific decision. When in doubt, start with a simpler, less intrusive and less thorough option such as

The following checks can be reasonable liveness probe candidates:

Note, however, that they will fail for the nodes paused by the "pause minority" partition handliner strategy.

The liveness probe section in the Stateful Set definition file demonstrates how to configure a liveness probe.

Plugins

RabbitMQ supports plugins. Some plugins are essential when running RabbitMQ on Kubernetes, e.g. the Kubernetes-specific peer discovery implementation.

The rabbitmq_peer_discovery_k8s plugin is required to deploy RabbitMQ on Kubernetes. It is quite common to also enable rabbitmq_management plugin in order to get a browser-based management UI and an HTTP API, and rabbitmq_prometheus for monitoring.

Plugins can be enabled in different ways. We recommend mounting the plugins file, enabled_plugins, to the node configuration directory, /etc/rabbitmq. A Config Map can be used to express the value of the enabled_plugins file. It can then be mounted as a Volume onto each RabbitMQ container in the Stateful Set definition.

In our configmap.yaml example file, we demonstrate how to popular the the enabled_plugins file and mount it under the /etc/rabbitmq directory.

Ports

The final consideration for the Stateful Set is the ports to open on the RabbitMQ Pods. Protocols supported by RabbitMQ are all TCP-based and require the protocol ports to be opened on the RabbitMQ nodes. Depending on the plugins that are enabled on a node, the list of required ports can vary.

The example enabled_plugins file mentioned above enables a few plugins: rabbitmq_peer_discovery_k8s (mandatory), rabbitmq_management and rabbitmq_prometheus. Therefore, the service must open several ports relevant for the core server and the enabled plugins:

  • 5672: used by AMQP 0-9-1 and AMQP 1.0 clients
  • 15672: management UI and HTTP API)
  • 15692: Prometheus scraping endpoint)

Deploy the Stateful Set

These are the key components in the Stateful Set file. Please have a look at the file, and if following from the example, deploy the Stateful Set:

This will start spinning up a RabbitMQ cluster. To watch the progress:

Create a Service for Client Connections

If all the steps above succeeded, you should have functioning RabbitMQ cluster deployed on Kubernetes! ? However, having a RabbitMQ cluster on Kubernetes is only useful clients can connect to it.

Time to create a Service to make the cluster accessible to client connections.

The type of the Service depends on your use case. The Kubernetes API reference gives a good overview of the types of Services available.

In the client-service.yaml example file, we have gone with a LoadBalancer Service. This gives us an external IP that can be used to access the RabbitMQ cluter.

For example, this should make it possible to visit the RabbitMQ management UI by visiting {external-ip}:15672, and signing in. Client applications can connect to endpoints such as {external-ip}:5672 (AMQP 0-9-1, AMQP 1.0) or {external-ip}:1883 (MQTT). Please refer to the get started guide to learn how to use RabbitMQ.

If following from the example, run

to create a Service of type LoadBalancer with an external IP address. To find out what the external IP address is, use kubectl get svc:

Resource Usage and Limits

Container resource management is a topic that deserves its own post. Capacity planning recommendations are entirely workload-, environment- and system-specific. Optimal values are usually found via extensive monitoring of the system, trial, and error. However, when picking the limits and resource allocation settings, consider a few RabbitMQ-specific things.

Use the Latest Major Erlang Release

RabbitMQ runs on the Erlang runtime. Recent Erlang/OTP releases have introduced a number of improvements highly relevant to the users who run RabbitMQ on Kubernetes:

  • In Erlang 22, inter-node communication [latency and head-of-line blocking(http://blog.erlang.org/OTP-22-Highlights/) have been significantly reduced. In earlier versions, link congestion was known to make cluster node heartbeat false positives likely.
  • In Erlang 23, the runtime will respect the container CPU quotas when computing the default number of schedulers to start. This means that nodes will respect the Kubernetes-managed CPU resource limits.

Docker community image for RabbitMQ ships with Erlang 23 at the time of writing. Users of custom Docker images are highly recommended to provision Erlang 23 as well.

CPU Resource Usage

RabbitMQ was designed for workloads that involve multiple queues and where a node serves multiple clients at the same time. Nodes will generally use all the CPU cores allowed without any explicit configuration. As the number of cores grows, some tuning may be necessary to reduce CPU context switching.

How CPU time is spent can be monitored via the runtime thread activity metrics which are also exposed via the RabbitMQ Prometheus plugin.

If RabbitMQ pods hover around their CPU resource allowance and experience throttling in environments with a large number of relatively idle clients, the load likely can be reduced with a modest amount of configuration.

Memory Limits

RabbitMQ uses the concept of a runtime memory high watermark. By default a node will use 40% of detected (available) memory as the watermark. When the watermark is crossed, publishers across the entire cluster will be blocked and more aggressive paging out to disk initiated. The watermark value may seem like a memory quota on Kubernetes at first but there is an important difference: RabbitMQ resource alarms assume a node can typically recover from this state. For example, a large backlog of messages will eventually be consumed.

Kubernetes memory limits are enforced by the OOM killer: no recovery is expected. This means that a RabbitMQ node's high memory watermark must be lower than the memory limit imposed on the node container. Kubernetes deployments should use the relative watermark values in the recommended range.

Memory usage breakdown data should be used to determine what consumes most memory on the node.

Disk Usage

We highly recommend overprovisioning the disk space available to RabbitMQ containers. A node that has run out of disk space won't always be able to recover from such an event. Such nodes must be decomissioned and replaced.

Consider Available Network Link Bandwidth

Finally, consider what kind of links and Kubernetes networking options are used for inter-node communication. Network link congestion can be a significant limiting factor to system throughput and affect its availability.

Below is a very simplistic formula to calculate the amount of bandwidth needed by a workload, in bits:

Therefore a workload with average message size of 3 kiB and expected peak message rate of 20K messages a second can consume up to

of bandwidth.

Team RabbitMQ maintains a Grafana dashboard for inter-node communication link metrics.

Using rabbitmq-perf-test to Run a Functional and Load Test of the Cluster

RabbitMQ comes with a load simulation tool, PerfTest, which can be executed from outside of a cluster or deployed to Kubernetes using the perf-test public docker image. Here's an example of how the image can be deployed to a Kubernetes cluster

Here the {username} and {password} are the user credentials, e.g. those set up in the rabbitmq-admin Secret. The {serivce} is the hostname to connect to. We use the name of the client service that will resolve as a hostname when deployed.

The above kubectl run command will start a PerfTest pod which can be observed in

For a functioning RabbitMQ cluster, running kubectl logs -f {perf-test-pod-name} where {perf-test-pod-name} is the name of the pod as reported by kubectl get pods, will produce output similar to this:

To learn more about PerfTest, its settings, capabilities and output, see the PerfTest doc guide.

PerfTest is not meant to be running permanently. To tear down the perf-test pod, use

Monitoring the Cluster

Monitoring is a critically important part of any production deployment.

RabbitMQ comes with in-built support for Prometheus. To enable it, enable the rabbitmq_prometheus plugin. This in turn can be done by adding rabbitmq_promethus to the enabled_plugins Config Map as explained above.

The Prometheus scraping port, 15972, must be open on both the Pod and the client Service.

Node and cluster metrics can be visualised with Grafana.

Alternative Option: the Kubernetes Cluster Operator for RabbitMQ

As this post demonstrates, there are quite a few parts involved in hosting a stateful data services such as RabbitMQ on Kubernetes. It may seem like a daunting task. There are several alternatives to this kind of DIY deployment demonstrated in this post.

Team RabbitMQ at VMware has open sourced a Kubernetes Operator pattern implementation for RabbitMQ. As of August 2020, this is a young project under active development. While it currently has limitations, it is our recommended option over the manual DIY setup demonstrated in this post.

See RabbitMQ Cluster Operator for Kubernetes to learn more. The project is developed in the open at rabbitmq/cluster-operator on GitHub. Give it a try and let us know how it goes. Besides GitHub, two great venues for providing feedback to the team behind the Operator are the RabbitMQ mailing list and the #kubernetes channel in RabbitMQ community Slack.

This Month in RabbitMQ, June 2020 Recap

July 30th, 2020 by Michael Klishin

This month in RabbitMQ features the release of the RabbitMQ Cluster Kubernetes Operator, benchmarks and cluster sizing case studies by Jack Vanlightly (@vanlightly), and a write up of RabbitMQ cluster migration by Tobias Schoknecht (@tobischo), plus lots of other tutorials by our vibrant community!

Project Updates

Community Writings and Resources

Learn More

Ready to learn more? Check out these upcoming opportunities to learn more about RabbitMQ

  • Udemy is offering “Learn RabbitMQ: In-Depth Concepts from Scratch with Demos” for $13.99, an 85% discount. You can’t afford not to learn RabbitMQ at this price!

Disaster Recovery and High Availability 101

July 7th, 2020 by Jack Vanlightly

In this post I am going to cover perhaps the most commonly asked question I have received regarding RabbitMQ in the enterprise.

How can I make RabbitMQ highly available and what architectures/practices are recommended for disaster recovery?
RabbitMQ offers features to support high availability and disaster recovery but before we dive straight in I’d like to prepare the ground a little. First I want to go over Business Continuity Planning and frame our requirements in those terms. From there we need to set some expectations about what is possible. There are fundamental laws such as the speed of light and the CAP theorem which both have serious impacts on what kind of DR/HA solution we decide to go with.

Finally we’ll look at the RabbitMQ features available to us and their pros/cons.

Read the rest of this entry »

This Month in RabbitMQ, May 2020 Recap

June 30th, 2020 by Michael Klishin

This month, Jack Vanlightly continues his blog series on Quorum Queues in RabbitMQ. Also, be sure to watch the replay of his related webinar.

Finally, Episode 5 of TGI RabbitMQ is out -- Gerhard Lazu walks us through how to run RabbitMQ on Kubernetes. Don’t miss!

Project Updates

  • RabbitMQ 3.8.4 was released in late May, the first release to feature Erlang 23 compatibility. Three weeks later 3.8.5 followed with complete Erlang 23 support.
  • Docker community-maintained RabbitMQ image has adopted Erlang 23 in less than two weeks since its release
  • rabbit-hole, the most popular Go RabbitMQ HTTP API client, has reached version 2.2.0
  • Merged an impressive pull request from GitHub user @joseliber that fixed the generation of password-encrypted certificates in the tls-gen project. This project is used by RabbitMQ, its client libraries, and other projects to easily generate self-signed certificates.

    Read the rest of this entry »

How quorum queues deliver locally while still offering ordering guarantees

June 23rd, 2020 by Jack Vanlightly

The team was recently asked about whether and how quorum queues can offer the same message ordering guarantees as classic queues given that they will deliver messages from a local queue replica (leader or follower) when possible. Mirrored queues always deliver from the master (the leader), so delivering from any queue replica sounds like it could impact those guarantees. 

That is the subject of this post. Be warned, this post is a technical deep dive for the curious and the distributed systems enthusiast. We’ll take a look at how quorum queues can deliver messages from any queue replica, leader or follower, without additional coordination (extra to Raft) but maintaining message ordering guarantees.

Read the rest of this entry »

Cluster Sizing Case Study – Quorum Queues Part 2

June 18th, 2020 by Jack Vanlightly

In the last post we started a sizing analysis of our workload using quorum queues. We focused on the happy scenario that consumers are keeping up meaning that there are no queue backlogs and all brokers in the cluster are operating normally. By running a series of benchmarks modelling our workload at different intensities we identified the top 5 cluster size and storage volume combinations in terms of cost per 1000 msg/s per month.

  1. Cluster: 7 nodes, 8 vCPUs (c5.2xlarge), gp2 SDD. Cost: $54
  2. Cluster: 9 nodes, 8 vCPUs (c5.2xlarge), gp2 SDD. Cost: $69
  3. Cluster: 5 nodes, 8 vCPUs (c5.2xlarge), st1 HDD. Cost: $93
  4. Cluster: 5 nodes, 16 vCPUs (c5.4xlarge), gp2 SDD. Cost: $98
  5. Cluster: 7 nodes, 16 vCPUs (c5.4xlarge), gp2 SDD. Cost: $107
 

There are more tests to run to ensure these clusters can handle things like brokers failing and large backlogs accumulating during things like outages or system slowdowns.

All quorum queues are declared with the following properties:

  • x-quorum-initial-group-size=3
  • x-max-in-memory-length=0
 

The x-max-in-memory-length property forces the quorum queue to remove message bodies from memory as soon as it is safe to do. You can set it to a longer limit, this is the most aggressive - designed to avoid large memory growth at the cost of more disk reads when consumers do not keep up. Without this property message bodies are kept in memory at all times which can place memory growth to the point of memory alarms setting off which severely impacts the publish rate - something we want to avoid in this workload case study.

Read the rest of this entry »

Cluster Sizing Case Study – Quorum Queues Part 1

June 18th, 2020 by Jack Vanlightly

In a first post in this sizing series we covered the workload, the tests, and the cluster and storage volume configurations on AWS ec2. In this post we’ll run a sizing analysis with quorum queues. We also ran a sizing analysis on mirrored queues.

In this post we'll run the increasing intensity tests that will measure our candidate cluster sizes at varying publish rates, under ideal conditions. In the next post we'll run resiliency tests that measure whether our clusters can handle our target peak load under adverse conditions.

All quorum queues are declared with the following properties:

  • x-quorum-initial-group-size=3 (replication factor)
  • x-max-in-memory-length=0
 

The x-max-in-memory-length property forces the quorum queue to remove message bodies from memory as soon as it is safe to do. You can set it to a longer limit, this is the most aggressive - designed to avoid large memory growth at the cost of more disk reads when consumers do not keep up. Without this property message bodies are kept in memory at all times which can place memory growth to the point of memory alarms setting off which severely impacts the publish rate - something we want to avoid in this workload case study.

Read the rest of this entry »