Archive for the ‘Introductory’ Category

Deploying RabbitMQ to Kubernetes: What’s Involved?

Monday, August 10th, 2020


Over time, we have seen the number of Kubernetes-related queries on our community mailing list and Slack channels soar. In this post we'd like to explain the basics of a DIY deployment of RabbitMQ on Kubernetes: what Kubernetes resources will be necessary, how to make sure RabbitMQ nodes use durable storage, how to approach configuration of sensitive values, and so on.

Deploying a statful data service such as RabbitMQ to Kubernetes is a bit like assembling a jigsaw puzzle. There are multiple pieces involved:

In this post, we will try to cover the key parts as well as mention a couple more steps that are not technically required to run RabbitMQ on Kubernetes, but every production system operator will have to worry about sooner rather than later:

  • How to set up cluster monitoring with Prometheus and Grafana
  • How to deploy a PerfTest instance to do basic functional and load testing of the cluster

This post by no means covers every aspect that may be relevant when deploying RabbitMQ to Kubernetes; our goal is to highlight the most important parts. Deployment- and workload-specific decisions such as what resource limits to apply to RabbitMQ node pod (containers), what kind of durable storage to use, how to approach TLS certificate/key pair rotation, log aggregation, and upgrades are great topics for separate blog posts. Let us know what you'd like to see in a follow-up!

Executable Examples

The files that accompany this post can be found in the DIY RabbitMQ on Kubernetes example repository. This post uses a Google Kubernetes Engine (GKE) cluster but Kubernetes concepts are universal.

To follow along the examples,

This post assumes that the reader is familiar with kubectl usage basics and the tool is set up to work with a GKE cluster.

RabbitMQ Docker Image

We recommend using the community RabbitMQ Docker image. The image is maintained by the Docker Community and is built with the latest versions of RabbitMQ, Erlang and OpenSSL. The image has a variant built with RabbitMQ release candidates for early testing and adoption.

Now let's begin with the first building block of a RabbitMQ cluster running on Kubernetes: picking a namespace to deploy to.

Kubernetes Namespace and Permissions (RBAC)

Every set of Kubernetes objects belongs to a Kubernetes Namespace. RabbitMQ cluster resources are no exception.

We recommend using a dedicated Namespace to keep the RabbitMQ cluster separate from other services that may be deployed in the Kubernetes cluster. Having a dedicated namespace makes logical sense and makes it easy to grant just enough permissions to the cluster nodes. This is a good security practice.

RabbitMQ's Kubernetes peer discovery plugin relies on the Kubernetes API as a data source. On first boot, every node will try to discover their peers using the Kubernetes API and attempt to join them. Nodes that finish booting emit a Kubernetes event to make it easier to discover such events in cluster activity (event) logs.

The plugin requires the following access to Kubernetes resources:

  • get access to the endpoints resource
  • create access to the events resource

Specify a Role, Role Binding and a Service Account to configure this access.

An example namespace, along with RBAC rules can be seen in the rbac.yaml example file.

If following from the example, use the following command to create a namespace and the required RBAC rules. Note that this creates a namespace called test-rabbitmq.

The kubectl examples below will use the test-rabbitmq namespace. This namespace can be set to be the default one for convenience:

Alternatively, --namespace="test-rabbitmq" can be appended to all kubectl commands demonstrated below.

Use a Stateful Set

RabbitMQ requires using a Stateful Set to deploy a RabbitMQ cluster to Kubernetes. The Stateful Set ensures that the RabbitMQ nodes are deployed in order, one at a time. This avoids running into a potential peer discovery race condition when deploying a multi-node RabbitMQ cluster.

There are other, equally important reasons for using a Stateful Set instead of a Deployment: sticky identity, simple network identifiers, stable persistent storage and the ability to perform ordered rolling upgrades.

The Stateful Set definition file is packed with detail such as mounting configuration, mounting credentials, opening ports, etc, which is explained topic-wise in the following sections.

The final Stateful Set file can be found in the under gke directory.

Create a Service For Clustering and CLI Tools

The Stateful Set definition can reference a Service which gives the Pods of the Stateful Set their network identity. Here, we are referring to the v1.StatefulSet.Spec.serviceName property.

This is required by RabbitMQ for clustering, and as mentioned in the Kubernetes documentation, has to be created before the Stateful Set.

RabbitMQ uses port 4369 for port 4369 for node discovery and port 25672 for inter-node communication. Since this Service is used internally and does not need to be exposed, we create a Headless Service. It can be found in the example headless-service.yaml file.

If following from the example, run the following to create a Headless Service for inter-node and CLI tool traffic:

The service now can be observed in the test-rabbitmq namespace:

Use a Persistent Volume for Node Data

In order for RabbitMQ nodes to retain data between Pod restarts, node's data directory must use durable storage. A Persistent Volume must be attached to each RabbitMQ Pod.

If a transient volume is used to back a RabbitMQ node, the node will lose its identity and all of its local data in case of a restart. This includes both schema and durable queue data. Syncing all of this data on every node restart would be highly inefficient. In case of a loss of quorum during a rolling restart, this will also lead to data loss.

In our statefulset.yaml example, we create a Persistent Volume Claim to provision a Persistent Volume.

The Persistent Volume is mounted at /var/lib/rabbitmq/mnesia. This path is used for a RABBITMQ_MNESIA_BASE location: the base directory for all persistent data of a node.

A description of default file paths for RabbitMQ can be found in the RabbitMQ documentation.

Node's data directory base can be changed using the RABBITMQ_MNESIA_BASE variable if needed. Make sure to mount a Persistent Volume at the updated path.

Node Authentication Secret: the Erlang Cookie

RabbitMQ nodes and CLI tools use a shared secret known as the Erlang Cookie, to authenticate to each other. The cookie value is a string of alphanumeric characters up to 255 characters in size. The value must be generated before creating a RabbitMQ cluster since it is needed by the nodes to form a cluster.

With the community Docker image, RabbitMQ nodes will expect the cookie to be at /var/lib/rabbitmq/.erlang.cookie. We recommend creating a Secret and mounting it as a Volume on the Pods at this path.

This is demonstrated in the statefulset.yaml example file.

The secret is expected to have the following key/value pair:

To create a cookie Secret, run

This will create a Secret with a single key, cookie, taken from the file name, and the file contents as its value.

Administrator Credentials

RabbitMQ will seed a default user with well-known credentials on first boot. The username and password of this user are both guest.

This default user can only connect from localhost by default. It is possible to lift this restriction by opting in. This may be useful for testing but very insecure. Instead, an administrative user must be created using generated credentials.

The administrative user credentials should be stored in a Kubernetes Secret, and mounting them onto the RabbitMQ Pods. The RABBITMQ_DEFAULT_USER and RABBITMQ_DEFAULT_PASS environment variables then can be set to the Secret values. The community Docker image will use them to override default user credentials.

Example for reference.

The secret is expected to have the following key/value pair:

To create an administrative user Secret, use

This will create a Secret with two keys, user and pass, taken from the file names, and file contents as their respective values.

Users can be create explicitly using CLI tools as well. See RabbitMQ doc section on user management to learn more.

Node Configuration

There are several ways to configure a RabbitMQ node. The recommended way is to use configuration files.

Configuration files can be expressed as Config Maps, and mounted as a Volume onto the RabbitMQ pods.

To create a Config Map with RabbitMQ configuration, apply our minimal configmap.yaml example:

Use an Init Container

Since Kubernetes 1.9.4, Config Maps are mounted as read-only volumes onto Pods. This is problematic for the RabbitMQ community Docker image: the image can try to update the config file at the time of container startup.

Thus, the path at which the RabbitMQ config is mounted must be read-write. If a read-only file is detected by the Docker image, you'll see the following warning:

While the Docker image does work around the issue, it is not ideal to store the configuration file in /tmp and we recommend instead making the mount path read-write.

As a few other projects in the Kubernetes community, we use an init container to overcome this.


Run The Pod As the rabbitmq User

The Docker image runs as the rabbitmq user with uid 999 and writes to the rabbitmq.conf file. Thus, the file permissions on rabbitmq.conf must allow this. A Pod Security Context can be added to the Stateful Set definition to achieve this. Set the runAsUser, runAsGroup and the fsGroup to 999 in the Security Context.

See Security Context in the Stateful Set definition file.

Importing Definitions

RabbitMQ nodes can importi definitions exported from another RabbitMQ cluster. This may also be done at node boot time.

Following from the RabbitMQ documentation, this can be done using the following steps:

  1. Export definitions from the RabbitMQ cluster you wish to replicate and save the file
  2. Create a Config Map with the key being the file name, and the value being the contents of the file (See the rabbitmq.conf Config Map example)
  3. Mount the Config Map as a Volume on the RabbitMQ Pod in the Stateful Set definition
  4. Update the rabbitmq.conf Config Map with load_definitions = /path/to/definitions/file

Readiness Probe

Kubernetes uses a check known as the readiness probe to determine if a pod is ready to serve client traffic. This is effectively a specialized health check defined by the system operator.

When an ordered pod deployment policy is used — and this is the commended option for RabbitMQ clusters — the probe controls when the Kubernetes controller will consider the currently deployed pod to be ready and proceed to deploy the next one. This check, if not chosen appropriately, can deadlock a rolling cluster node restart.

RabbitMQ nodes that belong to a clsuter will attempt to sync schema from their peers on startup. If no peer comes online within a configurable time window (five minutes by default), the node will give up and voluntarily stop. Before the sync is complete, the node won't mark itself as fully booted.

Therefore, if a readiness probe assumes that a node is fully booted and running, a rolling restart of RabbitMQ node pods using such probe will deadlock: the probe will never succeed, and will never proceed to deploy the next pod, which must come online for the original pod to be considered ready by the deployment.

It is therefore recommended to use a very basic RabbitMQ health check for readiness probe:

While this check is not thorough, it allows all pods to be started and re-join the cluster within a certain time period, even when pods are restarted one by one, in order.

This is covered in a dedicated section of the RabbitMQ clustering guide: Restarts and Health Checks (Readiness Probes).

The readiness probe section in the Stateful Set definition file demonstrates how to configure a readiness probe.

Liveness Probe

Similarly to the readiness probe described above, Kubernetes allows for pod health checks using a different health check called the liveness probe. The check determines if a pod must be restarted.

As with all health checks, there is no single solution that can be recommended for all deployments. Health checks can produce false positives, which means reasonably healthy, operational nodes will be restarted or even destroyed and re-created for no reason, reducing system availability.

Moreover, a RabbitMQ node restart won't necessarily address the issue. For example, restarting a node that is in an alarmed state because it is low on available disk space won't help.

All this is to say that liveness probes must be chosen wisely and with false positives and "recoverability by a restart" taken into account. Liveness probes also must use node-local health checks instead of cluster-wide ones.

RabbitMQ CLI tools provide a number of pre-defined health checks that vary in how thorough they are, how intrusive they are and how likely they are to produce false positives in different scenarios, e.g. when the system is under load. The checks are composable and can be combined. The right liveness probe choice is a system-specific decision. When in doubt, start with a simpler, less intrusive and less thorough option such as

The following checks can be reasonable liveness probe candidates:

Note, however, that they will fail for the nodes paused by the "pause minority" partition handliner strategy.

The liveness probe section in the Stateful Set definition file demonstrates how to configure a liveness probe.


RabbitMQ supports plugins. Some plugins are essential when running RabbitMQ on Kubernetes, e.g. the Kubernetes-specific peer discovery implementation.

The rabbitmq_peer_discovery_k8s plugin is required to deploy RabbitMQ on Kubernetes. It is quite common to also enable rabbitmq_management plugin in order to get a browser-based management UI and an HTTP API, and rabbitmq_prometheus for monitoring.

Plugins can be enabled in different ways. We recommend mounting the plugins file, enabled_plugins, to the node configuration directory, /etc/rabbitmq. A Config Map can be used to express the value of the enabled_plugins file. It can then be mounted as a Volume onto each RabbitMQ container in the Stateful Set definition.

In our configmap.yaml example file, we demonstrate how to popular the the enabled_plugins file and mount it under the /etc/rabbitmq directory.


The final consideration for the Stateful Set is the ports to open on the RabbitMQ Pods. Protocols supported by RabbitMQ are all TCP-based and require the protocol ports to be opened on the RabbitMQ nodes. Depending on the plugins that are enabled on a node, the list of required ports can vary.

The example enabled_plugins file mentioned above enables a few plugins: rabbitmq_peer_discovery_k8s (mandatory), rabbitmq_management and rabbitmq_prometheus. Therefore, the service must open several ports relevant for the core server and the enabled plugins:

  • 5672: used by AMQP 0-9-1 and AMQP 1.0 clients
  • 15672: management UI and HTTP API)
  • 15692: Prometheus scraping endpoint)

Deploy the Stateful Set

These are the key components in the Stateful Set file. Please have a look at the file, and if following from the example, deploy the Stateful Set:

This will start spinning up a RabbitMQ cluster. To watch the progress:

Create a Service for Client Connections

If all the steps above succeeded, you should have functioning RabbitMQ cluster deployed on Kubernetes! ? However, having a RabbitMQ cluster on Kubernetes is only useful clients can connect to it.

Time to create a Service to make the cluster accessible to client connections.

The type of the Service depends on your use case. The Kubernetes API reference gives a good overview of the types of Services available.

In the client-service.yaml example file, we have gone with a LoadBalancer Service. This gives us an external IP that can be used to access the RabbitMQ cluter.

For example, this should make it possible to visit the RabbitMQ management UI by visiting {external-ip}:15672, and signing in. Client applications can connect to endpoints such as {external-ip}:5672 (AMQP 0-9-1, AMQP 1.0) or {external-ip}:1883 (MQTT). Please refer to the get started guide to learn how to use RabbitMQ.

If following from the example, run

to create a Service of type LoadBalancer with an external IP address. To find out what the external IP address is, use kubectl get svc:

Resource Usage and Limits

Container resource management is a topic that deserves its own post. Capacity planning recommendations are entirely workload-, environment- and system-specific. Optimal values are usually found via extensive monitoring of the system, trial, and error. However, when picking the limits and resource allocation settings, consider a few RabbitMQ-specific things.

Use the Latest Major Erlang Release

RabbitMQ runs on the Erlang runtime. Recent Erlang/OTP releases have introduced a number of improvements highly relevant to the users who run RabbitMQ on Kubernetes:

  • In Erlang 22, inter-node communication [latency and head-of-line blocking( have been significantly reduced. In earlier versions, link congestion was known to make cluster node heartbeat false positives likely.
  • In Erlang 23, the runtime will respect the container CPU quotas when computing the default number of schedulers to start. This means that nodes will respect the Kubernetes-managed CPU resource limits.

Docker community image for RabbitMQ ships with Erlang 23 at the time of writing. Users of custom Docker images are highly recommended to provision Erlang 23 as well.

CPU Resource Usage

RabbitMQ was designed for workloads that involve multiple queues and where a node serves multiple clients at the same time. Nodes will generally use all the CPU cores allowed without any explicit configuration. As the number of cores grows, some tuning may be necessary to reduce CPU context switching.

How CPU time is spent can be monitored via the runtime thread activity metrics which are also exposed via the RabbitMQ Prometheus plugin.

If RabbitMQ pods hover around their CPU resource allowance and experience throttling in environments with a large number of relatively idle clients, the load likely can be reduced with a modest amount of configuration.

Memory Limits

RabbitMQ uses the concept of a runtime memory high watermark. By default a node will use 40% of detected (available) memory as the watermark. When the watermark is crossed, publishers across the entire cluster will be blocked and more aggressive paging out to disk initiated. The watermark value may seem like a memory quota on Kubernetes at first but there is an important difference: RabbitMQ resource alarms assume a node can typically recover from this state. For example, a large backlog of messages will eventually be consumed.

Kubernetes memory limits are enforced by the OOM killer: no recovery is expected. This means that a RabbitMQ node's high memory watermark must be lower than the memory limit imposed on the node container. Kubernetes deployments should use the relative watermark values in the recommended range.

Memory usage breakdown data should be used to determine what consumes most memory on the node.

Disk Usage

We highly recommend overprovisioning the disk space available to RabbitMQ containers. A node that has run out of disk space won't always be able to recover from such an event. Such nodes must be decomissioned and replaced.

Consider Available Network Link Bandwidth

Finally, consider what kind of links and Kubernetes networking options are used for inter-node communication. Network link congestion can be a significant limiting factor to system throughput and affect its availability.

Below is a very simplistic formula to calculate the amount of bandwidth needed by a workload, in bits:

Therefore a workload with average message size of 3 kiB and expected peak message rate of 20K messages a second can consume up to

of bandwidth.

Team RabbitMQ maintains a Grafana dashboard for inter-node communication link metrics.

Using rabbitmq-perf-test to Run a Functional and Load Test of the Cluster

RabbitMQ comes with a load simulation tool, PerfTest, which can be executed from outside of a cluster or deployed to Kubernetes using the perf-test public docker image. Here's an example of how the image can be deployed to a Kubernetes cluster

Here the {username} and {password} are the user credentials, e.g. those set up in the rabbitmq-admin Secret. The {serivce} is the hostname to connect to. We use the name of the client service that will resolve as a hostname when deployed.

The above kubectl run command will start a PerfTest pod which can be observed in

For a functioning RabbitMQ cluster, running kubectl logs -f {perf-test-pod-name} where {perf-test-pod-name} is the name of the pod as reported by kubectl get pods, will produce output similar to this:

To learn more about PerfTest, its settings, capabilities and output, see the PerfTest doc guide.

PerfTest is not meant to be running permanently. To tear down the perf-test pod, use

Monitoring the Cluster

Monitoring is a critically important part of any production deployment.

RabbitMQ comes with in-built support for Prometheus. To enable it, enable the rabbitmq_prometheus plugin. This in turn can be done by adding rabbitmq_promethus to the enabled_plugins Config Map as explained above.

The Prometheus scraping port, 15972, must be open on both the Pod and the client Service.

Node and cluster metrics can be visualised with Grafana.

Alternative Option: the Kubernetes Cluster Operator for RabbitMQ

As this post demonstrates, there are quite a few parts involved in hosting a stateful data services such as RabbitMQ on Kubernetes. It may seem like a daunting task. There are several alternatives to this kind of DIY deployment demonstrated in this post.

Team RabbitMQ at VMware has open sourced a Kubernetes Operator pattern implementation for RabbitMQ. As of August 2020, this is a young project under active development. While it currently has limitations, it is our recommended option over the manual DIY setup demonstrated in this post.

See RabbitMQ Cluster Operator for Kubernetes to learn more. The project is developed in the open at rabbitmq/cluster-operator on GitHub. Give it a try and let us know how it goes. Besides GitHub, two great venues for providing feedback to the team behind the Operator are the RabbitMQ mailing list and the #kubernetes channel in RabbitMQ community Slack.

RabbitMQ Performance Measurements, part 1

Tuesday, April 17th, 2012

So today I would like to talk about some aspects of RabbitMQ's performance. There are a huge number of variables that feed into the overall level of performance you can get from a RabbitMQ server, and today we're going to try tweaking some of them and seeing what we can see.


Sizing your Rabbits

Saturday, September 24th, 2011

One of the problems we face at the RabbitMQ HQ is that whilst we may know lots about how the broker works, we don't tend to have a large pool of experience of designing applications that use RabbitMQ and which need to work reliably, unattended, for long periods of time. We spend a lot of time answering questions on the mailing list, and we do consultancy work here and there, but in some cases it's as a result of being contacted by users building applications that we're really made to think about long-term behaviour of RabbitMQ. Recently, we've been prompted to think long and hard about the basic performance of queues, and this has lead to some realisations about provisioning Rabbits. (more…)

RabbitMQ, backing stores, databases and disks

Thursday, January 20th, 2011

From time to time, on our mailing list and elsewhere, the idea comes up of using a different backing store within RabbitMQ. The backing store is the bit that's responsible for writing messages to disk (a message can be written to disk for a number of reasons) and it's a fairly frequent suggestion to see what RabbitMQ would look like if its own backing store was replaced with another storage system.

Such a change would permit functionality that is not currently possible, for example out-of-band queue browsing, or distributed storage, but there is a fundamental difference in the nature of data storage and access patterns between a message broker such as RabbitMQ and a generic database. Indeed RabbitMQ deliberately does not store messages in such a database. (more…)

What’s Going on with the Ruby AMQP Gem?

Wednesday, January 12th, 2011

In the past year development of the AMQP gem was practicaly stagnating, as its original author Aman Gupta (@tmm1) was busy. A lot of bugs stayed unresolved, the code was getting old and out-dated and no new features or documentation were made.

At this point I started to talk with the RabbitMQ guys about possible collaboration on this. Actually originally I contacted VMware when I saw Ezra Zygmuntowicz looking for people to his cloud team, but when I found that VMware recently acquired the RabbitMQ project in London, I got interested. I signed the contract, switched from script/console to Wireshark and the RabbitMQ Tracer and since November I've been happily hacking on the AMQP and AMQ-Protocol gems.

To introduce myself, my name's Jakub Stastny (@botanicus) and I work as a Ruby contractor. I contributed to such projects as RubyGems, Merb and rSpec and I wrote my own framework called Rango, the only Ruby framework with template inheritance. I work with Node.js as well and I created Minitest.js, BDD framework for testing asynchronous code. My other hobbies are photography and travelling.

I asked Aman if I can take over the maintainership over the AMQP gem and he was happy to do so. At this point other two guys, Michael Klishin (michaelklishin) and Ar Vicco (arvicco) showed interest in the development, so we created ruby-amqp organisation at GitHub and forked the original code there, as well as a few other related repositories. The GitHub guys were happy to make our repository to be the main one, instead of just a fork, so since now, everything will be there (except the old issues which are still on tmm1's fork and which we want to solve and close soon).

Soo What's New?

Test Suite

At the beginning, there were barely any tests at all, so it was basically impossible to tell if the changes I made break something or not. So I started to write some. In the later stage, when michaelklishin and arvicco joined the development, we rewrote the few original Bacon specs to rSpec 2 and now arvicco is porting his specs which he happened to write some time ago to the main repository. Arvicco has also written amqp-spec, superset of em-spec for testing the AMQP gem.

AMQP 0.9.1

Currently the gem speaks only AMQP 0.8, which is more than 2 years old version, so probably the most important upcoming feature is support of AMQP 0.9.1. Because this is something what can be beneficial for other clients as well, I decided to create a new library called AMQ-protocol. It's using rabbitmq-codegen as many others client libraries.

One of the main goals of this gem is to be really fast and memory-efficient (not for the sake of memory-efficiency itself, but because the garbage collector of MRI is quite weak). I'm about to create some benchmarks soon to see if the performance is better and how much.

AMQ-Protocol is still work-in-progress. It works, but it still needs some polishing, refactoring and optimizations, as well as documentation and tests.

Other Changes

I fixed a lot of bugs and I merged all the pending pull requests to the main repository. I'm going to write more about the changes once I'll release AMQP 0.7. I released 0.7.pre recently, you can try it by running gem install amqp --pre, which would be greatly appreciated. As the work on the test suite is still in progress now, the release process is kind of russian roulette at the moment.

Backward compatibility

I fixed quite a few bugs and obviously the fixed code is never backward-compatible with the old buggy one. One of the major changes is that MQ#queues (as well as MQ#fanouts etc) is not a hash anymore, but an array-like collection with hash-like behaviour. It does NOT override anonymous instances when another anonymous instance is created (as it used to do before) and it does support server-generated names. So instead of MQ#queues[nil] = <first instance> and then MQ#queues[nil] = <second instance>) it now just adds both instances to the collection and when it receives Queue.Declare-Ok from the server, it updates the name to it.

Future plans

The AMQP gem is very opinionated. If you don't want to use EventMachine, you're out of luck. You might want to use something more low-level like or just another async library like You might not even want to care about the asynchronous code at all.

It'd be great if we could have one really un-opinionated AMQP client library which only job would be to expose low-level API defined by the AMQP protocol without any abstraction like hidding channels etc. Such library would be intended for another library implementators rather than for the end users. AMQP is a complex protocol and because of some design decisions it's pretty hard to design a good and easy-to-use (opinionated) client library for it. So some basic library which doesn't make any assumptions would help others to play around and try to implement their own, opinionated libraries on top of this one without the need to manually implement the hard stuff like encoding/decoding or basic socket communication.

Questions? Ideas? Get in touch!

Are you interested in the AMQP gem development? Do you want to participate or do you have some questions? Feel free to contact me, either by comments under this blog post, or you can drop me an e-mail to or drop by to Jabber MUC room at where all the current maintainers usually are. And for all the news make sure you are following me on Twitter!

Chapter 1: Introduction to Distributed Systems

Wednesday, November 17th, 2010

RabbitMQ needs more and better documentation. (And who doesn’t?) In particular, we need more and better introductory material that introduces the reader to various basic concepts, explains why they’re important, and motivates him or her to keep reading and learn more about RabbitMQ. Here’s a cut at Chapter 1 of that introduction. Your comments are welcome, and Chapters 2 and 3 will follow soon.

(You probably already know all of this, but a surprising number of people don’t. This introduction is for them.)

The Old Future

Long, long ago, the American science-fiction writer Isaac Asimov imagined a future world in which one single giant computer, “Multivac,” would control all of mankind’s affairs. Information would flow in from people and businesses and governments across the globe, and Multivac would store it and process it, and send exactly the right important new information right back out. All sorts of futuristic questions would pour in from our future selves, and the right futuristic answers would just pour back out. This future was a great place!

And our present-day world isn’t all that different from Asimov’s future, just without all that shininess. We’ve got the Internet, and it connects people and businesses and governments all over the globe, and information flows in, and information flows out, and questions pour in, and answers pour out. We’ve got our Googles and our Amazons and our eBays and our Facebooks, and our lives keep getting better every day. More and better information; more and better storage and processing; more and better answers.

But Asimov was only a lowly Ph.D. chemist turned science-fiction writer, not any sort of real Computer Scientist like we have now, and he never worked out all (or, really, any!) of the technical details of exactly how you’d build that one giant, all-knowing, all-powerful Multivac at the North Pole, and exactly who’d pay for it, and exactly what uses they’d allow, and so on. He left that part for future generations to figure out, if in fact they could. And as time has gone by, it’s also turned out that any one single computer that anyone can buy at the computer shop down the street is still several orders of magnitude too small and too weak to control all of mankind’s affairs. That’s the bad news.

The New Future

The good news, which Asimov didn’t anticipate (ha!), is that computers here in the future are cheap—almost dirt cheap, being made largely of silicon, which is after all just processed dirt. So if any one computer you can buy at the shop (or rent on the cloud from Amazon, or whatever) has a million times too little storage or processing power for what you want to do to or for mankind, just get a million of them and plug them together! (Some assembly required.) Google is close to doing just that—just as soon as it completes its takeover of the North Pole—and everyone else is trying to follow close behind. Google’s got its own computers to execute its plans for the world, and Facebook’s got its own computers and its own plans too, and the CIA too, and your company or organization too, and everyone cooperates and competes in controlling all of mankind’s affairs. Our old centralized computer systems couldn’t possibly grow big enough, so we’re replacing them with shiny new distributed systems that could presumably grow bigger forever. And our lives keep getting better every day.

But getting your million computers (or even just a thousand, or even fewer!) to work together on their assigned tasks isn’t as easy as it might sound to your upper management. One given server computer might crash once a year due to bad hardware or bad software or bad power or bad whatever—and that’s usually being pretty optimistic. If you have only a thousand server computers, one will crash on the average every 9 hours; if you had a million, one would crash about every 30 seconds; if you had a billion—which not even Google has yet—about 30 would crash every second, and good luck getting the remainder not to crash or otherwise go bonkers too! One centralized computer can be either up or down, and that’s it, but a distributed computer system is more likely to be 99% up and 1% down at any moment, and the 1% that’s down keeps shifting around and further confusing the other 99%. Problems in distributed systems are unavoidable, and they can multiply without bound. Welcome to the future!

You may have just a thousand computers so far, or maybe just a hundred, or maybe even just ten or so, but you’re still going to have problems and bad things are still going to happen. Crashes are one obvious cause, but lost messages or misconfigured systems or subtle race conditions all add to the error rate too. If you can’t think of half a dozen more potential problems with large distributed systems, you’ve probably never built or operated one. It would be impossibly hard to build that one giant Multivac at the North Pole, but it might be even harder to figure out exactly how those zillion smaller computers that you buy instead will ever work together. What to do?

Perfect Reliability* (*Not really)

There’s a great saying: If you ever see a computer system described as “reliable,” look for the asterisk and the footnote that says “Not really.” Perfect reliability is impossible to achieve. Put your computers in an expensive data center in California, and one sufficiently large earthquake can knock them all out. Spread them out across a bunch of expensive data centers on different continental plates, and you just need a few more earthquakes (or tsunamis, or whatever) to knock enough of your computers (or network links, or whatever) to render the others useless. Enough natural or man-made catastrophes can ruin anything, and they can happen a lot more often than you might think—especially the man-made ones! That’s the bad news.

The good news is, while you can’t build perfectly reliable systems, you can build systems that are reliable enough, whatever that happens to be. That is, you can build computer systems that are arbitrarily reliable. You can ensure that if enough of your computers are up and connected and working correctly, then the system as a whole will continue to do the right things, and that even if more fail, then the system as a whole still won’t do anything wrong. (It might not do anything at all, but that’s life.) If you want more reliability, you can buy more computers (maybe a lot more) and connect them properly. If you know how.

Cargo Cults and Banks

Unfortunately, much of the time, it seems that our distributed-system needs are growing faster than our expertise. Distributed systems are hard to build and they may never become all that easy. Right now, it’s often all we can do adopt best practices—to look at distributed systems that got it right, and try to figure out why they succeeded, and to try to duplicate their success. It’s a little like running your own cargo cult, but without all the coconuts.

Banks are in many ways an excellent industry to study and perhaps to imitate. Banks (and other financial institutions) can clearly care very much about reliability, and banks have been building pretty large, pretty reliable distributed systems for some while now. Banks today tend to build their reliable distributed systems atop reliable message-queuing systems, and they’ve even developed an open standard for such message-queuing systems, and that’s worked out pretty well for them, and that’s what we’ll look at next.