Clustering and Network Partitions


This guide covers one specific aspect of clustering: network failures between nodes, their effects and recovery options. For a general overview of clustering, see Clustering and Peer Discovery and Cluster Formation guides.

Clustering can be used to achieve different goals: increased data safety through replication, increased availability for client operations, higher overall throughput and so on. Different configurations are optimal for different purposes.

Network connection failures between cluster members have an effect on data consistency and availability (as in the CAP theorem) to client operations. Since different applications have different requirements around consistency and can tolerate unavailability to a different extent, different partition handling strategies are available.

Detecting Network Partitions

Nodes determine if its peer is down if another node is unable to contact it for a period of time, 60 seconds by default. If two nodes come back into contact, both having thought the other is down, the nodes will determine that a partition has occurred. This will be written to the RabbitMQ log in a form like:

=ERROR REPORT==== 15-Oct-2012::18:02:30 ===
Mnesia(rabbit@hostname): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, hare@hostname}

Partition presence can be idenfied via server logs, HTTP API (for monitoring) and a CLI command, rabbitmqctl cluster_status.

rabbitmqctl cluster_status will normally show an empty list for partitions:

rabbitmqctl cluster_status
# => Cluster status of node rabbit@smacmullen ...
# => [{nodes,[{disc,[hare@smacmullen,rabbit@smacmullen]}]},
# =>  {running_nodes,[rabbit@smacmullen,hare@smacmullen]},
# =>  {partitions,[]}]
# => ...done.

However, if a network partition has occurred then information about partitions will appear there:

rabbitmqctl cluster_status
# => Cluster status of node rabbit@smacmullen ...
# => [{nodes,[{disc,[hare@smacmullen,rabbit@smacmullen]}]},
# =>  {running_nodes,[rabbit@smacmullen,hare@smacmullen]},
# =>  {partitions,[{rabbit@smacmullen,[hare@smacmullen]},
# =>               {hare@smacmullen,[rabbit@smacmullen]}]}]
# => ...done.

The HTTP API will return partition information for each node under partitions in /api/nodes. The management UI will show a warning on the overview page if a partition has occurred.

During a Network Partition

While a network partition is in place, the two (or more!) sides of the cluster can evolve independently, with both sides thinking the other has crashed. This scenario is known as split-brain. Queues, bindings, exchanges can be created or deleted separately. Mirrored queues which are split across the partition will end up with one master on each side of the partition, again with both sides acting independently. Unless a partition handling strategy, such as pause_minority, is configured to be used, the split will continue even after network connectivity is restored.

Partitions Caused by Suspend and Resume

While we refer to "network" partitions, really a partition is any case in which the different nodes of a cluster can have communication interrupted without any node failing. In addition to network failures, suspending and resuming an entire OS can also cause partitions when used against running cluster nodes - as the suspended node will not consider itself to have failed, or even stopped, but the other nodes in the cluster will consider it to have done so.

While you could suspend a cluster node by running it on a laptop and closing the lid, the most common reason for this to happen is for a virtual machine to have been suspended by the hypervisor. While it's fine to run RabbitMQ clusters in virtualised environments, you should make sure that VMs are not suspended while running. Note that some virtualisation features such as migration of a VM from one host to another will tend to involve the VM being suspended.

Partitions caused by suspend and resume will tend to be asymmetrical - the suspended node will not necessarily see the other nodes as having gone down, but will be seen as down by the rest of the cluster. This has particular implications for pause_minority mode.

Recovering From a Split-Brain

To recover from a split-brain, first choose one partition which you trust the most. This partition will become the authority for the state of the system (schema, messages) to use; any changes which have occurred on other partitions will be lost.

Stop all nodes in the other partitions, then start them all up again. When they rejoin the cluster they will restore state from the trusted partition.

Finally, you should also restart all the nodes in the trusted partition to clear the warning.

It may be simpler to stop the whole cluster and start it again; if so make sure that the first node you start is from the trusted partition.

Partition Handling Strategies

RabbitMQ also three ways to deal with network partitions automatically: pause-minority mode, pause-if-all-down mode and autoheal mode. The default behaviour is referred to as ignore mode.

In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause as soon as a partition starts, and will start again when the partition ends. This configuration prevents split-brain and is therefore able to automatically recover from network partitions without inconsistencies.

In pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the listed nodes. In other words, all the listed nodes must be down for RabbitMQ to pause a cluster node. This is close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead of relying on the context. For instance, if the cluster is made of two nodes in rack A and two nodes in rack B, and the link between racks is lost, pause-minority mode will pause all nodes. In pause-if-all-down mode, if the administrator listed the two nodes in rack A, only nodes in rack B will pause. Note that it is possible the listed nodes get split across both sides of a partition: in this situation, no node will pause. That is why there is an additional ignore/autoheal argument to indicate how to recover from the partition.

In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred, and will restart all nodes that are not in the winning partition. Unlike pause_minority mode it therefore takes effect when a partition ends, rather than when one starts.

The winning partition is the one which has the most clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).

You can enable either mode by setting the configuration parameter cluster_partition_handling for the rabbit application in your configuration file to:

  • autoheal
  • pause_minority
  • pause_if_all_down
If using the pause_if_all_down mode, additional parameters are required:
  • nodes - nodes, which should be unavailable to pause
  • recover - recover action, can be ignore or autoheal
Example config snippet that uses pause_if_all_down:
cluster_partition_handling = pause_if_all_down

## Recovery strategy. Can be either 'autoheal' or 'ignore'
cluster_partition_handling.pause_if_all_down.recover = ignore

## Node names to check
cluster_partition_handling.pause_if_all_down.nodes.1 = rabbit@myhost1
cluster_partition_handling.pause_if_all_down.nodes.2 = rabbit@myhost2

Which mode should I pick?

It's important to understand that allowing RabbitMQ to deal with network partitions automatically does not make them less of a problem. Network partitions will always cause problems for RabbitMQ clusters; you just get some degree of choice over what kind of problems you get. As stated in the introduction, if you want to connect RabbitMQ clusters over generally unreliable links, you should use federation or the shovel.

With that said, you might wish to pick a recovery mode as follows:

  • ignore - Your network really is reliable. All your nodes are in a rack, connected with a switch, and that switch is also the route to the outside world. You don't want to run any risk of any of your cluster shutting down if any other part of it fails (or you have a two node cluster).
  • pause_minority - Your network is maybe less reliable. You have clustered across three datacenters, and you assume that only one datacenter will fail at once. In that scenario you want the remaining two datacenters to continue working and the nodes from the failed datacenter to rejoin automatically and without fuss when the datacenter comes back. Those datacenters have to have a direct and reliable connection to each other, like availability zones in EC2.
  • autoheal - Your network may not be reliable. You are more concerned with continuity of service than with data integrity. You may have a two node cluster.

More about pause-minority mode

The Erlang VM on the paused nodes will continue running but the nodes will not listen on any ports or do any other work. They will check once per second to see if the rest of the cluster has reappeared, and start up again if it has.

Note that nodes will not enter the paused state at startup, even if they are in a minority then. It is expected that any such minority at startup is due to the rest of the cluster not having been started yet.

Also note that RabbitMQ will pause nodes which are not in a strict majority of the cluster - i.e. containing more than half of all nodes. It is therefore not a good idea to enable pause-minority mode on a cluster of two nodes since in the event of any network partition or node failure, both nodes will pause. However, pause_minority mode is safer than ignore mode, with regards to integrity. For clusters of more than two nodes, especially if the most likely form of network partition is that a single minority of nodes drops off the network, the availability is remains as good as with ignore mode.

Finally, note that pause_minority mode will do nothing to defend against partitions caused by cluster nodes being suspended. This is because the suspended node will never see the rest of the cluster vanish, so will have no trigger to disconnect itself from the cluster.

Getting Help and Providing Feedback

If you have questions about the contents of this guide or any other topic related to RabbitMQ, don't hesitate to ask them on the RabbitMQ mailing list.

Help Us Improve the Docs <3

If you'd like to contribute an improvement to the site, its source is available on GitHub. Simply fork the repository and submit a pull request. Thank you!