Clustering and Network Partitions

RabbitMQ clusters do not tolerate network partitions well. If you are thinking of clustering across a WAN, don't. You should use federation or the shovel instead.

However, sometimes accidents happen. This page documents how to detect network partitions, some of the bad effects that may happen during partitions, and how to recover from them.

RabbitMQ stores information about queues, exchanges, bindings etc in Erlang's distributed database, Mnesia. Many of the details of what happens around network partitions are related to Mnesia's behaviour.

Detecting network partitions

Mnesia will typically determine that a node is down if another node is unable to contact it for a minute or so (see the page on net_ticktime). If two nodes come back into contact, both having thought the other is down, Mnesia will determine that a partition has occured. This will be written to the RabbitMQ log in a form like:

=ERROR REPORT==== 15-Oct-2012::18:02:30 ===
Mnesia(rabbit@smacmullen): ** ERROR ** mnesia_event got
    {inconsistent_database, running_partitioned_network, hare@smacmullen}

RabbitMQ nodes will record whether this event has ever occured while the node is up, and expose this information through rabbitmqctl cluster_status and the management plugin.

rabbitmqctl cluster_status will normally show an empty list for partitions:

# rabbitmqctl cluster_status
Cluster status of node rabbit@smacmullen ...
[{nodes,[{disc,[hare@smacmullen,rabbit@smacmullen]}]},
 {running_nodes,[rabbit@smacmullen,hare@smacmullen]},
 {partitions,[]}]
...done.

However, if a network partition has occured then information about partitions will appear there:

# rabbitmqctl cluster_status
Cluster status of node rabbit@smacmullen ...
[{nodes,[{disc,[hare@smacmullen,rabbit@smacmullen]}]},
 {running_nodes,[rabbit@smacmullen,hare@smacmullen]},
 {partitions,[{rabbit@smacmullen,[hare@smacmullen]},
              {hare@smacmullen,[rabbit@smacmullen]}]}]
...done.

The management plugin API will return partition information for each node under partitions in /api/nodes. The management plugin UI will show a large red warning on the overview page if a partition has occured.

During a network partition

While a network partition is in place, the two (or more!) sides of the cluster can evolve independently, with both sides thinking the other has crashed. Queues, bindings, exchanges can be created or deleted separately. Mirrored queues which are split across the partition will end up with one master on each side of the partition, again with both sides acting independently. Other undefined and weird behaviour may occur.

It is important to understand that when network connectivity is restored, this state of affairs persists. The cluster will continue to act in this way until you take action to fix it.

Recovering from a network partition

To recover from a network partition, first choose one partition which you trust the most. This partition will become the authority for the state of Mnesia to use; any changes which have occured on other partitions will be lost.

Stop all nodes in the other partitions, then start them all up again. When they rejoin the cluster they will restore state from the trusted partition.

Finally, you should also restart all the nodes in the trusted partition to clear the warning.

It may be simpler to stop the whole cluster and start it again; if so make sure that the first node you start is from the trusted partition.

Automatically handling partitions

RabbitMQ also offers two ways to deal with network partitions automatically: pause-minority mode and autoheal mode. (The default behaviour is referred to as ignore mode).

In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run.

In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred. It will restart all nodes that are not in the winning partition. The winning partition is the one which has the most clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).

You can enable either mode by setting the configuration parameter cluster_partition_handling for the rabbit application in your configuration file to either pause_minority or autoheal.

Which mode should I pick?

It's important to understand that allowing RabbitMQ to deal with network partitions automatically does not make them less of a problem. Network partitions will always cause problems for RabbitMQ clusters; you just get some degree of choice over what kind of problems you get. As stated in the introduction, if you want to connect RabbitMQ clusters over generally unreliable links, you should use federation or the shovel.

With that said, you might wish to pick a recovery mode as follows:

More about pause-minority mode

The Erlang VM on the paused nodes will continue running but the nodes will not listen on any ports or do any other work. They will check once per second to see if the rest of the cluster has reappeared, and start up again if it has.

Note that nodes will not enter the paused state at startup, even if they are in a minority then. It is expected that any such minority at startup is due to the rest of the cluster not having been started yet.

Also note that RabbitMQ will pause nodes which are not in a strict majority of the cluster - i.e. containing more than half of all nodes. It is therefore not a good idea to enable pause-minority mode on a cluster of two nodes since in the event of any network partition or node failure, both nodes will pause. However, pause_minority mode is likely to be safer than ignore mode for clusters of more than two nodes, especially if the most likely form of network partition is that a single minority of nodes drops off the network.