Clustering and Network Partitions
Introduction
This guide covers one specific aspect of clustering: network failures between nodes, their effects and recovery options. For a general overview of clustering, see Clustering and Peer Discovery and Cluster Formation guides.
Clustering can be used to achieve different goals: increased data safety through replication, increased availability for client operations, higher overall throughput and so on. Different configurations are optimal for different purposes.
Network connection failures between cluster members have an effect on data consistency and availability (as in the CAP theorem) to client operations. Since different applications have different requirements around consistency and can tolerate unavailability to a different extent, different partition handling strategies are available.
Detecting Network Partitions
Nodes determine if its peer is down if another node is unable to contact it for a period of time, 60 seconds by default. If two nodes come back into contact, both having thought the other is down, the nodes will determine that a partition has occurred. This will be written to the RabbitMQ log in a form like:
2020-05-18 06:55:37.324 [error] <0.341.0> Mnesia(rabbit@warp10): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@hostname2}
Partition presence can be identified via server logs, HTTP API (for monitoring) and a CLI command:
rabbitmq-diagnostics cluster_status
rabbitmq-diagnostics cluster_status
will normally show an
empty list for partitions:
rabbitmq-diagnostics cluster_status
# => Cluster status of node rabbit@warp10 ...
# => Basics
# =>
# => Cluster name: local.1
# =>
# => ...edited out for brevity...
# =>
# => Network Partitions
# =>
# => (none)
# =>
# => ...edited out for brevity...
However, if a network partition has occurred then information about partitions will appear there:
rabbitmqctl cluster_status
# => Cluster status of node rabbit@warp10 ...
# => Basics
# =>
# => Cluster name: local.1
# =>
# => ...edited out for brevity...
# =>
# => Network Partitions
# =>
# => Node flopsy@warp10 cannot communicate with hare@warp10
# => Node rabbit@warp10 cannot communicate with hare@warp10
The HTTP API will return partition
information for each node under partitions
in GET /api/nodes
endpoints.
The management UI will show a warning on the overview page if a partition has occurred.
Behavior During a Network Partition
While a network partition is in place, the two (or more!) sides of the cluster can evolve independently, with both sides thinking the other has crashed. This scenario is known as split-brain. Queues, bindings, exchanges can be created or deleted separately.
Quorum queues will elect a new leader on the majority side. Quorum queue replicas on the minority side will no longer make progress (i.e. accept new messages, deliver to consumers, etc), all this work will be done by the new leader.
Unless a partition handling strategy,
such as pause_minority
, is configured to be used,
the split will continue even after network connectivity is restored.
Partitions Caused by Suspend and Resume
While we refer to "network" partitions, really a partition is any case in which the different nodes of a cluster can have communication interrupted without any node failing. In addition to network failures, suspending and resuming an entire OS can also cause partitions when used against running cluster nodes - as the suspended node will not consider itself to have failed, or even stopped, but the other nodes in the cluster will consider it to have done so.
While you could suspend a cluster node by running it on a laptop and closing the lid, the most common reason for this to happen is for a virtual machine to have been suspended by the hypervisor.
While it's fine to run RabbitMQ clusters in virtualised environments or containers, make sure that VMs are not suspended while running.
Note that some virtualisation features such as migration of a VM from one host to another will tend to involve the VM being suspended.
Partitions caused by suspend and resume will tend to be asymmetrical - the suspended node will not necessarily see the other nodes as having gone down, but will be seen as down by the rest of the cluster. This has particular implications for pause_minority mode.
Recovering From a Split-Brain
To recover from a split-brain, first choose one partition which you trust the most. This partition will become the authority for the state of the system (schema, messages) to use; any changes which have occurred on other partitions will be lost.
Stop all nodes in the other partitions, then start them all up again. When they rejoin the cluster they will restore state from the trusted partition.
Finally, you should also restart all the nodes in the trusted partition to clear the warning.
It may be simpler to stop the whole cluster and start it again; if so make sure that the first node you start is from the trusted partition.
Partition Handling Strategies
RabbitMQ also offers three ways to deal with network partitions
automatically: pause-minority
mode, pause-if-all-down
mode and autoheal
mode. The default behaviour is referred
to as ignore
mode.
In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause as soon as a partition starts, and will start again when the partition ends. This configuration prevents split-brain and is therefore able to automatically recover from network partitions without inconsistencies.
In pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the listed nodes. In other words, all the listed nodes must be down for RabbitMQ to pause a cluster node. This is close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead of relying on the context. For instance, if the cluster is made of two nodes in rack A and two nodes in rack B, and the link between racks is lost, pause-minority mode will pause all nodes. In pause-if-all-down mode, if the administrator listed the two nodes in rack A, only nodes in rack B will pause. Note that it is possible the listed nodes get split across both sides of a partition: in this situation, no node will pause. That is why there is an additional ignore/autoheal argument to indicate how to recover from the partition.
In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred, and will restart all nodes that are not in the winning partition. Unlike pause_minority mode it therefore takes effect when a partition ends, rather than when one starts.
The winning partition is the one which has the most clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).
You can enable either mode by setting the configuration
parameter cluster_partition_handling
for the rabbit
application in the configuration file to:
autoheal
pause_minority
pause_if_all_down
If using the pause_if_all_down
mode, additional parameters are required:
nodes
: nodes which should be unavailable to pauserecover
: recover action, can beignore
orautoheal
Example config snippet that uses pause_if_all_down
:
cluster_partition_handling = pause_if_all_down
## Recovery strategy. Can be either 'autoheal' or 'ignore'
cluster_partition_handling.pause_if_all_down.recover = ignore
## Node names to check
cluster_partition_handling.pause_if_all_down.nodes.1 = rabbit@myhost1
cluster_partition_handling.pause_if_all_down.nodes.2 = rabbit@myhost2
Which Mode to Pick?
It's important to understand that allowing RabbitMQ to deal with network partitions automatically comes with trade offs.
As stated in the introduction, to connect RabbitMQ clusters over generally unreliable links, prefer Federation or the Shovel.
With that said, here are some guidelines to help the operator determine which mode may or may not be appropriate:
ignore
: use when network reliability is the highest practically possible and node availability is of topmost importance. For example, all cluster nodes can be in the same rack or equivalent, connected with a switch, and that switch is also the route to the outside world.pause_minority
: appropriate when clustering across racks or availability zones in a single region, and the probability of losing a majority of nodes (zones) at once is considered to be very low. This mode trades off some availability for the ability to automatically recover if/when the lost node(s) come back.autoheal
: appropriate when are more concerned with continuity of service than with data consistency across nodes.
More About Pause-minority Mode
The Erlang VM on the paused nodes will continue running but the nodes will not listen on any ports or be otherwise available. They will check once per second to see if the rest of the cluster has reappeared, and start up again if it has.
Note that nodes will not enter the paused state at startup, even if they are in a minority then. It is expected that any such minority at startup is due to the rest of the cluster not having been started yet.
Also note that RabbitMQ will pause nodes which are not in a
strict majority of the cluster - i.e. containing more
than half of all nodes. It is therefore not a good idea to
enable pause-minority mode on a cluster of two nodes since in
the event of any network partition or node failure, both
nodes will pause. However, pause_minority
mode is
safer than ignore
mode, with regards to integrity.
For clusters of more than two nodes, especially if the most likely
form of network partition is that a single minority of nodes
drops off the network, the availability remains as good as
with ignore
mode.
Note that pause_minority
mode will do
nothing to defend against partitions caused by cluster nodes
being suspended. This is because
the suspended node will never see the rest of the cluster
vanish, so will have no trigger to disconnect itself from the
cluster.