Detecting Dead TCP Connections with Heartbeats


Network can fail in many ways, sometimes pretty subtle (e.g. high ratio packet loss). Disrupted TCP connections take a moderately long time (about 11 minutes with default configuration on Linux, for example) to be detected by the operating system. AMQP 0-9-1 offers a heartbeat feature to ensure that the application layer promptly finds out about disrupted connections (and also completely unresponsive peers). Heartbeats also defend against certain network equipment which may terminate "idle" TCP connections when there's no activity on them for a certain period of time.

Heartbeat Timeout Interval

The heartbeat timeout value defines after what period of time the peer TCP connection should be considered unreachable (down) by RabbitMQ and client libraries. This value is negotiated between the client and RabbitMQ server at the time of connection. The client must be configured to request heartbeats. In RabbitMQ versions 3.0 and higher, the broker will attempt to negotiate heartbeats by default (although the client can still veto them). The timeout is in seconds, and default value is 60 (580 prior to release 3.5.5).

Heartbeat frames are sent about every timeout / 2 seconds. After two missed heartbeats, the peer is considered to be unreachable. Different clients manifest this differently but the TCP connection will be closed. When a client detects that RabbitMQ node is unreachable due to a heartbeat, it needs to re-connect.

Any traffic (e.g. protocol operations, published messages, acknowledgements) counts for a valid heartbeat. Clients may choose to send heartbeat frames regardless of whether there was any other traffic on the connection but some only do it when necessary.

Heartbeats can be disabled by setting the timeout interval to 0. This is not a recommended practice.

Enabling Heartbeats with Java Client

To configure the heartbeat timeout in the Java client, set it with ConnectionFactory#setRequestedHeartbeat before creating a connection:

ConnectionFactory cf = new ConnectionFactory();

// set the heartbeat timeout to 60 seconds

Note that in case RabbitMQ server has a non-zero heartbeat timeout configured (which is the default in versions starting with 3.6.x), the client can only lower the value but not increase it.

Enabling Heartbeats with the .NET Client

To configure the heartbeat timeout in the .NET client, set it with ConnectionFactory.RequestedHeartbeat before creating a connection:

var cf = new ConnectionFactory();

// set the heartbeat timeout to 60 seconds
cf.RequestedHeartbeat = 60;

Low Timeout Values and False Positives

Setting heartbeat timeout value too low can lead to false positives (peer being considered unavailable while it really isn't the case) due to transient network congestion, short-lived server flow control, and so on. This should be taken into consideration when picking a timeout value.

Several years worth of feedback from the users and client library maintainers suggest that values lower than 5 seconds are fairly likely to cause false positives, and values of 1 second or lower are very likely to do so. Values within the 5 to 20 seconds range are optimal for most environments.

Heartbeats in STOMP

STOMP 1.2 includes heartbeats. In STOMP, heartbeat timeouts can be assymetrical: that is to say, client and server can use different values. RabbitMQ STOMP plugin fully supports this feature.

Heartbeats in STOMP are opt-in. To enable them, use the heart-beat header when connecting. See STOMP specification for an example.

Heartbeats (Keepalives) in MQTT

MQTT 3.1.1 includes heartbeats under a different name ("keepalives"). RabbitMQ MQTT plugin fully supports this feature.

Keepalives in MQTT are opt-in. To enable them, set the keepalive interval when connecting. Please consult your MQTT client's documentation for examples.

Heartbeats in Shovel and Federation Plugins

Shovel and Federation plugins open Erlang client connections to RabbitMQ nodes under the hood. As such, they can be configured to use a desired heartbeat value.

Please refer to the AMQP 0-9-1 URI query parameters reference for details.

Heartbeats and TCP Keepalives

TCP contains a mechanism similar in purpose to the heartbeat (a.k.a. keepalive) one in messaging protocols and net tick timeout covered above: TCP keepalives. Due to inadequate defaults, TCP keepalives are not suitable, or even counterproductive, for messaging protocols. However, with proper tuning they can be useful as an additional defense mechanism in environments where applications cannot be expected to enable heartbeats or use reasonable values. See the Networking guide for details.

Heartbeats and TCP Keepalives

Certain networking tools (HAproxy, AWS ELB) and equipment (hardware load balancers) which may terminate "idle" TCP connections when there's no activity on them for a certain period of time. Most of the time it is not desirable.

When heartbeats are enabled on a connection, it results in periodic light network traffic. Therefore heartbeats have a side effect of guarding client connections that can go idle for periods of time against premature closure by proxies and load balancers.

Heartbeat timeouts from 10 to 30 seconds will produce periodic network traffic often enough (roughly every 5 to 15 seconds)) to satisfy defaults of most proxy tools and load balancers. Also see the section on low timeouts and false positives above.