Net Tick Time (Inter-node Communication Heartbeats)
Overview
This guide covers a mechanism used by RabbitMQ nodes and CLI tools (well, Erlang nodes)
to determine peer [un]availability, known as "net ticks"
or
kernel.net_ticktime
.
Each pair of nodes in a cluster are connected by the transport layer. Periodic tick messages are exchanged between all pairs of nodes to maintain the connections and to detect disconnections. Network interruptions could otherwise go undetected for a fairly long period of time (depending on the transport and OS kernel settings e.g. for TCP). Fundamentally this is the same problem that heartbeats seek to address in messaging protocols, just between different peers: RabbitMQ cluster nodes and CLI tools.
Nodes and connected CLI tools periodically send each other small data frames. If no data was received from a peer in a given period of time, that peer is considered to be unavailable ("down").
When one RabbitMQ node determines that another node has gone down it will log a message giving the other node's name and the reason, like:
2018-11-22 10:44:33.654 [info] node rabbit@peer-hostname down: net_tick_timeout
In this case the net_tick_timeout
event tells us that
the other node was detected as down due to the net ticktime
being exceeded. Another common reason is
connection_closed
, meaning that the connection
was explicitly closed at the TCP level.
Erlang documentation contains more details on this subsystem.
Tick Frequency
The frequency of both tick messages and detection of failures is controlled
by the net_ticktime
configuration setting. Normally four ticks
are exchanged between a pair of nodes every net_ticktime
seconds.
If no communication is received from a node within net_ticktime
(± 25% for ) seconds then the node is considered down and no longer a member
of the cluster.
Increasing the net_ticktime
across all nodes in a cluster will
make the cluster more resilient to short network outages, but it will take
longer for remaining nodes to detect crashed nodes. Conversely, reducing the
net_ticktime
across all nodes in a cluster will reduce detection
latency, but increases the risk of detecting spurious
partitions.
The impact of changing the default net_ticktime
should be
carefully considered. All nodes in a cluster must use the same
net_ticktime
. The following sample advanced.config
configuration demonstrates doubling the default net_ticktime
from
60 to 120 seconds:
[
{kernel, [{net_ticktime, 120}]}
].
Effects on HTTP API
The HTTP API often needs to perform cluster-wide queries
which has the effect that the UI can appear unresponsive until a
partition is detected and handled. Lowering net_ticktime
can help to improve the responsiveness during such events but any
decision to change net_ticktime
should be done carefully
as emphasised above.
Windows Quirks
Due to how RabbitMQ starts as a Windows service, you can't use a configuration
file to set net_ticktime
. Please see this section in the Windows Quirks
document to set net_ticktime
when running RabbitMQ as a Windows service.