High Availability (HA) ensures that the services running on the firewall are always available even if one unit is unavailable due to maintenance or a hardware problem. Reliable HA depends on the correct configuration of the surrounding switches and routers. Especially important is the ARP cache time or ARP timeout, which must be set to a value between 30 and 60 seconds. When the virtual server fails over to the secondary firewall, the MAC addresses associated with the virtual server IPs also change. The MAC address is immediately sent out via gratuitous or unsolicited ARP requests, updating the MAC address table or ARP cache of the connected switches and routers. If the lifetime of the ARP timeout of the switch is set to be longer, for example 300 seconds, the secondary unit would not be reachable for up to 5 minutes because the ARP cache would not be updated for that time. Longer timeouts also increase the number of ARP requests sent out by the firewall, increasing the load on the switch.
Requirements and Limitations for High Availability
- Both units must use the same platform. You cannot mix virtual and physical appliances.
- Both units must be the same model. Using different revisions of the same hardware appliance is possible.
- If you are running an HA setup with different appliance revisions, ensure that both physical ports of the private uplink are using identical port labels. Otherwise, HA synchronization may fail.
- Latency on the HA sync connection may not exceed 80 ms.
Stand-alone High Availability
For a stand-alone HA cluster, the primary unit downloads the licenses for both units, and when the secondary unit is joined to the HA cluster, the license for the secondary unit is transferred over. The licenses are bound to the MAC addresses of the primary and secondary unit.
For more information, see How to Set Up a High Availability Cluster.
In Depth: Transparent Failover Procedure and Limitations
An HA system can be used for load balancing to exploit all features that are available through the firewall architecture. Use transparent failover to synchronize the forward packet sessions (inbound and outbound TCP, UDP, ICMP-Echo, and OTHER-IP-Protocols) of the firewall server between the two HA partners. Transparent failover is enabled by default and is set per access rule.
The following information is not HA-synced:
|Module or component||Unsynchronized sub-components|
|Access Control Service||All|
Synchronization can be carried out via dedicated HA uplink and/or the LAN connection. Synchronization traffic is transmitted by AES-encrypted UDP packets, so-called sync packets, on port 689. The AES keys are created by using the BOX RSA keys and renewed every 60 seconds.
Only a small amount of synchronization traffic is necessary for synchronizing via LAN connection. Sync traffic is kept at a minimum by synchronizing only sessions and not each packet. Due to the characteristics of the TCP protocol (SYN, SYN-ACK, …), only existing established TCP connections are synchronized. When the synchronization takes place during the TCP handshake, the handshake must be repeated.
The synchronizing procedure takes place immediately (if possible). If synchronization packets are lost, up to 70 sessions per second are synchronized.
Depending on the system availability, the behavior differs:
- If the partner unit is inactive/rebooted – It may happen that the backup unit is not available and, therefore, does not respond to the sync packets (for example, for maintenance reasons). In this case, the active unit stops synchronizing. As soon as the partner unit reappears, the active unit checks whether the other one was rebooted or has an obsolete session state and re-synchronizes all necessary sessions.
- If the active unit reboots without a takeover – The Firmware Restart button was clicked. The ACPF sessions and sockets are gone, but the unit is not rebooted physically. In this case, the partner unit recognizes that its session state is obsolete and removes all synchronized sessions.
When the primary, active HA unit does not respond to the heartbeat (Control UDP 801), a takeover is initiated after a 10-15 second delay. This delay is necessary to account for potentially low network performance.
When the primary unit stays inactive, the synchronized sessions on the second unit are activated and all connections are available again. The backup unit does not have the current TCP sequence numbers. In case of a takeover, the sequence number is not checked for correctness. As soon as the connection has traffic, the sequence number is known to the former backup unit, and the sequence number check can be performed again. The missing sequence number on the backup unit also results from the fact that TCP connections that were taken over but have since had no traffic cannot be reset in a clean way. Terminating the session via the Terminate Session button removes the connection, but does not send a TCP Reset (TCP-RST) signal.