High Availability
If you want a system to be highly available it is not sufficient to simply supply multiple nodes of you application. Because if one goes down you have to tell your clients to go to the other one. There are a few options to do so:
- Application Retry: You could program you client-application in a way to go through the servers one by one until one responds. That will introduce latency in case the first one is down. There is also no load balancing, meaning that all requests will be handled by the first node until it goes down.
- External LoadBalancing: You could also use an external load balancer like HaProxy or NGINX. This will give you the option of load balancing and in case one of your nodes goes down, the load balancer will simply redirect all traffic to the remaining nodes. But what if your load balancer goes down? In that case you probably just moved the point of failure one layer up, but unless your load balancer itself is not highly available it does not help you.
- Keepalived: The third option is my preferred one, because you do not have external dependencies. But you also have no load balancing.
Keepalived
Keepalived uses the Virtual Router Redundancy Protocol
to achive high availability. All your nodes get a priority and the one with the highest will be the MASTER
node. The Master-Node listens to the virtual ip und recieves all the traffic.
All the other nodes (called Backup
) will constantly talk to the Master to check if it is still there. If it is not responding, the backup node with the highest priority will step in and listen to the virtual ip and thereby recieve the traffic.
As soon as the master comes back up, all nodes will agree to give the vip back to the master.
This will ensure that your clients can always connect to one of your nodes without any external dependencies. One Downside is, that you cannot loadbalance your traffic between the nodes. You are also limited to 255 keepalived services in one network.
Installation
Install Packages
First of all we need the keepalived package. It is in the standard-repository on every rhel-based linux system.
You can install it like this.
$ dnf install keepalived
Master Configuration
The next step would be to configure the master node.
Therefor open the configuration file with a text editor. If there is an example configuration, you may safely delete it.
$ vim /etc/keepalived/keepalived.conf
vrrp_instance Postgres_VIP {
state MASTER
interface ens18
virtual_router_id 51
priority 255
advert_int 1
unicast_src_ip 10.0.1.62
unicast_peer {
10.0.1.63
10.0.1.64
}
authentication {
auth_type PASS
auth_pass PostgresReplication
}
virtual_ipaddress {
10.0.2.100/8
}
}
The parameters expained:
vrrp_instance
defines an individual instance of theVRRP
protocol running on an interface. Just choose a unique name.state
defines the initial state that the instance should start in (MASTER or BACKUP).interface
defines the interface thatVRRP
runs on. This should be your LAN interface.virtual_router_id
is the unique identifier. It may not be used twice in the same network.priority
is the advertised priority of this node. Every node should have a unique priority. The node with the highes priority will recieve the vip.advert_int
specifies the frequency that advertisements are sent at (1 second, in this case).authentication
specifies the information necessary for servers participating inVRRP
to authenticate with each other. In this case, a simple password is defined.virtual_ipaddress
defines the IP addresses (there can be multiple) thatVRRP
is responsible for. In other words your VIP.unicast_src_ip
is the IP of this Node.unicast_peer
are the IPs of the other nodes.
Backup Configuration
Next we will configure the configuration on the backup nodes.
It will be mostly identical to the master node, with a few key differences:
- The IP Adresses have to be changed
- The Priority should be lower than the master. Also every node should get it’s own priority.
If there is an example configuration in the file, you can delete it.
$ vim /etc/keepalived/keepalived.conf
vrrp_instance Postgres_VIP {
state BACKUP
interface ens18
virtual_router_id 51
priority 254
advert_int 1
authentication {
auth_type PASS
auth_pass PostgresReplication
}
virtual_ipaddress {
10.0.2.100/8
}
}
Firewall
If you are using firewalld
you will need to open Port 112 for the VRRP communication.
$ firewall-cmd --add-port=112/tcp
$ firewall-cmd --runtime-to-permanent
Start Service
Now that everything is configured we can start the keepalived service. At this point the master node will listen on the virtual ip.
$ systemctl enable --now keepalived
Failover Test
To test the failover, I will disconnect the network on the master and look at the logs of one of the backup nodes.
You can see the log below.
$ journactl -f | grep Keepalived_vrrp
Dec 9 11:05:38 postgres-02 Keepalived_vrrp[7675]: (Postgres_VIP) Entering MASTER STATE
Dec 9 11:05:38 postgres-02 Keepalived_vrrp[7675]: (Postgres_VIP) setting VIPs.
Dec 9 11:05:38 postgres-02 Keepalived_vrrp[7675]: (Postgres_VIP) Sending/queueing gratuitous ARPs on ens18 for 10.0.2.100
The backup node is entering the master state and activating the vip. Now all the traffic goes through this node.
As soon as the master comes back up it will detect that a lower-priority node is the master and regain the control. Thats when the backup server will deactivate the vip for the master to take over.
$ journactl -f | grep Keepalived_vrrp
Dec 9 11:05:37 postgres-01 Keepalived_vrrp[7675]: (Postgres_VIP) received lower priority (100) advert from 10.0.1.63 - discarding
Dec 9 11:05:38 postgres-01 Keepalived_vrrp[7675]: (Postgres_VIP) received lower priority (100) advert from 10.0.1.63 - discarding
Dec 9 11:05:38 postgres-01 Keepalived_vrrp[7675]: (Postgres_VIP) Receive advertisement timeout
Dec 9 11:05:38 postgres-01 Keepalived_vrrp[7675]: (Postgres_VIP) Entering MASTER STATE
Dec 9 11:05:38 postgres-01 Keepalived_vrrp[7675]: (Postgres_VIP) setting VIPs.
Dec 9 11:05:38 postgres-01 Keepalived_vrrp[7675]: (Postgres_VIP) Sending/queueing gratuitous ARPs on ens18 for 10.0.2.100
Because of that the backup will switch back to the backup state.
$ journactl -f | grep Keepalived_vrrp
Dec 9 11:01:39 postgres-02 Keepalived_vrrp[7724]: (Postgres_VIP) Entering BACKUP STATE (init)
Scripts
By default the vip is on the master as long as the server is running.
If you want to check for a specific Port or Service, you can also use a script like this.
Therefor we will declare a vrrp_script
called apiserver and use curl
as a test command. In this case I’m checking the K8S Api on Port 6443
vrrp_script apiserver {
script "/usr/bin/curl -s -k https://localhost:6443/healthz -o /dev/null"
interval 5 # check every 5 seconds
timeout 2 # fail if it takes longer than 2 seconds
rise 5 # claim master state after 5 successful checks
fall 2 # remove master state after 2 failed checks
user root # execute checks as root
}
vrrp_instance ServerCenter {
state MASTER
interface ens192
virtual_router_id 55
priority 30
advert_int 1
unicast_src_ip 10.0.0.246
unicast_peer {
10.0.1.4
10.0.1.51
}
authentication {
auth_type PASS
auth_pass svcentPass
}
virtual_ipaddress {
10.0.0.191/22
}
track_script {
apiserver # use the script from above
}
}
To activate it we need to reload the service.
$ systemctl reload keepalived
Now your service should be highly available and if one node would fail the other could take other.