This is a quick note on setting up keepalived for k8s api failover on nixos.
Nixos has sane defaults for most of the keepalived config, so we only really need a few lines:
services.keepalived = {
enable = true;
vrrpInstances.k3s = {
interface = "{{node.interface}}";
virtualIps = [{ addr = "172.30.190.100"; }];
virtualRouterId = 1;
trackScripts = ["k3s"];
};
vrrpScripts.k3s = {
user = "root";
script = "systemctl is-active k3s";
interval = 10;
};
};
{{node.interface}}
is for my templating engine. Put a real interface there.
This will do a basic check if k3s is running. Sure it is possible k3s is running but not healthy; I think just checking the unit is a fine balance of simplicity for now. This should at least trigger a failover if the unit is stuck in a ‘restarting’ state.
I don’t bother to set a priority; keepalived will default to the highest ip address.
In my client config on my desktop, I have my cluster registered via a dns name:
~ > kubectl config view
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: DATA+OMITTED
server: https://kubernetes.default:6443
Which is set via a hosts entry:
networking.extraHosts = ''
172.30.190.100 kubernetes.default
''
testing
If we go to the current master and stop k3s:
Dec 05 20:04:41 d752cf78-d889-5bd9-8dd4-2bb7eeb898f4 Keepalived_vrrp[974]: Script `k3s` now returning 3
Dec 05 20:05:01 d752cf78-d889-5bd9-8dd4-2bb7eeb898f4 Keepalived_vrrp[974]: VRRP_Script(k3s) failed (exited with status 3)
Dec 05 20:05:01 d752cf78-d889-5bd9-8dd4-2bb7eeb898f4 Keepalived_vrrp[974]: (k3s) Entering FAULT STATE
Another node will claim the address:
Dec 05 20:05:02 933fa9c8-51d7-5477-8163-5890a80109bd Keepalived_vrrp[957]: (k3s) Entering MASTER STATE
Likewise, if we force k3s into a weird restarting state:
[root@933fa9c8-51d7-5477-8163-5890a80109bd:~]# while true; do systemctl restart k3s ; done
keepalived does catch on:
Dec 05 20:07:44 933fa9c8-51d7-5477-8163-5890a80109bd Keepalived_vrrp[957]: Script `k3s` now returning 3
Dec 05 20:08:04 933fa9c8-51d7-5477-8163-5890a80109bd Keepalived_vrrp[957]: VRRP_Script(k3s) failed (exited with status 3)
Dec 05 20:08:04 933fa9c8-51d7-5477-8163-5890a80109bd Keepalived_vrrp[957]: (k3s) Entering FAULT STATE
because is-active
will be one of activating or failed:
[root@933fa9c8-51d7-5477-8163-5890a80109bd:~]# systemctl is-active k3s
activating
In many of these cases, there is at least 10 seconds of downtime between failover. We could of course add complexity with a load balancing layer, more complex health checks, tighter intervals; but i think this is perfectly acceptable for a quick low-effort, low-cost control plane failover mechanism.