Serving http over Anycast

I make heavy use of anycast and frr bgp for load balancing and failover of http services. This page contains my running notes on the state of the technology on linux.

motivation

Anycast load balancing - that is, replicating and advertising a /32 ip address across a network - is to my knowledge the best bang-for-buck load balancing/HA mechanism when evaluated in terms of complexity vs capability. For all other maechanisms i’m aware of - I find my self asking “ok, but who load-balances the load-balancers?”. With anycast, bgp, and ecmp, the network itself performs the load balancing and failover - much like the internet at large.

drawbacks

There are two primary risks that I’ve become aware of in using this approach:

next-hop stability

In making forwarding decisions for an anycast address, spine routers will hash connections over next-hops of equal cost. In a steady state, this will keep tcp sessions intact. The catch is that classic bgp+ecmp does not implement a ‘stable’ hashing algorithm. When a downstream endpoint set is modified, ( for example a kubernetes deployment is scaled ), forwarding decisions will get shuffled around breaking at least some tcp connections. For otherwise “stateless” http endpoints this is generally an acceptable tradeoff for me given the overall simplicity of the system as a whole - but for many this may not be acceptable. The good news is that resilient hashing has made it into the linux kernel; but as of jan 2023 implementation of this feature in things like FRR appears limited.

null routes

Another risk to be aware of is inconsistency in the DC routing table due to ungraceful failure of a host or endpoint - that is to say, perhaps a host runs out of memory or disk space and continues to wrongfully advertise a now-incorrect prefix.

The fundamental issue here to my mind is the fact that BGP is not a CP system, it is AP with regard to CAP theorem.

If we contrast this with how this sort of failure would be handled internally in kubernetes - elements in a cluster inherently benefit from the consistent, partition tolerant nature of the cluster’s kv store - kube-proxy will not forward to nodes or pods that have been determined NotReady.

We could perhaps implement datacenter spine routers that interact with the kubernetes-apiserver to more reliably discover ‘ready’ pods, but I’m uncertain whether this would solve the problem or just move it. I suppose there is some value in the fact that a router that is essentially running no significant storage/memory workload is much less likly to fail in in unpleasant way than any given compute node.

motivation

drawbacks

next-hop stability

null routes

Nathan Hensel