A resilient Ceph-backed image registry for Kubernetes

When building services on kubernetes, your image registry very quickly becomes a mission-critical component of the infrastructure. Your first-principles, indestructable, stateless web-app can’t autoscale if the registry is dead (not without some fancy image priming at least). Therefore, an HA registry is a fundamental building block for success with on-premises kuberenetes.

An initial naive approach to addressing this requirement may be to run the registry outside the cluster - using systemd, haproxy, keepalived/carp, and perhaps nfs for shared storage - a sort of classic HA stack. To my mind this seems a bit of an anti-pattern, the shortcomings of those sort of technologies are among the good reasons to use kubernetes in the first place - for example, it would be nice for the registry to be indifferent to rack/availability-zone failures; something that can’t necessarily be done in a modern dc network with simple L2 arp/mac based failover.

If we’re to run the registry inside kubernetes itself, the next question we’ll arive at is the matter of bootstrapping a modified registry deployment - with storage and networking - without yet having a registry to store images in.

storage

I’m assuming you are somewhat versed in k8s storage paradigms and some of the solutions that are out there. I’m using the Rook ceph distribution. Rook provides primitives for block, object, and nfs storage - and is itself highly resilient.

Kubernetes is fundamentally a CP system with regard to cap theorem. This means that in the event of a network partition or node failure, it will not allow itself to enter an inconsistent or incorrect state. The most significant caveat to this is that pods on a disconnected node will continue to execute; kubelet on that island node will simply take no action until connectivity is restored. This is why it can’t recover pods with mounted block storage. If a node running a pod with a mounted volume is disconnected or powered off, the control plane majority quorum has no way of knowing whether or not that pod is still out there somewhere writing to that volume - and therefore will not attempt recovery as it would with a standard deployment. Due to this behavior, we cannot simply deploy our registry with a block volume and a replicaset of ‘1’, expecting kubernetes to shuffle things around for us if a host is lost.

The solution here is object storage. Using object storage, our registry remains stateless and therefore trivial to replicate and failover. We can configure our registry without having to rebuild the image using a configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: registry
data:
  config.yml: |
    version: 0.1
    http:
        addr: :5000
        tls:
            certificate: /var/run/secrets/certs/tls.crt
            key: /var/run/secrets/certs/tls.key
    storage:
        redirect:
            disable: true
        s3:
            accesskey:
            secretkey:
            region: us-east-1
            regionendpoint: http://rook-ceph-rgw-local-store.rook-ceph.svc.cluster.local:9000
            bucket: registry
            encrypt: false
            secure: false
            v4auth: true
            chunksize: 5242880
            rootdirectory: /

In my environment, rook is operating a rados object gateway deployment with a clusterIP at rook-ceph-rgw-local-store.rook-ceph.svc.cluster.local. My secretkey and accesskey have been omitted.

It is of course not ideal to serve credentials this way, but I am not aware of a way to embed Secrets into ConfigMaps. This little tradeoff to me is well worth not having to build a custom registry image and get into a chicken-egg scenario.

This bit is required to allow us to use the clusterIP’s internal dns name, otherwise the client itself will go looking for it:

        redirect:
            disable: true

networking

The next hurdle is how to plumb up the networking. To keep things as simple as possible - with no external dependencies - I’ve opted for a NodePort service.

NodePorts waste bandwidth, because external entities have no way of knowing what node a pod is actually running on, but this is an acceptable tradeoff for the registry. We can alleviate this inefficiency for ‘pulls’ by mapping the registry hostname to 127.0.0.1 on the k8s nodes themselves via /etc/hosts, causing traffic to immediately enter the k8s network and be routed to the correct node.

With configmap, bucket, certificates, credentials in place, the full solution looks like this:

resource "kubernetes_service" "registry" {
  metadata {
    name = "registry"
  }
  spec {
    selector = {
      app = "registry"
    }
    port {
      port        = "5000"
      target_port = "5000"
      node_port   = "30500"
    }
    type         = "NodePort"
    session_affinity        = "ClientIP"
  }
}

resource "kubernetes_deployment" "registry" {
  metadata {
    name = "registry"
  }
  spec {
    replicas = 2
    selector {
      match_labels = {
        app = "registry"
      }
    }
    template {
      metadata {
        labels = {
          app = "registry"
        }
      }
      spec {
        container {
          image = "registry:2.8"
          name  = "registry"
          volume_mount {
            name       = "cert-vol"
            mount_path = "/var/run/secrets/certs"
          }
          volume_mount {
            name       = "config-yml"
            mount_path = "/etc/docker/registry/"
            read_only  = true
          }
          env {
            name  = "REGISTRY_HTTP_TLS_CERTIFICATE"
            value = "/var/run/secrets/certs/tls.crt"
          }
          env {
            name  = "REGISTRY_HTTP_TLS_KEY"
            value = "/var/run/secrets/certs/tls.key"
          }
        }
        volume {
          name = "cert-vol"
          secret {
            secret_name = "registry-cert"
          }
        }
        volume {
          name = "config-yml"
          config_map {
            name = "registry"
          }
        }
      }
    }
  }
}

And there we have a rack/AZ agnostic registry on top of distributed storage, that will (within reason) handle arbitrary node reboots, pod migration, etc.

Nathan Hensel

on caving, mountaineering, networking, computing, electronics


2023-02-16