Monitoring k8s Nodes from Within

This is a writeup on migrating telegraf from systemd units on k8s cluster hosts to a daemonset within k8s itself. I’m doing this as part of my ongoing endeavor to absolutely minimize the configs of the cluster hosts themselves.

Heres what the initial nixos telegraf config looked like:

  services.telegraf = {
    enable = true;
    extraConfig = {
      outputs = {
        influxdb_v2 = {
          urls = ["http://172.30.190.62:8086"];
          token = "asdfasdf";
          organization = "default";
          bucket = "default";
        };
      };
      inputs = {
        mem = {};
        sensors = {};
        cpu = {
          percpu = false;
        };
        linux_cpu = {
          metrics = ["cpufreq"];
        };
        disk = {
          mount_points = ["/"];
        };
        diskio = {
          devices = ["{{node.boot_device}}"];
        };
        net = {
          interfaces = ["{{node.interface}}"];
        };
        smart = {
          use_sudo = true;
          attributes = true;
        };
      };
    };
  };

  systemd.services.telegraf.path = [ pkgs.lm_sensors pkgs.smartmontools pkgs.nvme-cli "/run/wrappers" ];
  security.sudo.extraRules = [{
    users = [ "telegraf" ];
    commands = [
      { command = "${pkgs.smartmontools}/bin/smartctl"; options = [ "NOPASSWD" ]; }
      { command = "${pkgs.nvme-cli}/bin/nvme"; options = [ "NOPASSWD" ]; }
    ];
  }];

Heres the deamonset manifest:

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: telegraf
  namespace: default
  labels:
    app: telegraf
spec:
  selector:
    matchLabels:
      app: telegraf
  template:
    metadata:
      labels:
        app: telegraf
    spec:
      hostNetwork: true
      containers:
        - name: telegraf
          image: telegraf:1.33
          securityContext:
            privileged: true
          env:
            - name: HOST_PROC
              value: /host/proc
          volumeMounts:
            - name: telegraf
              mountPath: /etc/telegraf
            - name: udev
              mountPath: /run/udev
              readOnly: true
            - name: proc
              mountPath: /host/proc
              readOnly: true
      volumes:
        - name: telegraf
          configMap:
            name: telegraf
        - name: udev
          hostPath:
            path: /run/udev
            type: Directory
        - name: proc
          hostPath:
            path: /proc
            type: Directory

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: telegraf
  namespace: default
data:
  telegraf.conf: |
    [inputs.cpu]
    percpu = false

    [inputs.disk]
    mount_points = ["/"]

    [inputs.diskio]
    devices = ["/dev/nvme0n1"]

    [inputs.linux_cpu]
    metrics = ["cpufreq"]

    [inputs.mem]

    [inputs.sensors]

    [outputs.influxdb_v2]
    bucket = "default"
    organization = "default"
    token = "asdfasdf"
    urls = ["http://172.30.190.62:8086"]

inputs.net has been deprecated in 1.33 so its been removed until i find a replacement. inputs.smart requires smartmontools and nvme-utils which aren’t in that telegraf image, so its been removed until i get around to creating my own image.

Other than that, everything works the same. What allows this to work are privileged: true and hostNetwork: true, which gets us as close to mimicking ‘running straight on the host’ as we can get from within a pod.

Nathan Hensel

on caving, mountaineering, networking, computing, electronics


2024-12-31