Scalable System Engineering with NixOS

I finally took a weekend to try out nixos; and in short, nix appears to ‘solve’ configutation management better than anything else I’ve seen.

While I know I’m late to the party, watching Matthew Croughan’s “What Nix Can Do (Docker Can’t) - SCaLE 20x” feels like as big a deal as when docker came into the world, or even kubernetes.

As many have said before me, you really have to see it for yourself to grok it.

What makes nix a big deal is its a purely declarative means of system configuration - like opentf, but without the messy global state file. How is this different than chef, puppet, ansible? The problem with these classic config management tools is that they do not actually guarantee an end result. If I apply a ‘role’ to a node with ansible, and then remove role A in our yaml and apply role B, it will make a mess of our system. Role A will still be there, and we are left in an incorrect state. If I perform a similar sequence of actions with nixos, the end state will correctly resemble the config.

This may not seem like a huge deal, because how often are we outright changing the role of a host? - but this problem shows itself in other ways. What if we have loop in ansible that creates a home directory and sets up a number of tools/database access/etc for some arbitrary list of users? If we remove a user from that list, our naive ansible script is probably not going to actively remove them from everything we had provisioned. Yes, we can carefully craft some anti-playbook or a seperate cleanup job altogether, but workarounds like this will quickly get you into the weeds with what is really meant to be done in ansible. You’ll likely find yourself doing things like running commands with shell:, looping through shell.stdout lines, ugh.

Here are some other scenarios that won’t be correctly handled with classic config management tools (without extra work):

  • pivot all the infrastructure from telegraf to prometheus.
  • remove an ansible.posix.firewalld statement, with the expectation that the rule will be removed.
  • remove an ansible postgresql_user statement, with the expectation the user will be dropped.
  • deploy config repo rev 2b2b2b2, change mind, git reset HEAD~ and deploy rev 1a1a1a1.

I’m coming at this from the perspective of someone who is interesting in running hundreds of physical container, vm, hpc, or storage hosts.

In my career, I’ve seen this done in roughly three ways:

  • kickstart/preseed installs, careful (chef,puppet,ansible) config managment; mantaining the expectation for the occassional ’nuke from orbit’ when nodes get too far off the rails.
  • pxe boot everything - this is attractive, but your pxe infrastructure becomes system-critical. High dependence on L2 network technologies, which are hard to ‘HA’. High dependence on network storage, which we still need to manage.
  • brute-force immutable vmdk, qcow disk image builds with packer, cloudinit, etc - which never cross an OS upgrade. Pretty much the docker build model at the VM layer. not efficient. only really an option on virtual infrastructure, which again, how to manage?

NixOS imo isn’t a replacement for kuberentes or containers in general, but may give us the most sustainable and ‘correct’ path to deploying and maintaining the baseos for kubernetes hosts.

I’ve only yet taken the first step in that direction - an auto-installer, forked from kevincox of github and gitlab.

Nixos makes building our own isos easier than any other distro I’ve seen - we can even skip the iso and build disk images with our configuration.nix already embedded.

Nathan Hensel

on caving, mountaineering, networking, computing, electronics


2023-09-10