Adopting NixOS for my RKE1 Kubernetes nodes

For those not aware, Nix is an interesting new application (Nix) and operating System (NixOS) that provides a declarative environment definition and atomic operating system. Declarative means that instead of running apt-get install docker, you write down everything you want and it installs everything and removes everything you don’t want. You can use the same language to manage packages, users, firewall, networking, etc. This is useful because now you can revision control your OS state in Git and have exact replicas across multiple hosts.

My friend, dade, and I have been diving into Nix and NixOS. He got it working on his laptop, I’m trying to get it to be the OS for my four dedicated servers all running Kubernetes. In this post, I’ll walk through the main issues I encountered and how I got a single node running in an existing RKE1 cluster.

I’m not going to go all the way to use Nix to configure everything including my Kubernetes configuration. I know that’s possible, but I already have a Kubernetes cluster deployed using RKE1 that I’m not ready to break yet since it hosts this blog and other services. Maybe in a future iteration I will.

Booting

First, I needed to make the disk as a bootable EFI image:

1
2
3
4
5
6
7
{ config, lib, pkgs, ... }:
{
  # Use the systemd-boot EFI boot loader.
  boot.loader.systemd-boot.enable = true;

  boot.loader.efi.canTouchEfiVariables = true;
}

Disk Partitioning

The NixOS minimal live CD does not give a partitioning tool other than fdisk. You’re responsible for figuring out partition types yourself. Luckily NixOS has disko which uses the exact same Nix files to declaratively define partitions and I can steal the config from somebody else who figured it out. Nice.

I had two drives, one SSD, one on a spinning hard drive and followed the disko guide. My configuration looked like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
{ config, lib, pkgs, disko, ... }:
{
  disko.devices = {
    disk = {
      ssd = {
        # When using disko-install, we will overwrite this value from the commandline
        device = "/dev/disk/by-id/5dbd3913-bc15-4e14-9902-de9c1703eacd";
        type = "disk";
        content = {
          type = "gpt";
          partitions = {
            MBR = {
              type = "EF02"; # for grub MBR
              size = "1M";
              priority = 1; # Needs to be first partition
            };
            ESP = {
              type = "EF00";
              size = "500M";
              content = {
                type = "filesystem";
                format = "vfat";
                mountpoint = "/boot";
              };
            };
            root = {
              size = "100%";
              content = {
                type = "filesystem";
                format = "btrfs";
                mountpoint = "/";
              };
            };
          };
        };
      };
      hdd = {
        # When using disko-install, we will overwrite this value from the commandline
        device = "/dev/disk/by-id/eae13bd2-c505-46b2-9066-5aa1028f10d7";
        type = "disk";
        content = {
          type = "gpt";
          partitions = {
            root = {
              size = "100%";
              content = {
                type = "filesystem";
                format = "btrfs";
                mountpoint = "/mnt/hdd";
              };
            };
          };
        };
      };
    };
  };
}

Writing to the disk was easy (make sure sdb and sda are pointing to the right disks)

1
nix --extra-experimental-features flakes --extra-experimental-features nix-command run 'github:nix-community/disko#disko-install' -- --flake '/tmp/config/etc/nixos#srv5' --write-efi-boot-entries --disk ssd /dev/sdb --disk hdd /dev/sda

Making Vim sane

Nix’s default vimrc configuration is weird for me, but it’s easy to fix this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
environment.systemPackages = with pkgs; [
  ((vim_configurable.override { }).customize {
    name = "vim";
    vimrcConfig.customRC = ''
      set mouse=""
      set backspace=indent,eol,start
      syntax on
    '';
  })
];

Moving on to K8s

I’m currently using RKE1 which involves running ./rke up to provision a new node, not NixOS because my K8s is a few years old now. To add a new node it looks like:

1
2
3
4
5
6
7
8
 # cluster.yml
 nodes:
+  - address: 1.2.3.4
+    user: adam
+    role:
+      - controlplane
+      - etcd
+      - worker

Then run: (--ignore-docker-version is needed because NixOS uses Docker 27.x, but RKE1 only supports up to 26.x)

1
./rke_linux-amd64 up --ignore-docker-version

And waiting for it to provision.

Resource requests (The dumb issue)

The node joined the cluster, but my next problem is no pods could start because the networking was down:

The kubelet logs (docker logs kubelet) showed:

1
2
3
kubelet.go:2862] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"

0/4 nodes are available: 1 Insufficient cpu, 3 node is filtered out by the prefilter result. preemption: 0/4 nodes are available: 1 Insufficient cpu, 3 Preemption is not helpful for scheduling..

I was stumped. How come there wasn’t enough CPU? Was it because Calico couldn’t spin up, so kubelet was reporting unhealthy because there was no CNI, thus no CPU? A circular dependency?

Nah, let’s get the dumb mistakes out of the way: turns out, I only provisioned one CPU to this test VM because it was just for experimenting, not production use.

And I had reserved 1 core for the host OS:

1
2
3
4
5
# rke cluster.yaml
services:
  kubelet:
    extra_args:
      kube-reserved: cpu=1000m,memory=512Mi

So there was nothing available for the Kubernetes pods itself. Whoops. I gave it more CPUs.

Inter-node traffic

Next problem is diagnosing why the pods weren’t able to communicate with the rest of the cluster. I’m currently using Calico Canal for networking which uses Flannel to create an encapsulated overlay network between the different nodes in the cluster. They weren’t able to communicate and I needed to figure out why, so I installed tcpdump, a packet capturing program. I added the following, then ran nixos-rebuild switch:

1
2
3
environment.systemPackages = with pkgs; [
  pkgs.tcpdump
];

On a different node, I pinged a pod running on my test node to identify which ports were needed:

1
2
3
ping 10.42.4.72

PING 10.42.4.72 (10.42.4.72) 56(84) bytes of data.

Gotcha:

1
2
3
4
tcpdump -f 'host 51.81.64.31 or dst 149.56.22.10'

05:04:55.054420 ens33 Out IP srv8.7946 > srv6.technowizardry.net.7946: UDP, length 96
05:04:55.068243 ens33 In  IP srv6.technowizardry.net.7946 > srv8.7946: UDP, length 49

Easy enough in Nix to open up the firewall. Ideally, I’d lock this down to just the other nodes in my cluster, but I didn’t know how to do that in NixOS yet.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
{ config, lib, pkgs, ... }:
{
  networking.firewall.allowedTCPPorts = [
    7946 # flannel
    8285 # flannel
    8472 # flannel

    # Kubernetes
    2379 # etcd-client
    2380 # etcd-cluster
    6443 # kube-apiserver

    # Prometheus metrics
    10250
    10254
  ];
  networking.firewall.allowedUDPPorts = [
    7946 # flannel
    8472 # flannel
  ];
}

Another NixOS rebuild and the pings are flowing between pods.

(it’s always) DNS

All my servers (via a DaemonSet) run PowerDNS as an authoritative DNS server for my various domain names. However, on my new node, it wasn’t able to start up:

1
2
3
Guardian is launching an instance
Unable to bind UDP socket to '0.0.0.0:53': Address already in use
Fatal error: Unable to bind to UDP socket

It binds to port 53/tcp and 53/udp on 0.0.0.0, however systemd-resolved is already running on that port on 127.0.0.53 and 127.0.0.54. Even though it’s listening on loopback, PowerDNS will conflict because it’s trying to run on 0.0.0.0.

1
2
3
4
5
6
7
8
lsof -iUDP -P -n
COMMAND      PID            USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
systemd-r    692 systemd-resolve   11u  IPv4   14287      0t0  UDP *:5355
systemd-r    692 systemd-resolve   13u  IPv6   14290      0t0  UDP *:5355
systemd-r    692 systemd-resolve   15u  IPv4   14292      0t0  UDP *:5353
systemd-r    692 systemd-resolve   16u  IPv6   14293      0t0  UDP *:5353
systemd-r    692 systemd-resolve   20u  IPv4   14297      0t0  UDP 127.0.0.53:53
systemd-r    692 systemd-resolve   22u  IPv4   14299      0t0  UDP 127.0.0.54:53

To fix this, I disable the local resolver and point my server to another resolver:

1
2
3
4
5
6
{ config, lib, pkgs, ... }:
{
  networking.nameservers = [ "1.1.1.1" ];

  services.resolved.enable = false;
}

Longhorn

I use Longhorn as my Kubernetes CSI (storage provider) because it supports mirroring across multiple hosts and automatic backups. Under the hood, it uses NFS to mount a volume from the Longhorn storage pod and the host running the application pod. To do this, it requires a few packages to be installed on the host itself (nfs-utils) and the iscsid system daemon.

When I tried to start Longhorn, I got the following error messages because the nfs utils weren’t installed:

1
2
3
4
5
6
MountVolume.MountDevice failed for volume "pvc-8416dbbf-89cf-45c0-bdc8-8a2b55a4e58a" :
rpc error: code = Internal desc = mount failed: exit status 32 Mounting command:
/usr/local/sbin/nsmounter Mounting arguments:
mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,softerr 10.43.149.52:/pvc-8416dbbf-89cf-45c0-bdc8-8a2b55a4e58a /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/6d6bbeebe575c0ca8e25eb35c0248aca3e81a606bc647be0c2d9901b6b4bd9b3/globalmount Output:
mount: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/6d6bbeebe575c0ca8e25eb35c0248aca3e81a606bc647be0c2d9901b6b4bd9b3/globalmount: 
bad option; for several filesystems (e.g. nfs, cifs) you might need a /sbin/mount.<type> helper program. dmesg(1) may have more information after failed mount system call.

As per the Longhorn docs, we need to install a few packages:

1
environment.systemPackages = [ pkgs.nfs-utils pkgs.openiscsi ];

However, NixOS doesn’t place the files in the same place as Longhorn expects (GitHub issue). Longhorn expects to find mount.nfs in the PATH, but in NixOS, it’s actually found in /nix/store/2l9hiinf01ikdjjxd1lafb9mqs5ssfp5-nfs-utils-2.7.1/bin/mount.nfs (the path may differ on your host). When it goes to run nsenter --mount=/host/proc/14084/ns/mnt --net=/host/proc/14084/ns/net mount.nfs in the container to break out of the container into the host namespaces and mount the NFS volume, it looks in the PATH of /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin and doesn’t find it so it fails.

We need to trick Longhorn into looking into the right place by overriding the PATH environment variable. When nsenter runs, it uses the PATH environment to invoke the command on the host OS. I like using Kyverno for this which is a Kubernetes program that can validate and mutate resources based on policies. I used it in the past and already had it installed on my cluster, so this next part worked well.

Inspired by this comment, I created a simple Kyverno policy that mutates every pod in the longhorn-system namespace to include our custom PATH.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: longhorn-add-nixos-path
  annotations:
    policies.kyverno.io/title: Add Environment Variables from ConfigMap
    policies.kyverno.io/subject: Pod
    policies.kyverno.io/category: Other
    policies.kyverno.io/description: >-
      Longhorn invokes executables on the host system, and needs
      to be aware of the host systems PATH. This modifies all
      deployments such that the PATH is explicitly set to support
      NixOS based systems.      
spec:
  rules:
    - name: add-env-vars
      match:
        resources:
          kinds:
            - Pod
          namespaces:
            - longhorn-system
      mutate:
        patchStrategicMerge:
          spec:
            initContainers:
              - (name): "*"
                env:
                  - name: PATH
                    value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/run/wrappers/bin:/nix/var/nix/profiles/default/bin:/run/current-system/sw/bin
            containers:
              - (name): "*"
                env:
                  - name: PATH
                    value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/run/wrappers/bin:/nix/var/nix/profiles/default/bin:/run/current-system/sw/bin

After that, I redeployed the Longhorn pods and the containers were able to start up without crashing.

iSCSI

However, trying to mount a volume on the host would fail:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
failed to execute: /usr/bin/nsenter [nsenter --mount=/host/proc/14084/ns/mnt --net=/host/proc/14084/ns/net iscsiadm -m discovery -t sendtargets -p 10.42.0.20], output , stderr
  iscsiadm: can't open iscsid.startup configuration file etc/iscsi/iscsid.conf\n
  iscsiadm: iscsid is not running. Could not start it up automatically using the startup command in the iscsid.conf iscsid.startup setting. Please check that the file exists or that your init scripts have started iscsid.
  iscsiadm: can not connect to iSCSI daemon (111)!
  iscsiadm: can't open iscsid.startup configuration file etc/iscsi/iscsid.conf
  iscsiadm: iscsid is not running. Could not start it up automatically using the startup command in the iscsid.conf iscsid.startup setting. Please check that the file exists or that your init scripts have started iscsid.
  iscsiadm: can not connect to iSCSI daemon (111)!
  iscsiadm: Cannot perform discovery. Initiatorname required.
  iscsiadm: Could not perform SendTargets discovery: could not connect to iscsid
  exit status 20

Nodes cleaned up for iqn.2019-10.io.longhorn:unifi-restore" func="iscsidev.(*Device).StartInitator" file="iscsi.go:168

msg="Failed to startup frontend" func="controller.(*Controller).startFrontend" file="control.go:489" error="failed to execute:
  /usr/bin/nsenter [nsenter --mount=/host/proc/14084/ns/mnt --net=/host/proc/14084/ns/net iscsiadm -m node -T iqn.2019-10.io.longhorn:unifi-restore -o update -n node.session.err_timeo.abort_timeout -v 15], output , stderr
  iscsiadm: No records found
  exit status 21"
  ...

It can’t find the iscsid service provided by the iscsid.socket and iscsid.service systemd units. Just installing the package doesn’t enable the service. For that we need to add:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
 { config, lib, pkgs, ... }:
 {
     environment.systemPackages = with pkgs; [
         pkgs.nfs-utils
         pkgs.openiscsi
     ];

+    services.openiscsi = {
+       enable = true;
+       name = "iqn.2005-10.nixos:${config.networking.hostName}";
+   };
 }

Then redeploy the application pod.

I forgot the Firewall

At this point everything is deploying, but I can’t seem to access anything running on this host. Normally Docker and Kubernetes hijack the IPTables rules, but on NixOS they don’t. Easy fix:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
networking.firewall.allowedTCPPorts = [
  53
  # nginx
  80
  443
  # mail
  25
  110
  143
  465
  587
  993
  995

  7946 # flannel
  8285 # flannel
  8472 # flannel

  # Rancher UI
  8443 # I don't know if this is actually used
  32121

  # TODO: Lock these down to intra-node only
  2379 # etcd-client
  2380 # etcd-cluster
  6443 # kube-apiserver
  10250
  10254
  9100 # prom-node-export
];
networking.firewall.allowedUDPPorts = [
  53

  7946 # flannel
  8472 # flannel
];

Missing kubelet logs

Next, I had an issue where kubelet logs would not show any logs from any pods:

1
2
3
$ kubectl --context=local -n cattle-system logs helm-operation-hpq86
Defaulted container "helm" out of: helm, proxy, init-kubeconfig-volume (init)
failed to try resolving symlinks in path "/var/log/pods/cattle-system_helm-operation-hpq86_f63d99d0-ad39-478f-be8f-bdc1f4657d1d/helm/0.log": lstat /var/log/pods/cattle-system_helm-operation-hpq86_f63d99d0-ad39-478f-be8f-bdc1f4657d1d/helm/0.log: no such file or directory

Looking at the host, no files were present in /var/logs/pods and nothing was in /var/lib/docker/containers/{container}/*-json.log. The default logDriver in Nix for Docker is journald. I tried switching to local which didn’t work, but this StackOverflow post suggests it needs to be set to json-file.

Easy fix. I changed updated the logDriver and nixos-rebuild switch to restart the Docker daemon fixed my logs:

1
2
3
4
virtualisation.docker = {
  enable = true;
  logDriver = "json-file";
};

Conclusion

Thus far, I feel that Nix is powerful and I like having declarative config that can manage everything about the OS. I can define a series of configuration files and define per host flakes for my servers and computers and pick and choose which machines get what config.

But I don’t find the language itself very intuitive. I tried to create my own derivative, but ran into issues. Over time, I hope it gets better though and I’ll keep playing with it.

Copyright - All Rights Reserved
Last updated on Sep 20, 2024 00:00 UTC

Comments

Comments are currently unavailable while I move to this new blog platform. To give feedback, send an email to adam [at] this website url.