~$ dobryops.com
~ / writing / declarative-homelab

Running a fully declarative homelab, four years in

Every machine in my house rebuilds from a git repo. Power loss, dead SSDs, a kid pulling cables - I run one command and twenty minutes later it's back. Here's how, why, and what I'd still do differently.

I’ve had some version of a homelab since 2021, and four rewrites later it’s finally something I can recommend without asterisks. The current shape is fifteen self-hosted services running on three small machines, all configured in NixOS, scheduled by k3s, with secrets in SOPS and the entire world described by one git repo I push to from my laptop.

What follows is the architecture, the pieces I think are worth stealing, and the bits that I would not recommend even to people I like.

Why declarative

The shortest version: I have lost a homelab to a single failed SSD, and I will not do it again. The longer version is that running imperative servers at home is mostly fine until the day you need to remember why a thing works, and the only living memory of that decision is in your fingers.

Declarative configuration moves that memory into the repo. The cost is up-front pain. The payoff is that future-you treats the machine like cattle without the team, the budget, or the SRE on-call rotation.

// rule of thumb If a config change isn’t in git, it didn’t happen. I run a hook that screams when /etc drifts.

The shape of it

Three nodes - one beefy NUC, two cheap mini-PCs - all running NixOS with a flake at the root of the repo. A k3s control plane on the NUC, agents on the others. Everything else is a NixOS module, a Helm release, or a custom resource.

// flake.nix (abridged)
{
  outputs = inputs: {
    nixosConfigurations = {
      core   = mkHost "core"   { role = "control-plane"; };
      edge-1 = mkHost "edge-1" { role = "agent"; };
      edge-2 = mkHost "edge-2" { role = "agent"; };
    };
    apps.deploy = ./scripts/deploy.nix;
  };
}

I deploy with nix run .#deploy. It evaluates the flake, builds new system closures, ships them to each host, and activates them. If any host fails to switch, the previous generation is still bootable. This is the single feature that turned NixOS from “fun” into “trustworthy.”

Secrets without panic

Anything sensitive lives in SOPS-encrypted files inside the repo, using age keys. Each host has its own key, the laptops have their own keys, and the master key lives in Bitwarden where past-me has already trusted it.

The thing I want to talk people into is the recovery path. If my laptop goes in the river:

  1. Buy a new laptop.
  2. Pull the master key from Bitwarden.
  3. Generate a new age key, add it to .sops.yaml, push.
  4. Run the deploy. Done.

No re-encrypting from memory, no “wait, was that DB password the one with the dollar sign or the underscore.”

k3s, but only just

A homelab does not need Kubernetes. I will say that again because I mean it: a homelab does not need Kubernetes. I run k3s because it’s the same primitives I use professionally, and dogfooding is cheaper than reading docs. If your homelab is for watching films, run Docker Compose and go outside.

With that out of the way: k3s is great. Single-binary install, built-in storage, sane defaults. I disabled traefik in favour of my own ingress and pulled in cert-manager for ACME certs. The rest is just Helm releases pinned in the flake.

Things that broke (so you don’t have to break them)

  • Storage class drift. Local-path-provisioner is fine until you reschedule a pod and find out your data lived on the wrong node. Pin replicas, label nodes, or use longhorn. I learned this with my Postgres volume. Once.
  • SOPS + git hooks + impatience. If you forget to re-encrypt after rotating a key, your deploy will succeed and your services will fail to start with a wonderful, unhelpful error. Add a pre-push hook.
  • NixOS upgrades on cheap hardware. Cross-compiling for ARM on a NUC is fine; building the world on a tiny ARM board is not. Set up a remote builder before you need it.

What I’d do again

  • Flake-only. configuration.nix is a sweet trap.
  • One repo for all hosts. Multi-repo seemed clean for a week.
  • Pangolin tunnels for any service I want from outside - no exposed ports, no VPN to onboard guests onto.
  • Prom + Grafana even at home scale. Looking at “why is the lab slow” and having actual data is its own reward.

What I’d do differently

  • Skip k3s for the first version. NixOS modules + systemd would have got me to v1 a month sooner.
  • Write the disaster-recovery doc on day one, not after the first disaster.
  • Buy ECC RAM. Just do it.

If you want the actual repo, it’s at github.com/bovf/homelab-overkill. Forks and questions both welcome - preferably in that order.

filed under #nix #homelab #gitops