Native Routing on Kubernetes with BGP

2024-04-25 · Arda Xi

Ever since I first started working with Kubernetes nearly a decade ago now, I’ve been preaching that for self-managed clusters it’s much simpler to just let the kernel and network handle routing. Yet pretty much everywhere people seem to stick to the default tunnel setup. As I recently had cause to look into Cilium’s BGP support, I decided to take that opportunity to also prove that you can do native networking like that. It wasn’t necessarily obvious, but it also wasn’t terribly difficult. I’ll explain why I did this, my test setup, and then the actual configuration.

Background

First a very basic primer on Kubernetes. This will be overly simplified since I’m only focusing on the parts that are relevant to networking. In other words, this is almost completely wrong, but it should be useful enough.

Kubernetes is a container orchestration system. In other words, a way to distribute multiple instances of multiple workloads over multiple machines. The primary resources we care about are Nodes and Pods. A node is a Linux host that is part of a cluster, and it’s what the workloads run on. A pod is a collection of one or more workloads that run in a namespace. For our purposes, we’ll just focus on the network namespace it runs on.

The namespace is (usually) completely isolated from the host system. To allow pods to talk to each other and the broader network, a Container Network Interface (CNI) plugin is used. In self-managed set-ups this is often Flannel or (more recently) Cilium. The CNI plugin generally works by defining a private network (PodCIDR) per node. When a new pod is created, it will create a virtual Ethernet interface. One end stays in the home namespace, the other is placed inside the pod and gets an IP address from the PodCIDR.

The other main thing the CNI plugin is responsible for is ensuring that pods can talk to each other even if they’re on different nodes. For simplicity, this is usually done with an overlay network (technically the same thing as a VPN, though usually not encrypted), often based on VXLAN. This is what we’re concerned with. While this setup does work pretty much everywhere, there’s a few downsides to it.

The Problem

While this approach works, it has a few downsides. To start with, it necessarily turns each node into a full NAT router. Any packet sent from a pod to outside has to be rewritten to come from the node’s IP address instead. This obfuscates the source. If one workload starts hammering a database outside the cluster for example, they’ll just see connections coming from a node, and you’ll have to try and find out which pod it was coming from. The same thing applies in the other direction. You can expose workloads in multiple ways, like NodePorts where each node listens on a port on its host address and forwards requests to one of the pods associated with it. There’s also solutions like MetalLB which advertise a virtual IP from the nodes.

The problem there is that by default, this will require DNAT. The incoming packet will have its source address rewritten to that of the node that first received it, so the response can be routed back properly. This means the service will no longer know what IP address hit it. You can avoid this by enforcing traffic to stay local, essentially disabling the proxy. The downside there is that you can no longer do load-balancing. With MetalLB in L2 mode this has the consequence that only one node (with usually one pod) will ever receive requests.

The final reason is complexity. Because there’s an overlay, it is no longer obvious how the network works. There’s another layer to debug when things go wrong. The CNI is something you don’t really notice or think about, until it breaks. Then it makes your life much harder than it needs to be.

Going Native

Most CNI plugins allow you to turn off the overlay network. When you do that, you’re now responsible for ensuring the nodes know how to route to each PodCIDR in the cluster. If all nodes are connected to the same switch (people also call this the same L2 network) this is actually quite simple. Each node just needs to route requests destined for a given PodCIDR to the node that PodCIDR belongs to. Cilium can automatically do this for you with the auto-direct-node-routes option. But what if your nodes are not directly connected, like if they’re in different datacentres? That’s where BGP comes in.

Test Setup

To prove this approach, I set up a testing environment using a server at home with libvirt/QEMU. My aim was to set up two routers with one Kubernetes node connected to each, to replicate two datacentres. I chose OpenWRT for the routers and Talos Linux for the nodes. For the routers anything you can run BGP with should work. For the nodes, anything you can run Kubernetes on will do, though for Cilium you’ll want to ensure you have a kernel with eBPF support. If you want to replicate this setup, here’s how you do it. If you already have a cluster with Cilium or you’d prefer to set it up yourself, feel free to skip ahead to the BGP section.

OpenWRT

First, create three isolated networks. One for each DC, and one peering network that connects the two routers. Then create your first router. OpenWRT assumes the first interface is the LAN and the second is the WAN. So your first NIC should be the first DC network. The second should grant you access inside and the router access to the internet. As I was hosting the VMs on my server and accessing them from my laptop, I used a macvtap device. If you’re using the same machine for both, you may want a bridge device instead. The third NIC is your peering network.

To boot it, I went with the fancy way of downloading the kernel and rootfs separately. Go to the downloads for OpenWRT and fetch the kernel and ext4-rootfs image for your target (probably x86-64).

As of writing, you can find the latest release here: OpenWRT 23.05.3 x86-64

Set the VM’s disk to the rootfs image. Enable direct kernel boot, point it at the kernel, and add the kernel args:

root=/dev/vda rootwait

Now start it up. Next, you’ll want to configure the firewall and network. Since this is a test setup and I trust the network, for simplicity I just opened everything. So first we change the default wan and lan zones in /etc/config/firewall:

config zone
    option name 'lan'
    list network 'lan'
    option input 'ACCEPT'
    option output 'ACCEPT'
    option forward 'ACCEPT'
    option masq_allow_invalid '1'

config zone
    option name 'wan'
    list network 'wan'
    list network 'wan6'
    option input 'ACCEPT'
    option output 'ACCEPT'
    option forward 'ACCEPT'
    option mtu_fix '1'
    option masq '1'
    option masq_allow_invalid '1'

The masq_allow_invalid option is important, because we will be passing traffic that was routed to us from a node. We’ll also want rules for a new peering zone we’ll create, so add the following:

config zone
    option name 'peering'
    option input 'ACCEPT'
    option output 'ACCEPT'
    option forward 'ACCEPT'
    list network 'peering'
    option masq_allow_invalid '1'

config forwarding
    option src 'peering'
    option dest 'lan'

config forwarding
    option src 'lan'
    option dest 'peering'

This sets up forwarding between the peering and lan zones. That is traffic coming from a node in DC1 going to a node in DC2. We don’t need to set up forwarding to WAN, since each router will be talking to WAN directly.

Next, in /etc/config/network, I changed the configuration for lan to have the IP address 10.0.0.1/24 from the default of 192.168.0.1/24. This is just to keep things simple, also because my home network is in the 192.168.0.0/16 range. You’ll also want to add the peering interface:

config interface 'peering'
    option proto 'static'
    option device 'eth2'
    list ipaddr '172.16.0.1/24'

This is where we’ll connect the routers. In /etc/config/dhcp you’ll want to make one simple change to add the following to the config dnsmasq section:

    option localservice '0'

This ensures that pods will also be able to make DNS requests. Without this option, dnsmasq will refuse any requests coming from PodCIDRs.

We also want to do BGP. I chose to use Bird 2. So, install it:

opkg install bird2

And then overwrite /etc/bird.conf. I mostly copied this config over from something I had lying around for dn42. It could probably be simplified, but this works. As an aside, I chose to use one ASN for the routers and one for the cluster. You could use more, but don’t use the same ASN everywhere. If you do that, the routes won’t propagate properly, as Bird assumes that all routers within the same AS already have routes to each other.

define OWNAS  = 65000;
define OWNIP  = 172.16.0.1;
define OTHIP  = 172.16.0.2;
define OWNNET = 10.0.0.0/24;
define KUBEAS = 65001;

router id OWNIP;
protocol device {
  scan time 10;
}
protocol kernel {
  scan time 20;
  ipv4 {
    import none;
    export filter {
      if source = RTS_STATIC then reject;
      krt_prefsrc = OWNIP;
      accept;
    };
  };
}
protocol static {
  route OWNNET reject;
  ipv4 {
    import all;
    export none;
  };
}
protocol bgp ibgp_node {
  local as OWNAS;
  neighbor OTHIP as OWNAS;
  direct;
  ipv4 {
    next hop self;
    import all;
    export all;
  };
}

We’ll add the cluster node later, but it first needs an IP address. At this point you can shut down this VM and clone it. Change the first NIC to point to the DC2 network instead of DC1. You should be able to start it fine. At that point, change the LAN in /etc/config/network to 10.0.1.1/24 instead, and swap OWNIP and OTHIP in /etc/bird.conf. Then reboot this VM and restart the other.

Talos Linux

Now we’ll create two more VMs, one control plane node and one worker node. At this point there’s a few utilities you’ll want to install on your local system:

talosctl
kubectl
helm
cilium-cli

Check your talosctl version and get the appropriate image from the Talos Image Factory with qemu-guest-agent. Fetch the kernel and initramfs from the bottom of the page, and configure a VM with at least 2GB RAM, an empty disk (at least 10GB) and have it boot from the downloaded kernel and initrd. Use the following for kernel args:

talos.platform=metal console=ttyS0 console=tty0 init_on_alloc=1 slab_nomerge pti=on consoleblank=0 nvme_core.io_timeout=4294967295 printk.devkmsg=on ima_template=ima-ng ima_appraise=fix ima_hash=sha512

This VM only needs one NIC, the DC1 network. The VM should come up and get an IP address from the OpenWRT router. You can now shut down this VM and clone it. The clone should again have its NIC switched to DC2.

Now you can restart the first Talos node. Think of a name (I just called it libvirt) and note the IP address of the node. It should be in the 10.0.0.0/24 range, I happened to get 10.0.0.117. Turn that into a HTTPS URL on port 6443. In my example, https://10.0.0.117:6443. This is your cluster-endpoint. We’ll also need to prepare a patch file so it’s ready for Cilium and so that it will install to /dev/vda instead of /dev/sda. Create patch.yaml with the following contents:

cluster:
  network:
    cni:
      name: none
  proxy:
    disabled: true
machine:
  install:
    disk: /dev/vda

Then, from your workstation, run:

talosctl gen config <cluster-name> <cluster-endpoint> --config-patch @patch.yaml
export TALOSCONFIG=$PWD/talosconfig

You will want to edit the generated controlplane.yaml and worker.yaml to install to /dev/vda instead of /dev/sda. Now we can install this node as the control plane:

talosctl apply-config --insecure -n <node-ip> --file controlplane.yaml

The --insecure is needed because the node doesn’t have the certificates we generated earlier yet. The node should now reboot. You may want to run the following to make things easier:

talosctl config node <node-ip>
talosctl config endpoint <node-ip>

When the node is running, you’ll have to bootstrap the cluster.

talosctl bootstrap
talosctl kubeconfig

This also sets up kubectl to reach the cluster. The Talos node will hang on phase 18/19 because we currently have no CNI. So we’ll have to install Cilium:

cilium install \
    --helm-set=ipam.mode=kubernetes \
    --helm-set=kubeProxyReplacement=true \
    --helm-set=securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
    --helm-set=securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
    --helm-set=cgroup.autoMount.enabled=false \
    --helm-set=cgroup.hostRoot=/sys/fs/cgroup \
    --helm-set=k8sServiceHost=localhost \
    --helm-set=k8sServicePort=7445 \
    --helm-set=bgpControlPlane.enabled=true \
    --helm-set=routingMode=native \
    --helm-set=enableIPv4Masquerade=false

The first set of options are what’s needed to enable Cilium on Talos. The last three set up BGP and native routing. Now we can add the second node. Boot up the second Talos VM and note its IP. Then run:

talosctl apply-config --insecure -n <node-ip> --file worker.yaml

BGP

If you haven’t been following along and you’ve already got a cluster with Cilium, just ensure you have the following Helm values set:

bgpControlPlane.enabled=true
routingMode=native
enableIPv4Masquerade=false

We disable masquerading because the upstream router will take care of it when needed. We need to label our nodes so Cilium knows which router to talk to. Find your node’s name with kubectl get nodes -o wide and label them. The node in 10.0.0.0/24 is dc1, the one in 10.0.1.0/24 is dc2.

kubectl label nodes <node-name> dc=dc1
kubectl label nodes <node-name> dc=dc2

Now we can add the peering policy. Create a file peering.yaml with the following contents:

apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
metadata:
  name: dc1
spec:
  nodeSelector:
    matchLabels:
      dc: dc1
  virtualRouters:
  - localASN: 65001
    exportPodCIDR: true
    neighbors:
    - peerAddress: '10.0.0.1/32'
      peerASN: 65000
---
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
metadata:
  name: dc2
spec:
  nodeSelector:
    matchLabels:
      dc: dc2
  virtualRouters:
  - localASN: 65001
    exportPodCIDR: true
    neighbors:
    - peerAddress: '10.0.1.1/32'
      peerASN: 65000

And apply it:

kubectl apply -f peering.yaml

It may take a few moments, but after a while cilium bgp routes should show routes being advertised to the routers. At this point, all connectivity should just work, with no overlay.

How this works

The nodes are now advertising their PodCIDRs to the router they’re connected to. That router in turn advertises it to the other router. So if a pod from node-1 tries to reach a pod in node-2, this is what happens:

The host (node-1) looks up the target IP address in its routing table. As there is no matching entry, the packet is forwarded to the default gateway. That’s router-1.
router-1 does have a route for the target IP address, because it was advertised by router-2. So it forwards the packet to it over the peering network.
router-2 has a route advertised by node-2, so it goes there over the lan network.
node-2 forwards the packet to the pod.

With more nodes on the same network it is preferable to add auto-direct-node-routes=true to the Cilium configuration so traffic within one network doesn’t go through the router. In addition, you may want to set the PodCIDRs specifically using Cilium Cluster-Pool IPAM, to avoid clobbering the network. At that point, this setup should scale to any amount of nodes or routers.

Diagram

For clarity, I’ve included a diagram of this setup, with routing tables per machine and example routes: