Running a Linux router on macOS

Using macOS's Virtualization.Framework and a few other hacks to configure a fully functional high performance Linux VM as a router on macOS.

Jul 03, 2024

For more than 2 decades now, I've used Linux-on-a-PC as the main router for my home, essentially using ISP provided modems as dumb network bridges1. It has been a crucial tool in my tinkering around with firewalls, DNS, VPNs, even running my own SMTP server at a point in time.

Over the last 7 yrs, that "PC" has been a Virtual Machine running atop the 2017 iMac Pro. Until recently, I'd been using hyperkit to run this VM, which I first encountered when running Docker on macOS. The network topology with the VM looks something like this:

my home network topology, and yes I use two ISPs
my home network topology, and yes I use two ISPs

Hyperkit is based on macOS's Hypervisor.Framework (HVF), which is a low level mechanism available to build something like a QEMU. Unfortunately, network device passthrough isn't available on HVF, so I had to use tuntap osx driver to get the L2 network to work.

Why tuntap

At this point in time, if you haven't had a background in networking, you might be wondering why I'm even mentioning something like tuntap when you see VMs with a functioning network being run on macOS all the time. You see, in a typical VM, networking is enabled at L3 level via a NAT mechanism, so the host OS (macOS in my case) exposes a virtual network to its guests (the Linux VM), and does NAT translation of packets in & out of the VMs, thus allowing the VMs to access the external network. Effectively, to the external world the VM is invisible and all packets are seen as if coming from the host macOS.

In this case however, this is no ordinary VM, it's supposed to be a router, which effectively means that other machines on the network need to be able to access it on L2 layer, i.e. it needs to be visible via its own ethernet address independent of the host OS.

tuntap driver allows us to create two kinds of network interfaces tun (which uses L3 routing, i.e. using IP addresses), and tap (which uses L2 routing, i.e. using ethernet addresses). We're specifically interested in tap (tun is available natively in macOS now).

We create a tap interface and pass it to hyperkit as a virtio-tap device (i.e. OS-guest communication is using virtio). What this means is that any packets sent by the VM are received over this tap interface and vice-versa. We then add this tap interface to a network bridge, which is natively available on macOS from its BSD history. A bridge is a software network switch.

a network bridge with members 'en6' and 'tap1'.
a network bridge with members 'en6' and 'tap1'.

In the pic above, the bridge1 interface is created on the host macOS, and we add en6 and tap1 as member interfaces. en6 is the network interface corresponding to a physical NIC on the machine, while tap1 was created by us using the tuntap driver, and passed to hyperkit VM.

With the above arrangement, network packets can flow between all the member interfaces without any need for NAT, and with distinct ethernet addresses. This allows our router to provide its L2 level services, specifically routing and DHCP.

But this setup will become unviable soon.

tuntap has been deprecated

Apple has been making changes to its macOS kernel (xnu) to lock down drivers as much as it can. tuntap is a collateral damage, and has been marked deprecated ever since Apple M1's release. Which means that this cannot be relied on a strategy anymore. Apple has reduced the damage to VPN providers by providing an effective utun mechanism for L3 routing, but have not provided an L2 alternative, likely due to a perceived security risk.

Given that all of Apple's macOS product line is going the Apple Silicon way, this creates a major issue for me. Either I move back to using a dedicated SFF machine for Linux router, or I'll have to figure out alternative mechanisms for my Linux VM. It's a matter of when, not if, my iMac Pro needs replacement, so I'll need a future proof way to run my Linux router VM.

Enter Virtualization.Framework

Apple's Virtualization.Framework (VF) is a framework provided by Apple on top of the HVM discussed earlier. The good part is that it has much lesser entitlement requirements (permissions from Apple) to write code using it.

Apple's VF has a pretty good documentation on how to get started with running a Linux VM, so I won't be talking about that. I skipped the GUI pieces, as our router will be running as a daemon.

If you pay close attention to the doc, you'll see that there are 2 ways we could enable L2 networking on VF:

  1. VZBridgedNetworkDeviceAttachment - this is the best option, but unfortunately Apple guards it closely and requires entitlements (permissions from Apple), making it an unreasonably difficult process. So I decided to skip it.
  2. VZFileHandleNetworkDeviceAttachment - this allows us to create a SOCK_DGRAM socket, and pass it to the VM. Packets send by the VM can be read from this socket with their L2 header, and any L2 packets we write to it are received by the VM. This also has the benefit of not requiring any specific entitlements, so we don't need permission from anyone to build our own. This is the option I chose.

With VZFileHandleNetworkDeviceAttachment as our choice forward, let's look at how we make it work.

Exchanging L2 packets

Let's say I want to logically passthrough the physical interface en6 to the VM, so any packets received on en6 is sent to the VM and vice-versa. Because there's no network device passthrough possible on macOS, we do this by reading/writing packets ourselves between the physical interface en6 and the VM socket interface.

We now run into a different issue. In macOS there's no mechanism to create a single socket that can read+write to a physical interface. So we have to open two different sockets:

  1. A BPF socket for reading L2 packets. Any packets that we receive from this socket, we then write to the VM socket.
  2. An NDRV socket for writing L2 packets. Any packets that we receive from the VM socket, we then write to this socket.
packet flow between a physical network interface and the VM interface
packet flow between a physical network interface and the VM interface

The full code for our custom software network switch implementing the above can be found here (look at NetworkSwitch.swift specifically). But we can explore some of the key ideas in code snippets below (these have been modified slightly from the code linked above, to better fit an explanatory structure).

Creation of all the sockets looks something like this:

/// VM socket. This is created as a socket pair - vmSocket is what we use on our code
/// side, while remoteSocket is used to initialise a VZFileHandleNetworkDeviceAttachment
/// which is passed to the VM configuration.
/// Note: this function does not exist in the original code linked above, and is shown
/// this way only for brevity here.
func createVMSocket() -> (Int32, Int32) {
    var socketPair: (Int32, Int32) = (0, 0)
    withUnsafePointer(to: &socketPair) {
        let ptr = UnsafeMutableRawPointer(mutating: $0).bindMemory(to: Int32.self, capacity: 2)
        guard socketpair(PF_LOCAL, SOCK_DGRAM, 0, ptr) == 0 else {
            fatalError("socketpair() failed: \(String(cString: strerror(errno)))")
        }
    }

    let (vmSocket, remoteSocket) = socketPair
    return (vmSocket, remoteSocket)
}

/// Create BPF socket. The bpfFilter parameter specifies the BPF program, which makes
/// sure that we receive only traffic for the VM's mac address.
/// See the original code (linked above) to see exact details.
func bpfSocket(_ ifc: String, _ buffSize: Int, _ bpfFilter: [bpf_insn]) -> Int32 {
    // get the first available bpf socket
    for i in 1..<256 {
        let dev = "/dev/bpf\(i)"
        let fd = open(dev, O_RDONLY)
        if fd >= 0 {
            // make a bunch of important ioctl() calls which I'm ignoring here for brevity
            // ...
            // bind to interface
            var ifr = ifreq()
            memset(&ifr, 0, MemoryLayout<ifreq>.size)
            ifc.copyTo(&ifr.ifr_name)
            guard ioctl(fd, BpfIoctl.BIOCSETIF, &ifr) == 0 else {
                fatalError("bpf ioctl(BIOCSETIF) failed for \(ifc): \(String(cString: strerror(errno)))")
            }
            // another bunch of important ioctl() calls which I'm ignoring here for brevity
            // ...
            // set filter
            var filter = bpf_program()
            filter.bf_len = UInt32(bpfFilter.count)
            filter.bf_insns = UnsafeMutablePointer<bpf_insn>.allocate(capacity: bpfFilter.count)
            for i in 0..<bpfFilter.count {
                filter.bf_insns[i] = bpfFilter[i]
            }
            guard ioctl(fd, BpfIoctl.BIOCSETFNR, &filter) == 0 else {
                fatalError("bpf ioctl(BIOCSETFNR) failed for \(ifc): \(String(cString: strerror(errno)))")
            }
            return fd
        }
    }
    fatalError("bpf open() failed for \(ifc): \(String(cString: strerror(errno)))")
}

/// Create NDRV socket for the specified physical interface (e.g. en6 in this example)
func ndrvSocket(_ ifc: String) -> Int32 {
    let fd = socket(PF_NDRV, SOCK_RAW, 0)
    guard fd >= 0 else {
        fatalError("ndrv socket() failed for \(ifc): \(String(cString: strerror(errno)))")
    }

    // bind to interface
    var nd = sockaddr_ndrv()
    nd.snd_len = UInt8(MemoryLayout<sockaddr_ndrv>.size)
    nd.snd_family = UInt8(AF_NDRV)
    ifc.copyTo(&nd.snd_name)

    withUnsafePointer(to: &nd) { nd_ptr in
        nd_ptr.withMemoryRebound(to: sockaddr.self, capacity: 1) { nd_ptr in
            if Darwin.bind(fd, nd_ptr, socklen_t(MemoryLayout<sockaddr_ndrv>.size)) != 0 {
                fatalError("ndrv bind() failed for \(ifc): \(String(cString: strerror(errno)))")
            }
            if Darwin.connect(fd, nd_ptr, socklen_t(MemoryLayout<sockaddr_ndrv>.size)) != 0 {
                fatalError("ndrv connect() failed for \(ifc): \(String(cString: strerror(errno)))")
            }
        }
    }
    return fd
}

We use kqueue to poll sockets for reading, giving us a high performance. I'm not including the kqueue code here because it's tangential to the main point of discussion, but you can take a look at it in the code linked above.

The packet reading/writing code looks something like the following. Note that this is a fairly unoptimised code - that's deliberate and will be explained later in this post:

/// Route traffic from host to VM by reading from bpfSocket and writing to vmSocket.
func hostToVM(_ event: kevent64_s) {
    var numPackets = 0, wlen = 0, wlenActual = 0
    let buffer = bpfReadBuffer.baseAddress!
    let len = read(bpfSocket, buffer, bpfBufferSize)
    if len > 0 {
        let endPtr = buffer.advanced(by: len)
        var pktPtr = buffer.assumingMemoryBound(to: bpf_hdr.self)
        while pktPtr < endPtr {
            // for each packet
            let hdr = pktPtr.pointee
            let nextPktPtr = UnsafeMutableRawPointer(pktPtr).advanced(by: Int(hdr.bh_caplen) + Int(hdr.bh_hdrlen))
            if hdr.bh_caplen > 0 {
                if nextPktPtr > endPtr {
                    NetworkSwitch.logger.error("\(hostInterface)-h2g: nextPktPtr out of bounds: \(nextPktPtr) > \(endPtr). current pktPtr=\(pktPtr) hdr=\(hdr)", throttleKey: "h2g-next-oob")
                }
                let hdr = pktPtr.pointee
                let dataPtr = UnsafeMutableRawPointer(mutating: pktPtr).advanced(by: Int(hdr.bh_hdrlen))
                let writeLen = write(vmSocket, dataPtr, Int(hdr.bh_caplen))
                numPackets += 1
                wlen += Int(hdr.bh_caplen)
                wlenActual += writeLen
                if writeLen < 0 {
                    NetworkSwitch.logger.error("\(hostInterface)-h2g: write() failed: \(String(cString: strerror(errno)))", throttleKey: "h2g-writ-fail")
                } else if writeLen != Int(hdr.bh_caplen) {
                    NetworkSwitch.logger.error("\(hostInterface)-h2g: write() failed: partial write", throttleKey: "h2g-writ-partial")
                }
            }
            pktPtr = nextPktPtr.alignedUp(toMultipleOf: BPF_ALIGNMENT).assumingMemoryBound(to: bpf_hdr.self)
        }
    } else if len == 0 {
        NetworkSwitch.logger.error("\(hostInterface)-h2g: EOF", throttleKey: "h2g-eof")
    } else if errno != EAGAIN && errno != EINTR {
        NetworkSwitch.logger.error("\(hostInterface)-h2g: read() failed: \(String(cString: strerror(errno)))", throttleKey: "h2g-read-fail")
    }
}

/// Send traffic from VM to host by reading from vmSocket and writing to ndrv socket.
func vmToHost(_ event: kevent64_s, onlyOne: Bool = true) {
    let availableLen = min(bpfReadBuffer.count, Int(event.data))
    let basePtr = bpfReadBuffer.baseAddress!
    var offset = 0
    while offset < availableLen {
        let n = read(vmSocket, basePtr, availableLen - offset)
        if n > 0 {
            let len = write(ndrvSocket, basePtr, n)
            if len != n {
                if len < 0 {
                    NetworkSwitch.logger.error("\(hostInterface)-g2h: write() failed: \(String(cString: strerror(errno)))", throttleKey: "g2h-writ-fail")
                } else if errno != EAGAIN && errno != EINTR {
                    NetworkSwitch.logger.error("\(hostInterface)-g2h: write() failed: partial write", throttleKey: "g2h-writ-partial")
                }
                break
            }
            offset += n
            if onlyOne {
                break
            }
        } else {
            if n == 0 {
                NetworkSwitch.logger.error("\(hostInterface)-g2h: EOF", throttleKey: "g2h-eof")
            } else if errno != EAGAIN && errno != EINTR {
                NetworkSwitch.logger.error("\(hostInterface)-g2h: read() failed: \(String(cString: strerror(errno))): e=\(event)", throttleKey: "g2h-read-fail")
            }
            break
        }
    }
}

Communication with the host

Up until now, we've built mechanisms to pipe traffic from VM to a physical interface and vice-versa. This works great for external traffic, but fails when the source or destination is the host itself. I can think of 3 different ways we can solve this, one of which is what I'll describe here.

macOS has a notion of feth (fake ethernet) network interfaces. Those of you familiar with Linux's veth, this seems identical to that. For those unfamiliar, this creates two peered interfaces, where writing to one side produces output on the other side and vice-versa. We utilise this feth pair to create a bridge on the host side (to which the host specific IP is assigned), one of the two feths in a pair gets added to the bridge, while the other is used as a target for the packet switching with VM socket. It looks like this:

feth based communication between VM and host
feth based communication between VM and host

Notice that the bridge has an IP address, so any packets that reach this bridge are also read by the host macOS, and if they're destined for it, get consumed interally in the correct fashion. Given how neatly this fits into our earlier workflow, we simply check if the target interface is a bridge (you can see it in the original code linked above), and if it is, we create a feth pair to set up this topology, and all of the rest of the code remains identical.

The hacks

Not all of the above is public API, e.g. fake ethernet devices don't have a userspace API other than ifconfig, and ifconfig's man page doesn't reveal. Similarly, adding/deleting network interfaces to/from a bridge doesn't seem to have a userspace API other than ifconfig.

Fortunately, this is where access to the xnu kernel's source code is helpful, even if Apple's runtime is tightly locked down. Looking at the kernel's source code allows us to determine what ioctl parameters it expects, so we can reconstruct those and use the hidden APIs. Obviously, anything that's hidden can be suddenly discontinued, but except for the feth parts, I doubt anything else (like bridging) is under any risk of change, simply because it'd break a lot of existing code.

Results

It's been a little over 3 weeks now since the VM has been running. No packet losses seen. iperf3 tests saturate out my physical network before coming anywhere close to using up CPU.

There does seem to be an mbuf overflow on the feth network path in the kernel beyond a certain threshold of bandwidth (on M1 it was ~8Gbps, and on my Intel iMac Pro it was ~5Gbps), which when exceeded leads to a kernel panic sometimes. feth being a non-public API probably allows Apple to be ok with it, but I guess this should eventually get addressed. Anyhow on my home network, I'm limited by my WiFi bandwidth to much lower levels than that, so works for me. Without feth, purely on virtio I get about 25Gbps which is significantly higher than what I used to get on hyperkit (~1.5Gbps). This is the reason why I've chosen to keep the packet transfer code completely unoptimised - the code isn't the bottleneck yet.

You could argue why write my own when things like lima-vm exists, but I felt given the simplicity & performance of Apple's VF, it's a good choice to make. Plus, this allows me flexibility around specific areas (related to running outside the context of a user session), which are important to the nature of a router.

I've kept the overall code private, while keeping the L2 networking part open, because there are some parts to the overall VM code that are specific to my needs. If anyone out there shares the weirdness of running a linux router VM on macOS, let me know and I'll open source the full thing after refactoring the code a bit.