The future of NetBSD network infrastructure has to efficiently embrace two major design criteria, Symmetric Multi-Processing (SMP) and modularity. Other design considerations include not only supporting but taking advantage of the capability of newer network devices to do packet classification, payload splitting, and even full connection offload.

Symmetric Multi-Processing

NetBSD networking has evolved to work in a uniprocessor envionment. Switching it to use fine-grained locked is a hard and complex problem.

You can divide the network infrastructure into 5 major components:

Part of the complexity is that due to the monolithic nature of the kernel each layer currently feels free to call any other layer. This makes designing a lock hierarchy difficult and likely to fail.

Part of the problem are asynchonous upcalls (among which include):

Another source of complexity is the large number of global variables scattered throughout the source files. This makes putting locks around them difficult.

The proposed solution presented here include (in no particular order):

Revamped struct protosw

To split out obvious PR*_xxx that should have never been dispatched through the pr_usrreq/pr_ctloutput. Note that pr_ctloutput is replaced by pr_getopt/pr_setopt.

It's expected that pr_drain will just schedule a kernel continuation.

pr_initint pr_init(void *dsc);

int pr_fini(void *dsc); has been added.

Lockless queues

Producer/Consumer Queue (PCQ)

It allows multiple writers (producers) but only a single reader (consumer). Compare-And-Store operations are used to allow lockless updates. The consumer is expected to be protected by a mutex that covers the structure that the PCQ is embedded into (e.g. socket lock, ifnet hwlock). These queues operate in a First-In, First-Out (FIFO) manner. The act of inserting or removing an item from a PCQ does not modify the item in any way. A PCQ does not prevent an item being inserted multiple times into a single PCQ.

Since this structure isn't specific to networking it will accessed via <sys/pcq.h> and the code will live in kern/subr_pcq.c.

Atomic FIFO/LIFO Queue

These routines allow for commonly typed items to be locklessly inserted at either the head or tail of a queue for either last-in, first-out (LIFO) or first-in, first-out (FIFO) behavior, respectively. However, a queue is not instrinsicly LIFO or FIFO. Its behavior is determined solely by which method each item was pushed onto the queue.

It is only possible for an item to removed from the head of queue. This removal is also performed in a lockless manner.

All items in the queue must share a atomic_queue_link_t member at the same offset from the beginning of item. This offset is passed to atomic_qinit.

Generic Radix / PATRICIA Trees

BSD systems have always used a radix tree for their routing tables. However, the radix tree implementation is showing its age. It's lack of flexibility (it's only suitable for use in a routing table) and overhead of use (requires memory allocation/deallocation for insertions and removals) make replacing it with something better tuned to today's processors a necessity.

Since a radix tree branches on bit differences, finding these bit differences efficiently is crucial to the speed of tree operations. This is most quickly done by XORing the key and the tree node's value together and then counting the number of leading zeroes in the result of the XOR. Many processors today (ARM, PowerPC) have instructions that can count the number of leading zeroes in a 32 bit word (and even a 64 bit word). Even those that don't can use a simple constant time routine to count them:

int
clz(unsigned int bits)
{
	int zeroes = 0;
	if (bits == 0)
		return 32;
        if (bits & 0xffff0000) bits &= 0xffff0000; else zeroes += 16;
        if (bits & 0xff00ff00) bits &= 0xff00ff00; else zeroes += 8;
        if (bits & 0xf0f0f0f0) bits &= 0xf0f0f0f0; else zeroes += 4;
        if (bits & 0xcccccccc) bits &= 0xcccccccc; else zeroes += 2;
        if (bits & 0xaaaaaaaa) bits &= 0xaaaaaaaa; else zeroes += 1;
	return zeroes;
}
The existing BSD radix tree implementation does not use this method but instead uses a far more expensive method of comparision. Adapting the existing implementation to do the above is actually more expensive than writing a new implementation.

The primary requirements for the new radix tree are:

To make the radix tree flexible, all knowledge of how keys are represented have been encapsulated into a pt_tree_ops_t structure which contains 4 functions:

All bit offsets are relative to the most significant bit of the key,

The ptree programming interface contains just five routines:

Kernel Continuations

Combine callouts, softints, and workqueues into a single framework. Continuations are meant to cheap. Very cheap.

The API is primarily derived from the callout(9) API and is a superset of the softint(9) API. The most significant change is that workqueue items are not tied to a specific kernel thread.

Separate nexthop cache; not part of the routing table.

Remove ARP, AARP, ISO SNPA, and IPv6 Neighbors from the routing table. Instead, the ifnet structure will have a set of nexthop caches (usually implemented using patricia trees), one per address family. Each nexthop entry will contain the datalink header needed to reach the neighbor.

This will remove cloneable routes from the routing table and remove the need to protocol-specific code in the common Ethernet, FDDI, PPP, etc. code and put it back where it belongs, in the protocol itself.

if_poll

When a network device gets an interrupt, it ill call <iftype>_defer(ifp) to schedule a kernel continuation for that interface which invokes <iftype>_poll. Whether the the interrupt source should be masked depends if the device is a DMA device or a PIO device. This routine will call (*ifp->if_poll)(ifp) to deal with the interrupt's servicing.

During servicing any received packets will passed up via (*ifp->if_input)(ifp, m) which will be responsible for ALTQ or any other optional processing as well as protocol dispatch. Protocol dispatch in <iftype>_input will decode the datalink headers, if needed, via a table lookup and call the matching protocol's pr_input to process the packet. As such, interrupt queues (e.g. ipintrq) will no longer be needed. Any transmitted packets can be processed as can MII events. Either true or false will be returned by if_poll depending on whether another invokation of <iftype>_poll for this interface should be immediately scheduled or not, respectively.

Memory allocation will be prohibited in the interrupt routines. The device's if_poll routine should pre-allocate enough mbufs to do any required buffering. For devices doing DMA, the buffers are placed into receive descripors to be filled via DMA.

For devices doing PIO, pre-allocated mbufs are enqueued onto the softc of the device so when the interrupt routine needs one it simply dequeues one, fills in it in, and then enqueues it onto a completed queue, finally calls <iftype>_defer. If the number of pre-allocated mbufs drops below a threshold, the driver may decide to increase the number of mbufs that if_poll pre-allocates. If there are no mbufs left to receive the packet, the packets is dropped and the number of mbufs for if_poll to pre-allocate should be increased.

When interrupts are unmasked depends a few things. If the device is interrupting "too" often, it might make sense for the device's interrupts to remain masked and just schedule the device's continuation for the next clock tick. This assumes the system has a high enough value set for HZ.

Fast protocol/port demultiplexing.

When a packet is received and it's destined for a socket, simply place the packet in the socket's receive PCQ and wake the blocking socket. Then it's off to process the next packet.

The typical packet flow from ip_input is to {rip,tcp,udp}_input which

Lazy Receive Processing

Instead of having a set of active workqueue lwps waiting to service sockets, use the kernel lwp that's blocked on the socket to service the workitem. It's not being productive being blocked and it has an interest in getting that workitme done, and maybe we can directly copy that data to user's address and avoid queuing in the socket at all.

Make TCP syncache optional

For small systems, the syncache is complete overkill and should be made optional.

Virtual Network Stacks

Collect all the global data for an instance of a network stack (excluding AF_LOCAL). This includes routing table, data for multiple domains and their protocols, and the mutexes needed for regulating access to it all. Indead, a brane is an instance of a networking stack.

An interface belongs to a brane as do processes. This can be considered as chroot(2) for networking. A chbrane(2)

Radical Thought #1

LWPs in user space don't need a kernel stack

Those pages are only being used in case the an exception happens. Interrupts are probably going to their own dedicated stack. One could just keep a set of kernel stacks around. Each CPU has one, when a user exception happens, that stack is assigned to the current LWP and removed as the active CPU one. When that CPU next returns to user space, the kernel stack it was using is saved to be used for the next user exception. The idle lwp would just use the current kernel stack.

LWPs waiting for kernel condition shouldn't need a kernel stack

If a LWP is waiting on a kernel condition variable, it is expecting to be inactive for some time, possibly a long time. During this inactivity, it doesn't really need a kernel stack.

When the exception handler get an usermode exeception, it sets LWP restartable flag that indicates that the exception is restartable, and then services the exception as normal. As routines are called, they can clear the LWP restartable flag as needed. When an LWP needs to block for a long time, instead of calling cv_wait, it calls cv_restart. If cv_restart returned false, the LWPs restartable flag was clear so cv_restart acted just like cv_wait. Otherwise, the LWP and CV have been tied together (big hand wave), the lock has been released and the routine should return ERESTART. cv_restart could also wait for a small amount of time like .5 second, and only if the timeout expires.

As the stack unwinds, eventually, it will return to last the exception handler. The exception will see the LWP has a bound CV, save the LWP's user state into the PCB, set the LWP to sleeping, mark the lwp's stack as idle, and call the scheduler to find more work. When called, cpu_switchto will notice the stack is marked idle, and detach it from the LWP,

When the condition times out or is signalled, the first lwp attached to the condition variable is mark runnable and detached from the CV. When the cpu_switchto routine is called, the it will notice the lack of a stack so it will grab one, restore the trapframe, and reinvoke the exception handler.