Wine-NSPA – Gamma Channel Dispatcher

Wine 11.6 + NSPA RT patchset | Kernel 6.19.x-rt with NTSync channels + aggregate-wait + TRY_RECV2 | 2026-05-01 Author: Jordan Johnston Status: production dispatcher architecture; aggregate-wait Phase 3, post-1011 TRY_RECV2 burst-drain, and the immediate hot-path tuning follow-ons are all part of the shipped path.

This page explains the current wineserver request path for a Wine process: how requests enter the gamma channel, how the dispatcher owns the reply path, and how the post-1010 aggregate-wait loop fits into that design.

Table of Contents

  1. Overview
  2. Architecture
  3. Design Goals
  4. Kernel Channel Object and ioctls
  5. Sender State Machine
  6. Dispatcher State Machine
  7. Priority-Inheritance Semantics
  8. Phase B Lock-Drop Integration
  9. Thread-Token Pass-Through (T1/T2/T3)
  10. NT Semantics Preservation
  11. Validation and Performance
  12. Predecessors and Their Failure Modes
  13. Bug History and Audits
  14. References

1. Overview

Gamma is the third generation of Wine-NSPA’s client-to-wineserver IPC fast path. It replaces the v1.5 per-thread pthread dispatcher (and the v2.4 cached-CAS / futex-wake hybrid that briefly extended it) with a single per-process kernel-mediated request channel built on top of the NTSync NTSYNC_TYPE_CHANNEL object.

Every Wine client process has exactly one channel fd, opened by the wineserver during process attach and shipped to the client via SCM_RIGHTS in the init_first_thread reply. Client threads issue NTSYNC_IOC_CHANNEL_SEND_PI to atomically enqueue a request, boost the dispatcher pthread to the sender’s priority, and block for reply, all in one syscall.

The wineserver now runs one dispatcher context per client process, not just one bare receive loop. That context owns:

On post-1010 kernels the dispatcher blocks in NTSYNC_IOC_AGGREGATE_WAIT over (channel object, uring eventfd if active, shutdown eventfd), follows a channel wake with NTSYNC_IOC_CHANNEL_RECV2, runs the existing read_request_shm handler under global_lock, and calls NTSYNC_IOC_CHANNEL_REPLY to wake the originator and drain its PI boost. On 1011 kernels, if NSPA_TRY_RECV2 is left at its default-on setting, the dispatcher then issues NTSYNC_IOC_CHANNEL_TRY_RECV2 in a tight loop to drain any additional ready entries from the same wake. On pre-1010 kernels it falls back permanently to the legacy direct CHANNEL_RECV2 / RECV loop for that dispatcher, and on pre-1011 kernels the burst-drain feature gates itself off via -ENOTTY.

The key win over the legacy designs: priority inheritance is now kernel-atomic. There is no userspace TID-read-vs-sched_setscheduler race window, no pthread_setschedparam call against a thread that may have already exited, and no userspace bookkeeping of “who is currently boosted to what”. The kernel’s apply_event_pi_boost / consume_event_pi_boost machinery (introduced in ntsync patch 1008 deferred-boost) handles all of it inside the same lock that orders the queue.

The published shmem-ipc.gen.html describes v1.5 and v2.4. That document is superseded by this one. Gamma plus the aggregate-wait dispatcher loop is the architecture in production today.


2. Architecture

2.1 Component diagram

The gamma path involves four cooperating components:

Component Location Role
Kernel channel object drivers/misc/ntsync.c Priority rbtree of pending entries; SEND_PI / RECV2 / TRY_RECV2 / REPLY PI machinery
Aggregate-wait primitive drivers/misc/ntsync.c + patch 1010 Heterogeneous wait over channel object + fd sources
Sender shim dlls/ntdll/unix/server.c nspa_send_request_channel: copy header, SEND_PI, copy reply
Dispatcher context server/nspa/shmem_channel.c channel fd + shutdown eventfd + per-process nspa_uring_instance
Per-thread shmem unchanged from v1.5 Holds request payload and reply payload (zero-copy)

The channel fd is created in process attach, the dispatcher context is allocated alongside it, the detached pthread is spawned with explicit RT scheduler attrs when NSPA_SRV_RT_PRIO > 0, and the channel fd is shipped to the client over SCM_RIGHTS alongside the existing per-thread request_shm fds in the init_first_thread reply. The client stashes it in nspa_request_channel_fd and from then on uses it for every server_call_unlocked whose request fits in the per-thread shmem window.

Gamma channel topology Client process many sender threads, one shared channel fd sender threads build request header, copy payload, know caller RT prio per-thread `request_shm` pages request bytes and reply bytes stay zero-copy channel carries scheduling metadata only shared `nspa_request_channel_fd` Kernel ntsync channel object priority queue of entries sender_tid, prio, payload_off, thread_token `SEND_PI` enqueue + boost + block `RECV2` dequeue + `TRY_RECV2` burst drain reply / PI handoff `REPLY` wakes sender and drains / re-applies PI Wineserver process one dispatcher pthread per client process channel_dispatcher pthread aggregate-wait loop at `NSPA_SRV_RT_PRIO` `RECV2` / `TRY_RECV2` / uring / shutdown aware existing request handlers `read_request_shm()` under `global_lock` reply written back into `request_shm` 1. `SEND_PI` metadata 2. `AGG_WAIT` -> `RECV2` / `TRY_RECV2` 3. dispatch handler 4. `REPLY(entry_id)` 5. wake sender Attach-time path: wineserver creates the channel fd and transfers it to the client with `SCM_RIGHTS` in `init_first_thread`; request and reply bytes remain in `request_shm`.

2.2 Aggregate-wait topology

Post-1010/1011 gamma is no longer just a blocking RECV2 loop. The dispatcher waits on three sources and selects the work type from the aggregate-wait result; on 1011 kernels it can then keep draining ready channel work with TRY_RECV2 before it sleeps again.

Gamma dispatcher on post-1011 kernels `NTSYNC_IOC_AGGREGATE_WAIT` sources = { channel object, uring eventfd if active, shutdown eventfd } one waiter, same dispatcher thread, same process-owned ring channel source fired follow with `CHANNEL_RECV2` run handler under `global_lock` reply, then TRY_RECV2-drain while ready (1011 + gate) async-capable handler may submit SQE and return to wait uring eventfd fired drain eventfd counter `nspa_uring_drain()` runs inline CQE callback finishes deferred work `CHANNEL_REPLY` issued from this same RT thread shutdown eventfd fired destroy path wrote `1` aggregate wait returns dispatcher closes ring state dispatcher frees its own context and exits Load-bearing invariant the same RT thread receives, drains completion, and signals the reply aggregate-wait opt-out or `-ENOTTY` selects the legacy direct receive loop `NSPA_TRY_RECV2=0` or `-ENOTTY` keeps one dequeue per wake

2.3 Per-request data flow

For one request, the payload stays in request_shm while the channel only carries scheduling metadata and reply ownership:

Single-request sequence sender thread kernel channel object dispatcher thread data plane stays in `request_shm` request bytes and reply bytes are zero-copy; the channel only carries metadata and reply ownership 1. build request copy header and payload into `request_shm` 2. `SEND_PI` submit metadata, boost dispatcher, block sender channel entry queued tid, prio, payload_off, thread_token 3. `RECV2` dequeues winner burst mode may continue with `TRY_RECV2` 4. handler runs reads `request_shm`, writes reply bytes back 5. `REPLY(entry_id)` same thread completes wake + reply signal reply ownership returns kernel wakes sender and drains / re-applies PI 6. sender resumes `SEND_PI` returns; caller copies reply locally

Steps 2 and the unboost-and-reboost inside the kernel REPLY handler are atomic with respect to each other under the channel’s internal spinlock. There is no observable interval where the dispatcher is running unboosted while another high-prio entry sits ready in the queue.

2.4 End-to-end flow diagram

The following inline SVG shows a burst-drain lifecycle through gamma. Two senders are shown at differing priorities to illustrate the rbtree’s strict-priority ordering, REPLY’s automatic re-boost, and the 1011 TRY_RECV2 follow-on that keeps draining ready work without returning to AGG_WAIT.

Gamma channel burst-drain: two senders, one wake, one dispatcher CLIENT THREAD A FIFO 80 (audio) CLIENT THREAD B FIFO 50 (gui) KERNEL CHANNEL priority rbtree + PI boost DISPATCHER PTHREAD SCHED_FIFO base 64 memcpy req -> shmem[B] request_shm[B_tid] SEND_PI(prio=50) enqueue B @prio 50 apply_event_pi_boost dispatcher: 64 -> 50? NO (50<64) wake RECV2 -> entry B re-boost dispatcher to 64 memcpy req -> shmem[A] request_shm[A_tid] SEND_PI(prio=80) enqueue A @prio 80 apply_event_pi_boost dispatcher: 64 -> 80 IMMEDIATE A blocks waiting for REPLY global_lock read_request_shm(B) running @80 due to A's boost global_unlock REPLY(B) complete B.reply_done drain B's PI contribution re-boost from queue head: 80 (A is now head) B wakes memcpy reply <- shmem[B] SEND_PI returns TRY_RECV2 TRY_RECV2 -> entry A no re-boost: same prio 80 global_lock read_request_shm(A) global_unlock REPLY(A) complete A.reply_done drain A's PI queue empty -> back to base 64 A wakes memcpy reply <- shmem[A] SEND_PI returns A: prio 80 B: prio 50 REPLY = wake + drain + auto-reboost empty queue -> base prio all kernel ops under channel->lock spinlock

The two-sender scenario shows the property that matters on 1011: between B’s REPLY and the dispatcher’s TRY_RECV2, the dispatcher stays at FIFO 80 because the kernel re-boosted from the new queue head atomically inside REPLY, and the dispatcher can keep draining ready work without returning to AGG_WAIT. The legacy v1.5 design would have unboosted the dispatcher to 64 at unboost-time and then re-boosted to 80 only when A’s sched_setscheduler landed – which would have raced with the dispatcher’s own RECV-side runqueue insertion. Gamma closes that gap by construction, and 1011 removes the extra wake/round-trip once the next entry is already queued.


3. Design Goals

The gamma redesign was scoped tightly:

The gating env vars on the current production path are:

Gamma itself remains the default transport whenever the channel ioctls are present.


4. Kernel Channel Object and ioctls

The kernel side lives in drivers/misc/ntsync.c (Linux-NSPA tree at /home/ninez/pkgbuilds/Linux-NSPA-pkgbuild/linux-nspa-6.19.11-1.src/linux-nspa/src/linux-6.19.11/drivers/misc/ntsync.c, lines 1190-1494 for the channel object). Each NTSync channel is:

struct ntsync_channel {
    struct ntsync_obj   obj;          /* base */
    spinlock_t          lock;         /* serialises queue + boost state */
    struct rb_root      entries;      /* priority-ordered by entry->prio */
    u32                 max_depth;
    struct hlist_head   thread_tokens;/* (tid -> struct thread *) registry */
    ...
};

struct ntsync_channel_entry {
    struct rb_node      node;
    u32                 prio;
    u32                 sender_tid;
    u64                 payload_off;
    u64                 reply_off;
    u64                 thread_token;
    struct task_struct *sender;
    struct completion   reply_done;
    refcount_t          refs;          /* added by patch 1009 (see audit) */
};

The channel exposes seven ioctls. Six are core to gamma’s hot path; one is for opening a channel during process attach.

ioctl Direction Patch Purpose
NTSYNC_IOC_CREATE_CHANNEL wineserver 1004 Open a new channel, return fd. max_depth caps queued entries.
NTSYNC_IOC_CHANNEL_SEND_PI client 1004 Enqueue + boost dispatcher + block for reply, atomically.
NTSYNC_IOC_CHANNEL_RECV dispatcher 1004 Dequeue highest-prio entry; boost dispatcher to that prio; return metadata.
NTSYNC_IOC_CHANNEL_RECV2 dispatcher 1005 Same as RECV but additionally returns thread_token.
NTSYNC_IOC_CHANNEL_TRY_RECV2 dispatcher 1011 Same payload as RECV2, but non-blocking; used for post-dispatch burst drain.
NTSYNC_IOC_CHANNEL_REPLY dispatcher 1004 Wake the matching entry’s sender; drain our PI boost from that entry; auto-re-boost to the next pending entry’s prio if any.
NTSYNC_IOC_CHANNEL_REGISTER_THREAD / DEREGISTER_THREAD wineserver 1005 Register (tid -> struct thread *) for token pass-through.

The userspace UAPI structs are defined in linux/ntsync.h and fall-back-defined in both dlls/ntdll/unix/server.c:339-347 and server/nspa/shmem_channel.c:60-107 for clients running against a kernel header that predates the patches. The fall-back blocks #ifndef NTSYNC_IOC_CREATE_CHANNEL so they activate exactly when the build host’s headers are stale; once the kernel headers carry the definitions the fall-back is silently ignored.

Operationally the channel’s policy is strict-priority + FIFO inside each priority class. The rbtree key is (prio_desc, enqueue_seq_asc). A SCHED_FIFO sender at prio 70 always drains before any sender at prio 65; among prio-70 senders they drain in arrival order. SCHED_OTHER senders pass prio = 0 and the kernel routes them at the bottom of the tree.


5. Sender State Machine

The client-side entry point is nspa_send_request_channel in dlls/ntdll/unix/server.c:349. The function is invoked from server_call_unlocked (line 442) when all three preconditions hold:

If any precondition fails, server_call_unlocked falls through to the upstream socket path (send_request + wait_reply). This is the ungated, transparent fallback.

The state machine for the gamma path is:

1. memcpy req->u.req into request_shm->u.req
2. for each req->data[i]:
       memcpy into request_shm[after-header]
3. read data->nspa_rt_cached_prio (set by nspa_rt_apply_tid)
   if > 0:
       args.policy = data->nspa_rt_cached_policy
       args.prio   = data->nspa_rt_cached_prio
   else:
       args.policy = 0; args.prio = 0    /* SCHED_OTHER, no boost */
4. args.payload_off = GetCurrentThreadId()
   args.reply_off   = same  (channel is metadata-only)
5. data_ptr  = request_shm + sizeof(req) + request_size
   copy_limit = end-of-shmem - data_ptr
   /* Computed BEFORE the SEND_PI: req->u.req and req->u.reply
      share union storage, so post-reply reads of request_size
      would actually return reply_size. */
6. ioctl SEND_PI            <-- blocks until REPLY
   on EINTR: fall through to read reply (server already wrote it)
   on any other error: return STATUS_INTERNAL_ERROR
7. memcpy request_shm->u.reply -> req->u.reply
8. if reply_size > copy_limit:
       split: copy first copy_limit bytes from shmem
              read remainder via socket fallback (read_reply_data)
   else:
       memcpy reply_size bytes from shmem
9. return req->u.reply.reply_header.error

Two subtleties worth highlighting:

The non-RT case (prio = 0) is interesting: the kernel still enqueues the entry at the bottom of the rbtree and wakes the dispatcher, but it skips the boost machinery entirely. SCHED_OTHER clients pay a single ioctl and a single memcpy round-trip – no sched_setscheduler syscalls, no userspace PI bookkeeping. Even on the cold (non-RT) path gamma is cheaper than v1.5.


6. Dispatcher State Machine

The dispatcher pthread is still born detached with explicit SCHED_FIFO attrs when NSPA_SRV_RT_PRIO > 0, but the runtime loop is now selected in layers:

  1. prefer aggregate-wait if patch 1010 is present and NSPA_AGG_WAIT is not set to 0
  2. if aggregate-wait returns -ENOTTY, permanently fall back to the legacy direct receive loop for this dispatcher
  3. inside that loop, prefer CHANNEL_RECV2; if the kernel predates 1005, permanently fall back to CHANNEL_RECV
  4. after each successful dispatch, if patch 1011 is present and NSPA_TRY_RECV2 is not set to 0, issue non-blocking TRY_RECV2 until the channel is empty

The post-1010 loop is structurally:

for (;;) {
    build sources[] from:
        channel object
        uring eventfd (if ring active)
        shutdown eventfd

    ret = ioctl(dev_fd, NTSYNC_IOC_AGGREGATE_WAIT, &agg);
    if (ret < 0 && errno == ENOTTY) {
        agg_supported = 0;
        continue;   /* use legacy path on next iteration */
    }
    if (ret < 0) {
        if (errno == EINTR) continue;
        break;
    }

    if (source == shutdown_efd)
        break;

    if (source == uring_efd) {
        drain eventfd counter;
        pi_mutex_lock(&global_lock);
        nspa_uring_drain(&ctx->uring);
        pi_mutex_unlock(&global_lock);
        continue;
    }

    /* source == channel */
    ret = ioctl(channel_fd, NTSYNC_IOC_CHANNEL_RECV2, &recv);
    if (ret < 0 && errno == ENOTTY)
        recv2_state = 0;
    dispatch request;
    while (try_recv2_state) {
        ret = ioctl(channel_fd, NTSYNC_IOC_CHANNEL_TRY_RECV2, &recv);
        if (ret == 0) {
            dispatch request;
            continue;
        }
        if (errno == ENOTTY)
            try_recv2_state = 0;
        break;
    }
}

The legacy fallback path is still the old direct RECV2 / RECV receive loop. That is now compatibility logic, not the preferred production shape.

Key invariants:

The detached-thread exit property remains the same: destroy wakes the dispatcher, the dispatcher cleans up its own context, and no join is required.


7. Priority-Inheritance Semantics

Gamma’s PI guarantee is the most important property of the design. The promise is:

While a request from sender S (priority P_S) is pending or in flight, the dispatcher pthread runs at priority max(P_dispatcher_base, max {P_S' : S' enqueued or being handled}). There is no observable interval where the dispatcher runs at a lower priority while a higher-priority sender’s entry is queued.

This holds because of three kernel-side properties of NTSYNC_TYPE_CHANNEL:

On post-1010 kernels there is one extra requirement: a dispatcher blocked in NTSYNC_IOC_AGGREGATE_WAIT on the channel source must still be visible to the channel’s SEND_PI wake/boost logic. The production 1010 follow-up (072bfee) is part of gamma’s correctness story for exactly that reason; without it, aggregate-wait would have reintroduced a priority gap on the receive side.

7.1 SEND_PI atomically boosts the dispatcher

When SEND_PI fires, the kernel acquires channel->lock, inserts the entry into the rbtree, and – under the same spinlock – compares the entry’s prio against the current dispatcher boost level. If the new entry is higher prio, it calls apply_event_pi_boost(channel, entry->prio) which raises the dispatcher’s effective prio via the underlying task_struct. The boost happens before SEND_PI sleeps the sender, so by the time the sender is blocked the dispatcher is already running at (at least) the sender’s prio.

7.2 RECV2 re-boosts to the popped entry’s prio

When the dispatcher pops the highest-prio entry, the kernel recalculates the boost cap from the new queue head and the popped entry’s prio. The dispatcher’s boost is “rooted” in the popped entry for the duration of the handler – if a lower-prio sender arrives while the handler runs, it does not raise the dispatcher’s prio; if a higher-prio sender arrives, it does (apply_event_pi_boost is re-entrant in the safe direction).

7.3 REPLY drains the popped entry’s contribution and re-boosts

NTSYNC_IOC_CHANNEL_REPLY is the most subtle ioctl. In one critical section under channel->lock it:

  1. removes the entry from the per-channel “in-flight” list and frees its slot;
  2. completes the entry’s reply_done (waking the sender);
  3. drops the entry’s contribution to the dispatcher’s PI boost;
  4. re-applies a boost from the new queue head if one exists.

Step 4 is what closes the gap. Without it, REPLY would return the dispatcher to base priority for the duration of the next RECV syscall, during which a high-prio sender that arrived during the just-completed handler would be stranded behind the dispatcher’s self-rescheduling. Step 4 stitches the boost forward from one entry to the next inside the same ioctl that wakes the previous sender. This is the deferred-boost mechanism introduced in ntsync patch 1008; gamma was redesigned mid-2026-04 to require it.

7.4 Why kernel-atomic PI is strictly better than userspace PI

The legacy v1.5/v2.4 design had three orthogonal hand-rolled pieces: the boost call itself (sched_setscheduler), the bookkeeping cache (nspa_dispatcher_current_prio), and the unboost call. Any of the three could desync from the others under churn:

Userspace PI failure mode Kernel-atomic equivalent
Boost lands on wrong tid (TID race) Impossible: boost is keyed off the channel’s task_struct pointer, set at dispatcher pthread spawn time
Cache says “boosted to 80” but actual policy is RR/40 Impossible: kernel owns the boost
Two senders racing the cache leave dispatcher unboosted Impossible: apply_event_pi_boost is serialised by channel->lock
Dispatcher exits between cache read and unboost call N/A: dispatcher exit closes the channel; pending sends fail with EBADF

The only remaining consideration is interaction with NTSync’s other PI machinery (events, mutexes). Channels share the same apply_* / drain_* primitives so a dispatcher that holds an event boost from one source and a channel boost from another sees correctly summed priority. We have observed no PI-summing bugs in production since the channel landed.


8. Phase B Lock-Drop Integration

Phase B is the second-most important integration consumer of gamma. It lives in server/nspa/fd_lockdrop.c and reshapes how the dispatcher cooperates with slow filesystem syscalls.

8.1 The problem

The wineserver’s create_file handler ultimately does an openat() syscall against the host filesystem. On a cold-cache disk read this can take tens of milliseconds. With the v1.5 design each dispatcher held only one thread’s global_lock so a slow openat only blocked one client’s queue. With gamma there is one dispatcher per process: a slow openat blocks the entire process’s request queue.

In a DAW, the audio thread issuing a NtQueryPerformanceCounter or a futex syscall lookup is now stuck behind the GUI thread’s multi-millisecond LoadLibrary chain. That is a reliable xrun on drum-track-load-while-playing.

8.2 The Phase B fix

nspa_openat_lockdrop (line 47) reorganises the openat critical section into a “drop, syscall, re-acquire” pattern:

/* Inside server/fd.c create_file_obj path */
...
{
    struct thread *saved_current = current;
    unsigned int saved_error = saved_current->error;
    struct object *fd_ref = grab_object(fd_object);
    struct object *root_ref = root_object ? grab_object(root_object) : NULL;

    pi_mutex_unlock(&global_lock);

    unix_fd = do_openat(...);

    pi_mutex_lock(&global_lock);

    current = saved_current;
    if (saved_current) saved_current->error = saved_error;
    if (root_ref) release_object(root_ref);
    if (fd_ref)   release_object(fd_ref);
}

While the lock is dropped the dispatcher’s priority is whatever the kernel last boosted it to (the pending sender’s prio). Any other sender – including the audio thread – can have its request popped by a different mechanism… except there isn’t one: the dispatcher is in the middle of this handler. Phase B is therefore narrower than its name suggests: it lets the kernel schedule other processes' threads (and the host’s RT audio path) while we are blocked in openat(), but it does not let other entries in this process’s queue jump ahead.

That sounds like it does nothing useful, but the Linux scheduler’s PI propagation is what makes it work: while we hold global_lock under FIFO 80 (boosted), other RT threads in this process are at their own FIFO prio (typically 80 for the audio thread), and they are CPU-blocked behind us only insofar as we hold the CPU. Dropping the lock lets us also be IO-blocked, at which point the audio thread can preempt us via the kernel scheduler. The dispatcher is still single-threaded with respect to gamma’s own queue.

8.3 Save/restore discipline

Several pieces of per-request state are global-ish and must be preserved across the lock-drop window:

State Why it must be saved
current (per-request thread pointer; server/request.c:121) Another handler running in our unlocked window will overwrite it
current->error Belongs to our request; read by the reply path. Must not pick up a stranger’s error
fd_object refcount Just-allocated by alloc_fd_object, only the caller knows it; grab_object makes the unlocked window bullet-proof
root_object refcount Held by caller’s handler; pinning means a concurrent close-handle of root cannot free it during our syscall
errno Per-thread, so naturally preserved; we still snapshot to local_errno to insulate from libc calls in pi_mutex_lock etc.

The restore order is the inverse: re-lock, restore current, restore current->error, drop refs.

8.4 Gating

Phase B is default-on as of 2026-04-26, gated by NSPA_OPENFD_LOCKDROP=0 for A/B testing or as a panic switch. Originally shipped default-off after a host lockup on the first validation run; the lockup was eventually traced to the ntsync driver’s kfree-under-raw_spinlock_t bug (fixed in ntsync-patches/1006-ntsync-rt-alloc-hoist.patch), not Phase B itself. Re-validated post-1006 with Ableton drum-track-load-while- playing – the file-open-burst workload Phase B targets – with clean results.

The cached env-var read at lines 67-79 follows the same one-shot getenv pattern as the other gamma gates (NSPA_DISPATCHER_USE_TOKEN, NSPA_DISABLE_EPOLL).


9. Thread-Token Pass-Through (T1/T2/T3)

The thread-token mechanism is a steady-state CPU optimisation introduced by ntsync patch 1005 and consumed by the dispatcher. It removes a hash-table lookup on the dispatcher’s hot path.

9.1 The bottleneck

Pre-token, the dispatcher mapped payload_off (which is the sender’s Wine thread_id_t) to a struct thread * via get_thread_from_id, which walks a hash table under thread_id_lock. Per the perf trace from 2026-04-26 this call was ~10% of dispatcher CPU in mixed-load steady state. Eliminating it is worth the kernel-side complexity.

9.2 The protocol

The optimisation is split across three deployment phases:

Phase Patch What changes
T1 1005 kernel patch Channel object grows a (tid -> token) hash; new ioctls REGISTER_THREAD / DEREGISTER_THREAD / RECV2
T2 wineserver plumbing Wineserver registers (unix_tid -> (struct thread *)) from req_init_first_thread and req_init_thread; deregisters from destroy_thread
T3 dispatcher consumes token channel_dispatcher calls RECV2 and uses the token directly, skipping get_thread_from_id when it is non-zero

T1 and T2 ship behaviour-neutral (the kernel stamps tokens and the wineserver registers them, but nobody reads the token). T3 flips the dispatcher to consume them and is gated NSPA_DISPATCHER_USE_TOKEN (default on, set to 0 to fall back to the legacy get_thread_from_id lookup for A/B testing).

9.3 Lifetime safety

The token is (struct thread *) cast to __u64. Dereferencing it in the dispatcher requires the registration to happen before any client send that would resolve to that thread, and the deregistration to happen after the last reply. Both invariants are satisfied naturally:

The dispatcher does not take a ref on the token-resolved thread (line 222 in shmem_channel.c: if (!recv.thread_token) release_object(thread)). It “borrows” the registration’s ref. That is sound because the registration’s ref is held until deregister- after-last-reply, and the dispatcher is the entity that processes those replies – the deregister cannot race with the dispatcher doing the work.

If a sender’s thread happens to be unregistered (very early pre-init traffic, or a build against an old kernel without 1005), recv.thread_token is zero and the dispatcher falls back to get_thread_from_id + release_object. The fallback path is identical to the pre-token behaviour and is exercised every time RECV2 returns ENOTTY (line 161-166).

9.4 Performance

Per the 2026-04-26 perf run, with T3 enabled:


10. NT Semantics Preservation

A redesign of the IPC fast path must not change observable Win32 semantics. Two ordering guarantees must be preserved:

10.1 Per-thread request ordering

Win32 guarantees that within a single thread, request k is serialised before request k+1. Gamma preserves this trivially because every request blocks the issuing thread until its reply is delivered (SEND_PI returns only after REPLY). Thread T cannot have request k+1 outstanding while k is still in flight; the kernel-side rbtree never holds two entries from the same thread simultaneously.

10.2 Cross-thread ordering

Win32 is silent on cross-thread request ordering – threads race the wineserver, and whichever request reaches the server first wins. The upstream socket dispatcher serialises by epoll-readiness order (roughly arrival order plus kernel scheduling latency). The v1.5 per-thread-pthread design serialised by “first dispatcher pthread to acquire global_lock” (essentially random under contention). Gamma serialises by strict sender priority, FIFO inside priority.

This is strictly stronger than either legacy design. An app that relied on a specific cross-thread ordering would already be racy on upstream Wine; gamma’s priority-ordered shape is observationally indistinguishable from a faster machine reaching the upstream ordering. Notably, gamma never violates a happens-before relationship the app could observe through synchronisation primitives, because those primitives also flow through the wineserver and are subject to the same ordering – a high-prio thread’s signal arrives at the wineserver in priority order along with everyone else’s traffic.

10.3 Reply data shape

The reply is byte-identical to the upstream socket reply. Same reply_header.error codes, same payload layout, same handle allocations. Apps that probe wineserver-internal state (none should, but Wine’s own conformance tests do) see the same values.


11. Validation and Performance

11.1 Functional

11.2 Performance

Ableton 30s busy capture

Symbol (wineserver-relative) Before After Delta
channel_dispatcher 14.51% 0.70% −13.81pp / −95%
main_loop_epoll 7.24% 2.68% −4.56pp
nspa_queue_bypass_shm 2.77% absent inlined into call sites
req_get_update_region 4.92% absent gone from top symbols
nspa_redraw_ring_drain 2.88% absent gone from top symbols

System-wide samples: 38,588 -> 19,415 per 30s.

This profile shift is the combined effect of 1d85c558 (dispatcher ACQ_REL fences + inline accessor) and 01d528f5 (TRY_RECV2 burst-drain) on top of the 1011 kernel primitive.

Post-ship hot-path follow-ons

Commit Implemented change Exact observed effect
c0f5c515cd7 + 2870c9629ce gate mark_block_* poison and the paired valgrind annotations behind NSPA_DEBUG_POISON_ALLOCS mark_block_uninitialized was sampled at 1.34% wineserver-relative under dispatcher-burst; the combined change reclaims the full 1.34pp and drops the symbol out of the top-20
0802dadc750 inline read_request_shm at the dispatcher call site read_request_shm was sampled at 3.55% wineserver-relative under dispatcher-burst; after inlining it disappears from the symbol table and saves ~1pp more on the dispatcher path

These follow-ons do not change the dispatcher architecture. They remove residual per-RPC overhead that remained after the bigger structural landing (AGG_WAIT, TRY_RECV2, inline queue accessor, lighter fences) was already in place.

PE-side dispatcher-burst A/B

Metric TRY_RECV2 on TRY_RECV2 off Delta
burst ops/sec (wall) 841,765 555,567 +34% / 1.5x
burst worst max ns 23,014,325 31,843,082 −28%
steady avg ns 35,202 33,405 flat (no burst)

Steady-state is flat both ways, exactly as designed. The win is concentrated in burst load where the dispatcher can drain N queued entries per AGG_WAIT wake instead of paying N round-trips.

11.3 Configuration validated for production

For the 2026-04-30 production validation:


12. Predecessors and Their Failure Modes

12.1 Alpha (v1.5) – per-thread pthread + userspace sched_setscheduler

The original Torge Matthies forward-port spawned one dispatcher pthread per client thread. Each pthread owned a thread-private request_shm page and watched a futex word inside it. When the client wrote a request, it raised the word and FUTEX_WAKE-ed the dispatcher; the dispatcher locked global_lock, ran the handler, wrote the reply, and lowered the word so the client’s FUTEX_WAIT returned.

Priority inheritance was bolted on in userspace. Before sending, the client did sched_setscheduler(dispatcher_tid, RT_POLICY, our_prio) to boost the dispatcher to the caller’s level. After reply, the dispatcher reset its own scheduler attrs.

The pain points:

12.2 Beta (v2.4) – cached-CAS + manual prio cache

v2.4 narrowed the steady-state cost: senders cached their RT prio in ntdll_thread_data, did a CAS on a request-state word, did a single FUTEX_WAKE, and only fell back to sched_setscheduler when the cached dispatcher prio was below ours. This eliminated four syscalls per request on the steady-state hot path but left every architectural problem of v1.5 in place: still one dispatcher per thread, still userspace TID-read-vs-setscheduler racing, still hand-rolled PI arithmetic. The “cache” added a third place where boost state could desync.

12.3 The case for moving boost into the kernel

Once NTSync gained an event PI primitive (patch 1006, eventually deferred-boost in 1008), it was clear that PI for IPC could ride the same machinery. The legacy machinery had three structural problems no amount of userspace engineering could fix:

Structural problem Gamma resolution
N pthreads per process contending on global_lock One dispatcher per process; contention is O(1) per process
TID-read vs sched_setscheduler race window Kernel boosts dispatcher inside the same syscall that enqueues
Userspace PI accounting drift Kernel owns the boost state; userspace never reads or writes it

Gamma is the smallest design that closes all three.


13. Bug History and Audits

Gamma has been validated under sustained stress and through several KASAN-caught bugs. Tracking them here for completeness.

13.1 The 2026-04-26 read-only audit (Wine commit 75a3c534d5f)

A static audit of server/nspa/shmem_channel.c found no latent correctness bugs after the baf088c290f refcount + process- membership patch. The handler runs under global_lock exactly as v1.5 did, so handler-internal correctness is inherited from upstream Wine. The dispatcher loop has no spin-loops, no missing locks, and no lifetime races. The full audit lives at wine/nspa/docs/gamma-dispatcher-audit-and-split-plan.md.

13.2 ntsync patch 1007 – channel exclusive recv (priority inversion)

Pre-1007, the channel’s RECV path used a non-exclusive wake_up_interruptible_all on enqueue, which woke every waiter and let the kernel pick one. Under multiple-dispatcher scenarios (which gamma does not actually use, but the test-channel-stress harness does) the wake-all caused a real priority inversion: a low-prio waiter could win the race and delay the high-prio waiter behind a sleep. Patch 1007 narrowed RECV to wait_event_interruptible_exclusive + wake_up_interruptible. Audit doc at wine/nspa/docs/ntsync-rt-audit.md.

13.3 ntsync patch 1008 – EVENT_SET_PI deferred boost

The pre-1008 EVENT_SET_PI boost was applied immediately under raw_spinlock_t, which blocked other RT operations. 1008 deferred the boost to a per-CPU pi_work pool drained outside the spinlock. Gamma channel REPLY uses the same machinery via consume_event_pi_boost / apply_event_pi_boost – the deferred- boost queue is what makes “drain previous, re-boost from new head” atomic-feeling without holding the raw spinlock through the actual task_struct boost call.

13.4 ntsync patch 1009 – channel_entry refcount UAF

KASAN caught a use-after-free on struct ntsync_channel_entry in test-channel-stress: a REPLY’s wake_up_all raced with SEND_PI’s kfree(entry). Same bug class as the rolled-back 1008/1009 wave. The clean fix was a refcount_t refs on ntsync_channel_entry, incremented on enqueue and decremented at REPLY completion and at sender wakeup; ~15 LOC. Patch 1009 in tree. No production user has ever observed this bug (gamma has only one dispatcher per channel, which keeps the path single-consumer); but the channel UAPI is shared with other potential consumers and the fix is unconditional.

13.5 The lockup audit (2026-04-27)

After the ~370M-ops ntsync validation proved the kernel sound, the lockup investigation moved to wine-NSPA userspace. The audit doc at wine/nspa/docs/wine-nspa-lockup-audit-20260427.md covers F1-F9 wineserver-side findings and MR1-MR8 msg_ring findings; gamma itself was scored clean. The shipped fixes (MR1 reply-slot ABA, MR2 FUTEX_PRIVATE on shared memfd, MR4 POST wake-loss) are all in dlls/win32u/nspa/msg_ring.c and orthogonal to gamma.

13.6 Don’t-shotgun-the-audit feedback

A separate behavioural-feedback note (feedback_dont_shotgun_audit_into_unfound_bug) documents that ntsync patches 1007-1011 originally shipped five patches as “audit findings” without ever tracing the original EVENT_SET_PI slab UAF; they were rolled back, reduced to the four genuinely-needed fixes (1006/1007/1008/1009), and re-shipped. The lesson: KASAN / trace first, audit second. Gamma’s design is small enough that this discipline applies to its own future evolution as well.


14. References

14.1 Wine-NSPA source

File Lines Role
wine/dlls/ntdll/unix/server.c 311-436 Sender shim nspa_send_request_channel + UAPI fallback
wine/dlls/ntdll/unix/server.c 442-461 server_call_unlocked gating logic
wine/server/nspa/shmem_channel.c 60-139 UAPI fallback for pre-1005 / pre-1010 kernel headers
wine/server/nspa/shmem_channel.c 158-390 dispatcher context + aggregate-wait loop + legacy fallback loop
wine/server/nspa/shmem_channel.c 474-581 dispatcher create/destroy path, shutdown eventfd lifetime
wine/server/nspa/shmem_channel.c 310-340 T2 thread-token register/deregister
wine/server/nspa/uring.h per-process nspa_uring_instance API consumed by Phase 2 / Phase 3
wine/server/nspa/shmem_channel.h 1-48 Public header
wine/server/nspa/fd_lockdrop.c 47-125 Phase B nspa_openat_lockdrop – lock-drop integration
wine/nspa/docs/gamma-dispatcher-audit-and-split-plan.md Audit + future router/handler split plan
wine/nspa/docs/wine-nspa-lockup-audit-20260427.md F1-F9 + MR1-MR8 lockup-investigation findings
wine/nspa/docs/ntsync-rt-audit.md ntsync 1007/1008/1009 audit

14.2 Kernel source

File Lines Role
drivers/misc/ntsync.c Channel object plus aggregate-wait registration / wake path
ntsync-patches/1004-ntsync-channel.patch Channel object + core ioctls
ntsync-patches/1005-ntsync-channel-thread-token.patch RECV2 + REGISTER_THREAD + DEREGISTER_THREAD
ntsync-patches/1006-ntsync-rt-alloc-hoist.patch kfree-under-raw_spinlock fix; unblocked Phase B default-on
ntsync-patches/1007-ntsync-channel-exclusive-recv.patch Channel exclusive recv – priority inversion fix
ntsync-patches/1008-ntsync-event-set-pi-deferred-boost.patch Deferred boost machinery (consumed by REPLY)
ntsync-patches/1009-ntsync-channel-entry-refcount.patch refcount_t on ntsync_channel_entry (KASAN UAF fix)
ntsync-patches/1010-ntsync-aggregate-wait.patch heterogeneous wait primitive used by the post-1010 dispatcher
ntsync-patches/1011-ntsync-channel-try-recv2.patch non-blocking RECV2 used for post-dispatch burst drain

14.3 Memory / handoff documents

Doc Topic
project_gamma_dispatcher_audit_and_split_plan.md 2026-04-26 audit + T1/T2/T3 + router/handler split plan
project_msg_ring_v2_mr1_mr2_mr4_shipped_20260427.md MR1/MR2/MR4 + Ableton run-3 config
project_ntsync_session_20260427_results.md 30M-ops cumulative validation, 4 bugs fixed
project_ntsync_kfree_under_raw_spinlock.md 1006 alloc-hoist (unblocked Phase B default-on)
feedback_dont_shotgun_audit_into_unfound_bug.md KASAN-first / audit-second discipline

14.4 Predecessor docs

The published shmem-ipc.gen.html describes v1.5 (per-thread dispatcher) and v2.4 (cached-CAS + manual prio cache) and is superseded by this document. It is retained for historical reference and for the comparison diagrams. The CS-PI design (cs-pi.gen.html) is orthogonal to gamma and continues to apply unchanged: gamma improves the IPC path; CS-PI improves the in- process critical-section path; they coexist without interaction.