Wine-NSPA – Gamma Channel Dispatcher

Wine 11.6 + NSPA RT patchset | Kernel 6.19.x-rt with NTSync channels + aggregate-wait + TRY_RECV2 | 2026-05-01 Author: Jordan Johnston Status: production dispatcher architecture; aggregate-wait Phase 3, post-1011 TRY_RECV2 burst-drain, and the immediate hot-path tuning follow-ons are all part of the shipped path.

This page explains the current wineserver request path for a Wine process: how requests enter the gamma channel, how the dispatcher owns the reply path, and how the post-1010 aggregate-wait loop fits into that design.

Overview
Architecture
Design Goals
Kernel Channel Object and ioctls
Sender State Machine
Dispatcher State Machine
Priority-Inheritance Semantics
Phase B Lock-Drop Integration
Thread-Token Pass-Through (T1/T2/T3)
NT Semantics Preservation
Validation and Performance
Predecessors and Their Failure Modes
Bug History and Audits
References

1. Overview

Gamma is the third generation of Wine-NSPA’s client-to-wineserver IPC fast path. It replaces the v1.5 per-thread pthread dispatcher (and the v2.4 cached-CAS / futex-wake hybrid that briefly extended it) with a single per-process kernel-mediated request channel built on top of the NTSync NTSYNC_TYPE_CHANNEL object.

Every Wine client process has exactly one channel fd, opened by the wineserver during process attach and shipped to the client via SCM_RIGHTS in the init_first_thread reply. Client threads issue NTSYNC_IOC_CHANNEL_SEND_PI to atomically enqueue a request, boost the dispatcher pthread to the sender’s priority, and block for reply, all in one syscall.

The wineserver now runs one dispatcher context per client process, not just one bare receive loop. That context owns:

the process channel fd
a shutdown eventfd used to wake teardown cleanly
a per-process nspa_uring_instance

On post-1010 kernels the dispatcher blocks in NTSYNC_IOC_AGGREGATE_WAIT over (channel object, uring eventfd if active, shutdown eventfd), follows a channel wake with NTSYNC_IOC_CHANNEL_RECV2, runs the existing read_request_shm handler under global_lock, and calls NTSYNC_IOC_CHANNEL_REPLY to wake the originator and drain its PI boost. On 1011 kernels, if NSPA_TRY_RECV2 is left at its default-on setting, the dispatcher then issues NTSYNC_IOC_CHANNEL_TRY_RECV2 in a tight loop to drain any additional ready entries from the same wake. On pre-1010 kernels it falls back permanently to the legacy direct CHANNEL_RECV2 / RECV loop for that dispatcher, and on pre-1011 kernels the burst-drain feature gates itself off via -ENOTTY.

The key win over the legacy designs: priority inheritance is now kernel-atomic. There is no userspace TID-read-vs-sched_setscheduler race window, no pthread_setschedparam call against a thread that may have already exited, and no userspace bookkeeping of “who is currently boosted to what”. The kernel’s apply_event_pi_boost / consume_event_pi_boost machinery (introduced in ntsync patch 1008 deferred-boost) handles all of it inside the same lock that orders the queue.

The published shmem-ipc.gen.html describes v1.5 and v2.4. That document is superseded by this one. Gamma plus the aggregate-wait dispatcher loop is the architecture in production today.

2. Architecture

2.1 Component diagram

The gamma path involves four cooperating components:

Component	Location	Role
Kernel channel object	`drivers/misc/ntsync.c`	Priority rbtree of pending entries; SEND_PI / RECV2 / TRY_RECV2 / REPLY PI machinery
Aggregate-wait primitive	`drivers/misc/ntsync.c` + patch 1010	Heterogeneous wait over channel object + fd sources
Sender shim	`dlls/ntdll/unix/server.c`	`nspa_send_request_channel`: copy header, `SEND_PI`, copy reply
Dispatcher context	`server/nspa/shmem_channel.c`	channel fd + shutdown eventfd + per-process `nspa_uring_instance`
Per-thread shmem	unchanged from v1.5	Holds request payload and reply payload (zero-copy)

The channel fd is created in process attach, the dispatcher context is allocated alongside it, the detached pthread is spawned with explicit RT scheduler attrs when NSPA_SRV_RT_PRIO > 0, and the channel fd is shipped to the client over SCM_RIGHTS alongside the existing per-thread request_shm fds in the init_first_thread reply. The client stashes it in nspa_request_channel_fd and from then on uses it for every server_call_unlocked whose request fits in the per-thread shmem window.

2.2 Aggregate-wait topology

Post-1010/1011 gamma is no longer just a blocking RECV2 loop. The dispatcher waits on three sources and selects the work type from the aggregate-wait result; on 1011 kernels it can then keep draining ready channel work with TRY_RECV2 before it sleeps again.

2.3 Per-request data flow

For one request, the payload stays in request_shm while the channel only carries scheduling metadata and reply ownership:

Steps 2 and the unboost-and-reboost inside the kernel REPLY handler are atomic with respect to each other under the channel’s internal spinlock. There is no observable interval where the dispatcher is running unboosted while another high-prio entry sits ready in the queue.

2.4 End-to-end flow diagram

The following inline SVG shows a burst-drain lifecycle through gamma. Two senders are shown at differing priorities to illustrate the rbtree’s strict-priority ordering, REPLY’s automatic re-boost, and the 1011 TRY_RECV2 follow-on that keeps draining ready work without returning to AGG_WAIT.

The two-sender scenario shows the property that matters on 1011: between B’s REPLY and the dispatcher’s TRY_RECV2, the dispatcher stays at FIFO 80 because the kernel re-boosted from the new queue head atomically inside REPLY, and the dispatcher can keep draining ready work without returning to AGG_WAIT. The legacy v1.5 design would have unboosted the dispatcher to 64 at unboost-time and then re-boosted to 80 only when A’s sched_setscheduler landed – which would have raced with the dispatcher’s own RECV-side runqueue insertion. Gamma closes that gap by construction, and 1011 removes the extra wake/round-trip once the next entry is already queued.

3. Design Goals

The gamma redesign was scoped tightly:

One dispatcher pthread per client process. Not per thread; not router/handler split (deferred to a later phase, see project_gamma_dispatcher_audit_and_split_plan.md). Just one pthread that drains the channel sequentially.
Single ioctl per request on the sender side. Enqueue + boost + block-for-reply must be one syscall, not three. Anything less leaks PI gaps.
Single ioctl per reply on the dispatcher side. Wake-sender + drain-our-boost must be atomic. Otherwise a higher-prio sender that arrived during our handler would be unboosted before the dispatcher picks them up.
Zero-copy payloads. Reuse the v1.5 per-thread request_shm page exactly as-is. The channel only carries metadata (TID + priority), never request data.
Behavioural neutrality vs upstream. Per-thread request ordering must be preserved; cross-thread ordering can become priority-ordered (strictly stronger, never weaker, than the legacy “first-to-wake” shape).
Graceful fallback. If the kernel lacks the channel ioctls, the client must transparently fall back to the upstream socket path. Same for senders that arrive before their dispatcher pthread has managed to spawn.

The gating env vars on the current production path are:

NSPA_DISPATCHER_USE_TOKEN=0 – A/B for T3 thread-token consumption
NSPA_AGG_WAIT=0 – opt out of the post-1010 aggregate-wait loop and force the old direct CHANNEL_RECV2 path for that dispatcher
NSPA_TRY_RECV2=0 – keep one dequeue per wake even on 1011 kernels

Gamma itself remains the default transport whenever the channel ioctls are present.

4. Kernel Channel Object and ioctls

The kernel side lives in drivers/misc/ntsync.c (Linux-NSPA tree at /home/ninez/pkgbuilds/Linux-NSPA-pkgbuild/linux-nspa-6.19.11-1.src/linux-nspa/src/linux-6.19.11/drivers/misc/ntsync.c, lines 1190-1494 for the channel object). Each NTSync channel is:

struct ntsync_channel {
    struct ntsync_obj   obj;          /* base */
    spinlock_t          lock;         /* serialises queue + boost state */
    struct rb_root      entries;      /* priority-ordered by entry->prio */
    u32                 max_depth;
    struct hlist_head   thread_tokens;/* (tid -> struct thread *) registry */
    ...
};

struct ntsync_channel_entry {
    struct rb_node      node;
    u32                 prio;
    u32                 sender_tid;
    u64                 payload_off;
    u64                 reply_off;
    u64                 thread_token;
    struct task_struct *sender;
    struct completion   reply_done;
    refcount_t          refs;          /* added by patch 1009 (see audit) */
};

The channel exposes seven ioctls. Six are core to gamma’s hot path; one is for opening a channel during process attach.

ioctl	Direction	Patch	Purpose
`NTSYNC_IOC_CREATE_CHANNEL`	wineserver	1004	Open a new channel, return fd. `max_depth` caps queued entries.
`NTSYNC_IOC_CHANNEL_SEND_PI`	client	1004	Enqueue + boost dispatcher + block for reply, atomically.
`NTSYNC_IOC_CHANNEL_RECV`	dispatcher	1004	Dequeue highest-prio entry; boost dispatcher to that prio; return metadata.
`NTSYNC_IOC_CHANNEL_RECV2`	dispatcher	1005	Same as RECV but additionally returns `thread_token`.
`NTSYNC_IOC_CHANNEL_TRY_RECV2`	dispatcher	1011	Same payload as RECV2, but non-blocking; used for post-dispatch burst drain.
`NTSYNC_IOC_CHANNEL_REPLY`	dispatcher	1004	Wake the matching entry’s sender; drain our PI boost from that entry; auto-re-boost to the next pending entry’s prio if any.
`NTSYNC_IOC_CHANNEL_REGISTER_THREAD` / `DEREGISTER_THREAD`	wineserver	1005	Register `(tid -> struct thread *)` for token pass-through.

The userspace UAPI structs are defined in linux/ntsync.h and fall-back-defined in both dlls/ntdll/unix/server.c:339-347 and server/nspa/shmem_channel.c:60-107 for clients running against a kernel header that predates the patches. The fall-back blocks #ifndef NTSYNC_IOC_CREATE_CHANNEL so they activate exactly when the build host’s headers are stale; once the kernel headers carry the definitions the fall-back is silently ignored.

Operationally the channel’s policy is strict-priority + FIFO inside each priority class. The rbtree key is (prio_desc, enqueue_seq_asc). A SCHED_FIFO sender at prio 70 always drains before any sender at prio 65; among prio-70 senders they drain in arrival order. SCHED_OTHER senders pass prio = 0 and the kernel routes them at the bottom of the tree.

5. Sender State Machine

The client-side entry point is nspa_send_request_channel in dlls/ntdll/unix/server.c:349. The function is invoked from server_call_unlocked (line 442) when all three preconditions hold:

nspa_request_channel_fd >= 0 – channel was successfully opened by the wineserver and the fd survived the SCM_RIGHTS exchange;
ntdll_get_thread_data()->request_shm is non-NULL – per-thread shmem is mapped (set up during init_thread);
sizeof(req->u.req) + req->u.req.request_header.request_size < NSPA_REQUEST_SHM_SIZE – request fits in the zero-copy window.

If any precondition fails, server_call_unlocked falls through to the upstream socket path (send_request + wait_reply). This is the ungated, transparent fallback.

The state machine for the gamma path is:

1. memcpy req->u.req into request_shm->u.req
2. for each req->data[i]:
       memcpy into request_shm[after-header]
3. read data->nspa_rt_cached_prio (set by nspa_rt_apply_tid)
   if > 0:
       args.policy = data->nspa_rt_cached_policy
       args.prio   = data->nspa_rt_cached_prio
   else:
       args.policy = 0; args.prio = 0    /* SCHED_OTHER, no boost */
4. args.payload_off = GetCurrentThreadId()
   args.reply_off   = same  (channel is metadata-only)
5. data_ptr  = request_shm + sizeof(req) + request_size
   copy_limit = end-of-shmem - data_ptr
   /* Computed BEFORE the SEND_PI: req->u.req and req->u.reply
      share union storage, so post-reply reads of request_size
      would actually return reply_size. */
6. ioctl SEND_PI            <-- blocks until REPLY
   on EINTR: fall through to read reply (server already wrote it)
   on any other error: return STATUS_INTERNAL_ERROR
7. memcpy request_shm->u.reply -> req->u.reply
8. if reply_size > copy_limit:
       split: copy first copy_limit bytes from shmem
              read remainder via socket fallback (read_reply_data)
   else:
       memcpy reply_size bytes from shmem
9. return req->u.reply.reply_header.error

Two subtleties worth highlighting:

The data_ptr / copy_limit computation must happen before the SEND_PI, not after. This was a v2.4 invariant carried forward unchanged. After SEND_PI returns, request_shm->u.req and request_shm->u.reply share union storage; reading request_header.request_size post-reply actually reads reply_header.reply_size (same byte offset in the C union) and drives data_ptr to the wrong place.
EINTR is recoverable. A signal interruption during the wait does not abort the request – the wineserver still ran the handler and the reply is already in request_shm. We fall through to copy it out as if SEND_PI had returned 0.

The non-RT case (prio = 0) is interesting: the kernel still enqueues the entry at the bottom of the rbtree and wakes the dispatcher, but it skips the boost machinery entirely. SCHED_OTHER clients pay a single ioctl and a single memcpy round-trip – no sched_setscheduler syscalls, no userspace PI bookkeeping. Even on the cold (non-RT) path gamma is cheaper than v1.5.

6. Dispatcher State Machine

The dispatcher pthread is still born detached with explicit SCHED_FIFO attrs when NSPA_SRV_RT_PRIO > 0, but the runtime loop is now selected in layers:

prefer aggregate-wait if patch 1010 is present and NSPA_AGG_WAIT is not set to 0
if aggregate-wait returns -ENOTTY, permanently fall back to the legacy direct receive loop for this dispatcher
inside that loop, prefer CHANNEL_RECV2; if the kernel predates 1005, permanently fall back to CHANNEL_RECV
after each successful dispatch, if patch 1011 is present and NSPA_TRY_RECV2 is not set to 0, issue non-blocking TRY_RECV2 until the channel is empty

The post-1010 loop is structurally:

for (;;) {
    build sources[] from:
        channel object
        uring eventfd (if ring active)
        shutdown eventfd

    ret = ioctl(dev_fd, NTSYNC_IOC_AGGREGATE_WAIT, &agg);
    if (ret < 0 && errno == ENOTTY) {
        agg_supported = 0;
        continue;   /* use legacy path on next iteration */
    }
    if (ret < 0) {
        if (errno == EINTR) continue;
        break;
    }

    if (source == shutdown_efd)
        break;

    if (source == uring_efd) {
        drain eventfd counter;
        pi_mutex_lock(&global_lock);
        nspa_uring_drain(&ctx->uring);
        pi_mutex_unlock(&global_lock);
        continue;
    }

    /* source == channel */
    ret = ioctl(channel_fd, NTSYNC_IOC_CHANNEL_RECV2, &recv);
    if (ret < 0 && errno == ENOTTY)
        recv2_state = 0;
    dispatch request;
    while (try_recv2_state) {
        ret = ioctl(channel_fd, NTSYNC_IOC_CHANNEL_TRY_RECV2, &recv);
        if (ret == 0) {
            dispatch request;
            continue;
        }
        if (errno == ENOTTY)
            try_recv2_state = 0;
        break;
    }
}

The legacy fallback path is still the old direct RECV2 / RECV receive loop. That is now compatibility logic, not the preferred production shape.

Key invariants:

Same-thread completion. The same RT dispatcher thread receives the request, drains the CQE, and issues CHANNEL_REPLY.
global_lock remains the only server lock the dispatcher takes. Phase 3 changes the wait primitive and completion ownership; it does not change handler locking discipline.
The process-membership check stays. A received thread token or payload TID must still resolve to a thread that belongs to the dispatcher’s process and has a live request_shm mapping.
Shutdown is explicit in the aggregate-wait era. Closing the channel fd alone is not a reliable wake source once the kernel holds its own reference during aggregate-wait registration, so the dispatcher also waits on shutdown_efd.
Fallback is sticky per dispatcher. After the first -ENOTTY on aggregate-wait, RECV2, or TRY_RECV2, that dispatcher stays on the older compatible path for the rest of its lifetime.

The detached-thread exit property remains the same: destroy wakes the dispatcher, the dispatcher cleans up its own context, and no join is required.

7. Priority-Inheritance Semantics

Gamma’s PI guarantee is the most important property of the design. The promise is:

While a request from sender S (priority P_S) is pending or in flight, the dispatcher pthread runs at priority max(P_dispatcher_base, max {P_S' : S' enqueued or being handled}). There is no observable interval where the dispatcher runs at a lower priority while a higher-priority sender’s entry is queued.

This holds because of three kernel-side properties of NTSYNC_TYPE_CHANNEL:

On post-1010 kernels there is one extra requirement: a dispatcher blocked in NTSYNC_IOC_AGGREGATE_WAIT on the channel source must still be visible to the channel’s SEND_PI wake/boost logic. The production 1010 follow-up (072bfee) is part of gamma’s correctness story for exactly that reason; without it, aggregate-wait would have reintroduced a priority gap on the receive side.

7.1 SEND_PI atomically boosts the dispatcher

When SEND_PI fires, the kernel acquires channel->lock, inserts the entry into the rbtree, and – under the same spinlock – compares the entry’s prio against the current dispatcher boost level. If the new entry is higher prio, it calls apply_event_pi_boost(channel, entry->prio) which raises the dispatcher’s effective prio via the underlying task_struct. The boost happens before SEND_PI sleeps the sender, so by the time the sender is blocked the dispatcher is already running at (at least) the sender’s prio.

7.2 RECV2 re-boosts to the popped entry’s prio

When the dispatcher pops the highest-prio entry, the kernel recalculates the boost cap from the new queue head and the popped entry’s prio. The dispatcher’s boost is “rooted” in the popped entry for the duration of the handler – if a lower-prio sender arrives while the handler runs, it does not raise the dispatcher’s prio; if a higher-prio sender arrives, it does (apply_event_pi_boost is re-entrant in the safe direction).

7.3 REPLY drains the popped entry’s contribution and re-boosts

NTSYNC_IOC_CHANNEL_REPLY is the most subtle ioctl. In one critical section under channel->lock it:

removes the entry from the per-channel “in-flight” list and frees its slot;
completes the entry’s reply_done (waking the sender);
drops the entry’s contribution to the dispatcher’s PI boost;
re-applies a boost from the new queue head if one exists.

Step 4 is what closes the gap. Without it, REPLY would return the dispatcher to base priority for the duration of the next RECV syscall, during which a high-prio sender that arrived during the just-completed handler would be stranded behind the dispatcher’s self-rescheduling. Step 4 stitches the boost forward from one entry to the next inside the same ioctl that wakes the previous sender. This is the deferred-boost mechanism introduced in ntsync patch 1008; gamma was redesigned mid-2026-04 to require it.

7.4 Why kernel-atomic PI is strictly better than userspace PI

The legacy v1.5/v2.4 design had three orthogonal hand-rolled pieces: the boost call itself (sched_setscheduler), the bookkeeping cache (nspa_dispatcher_current_prio), and the unboost call. Any of the three could desync from the others under churn:

Userspace PI failure mode	Kernel-atomic equivalent
Boost lands on wrong tid (TID race)	Impossible: boost is keyed off the channel’s `task_struct` pointer, set at dispatcher pthread spawn time
Cache says “boosted to 80” but actual policy is RR/40	Impossible: kernel owns the boost
Two senders racing the cache leave dispatcher unboosted	Impossible: `apply_event_pi_boost` is serialised by `channel->lock`
Dispatcher exits between cache read and unboost call	N/A: dispatcher exit closes the channel; pending sends fail with EBADF

The only remaining consideration is interaction with NTSync’s other PI machinery (events, mutexes). Channels share the same apply_* / drain_* primitives so a dispatcher that holds an event boost from one source and a channel boost from another sees correctly summed priority. We have observed no PI-summing bugs in production since the channel landed.

8. Phase B Lock-Drop Integration

Phase B is the second-most important integration consumer of gamma. It lives in server/nspa/fd_lockdrop.c and reshapes how the dispatcher cooperates with slow filesystem syscalls.

8.1 The problem

The wineserver’s create_file handler ultimately does an openat() syscall against the host filesystem. On a cold-cache disk read this can take tens of milliseconds. With the v1.5 design each dispatcher held only one thread’s global_lock so a slow openat only blocked one client’s queue. With gamma there is one dispatcher per process: a slow openat blocks the entire process’s request queue.

In a DAW, the audio thread issuing a NtQueryPerformanceCounter or a futex syscall lookup is now stuck behind the GUI thread’s multi-millisecond LoadLibrary chain. That is a reliable xrun on drum-track-load-while-playing.

8.2 The Phase B fix

nspa_openat_lockdrop (line 47) reorganises the openat critical section into a “drop, syscall, re-acquire” pattern:

/* Inside server/fd.c create_file_obj path */
...
{
    struct thread *saved_current = current;
    unsigned int saved_error = saved_current->error;
    struct object *fd_ref = grab_object(fd_object);
    struct object *root_ref = root_object ? grab_object(root_object) : NULL;

    pi_mutex_unlock(&global_lock);

    unix_fd = do_openat(...);

    pi_mutex_lock(&global_lock);

    current = saved_current;
    if (saved_current) saved_current->error = saved_error;
    if (root_ref) release_object(root_ref);
    if (fd_ref)   release_object(fd_ref);
}

While the lock is dropped the dispatcher’s priority is whatever the kernel last boosted it to (the pending sender’s prio). Any other sender – including the audio thread – can have its request popped by a different mechanism… except there isn’t one: the dispatcher is in the middle of this handler. Phase B is therefore narrower than its name suggests: it lets the kernel schedule other processes' threads (and the host’s RT audio path) while we are blocked in openat(), but it does not let other entries in this process’s queue jump ahead.

That sounds like it does nothing useful, but the Linux scheduler’s PI propagation is what makes it work: while we hold global_lock under FIFO 80 (boosted), other RT threads in this process are at their own FIFO prio (typically 80 for the audio thread), and they are CPU-blocked behind us only insofar as we hold the CPU. Dropping the lock lets us also be IO-blocked, at which point the audio thread can preempt us via the kernel scheduler. The dispatcher is still single-threaded with respect to gamma’s own queue.

8.3 Save/restore discipline

Several pieces of per-request state are global-ish and must be preserved across the lock-drop window:

State	Why it must be saved
`current` (per-request thread pointer; `server/request.c:121`)	Another handler running in our unlocked window will overwrite it
`current->error`	Belongs to our request; read by the reply path. Must not pick up a stranger’s error
`fd_object` refcount	Just-allocated by `alloc_fd_object`, only the caller knows it; `grab_object` makes the unlocked window bullet-proof
`root_object` refcount	Held by caller’s handler; pinning means a concurrent close-handle of root cannot free it during our syscall
`errno`	Per-thread, so naturally preserved; we still snapshot to `local_errno` to insulate from libc calls in `pi_mutex_lock` etc.

The restore order is the inverse: re-lock, restore current, restore current->error, drop refs.

8.4 Gating

Phase B is default-on as of 2026-04-26, gated by NSPA_OPENFD_LOCKDROP=0 for A/B testing or as a panic switch. Originally shipped default-off after a host lockup on the first validation run; the lockup was eventually traced to the ntsync driver’s kfree-under-raw_spinlock_t bug (fixed in ntsync-patches/1006-ntsync-rt-alloc-hoist.patch), not Phase B itself. Re-validated post-1006 with Ableton drum-track-load-while- playing – the file-open-burst workload Phase B targets – with clean results.

The cached env-var read at lines 67-79 follows the same one-shot getenv pattern as the other gamma gates (NSPA_DISPATCHER_USE_TOKEN, NSPA_DISABLE_EPOLL).

9. Thread-Token Pass-Through (T1/T2/T3)

The thread-token mechanism is a steady-state CPU optimisation introduced by ntsync patch 1005 and consumed by the dispatcher. It removes a hash-table lookup on the dispatcher’s hot path.

9.1 The bottleneck

Pre-token, the dispatcher mapped payload_off (which is the sender’s Wine thread_id_t) to a struct thread * via get_thread_from_id, which walks a hash table under thread_id_lock. Per the perf trace from 2026-04-26 this call was ~10% of dispatcher CPU in mixed-load steady state. Eliminating it is worth the kernel-side complexity.

9.2 The protocol

The optimisation is split across three deployment phases:

Phase	Patch	What changes
T1	1005 kernel patch	Channel object grows a `(tid -> token)` hash; new ioctls `REGISTER_THREAD` / `DEREGISTER_THREAD` / `RECV2`
T2	wineserver plumbing	Wineserver registers `(unix_tid -> (struct thread *))` from `req_init_first_thread` and `req_init_thread`; deregisters from `destroy_thread`
T3	dispatcher consumes token	`channel_dispatcher` calls `RECV2` and uses the token directly, skipping `get_thread_from_id` when it is non-zero

T1 and T2 ship behaviour-neutral (the kernel stamps tokens and the wineserver registers them, but nobody reads the token). T3 flips the dispatcher to consume them and is gated NSPA_DISPATCHER_USE_TOKEN (default on, set to 0 to fall back to the legacy get_thread_from_id lookup for A/B testing).

9.3 Lifetime safety

The token is (struct thread *) cast to __u64. Dereferencing it in the dispatcher requires the registration to happen before any client send that would resolve to that thread, and the deregistration to happen after the last reply. Both invariants are satisfied naturally:

Registration runs inside req_init_first_thread / req_init_thread, both of which are server handlers that complete before the client sees the reply that lets it issue further requests.
Deregistration runs inside destroy_thread, which is called after the thread’s last reference drops. By that point no further sends are possible (the thread is gone).

The dispatcher does not take a ref on the token-resolved thread (line 222 in shmem_channel.c: if (!recv.thread_token) release_object(thread)). It “borrows” the registration’s ref. That is sound because the registration’s ref is held until deregister- after-last-reply, and the dispatcher is the entity that processes those replies – the deregister cannot race with the dispatcher doing the work.

If a sender’s thread happens to be unregistered (very early pre-init traffic, or a build against an old kernel without 1005), recv.thread_token is zero and the dispatcher falls back to get_thread_from_id + release_object. The fallback path is identical to the pre-token behaviour and is exercised every time RECV2 returns ENOTTY (line 161-166).

9.4 Performance

Per the 2026-04-26 perf run, with T3 enabled:

get_ptid_entry drops from ~10% of dispatcher CPU to ~0%.
No measurable change in per-request latency (the lookup was always within a microsecond), but the freed CPU translates directly into headroom under load.

10. NT Semantics Preservation

A redesign of the IPC fast path must not change observable Win32 semantics. Two ordering guarantees must be preserved:

10.1 Per-thread request ordering

Win32 guarantees that within a single thread, request k is serialised before request k+1. Gamma preserves this trivially because every request blocks the issuing thread until its reply is delivered (SEND_PI returns only after REPLY). Thread T cannot have request k+1 outstanding while k is still in flight; the kernel-side rbtree never holds two entries from the same thread simultaneously.

10.2 Cross-thread ordering

Win32 is silent on cross-thread request ordering – threads race the wineserver, and whichever request reaches the server first wins. The upstream socket dispatcher serialises by epoll-readiness order (roughly arrival order plus kernel scheduling latency). The v1.5 per-thread-pthread design serialised by “first dispatcher pthread to acquire global_lock” (essentially random under contention). Gamma serialises by strict sender priority, FIFO inside priority.

This is strictly stronger than either legacy design. An app that relied on a specific cross-thread ordering would already be racy on upstream Wine; gamma’s priority-ordered shape is observationally indistinguishable from a faster machine reaching the upstream ordering. Notably, gamma never violates a happens-before relationship the app could observe through synchronisation primitives, because those primitives also flow through the wineserver and are subject to the same ordering – a high-prio thread’s signal arrives at the wineserver in priority order along with everyone else’s traffic.

10.3 Reply data shape

The reply is byte-identical to the upstream socket reply. Same reply_header.error codes, same payload layout, same handle allocations. Apps that probe wineserver-internal state (none should, but Wine’s own conformance tests do) see the same values.

11. Validation and Performance

11.1 Functional

Native ntsync suite: 3 PASS / 0 FAIL (test-event-set-pi, test-channel-recv-exclusive, test-aggregate-wait).
test-aggregate-wait: 9/9 PASS, including the channel-notify and channel-PI propagation sub-tests added for the Phase 3 path, and the kitchen-sink path with 86,528 wakes / 0 timeouts / 0 errors.
PE matrix: 24 PASS / 0 FAIL / 0 TIMEOUT after adding dispatcher-burst to the baseline + RT runner.
dispatcher-burst matters because the rest of the PE matrix mostly goes through inproc_wait -> ntsync ioctls directly and does not hit the dispatcher hot path.
On pre-1011 kernels the TRY_RECV2 loop gates itself off via -ENOTTY; the dispatcher remains functionally correct and simply consumes one entry per wake.
Ableton Live 12 Lite with NSPA_AGG_WAIT=1, default-on NSPA_TRY_RECV2, and default-on async create_file: clean cold-start, plugin scan, drum-track-load-while-playing, and clean shutdown.

11.2 Performance

Ableton 30s busy capture

Symbol (wineserver-relative)	Before	After	Delta
`channel_dispatcher`	14.51%	0.70%	−13.81pp / −95%
`main_loop_epoll`	7.24%	2.68%	−4.56pp
`nspa_queue_bypass_shm`	2.77%	absent	inlined into call sites
`req_get_update_region`	4.92%	absent	gone from top symbols
`nspa_redraw_ring_drain`	2.88%	absent	gone from top symbols

System-wide samples: 38,588 -> 19,415 per 30s.

This profile shift is the combined effect of 1d85c558 (dispatcher ACQ_REL fences + inline accessor) and 01d528f5 (TRY_RECV2 burst-drain) on top of the 1011 kernel primitive.

Post-ship hot-path follow-ons

Commit	Implemented change	Exact observed effect
`c0f5c515cd7` + `2870c9629ce`	gate `mark_block_*` poison and the paired valgrind annotations behind `NSPA_DEBUG_POISON_ALLOCS`	`mark_block_uninitialized` was sampled at `1.34%` wineserver-relative under `dispatcher-burst`; the combined change reclaims the full `1.34pp` and drops the symbol out of the top-20
`0802dadc750`	inline `read_request_shm` at the dispatcher call site	`read_request_shm` was sampled at `3.55%` wineserver-relative under `dispatcher-burst`; after inlining it disappears from the symbol table and saves `~1pp` more on the dispatcher path

These follow-ons do not change the dispatcher architecture. They remove residual per-RPC overhead that remained after the bigger structural landing (AGG_WAIT, TRY_RECV2, inline queue accessor, lighter fences) was already in place.

PE-side `dispatcher-burst` A/B

Metric	TRY_RECV2 on	TRY_RECV2 off	Delta
burst ops/sec (wall)	841,765	555,567	+34% / 1.5x
burst worst max ns	23,014,325	31,843,082	−28%
steady avg ns	35,202	33,405	flat (no burst)

Steady-state is flat both ways, exactly as designed. The win is concentrated in burst load where the dispatcher can drain N queued entries per AGG_WAIT wake instead of paying N round-trips.

11.3 Configuration validated for production

For the 2026-04-30 production validation:

Module: 10124FB81FDC76797EF1F91
NSPA_RT_POLICY=FF
NSPA_OPENFD_LOCKDROP unset -> default ON
NSPA_DISPATCHER_USE_TOKEN unset -> default ON
NSPA_AGG_WAIT unset -> default ON
NSPA_TRY_RECV2 unset -> default ON
NSPA_ENABLE_ASYNC_CREATE_FILE unset -> default ON

12. Predecessors and Their Failure Modes

12.1 Alpha (v1.5) – per-thread pthread + userspace `sched_setscheduler`

The original Torge Matthies forward-port spawned one dispatcher pthread per client thread. Each pthread owned a thread-private request_shm page and watched a futex word inside it. When the client wrote a request, it raised the word and FUTEX_WAKE-ed the dispatcher; the dispatcher locked global_lock, ran the handler, wrote the reply, and lowered the word so the client’s FUTEX_WAIT returned.

Priority inheritance was bolted on in userspace. Before sending, the client did sched_setscheduler(dispatcher_tid, RT_POLICY, our_prio) to boost the dispatcher to the caller’s level. After reply, the dispatcher reset its own scheduler attrs.

The pain points:

N dispatchers per process all waking on the same global_lock. A 60-thread DAW had 60 dispatcher pthreads contending for one mutex.
TID race window. The client read dispatcher_tid from a shared field, then called sched_setscheduler. Between the read and the syscall the dispatcher could exit and another thread could be assigned the same tid by the kernel; the boost would land on a random thread. We never observed this in production but it was a real correctness hole.
Capability churn. Boost / unboost cycles forced cap_sys_nice-bearing syscalls on every request.
Userspace PI accounting. The wineserver maintained its own “current boost level” cache so that overlapping senders did not trample each other’s boost. Hand-rolled PI is brittle; under stress we hit unboost-too-early bugs.

12.2 Beta (v2.4) – cached-CAS + manual prio cache

v2.4 narrowed the steady-state cost: senders cached their RT prio in ntdll_thread_data, did a CAS on a request-state word, did a single FUTEX_WAKE, and only fell back to sched_setscheduler when the cached dispatcher prio was below ours. This eliminated four syscalls per request on the steady-state hot path but left every architectural problem of v1.5 in place: still one dispatcher per thread, still userspace TID-read-vs-setscheduler racing, still hand-rolled PI arithmetic. The “cache” added a third place where boost state could desync.

12.3 The case for moving boost into the kernel

Once NTSync gained an event PI primitive (patch 1006, eventually deferred-boost in 1008), it was clear that PI for IPC could ride the same machinery. The legacy machinery had three structural problems no amount of userspace engineering could fix:

Structural problem	Gamma resolution
N pthreads per process contending on `global_lock`	One dispatcher per process; contention is O(1) per process
TID-read vs `sched_setscheduler` race window	Kernel boosts dispatcher inside the same syscall that enqueues
Userspace PI accounting drift	Kernel owns the boost state; userspace never reads or writes it

Gamma is the smallest design that closes all three.

13. Bug History and Audits

Gamma has been validated under sustained stress and through several KASAN-caught bugs. Tracking them here for completeness.

13.1 The 2026-04-26 read-only audit (Wine commit 75a3c534d5f)

A static audit of server/nspa/shmem_channel.c found no latent correctness bugs after the baf088c290f refcount + process- membership patch. The handler runs under global_lock exactly as v1.5 did, so handler-internal correctness is inherited from upstream Wine. The dispatcher loop has no spin-loops, no missing locks, and no lifetime races. The full audit lives at wine/nspa/docs/gamma-dispatcher-audit-and-split-plan.md.

13.2 ntsync patch 1007 – channel exclusive recv (priority inversion)

Pre-1007, the channel’s RECV path used a non-exclusive wake_up_interruptible_all on enqueue, which woke every waiter and let the kernel pick one. Under multiple-dispatcher scenarios (which gamma does not actually use, but the test-channel-stress harness does) the wake-all caused a real priority inversion: a low-prio waiter could win the race and delay the high-prio waiter behind a sleep. Patch 1007 narrowed RECV to wait_event_interruptible_exclusive + wake_up_interruptible. Audit doc at wine/nspa/docs/ntsync-rt-audit.md.

13.3 ntsync patch 1008 – EVENT_SET_PI deferred boost

The pre-1008 EVENT_SET_PI boost was applied immediately under raw_spinlock_t, which blocked other RT operations. 1008 deferred the boost to a per-CPU pi_work pool drained outside the spinlock. Gamma channel REPLY uses the same machinery via consume_event_pi_boost / apply_event_pi_boost – the deferred- boost queue is what makes “drain previous, re-boost from new head” atomic-feeling without holding the raw spinlock through the actual task_struct boost call.

13.4 ntsync patch 1009 – channel_entry refcount UAF

KASAN caught a use-after-free on struct ntsync_channel_entry in test-channel-stress: a REPLY’s wake_up_all raced with SEND_PI’s kfree(entry). Same bug class as the rolled-back 1008/1009 wave. The clean fix was a refcount_t refs on ntsync_channel_entry, incremented on enqueue and decremented at REPLY completion and at sender wakeup; ~15 LOC. Patch 1009 in tree. No production user has ever observed this bug (gamma has only one dispatcher per channel, which keeps the path single-consumer); but the channel UAPI is shared with other potential consumers and the fix is unconditional.

13.5 The lockup audit (2026-04-27)

After the ~370M-ops ntsync validation proved the kernel sound, the lockup investigation moved to wine-NSPA userspace. The audit doc at wine/nspa/docs/wine-nspa-lockup-audit-20260427.md covers F1-F9 wineserver-side findings and MR1-MR8 msg_ring findings; gamma itself was scored clean. The shipped fixes (MR1 reply-slot ABA, MR2 FUTEX_PRIVATE on shared memfd, MR4 POST wake-loss) are all in dlls/win32u/nspa/msg_ring.c and orthogonal to gamma.

13.6 Don’t-shotgun-the-audit feedback

A separate behavioural-feedback note (feedback_dont_shotgun_audit_into_unfound_bug) documents that ntsync patches 1007-1011 originally shipped five patches as “audit findings” without ever tracing the original EVENT_SET_PI slab UAF; they were rolled back, reduced to the four genuinely-needed fixes (1006/1007/1008/1009), and re-shipped. The lesson: KASAN / trace first, audit second. Gamma’s design is small enough that this discipline applies to its own future evolution as well.

14. References

14.1 Wine-NSPA source

File	Lines	Role
`wine/dlls/ntdll/unix/server.c`	311-436	Sender shim `nspa_send_request_channel` + UAPI fallback
`wine/dlls/ntdll/unix/server.c`	442-461	`server_call_unlocked` gating logic
`wine/server/nspa/shmem_channel.c`	60-139	UAPI fallback for pre-1005 / pre-1010 kernel headers
`wine/server/nspa/shmem_channel.c`	158-390	dispatcher context + aggregate-wait loop + legacy fallback loop
`wine/server/nspa/shmem_channel.c`	474-581	dispatcher create/destroy path, shutdown eventfd lifetime
`wine/server/nspa/shmem_channel.c`	310-340	T2 thread-token register/deregister
`wine/server/nspa/uring.h`	–	per-process `nspa_uring_instance` API consumed by Phase 2 / Phase 3
`wine/server/nspa/shmem_channel.h`	1-48	Public header
`wine/server/nspa/fd_lockdrop.c`	47-125	Phase B `nspa_openat_lockdrop` – lock-drop integration
`wine/nspa/docs/gamma-dispatcher-audit-and-split-plan.md`	–	Audit + future router/handler split plan
`wine/nspa/docs/wine-nspa-lockup-audit-20260427.md`	–	F1-F9 + MR1-MR8 lockup-investigation findings
`wine/nspa/docs/ntsync-rt-audit.md`	–	ntsync 1007/1008/1009 audit

14.2 Kernel source

File	Lines	Role
`drivers/misc/ntsync.c`	–	Channel object plus aggregate-wait registration / wake path
`ntsync-patches/1004-ntsync-channel.patch`	–	Channel object + core ioctls
`ntsync-patches/1005-ntsync-channel-thread-token.patch`	–	RECV2 + REGISTER_THREAD + DEREGISTER_THREAD
`ntsync-patches/1006-ntsync-rt-alloc-hoist.patch`	–	kfree-under-raw_spinlock fix; unblocked Phase B default-on
`ntsync-patches/1007-ntsync-channel-exclusive-recv.patch`	–	Channel exclusive recv – priority inversion fix
`ntsync-patches/1008-ntsync-event-set-pi-deferred-boost.patch`	–	Deferred boost machinery (consumed by REPLY)
`ntsync-patches/1009-ntsync-channel-entry-refcount.patch`	–	refcount_t on `ntsync_channel_entry` (KASAN UAF fix)
`ntsync-patches/1010-ntsync-aggregate-wait.patch`	–	heterogeneous wait primitive used by the post-1010 dispatcher
`ntsync-patches/1011-ntsync-channel-try-recv2.patch`	–	non-blocking `RECV2` used for post-dispatch burst drain

14.3 Memory / handoff documents

Doc	Topic
`project_gamma_dispatcher_audit_and_split_plan.md`	2026-04-26 audit + T1/T2/T3 + router/handler split plan
`project_msg_ring_v2_mr1_mr2_mr4_shipped_20260427.md`	MR1/MR2/MR4 + Ableton run-3 config
`project_ntsync_session_20260427_results.md`	30M-ops cumulative validation, 4 bugs fixed
`project_ntsync_kfree_under_raw_spinlock.md`	1006 alloc-hoist (unblocked Phase B default-on)
`feedback_dont_shotgun_audit_into_unfound_bug.md`	KASAN-first / audit-second discipline

14.4 Predecessor docs

The published shmem-ipc.gen.html describes v1.5 (per-thread dispatcher) and v2.4 (cached-CAS + manual prio cache) and is superseded by this document. It is retained for historical reference and for the comparison diagrams. The CS-PI design (cs-pi.gen.html) is orthogonal to gamma and continues to apply unchanged: gamma improves the IPC path; CS-PI improves the in- process critical-section path; they coexist without interaction.