Wine 11.6 + NSPA RT patchset | Kernel 6.19.x-rt with NTSync channels + aggregate-wait + TRY_RECV2 | 2026-05-01 Author: Jordan Johnston Status: production dispatcher architecture; aggregate-wait Phase 3, post-1011 TRY_RECV2 burst-drain, and the immediate hot-path tuning follow-ons are all part of the shipped path.
This page explains the current wineserver request path for a Wine process: how requests enter the gamma channel, how the dispatcher owns the reply path, and how the post-1010 aggregate-wait loop fits into that design.
Gamma is the third generation of Wine-NSPA’s client-to-wineserver IPC fast
path. It replaces the v1.5 per-thread pthread dispatcher (and the v2.4
cached-CAS / futex-wake hybrid that briefly extended it) with a single
per-process kernel-mediated request channel built on top of the
NTSync NTSYNC_TYPE_CHANNEL object.
Every Wine client process has exactly one channel fd, opened by the
wineserver during process attach and shipped to the client via
SCM_RIGHTS in the init_first_thread reply. Client threads issue
NTSYNC_IOC_CHANNEL_SEND_PI to atomically enqueue a request, boost the
dispatcher pthread to the sender’s priority, and block for reply, all
in one syscall.
The wineserver now runs one dispatcher context per client process, not just one bare receive loop. That context owns:
nspa_uring_instanceOn post-1010 kernels the dispatcher blocks in
NTSYNC_IOC_AGGREGATE_WAIT over (channel object, uring eventfd if
active, shutdown eventfd), follows a channel wake with
NTSYNC_IOC_CHANNEL_RECV2, runs the existing read_request_shm
handler under global_lock, and calls NTSYNC_IOC_CHANNEL_REPLY to
wake the originator and drain its PI boost. On 1011 kernels, if
NSPA_TRY_RECV2 is left at its default-on setting, the dispatcher
then issues NTSYNC_IOC_CHANNEL_TRY_RECV2 in a tight loop to drain any
additional ready entries from the same wake. On pre-1010 kernels it
falls back permanently to the legacy direct CHANNEL_RECV2 / RECV
loop for that dispatcher, and on pre-1011 kernels the burst-drain
feature gates itself off via -ENOTTY.
The key win over the legacy designs: priority inheritance is now
kernel-atomic. There is no userspace TID-read-vs-sched_setscheduler
race window, no pthread_setschedparam call against a thread that may
have already exited, and no userspace bookkeeping of “who is currently
boosted to what”. The kernel’s apply_event_pi_boost /
consume_event_pi_boost machinery (introduced in ntsync patch 1008
deferred-boost) handles all of it inside the same lock that orders the
queue.
The published shmem-ipc.gen.html describes v1.5 and v2.4. That
document is superseded by this one. Gamma plus the aggregate-wait
dispatcher loop is the architecture in production today.
The gamma path involves four cooperating components:
| Component | Location | Role |
|---|---|---|
| Kernel channel object | drivers/misc/ntsync.c |
Priority rbtree of pending entries; SEND_PI / RECV2 / TRY_RECV2 / REPLY PI machinery |
| Aggregate-wait primitive | drivers/misc/ntsync.c + patch 1010 |
Heterogeneous wait over channel object + fd sources |
| Sender shim | dlls/ntdll/unix/server.c |
nspa_send_request_channel: copy header, SEND_PI, copy reply |
| Dispatcher context | server/nspa/shmem_channel.c |
channel fd + shutdown eventfd + per-process nspa_uring_instance |
| Per-thread shmem | unchanged from v1.5 | Holds request payload and reply payload (zero-copy) |
The channel fd is created in process attach, the dispatcher context is
allocated alongside it, the detached pthread is spawned with explicit RT
scheduler attrs when NSPA_SRV_RT_PRIO > 0, and the channel fd is
shipped to the client over SCM_RIGHTS alongside the existing
per-thread request_shm fds in the init_first_thread reply. The
client stashes it in nspa_request_channel_fd and from then on uses it
for every server_call_unlocked whose request fits in the per-thread
shmem window.
Post-1010/1011 gamma is no longer just a blocking RECV2 loop. The
dispatcher waits on three sources and selects the work type from the
aggregate-wait result; on 1011 kernels it can then keep draining ready
channel work with TRY_RECV2 before it sleeps again.
For one request, the payload stays in request_shm while the channel
only carries scheduling metadata and reply ownership:
Steps 2 and the unboost-and-reboost inside the kernel REPLY handler are atomic with respect to each other under the channel’s internal spinlock. There is no observable interval where the dispatcher is running unboosted while another high-prio entry sits ready in the queue.
The following inline SVG shows a burst-drain lifecycle through gamma.
Two senders are shown at differing priorities to illustrate the
rbtree’s strict-priority ordering, REPLY’s automatic re-boost, and the
1011 TRY_RECV2 follow-on that keeps draining ready work without
returning to AGG_WAIT.
The two-sender scenario shows the property that matters on 1011:
between B’s REPLY and the dispatcher’s TRY_RECV2, the dispatcher
stays at FIFO 80 because the kernel re-boosted from the new
queue head atomically inside REPLY, and the dispatcher can keep
draining ready work without returning to AGG_WAIT. The legacy v1.5
design would have unboosted the dispatcher to 64 at unboost-time and
then re-boosted to 80 only when A’s sched_setscheduler landed –
which would have raced with the dispatcher’s own RECV-side runqueue
insertion. Gamma closes that gap by construction, and 1011 removes the
extra wake/round-trip once the next entry is already queued.
The gamma redesign was scoped tightly:
project_gamma_dispatcher_audit_and_split_plan.md). Just one
pthread that drains the channel sequentially.request_shm page
exactly as-is. The channel only carries metadata (TID + priority),
never request data.The gating env vars on the current production path are:
NSPA_DISPATCHER_USE_TOKEN=0 – A/B for T3 thread-token consumptionNSPA_AGG_WAIT=0 – opt out of the post-1010 aggregate-wait loop and
force the old direct CHANNEL_RECV2 path for that dispatcherNSPA_TRY_RECV2=0 – keep one dequeue per wake even on 1011 kernelsGamma itself remains the default transport whenever the channel ioctls are present.
The kernel side lives in drivers/misc/ntsync.c (Linux-NSPA tree at
/home/ninez/pkgbuilds/Linux-NSPA-pkgbuild/linux-nspa-6.19.11-1.src/linux-nspa/src/linux-6.19.11/drivers/misc/ntsync.c,
lines 1190-1494 for the channel object). Each NTSync channel is:
struct ntsync_channel {
struct ntsync_obj obj; /* base */
spinlock_t lock; /* serialises queue + boost state */
struct rb_root entries; /* priority-ordered by entry->prio */
u32 max_depth;
struct hlist_head thread_tokens;/* (tid -> struct thread *) registry */
...
};
struct ntsync_channel_entry {
struct rb_node node;
u32 prio;
u32 sender_tid;
u64 payload_off;
u64 reply_off;
u64 thread_token;
struct task_struct *sender;
struct completion reply_done;
refcount_t refs; /* added by patch 1009 (see audit) */
};
The channel exposes seven ioctls. Six are core to gamma’s hot path; one is for opening a channel during process attach.
| ioctl | Direction | Patch | Purpose |
|---|---|---|---|
NTSYNC_IOC_CREATE_CHANNEL |
wineserver | 1004 | Open a new channel, return fd. max_depth caps queued entries. |
NTSYNC_IOC_CHANNEL_SEND_PI |
client | 1004 | Enqueue + boost dispatcher + block for reply, atomically. |
NTSYNC_IOC_CHANNEL_RECV |
dispatcher | 1004 | Dequeue highest-prio entry; boost dispatcher to that prio; return metadata. |
NTSYNC_IOC_CHANNEL_RECV2 |
dispatcher | 1005 | Same as RECV but additionally returns thread_token. |
NTSYNC_IOC_CHANNEL_TRY_RECV2 |
dispatcher | 1011 | Same payload as RECV2, but non-blocking; used for post-dispatch burst drain. |
NTSYNC_IOC_CHANNEL_REPLY |
dispatcher | 1004 | Wake the matching entry’s sender; drain our PI boost from that entry; auto-re-boost to the next pending entry’s prio if any. |
NTSYNC_IOC_CHANNEL_REGISTER_THREAD / DEREGISTER_THREAD |
wineserver | 1005 | Register (tid -> struct thread *) for token pass-through. |
The userspace UAPI structs are defined in linux/ntsync.h and
fall-back-defined in both dlls/ntdll/unix/server.c:339-347 and
server/nspa/shmem_channel.c:60-107 for clients running against a
kernel header that predates the patches. The fall-back blocks
#ifndef NTSYNC_IOC_CREATE_CHANNEL so they activate exactly when the
build host’s headers are stale; once the kernel headers carry the
definitions the fall-back is silently ignored.
Operationally the channel’s policy is strict-priority + FIFO inside
each priority class. The rbtree key is (prio_desc, enqueue_seq_asc).
A SCHED_FIFO sender at prio 70 always drains before any sender at
prio 65; among prio-70 senders they drain in arrival order.
SCHED_OTHER senders pass prio = 0 and the kernel routes them at
the bottom of the tree.
The client-side entry point is nspa_send_request_channel in
dlls/ntdll/unix/server.c:349. The function is invoked from
server_call_unlocked (line 442) when all three preconditions hold:
nspa_request_channel_fd >= 0 – channel was successfully opened by
the wineserver and the fd survived the SCM_RIGHTS exchange;ntdll_get_thread_data()->request_shm is non-NULL – per-thread
shmem is mapped (set up during init_thread);sizeof(req->u.req) + req->u.req.request_header.request_size <
NSPA_REQUEST_SHM_SIZE – request fits in the zero-copy window.If any precondition fails, server_call_unlocked falls through to the
upstream socket path (send_request + wait_reply). This is the
ungated, transparent fallback.
The state machine for the gamma path is:
1. memcpy req->u.req into request_shm->u.req
2. for each req->data[i]:
memcpy into request_shm[after-header]
3. read data->nspa_rt_cached_prio (set by nspa_rt_apply_tid)
if > 0:
args.policy = data->nspa_rt_cached_policy
args.prio = data->nspa_rt_cached_prio
else:
args.policy = 0; args.prio = 0 /* SCHED_OTHER, no boost */
4. args.payload_off = GetCurrentThreadId()
args.reply_off = same (channel is metadata-only)
5. data_ptr = request_shm + sizeof(req) + request_size
copy_limit = end-of-shmem - data_ptr
/* Computed BEFORE the SEND_PI: req->u.req and req->u.reply
share union storage, so post-reply reads of request_size
would actually return reply_size. */
6. ioctl SEND_PI <-- blocks until REPLY
on EINTR: fall through to read reply (server already wrote it)
on any other error: return STATUS_INTERNAL_ERROR
7. memcpy request_shm->u.reply -> req->u.reply
8. if reply_size > copy_limit:
split: copy first copy_limit bytes from shmem
read remainder via socket fallback (read_reply_data)
else:
memcpy reply_size bytes from shmem
9. return req->u.reply.reply_header.error
Two subtleties worth highlighting:
request_shm->u.req and
request_shm->u.reply share union storage; reading
request_header.request_size post-reply actually reads
reply_header.reply_size (same byte offset in the C union) and
drives data_ptr to the wrong place.request_shm. We fall through to
copy it out as if SEND_PI had returned 0.The non-RT case (prio = 0) is interesting: the kernel still enqueues
the entry at the bottom of the rbtree and wakes the dispatcher, but it
skips the boost machinery entirely. SCHED_OTHER clients pay a single
ioctl and a single memcpy round-trip – no sched_setscheduler
syscalls, no userspace PI bookkeeping. Even on the cold (non-RT) path
gamma is cheaper than v1.5.
The dispatcher pthread is still born detached with explicit
SCHED_FIFO attrs when NSPA_SRV_RT_PRIO > 0, but the runtime loop is
now selected in layers:
NSPA_AGG_WAIT
is not set to 0-ENOTTY, permanently fall back to the
legacy direct receive loop for this dispatcherCHANNEL_RECV2; if the kernel predates
1005, permanently fall back to CHANNEL_RECVNSPA_TRY_RECV2 is not set to 0, issue non-blocking TRY_RECV2
until the channel is emptyThe post-1010 loop is structurally:
for (;;) {
build sources[] from:
channel object
uring eventfd (if ring active)
shutdown eventfd
ret = ioctl(dev_fd, NTSYNC_IOC_AGGREGATE_WAIT, &agg);
if (ret < 0 && errno == ENOTTY) {
agg_supported = 0;
continue; /* use legacy path on next iteration */
}
if (ret < 0) {
if (errno == EINTR) continue;
break;
}
if (source == shutdown_efd)
break;
if (source == uring_efd) {
drain eventfd counter;
pi_mutex_lock(&global_lock);
nspa_uring_drain(&ctx->uring);
pi_mutex_unlock(&global_lock);
continue;
}
/* source == channel */
ret = ioctl(channel_fd, NTSYNC_IOC_CHANNEL_RECV2, &recv);
if (ret < 0 && errno == ENOTTY)
recv2_state = 0;
dispatch request;
while (try_recv2_state) {
ret = ioctl(channel_fd, NTSYNC_IOC_CHANNEL_TRY_RECV2, &recv);
if (ret == 0) {
dispatch request;
continue;
}
if (errno == ENOTTY)
try_recv2_state = 0;
break;
}
}
The legacy fallback path is still the old direct RECV2 / RECV
receive loop. That is now compatibility logic, not the preferred
production shape.
Key invariants:
CHANNEL_REPLY.global_lock remains the only server lock the dispatcher takes.
Phase 3 changes the wait primitive and completion ownership; it does
not change handler locking discipline.request_shm mapping.shutdown_efd.-ENOTTY on
aggregate-wait, RECV2, or TRY_RECV2, that dispatcher stays on
the older compatible path for the rest of its lifetime.The detached-thread exit property remains the same: destroy wakes the dispatcher, the dispatcher cleans up its own context, and no join is required.
Gamma’s PI guarantee is the most important property of the design. The promise is:
While a request from sender S (priority P_S) is pending or in flight, the dispatcher pthread runs at priority
max(P_dispatcher_base, max {P_S' : S' enqueued or being handled}). There is no observable interval where the dispatcher runs at a lower priority while a higher-priority sender’s entry is queued.
This holds because of three kernel-side properties of
NTSYNC_TYPE_CHANNEL:
On post-1010 kernels there is one extra requirement: a dispatcher
blocked in NTSYNC_IOC_AGGREGATE_WAIT on the channel source must still
be visible to the channel’s SEND_PI wake/boost logic. The production
1010 follow-up (072bfee) is part of gamma’s correctness story for
exactly that reason; without it, aggregate-wait would have reintroduced
a priority gap on the receive side.
When SEND_PI fires, the kernel acquires channel->lock, inserts the
entry into the rbtree, and – under the same spinlock – compares the
entry’s prio against the current dispatcher boost level. If the new
entry is higher prio, it calls apply_event_pi_boost(channel,
entry->prio) which raises the dispatcher’s effective prio via the
underlying task_struct. The boost happens before SEND_PI sleeps the
sender, so by the time the sender is blocked the dispatcher is
already running at (at least) the sender’s prio.
When the dispatcher pops the highest-prio entry, the kernel
recalculates the boost cap from the new queue head and the popped
entry’s prio. The dispatcher’s boost is “rooted” in the popped entry
for the duration of the handler – if a lower-prio sender arrives
while the handler runs, it does not raise the dispatcher’s prio; if
a higher-prio sender arrives, it does (apply_event_pi_boost is
re-entrant in the safe direction).
NTSYNC_IOC_CHANNEL_REPLY is the most subtle ioctl. In one critical
section under channel->lock it:
reply_done (waking the sender);Step 4 is what closes the gap. Without it, REPLY would return the dispatcher to base priority for the duration of the next RECV syscall, during which a high-prio sender that arrived during the just-completed handler would be stranded behind the dispatcher’s self-rescheduling. Step 4 stitches the boost forward from one entry to the next inside the same ioctl that wakes the previous sender. This is the deferred-boost mechanism introduced in ntsync patch 1008; gamma was redesigned mid-2026-04 to require it.
The legacy v1.5/v2.4 design had three orthogonal hand-rolled pieces:
the boost call itself (sched_setscheduler), the bookkeeping cache
(nspa_dispatcher_current_prio), and the unboost call. Any of the
three could desync from the others under churn:
| Userspace PI failure mode | Kernel-atomic equivalent |
|---|---|
| Boost lands on wrong tid (TID race) | Impossible: boost is keyed off the channel’s task_struct pointer, set at dispatcher pthread spawn time |
| Cache says “boosted to 80” but actual policy is RR/40 | Impossible: kernel owns the boost |
| Two senders racing the cache leave dispatcher unboosted | Impossible: apply_event_pi_boost is serialised by channel->lock |
| Dispatcher exits between cache read and unboost call | N/A: dispatcher exit closes the channel; pending sends fail with EBADF |
The only remaining consideration is interaction with NTSync’s other
PI machinery (events, mutexes). Channels share the same apply_* /
drain_* primitives so a dispatcher that holds an event boost from
one source and a channel boost from another sees correctly summed
priority. We have observed no PI-summing bugs in production since
the channel landed.
Phase B is the second-most important integration consumer of gamma.
It lives in server/nspa/fd_lockdrop.c and reshapes how the
dispatcher cooperates with slow filesystem syscalls.
The wineserver’s create_file handler ultimately does an
openat() syscall against the host filesystem. On a cold-cache
disk read this can take tens of milliseconds. With the v1.5 design
each dispatcher held only one thread’s global_lock so a slow
openat only blocked one client’s queue. With gamma there is one
dispatcher per process: a slow openat blocks the entire
process’s request queue.
In a DAW, the audio thread issuing a NtQueryPerformanceCounter or
a futex syscall lookup is now stuck behind the GUI thread’s
multi-millisecond LoadLibrary chain. That is a reliable xrun on
drum-track-load-while-playing.
nspa_openat_lockdrop (line 47) reorganises the openat critical
section into a “drop, syscall, re-acquire” pattern:
/* Inside server/fd.c create_file_obj path */
...
{
struct thread *saved_current = current;
unsigned int saved_error = saved_current->error;
struct object *fd_ref = grab_object(fd_object);
struct object *root_ref = root_object ? grab_object(root_object) : NULL;
pi_mutex_unlock(&global_lock);
unix_fd = do_openat(...);
pi_mutex_lock(&global_lock);
current = saved_current;
if (saved_current) saved_current->error = saved_error;
if (root_ref) release_object(root_ref);
if (fd_ref) release_object(fd_ref);
}
While the lock is dropped the dispatcher’s priority is whatever the
kernel last boosted it to (the pending sender’s prio). Any other
sender – including the audio thread – can have its request popped
by a different mechanism… except there isn’t one: the dispatcher
is in the middle of this handler. Phase B is therefore narrower
than its name suggests: it lets the kernel schedule other
processes' threads (and the host’s RT audio path) while we are
blocked in openat(), but it does not let other entries in this
process’s queue jump ahead.
That sounds like it does nothing useful, but the Linux scheduler’s
PI propagation is what makes it work: while we hold global_lock
under FIFO 80 (boosted), other RT threads in this process are at
their own FIFO prio (typically 80 for the audio thread), and they
are CPU-blocked behind us only insofar as we hold the CPU. Dropping
the lock lets us also be IO-blocked, at which point the audio
thread can preempt us via the kernel scheduler. The dispatcher is
still single-threaded with respect to gamma’s own queue.
Several pieces of per-request state are global-ish and must be preserved across the lock-drop window:
| State | Why it must be saved |
|---|---|
current (per-request thread pointer; server/request.c:121) |
Another handler running in our unlocked window will overwrite it |
current->error |
Belongs to our request; read by the reply path. Must not pick up a stranger’s error |
fd_object refcount |
Just-allocated by alloc_fd_object, only the caller knows it; grab_object makes the unlocked window bullet-proof |
root_object refcount |
Held by caller’s handler; pinning means a concurrent close-handle of root cannot free it during our syscall |
errno |
Per-thread, so naturally preserved; we still snapshot to local_errno to insulate from libc calls in pi_mutex_lock etc. |
The restore order is the inverse: re-lock, restore current,
restore current->error, drop refs.
Phase B is default-on as of 2026-04-26, gated by
NSPA_OPENFD_LOCKDROP=0 for A/B testing or as a panic switch.
Originally shipped default-off after a host lockup on the first
validation run; the lockup was eventually traced to the ntsync
driver’s kfree-under-raw_spinlock_t bug (fixed in
ntsync-patches/1006-ntsync-rt-alloc-hoist.patch), not Phase B
itself. Re-validated post-1006 with Ableton drum-track-load-while-
playing – the file-open-burst workload Phase B targets – with
clean results.
The cached env-var read at lines 67-79 follows the same one-shot
getenv pattern as the other gamma gates (NSPA_DISPATCHER_USE_TOKEN,
NSPA_DISABLE_EPOLL).
The thread-token mechanism is a steady-state CPU optimisation introduced by ntsync patch 1005 and consumed by the dispatcher. It removes a hash-table lookup on the dispatcher’s hot path.
Pre-token, the dispatcher mapped payload_off (which is the
sender’s Wine thread_id_t) to a struct thread * via
get_thread_from_id, which walks a hash table under
thread_id_lock. Per the perf trace from 2026-04-26 this call was
~10% of dispatcher CPU in mixed-load steady state. Eliminating
it is worth the kernel-side complexity.
The optimisation is split across three deployment phases:
| Phase | Patch | What changes |
|---|---|---|
| T1 | 1005 kernel patch | Channel object grows a (tid -> token) hash; new ioctls REGISTER_THREAD / DEREGISTER_THREAD / RECV2 |
| T2 | wineserver plumbing | Wineserver registers (unix_tid -> (struct thread *)) from req_init_first_thread and req_init_thread; deregisters from destroy_thread |
| T3 | dispatcher consumes token | channel_dispatcher calls RECV2 and uses the token directly, skipping get_thread_from_id when it is non-zero |
T1 and T2 ship behaviour-neutral (the kernel stamps tokens and the
wineserver registers them, but nobody reads the token). T3 flips
the dispatcher to consume them and is gated NSPA_DISPATCHER_USE_TOKEN
(default on, set to 0 to fall back to the legacy
get_thread_from_id lookup for A/B testing).
The token is (struct thread *) cast to __u64. Dereferencing it
in the dispatcher requires the registration to happen before any
client send that would resolve to that thread, and the deregistration
to happen after the last reply. Both invariants are satisfied
naturally:
req_init_first_thread /
req_init_thread, both of which are server handlers that complete
before the client sees the reply that lets it issue further requests.destroy_thread, which is called after
the thread’s last reference drops. By that point no further sends
are possible (the thread is gone).The dispatcher does not take a ref on the token-resolved thread
(line 222 in shmem_channel.c: if (!recv.thread_token)
release_object(thread)). It “borrows” the registration’s ref. That
is sound because the registration’s ref is held until deregister-
after-last-reply, and the dispatcher is the entity that processes
those replies – the deregister cannot race with the dispatcher
doing the work.
If a sender’s thread happens to be unregistered (very early
pre-init traffic, or a build against an old kernel without 1005),
recv.thread_token is zero and the dispatcher falls back to
get_thread_from_id + release_object. The fallback path is
identical to the pre-token behaviour and is exercised every time
RECV2 returns ENOTTY (line 161-166).
Per the 2026-04-26 perf run, with T3 enabled:
get_ptid_entry drops from ~10% of dispatcher CPU to ~0%.A redesign of the IPC fast path must not change observable Win32 semantics. Two ordering guarantees must be preserved:
Win32 guarantees that within a single thread, request k is
serialised before request k+1. Gamma preserves this trivially
because every request blocks the issuing thread until its reply is
delivered (SEND_PI returns only after REPLY). Thread T cannot
have request k+1 outstanding while k is still in flight; the
kernel-side rbtree never holds two entries from the same thread
simultaneously.
Win32 is silent on cross-thread request ordering – threads race the
wineserver, and whichever request reaches the server first wins. The
upstream socket dispatcher serialises by epoll-readiness order
(roughly arrival order plus kernel scheduling latency). The v1.5
per-thread-pthread design serialised by “first dispatcher pthread to
acquire global_lock” (essentially random under contention). Gamma
serialises by strict sender priority, FIFO inside priority.
This is strictly stronger than either legacy design. An app that relied on a specific cross-thread ordering would already be racy on upstream Wine; gamma’s priority-ordered shape is observationally indistinguishable from a faster machine reaching the upstream ordering. Notably, gamma never violates a happens-before relationship the app could observe through synchronisation primitives, because those primitives also flow through the wineserver and are subject to the same ordering – a high-prio thread’s signal arrives at the wineserver in priority order along with everyone else’s traffic.
The reply is byte-identical to the upstream socket reply. Same
reply_header.error codes, same payload layout, same handle
allocations. Apps that probe wineserver-internal state (none should,
but Wine’s own conformance tests do) see the same values.
test-event-set-pi,
test-channel-recv-exclusive, test-aggregate-wait).test-aggregate-wait: 9/9 PASS, including the channel-notify
and channel-PI propagation sub-tests added for the Phase 3 path, and
the kitchen-sink path with 86,528 wakes / 0 timeouts / 0 errors.dispatcher-burst to the baseline + RT runner.dispatcher-burst matters because the rest of the PE matrix mostly
goes through inproc_wait -> ntsync ioctls directly and does not hit
the dispatcher hot path.TRY_RECV2 loop gates itself off via
-ENOTTY; the dispatcher remains functionally correct and simply
consumes one entry per wake.NSPA_AGG_WAIT=1, default-on
NSPA_TRY_RECV2, and default-on async create_file: clean
cold-start, plugin scan, drum-track-load-while-playing, and clean
shutdown.| Symbol (wineserver-relative) | Before | After | Delta |
|---|---|---|---|
channel_dispatcher |
14.51% | 0.70% | −13.81pp / −95% |
main_loop_epoll |
7.24% | 2.68% | −4.56pp |
nspa_queue_bypass_shm |
2.77% | absent | inlined into call sites |
req_get_update_region |
4.92% | absent | gone from top symbols |
nspa_redraw_ring_drain |
2.88% | absent | gone from top symbols |
System-wide samples: 38,588 -> 19,415 per 30s.
This profile shift is the combined effect of
1d85c558
(dispatcher ACQ_REL fences + inline accessor) and
01d528f5
(TRY_RECV2 burst-drain) on top of the 1011 kernel primitive.
| Commit | Implemented change | Exact observed effect |
|---|---|---|
c0f5c515cd7 + 2870c9629ce |
gate mark_block_* poison and the paired valgrind annotations behind NSPA_DEBUG_POISON_ALLOCS |
mark_block_uninitialized was sampled at 1.34% wineserver-relative under dispatcher-burst; the combined change reclaims the full 1.34pp and drops the symbol out of the top-20 |
0802dadc750 |
inline read_request_shm at the dispatcher call site |
read_request_shm was sampled at 3.55% wineserver-relative under dispatcher-burst; after inlining it disappears from the symbol table and saves ~1pp more on the dispatcher path |
These follow-ons do not change the dispatcher architecture. They remove
residual per-RPC overhead that remained after the bigger structural
landing (AGG_WAIT, TRY_RECV2, inline queue accessor, lighter
fences) was already in place.
dispatcher-burst A/B| Metric | TRY_RECV2 on | TRY_RECV2 off | Delta |
|---|---|---|---|
| burst ops/sec (wall) | 841,765 | 555,567 | +34% / 1.5x |
| burst worst max ns | 23,014,325 | 31,843,082 | −28% |
| steady avg ns | 35,202 | 33,405 | flat (no burst) |
Steady-state is flat both ways, exactly as designed. The win is
concentrated in burst load where the dispatcher can drain N queued
entries per AGG_WAIT wake instead of paying N round-trips.
For the 2026-04-30 production validation:
10124FB81FDC76797EF1F91NSPA_RT_POLICY=FFNSPA_OPENFD_LOCKDROP unset -> default ONNSPA_DISPATCHER_USE_TOKEN unset -> default ONNSPA_AGG_WAIT unset -> default ONNSPA_TRY_RECV2 unset -> default ONNSPA_ENABLE_ASYNC_CREATE_FILE unset -> default ONsched_setschedulerThe original Torge Matthies forward-port spawned one dispatcher
pthread per client thread. Each pthread owned a thread-private
request_shm page and watched a futex word inside it. When the client
wrote a request, it raised the word and FUTEX_WAKE-ed the dispatcher;
the dispatcher locked global_lock, ran the handler, wrote the reply,
and lowered the word so the client’s FUTEX_WAIT returned.
Priority inheritance was bolted on in userspace. Before sending, the
client did sched_setscheduler(dispatcher_tid, RT_POLICY, our_prio)
to boost the dispatcher to the caller’s level. After reply, the
dispatcher reset its own scheduler attrs.
The pain points:
global_lock.
A 60-thread DAW had 60 dispatcher pthreads contending for one mutex.dispatcher_tid from a shared
field, then called sched_setscheduler. Between the read and the
syscall the dispatcher could exit and another thread could be
assigned the same tid by the kernel; the boost would land on a
random thread. We never observed this in production but it was a
real correctness hole.cap_sys_nice-bearing syscalls on every request.v2.4 narrowed the steady-state cost: senders cached their RT prio in
ntdll_thread_data, did a CAS on a request-state word, did a single
FUTEX_WAKE, and only fell back to sched_setscheduler when the
cached dispatcher prio was below ours. This eliminated four syscalls
per request on the steady-state hot path but left every architectural
problem of v1.5 in place: still one dispatcher per thread, still
userspace TID-read-vs-setscheduler racing, still hand-rolled PI
arithmetic. The “cache” added a third place where boost state could
desync.
Once NTSync gained an event PI primitive (patch 1006, eventually deferred-boost in 1008), it was clear that PI for IPC could ride the same machinery. The legacy machinery had three structural problems no amount of userspace engineering could fix:
| Structural problem | Gamma resolution |
|---|---|
N pthreads per process contending on global_lock |
One dispatcher per process; contention is O(1) per process |
TID-read vs sched_setscheduler race window |
Kernel boosts dispatcher inside the same syscall that enqueues |
| Userspace PI accounting drift | Kernel owns the boost state; userspace never reads or writes it |
Gamma is the smallest design that closes all three.
Gamma has been validated under sustained stress and through several KASAN-caught bugs. Tracking them here for completeness.
A static audit of server/nspa/shmem_channel.c found no latent
correctness bugs after the baf088c290f refcount + process-
membership patch. The handler runs under global_lock exactly as
v1.5 did, so handler-internal correctness is inherited from upstream
Wine. The dispatcher loop has no spin-loops, no missing locks, and
no lifetime races. The full audit lives at
wine/nspa/docs/gamma-dispatcher-audit-and-split-plan.md.
Pre-1007, the channel’s RECV path used a non-exclusive
wake_up_interruptible_all on enqueue, which woke every waiter and
let the kernel pick one. Under multiple-dispatcher scenarios (which
gamma does not actually use, but the test-channel-stress harness does)
the wake-all caused a real priority inversion: a low-prio waiter
could win the race and delay the high-prio waiter behind a sleep.
Patch 1007 narrowed RECV to wait_event_interruptible_exclusive +
wake_up_interruptible. Audit doc at wine/nspa/docs/ntsync-rt-audit.md.
The pre-1008 EVENT_SET_PI boost was applied immediately under
raw_spinlock_t, which blocked other RT operations. 1008 deferred
the boost to a per-CPU pi_work pool drained outside the spinlock.
Gamma channel REPLY uses the same machinery via
consume_event_pi_boost / apply_event_pi_boost – the deferred-
boost queue is what makes “drain previous, re-boost from new head”
atomic-feeling without holding the raw spinlock through the actual
task_struct boost call.
KASAN caught a use-after-free on struct ntsync_channel_entry in
test-channel-stress: a REPLY’s wake_up_all raced with SEND_PI’s
kfree(entry). Same bug class as the rolled-back 1008/1009 wave.
The clean fix was a refcount_t refs on ntsync_channel_entry,
incremented on enqueue and decremented at REPLY completion and at
sender wakeup; ~15 LOC. Patch 1009 in tree. No production user has
ever observed this bug (gamma has only one dispatcher per channel,
which keeps the path single-consumer); but the channel UAPI is
shared with other potential consumers and the fix is unconditional.
After the ~370M-ops ntsync validation proved the kernel sound, the
lockup investigation moved to wine-NSPA userspace. The audit doc at
wine/nspa/docs/wine-nspa-lockup-audit-20260427.md covers F1-F9
wineserver-side findings and MR1-MR8 msg_ring findings; gamma
itself was scored clean. The shipped fixes (MR1 reply-slot ABA, MR2
FUTEX_PRIVATE on shared memfd, MR4 POST wake-loss) are all in
dlls/win32u/nspa/msg_ring.c and orthogonal to gamma.
A separate behavioural-feedback note
(feedback_dont_shotgun_audit_into_unfound_bug) documents that
ntsync patches 1007-1011 originally shipped five patches as “audit
findings” without ever tracing the original EVENT_SET_PI slab
UAF; they were rolled back, reduced to the four genuinely-needed
fixes (1006/1007/1008/1009), and re-shipped. The lesson: KASAN /
trace first, audit second. Gamma’s design is small enough that
this discipline applies to its own future evolution as well.
| File | Lines | Role |
|---|---|---|
wine/dlls/ntdll/unix/server.c |
311-436 | Sender shim nspa_send_request_channel + UAPI fallback |
wine/dlls/ntdll/unix/server.c |
442-461 | server_call_unlocked gating logic |
wine/server/nspa/shmem_channel.c |
60-139 | UAPI fallback for pre-1005 / pre-1010 kernel headers |
wine/server/nspa/shmem_channel.c |
158-390 | dispatcher context + aggregate-wait loop + legacy fallback loop |
wine/server/nspa/shmem_channel.c |
474-581 | dispatcher create/destroy path, shutdown eventfd lifetime |
wine/server/nspa/shmem_channel.c |
310-340 | T2 thread-token register/deregister |
wine/server/nspa/uring.h |
– | per-process nspa_uring_instance API consumed by Phase 2 / Phase 3 |
wine/server/nspa/shmem_channel.h |
1-48 | Public header |
wine/server/nspa/fd_lockdrop.c |
47-125 | Phase B nspa_openat_lockdrop – lock-drop integration |
wine/nspa/docs/gamma-dispatcher-audit-and-split-plan.md |
– | Audit + future router/handler split plan |
wine/nspa/docs/wine-nspa-lockup-audit-20260427.md |
– | F1-F9 + MR1-MR8 lockup-investigation findings |
wine/nspa/docs/ntsync-rt-audit.md |
– | ntsync 1007/1008/1009 audit |
| File | Lines | Role |
|---|---|---|
drivers/misc/ntsync.c |
– | Channel object plus aggregate-wait registration / wake path |
ntsync-patches/1004-ntsync-channel.patch |
– | Channel object + core ioctls |
ntsync-patches/1005-ntsync-channel-thread-token.patch |
– | RECV2 + REGISTER_THREAD + DEREGISTER_THREAD |
ntsync-patches/1006-ntsync-rt-alloc-hoist.patch |
– | kfree-under-raw_spinlock fix; unblocked Phase B default-on |
ntsync-patches/1007-ntsync-channel-exclusive-recv.patch |
– | Channel exclusive recv – priority inversion fix |
ntsync-patches/1008-ntsync-event-set-pi-deferred-boost.patch |
– | Deferred boost machinery (consumed by REPLY) |
ntsync-patches/1009-ntsync-channel-entry-refcount.patch |
– | refcount_t on ntsync_channel_entry (KASAN UAF fix) |
ntsync-patches/1010-ntsync-aggregate-wait.patch |
– | heterogeneous wait primitive used by the post-1010 dispatcher |
ntsync-patches/1011-ntsync-channel-try-recv2.patch |
– | non-blocking RECV2 used for post-dispatch burst drain |
| Doc | Topic |
|---|---|
project_gamma_dispatcher_audit_and_split_plan.md |
2026-04-26 audit + T1/T2/T3 + router/handler split plan |
project_msg_ring_v2_mr1_mr2_mr4_shipped_20260427.md |
MR1/MR2/MR4 + Ableton run-3 config |
project_ntsync_session_20260427_results.md |
30M-ops cumulative validation, 4 bugs fixed |
project_ntsync_kfree_under_raw_spinlock.md |
1006 alloc-hoist (unblocked Phase B default-on) |
feedback_dont_shotgun_audit_into_unfound_bug.md |
KASAN-first / audit-second discipline |
The published shmem-ipc.gen.html describes v1.5 (per-thread
dispatcher) and v2.4 (cached-CAS + manual prio cache) and is
superseded by this document. It is retained for historical
reference and for the comparison diagrams. The CS-PI design
(cs-pi.gen.html) is orthogonal to gamma and continues to apply
unchanged: gamma improves the IPC path; CS-PI improves the in-
process critical-section path; they coexist without interaction.