Wine-NSPA – Aggregate-Wait and Async Completion

Wine 11.6 + NSPA RT patchset | Kernel patch 1010 + Gamma dispatcher Phase 2/3 | 2026-04-30 Author: Jordan Johnston

This page documents the landed aggregate-wait slice in Wine-NSPA: kernel patch 1010 (NTSYNC_IOC_AGGREGATE_WAIT) plus the first userspace consumer shape that uses it, namely dispatcher Phase 2 (per-process dispatcher-owned io_uring) and Phase 3 (gamma waits on channel + uring eventfd + shutdown eventfd and drains CQEs inline on the same RT thread).

Status: shipped and validated. NSPA_AGG_WAIT is default-on as of 2026-04-29. The 2026-04-30 follow-ons (NSPA_ENABLE_ASYNC_CREATE_FILE=1 and, on 1011 kernels, NSPA_TRY_RECV2=1) now ship on top of this same foundation.

Overview
Scope of this page
Why the old bridge was wrong
Kernel patch 1010
Wine-NSPA Phase 2 and Phase 3
Validation and deployment
Relationship to the broader decomposition plan
References

1. Overview

Aggregate-wait is the kernel-side wait primitive that lets the gamma dispatcher block on request traffic, deferred-completion wakeups, and teardown wakeups in one place while keeping receive, CQE drain, and reply signaling on the same RT thread.

That is the architectural role of patch 1010 plus the dispatcher Phase 2/3 userspace work. Gamma already gave Wine-NSPA the correct request-side priority inheritance story: client threads do CHANNEL_SEND_PI, the kernel enqueues by priority, and the wineserver dispatcher runs the handler at the right effective priority. What gamma lacked was the matching async completion-side wait primitive.

The first async-completion prototype used the wineserver main thread as the CQE drain site. That proved the basic mechanism but broke the more important invariant: the thread that received the request was no longer the thread that completed and replied to it.

Patch 1010 and the accompanying dispatcher restructure fix that. The dispatcher now owns all three parts of the async path:

receive request from the channel
submit deferred work to its per-process io_uring
drain completion and issue CHANNEL_REPLY

The same RT thread handles the full lifecycle.

What shipped

Layer	Landed change	Why it matters
Kernel	`NTSYNC_IOC_AGGREGATE_WAIT`	One wait covers NTSync objects plus pollable fds
Kernel	Channel notify-only support inside aggregate-wait	lets the dispatcher block on the channel without consuming the entry in the aggregate ioctl itself
Kernel	follow-up PI fixes (`072bfee`)	stable boost propagation for aggregate-waiting dispatchers
Userspace	`struct nspa_uring_instance` per process	dispatcher-local ring + eventfd + fixed pending pool
Userspace	`struct nspa_dispatcher_ctx`	single owner for channel fd, shutdown eventfd, and ring lifetime
Userspace	aggregate-wait dispatcher loop	same-thread request receive, CQE drain, and reply

2. Scope of this page

This page stays focused on the 1010 / Phase 2 / Phase 3 slice itself: the kernel wait primitive, the dispatcher-owned ring, and the same-thread completion/reply invariant that those pieces established.

The later follow-ons are intentionally not expanded here. Phase 4 async create_file is a later consumer of the same dispatcher-owned ring, and 1011 TRY_RECV2 is a later queue-drain optimization on top of the already-landed dispatcher shape. Those are part of the current shipped system, but they belong in the pages that track the dispatcher hot path and current production state: gamma-channel-dispatcher, io_uring-architecture, and current-state.

3. Why the old bridge was wrong

The rejected shape was:

The problem was not that the code path was impossible. The problem was that it was the wrong ownership model for an RT request path:

submission happened on the dispatcher thread
completion wake happened on the main thread
reply signaling happened on the main thread
the gamma request path lost its single-thread execution invariant

That shape showed up exactly where expected: real workloads tolerated it structurally, but timing-sensitive application behavior did not.

4. Kernel patch 1010

Patch 1010 adds NTSYNC_IOC_AGGREGATE_WAIT: a heterogeneous wait that combines NTSync object sources, pollable fd sources, and an optional absolute deadline.

The dispatcher is the first consumer, but the primitive is intentionally general.

UAPI shape


struct ntsync_aggregate_source {
    __u32 type;          /* NTSYNC_AGG_OBJECT | NTSYNC_AGG_FD */
    __u32 events;        /* FD source: POLLIN / POLLOUT / POLLERR / POLLHUP */
    __u64 handle_or_fd;  /* ntsync object handle, or unix fd */
};

struct ntsync_aggregate_wait_args {
    __u32 nb_sources;
    __u32 reserved;
    __u64 sources;       /* user pointer to struct ntsync_aggregate_source[] */
    struct __kernel_timespec deadline; /* CLOCK_MONOTONIC ABSTIME or {0,0} */
    __u32 fired_index;
    __u32 fired_events;
    __u32 flags;
    __u32 owner;
};

#define NTSYNC_AGG_OBJECT        0x1
#define NTSYNC_AGG_FD            0x2
#define NTSYNC_AGG_MAX           64
#define NTSYNC_AGG_FLAG_REALTIME 0x1
#define NTSYNC_AGG_TIMEOUT       0xFFFFFFFFu
#define NTSYNC_IOC_AGGREGATE_WAIT _IOWR('N', 0x95, struct ntsync_aggregate_wait_args)

Semantics that matter for gamma

Channel participation is notify-only. Aggregate-wait tells userspace that the channel source fired; userspace still follows up with CHANNEL_RECV2 to consume the actual entry.
Object-source PI remains visible. The dispatcher blocked inside aggregate-wait must still be discoverable by the existing channel and event PI paths.
Pre-1010 kernels are supported. Userspace detects -ENOTTY on the first aggregate-wait attempt and permanently falls back to the legacy direct CHANNEL_RECV2 loop for that dispatcher.

That last point is operationally important: public docs can describe the new default without pretending the code lost its rollback path.

5. Wine-NSPA Phase 2 and Phase 3

5.1 Phase 2: dispatcher-owned `io_uring`

Phase 2 did not make handlers async by itself. It put the ring and its state in the correct ownership domain first.

The old global-ring direction was abandoned. The landed design keeps one nspa_uring_instance per gamma channel / per Wine process, stored alongside the dispatcher context.


struct nspa_dispatcher_ctx {
    int channel_fd;
    int shutdown_efd;
    struct nspa_uring_instance uring;
};

Key properties:

one submitter per ring
one CQE drainer per ring
ring lifecycle ends with the dispatcher that owns it
shutdown_efd gives the aggregate-wait path an explicit teardown wakeup

5.2 Phase 3: aggregate-wait dispatcher loop

The dispatcher now waits on three sources:

channel object: request available
uring eventfd: completion available
shutdown eventfd: process teardown requested

5.3 Dispatcher behavior

The loop is now:

build the aggregate source table from {channel, uring eventfd if active, shutdown eventfd}
call NTSYNC_IOC_AGGREGATE_WAIT
if the fired source is the channel:
- call CHANNEL_RECV2
- dispatch the request
- sync handlers reply immediately
- async-capable handlers may submit work and return to the wait loop
if the fired source is the uring eventfd:
- drain the eventfd counter
- call nspa_uring_drain()
- completion callbacks finish deferred work and issue CHANNEL_REPLY
if the fired source is shutdown_efd:
- exit cleanly and free the dispatcher-owned context

5.4 Fallback behavior

Userspace still handles two older-kernel shapes:

no patch 1010: aggregate-wait returns -ENOTTY, dispatcher permanently falls back to direct CHANNEL_RECV2
no patch 1005: CHANNEL_RECV2 returns -ENOTTY, dispatcher falls back to legacy CHANNEL_RECV

That logic is runtime feature detection, not a release ladder:

6. Validation and deployment

Production state

Item	Value
Kernel module srcversion	`10124FB81FDC76797EF1F91`
Wine userspace state	Phase 2 + Phase 3 landed; Phase 4 `create_file` now uses the same ring
Default gate	`NSPA_AGG_WAIT=1`
Opt-out	`NSPA_AGG_WAIT=0`
Follow-on gates on top of this base	`NSPA_ENABLE_ASYNC_CREATE_FILE=1`; `NSPA_TRY_RECV2=1` on 1011 kernels

Validation results

Test	Result
`test-aggregate-wait`	9/9 PASS
channel-PI propagation sub-test	PASS
1k mixed-concurrency stress	PASS
30k stress + full native ntsync suite	PASS, dmesg clean
PE matrix	24 PASS / 0 FAIL / 0 TIMEOUT, including `dispatcher-burst`
Ableton level 2/3 with `NSPA_AGG_WAIT=1`	PASS
Phase 3 default-on under Ableton	PASS

The follow-up kernel fixes in 072bfee matter here. The first 1010 cut exposed exactly the kind of PI edge that the dispatcher cannot tolerate: an aggregate-waiting dispatcher must still be visible to SEND_PI wake/boost logic and must not be woken before the new boost state is established. The production module includes those corrections.

7. Relationship to the broader decomposition plan

The public decomposition plan still has queued work in front of it, but the aggregate-wait story is no longer purely hypothetical.

Already shipped:

kernel aggregate-wait primitive
gamma dispatcher consumer
per-process dispatcher-owned ring infrastructure

Still queued:

timer-thread split
fd-poll thread split
wider handler-tier decomposition
lock partitioning

So the right interpretation is:

the primitive and first consumer are landed
the broader multi-thread decomposition that also wants this primitive is still ahead

That is a better architectural state than the earlier plan assumed. Future work no longer needs to prove the syscall shape from scratch; it can build on a production consumer.

8. References

wine/server/nspa/shmem_channel.c — dispatcher context, aggregate-wait loop, shutdown path
wine/server/nspa/uring.h — per-process nspa_uring_instance public surface
ntsync-patches/1010-ntsync-aggregate-wait.patch — aggregate-wait kernel patch
Superproject commits:
- 1879e2c — ntsync 1010 first cut
- 072bfee — SEND_PI any_waiters fallback + wake-after-boost reorder
- 8cc157c — userspace Phase 2 per-process uring infrastructure
- f21c6e1 — userspace Phase 3 aggregate-wait dispatcher
- b36e36d — Phase 3 default-on
In-tree handoff:
- wine/nspa/docs/session-handoff-20260429-phase-4.md