Wine-NSPA – Aggregate-Wait and Async Completion

Wine 11.6 + NSPA RT patchset | Kernel patch 1010 + Gamma dispatcher Phase 2/3 | 2026-04-30 Author: Jordan Johnston

This page documents the landed aggregate-wait slice in Wine-NSPA: kernel patch 1010 (NTSYNC_IOC_AGGREGATE_WAIT) plus the first userspace consumer shape that uses it, namely dispatcher Phase 2 (per-process dispatcher-owned io_uring) and Phase 3 (gamma waits on channel + uring eventfd + shutdown eventfd and drains CQEs inline on the same RT thread).

Status: shipped and validated. NSPA_AGG_WAIT is default-on as of 2026-04-29. The 2026-04-30 follow-ons (NSPA_ENABLE_ASYNC_CREATE_FILE=1 and, on 1011 kernels, NSPA_TRY_RECV2=1) now ship on top of this same foundation.


Table of Contents

  1. Overview
  2. Scope of this page
  3. Why the old bridge was wrong
  4. Kernel patch 1010
  5. Wine-NSPA Phase 2 and Phase 3
  6. Validation and deployment
  7. Relationship to the broader decomposition plan
  8. References

1. Overview

Aggregate-wait is the kernel-side wait primitive that lets the gamma dispatcher block on request traffic, deferred-completion wakeups, and teardown wakeups in one place while keeping receive, CQE drain, and reply signaling on the same RT thread.

That is the architectural role of patch 1010 plus the dispatcher Phase 2/3 userspace work. Gamma already gave Wine-NSPA the correct request-side priority inheritance story: client threads do CHANNEL_SEND_PI, the kernel enqueues by priority, and the wineserver dispatcher runs the handler at the right effective priority. What gamma lacked was the matching async completion-side wait primitive.

The first async-completion prototype used the wineserver main thread as the CQE drain site. That proved the basic mechanism but broke the more important invariant: the thread that received the request was no longer the thread that completed and replied to it.

Patch 1010 and the accompanying dispatcher restructure fix that. The dispatcher now owns all three parts of the async path:

  1. receive request from the channel
  2. submit deferred work to its per-process io_uring
  3. drain completion and issue CHANNEL_REPLY

The same RT thread handles the full lifecycle.

What shipped

Layer Landed change Why it matters
Kernel NTSYNC_IOC_AGGREGATE_WAIT One wait covers NTSync objects plus pollable fds
Kernel Channel notify-only support inside aggregate-wait lets the dispatcher block on the channel without consuming the entry in the aggregate ioctl itself
Kernel follow-up PI fixes (072bfee) stable boost propagation for aggregate-waiting dispatchers
Userspace struct nspa_uring_instance per process dispatcher-local ring + eventfd + fixed pending pool
Userspace struct nspa_dispatcher_ctx single owner for channel fd, shutdown eventfd, and ring lifetime
Userspace aggregate-wait dispatcher loop same-thread request receive, CQE drain, and reply

2. Scope of this page

This page stays focused on the 1010 / Phase 2 / Phase 3 slice itself: the kernel wait primitive, the dispatcher-owned ring, and the same-thread completion/reply invariant that those pieces established.

The later follow-ons are intentionally not expanded here. Phase 4 async create_file is a later consumer of the same dispatcher-owned ring, and 1011 TRY_RECV2 is a later queue-drain optimization on top of the already-landed dispatcher shape. Those are part of the current shipped system, but they belong in the pages that track the dispatcher hot path and current production state: gamma-channel-dispatcher, io_uring-architecture, and current-state.


3. Why the old bridge was wrong

The rejected shape was:

Rejected cross-thread bridge Dispatcher pthread Wineserver main thread `CHANNEL_RECV2` dispatcher owns request entry handler submits SQE deferred path returns to channel receive loop request is no longer owned by the dispatcher `main_loop_epoll` main thread owns uring wake CQE arrives later main thread drains CQE callback restores state and writes reply `CHANNEL_REPLY` issued cross-thread ownership jump after SQE submit completion timing now depends on the main-thread wake path Why it was rejected request receive, CQE drain, and reply signaling no longer live on one RT thread reply ordering depends on main-thread wake timing and contention instead of dispatcher availability

The problem was not that the code path was impossible. The problem was that it was the wrong ownership model for an RT request path:

That shape showed up exactly where expected: real workloads tolerated it structurally, but timing-sensitive application behavior did not.


4. Kernel patch 1010

Patch 1010 adds NTSYNC_IOC_AGGREGATE_WAIT: a heterogeneous wait that combines NTSync object sources, pollable fd sources, and an optional absolute deadline.

The dispatcher is the first consumer, but the primitive is intentionally general.

Patch 1010: heterogeneous wait surface Object sources event / mutex / semaphore channel notify-only source PI-visible registration for NTSync-backed sources `NTSYNC_IOC_AGGREGATE_WAIT` copy source array register object waits and poll waits sleep once return `fired_index` + `fired_events` deadline expiry returns `NTSYNC_AGG_TIMEOUT` FD sources uring eventfd future fd-poll / timer wake sources poll semantics, no intrinsic PI owner Kernel follow-ups required for production stability `072bfee` added SEND_PI any-waiters fallback and wake-after-boost ordering so aggregate-waiting dispatchers inherit priority correctly

UAPI shape


struct ntsync_aggregate_source {
    __u32 type;          /* NTSYNC_AGG_OBJECT | NTSYNC_AGG_FD */
    __u32 events;        /* FD source: POLLIN / POLLOUT / POLLERR / POLLHUP */
    __u64 handle_or_fd;  /* ntsync object handle, or unix fd */
};

struct ntsync_aggregate_wait_args {
    __u32 nb_sources;
    __u32 reserved;
    __u64 sources;       /* user pointer to struct ntsync_aggregate_source[] */
    struct __kernel_timespec deadline; /* CLOCK_MONOTONIC ABSTIME or {0,0} */
    __u32 fired_index;
    __u32 fired_events;
    __u32 flags;
    __u32 owner;
};

#define NTSYNC_AGG_OBJECT        0x1
#define NTSYNC_AGG_FD            0x2
#define NTSYNC_AGG_MAX           64
#define NTSYNC_AGG_FLAG_REALTIME 0x1
#define NTSYNC_AGG_TIMEOUT       0xFFFFFFFFu
#define NTSYNC_IOC_AGGREGATE_WAIT _IOWR('N', 0x95, struct ntsync_aggregate_wait_args)

Semantics that matter for gamma

That last point is operationally important: public docs can describe the new default without pretending the code lost its rollback path.


5. Wine-NSPA Phase 2 and Phase 3

5.1 Phase 2: dispatcher-owned io_uring

Phase 2 did not make handlers async by itself. It put the ring and its state in the correct ownership domain first.

The old global-ring direction was abandoned. The landed design keeps one nspa_uring_instance per gamma channel / per Wine process, stored alongside the dispatcher context.


struct nspa_dispatcher_ctx {
    int channel_fd;
    int shutdown_efd;
    struct nspa_uring_instance uring;
};

Key properties:

5.2 Phase 3: aggregate-wait dispatcher loop

The dispatcher now waits on three sources:

  1. channel object: request available
  2. uring eventfd: completion available
  3. shutdown eventfd: process teardown requested
Phase 3 dispatcher topology Dispatcher context channel fd + shutdown eventfd + `nspa_uring_instance` one context per Wine process, freed by the dispatcher on exit source 0: channel object aggregate-wait fires dispatcher follows with `CHANNEL_RECV2` handler runs under existing `global_lock` discipline source 1: uring eventfd drain eventfd `nspa_uring_drain()` runs inline on the dispatcher CQE callback issues `CHANNEL_REPLY` on that same thread source 2: shutdown eventfd destroy path writes `1` aggregate-wait returns dispatcher drains and frees its own context Operational invariants same RT thread receives the request, drains completion, and signals the reply aggregate-wait `-ENOTTY` selects the legacy direct `CHANNEL_RECV2` loop

5.3 Dispatcher behavior

The loop is now:

  1. build the aggregate source table from {channel, uring eventfd if active, shutdown eventfd}
  2. call NTSYNC_IOC_AGGREGATE_WAIT
  3. if the fired source is the channel:
  4. if the fired source is the uring eventfd:
  5. if the fired source is shutdown_efd:

5.4 Fallback behavior

Userspace still handles two older-kernel shapes:

That logic is runtime feature detection, not a release ladder:

Dispatcher compatibility decisions dispatcher startup / first wait probe once, then cache the supported receive shape in the dispatcher context 1. try `NTSYNC_IOC_AGGREGATE_WAIT` channel object + uring eventfd + shutdown eventfd production path on post-1010 kernels if `-ENOTTY` kernel lacks patch 1010 disable aggregate-wait for this dispatcher 2. use direct `CHANNEL_RECV2` loop channel transport still intact legacy pre-1010 dispatcher wait shape if `CHANNEL_RECV2` returns `-ENOTTY` kernel lacks patch 1005 thread-token support disable `RECV2` for this dispatcher 3. fall back to `CHANNEL_RECV` oldest supported channel shape no thread-token carried in the receive result steady-state production loop aggregate-wait blocks once on channel, uring, and shutdown same thread receives, drains CQEs, and replies

6. Validation and deployment

Production state

Item Value
Kernel module srcversion 10124FB81FDC76797EF1F91
Wine userspace state Phase 2 + Phase 3 landed; Phase 4 create_file now uses the same ring
Default gate NSPA_AGG_WAIT=1
Opt-out NSPA_AGG_WAIT=0
Follow-on gates on top of this base NSPA_ENABLE_ASYNC_CREATE_FILE=1; NSPA_TRY_RECV2=1 on 1011 kernels

Validation results

Test Result
test-aggregate-wait 9/9 PASS
channel-PI propagation sub-test PASS
1k mixed-concurrency stress PASS
30k stress + full native ntsync suite PASS, dmesg clean
PE matrix 24 PASS / 0 FAIL / 0 TIMEOUT, including dispatcher-burst
Ableton level 2/3 with NSPA_AGG_WAIT=1 PASS
Phase 3 default-on under Ableton PASS

The follow-up kernel fixes in 072bfee matter here. The first 1010 cut exposed exactly the kind of PI edge that the dispatcher cannot tolerate: an aggregate-waiting dispatcher must still be visible to SEND_PI wake/boost logic and must not be woken before the new boost state is established. The production module includes those corrections.


7. Relationship to the broader decomposition plan

The public decomposition plan still has queued work in front of it, but the aggregate-wait story is no longer purely hypothetical.

Already shipped:

Still queued:

So the right interpretation is:

That is a better architectural state than the earlier plan assumed. Future work no longer needs to prove the syscall shape from scratch; it can build on a production consumer.


8. References