Wine-NSPA – Wineserver Decomposition: The Long-Horizon Plan

Wine 11.6 + NSPA RT patchset | Kernel 6.19.x-rt with NTSync PI | 2026-04-28 Author: Jordan Johnston Status: long-horizon architecture plan; aggregate-wait is already a shipped slice, while the broader timer/fd-poll splits remain future work.

This page explains the residual wineserver architecture problem after the existing bypasses, which decomposition slices have already landed, and which ones are still roadmap material.

Table of Contents

  1. What this doc is
  2. Where wineserver is today
  3. Two complementary directions
  4. NTSync extension proposals
  5. Decomposition path proposals
  6. What MUST stay in wineserver
  7. Phasing
  8. Why this isn’t a full rewrite
  9. Open questions
  10. Phase ladder diagram
  11. Cross-references

1. What this doc is

The NSPA bypass catalog describes the trajectories along which NT-API state moves out of wineserver. This doc is the other half: what eventually happens to wineserver itself, after enough state has migrated.

The high-level NSPA strategy has two coordinated parts. First, bypasses move specific classes of state client-side while retaining wineserver as fallback for cases the bypass cannot model. Second, decomposition reduces the residual wineserver to a smaller set of cooperating threads with subsystem-scoped locks instead of a single global_lock.

This document describes the decomposition side of that plan: the target wineserver shape and the dependency relationship between bypass work and server-internal restructuring.

A few framing notes before getting into the details:

The audience this doc is written for: a developer who has read the bypass overview, has skimmed the gamma-channel-dispatcher and ntsync-driver docs, and wants to understand the architectural arc that the bypass work enables. If you’re implementing a single bypass and want a checklist, read the bypass detail doc for that bypass; if you’re implementing a single phase from this road map, read the in-tree handoff doc (wine/nspa/docs/wineserver-decomposition-plan.md) which has line-level kernel landmarks. This doc is the why, not the how.


2. Where wineserver is today

Wineserver runs two RT threads in current NSPA configurations. Both serialize on a single pi_mutex_t global_lock, and that’s the dominant bottleneck.

Thread Scheduler Priority Holds global_lock Wakes on
Main loop SCHED_FIFO nspa_srv_rt_prio (default 64) yes, around handler dispatch poll() / epoll_wait() over wineserver fds
Gamma channel dispatcher (1 per client process) SCHED_FIFO nspa_srv_rt_prio (64) yes, around handler dispatch NTSYNC_IOC_CHANNEL_RECV (futex-backed)
(no separate timer thread today) timers handled inside main loop’s get_next_timeout

The shape worth noting: the main loop’s wait primitive is poll() / epoll_wait() bounded by get_next_timeout(). The same syscall returns either when an fd is ready or when the next NT timer is due. Time-driven and event-driven processing are conflated into one wait primitive. That conflation is the seam Phase 3 is going to split along.

The dispatcher pthread runs one per client process: the kernel-mediated request channel (the “gamma channel” – see gamma-channel-dispatcher.md) replaces the older per-thread request dispatcher fan-out. So if a Wine application has 50 threads, there is still exactly one dispatcher pthread in wineserver handling all of them, and that dispatcher takes global_lock once per request.

Both RT threads serialize on the same lock. Adding more RT threads under the same lock doesn’t help, because the lock holds serialise them anyway. Adding RT threads with finer-grained locks helps, but only after we know which subsets of state are independent enough to lock separately.

A perf capture from 2026-04-26 (PREEMPT_RT, Ableton steady-state playback workload) shows the wineserver-resident hot symbols:

All of those run under global_lock. The wineserver process itself sits around 1% CPU at steady state – it’s a very lightly loaded process by throughput. But the question for an RT workload isn’t “how busy is the server” – it’s latency under contention, which is exactly what a single global lock tilts against. Every handler runs to completion under the lock; the variance of “how long is the lock held” propagates into every other request.

The open_fd lock-drop work shipped in Phase 1 attacked one specific instance of this – the long lock-holder during openat – and it measurably improved drum-track-load-while-playing because it carved out a window where the lock could be released around the slow syscall. Similar surgical fixes exist elsewhere, but the lock-drop pattern is fundamentally a workaround for the wrong-grain-of-locking problem. The real fix is to either move the work out (bypasses) or split the lock (Phase 4).


3. Two complementary directions

NSPA addresses the wineserver bottleneck along two complementary directions, and this doc is about the second one. They are not alternatives – they compose.

Direction A: move state out. Each NSPA bypass picks a class of NT-API state, hosts it in the client process via a local stub or kernel-mediated primitive, and falls back to the server only when the bypass envelope is exceeded. Sync primitives go to NTSync. File I/O goes to io_uring. Hooks get a Tier 1+2 cache. Read-only file opens go to local_file. Timers go to a per-process timer dispatcher. Cross-thread same-process messages go through msg-ring. Each bypass shrinks the residual surface that wineserver still has to authoritatively serve.

Direction B: restructure what remains. Once the residual surface is small enough, wineserver can be split into multiple cooperating threads with finer-grained locks. The split has three components:

  1. Kernel-side primitives. Extend NTSync with the wait/dispatch primitives wineserver needs to do its work without the main-loop conflation (aggregate-wait, thread-token pass-through). These are kernel patches in ntsync-patches/.
  2. Userspace dispatcher decomposition. Split the gamma dispatcher’s RECV → handler → REPLY into router (fast-path classifier) + handler (slow-path), and split the main loop’s poll into a non-RT FD polling thread that hands off to RT handlers.
  3. Lock partitioning. Target state: per-subsystem locks (windows, hooks, files, sync, processes) instead of one global_lock.

This is the doc for direction B. The two directions interact in a specific way: Direction A reduces the amount of state still under the lock; Direction B reduces the overhead per access to what’s left. Direction A is incremental, parallelizable across many bypasses, and starts paying immediately. Direction B has higher per-step risk and more design surface, and it pays its big dividends only after the surface has already been pruned. That ordering is the central design choice of the whole roadmap.

Direction A reduces the amount of state still owned by wineserver. Direction B reduces the dispatch and locking cost of the state that remains. The ordering matters: reducing the residual surface first lowers the risk and audit cost of later server-internal restructuring.

There is a third direction worth naming explicitly even though it’s not a separate workstream: lock discipline inside existing handlers. The Phase B open_fd lock-drop is the canonical example – a single handler that holds global_lock across a slow blocking syscall, fixed by carefully releasing the lock around the syscall and reacquiring with a generation check. That kind of work doesn’t move state out (Direction A) and doesn’t restructure the threading (Direction B); it just reduces the lock-hold duration of one specific handler. It’s surgical and labour-intensive, but several handlers benefit from it and the wins are immediate. It compounds with both other directions: a handler whose lock-hold has been minimized is a smaller obstacle once aggregate-wait or lock partitioning lands.

The decomposition arc treats lock-discipline patches as Phase 1 – “individually surgical fixes to the worst lock-holders” – and otherwise leaves them as ongoing work that ships independently. The Phase 1 row in the phase table represents the entire family, not just open_fd.

Wineserver decomposition arc: from one lock domain to a narrowed metadata core Today Bypass-led transition Target shape Current wineserver process main loop poll/epoll + get_next_timeout + handlers gamma dispatcher CHANNEL_RECV -> handler -> REPLY single global_lock domain windows hooks files / timers / queues / sync registration every request serializes here Transition state state moves out first msg-ring, local-file, io_uring, hooks, timers kernel primitives grow channel RECV2, aggregate-wait, PI handoff dispatch becomes separable timer split FD poll split router / handler split begins to pay lock-hold patches continue shipping Long-horizon wineserver RT handler tier aggregate-wait over channel + fd queue + timer queue non-RT helpers fd polling, timer wake sources, router staging metadata core only naming lifecycle inheritance / handle coordination named sync registration / NT path rules lock partitioning only after surface shrinks ordering: move state out first, then split waits/threads, then partition residual locks

3.1 Priority inheritance across the splits

A reasonable concern when introducing more threads into wineserver is: does priority inheritance still propagate correctly? The answer depends on what’s holding what.

In current NSPA wineserver, PI propagates through two paths. First, the gamma channel: a client SEND_PI boosts the dispatcher pthread to the sender’s priority (kernel-mediated), and the kernel re-boosts on each RECV pop to the popped entry’s priority for the duration of the handler – so PI tracks the highest-priority pending request automatically. Second, global_lock itself is a pi_mutex_t; any thread blocked on it boosts the holder.

Phase 3 introduces three new threads. PI behaviour for each:

The pattern that emerges: as long as every thread-to-thread handoff inside wineserver goes through an NTSync primitive that carries priority (channel, event with SET_PI), PI propagates end-to-end. As soon as a handoff goes through a bare userspace queue (a pi_mutex_t-protected list with no PI signal), priority propagation breaks and the highest-priority pending request can be starved by lower-priority work. That observation alone is a design constraint on Phase 3 / Phase 4: every handoff queue needs an NTSync event as its waiter primitive, not just a bare condition variable.

This is also why the aggregate-wait extension matters strategically. With aggregate-wait, a handler thread can wait on (incoming channel events, FD-event queue NTSync event, timer-deadline NTSync event) and the kernel keeps PI consistent across all of them. Without aggregate-wait, we either fragment the wait primitives (one thread per wait shape, more handoff queues, more places to drop PI) or lose the cleanliness of the boost propagation.


4. NTSync extension proposals

NTSync is the kernel module NSPA owns and extends (ntsync-patches/, ntsync-driver.gen.html). Two extensions are relevant to wineserver decomposition. The first is shipped; the second now exists as a shipped kernel/userspace slice with broader decomposition consumers still ahead.

4.1 Thread-token pass-through (shipped)

Status: shipped 2026-04-26 (T1/T2/T3, default-on as of post-1006 unblocking). Listed here for completeness; the implementation is described in gamma-channel-dispatcher.md.

The problem this solved: every channel request, the dispatcher called get_thread_from_id((thread_id_t)recv.payload_off) which called get_ptid_entry(id), an indexed array lookup with a possible cache miss in process.c:547. At 10% of dispatcher CPU in steady-state playback, this was meaningful overhead and – more importantly – a cache-miss-prone source of latency variance on every channel request.

The fix: extend NTSYNC_IOC_CHANNEL_RECV to return a thread_token that wineserver populated at thread create time. Wineserver registers (tid, struct thread *) via the new NTSYNC_IOC_CHANNEL_REGISTER_THREAD ioctl on thread create, deregisters on thread die, and on the receiving side reads the kernel-stamped token directly with no userspace lookup. Lifetime safety is preserved by the register-before-first-send / deregister-after-last-reply invariants.

Why it lives in this doc: it’s the first NTSync extension specifically targeted at making wineserver do less per-request, and it’s the prototype for the 4.2 aggregate-wait extension that would follow. The pattern (register a userspace pointer with the kernel; have the kernel hand it back at the dispatch event) is the same pattern aggregate-wait would extend to wider state.

The trust model – which generalizes to 4.2 – is “wineserver is trusted by the kernel because wineserver provided the registration; the client cannot influence what’s stored.” That’s the right design for kernel objects whose userspace owner is privileged in the relevant sense (the wineserver process, which runs as the same UID as its clients but is the source of truth for the cross-process semantics layered on top of NTSync).

4.2 Aggregate-wait primitive (shipped slice)

Status: kernel primitive + first userspace consumer shipped 2026-04-29. The broader decomposition consumers (timer-thread split + FD poll thread split) remain queued, but NTSYNC_IOC_AGGREGATE_WAIT itself is no longer hypothetical and is already default-on in the gamma dispatcher via NSPA_AGG_WAIT.

The problem: wineserver’s main loop today waits via poll() / epoll_wait() over wineserver fds. It does not compose with NTSync object waits. The dispatcher pthread waits via NTSYNC_IOC_CHANNEL_RECV (futex-backed); it does not compose with fd readiness or NT timer deadlines. Each thread has exactly one wait primitive and one shape of wakeup, and the two shapes can’t merge into a single waiter.

That fragmentation is workable today because wineserver is not yet thread-decomposed. The main loop doesn’t need to wait on NTSync objects; the dispatcher doesn’t need to wait on fds. Once we move toward “single RT thread does everything except FD polling” (Phase 3) the unification matters: the RT thread needs to wait on the gamma channel, on a poll-set of wineserver fds, and on the next NT timer deadline, all in one syscall, with PI propagation from the channel sender.

The landed ioctl is NTSYNC_IOC_AGGREGATE_WAIT, which takes a heterogeneous source set:

Wakes on whichever source fires first. Reports back which source fired and (for FDs) what events. PI propagates from NTSync sources where the source carries a sender priority (channel SEND_PI, event SET_PI, etc.); FD readiness has no inherent priority.

The cost is moderate:

The win is moderate-large but bounded by the lock. Even with one unified waiter, every handler still serialises on global_lock. That honest accounting did not change when the primitive shipped: aggregate-wait is useful today because it fixes the gamma async-completion ownership problem, but it becomes strategically larger once the timer and fd-poll splits also compose with it.

A separate consideration is the PREEMPT_RT epoll question. The runtime gate NSPA_DISABLE_EPOLL lets us A/B plain poll() against epoll_wait() on PREEMPT_RT. If epoll behaves cleanly under the workload (no priority inversions on its internal RT-mutex-converted locks), the urgency on the FD polling thread split (5.3) drops, and aggregate-wait may be the right unification anyway – but for the right reasons (composition with NTSync, not avoiding epoll). The decision belongs in the same session that designs the aggregate-wait API.


5. Decomposition path proposals

These four splits describe the userspace side of the road map. They are independent in implementation but ordered in the phasing because the risk profile varies and because some splits depend on prior infrastructure being in place.

5.1 Timer thread split

Status: queued for Phase 3. Behaviour-preserving structural change; first safe split.

Today, wineserver’s main loop computes get_next_timeout() from the head of the NT timer queue, passes that timeout to poll(), and processes timer expirations when poll returns due to timeout (rather than fd readiness). That couples timer-driven wakeups to fd-driven wakeups in the same syscall.

The proposal: a dedicated timer thread that owns the NT timer queue.

This is the first safe split for several reasons:

The risk is medium. Timer expiration must happen under global_lock to avoid races with handlers that read or write the same NT timer state, so the new thread is another lock contender. Today there are two RT lock-takers on the lock (main loop, dispatcher); after the split there are three. That doesn’t directly hurt – the lock is held briefly during timer processing – but it is one more thread whose latency is sensitive to lock contention. Pairing this with the aggregate-wait extension (4.2) lets us evaluate whether the timer thread can use aggregate-wait to also watch for timer-add notifications, simplifying the wakeup signaling.

A subtlety worth flagging: NT timer semantics are mutable. NT code can create, modify, or cancel a timer at any moment. The timer thread needs to react to deadline changes between iterations. The cleanest signal is the one in the proposal (pthread_kill to interrupt the sleep, recompute, sleep again). An alternative is for the timer thread to also wait on a futex that fires on add/cancel; either works, and the choice is a design detail rather than a fundamental question.

5.2 Router / handler split

Status: queued for Phase 4 (long horizon). Pays its design cost as state migrates out.

Today, the gamma dispatcher pthread does CHANNEL_RECV → grab global_lock → read_request_shm → run handler → release lock → CHANNEL_REPLY in a tight loop. Every request takes the lock. Most requests touch only a small subset of wineserver state.

The proposal is to split the dispatcher into two tiers:

Today, the fast-path classifier would return “slow path” for every request type. There is no request that doesn’t go through the existing handler. So the split is initially behaviour-neutral: every request still ends up running under global_lock, just with one more queue hop.

The split pays over time, as state migrates out. Once enough state lives client-side (NT-local stubs, redirect tables in shared memory, hook caches), more request types qualify for the fast path. Trivial queries – “is this handle valid?” “what’s the size of this object?” “is this thread alive?” – become candidates. Cross-process queries that can be answered from shared metadata become candidates. Each migration is a small change to the classifier: add a request type to the fast-path set, validate, ship.

The reason this is a Phase 4 item rather than a Phase 3 item: it has zero immediate impact. The fast-path set is empty today. Designing the classifier framework before there are clients for it risks over-engineering. Better to wait until a few obvious fast-path candidates exist (the bypasses ahead make this likely – e.g. the GetMessage bypass turns a class of message-pump traffic into a candidate; the redraw push ring already shifts state shapes that could be queried fast-path).

The risk of the split itself is low (it’s mechanical). The actual hard work is the per-request-type fast-path classification: deciding whether request type X is eligible, validating that the eligibility logic is correct under all envelope conditions, A/B'ing.

5.3 FD polling thread split

Status: queued for Phase 3. Decision contingent on PREEMPT_RT epoll experiment outcome.

Today, the main loop is RT and spends most of its time blocked in poll() or epoll_wait(). The wait itself doesn’t actually need RT priority – only the response to the wait does. RT priority matters for the work that happens after the wait returns, not for the act of sleeping in the kernel.

The proposal: separate the FD polling from the FD-event handling.

The reason for non-RT polling: RT priority on a thread that’s sleeping in the kernel doesn’t change wakeup latency. The kernel wakes the thread when an fd is ready, regardless of scheduler class. RT priority helps once the thread is awake and competing for CPU – but at that point we’ve already done a context switch into the polling thread; the cost is paid. Having the polling thread immediately hand off to a separate RT thread keeps the RT scheduler attention focused on the work that benefits from it.

The win compounds with the timer split (5.1) and aggregate-wait (4.2): after both, the wineserver main loop becomes a pure handler loop with no poll() calls of its own. The handler loop is the natural home for an aggregate-wait that watches the gamma channel, the FD-event queue, and the timer queue at once.

The risk is moderate. The handoff queue adds an extra context switch per fd-driven request: “fd ready” → polling thread wakes → enqueues → handler thread wakes → runs. Today that’s “fd ready → main loop wakes → runs” – one fewer context switch. Whether that latency increase matters depends on which fds carry latency-critical traffic. Most wineserver fds are control plane (request channels, sockets to clients), not data plane; the latency of “client request enqueued” to “server starts processing” is dominated by the existing channel + lock costs, not by an extra wakeup hop.

The other risk is the PREEMPT_RT epoll behavior. If epoll on PREEMPT_RT is adequate for the workload (the runtime A/B via NSPA_DISABLE_EPOLL will determine this), the urgency on this split drops. If epoll shows real priority inversions on its internal locks, the split becomes both an architectural and correctness requirement.

5.4 Lock partitioning

Status: long horizon, Phase 4. Don’t start until 2-3 subsystems have already been pruned.

Current lock state: one global_lock (a pi_mutex_t) covers all wineserver state. Every handler takes it. The lock is a serialization point for every Win32 process running on the system.

The proposal: per-subsystem locks. Windows, hooks, files, sync objects, processes, message queues – each with its own lock. Handlers grab only the lock(s) for the subsystem they touch. Cross-subsystem operations (rare) take multiple locks in a canonical order to avoid deadlock.

This is the only thing that lets multiple handlers run concurrently on the same wineserver process. Until it lands, every other split is ultimately bottlenecked at the lock; multi-threaded wineserver under one global lock is no better at throughput than single-threaded wineserver under one global lock, and is worse at latency variance because more threads contend for the same lock.

It’s also, by a wide margin, the hardest split. The reasons it is the last thing to do:

The recommendation: do not start this work until at least 2-3 subsystems (probably files, sync, hooks) have been moved fully out of wineserver and the remaining lock-holders are identifiable as a small, audit-able set. Until then, ship bypasses, ship the other splits, and let the surface shrink. When the time comes, lock partitioning is the surgical conclusion of the whole strangler arc – not its centerpiece.


6. What MUST stay in wineserver

These are the surfaces for which wineserver remains the source of truth and which no bypass or kernel primitive eliminates. The residual wineserver remains a metadata service for these:

What remains in wineserver after the bypass arc Residual wineserver metadata core cross-process object naming process / thread lifecycle handle inheritance / duplication coordination named sync registration NT path and object-directory rules object types with no clean Linux analogue already moving out sync waits, file I/O, hooks client-local transports msg-ring, local-file, local timers kernel assist NTSync, futex PI, io_uring future split points aggregate-wait, handler tiers, lock partitions design goal: the server stops being the default execution path and becomes the authoritative broker for the small set of semantics that must stay centralized

These are small relative to what can move out. Windows, hooks, file inodes, message queues, timers, sync primitives, and file I/O are already in flight or shipped client-side. The residual wineserver becomes a thin metadata service that answers cross-process naming questions and brokers lifecycle events, not an application server that runs handlers for every NT call.

This list is also why the strategy is “decompose, not delete.” A from-scratch replacement would have to re-implement all of the above plus everything that hasn’t been moved yet. Decomposition keeps the existing implementations of the must-stay items and just rearranges how they’re locked and dispatched.


7. Phasing

The single canonical phase table for the decomposition arc. This table covers the four phases of decomposition itself; bypass trajectories are tracked separately in their own subsystem docs.

Phase Items Status
1 Phase B open_fd lock-drop shipped default-on
2 NTSync §2.1 thread-token pass-through (T1/T2/T3) shipped default-on
3 Timer thread split (5.1) + FD poll thread split (5.3), composed around shipped aggregate-wait (4.2) queued
4 Router/handler split (5.2) + lock partitioning (5.4) long horizon

Each phase ships discrete, testable, revertible wins. The architecture direction stays clear (less wineserver, less global_lock, more event-driven RT primitives) but every phase is independently valuable.

A few notes about the ordering:

Most importantly: each phase ships independently. There is no big-bang. If Phase 3 stalls, Phase 4 doesn’t unblock anything that Phase 1+2 didn’t already unblock; the bypasses keep shipping in parallel. The decomposition-arc and the bypass-arc are independently progressing, with each sometimes accelerating the other but neither blocking it.


8. Why this isn’t a full rewrite

The natural alternative to this plan is: rewrite wineserver from scratch with the architecture you wish it had. Multi-threaded by design, fine-grained locks, modern wait primitives, no global_lock. Wine’s existing wineserver has a lot of accumulated assumptions (“nothing else changes during my handler”) that a clean-slate rewrite could just not have.

There are real reasons NSPA chose decomposition over rewrite:

The cost of decomposition is a slight cost in design uniformity. Each phase has its own approach, its own gating env var, its own validation discipline. There’s no single “wineserver 2.0” that you can point at; instead there’s a wineserver that’s been progressively reshaped. That is, on net, the right trade for a project that has to ship usable improvements continuously rather than commit to a multi-quarter rewrite.

A useful framing: bypasses and decomposition use the same incremental-migration discipline on different surfaces. Bypasses move NT-API state; decomposition restructures the remaining wineserver internals.

8.1 The validation discipline

A constraint that runs through the whole arc: each phase has to pass real-workload validation before it can flip default-on. The default workload is Ableton-on-PREEMPT_RT under realistic plugin load – it exercises the message pump, the file I/O paths, the sync primitives, the timer paths, and the audio RT thread all simultaneously. If a change breaks Ableton or introduces measurable xrun regressions, it stays default-off until the cause is found and fixed.

This discipline has caught real bugs. The post-1006 ntsync work re-validated several “shipped” bypasses against a kernel module that finally didn’t lock the host; the validation found that some of the lockup attribution had been wrong (Phase B open_fd was blamed for a lockup that turned out to be an unrelated NTSync slab corruption). Without re-validation under stable conditions, the wrong bypass would have stayed gated.

The implication for Phase 3: every component split needs its own gate (NSPA_TIMER_THREAD_SPLIT=1, NSPA_FD_POLL_THREAD=1, and so on), its own validation plan, and independent combination testing. NSPA_AGG_WAIT already followed that path and flipped default-on after validation; the remaining pieces should be held to the same discipline.


9. Open questions

These are the unresolved design questions ahead of Phase 3. None block Phase 1 or Phase 2 (already shipped) but each one wants an answer before the corresponding piece of Phase 3 ships.

  1. NTSync trust model for thread-token registration. Does the kernel hold a strong ref on the thread struct so the token is always valid when returned, or does wineserver clear the token atomically with thread destroy? The former is simpler; the latter has lower kernel memory footprint. Phase 2 chose “wineserver-side clear-on-destroy” on the strength of the register-before-first-send / deregister-after-last-reply invariant. The same question recurs for any future NTSync-side token registry (sections, timers, IOCP completions).
  2. Aggregate-wait fairness. If multiple sources are ready simultaneously, how are they ordered? For NTSync-object sources, “priority of the waker” is the obvious answer (it’s how SEND_PI / SET_PI already work). For FD readiness, there is no waker priority. The aggregate-wait API needs a tie-break rule – probably “object sources first, ordered by waker priority; FD sources second, ordered by registration order” – but the call has not been made.
  3. Timer thread vs NT timer mutability. NT timers can be created, modified, or destroyed at any time. The timer thread needs to react to deadline changes between iterations. Two clean signals: pthread_kill(timer_thread, SIGRTMIN) to interrupt clock_nanosleep and force recompute, or have the timer thread also wait on a futex that fires on add/cancel. Aggregate-wait (4.2) makes this trivial: the timer thread waits on (NT timer queue head deadline, futex on add/cancel) and reacts to whichever fires. So this is partially a question of “does timer-split land before aggregate-wait or after?”
  4. Strangler vs growth. As the wineserver-decomposition direction continues, do we keep wineserver largely stable while pruning, or actively rewrite the parts that remain? The default recommendation is strangler – keep the existing handler bodies, change only the dispatch and locking. But there are individual subsystems where a partial rewrite of the handler (not the architecture) might be cleaner once it’s been pruned to a small surface. That call is per-subsystem and shouldn’t be made up front.
  5. PREEMPT_RT epoll experiment outcome. NSPA_DISABLE_EPOLL (90231fc8d21) lets us A/B plain poll() vs epoll_wait() on PREEMPT_RT without rebuilding. If epoll behaves cleanly under the workload, the urgency on the FD poll thread split (5.3) drops; if it shows priority inversions on its internal RT-mutex-converted locks, the split moves up the priority list. The experiment should land before Phase 3 design is finalized.
  6. Where does inproc_sync fit? The in-tree server/inproc_sync.c already handles a class of intra-process sync operations without round-tripping through the dispatcher. Some of its design lessons – per-process state, ioctl-direct dispatch – generalize to other request types, and the question is whether inproc_sync becomes a model for further router/handler-split fast paths or stays a one-off.
  7. Handler queue priority discipline. If the gamma dispatcher splits into router + handler tiers (5.2), the handoff queue between them needs a priority-respecting drain order. NTSync gives us PI on the channel; once a request is on a userspace queue inside wineserver, PI doesn’t automatically follow. The queue drain probably needs to use NTSync as its waiter primitive (an event per handler, signalled from the router) so PI re-applies on the handoff. Not a blocker; a design detail.

10. Phase ladder diagram

A vertical phase ladder. Phases 1 and 2 are below the line (“done”); Phases 3 and 4 are above the line (“ahead”). The components of each phase are listed inside the phase block.

Wineserver decomposition phase ladder bottom = shipped, top = horizon PHASE COMPONENTS Phase 1 SHIPPED default-on 2026-04-26 surgical lock release Phase B open_fd lock-drop release global_lock around openat (long syscall) drum-track-load-while-playing xrun fix pattern: NSPA-side lock-discipline patches in handlers Phase 2 SHIPPED default-on 2026-04-26 first NTSync extension NTSync sec 2.1 thread-token pass-through T1: ioctls; T2: register/deregister; T3: dispatcher consumes drops get_ptid_entry from ~10% of dispatcher CPU prototype for further kernel-side primitives Phase 3 QUEUED co-designed thread split + kernel primitive aggregate-wait + decomp 5.1 Timer thread split separate time-driven from event-driven wakeup 4.2 NTSync sec 2.2 aggregate-wait unified waiter: NTSync objects + FDs + deadline 5.3 FD poll thread split non-RT polling, RT handler handoff Phase 4 LONG HORIZON begins after 2-3 subsystems have moved out via bypasses surgical conclusion 5.2 Router / handler split fast-path classifier; pays as state migrates out handler queue, NTSync-mediated handoff 5.4 Lock partitioning per-subsystem locks: windows / hooks / files / sync late-stage split; massive audit surface each phase ships independently; bypasses progress in parallel

The visual point of the ladder: the bottom two phases are done, the middle phase is the next major piece of architectural work, and the top phase only starts once the bypass arc has shrunk the surface enough to make it tractable. Each rung is independently valuable; nothing requires the rung above it before it can ship.


11. Cross-references