Wine 11.6 + NSPA RT patchset | Kernel 6.19.x-rt with NTSync PI | 2026-04-28 Author: Jordan Johnston Status: long-horizon architecture plan; aggregate-wait is already a shipped slice, while the broader timer/fd-poll splits remain future work.
This page explains the residual wineserver architecture problem after the existing bypasses, which decomposition slices have already landed, and which ones are still roadmap material.
The NSPA bypass catalog describes the trajectories along which NT-API state moves out of wineserver. This doc is the other half: what eventually happens to wineserver itself, after enough state has migrated.
The high-level NSPA strategy has two coordinated parts. First, bypasses move specific classes of state client-side while retaining wineserver as fallback for cases the bypass cannot model. Second, decomposition reduces the residual wineserver to a smaller set of cooperating threads with subsystem-scoped locks instead of a single global_lock.
This document describes the decomposition side of that plan: the target wineserver shape and the dependency relationship between bypass work and server-internal restructuring.
A few framing notes before getting into the details:
open_fd lock-drop default-on, NTSync §2.1 thread-token pass-through default-on), and one kernel/userspace slice that used to live under Phase 3 is now also shipped: NTSync aggregate-wait plus the gamma dispatcher consumer. The phase table in section 7 is the canonical status; the body sections describe each component split in isolation.The audience this doc is written for: a developer who has read the bypass overview, has skimmed the gamma-channel-dispatcher and ntsync-driver docs, and wants to understand the architectural arc that the bypass work enables. If you’re implementing a single bypass and want a checklist, read the bypass detail doc for that bypass; if you’re implementing a single phase from this road map, read the in-tree handoff doc (wine/nspa/docs/wineserver-decomposition-plan.md) which has line-level kernel landmarks. This doc is the why, not the how.
Wineserver runs two RT threads in current NSPA configurations. Both serialize on a single pi_mutex_t global_lock, and that’s the dominant bottleneck.
| Thread | Scheduler | Priority | Holds global_lock |
Wakes on |
|---|---|---|---|---|
| Main loop | SCHED_FIFO | nspa_srv_rt_prio (default 64) |
yes, around handler dispatch | poll() / epoll_wait() over wineserver fds |
| Gamma channel dispatcher (1 per client process) | SCHED_FIFO | nspa_srv_rt_prio (64) |
yes, around handler dispatch | NTSYNC_IOC_CHANNEL_RECV (futex-backed) |
| (no separate timer thread today) | – | – | – | timers handled inside main loop’s get_next_timeout |
The shape worth noting: the main loop’s wait primitive is poll() / epoll_wait() bounded by get_next_timeout(). The same syscall returns either when an fd is ready or when the next NT timer is due. Time-driven and event-driven processing are conflated into one wait primitive. That conflation is the seam Phase 3 is going to split along.
The dispatcher pthread runs one per client process: the kernel-mediated request channel (the “gamma channel” – see gamma-channel-dispatcher.md) replaces the older per-thread request dispatcher fan-out. So if a Wine application has 50 threads, there is still exactly one dispatcher pthread in wineserver handling all of them, and that dispatcher takes global_lock once per request.
Both RT threads serialize on the same lock. Adding more RT threads under the same lock doesn’t help, because the lock holds serialise them anyway. Adding RT threads with finer-grained locks helps, but only after we know which subsets of state are independent enough to lock separately.
A perf capture from 2026-04-26 (PREEMPT_RT, Ableton steady-state playback workload) shows the wineserver-resident hot symbols:
channel_dispatcher – 6-11%get_ptid_entry – 1-10% (called from get_thread_from_id from dispatcher)main_loop_epoll – 2-7%ioctl – 5-7%read_request_shm – 2-3%nspa_redraw_ring_drain – 1-4%get_next_timeout – 2-3%All of those run under global_lock. The wineserver process itself sits around 1% CPU at steady state – it’s a very lightly loaded process by throughput. But the question for an RT workload isn’t “how busy is the server” – it’s latency under contention, which is exactly what a single global lock tilts against. Every handler runs to completion under the lock; the variance of “how long is the lock held” propagates into every other request.
The open_fd lock-drop work shipped in Phase 1 attacked one specific instance of this – the long lock-holder during openat – and it measurably improved drum-track-load-while-playing because it carved out a window where the lock could be released around the slow syscall. Similar surgical fixes exist elsewhere, but the lock-drop pattern is fundamentally a workaround for the wrong-grain-of-locking problem. The real fix is to either move the work out (bypasses) or split the lock (Phase 4).
NSPA addresses the wineserver bottleneck along two complementary directions, and this doc is about the second one. They are not alternatives – they compose.
Direction A: move state out. Each NSPA bypass picks a class of NT-API state, hosts it in the client process via a local stub or kernel-mediated primitive, and falls back to the server only when the bypass envelope is exceeded. Sync primitives go to NTSync. File I/O goes to io_uring. Hooks get a Tier 1+2 cache. Read-only file opens go to local_file. Timers go to a per-process timer dispatcher. Cross-thread same-process messages go through msg-ring. Each bypass shrinks the residual surface that wineserver still has to authoritatively serve.
Direction B: restructure what remains. Once the residual surface is small enough, wineserver can be split into multiple cooperating threads with finer-grained locks. The split has three components:
ntsync-patches/.global_lock.This is the doc for direction B. The two directions interact in a specific way: Direction A reduces the amount of state still under the lock; Direction B reduces the overhead per access to what’s left. Direction A is incremental, parallelizable across many bypasses, and starts paying immediately. Direction B has higher per-step risk and more design surface, and it pays its big dividends only after the surface has already been pruned. That ordering is the central design choice of the whole roadmap.
Direction A reduces the amount of state still owned by wineserver. Direction B reduces the dispatch and locking cost of the state that remains. The ordering matters: reducing the residual surface first lowers the risk and audit cost of later server-internal restructuring.
There is a third direction worth naming explicitly even though it’s not a separate workstream: lock discipline inside existing handlers. The Phase B open_fd lock-drop is the canonical example – a single handler that holds global_lock across a slow blocking syscall, fixed by carefully releasing the lock around the syscall and reacquiring with a generation check. That kind of work doesn’t move state out (Direction A) and doesn’t restructure the threading (Direction B); it just reduces the lock-hold duration of one specific handler. It’s surgical and labour-intensive, but several handlers benefit from it and the wins are immediate. It compounds with both other directions: a handler whose lock-hold has been minimized is a smaller obstacle once aggregate-wait or lock partitioning lands.
The decomposition arc treats lock-discipline patches as Phase 1 – “individually surgical fixes to the worst lock-holders” – and otherwise leaves them as ongoing work that ships independently. The Phase 1 row in the phase table represents the entire family, not just open_fd.
A reasonable concern when introducing more threads into wineserver is: does priority inheritance still propagate correctly? The answer depends on what’s holding what.
In current NSPA wineserver, PI propagates through two paths. First, the gamma channel: a client SEND_PI boosts the dispatcher pthread to the sender’s priority (kernel-mediated), and the kernel re-boosts on each RECV pop to the popped entry’s priority for the duration of the handler – so PI tracks the highest-priority pending request automatically. Second, global_lock itself is a pi_mutex_t; any thread blocked on it boosts the holder.
Phase 3 introduces three new threads. PI behaviour for each:
global_lock like everyone else. PI on its global_lock blocking is normal pi_mutex_t behaviour. No new propagation is needed.global_lock, so it never inverts anything. The handoff queue itself is short-lived; the queue-drain wakeup signals an event that the handler thread waits on, and PI on that event needs to come from somewhere – probably from the FD readiness itself (which has no inherent priority) plus the highest-priority pending FD-driven request (which we’d have to compute). This is a design detail still open.The pattern that emerges: as long as every thread-to-thread handoff inside wineserver goes through an NTSync primitive that carries priority (channel, event with SET_PI), PI propagates end-to-end. As soon as a handoff goes through a bare userspace queue (a pi_mutex_t-protected list with no PI signal), priority propagation breaks and the highest-priority pending request can be starved by lower-priority work. That observation alone is a design constraint on Phase 3 / Phase 4: every handoff queue needs an NTSync event as its waiter primitive, not just a bare condition variable.
This is also why the aggregate-wait extension matters strategically. With aggregate-wait, a handler thread can wait on (incoming channel events, FD-event queue NTSync event, timer-deadline NTSync event) and the kernel keeps PI consistent across all of them. Without aggregate-wait, we either fragment the wait primitives (one thread per wait shape, more handoff queues, more places to drop PI) or lose the cleanliness of the boost propagation.
NTSync is the kernel module NSPA owns and extends (ntsync-patches/, ntsync-driver.gen.html). Two extensions are relevant to wineserver decomposition. The first is shipped; the second now exists as a shipped kernel/userspace slice with broader decomposition consumers still ahead.
Status: shipped 2026-04-26 (T1/T2/T3, default-on as of post-1006 unblocking). Listed here for completeness; the implementation is described in gamma-channel-dispatcher.md.
The problem this solved: every channel request, the dispatcher called get_thread_from_id((thread_id_t)recv.payload_off) which called get_ptid_entry(id), an indexed array lookup with a possible cache miss in process.c:547. At 10% of dispatcher CPU in steady-state playback, this was meaningful overhead and – more importantly – a cache-miss-prone source of latency variance on every channel request.
The fix: extend NTSYNC_IOC_CHANNEL_RECV to return a thread_token that wineserver populated at thread create time. Wineserver registers (tid, struct thread *) via the new NTSYNC_IOC_CHANNEL_REGISTER_THREAD ioctl on thread create, deregisters on thread die, and on the receiving side reads the kernel-stamped token directly with no userspace lookup. Lifetime safety is preserved by the register-before-first-send / deregister-after-last-reply invariants.
Why it lives in this doc: it’s the first NTSync extension specifically targeted at making wineserver do less per-request, and it’s the prototype for the 4.2 aggregate-wait extension that would follow. The pattern (register a userspace pointer with the kernel; have the kernel hand it back at the dispatch event) is the same pattern aggregate-wait would extend to wider state.
The trust model – which generalizes to 4.2 – is “wineserver is trusted by the kernel because wineserver provided the registration; the client cannot influence what’s stored.” That’s the right design for kernel objects whose userspace owner is privileged in the relevant sense (the wineserver process, which runs as the same UID as its clients but is the source of truth for the cross-process semantics layered on top of NTSync).
Status: kernel primitive + first userspace consumer shipped 2026-04-29. The broader decomposition consumers (timer-thread split + FD poll thread split) remain queued, but NTSYNC_IOC_AGGREGATE_WAIT itself is no longer hypothetical and is already default-on in the gamma dispatcher via NSPA_AGG_WAIT.
The problem: wineserver’s main loop today waits via poll() / epoll_wait() over wineserver fds. It does not compose with NTSync object waits. The dispatcher pthread waits via NTSYNC_IOC_CHANNEL_RECV (futex-backed); it does not compose with fd readiness or NT timer deadlines. Each thread has exactly one wait primitive and one shape of wakeup, and the two shapes can’t merge into a single waiter.
That fragmentation is workable today because wineserver is not yet thread-decomposed. The main loop doesn’t need to wait on NTSync objects; the dispatcher doesn’t need to wait on fds. Once we move toward “single RT thread does everything except FD polling” (Phase 3) the unification matters: the RT thread needs to wait on the gamma channel, on a poll-set of wineserver fds, and on the next NT timer deadline, all in one syscall, with PI propagation from the channel sender.
The landed ioctl is NTSYNC_IOC_AGGREGATE_WAIT, which takes a heterogeneous source set:
Wakes on whichever source fires first. Reports back which source fired and (for FDs) what events. PI propagates from NTSync sources where the source carries a sender priority (channel SEND_PI, event SET_PI, etc.); FD readiness has no inherent priority.
The cost is moderate:
poll_wait machinery for the FD half with the existing NTSync wait machinery for the object half. Existing kernel infrastructure handles both, but the unification work is its own design.main_loop_epoll body and dispatcher RECV loop. Mechanical once the kernel side is in.The win is moderate-large but bounded by the lock. Even with one unified waiter, every handler still serialises on global_lock. That honest accounting did not change when the primitive shipped: aggregate-wait is useful today because it fixes the gamma async-completion ownership problem, but it becomes strategically larger once the timer and fd-poll splits also compose with it.
A separate consideration is the PREEMPT_RT epoll question. The runtime gate NSPA_DISABLE_EPOLL lets us A/B plain poll() against epoll_wait() on PREEMPT_RT. If epoll behaves cleanly under the workload (no priority inversions on its internal RT-mutex-converted locks), the urgency on the FD polling thread split (5.3) drops, and aggregate-wait may be the right unification anyway – but for the right reasons (composition with NTSync, not avoiding epoll). The decision belongs in the same session that designs the aggregate-wait API.
These four splits describe the userspace side of the road map. They are independent in implementation but ordered in the phasing because the risk profile varies and because some splits depend on prior infrastructure being in place.
Status: queued for Phase 3. Behaviour-preserving structural change; first safe split.
Today, wineserver’s main loop computes get_next_timeout() from the head of the NT timer queue, passes that timeout to poll(), and processes timer expirations when poll returns due to timeout (rather than fd readiness). That couples timer-driven wakeups to fd-driven wakeups in the same syscall.
The proposal: a dedicated timer thread that owns the NT timer queue.
clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, deadline) where deadline is the next NT timer expiration.global_lock, processes timer expirations, releases, recomputes deadline, sleeps again.pthread_kill(timer_thread, SIGRTMIN) to interrupt the sleep, or a dedicated futex).This is the first safe split for several reasons:
clock_nanosleep ABSTIME, which is exactly the right primitive for “wake at deadline X” – no math, no conflation.poll() wakes only on fd readiness, and get_next_timeout (currently 2-3% of wineserver CPU) moves to the timer thread.global_lock; the only thing that moves is when they’re processed.The risk is medium. Timer expiration must happen under global_lock to avoid races with handlers that read or write the same NT timer state, so the new thread is another lock contender. Today there are two RT lock-takers on the lock (main loop, dispatcher); after the split there are three. That doesn’t directly hurt – the lock is held briefly during timer processing – but it is one more thread whose latency is sensitive to lock contention. Pairing this with the aggregate-wait extension (4.2) lets us evaluate whether the timer thread can use aggregate-wait to also watch for timer-add notifications, simplifying the wakeup signaling.
A subtlety worth flagging: NT timer semantics are mutable. NT code can create, modify, or cancel a timer at any moment. The timer thread needs to react to deadline changes between iterations. The cleanest signal is the one in the proposal (pthread_kill to interrupt the sleep, recompute, sleep again). An alternative is for the timer thread to also wait on a futex that fires on add/cancel; either works, and the choice is a design detail rather than a fundamental question.
Status: queued for Phase 4 (long horizon). Pays its design cost as state migrates out.
Today, the gamma dispatcher pthread does CHANNEL_RECV → grab global_lock → read_request_shm → run handler → release lock → CHANNEL_REPLY in a tight loop. Every request takes the lock. Most requests touch only a small subset of wineserver state.
The proposal is to split the dispatcher into two tiers:
CHANNEL_RECV. For each request, classifies it. If the request is fast-path eligible – synthesizable locally without wineserver state, or answerable directly from client-side shm – it replies immediately via CHANNEL_REPLY without taking global_lock. Otherwise it queues the request to the handler tier.global_lock, runs the existing handler logic, replies.Today, the fast-path classifier would return “slow path” for every request type. There is no request that doesn’t go through the existing handler. So the split is initially behaviour-neutral: every request still ends up running under global_lock, just with one more queue hop.
The split pays over time, as state migrates out. Once enough state lives client-side (NT-local stubs, redirect tables in shared memory, hook caches), more request types qualify for the fast path. Trivial queries – “is this handle valid?” “what’s the size of this object?” “is this thread alive?” – become candidates. Cross-process queries that can be answered from shared metadata become candidates. Each migration is a small change to the classifier: add a request type to the fast-path set, validate, ship.
The reason this is a Phase 4 item rather than a Phase 3 item: it has zero immediate impact. The fast-path set is empty today. Designing the classifier framework before there are clients for it risks over-engineering. Better to wait until a few obvious fast-path candidates exist (the bypasses ahead make this likely – e.g. the GetMessage bypass turns a class of message-pump traffic into a candidate; the redraw push ring already shifts state shapes that could be queried fast-path).
The risk of the split itself is low (it’s mechanical). The actual hard work is the per-request-type fast-path classification: deciding whether request type X is eligible, validating that the eligibility logic is correct under all envelope conditions, A/B'ing.
Status: queued for Phase 3. Decision contingent on PREEMPT_RT epoll experiment outcome.
Today, the main loop is RT and spends most of its time blocked in poll() or epoll_wait(). The wait itself doesn’t actually need RT priority – only the response to the wait does. RT priority matters for the work that happens after the wait returns, not for the act of sleeping in the kernel.
The proposal: separate the FD polling from the FD-event handling.
poll() / epoll_wait() / io_uring_enter() over wineserver fds. Spends ~all its time sleeping in the kernel. On wake-up, doesn’t run the handler – it queues the fd-event to a handler thread.global_lock, runs the per-fd handler, releases.The reason for non-RT polling: RT priority on a thread that’s sleeping in the kernel doesn’t change wakeup latency. The kernel wakes the thread when an fd is ready, regardless of scheduler class. RT priority helps once the thread is awake and competing for CPU – but at that point we’ve already done a context switch into the polling thread; the cost is paid. Having the polling thread immediately hand off to a separate RT thread keeps the RT scheduler attention focused on the work that benefits from it.
The win compounds with the timer split (5.1) and aggregate-wait (4.2): after both, the wineserver main loop becomes a pure handler loop with no poll() calls of its own. The handler loop is the natural home for an aggregate-wait that watches the gamma channel, the FD-event queue, and the timer queue at once.
The risk is moderate. The handoff queue adds an extra context switch per fd-driven request: “fd ready” → polling thread wakes → enqueues → handler thread wakes → runs. Today that’s “fd ready → main loop wakes → runs” – one fewer context switch. Whether that latency increase matters depends on which fds carry latency-critical traffic. Most wineserver fds are control plane (request channels, sockets to clients), not data plane; the latency of “client request enqueued” to “server starts processing” is dominated by the existing channel + lock costs, not by an extra wakeup hop.
The other risk is the PREEMPT_RT epoll behavior. If epoll on PREEMPT_RT is adequate for the workload (the runtime A/B via NSPA_DISABLE_EPOLL will determine this), the urgency on this split drops. If epoll shows real priority inversions on its internal locks, the split becomes both an architectural and correctness requirement.
Status: long horizon, Phase 4. Don’t start until 2-3 subsystems have already been pruned.
Current lock state: one global_lock (a pi_mutex_t) covers all wineserver state. Every handler takes it. The lock is a serialization point for every Win32 process running on the system.
The proposal: per-subsystem locks. Windows, hooks, files, sync objects, processes, message queues – each with its own lock. Handlers grab only the lock(s) for the subsystem they touch. Cross-subsystem operations (rare) take multiple locks in a canonical order to avoid deadlock.
This is the only thing that lets multiple handlers run concurrently on the same wineserver process. Until it lands, every other split is ultimately bottlenecked at the lock; multi-threaded wineserver under one global lock is no better at throughput than single-threaded wineserver under one global lock, and is worse at latency variance because more threads contend for the same lock.
It’s also, by a wide margin, the hardest split. The reasons it is the last thing to do:
The recommendation: do not start this work until at least 2-3 subsystems (probably files, sync, hooks) have been moved fully out of wineserver and the remaining lock-holders are identifiable as a small, audit-able set. Until then, ship bypasses, ship the other splits, and let the surface shrink. When the time comes, lock partitioning is the surgical conclusion of the whole strangler arc – not its centerpiece.
These are the surfaces for which wineserver remains the source of truth and which no bypass or kernel primitive eliminates. The residual wineserver remains a metadata service for these:
\BaseNamedObjects\Foo and the NT object directory tree are shared across processes. Someone has to be the source of truth for “what’s the object that handle H in process P refers to?” when H or its name is shared with another process.WaitForSingleObject on a process or thread handle. NT semantics are server-mediated and the cross-process visibility requires a centralized authority.DuplicateHandle between processes, inherited handles at process create, and the bookkeeping that ensures handles in the parent’s table appear correctly in the child’s table at the right moment. The handle table itself can be partitioned per-process; the coordination is cross-process.\??\ paths, NT object directory hierarchy, some reparse-point handling, cross-process name redirection. These are NT-specific path rules without a Linux equivalent; they have to live somewhere and the only honest home is the source of truth for the NT name space.These are small relative to what can move out. Windows, hooks, file inodes, message queues, timers, sync primitives, and file I/O are already in flight or shipped client-side. The residual wineserver becomes a thin metadata service that answers cross-process naming questions and brokers lifecycle events, not an application server that runs handlers for every NT call.
This list is also why the strategy is “decompose, not delete.” A from-scratch replacement would have to re-implement all of the above plus everything that hasn’t been moved yet. Decomposition keeps the existing implementations of the must-stay items and just rearranges how they’re locked and dispatched.
The single canonical phase table for the decomposition arc. This table covers the four phases of decomposition itself; bypass trajectories are tracked separately in their own subsystem docs.
| Phase | Items | Status |
|---|---|---|
| 1 | Phase B open_fd lock-drop |
shipped default-on |
| 2 | NTSync §2.1 thread-token pass-through (T1/T2/T3) | shipped default-on |
| 3 | Timer thread split (5.1) + FD poll thread split (5.3), composed around shipped aggregate-wait (4.2) | queued |
| 4 | Router/handler split (5.2) + lock partitioning (5.4) | long horizon |
Each phase ships discrete, testable, revertible wins. The architecture direction stays clear (less wineserver, less global_lock, more event-driven RT primitives) but every phase is independently valuable.
A few notes about the ordering:
global_lock). It’s not a “decomposition” in the architectural sense; it’s a targeted release of the lock around a known-slow critical section. Listed as Phase 1 because it was the first piece of decomposition-direction work to ship, and because the pattern it establishes (NSPA-side lock-discipline patches inside server handlers) generalizes to other long lock-holders we may identify later.Most importantly: each phase ships independently. There is no big-bang. If Phase 3 stalls, Phase 4 doesn’t unblock anything that Phase 1+2 didn’t already unblock; the bypasses keep shipping in parallel. The decomposition-arc and the bypass-arc are independently progressing, with each sometimes accelerating the other but neither blocking it.
The natural alternative to this plan is: rewrite wineserver from scratch with the architecture you wish it had. Multi-threaded by design, fine-grained locks, modern wait primitives, no global_lock. Wine’s existing wineserver has a lot of accumulated assumptions (“nothing else changes during my handler”) that a clean-slate rewrite could just not have.
There are real reasons NSPA chose decomposition over rewrite:
The cost of decomposition is a slight cost in design uniformity. Each phase has its own approach, its own gating env var, its own validation discipline. There’s no single “wineserver 2.0” that you can point at; instead there’s a wineserver that’s been progressively reshaped. That is, on net, the right trade for a project that has to ship usable improvements continuously rather than commit to a multi-quarter rewrite.
A useful framing: bypasses and decomposition use the same incremental-migration discipline on different surfaces. Bypasses move NT-API state; decomposition restructures the remaining wineserver internals.
A constraint that runs through the whole arc: each phase has to pass real-workload validation before it can flip default-on. The default workload is Ableton-on-PREEMPT_RT under realistic plugin load – it exercises the message pump, the file I/O paths, the sync primitives, the timer paths, and the audio RT thread all simultaneously. If a change breaks Ableton or introduces measurable xrun regressions, it stays default-off until the cause is found and fixed.
This discipline has caught real bugs. The post-1006 ntsync work re-validated several “shipped” bypasses against a kernel module that finally didn’t lock the host; the validation found that some of the lockup attribution had been wrong (Phase B open_fd was blamed for a lockup that turned out to be an unrelated NTSync slab corruption). Without re-validation under stable conditions, the wrong bypass would have stayed gated.
The implication for Phase 3: every component split needs its own gate (NSPA_TIMER_THREAD_SPLIT=1, NSPA_FD_POLL_THREAD=1, and so on), its own validation plan, and independent combination testing. NSPA_AGG_WAIT already followed that path and flipped default-on after validation; the remaining pieces should be held to the same discipline.
These are the unresolved design questions ahead of Phase 3. None block Phase 1 or Phase 2 (already shipped) but each one wants an answer before the corresponding piece of Phase 3 ships.
pthread_kill(timer_thread, SIGRTMIN) to interrupt clock_nanosleep and force recompute, or have the timer thread also wait on a futex that fires on add/cancel. Aggregate-wait (4.2) makes this trivial: the timer thread waits on (NT timer queue head deadline, futex on add/cancel) and reacts to whichever fires. So this is partially a question of “does timer-split land before aggregate-wait or after?”NSPA_DISABLE_EPOLL (90231fc8d21) lets us A/B plain poll() vs epoll_wait() on PREEMPT_RT without rebuilding. If epoll behaves cleanly under the workload, the urgency on the FD poll thread split (5.3) drops; if it shows priority inversions on its internal RT-mutex-converted locks, the split moves up the priority list. The experiment should land before Phase 3 design is finalized.inproc_sync fit? The in-tree server/inproc_sync.c already handles a class of intra-process sync operations without round-tripping through the dispatcher. Some of its design lessons – per-process state, ioctl-direct dispatch – generalize to other request types, and the question is whether inproc_sync becomes a model for further router/handler-split fast paths or stays a one-off.A vertical phase ladder. Phases 1 and 2 are below the line (“done”); Phases 3 and 4 are above the line (“ahead”). The components of each phase are listed inside the phase block.
The visual point of the ladder: the bottom two phases are done, the middle phase is the next major piece of architectural work, and the top phase only starts once the bypass arc has shrunk the surface enough to make it tractable. Each rung is independently valuable; nothing requires the rung above it before it can ship.
gamma-channel-dispatcher.md – the existing gamma dispatcher, which 5.2’s router/handler split decomposes. Also the home of the Phase 2 thread-token pass-through implementation.nt-local-stubs.md – the architectural pattern for client-resident handlers. Section 6’s “what stays in wineserver” defines the floor that nt-local stubs and bypasses converge toward.ntsync-driver.gen.html – the kernel module that hosts the NTSync primitives. Section 4’s extensions live in this module’s patch series.nspa-local-file-architecture.gen.html, msg-ring-architecture.gen.html, io_uring-architecture.gen.html, cs-pi.gen.html, condvar-pi-requeue.gen.html – the per-subsystem detail docs whose work composes with the decomposition arc.architecture.gen.html – the integrated NSPA architecture overview; this doc is the wineserver-internals lens of that architecture.wine/nspa/docs/wineserver-decomposition-plan.md – the session-handoff version of this plan, with line-level kernel landmarks and active-session details. Use that doc when implementing Phase 3 / Phase 4; use this doc when reasoning about the trajectory.