Wine-NSPA – Architecture & Design Reference

Wine 11.6 + NSPA RT patchset | Kernel 6.19.x-rt with NTSync PI | 2026-04-15 Author: jordan Johnston

Table of Contents

  1. Overview
  2. Vanilla Wine vs Wine-NSPA
  3. Wine Process Model
  4. RT Priority Architecture
  5. Synchronization Architecture
  6. NTSync Kernel Patches
  7. io_uring I/O Architecture
  8. Audio Stack Architecture
  9. QPC & Timing
  10. Memory & Large Pages
  11. msvcrt SIMD Optimizations
  12. Version History

1. Overview

Wine-NSPA is a real-time optimized fork of Wine designed to run on PREEMPT_RT Linux kernels. It bridges the gap between Wine’s SCHED_OTHER threading model and the deterministic scheduling guarantees that RT workloads require – priority inheritance, bounded lock hold times, priority-ordered wakeups, and correct NT-to-Linux priority mapping.

While professional audio (DAWs, VST plugins, ASIO applications) is the primary motivation, Wine-NSPA’s RT infrastructure benefits any latency-sensitive Win32 application: real-time simulation, industrial control, low-latency trading, or any scenario where a Windows application must coexist with Linux’s RT scheduler without priority inversion or unbounded blocking.

Design Philosophy

What PREEMPT_RT Changes

On a standard kernel, spinlock_t disables preemption and mutex is a sleeping lock. Under PREEMPT_RT, spinlock_t becomes a sleeping rt_mutex (fully preemptible), and only raw_spinlock_t disables preemption. This means:

Component Summary

Component Layer Status
RT priority mapping (v1/v1.2) Wine ntdll + wineserver SHIPPED
Wineserver self-promotion (v1.1) Wine wineserver SHIPPED
Shmem wineserver IPC (v1.5) Wine ntdll + wineserver SHIPPED
Wineserver global_lock PI Wine wineserver SHIPPED
CS-PI / FUTEX_LOCK_PI (v2.3) Wine ntdll (PE + Unix) SHIPPED
librtpi vendoring (v2.0) Wine libs/ SHIPPED
NTSync PI kernel patches Linux kernel driver SHIPPED
Client-side NTSync creation Wine ntdll Unix SHIPPED
winejack.drv (MIDI + audio) Wine driver SHIPPED
nspaASIO bridge Wine DLL SHIPPED
io_uring I/O bypass (Phase 1-3) Wine ntdll Unix + server SHIPPED
ntsync uring_fd kernel extension Linux kernel driver SHIPPED
msvcrt SIMD (AVX/SSE2) Wine msvcrt SHIPPED
SRW lock spin phase Wine ntdll SHIPPED
pi_cond requeue-PI Wine libs/librtpi SHIPPED
Win32 condvar PI (requeue-PI) Wine ntdll PE + Unix SHIPPED
CoWaitForMultipleHandles rewrite Wine combase SHIPPED
QPC rdTSC bypass Wine ntdll Unix SHIPPED
Large/huge pages Wine ntdll Unix SHIPPED

2. Vanilla Wine vs Wine-NSPA

Side-by-side comparison of how vanilla Wine and Wine-NSPA handle the same architectural components. Every NSPA change is additive – the vanilla behavior is preserved when NSPA_RT_PRIO is unset.

Vanilla Wine Wine-NSPA wineserver SCHED_OTHER (nice 0) single-threaded, handles all IPC Thread Scheduling all threads SCHED_OTHER TIME_CRITICAL = nice -11 no RT, no priority inheritance Synchronization CS → futex (no PI) Mutex → wineserver round-trip FIFO wakeup order NTSync (upstream) spinlock_t, FIFO waiters, no PI QPC Timing clock_gettime (~1μs) Memory 4KB pages only Linux Kernel PREEMPT_VOLUNTARY / PREEMPT wineserver SCHED_FIFO 64 (self-promoted, below all client RT) RT Priority Mapping (v1/v1.2) REALTIME class → SCHED_FIFO 65-80 TC ceiling clamp: NT 31 = NSPA_RT_PRIO cross-thread map + lenient tier 1 path Synchronization (4 PI paths) CS → FUTEX_LOCK_PI (priority inheritance) Mutex → NTSync PI (5 kernel patches) Condvar → requeue-PI (unix + Win32) SRW spin (256 iters, RT skip) NTSync + 5 NSPA kernel patches raw_spinlock, prio-ordered, PI boost v2, uring_fd, kmalloc QPC rdTSC Bypass rdtsc direct read (~10ns, 100x faster) Large + Huge Pages 2MB large pages, 1GB hugepages, TLB reduction PREEMPT_RT Kernel fully preemptible, rt_mutex PI chains, NTSync PI vanilla (unchanged) NSPA modified NSPA new component NSPA kernel patches

Key differences: Vanilla Wine runs entirely under SCHED_OTHER – wineserver, all threads, all sync primitives. There is no priority mapping, no PI, no RT scheduling. Wine-NSPA adds 6 layers of RT infrastructure: wineserver self-promotion, NT–>FIFO priority mapping, CS-PI via FUTEX_LOCK_PI, client-side NTSync with PI kernel patches, rdTSC QPC bypass, and large page support. All layers are opt-in via NSPA_RT_PRIO.


3. Wine Process Model

Wine implements the Windows NT process model on top of Linux. Understanding this architecture is essential for knowing where NSPA hooks in and why certain design choices were made.

Architecture Diagram

wineserver Single process, event-driven Handle table + sync objects Process/thread state mgmt v1.1: SCHED_FIFO @ NSPA_RT_PRIO-16 Wine Process (e.g. DAW, plugin host) PE Side Win32 app code ntdll.dll (PE) kernelbase.dll kernel32.dll CS-PI fast path (TID CAS) SRW spin (256 iters) Condvar PI mapping table + requeue-PI syscalls Unix Side ntdll.so (Unixlib) sched_setscheduler() futex() syscalls ioctl(/dev/ntsync) FUTEX_LOCK_PI slow path RT prio self-promotion io_uring (per-thread) file + socket I/O bypass E2 bitmap + POLL_ADD Same address space (same process) Unix socket IPC v1.5 shmem path Linux Kernel /dev/ntsync Sem / Mutex / Event Priority-ordered queues PI boost (sched_setattr) raw_spinlock (PREEMPT_RT) uring_fd CQE wakeup futex subsystem FUTEX_WAIT/WAKE FUTEX_LOCK_PI FUTEX_WAIT_REQUEUE_PI FUTEX_CMP_REQUEUE_PI rt_mutex PI chain io_uring POLL_ADD / READ / WRITE COOP_TASKRUN (RT-safe) no wineserver involvement PREEMPT_RT Scheduler SCHED_FIFO / SCHED_RR sched_setscheduler() from Wine ioctl syscall io_uring_enter (bypasses server) sched_setscheduler Handle Table (fd cache) Server handles: index 0 upward (wineserver-assigned) Client handles: index ~524K downward (NSPA client-side) Each entry: fd + type + access + refcount Client handles bypass wineserver entirely for anon objects PE (Win32) Unix (native) NTSync futex PI io_uring scheduler wineserver IPC optional/fallback

Key Concepts

Wineserver

A single long-lived process per Wine prefix that acts as the NT kernel analog. It owns the handle table, manages process/thread lifetime, and arbitrates named synchronization objects. All cross-process operations (handle inheritance, named mutexes, process creation) go through wineserver via Unix domain sockets.

Bottleneck: Wineserver is single-threaded and event-driven. An RT thread that makes a wineserver round-trip blocks behind the server’s serialization. v1.5’s shmem path and client-side NTSync creation reduce the number of operations that require this round-trip. The v1.5 shmem dispatchers serialize on a global_lock (now PI-aware via pi_mutex_t / FUTEX_LOCK_PI) so that high-priority dispatcher contention propagates priority through the kernel’s rt_mutex PI chain.

PE / Unix Split

Every Wine process has two halves sharing the same address space. The PE side runs Win32 code (app binaries, PE DLLs like ntdll.dll, kernelbase.dll). The Unix side runs native Linux code (ntdll.so, driver Unixlibs). The PE side cannot make raw Linux syscalls – it must cross into the Unix side via the Unixlib dispatcher. This split is architecturally important for NSPA: CS-PI’s fast path (CAS on LockSemaphore) lives on the PE side, while the FUTEX_LOCK_PI slow path lives on the Unix side.

NTSync

Wine 11.x’s primary synchronization backend. The /dev/ntsync kernel device (Elizabeth Figura, mainlined) implements Win32 semaphores, mutexes, and events with correct NT semantics (abandoned mutexes, WaitForMultipleObjects atomicity, etc.) directly in the kernel. NSPA adds five patches to this driver for PI support, io_uring integration, and PREEMPT_RT safety.

Client-Side Object Creation

NSPA adds a client-side fast path for creating anonymous sync objects (unnamed mutexes and semaphores). Instead of a wineserver round-trip, the client calls ioctl(NTSYNC_IOC_CREATE_*) directly and allocates a handle from the client range (index ~524K downward). Named objects still go through wineserver for namespace resolution. This eliminates a round-trip per object creation – significant for apps that create thousands of mutexes at startup.

Events are excluded from client-side creation (a7b00453978). Client-created event handles destabilized Ableton Live, causing crashes in the handle lifecycle. Anonymous events remain on wineserver, where the server-side lifecycle management is proven stable. Mutexes and semaphores are stable client-side.

Client-Side NTSync Object Creation Split NtCreateMutant NtCreateSemaphore NtCreateEvent (anonymous only) client server ioctl(NTSYNC_IOC_CREATE_*) direct to /dev/ntsync 0 server round-trips wine_server_call() server creates on /dev/ntsync sends fd back to client inproc_wait() / linux_wait_objs() ioctl(NTSYNC_IOC_WAIT_ANY) same wait path for both Events excluded: client-created event handles crashed Ableton Live (a7b00453978) Server-side lifecycle management is stable for events; mutexes + semaphores are stable client-side

Shared Memory Wineserver IPC (v1.5)

Forward-port of Torge Matthies’s 2022 shmem wineserver IPC patch. Replaces socket-based request/reply with per-thread shared memory + futex signaling for small requests. The server side spawns a pthread per client thread that sits in FUTEX_WAIT and dispatches via the existing req_handlers[] array, serialized by a PI-aware global_lock around the main poll loop. v2.4 client-side boost raises the dispatcher’s priority to match the client’s RT priority for the duration of each request; the PI lock ensures this boost propagates through contention with other dispatchers or the main thread.

Vanilla IPC (top) vs Shmem IPC (bottom) ntdll (client) send_request() socket wineserver main loop epoll → read → dispatch socket ntdll (client) wait_reply() ~10-50μs round-trip (2 context switches + 2 socket ops + epoll wake) ntdll (client) write to shmem FUTEX_WAKE shmem+futex shmem dispatcher FUTEX_WAIT → dispatch global_lock serialized shmem+futex ntdll (client) read from shmem FUTEX_WAIT return ~2-5μs round-trip (futex wake/wait, no socket, no epoll) Oversized requests (>1MB) and shmem init failure fall back to socket path transparently Per-thread 1MB shmem region, pthread dispatcher per client thread, global_lock serialization

Why shmem matters for RT: NTSync captures the sync-wait hot path, but non-sync wineserver traffic (handle operations, registry, file metadata, loader calls) still uses IPC. During VST plugin loading, the loader holds Wine’s loader lock while dispatching dozens of wineserver calls. Shorter round-trips = shorter lock-hold times = less contention bleeding into RT threads sharing those locks.

Limitations: The global_lock serializes all shmem dispatchers, so this is latency improvement, not throughput. Dispatcher threads run at SCHED_OTHER by default (v1.1’s wineserver RT promotion handles the scheduling side independently).

Shmem PI v2.5: Cached Scheduling State

v2.4’s manual PI boost called sched_getscheduler() + sched_getparam() + sched_setscheduler() on every request (6 syscalls per RT request including unboost). v2.5 caches the RT thread’s scheduling policy and priority in TLS (nspa_rt_cached_policy, nspa_rt_cached_prio), eliminating the read syscalls. Only sched_setscheduler() fires — 2 syscalls per request (boost + unboost). Committed as 17621ba494c.

Shmem PI: v2.4 (6 syscalls) vs v2.5 (2 syscalls, cached) v2.4: getscheduler() getparam() setscheduler(boost) dispatch setscheduler(restore) 6 sys v2.5: setscheduler(boost) dispatch (unchanged) setscheduler(restore) 2 sys TLS: nspa_rt_cached_policy + nspa_rt_cached_prio (set once at thread RT init, read on every boost) Eliminates sched_getscheduler + sched_getparam — policy/prio don't change between requests

4. RT Priority Architecture

NSPA maps Win32 thread priorities to Linux SCHED_FIFO priorities, giving audio threads deterministic scheduling. The mapping is controlled by two environment variables and implemented in a two-tier architecture.

Priority Mapping Diagram

Win32 NT Band Linux SCHED_FIFO Role 99 (reserved) Kernel threads only 88-89 JACK / PipeWire callbacks TIME_CRITICAL 31 80 = NSPA_RT_PRIO Audio callback ceiling Always SCHED_FIFO (realtime +6) 30 79 HIGHEST (RT) 26 75 NORMAL (RT cls) 24 73 Policy from NSPA_RT_POLICY RT band boundary (NT 16) IDLE (RT cls) 16 65 64 (wineserver) v1.1: NSPA_RT_PRIO - 16 Just below entire RT band 1 .. 63 (unused) SCHED_OTHER NT 1..15, nice-based Ceiling Mapping fifo = NSPA_RT_PRIO - (31 - nt_band) NSPA_RT_PRIO is the ceiling. NT 31 maps here; lower RT bands scale linearly below it. Example: NSPA_RT_PRIO=80, NT 31 = 80, NT 24 = 73, NT 16 = 65

Two-Tier Promotion

Tier 1: Client-side self-promotion (zero round-trip)

When a thread calls SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL), ntdll’s Unix side detects this in NtSetInformationThread(ThreadBasePriority) and calls sched_setscheduler(0, SCHED_FIFO | SCHED_RESET_ON_FORK, ...) directly. No wineserver round-trip needed. The priority change still forwards to wineserver for bookkeeping so that GetThreadPriority from other processes returns the correct value.

Implementation: dlls/ntdll/unix/thread.c, function nspa_rt_apply_tid().

Tier 2: Server-side cross-process promotion

When a thread calls SetThreadPriority(hOtherThread, ...) targeting a thread in another process, the request goes through wineserver. The server’s apply_thread_priority() in server/thread.c calls sched_setscheduler(thread->unix_tid, ...) on the target thread. This covers bulk updates from SetPriorityClass and any cross-process priority manipulation.

v1.2: Cross-thread map

Tier 1 extended with a HANDLE-to-unix_tid map (256-slot open-addressing hash table protected by a PI mutex). When a thread calls SetThreadPriority(hThread, ...) where hThread is another thread in the same process, the client looks up the target’s unix_tid in the map and applies the scheduling change locally, avoiding a wineserver round-trip. The map is populated at NtCreateThreadEx time.

Environment Variables

Variable Values Default Effect
NSPA_RT_PRIO Integer in [min..max-1] unset = RT dormant Master switch + ceiling FIFO priority. NT 31 maps here.
NSPA_RT_POLICY FF, RR, TS FF Scheduler policy for NT [16..30]. TC (NT 31) always FIFO.
NSPA_SRV_RT_PRIO Integer NSPA_RT_PRIO - 16 Override wineserver’s FIFO priority.
NSPA_SRV_RT_POLICY FF, RR FF Wineserver scheduler policy.

Lenient path (v1 feature): For client-side self/same-process promotion, THREAD_PRIORITY_TIME_CRITICAL is treated as a special-case ceiling promotion even when the process is not in REALTIME priority class. This covers the common audio pattern where apps call SetThreadPriority(..., 15) without first calling SetPriorityClass(REALTIME). The lower realtime band still follows the normal process-class rules; the lenient exception is specifically for TIME_CRITICAL.


5. Synchronization Architecture

Priority inversion is the primary failure mode for RT audio under Wine. An RT thread blocked on a lock held by a normal-priority thread can be delayed indefinitely while CFS time-slices the holder against dozens of other threads. NSPA addresses this with PI (priority inheritance) on two independent sync paths.

4-Path Synchronization Diagram

Path A: CS-PI Path B: NTSync PI Path C: pi_cond PI Path D: Condvar PI Win32 API / Entry Point EnterCriticalSection WaitForSingleObject pi_cond_wait() (unix) SleepConditionVariableCS Wine ntdll (PE + Unix) RtlEnterCriticalSection TID CAS fast path / LOCK_PI slow NtWaitForSingleObject inproc_wait → ioctl(/dev/ntsync) librtpi (header-only) unlock mutex → sleep on condvar NtNspaCondWaitPI condvar→mutex map + 3 syscalls Linux Kernel Mechanism FUTEX_LOCK_PI kernel rt_mutex, transitive PI /dev/ntsync (5 patches) prio-ordered queues, sched_setattr FUTEX_WAIT_REQUEUE_PI atomic requeue onto PI mutex FUTEX_WAIT_REQUEUE_PI atomic requeue onto CS PI mutex PI Effect Holder boosted to waiter prio deadlock detection, +1 CAS (~5ns) Owner boosted via sched_setattr prio-ordered wakeup, boost_count Zero-gap: wake → own mutex kernel requeues atomically Zero-gap: wake → own CS kernel requeues atomically Release / Signal Path LeaveCriticalSection CAS(tid,0) or FUTEX_UNLOCK_PI ReleaseMutex ntsync_pi_drop, restore sched pi_cond_signal CMP_REQUEUE_PI (wake 1) WakeConditionVariable NtNspaCondSignalPI (map lookup) Scope: all Win32 CriticalSections Scope: Win32 Mutex/Sem/Event Scope: audio, gstreamer, winebus Scope: Win32 CondVar + CS Not covered: SleepConditionVariableSRW — SRW locks have no ownership, no PI target (unsolved in Linux kernel) All 4 paths gated on NSPA_RT_PRIO — when unset, every code path is byte-identical to upstream Wine

Path A: CRITICAL_SECTION PI (CS-PI)

Win32 CRITICAL_SECTION is the most contended lock in typical Wine workloads – used by heap operations, loader, DllMain serialization, GDI, and most app/plugin code. NSPA’s CS-PI repurposes the LockSemaphore field of RTL_CRITICAL_SECTION as a FUTEX_LOCK_PI word.

Protocol

  1. Acquire (uncontended): PE side does InterlockedIncrement(&LockCount). If we won (-1 –> 0), CAS our Linux TID into LockSemaphore. Done – never leaves user space.
  2. Acquire (contended): PE side calls NtNspaLockCriticalSectionPI(address), crossing to Unix side. Unix side calls futex(&LockSemaphore, FUTEX_LOCK_PI_PRIVATE, ...). The kernel sees the owner TID in the futex word, boosts the owner to the waiter’s scheduling priority, and blocks the waiter on an rt_mutex.
  3. Release (no waiters): PE side does CAS(LockSemaphore, my_tid, 0). Pure user-space.
  4. Release (waiters present): If FUTEX_WAITERS bit is set, PE side calls NtNspaUnlockCriticalSectionPI(address). Kernel transfers ownership to the highest-priority waiter and drops the boost.

TID source: PE code cannot call syscall(SYS_gettid) directly. NSPA adds NtNspaGetUnixTid() which returns the Linux kernel TID from ntdll_thread_data->nspa_unix_tid. PE caches this in a __thread variable – the syscall fires at most once per thread.

Path B: NTSync Mutex PI

Win32 CreateMutex + WaitForSingleObject goes through the /dev/ntsync kernel driver. NSPA’s five kernel patches add PI, uring_fd, and PREEMPT_RT fixes to this path (see Section 6). The flow:

  1. NtWaitForSingleObject –> inproc_wait() resolves the mutex’s fd from the handle cache.
  2. linux_wait_objs() calls ioctl(device, NTSYNC_IOC_WAIT_ANY, &args).
  3. Inside the kernel, ntsync_insert_waiter() inserts the waiter into a priority-ordered queue (patch 2).
  4. ntsync_pi_recalc() scans both any_waiters and all_waiters for the highest-priority waiter. Compares against the owner’s saved original priority (not normal_prio, which changes after boost). If boost needed, looks up or creates a per-task ntsync_pi_owner tracking entry, increments boost_count, and boosts via sched_setattr_nocheck() (patch 3 v2).
  5. On release, try_wake_any_mutex() wakes the highest-priority waiter. ntsync_pi_drop() decrements boost_count; only restores original scheduling when the last boost on that task is removed (multi-object safe).

Fallback Behavior

Both PI paths degrade gracefully. CS-PI falls back to the legacy keyed-event wait if FUTEX_LOCK_PI returns ENOSYS. NTSync PI requires the patched kernel driver – without it, waiters are FIFO-ordered (upstream default) with no owner boost.

SRW Lock Spin Phase

Windows SRW locks (Slim Reader/Writer) spin briefly before entering the kernel wait path. NSPA implements this behavior:

This matches Windows NT behavior where SRWLock uses a brief spin phase before blocking. The 256-iteration count is empirically chosen to cover the common case of short critical sections (sub-microsecond) without burning excessive cycles on longer holds.

Impact: Reduces kernel transitions for uncontended or briefly-contended SRW locks. The v5 NTSync d4 rapid throughput improved from 232K to 259K ops/s (+11.6%), consistent with fewer futex syscalls in the lock path.

Implementation: dlls/ntdll/sync.c (SRW acquire path).

pi_cond Requeue-PI (Unix-Side Condvars)

Wine-NSPA’s pi_cond_t (condition variable with PI support, from librtpi) uses FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI to close the PI gap in condition variable wakeup. The waiter transitions directly from “blocked on condvar” to “blocked on PI mutex with priority inheritance” – no gap.

Implementation: libs/librtpi/rtpi.h (header-only inline, pi_cond_wait, pi_cond_signal, pi_cond_broadcast).

Win32 Condvar PI (RtlSleepConditionVariableCS)

The same requeue-PI mechanism applied to PE-side Win32 condvars. When CS-PI is active and the CS is held non-recursively, RtlSleepConditionVariableCS takes the PI path: capture value, register condvar→mutex mapping, NtNspaCondWaitPI (unix: FUTEX_UNLOCK_PI + FUTEX_WAIT_REQUEUE_PI). On signal, kernel atomically requeues waiter onto PI mutex – zero gap.

Scope: CS-backed condvars only. SRW-backed condvars remain without PI (unsolved problem even in the Linux kernel).

PI Coverage Path Mechanism Scope
CS-PI FUTEX_LOCK_PI on LockSemaphore Win32 CriticalSection enter/leave
NTSync PI Kernel ntsync driver (priority-ordered wakeup) Win32 Mutex/Semaphore/Event
pi_cond requeue-PI FUTEX_WAIT_REQUEUE_PI in librtpi Unix-side condvars (audio, gstreamer)
Win32 condvar PI FUTEX_WAIT_REQUEUE_PI for RtlSleepConditionVariableCS Win32 SleepConditionVariableCS

Condvar Requeue-PI Diagram

Both pi_cond (unix-side) and Win32 condvar PI (PE-side) use FUTEX_WAIT_REQUEUE_PI / FUTEX_CMP_REQUEUE_PI to atomically move waiters from the condvar futex onto the PI mutex chain — eliminating the priority inversion gap between wake and mutex reacquire.

Old: plain FUTEX_WAIT (PI gap) New: FUTEX_WAIT_REQUEUE_PI (no gap) RT thread: pi_cond_wait(cond, mutex) pi_mutex_unlock(mutex) — drop PI boost FUTEX_WAIT(&cond, seq) — sleep on condvar FUTEX_WAKE — signaler wakes us PI GAP — thread runnable, no boost Can be preempted by lower-priority threads pi_mutex_lock(mutex) — manual reacquire PI restored (kernel rt_mutex) worst-case max: 53.8us (measured under RT load) RT thread: pi_cond_wait(cond, mutex) pi_mutex_unlock(mutex) — drop PI boost FUTEX_WAIT_REQUEUE_PI(&cond, &mutex) FUTEX_CMP_REQUEUE_PI — signaler requeues ATOMIC — requeued onto PI mutex chain Kernel transfers waiter directly, no userspace gap Mutex acquired — PI already in effect Unbroken PI through entire path worst-case max: 31.6us (-41% vs old, under RT load)

Full details: Win32 Condvar PI documentation (architecture, syscall interface, mapping table design, 2 SVG diagrams).


6. NTSync Kernel Patches

Five patches applied to drivers/misc/ntsync.c in the NSPA kernel tree. Patches 1-3 make the NTSync driver safe on PREEMPT_RT and add Windows-faithful priority semantics with PI boost. Patches 4-5 integrate ntsync with io_uring for CQE wakeup and fix a PREEMPT_RT allocation bug.

Patch 1: raw_spinlock + rt_mutex hardening

Problem: Upstream ntsync uses spinlock_t, which on PREEMPT_RT becomes a sleeping rt_mutex. This is correct for general use but changes the timing characteristics and makes some code paths that assume non-preemptibility incorrect.

Fix: Convert obj->lock to raw_spinlock_t with raw_spin_lock() / raw_spin_unlock(). This preserves true spin semantics even on PREEMPT_RT kernels, matching the driver’s design assumption of short critical sections around object state updates.

Patch 2: Priority-ordered waiter queues

Problem: Upstream ntsync uses list_add_tail() for all waiters – FIFO order. Windows NT uses strict priority ordering: the highest-priority waiter is always woken first on object release.

Fix: Replace list_add_tail() with ntsync_insert_waiter(), which walks the waiter list and inserts the new entry before the first entry with lower scheduling priority (task->prio). Same-priority waiters are FIFO within their level. This matches Windows NT scheduling semantics exactly.

static void ntsync_insert_waiter(struct ntsync_q_entry *new_entry,
                                 struct list_head *head)
{
    struct ntsync_q_entry *entry;
    list_for_each_entry(entry, head, node) {
        if (new_entry->q->task->prio < entry->q->task->prio) {
            list_add_tail(&new_entry->node, &entry->node);
            return;
        }
    }
    list_add_tail(&new_entry->node, head);
}

Patch 3: Mutex owner PI boost (v2, 2026-04-15)

Problem: When a SCHED_FIFO thread waits on a mutex held by a SCHED_OTHER thread, the holder may not get CPU time promptly (CFS time-sharing), causing unbounded priority inversion.

Fix (v2): ntsync_pi_recalc() scans both any_waiters and all_waiters for the highest-priority waiter. Compares against the owner’s saved original priority (via per-device ntsync_pi_owner tracking), not normal_prio (which changes after boost). Per-task boost_count ensures original scheduling is saved once and restored only when ALL mutexes stop boosting that task. Conservative over-boosting between first and last removal is safe (never under-boosts).

v2 fixes three bugs from v1: (a) multi-object PI corruption when a task held multiple boosted mutexes, (b) zero PI for WaitForMultipleObjects(bWaitAll=TRUE), © stale normal_prio comparison causing boost/unboost thrashing. Test results: philosophers RT max wait 1620→865us (-46.6%), ntsync d8 PI contention 479→239ms (-50.1%).

/* Per-task tracking — saves orig_attr once, restores when boost_count reaches 0 */
po = find_pi_owner(dev, owner);
base_prio = po ? po->orig_normal_prio : owner->normal_prio;

if (highest_prio < base_prio && !was_boosted) {
    if (!po) {
        po = kzalloc(sizeof(*po), GFP_ATOMIC);
        po->orig_attr = capture_sched(owner);
        po->orig_normal_prio = owner->normal_prio;
        list_add(&po->node, &dev->boosted_owners);
    }
    po->boost_count++;
    obj->u.mutex.pi_boosted = true;
}
if (needs_boost && highest_prio < owner->prio)
    sched_setattr_nocheck(owner, &boost);
NTSync PI v2: Per-Task Boost Tracking ntsync_device wait_all_lock (rt_mutex) boost_lock (raw_spinlock) boosted_owners (list) one per /dev/ntsync fd ntsync_pi_owner task: owner task_struct* orig_attr: saved SCHED_OTHER orig_normal_prio: 120 (CFS) boost_count: 2 (M1 + M2) one per boosted task Mutex M1 pi_boosted = true any_waiters: [RT prio 15] Mutex M2 pi_boosted = true all_waiters: [RT prio 20] ntsync_pi_recalc() flow scan any_waiters scan all_waiters compare vs orig_normal_prio find/create pi_owner boost_count++ or -- boost: setattr unboost: only if boost_count == 0 Multi-object safe: owner stays boosted until ALL mutexes stop contributing (conservative over-boost)

Scaling Characteristics

Tested with transitive PI chains up to depth 12. RT wait time does not increase with chain depth beyond the tail holder’s work time (~235ms for a 100M-iter CPU loop). The per-hop increment (~50ms) is visible in individual holder elapsed times but does not accumulate in the RT thread’s total wait. ntsync_pi_recalc() scales to at least depth 12 without degradation.

Patch 4: uring_fd extension (io_uring CQE wakeup)

Problem: When a thread is blocked in NTSYNC_IOC_WAIT_ANY/ALL and an io_uring CQE arrives (e.g. socket data ready), there’s no mechanism to wake the thread — the CQE sits in the ring until the ntsync wait times out or another object is signaled.

Fix: Repurpose the pad field in ntsync_wait_args as uring_fd. When set to a valid eventfd (registered with IORING_REGISTER_EVENTFD), the ntsync wait ioctl monitors it via poll_initwait/vfs_poll alongside the ntsync objects. When the eventfd fires (io_uring posted a CQE), the ioctl returns NTSYNC_INDEX_URING_READY (0xFFFFFFFE). The client-side retry loop in sync.c drains CQEs and re-enters the wait.

Patch 5: PI kmalloc pre-allocation fix

Problem: ntsync_pi_recalc() called kzalloc(GFP_ATOMIC) while holding a raw_spinlock. On PREEMPT_RT, the slab allocator’s internal rt_spin_lock can sleep, triggering __schedule_bug (BUG: scheduling while atomic).

Fix: Pre-allocate the ntsync_pi_owner struct before acquiring the raw spinlock. Pass the pre-allocated pointer into ntsync_pi_recalc() and consume it only if a new owner entry is needed; otherwise free it after releasing the lock.


7. io_uring I/O Architecture

Per-thread io_uring rings in ntdll’s Unix layer bypass the wineserver for file and socket I/O. See io_uring-architecture.html for the full design document with diagrams.

What It Replaces

Operation Before (server-mediated) After (io_uring)
Sync file poll poll() syscall in NtReadFile io_uring_prep_poll_add → 1 kernel transition
Async file read/write 2 server round-trips + epoll monitoring io_uring_prep_read/write → 0 server involvement
Socket recv/send (EAGAIN) Server epoll monitors fd, alerts client io_uring POLL_ADD → CQE → try_recv/try_send

Key Design Decisions

Test Results

Phase Test Result
Phase 1 Sync poll replacement All tests PASS
Phase 2 Async file I/O bypass All tests PASS
Phase 3 Socket I/O (sync + overlapped) 4000/4000, avg 95-113us

Files

File Purpose
dlls/ntdll/unix/io_uring.c (~760 LOC) Ring management, pool allocator, Phase 1-3 submit/complete
dlls/ntdll/unix/socket.c ALERTED interception, CQE handler, bitmap set/clear
dlls/ntdll/unix/sync.c ntsync uring_fd retry loop
server/sock.c E2 bitmap check in sock_get_poll_events

8. Audio Stack Architecture

Wine’s Linux audio stack has two layers: Windows-facing APIs such as WASAPI / WinMM / ASIO on top, and a Unix-side driver backend underneath. In upstream Wine, normal application audio typically flows through mmdevapi into winealsa.drv, then out through ALSA or PipeWire. In Wine-NSPA, winejack.drv replaces that backend with a direct JACK path for standard shared/exclusive audio, while nspaASIO is a separate bridge for hosts that specifically need ASIO semantics.

Audio Path Diagram

winealsa.drv (Vanilla Wine) Windows App (game / DAW / media) WASAPI mmdevapi winealsa.drv NtDelayExecution timer loop ALSA PCM device PipeWire extra hop to JACK pthread_mutex in timer | no exclusive mode | no ASIO | scalar loops | drift-prone timer winejack.drv WASAPI Path (Wine-NSPA) Windows App (game / media) WASAPI mmdevapi winejack.drv lock-free RT | SSE2 | PI mutex JACK direct, no ALSA hop hardware-synced callback | exclusive mode | per-channel fast path winejack.drv ASIO Path -- Zero-Latency (Wine-NSPA) JACK process_callback (RT thread, SCHED_FIFO) 1. MIDI I/O (lock-free ringbuffer) 2. JACK capture → ASIO input (memcpy) 3. bufferSwitch() Host fills output via futex same-period round-trip 4. ASIO output → JACK ports (memcpy) 5. WASAPI streams (games, media) Output latency = exactly 1 JACK period (theoretical minimum) Data written in bufferSwitch is output in the SAME period. MIDI is synchronized. winealsa: drift-prone timer, no ASIO, extra ALSA→PipeWire hop winejack: hardware-synced, ASIO+WASAPI+MIDI unified, 1 period

winejack.drv

A unified Wine audio/MIDI driver that connects directly to JACK, replacing winealsa.drv + winepulse.drv for systems using JACK or PipeWire.

Implementation: dlls/winejack.drv/jack.c (~2700 lines, Unix side) + dlls/winejack.drv/jackmidi.c (~700 lines, MIDI).

nspaASIO

A Wine PE DLL implementing the ASIO COM interface. Windows ASIO apps see a standard ASIO driver named “nspaASIO.” When JACK is active and float32 format is available, nspaASIO registers directly with winejack – bypassing WASAPI entirely. The play_thread uses futex synchronization with the JACK RT callback for same-period output.

Falls back to WASAPI exclusive mode when direct registration is unavailable (non-float32 format, JACK not running).

Comparison: winejack.drv vs winealsa.drv (vanilla Wine)

Feature winealsa.drv (vanilla) winejack.drv (NSPA)
Backend ALSA PCM JACK (via PipeWire or native jackd)
RT callback NtDelayExecution timer loop (drift-prone) JACK process callback (hardware-synced)
Exclusive mode Accepts flag, no real exclusive access Dedicated JACK ports, proper period contract
ASIO support None (needs separate wineasio) Built-in via nspaASIO Phase F
ASIO latency N/A (wineasio: 1 period, separate driver) 1 period (same driver, same callback)
MIDI ALSA sequencer JACK MIDI (sub-period timestamps)
Format conversion Scalar loops SSE2 mono/stereo fast paths
Locking pthread_mutex in timer loop PI mutex (app side), lock-free (RT side)
Fast path None Per-channel double buffers (exclusive float32)
WASAPI+ASIO coexistence Separate drivers, no coordination Same JACK callback, same period

9. QPC & Timing

Low-latency audio requires precise, low-overhead timing. NSPA makes several changes to Wine’s timing subsystem to reduce jitter and overhead.

NtQueryPerformanceCounter

Wine’s QPC calls monotonic_counter() which uses clock_gettime(CLOCK_BOOTTIME) (falling back to CLOCK_MONOTONIC). This is a vDSO-accelerated call on modern kernels (~26ns). Without vDSO it becomes a real syscall (~328ns, 12.5x slower). The preloader’s vDSO preservation port (Jinoh Kang series) ensures the vDSO is never deleted – if ASLR places it in a reserved address range, the preloader relocates it via mremap instead of removing it from the auxiliary vector. The TSC frequency is calibrated at init via /sys/devices/system/cpu/cpu*/cpufreq/base_frequency for accurate jiffies-to-TSC conversion.

static inline ULONGLONG monotonic_counter(void)
{
    struct timespec ts;
    if (!clock_gettime( CLOCK_BOOTTIME, &ts ))
        return ts.tv_sec * (ULONGLONG)TICKSPERSEC + ts.tv_nsec / 100;
    ...
}

PR_SET_TIMERSLACK

When an application calls NtSetTimerResolution (the backend for timeBeginPeriod), NSPA translates the requested resolution to a Linux timer slack value via prctl(PR_SET_TIMERSLACK, slack_ns). The default Linux timer slack is 50us; setting it to 1ns gives sub-millisecond timer precision for Sleep(), poll(), nanosleep(), and futex waits.

if (set)
{
    unsigned long slack_ns = (unsigned long)res * 100;  /* 100ns units -> ns */
    if (slack_ns < 1) slack_ns = 1;
    prctl( PR_SET_TIMERSLACK, slack_ns );
}
else
    prctl( PR_SET_TIMERSLACK, 0 );  /* reset to default */

Why this matters: Many audio apps call timeBeginPeriod(1) expecting 1ms timer precision. Without timer slack adjustment, Linux’s default 50us slack on top of CFS scheduling jitter can turn a Sleep(1) into a 2-3ms delay. With PR_SET_TIMERSLACK set to match the requested resolution, timer expirations fire at their requested time.

NtDelayExecution

Wine’s Sleep() implementation uses clock_nanosleep(CLOCK_MONOTONIC, ...) for high-precision delays, which benefits from the PREEMPT_RT kernel’s deterministic scheduling and the per-thread timer slack set above.

ExSetTimerResolution Forwarding

The kernel-mode ExSetTimerResolution API is forwarded from ntoskrnl.exe to the same NtSetTimerResolution path, ensuring that drivers setting timer resolution also benefit from the timer slack adjustment.


10. Memory & Large Pages

NSPA implements Windows VirtualAlloc(MEM_LARGE_PAGES) using Linux hugetlbfs, reducing TLB misses for large allocations common in audio sample buffers and plugin memory pools.

Implementation

Windows API Linux Implementation Page Size
VirtualAlloc(MEM_LARGE_PAGES) mmap(MAP_HUGETLB | MAP_LOCKED, ...) 2 MB (default hugepage)
VirtualAlloc(MEM_LARGE_PAGES) with 1GB hint mmap(MAP_HUGETLB | MAP_HUGE_1GB | MAP_LOCKED, ...) 1 GB
QueryWorkingSetEx PAGEMAP_SCAN ioctl / /proc/pid/pagemap Reports LargePage flag

Audio benefit: A typical VST plugin loads 50-500 MB of sample data. With 4KB pages, this requires 12,800-128,000 TLB entries. With 2MB pages, only 25-250 entries are needed – an order-of-magnitude reduction in TLB pressure. For systems with 1GB hugepages pre-allocated, a single TLB entry covers the entire sample bank.

Key Implementation Details

Implementation: dlls/ntdll/unix/virtual.c, anon_mmap_alloc() / NtAllocateVirtualMemory().


11. msvcrt SIMD Optimizations

SIMD Memory/String Operations

Wine’s msvcrt provides the C runtime for Win32 applications, including memcpy, memmove, memchr, strlen, and memcmp. Upstream Wine uses scalar C implementations or minimal hand-written assembly. NSPA replaces these with SIMD-optimized implementations:

Function Implementation Width Fallback
memcpy AVX _mm256_loadu_si256 / _mm256_storeu_si256 256-bit SSE2 128-bit
memmove AVX with overlap detection + reverse copy 256-bit SSE2 128-bit
memchr SSE2 _mm_cmpeq_epi8 + _mm_movemask_epi8 128-bit scalar
strlen SSE2 _mm_cmpeq_epi8 null scan 128-bit scalar
memcmp SSE2 _mm_cmpeq_epi8 + early exit 128-bit scalar

Runtime CPU dispatch: At DLL init, memcpy/memmove check CPUID for AVX support and select the AVX path when available. SSE2 is the minimum baseline (guaranteed on x86_64). The dispatch cost is a single branch on a cached function pointer – zero overhead after init.

Why it matters for RT: These functions are called thousands of times per audio buffer cycle – buffer copies in WASAPI/ASIO, socket I/O data movement, string parsing in API calls. The v5 test results show measurable impact: +4.7% CS throughput, +27% baseline socket-io throughput, -7% process startup time.

Implementation: dlls/msvcrt/string.c (all five functions), dlls/msvcrt/math.c (AVX detection), dlls/msvcrt/msvcrt.h (avx_supported declaration).


12. Version History

Version Scope Key Files
v1 RT priority mapping: two-tier promotion with NSPA_RT_PRIO as the client RT ceiling and linear scaling below it dlls/ntdll/unix/thread.c, server/thread.c
v1.1 Wineserver self-promotion to SCHED_FIFO at NSPA_RT_PRIO-16 (below entire RT band) server/thread.c
v1.2 Cross-thread promotion via HANDLE-to-unix_tid map; NSPA_RT_POLICY=TS conservative mode dlls/ntdll/unix/thread.c
v1.5 Shmem wineserver IPC (Torge Matthies forward-port); reduces wineserver round-trip latency dlls/ntdll/unix/server.c, server/thread.c
v2.0 librtpi vendoring into libs/librtpi/; PI-aware mutexes/condvars for Unix-side Wine internals libs/librtpi/*, include/rtpi.h
v2.3 CS-PI: FUTEX_LOCK_PI on CRITICAL_SECTION via repurposed LockSemaphore field dlls/ntdll/sync.c, dlls/ntdll/unix/sync.c
kernel NTSync PI: 5 patches (raw_spinlock, priority-ordered queues, PI boost v2, uring_fd, kmalloc fix) drivers/misc/ntsync.c
kernel Client-side NTSync: anonymous sync object creation without wineserver round-trip dlls/ntdll/unix/sync.c
io_uring Phase 1+2: sync poll replacement + async file I/O server bypass + TLS pool allocator dlls/ntdll/unix/io_uring.c, dlls/ntdll/unix/file.c
io_uring Phase 3: socket I/O bypass via E2 bitmap + ALERTED-state interception (sync+overlapped) dlls/ntdll/unix/socket.c, server/sock.c
kernel ntsync uring_fd extension: CQE wakeup in ntsync waits + PI kmalloc pre-alloc fix drivers/misc/ntsync.c
msvcrt AVX memcpy/memmove, SSE2 memchr/strlen/memcmp with runtime CPU dispatch dlls/msvcrt/string.c, dlls/msvcrt/math.c
sync SRW lock spin phase (256 iters, skipped for RT threads matching Windows behavior) dlls/ntdll/sync.c
sync pi_cond requeue-PI upgrade (FUTEX_WAIT_REQUEUE_PI / FUTEX_CMP_REQUEUE_PI) libs/librtpi/rtpi.h
sync Win32 condvar PI: FUTEX_WAIT_REQUEUE_PI for RtlSleepConditionVariableCS (condvar→mutex mapping table + 3 new syscalls) dlls/ntdll/sync.c, dlls/ntdll/unix/sync.c, dlls/ntdll/ntsyscalls.h

Environment Variable Quick Reference

Variable Default Description
NSPA_RT_PRIO unset (dormant) Master RT switch + ceiling FIFO priority
NSPA_RT_POLICY FF FF/RR/TS for lower RT band [16..30]
NSPA_SRV_RT_PRIO NSPA_RT_PRIO-16 Wineserver FIFO priority override
NSPA_SRV_RT_POLICY FF Wineserver scheduler policy

File Map

File Role
dlls/ntdll/unix/thread.c Tier 1 RT promotion, v1.2 cross-thread map, priority mapping
server/thread.c Tier 2 RT promotion, wineserver self-promotion, env var parsing
dlls/ntdll/sync.c CS-PI fast path (PE side): TID CAS, recursion handling
dlls/ntdll/unix/sync.c CS-PI slow path, NTSync wait/create, client-side handles, QPC, timer slack
dlls/ntdll/unix/virtual.c Large/huge pages, MAP_HUGETLB allocation
dlls/ntdll/unix/server.c Shmem wineserver IPC (v1.5)
dlls/winejack.drv/jack.c JACK audio driver (WASAPI backend)
dlls/winejack.drv/jackmidi.c JACK MIDI driver
libs/librtpi/ Vendored PI-aware mutex/condvar library
dlls/ntdll/unix/io_uring.c io_uring per-thread ring, pool allocator, Phase 1-3 submit/complete
dlls/msvcrt/string.c SSE2 memchr, strlen, memcmp
dlls/msvcrt/mem.c AVX/SSE2 memcpy, memmove with runtime CPU dispatch
libs/librtpi/pi_cond.c Requeue-PI condition variable for unix-side consumers (FUTEX_WAIT_REQUEUE_PI)
dlls/ntdll/sync.c + unix/sync.c Win32 condvar PI: requeue-PI for RtlSleepConditionVariableCS (3 new syscalls + mapping table)
dlls/combase/apartment.c CoWaitForMultipleHandles correctness rewrite
drivers/misc/ntsync.c NTSync kernel driver (in kernel tree)

Wine-NSPA Architecture Reference | Generated 2026-04-16 | Wine 11.6 + NSPA RT patchset