Wine-NSPA – Critical Section Priority Inheritance (CS-PI)

Wine 11.6 + NSPA RT v2.3 | Kernel 6.19.x-rt with FUTEX_LOCK_PI | 2026-04-15 Author: Jordan Johnston

Table of Contents

  1. Overview
  2. Upstream Wine vs NSPA Comparison
  3. LockSemaphore Repurposing
  4. Fast Path (Uncontended)
  5. Slow Path (Contended)
  6. Release Path
  7. TID Source
  8. Gating Mechanism
  9. Recursive Locking
  10. Fallback Behavior
  11. SRW Lock Spin Phase
  12. Validation

1. Overview

RTL_CRITICAL_SECTION is the most contended lock primitive in Wine. Every heap allocation, loader operation, DllMain serialization, GDI call, and most application/plugin code exercises critical sections. In a typical DAW workload running 50-100 VST plugins, thousands of CS acquire/release pairs happen per audio callback period (typically 2-5 ms at 48 kHz with a 128-sample buffer).

The core problem: priority inversion. When an RT audio thread (SCHED_FIFO, priority 80+) blocks on a critical section held by a normal-priority thread (SCHED_OTHER), the holder cannot run because the RT thread is monopolizing the CPU. Under CFS, the holder competes with dozens of other SCHED_OTHER threads for time slices. The RT thread’s audio callback deadline passes. The result is an audible glitch – an xrun.

Windows does not implement priority inheritance on CRITICAL_SECTION. Windows' NT scheduler has its own mechanisms for mitigating inversion (priority boosting on wakeup, quantum donation), but these are heuristic and non-deterministic. NSPA’s CS-PI is novel: it grafts Linux’s kernel rt_mutex PI protocol onto Wine’s RTL_CRITICAL_SECTION, giving every CS acquire/release pair the same deterministic, transitive priority inheritance that POSIX pthread_mutex with PTHREAD_PRIO_INHERIT provides.

Key properties of CS-PI:

Source Files

File Content
dlls/ntdll/sync.c lines 149-495 PE-side CS-PI: design comment, state machine, fast/slow/release paths
dlls/ntdll/unix/sync.c lines 144-168, 4057-4156 Unix-side: futex helpers, NtNspaGetUnixTid, NtNspaLockCriticalSectionPI, NtNspaUnlockCriticalSectionPI
dlls/ntdll/unix/unix_private.h lines 150-181 nspa_unix_tid field in ntdll_thread_data, C_ASSERT offset validation

2. Upstream Wine vs NSPA Comparison

The following diagram shows the complete acquire and release flow for both upstream Wine and NSPA CS-PI, side by side. The left column is upstream Wine’s legacy path; the right column is NSPA’s PI-enabled path. Both start from the same RtlEnterCriticalSection entry point.

RTL_CRITICAL_SECTION: Upstream Wine vs NSPA CS-PI UPSTREAM WINE (no PI) NSPA CS-PI (FUTEX_LOCK_PI) RtlEnterCriticalSection(&cs) if (SpinCount) spin up to N iterations InterlockedIncrement(&LockCount) -1 to 0 = won lock | >= 0 = contended UNCONTENDED Set OwningThread, RecursionCount Done -- pure userspace CONTENDED RtlpWaitForCriticalSection() Keyed event wait (NtWaitForKeyedEvent) Keyed Event Wait NtWaitForKeyedEvent(LockSemaphore) NO PRIORITY INHERITANCE --- Release --- RtlLeaveCriticalSection(&cs) InterlockedDecrement(&LockCount) < 0 = no waiters | >= 0 = wake one RtlpUnWaitCriticalSection() NtReleaseKeyedEvent -- FIFO wakeup PROBLEM: Priority Inversion RT thread blocked on keyed event. Holder at SCHED_OTHER. Kernel has no knowledge of lock ownership. No boost possible. Result: unbounded inversion, audio xruns RtlEnterCriticalSection(&cs) nspa_cs_pi_active() -- check NSPA_RT_PRIO FAST PATH: CAS(LockSemaphore, 0, my_tid) InterlockedCompareExchange -- single atomic op +5ns overhead vs upstream | never leaves userspace WON (CAS returned 0) Set OwningThread, RecursionCount=1 CONTENDED (CAS failed) Optional spin (SpinCount iters) NtNspaLockCriticalSectionPI(futex) PE / Unix boundary futex(addr, FUTEX_LOCK_PI_PRIVATE, ...) userspace / kernel boundary Kernel rt_mutex PI Chain 1. Read owner TID from futex word 2. Boost owner to waiter's priority Transitive: chains propagated through nested locks --- Release --- RtlLeaveCriticalSection(&cs) -- nspa_cs_leave_pi() NO WAITERS CAS(LockSemaphore, tid, 0) Pure userspace -- no syscall FUTEX_WAITERS SET NtNspaUnlockCriticalSectionPI() Kernel hands off to highest-pri waiter SOLUTION: Kernel PI Chain Holder boosted to RT waiter's scheduling priority. Kernel sees ownership via futex word TID. Boost is instant. Result: bounded inversion, deterministic audio Comparison Summary Property Upstream Wine NSPA CS-PI Wait mechanism Keyed event (NtWaitForKeyedEvent) FUTEX_LOCK_PI (kernel rt_mutex) Priority inheritance None Full transitive PI via kernel Ownership tracking OwningThread (Win32 TID only) Futex word (Linux TID) + OwningThread Uncontended cost 1 atomic (InterlockedIncrement) 1 atomic + 1 CAS (~5ns overhead)

3. LockSemaphore Repurposing

RTL_CRITICAL_SECTION has a LockSemaphore field typed as HANDLE (i.e., PVOID – pointer-sized). In upstream Wine, this stores a handle to a keyed event object, lazily created on first contention. The keyed event is used as the park/unpark mechanism: contended acquires call NtWaitForKeyedEvent(LockSemaphore), and releases with waiters call NtReleaseKeyedEvent(LockSemaphore).

Under CS-PI, LockSemaphore is repurposed as a FUTEX_LOCK_PI word. The field is still pointer-sized, but only the low 32 bits are used, matching the LONG that FUTEX_LOCK_PI operates on. The bit layout follows the kernel’s futex.h protocol exactly:

LockSemaphore as FUTEX_LOCK_PI Word (32-bit layout) bit 31 bit 30 bits 29..0 W D Owner TID (Linux kernel TID, 30 bits) FUTEX_WAITERS 0x80000000 Set by kernel when threads are blocked FUTEX_OWNER_DIED 0x40000000 Set if owner exited without releasing TID_MASK = 0x3FFFFFFF Linux kernel TID of current lock owner Validated by kernel against /proc/<tid> 0 = lock is free (no owner) PE-side constants (dlls/ntdll/sync.c): #define NSPA_CS_FUTEX_WAITERS 0x80000000U #define NSPA_CS_FUTEX_TID_MASK 0x3fffffffU

Why this works without breaking the struct layout:


4. Fast Path (Uncontended)

The fast path handles the common case: the critical section is free, and the caller acquires it without contention. This executes entirely on the PE side, never crosses to Unix code, and never enters the kernel.

Sequence

  1. Get Linux TID: nspa_get_unix_tid() reads the cached TID from the TEB (a single memory load, ~2 ns). See Section 7 for details.

  2. CAS the futex word: InterlockedCompareExchange(&LockSemaphore, my_tid, 0). If the field was 0 (lock free), it atomically writes my_tid and returns 0 (success). This is a single lock cmpxchg instruction on x86.

  3. Update bookkeeping: InterlockedIncrement(&LockCount), set OwningThread = win_tid, set RecursionCount = 1. These are for external compatibility with code that queries CS state.

Cost

Operation Time
Upstream uncontended acquire ~3 ns (1 atomic: InterlockedIncrement)
NSPA CS-PI uncontended acquire ~8 ns (1 memory load + 1 CAS + 1 atomic)
Overhead ~5 ns per uncontended acquire

The 5 ns overhead is the cost of the CAS on LockSemaphore plus the TEB memory load. This is acceptable: in a DAW running at 48 kHz / 128 samples, a single audio callback period is 2,666,667 ns. Even 10,000 CS acquire/release pairs per callback add only 50 us of overhead – under 2% of the callback budget.

Code Path

static inline BOOL nspa_cs_try_fast( RTL_CRITICAL_SECTION *crit, DWORD unix_tid )
{
    LONG *futex = (LONG *)&crit->LockSemaphore;
    return InterlockedCompareExchange( futex, (LONG)unix_tid, 0 ) == 0;
}

The function is inline – the compiler emits the lock cmpxchg directly at each call site. No function call overhead.


5. Slow Path (Contended)

When the fast-path CAS fails (the futex word is non-zero, meaning another thread holds the lock), the slow path hands control to the kernel’s rt_mutex PI infrastructure.

Sequence

  1. Recursive check: Before going to the kernel, check if the current thread already owns this CS (by comparing OwningThread against the Win32 TID). If yes, bump RecursionCount and return. This avoids calling futex_lock_pi on a lock we already hold, which would return EDEADLK. See Section 9.

  2. Optional spin: If crit->SpinCount > 0, retry the CAS up to SpinCount times with YieldProcessor() between attempts. This catches short critical sections that release before the spin budget expires, avoiding the syscall overhead.

  3. Publish waiter count: InterlockedIncrement(&LockCount) before the syscall. This maintains LockCount semantics for external observers.

  4. Cross PE/Unix boundary: Call NtNspaLockCriticalSectionPI(futex). This is an Nt-style syscall that crosses from PE ntdll to Unix ntdll.

  5. Unix side: futex_lock_pi(futex) – a syscall(__NR_futex, addr, FUTEX_LOCK_PI_PRIVATE, ...). The kernel:

  6. Return: When the owner releases and the kernel transfers ownership, futex_lock_pi returns 0. The futex word now contains the caller’s TID (possibly with FUTEX_WAITERS set if more threads are waiting). The PE side sets OwningThread and RecursionCount.

PI Chain Diagram

PI Chain: RT Waiter to Kernel Boost User (PE) User (Unix) Kernel RT Waiter SCHED_FIFO prio 80 CAS failed -- lock held reads Futex Word &crit->LockSemaphore TID=4567 | WAITERS=1 identifies Lock Holder SCHED_OTHER (TID 4567) Doing work inside CS... NtNspaLockCriticalSectionPI() futex_lock_pi(&LockSemaphore) Linux Kernel: rt_mutex PI Infrastructure 1. Lookup owner task find_task_by_vpid(TID) 2. Create rt_mutex PI waiter tree (prio-ordered) 3. Boost owner SCHED_OTHER -> SCHED_FIFO 80 BOOST Holder now runs at SCHED_FIFO 80 until release. RT waiter blocked on rt_mutex (not spinning). On release: kernel transfers futex word ownership to highest-priority waiter, drops boost.

Error Handling


6. Release Path

Release is the mirror of acquire, with the same fast/slow split.

Final Release (RecursionCount drops to 0)

  1. Clear bookkeeping: Set RecursionCount = 0, OwningThread = 0. These must be cleared before the futex word is released, because once the futex word is zero, another thread can acquire the lock and see stale values.

  2. Uncontended release (fast): InterlockedCompareExchange(&LockSemaphore, 0, my_tid). If the CAS succeeds (the old value was exactly my_tid with no FUTEX_WAITERS bit), the lock is free. No syscall. Decrement LockCount and return.

  3. Contended release (slow): If the CAS fails – typically because the kernel set the FUTEX_WAITERS bit (0x80000000) on the futex word while a waiter blocked – the old value is my_tid | FUTEX_WAITERS. The PE side calls NtNspaUnlockCriticalSectionPI(futex), which invokes futex(addr, FUTEX_UNLOCK_PI_PRIVATE, ...). The kernel:

Recursive Release (RecursionCount > 1)

If RecursionCount > 1, this is not the final unlock. Just decrement RecursionCount and LockCount, then return. The futex word stays unchanged (still contains our TID).

Error Recovery

If NtNspaUnlockCriticalSectionPI returns STATUS_NOT_SUPPORTED (the extremely unlikely case where FUTEX_LOCK_PI worked but FUTEX_UNLOCK_PI returns ENOSYS), CS-PI logs an error, disables itself globally, and returns STATUS_SUCCESS. The futex word is stuck (contains a stale TID with FUTEX_WAITERS set), and future acquires on this specific CS will hang. This is accepted as a fatal diagnostic condition – it should never happen on a consistent kernel.


7. TID Source

FUTEX_LOCK_PI requires the owner field in the futex word to be a valid Linux kernel TID (pid_t from SYS_gettid). The kernel validates this against its task list – if the TID is invalid, futex_lock_pi returns ESRCH, and every contended acquire hangs.

Wine’s GetCurrentThreadId() returns the Win32 thread ID from TEB->ClientId.UniqueThread. This is a wineserver-assigned value, unrelated to the Linux kernel TID. Using it directly causes ESRCH.

Solution: Cached TID via TEB

The Unix-side ntdll_thread_data struct (embedded in the TEB’s GdiTebBatch region) has an nspa_unix_tid field. This is populated on first access via syscall(SYS_gettid) and cached for the thread’s lifetime.

PE-side read (zero-syscall hot path):

The PE side cannot include unix_private.h (it’s Unix-only). Instead, it reads the TID at a hardcoded byte offset from GdiTebBatch:

#ifdef _WIN64
#define NSPA_UNIX_TID_OFFSET 0xf8
#else
#define NSPA_UNIX_TID_OFFSET 0x88
#endif

static inline DWORD nspa_get_unix_tid(void)
{
    DWORD tid = *(volatile DWORD *)((char *)&NtCurrentTeb()->GdiTebBatch
                                    + NSPA_UNIX_TID_OFFSET);
    if (tid) return tid;
    return NtNspaGetUnixTid();  /* first call: populate via syscall */
}
Offset safety: C_ASSERT checks in unix_private.h verify that offsetof(struct ntdll_thread_data, nspa_unix_tid) matches the PE-side literal. If the struct layout changes, the build fails. Cost: - Hot path (subsequent calls): ~2 ns (one memory load + branch-not-taken) - Cold path (first call per thread): ~200-500 ns (one syscall(SYS_gettid) round trip) The cold path fires at most once per thread. In a typical DAW with 20-50 threads, this is 20-50 syscalls total across the entire process lifetime – negligible.


8. Gating Mechanism

CS-PI is gated on the NSPA_RT_PRIO environment variable. When the variable is unset, CS-PI is inactive and all CS functions execute the upstream Wine legacy path with zero overhead beyond a single branch on a cached state variable.

Tri-State Machine

nspa_cs_pi_state (static LONG): 0 = uninitialized (first CS op on any thread triggers probe) 1 = active (NSPA_RT_PRIO is set with a non-empty value) -1 = inactive (NSPA_RT_PRIO unset, or kernel returned ENOSYS)

Why Not RtlQueryEnvironmentVariable_U

The obvious approach – calling RtlQueryEnvironmentVariable_U to read the env var – causes a recursive stack overflow. That function internally acquires critical sections (the PEB lock, the process heap lock). Those CS operations re-enter nspa_cs_pi_active(), which re-enters RtlQueryEnvironmentVariable_U, and so on. This was observed as err:virtual:virtual_setup_exception crashes on every PE binary launch.

Direct PEB Scan

Instead, nspa_cs_pi_active() reads the PEB environment block directly: NtCurrentTeb()->Peb->ProcessParameters->Environment This is a null-separated list of L"VAR=value\0" strings. The function walks the list comparing against L"NSPA_RT_PRIO=" character by character, using no Rtl functions and no locks. This mirrors how the Windows loader reads environment variables before kernel32 is loaded.

Race Safety

Multiple threads may call nspa_cs_pi_active() concurrently during the uninitialized phase. Each computes new_state independently, then publishes via InterlockedCompareExchange(&nspa_cs_pi_state, new_state, 0). The first writer wins; all subsequent readers see the published value. Since all threads observe the same PEB environment, they all compute the same answer – the race is benign.


9. Recursive Locking

RTL_CRITICAL_SECTION supports recursive acquisition: the same thread can enter a CS multiple times, incrementing RecursionCount each time, and must leave the same number of times. CS-PI handles recursion without calling futex_lock_pi on a lock we already hold:

  1. On acquire: After the fast-path CAS fails, check crit->OwningThread == ULongToHandle(win_tid). If true, this is a recursive entry: bump RecursionCount and LockCount, return immediately. The futex word already contains our TID.

  2. On release: If RecursionCount > 1, decrement it and LockCount, return immediately. The futex word stays unchanged. Only when RecursionCount drops to 0 is the futex word CAS’d back to zero (or the kernel unlock path invoked).

Why This Matters

Calling futex_lock_pi when we already hold the futex would return EDEADLK (the kernel detects the self-deadlock via the rt_mutex chain). We must detect recursion in user space before the syscall. The OwningThread comparison is the canonical way – it uses the Win32 TID, matching both the legacy path’s check and external APIs like RtlIsCriticalSectionLockedByThread.


10. Fallback Behavior

CS-PI is a soft dependency on kernel FUTEX_LOCK_PI support. If the kernel returns ENOSYS (function not implemented), CS-PI disables itself permanently and all subsequent CS operations use upstream Wine’s legacy keyed-event path.

Trigger

The first NtNspaLockCriticalSectionPI call that receives ENOSYS returns STATUS_NOT_SUPPORTED to the PE side. The PE-side nspa_cs_enter_pi function:

  1. Decrements LockCount (undoing the waiter count bump)
  2. Calls InterlockedExchange(&nspa_cs_pi_state, -1) – permanently disabling CS-PI
  3. Returns STATUS_RETRY

The calling RtlEnterCriticalSection sees STATUS_RETRY and falls through to the legacy InterlockedIncrement / RtlpWaitForCriticalSection / keyed-event path.

Scope

The disable is global and permanent (for the process lifetime). Once nspa_cs_pi_state is set to -1, nspa_cs_pi_active() returns FALSE on every subsequent call. All CS operations across all threads revert to upstream behavior.

When This Fires

FUTEX_LOCK_PI has been in the Linux kernel since 2.6.18 (September 2006). Any kernel from the last 20 years supports it. On PREEMPT_RT kernels (which NSPA requires), it is always available. The fallback exists as a safety net for unusual kernel configurations (e.g., stripped embedded kernels), not as an expected code path.


11. SRW Lock Spin Phase

RTL_SRWLOCK is the other major user-space lock in Wine, used by the process heap, loader, and application code. NSPA adds a bounded spin phase to SRW lock acquisition, complementing CS-PI. These are independent optimizations for different lock types.

Design

Windows SRW locks spin approximately 1024 iterations before parking via NtWaitForAlertByThreadId. Upstream Wine does zero spinning – every contended acquire immediately calls RtlWaitOnAddress, which translates to a futex syscall. NSPA adds 256 spin iterations for normal threads before falling through to the wait.

#define SRW_SPIN_COUNT 256

RT threads skip spinning entirely. An RT thread at SCHED_FIFO spinning on a lock held by a SCHED_OTHER thread would starve the holder – the holder cannot make progress while the RT thread monopolizes the CPU. Better to fall through to the futex wait immediately, allowing the scheduler to handle priority properly (or, for CS, allowing PI to boost the holder).

Single-CPU systems: Spinning is disabled on uniprocessor systems. The holder cannot make progress while the spinner runs on the same (only) core.

Relationship to CS-PI

SRW locks and critical sections are separate primitives with different internal architectures:

Property RTL_CRITICAL_SECTION RTL_SRWLOCK
Ownership tracking Yes (OwningThread) No
Recursive entry Yes (RecursionCount) No
PI under NSPA Yes (FUTEX_LOCK_PI) No (no owner to boost)
Spin phase under NSPA Via SpinCount (existing) 256 iters (new)
Wait mechanism Keyed event / futex PI RtlWaitOnAddress / futex

SRW locks cannot have PI because they do not track ownership – the kernel cannot know which thread to boost. The spin phase is the applicable optimization for SRW.


12. Validation

CS-PI is validated by three test programs in the NSPA RT test suite (nspa_rt_test.exe), run with NSPA_RT_PRIO=80 NSPA_RT_POLICY=FF on a PREEMPT_RT kernel.

Test: cs-contention

Purpose: Validates that PI boost matches uncontended work time. A SCHED_OTHER holder does 200M-iteration busywork inside a CS (approximately 475 ms of CPU time). An RT waiter (SCHED_FIFO 87) blocks on the CS. Four SCHED_OTHER load threads compete for CPU. Under PI, the holder is boosted and the waiter’s wait time matches the holder’s work time. Without PI, CFS time-slices the holder against the load threads, inflating the wait.

Results (v5, latest):

Metric Value
Hold time per iteration ~475 ms (work loop)
Waiter wait time (with PI) 474-475 ms
Wait/hold ratio ~1.00x (perfect)
Samples captured 3/3
Verdict PASS

The wait time matches the work time to within 1 ms, confirming the holder receives full CPU time under PI boost.

Test: rapidmutex

Purpose: Throughput stress test. Four threads (1 RT + 3 load) perform 500,000 CS acquire/release cycles each on a shared critical section. Measures throughput, per-thread max wait, and correctness (shared counter).

Results (v5, latest):

Metric Baseline RT (CS-PI) Delta
Throughput 319K ops/s 327K ops/s +2.5%
RT max wait 36 us
RT avg wait 1 us
Shared counter 2,000,000 2,000,000 correct

v4 to v5 improvement: RT throughput improved from 312K to 327K ops/s (+4.7%). RT max wait dropped from 46 us to 36 us. These gains are attributed to SIMD memcpy/memmove optimizations reducing overhead in the CS fast path.

Test: philosophers

Purpose: Dining philosophers with 5 diners, 2 forks each. Philosopher 0 is RT (SCHED_FIFO), philosophers 1-4 are SCHED_OTHER. Four background load threads. Validates transitive PI: philosopher 0 waiting on fork A, held by philosopher 1, who is waiting on fork B, held by philosopher 2 – the PI chain propagates through the rt_mutex infrastructure.

Results (v5, latest):

Metric Value
Total meals 250/250 (50 each)
Total elapsed 205 ms
RT max wait 1301 us
Spread (max-min meals) 0 (perfect fairness)
Verdict PASS

The RT max wait varies between runs due to CFS load placement (v4 measured 601 us, v5 measured 1301 us – both within acceptable range). The critical validation is that all meals complete without deadlock and the PI chain propagates correctly through nested lock acquisitions.

v4 to v5 Summary

Metric v4 v5 Cause
rapidmutex RT throughput 312K ops/s 327K ops/s (+4.7%) SIMD + SRW spin
rapidmutex RT max wait 46 us 36 us Reduced lock transition overhead
cs-contention wait/hold ratio ~1.00x ~1.00x Stable – PI correct
philosophers meals 250/250 250/250 Stable – transitive PI correct
fork-mutex RT elapsed 1021 ms 948 ms (-7.1%) SIMD string ops in process setup

Wine-NSPA CS-PI documentation. Source: dlls/ntdll/sync.c (PE), dlls/ntdll/unix/sync.c (Unix). Generated 2026-04-15.