Wine 11.6 + NSPA RT v2.3 | Kernel 6.19.x-rt with FUTEX_LOCK_PI | 2026-04-15 Author: Jordan Johnston
RTL_CRITICAL_SECTION is the most contended lock primitive in Wine. Every heap allocation, loader operation, DllMain serialization, GDI call, and most application/plugin code exercises critical sections. In a typical DAW workload running 50-100 VST plugins, thousands of CS acquire/release pairs happen per audio callback period (typically 2-5 ms at 48 kHz with a 128-sample buffer).
The core problem: priority inversion. When an RT audio thread (SCHED_FIFO, priority 80+) blocks on a critical section held by a normal-priority thread (SCHED_OTHER), the holder cannot run because the RT thread is monopolizing the CPU. Under CFS, the holder competes with dozens of other SCHED_OTHER threads for time slices. The RT thread’s audio callback deadline passes. The result is an audible glitch – an xrun.
Windows does not implement priority inheritance on CRITICAL_SECTION. Windows' NT scheduler has its own mechanisms for mitigating inversion (priority boosting on wakeup, quantum donation), but these are heuristic and non-deterministic. NSPA’s CS-PI is novel: it grafts Linux’s kernel rt_mutex PI protocol onto Wine’s RTL_CRITICAL_SECTION, giving every CS acquire/release pair the same deterministic, transitive priority inheritance that POSIX pthread_mutex with PTHREAD_PRIO_INHERIT provides.
Key properties of CS-PI:
NSPA_RT_PRIO is unset, every CS function short-circuits to upstream Wine’s legacy implementation. No CAS, no TID lookup, no branch misprediction penalty beyond the initial state check.FUTEX_LOCK_PI support (pre-2.6.18, effectively impossible in 2026), CS-PI permanently disables itself and falls through to the legacy keyed-event path.| File | Content |
|---|---|
dlls/ntdll/sync.c lines 149-495 |
PE-side CS-PI: design comment, state machine, fast/slow/release paths |
dlls/ntdll/unix/sync.c lines 144-168, 4057-4156 |
Unix-side: futex helpers, NtNspaGetUnixTid, NtNspaLockCriticalSectionPI, NtNspaUnlockCriticalSectionPI |
dlls/ntdll/unix/unix_private.h lines 150-181 |
nspa_unix_tid field in ntdll_thread_data, C_ASSERT offset validation |
The following diagram shows the complete acquire and release flow for both upstream Wine and NSPA CS-PI, side by side. The left column is upstream Wine’s legacy path; the right column is NSPA’s PI-enabled path. Both start from the same RtlEnterCriticalSection entry point.
RTL_CRITICAL_SECTION has a LockSemaphore field typed as HANDLE (i.e., PVOID – pointer-sized). In upstream Wine, this stores a handle to a keyed event object, lazily created on first contention. The keyed event is used as the park/unpark mechanism: contended acquires call NtWaitForKeyedEvent(LockSemaphore), and releases with waiters call NtReleaseKeyedEvent(LockSemaphore).
Under CS-PI, LockSemaphore is repurposed as a FUTEX_LOCK_PI word. The field is still pointer-sized, but only the low 32 bits are used, matching the LONG that FUTEX_LOCK_PI operates on. The bit layout follows the kernel’s futex.h protocol exactly:
Why this works without breaking the struct layout:
RTL_CRITICAL_SECTION layout is ABI-frozen. LockSemaphore is at a fixed offset. Applications that sizeof(RTL_CRITICAL_SECTION) are unaffected.HANDLE-sized (PVOID), which is 8 bytes on x86_64 and 4 bytes on i386. The futex word uses only the low 32 bits. On x86_64, the upper 32 bits are implicitly zero (the field is cast to LONG *).LockSemaphore expecting a valid HANDLE would see a small integer (a TID, typically in the range 1-32768). This is undocumented internal state; no known application introspects this field.LockCount is still maintained for external compatibility. Applications that poll LockCount to check if a CS is contended continue to work. However, LockCount is no longer the atomic-primary ownership word – that role is now served by the futex word in LockSemaphore.The fast path handles the common case: the critical section is free, and the caller acquires it without contention. This executes entirely on the PE side, never crosses to Unix code, and never enters the kernel.
Get Linux TID: nspa_get_unix_tid() reads the cached TID from the TEB (a single memory load, ~2 ns). See Section 7 for details.
CAS the futex word: InterlockedCompareExchange(&LockSemaphore, my_tid, 0). If the field was 0 (lock free), it atomically writes my_tid and returns 0 (success). This is a single lock cmpxchg instruction on x86.
Update bookkeeping: InterlockedIncrement(&LockCount), set OwningThread = win_tid, set RecursionCount = 1. These are for external compatibility with code that queries CS state.
| Operation | Time |
|---|---|
| Upstream uncontended acquire | ~3 ns (1 atomic: InterlockedIncrement) |
| NSPA CS-PI uncontended acquire | ~8 ns (1 memory load + 1 CAS + 1 atomic) |
| Overhead | ~5 ns per uncontended acquire |
The 5 ns overhead is the cost of the CAS on LockSemaphore plus the TEB memory load. This is acceptable: in a DAW running at 48 kHz / 128 samples, a single audio callback period is 2,666,667 ns. Even 10,000 CS acquire/release pairs per callback add only 50 us of overhead – under 2% of the callback budget.
static inline BOOL nspa_cs_try_fast( RTL_CRITICAL_SECTION *crit, DWORD unix_tid )
{
LONG *futex = (LONG *)&crit->LockSemaphore;
return InterlockedCompareExchange( futex, (LONG)unix_tid, 0 ) == 0;
}
The function is inline – the compiler emits the lock cmpxchg directly at each call site. No function call overhead.
When the fast-path CAS fails (the futex word is non-zero, meaning another thread holds the lock), the slow path hands control to the kernel’s rt_mutex PI infrastructure.
Recursive check: Before going to the kernel, check if the current thread already owns this CS (by comparing OwningThread against the Win32 TID). If yes, bump RecursionCount and return. This avoids calling futex_lock_pi on a lock we already hold, which would return EDEADLK. See Section 9.
Optional spin: If crit->SpinCount > 0, retry the CAS up to SpinCount times with YieldProcessor() between attempts. This catches short critical sections that release before the spin budget expires, avoiding the syscall overhead.
Publish waiter count: InterlockedIncrement(&LockCount) before the syscall. This maintains LockCount semantics for external observers.
Cross PE/Unix boundary: Call NtNspaLockCriticalSectionPI(futex). This is an Nt-style syscall that crosses from PE ntdll to Unix ntdll.
Unix side: futex_lock_pi(futex) – a syscall(__NR_futex, addr, FUTEX_LOCK_PI_PRIVATE, ...). The kernel:
rt_mutex backing the futexrt_mutex waiter tree (priority-ordered)Return: When the owner releases and the kernel transfers ownership, futex_lock_pi returns 0. The futex word now contains the caller’s TID (possibly with FUTEX_WAITERS set if more threads are waiting). The PE side sets OwningThread and RecursionCount.
EINTR: The futex_lock_pi call is wrapped in a do { } while (ret == -1 && errno == EINTR) loop. Signal delivery restarts the wait.ENOSYS: Kernel lacks FUTEX_LOCK_PI. Returns STATUS_NOT_SUPPORTED, which triggers the fallback path (see Section 10).STATUS_UNSUCCESSFUL. The PE side calls RtlRaiseStatus() to raise an exception – this is a fatal condition indicating kernel-level corruption.Release is the mirror of acquire, with the same fast/slow split.
Clear bookkeeping: Set RecursionCount = 0, OwningThread = 0. These must be cleared before the futex word is released, because once the futex word is zero, another thread can acquire the lock and see stale values.
Uncontended release (fast): InterlockedCompareExchange(&LockSemaphore, 0, my_tid). If the CAS succeeds (the old value was exactly my_tid with no FUTEX_WAITERS bit), the lock is free. No syscall. Decrement LockCount and return.
Contended release (slow): If the CAS fails – typically because the kernel set the FUTEX_WAITERS bit (0x80000000) on the futex word while a waiter blocked – the old value is my_tid | FUTEX_WAITERS. The PE side calls NtNspaUnlockCriticalSectionPI(futex), which invokes futex(addr, FUTEX_UNLOCK_PI_PRIVATE, ...). The kernel:
rt_mutex waiter treeIf RecursionCount > 1, this is not the final unlock. Just decrement RecursionCount and LockCount, then return. The futex word stays unchanged (still contains our TID).
If NtNspaUnlockCriticalSectionPI returns STATUS_NOT_SUPPORTED (the extremely unlikely case where FUTEX_LOCK_PI worked but FUTEX_UNLOCK_PI returns ENOSYS), CS-PI logs an error, disables itself globally, and returns STATUS_SUCCESS. The futex word is stuck (contains a stale TID with FUTEX_WAITERS set), and future acquires on this specific CS will hang. This is accepted as a fatal diagnostic condition – it should never happen on a consistent kernel.
FUTEX_LOCK_PI requires the owner field in the futex word to be a valid Linux kernel TID (pid_t from SYS_gettid). The kernel validates this against its task list – if the TID is invalid, futex_lock_pi returns ESRCH, and every contended acquire hangs.
Wine’s GetCurrentThreadId() returns the Win32 thread ID from TEB->ClientId.UniqueThread. This is a wineserver-assigned value, unrelated to the Linux kernel TID. Using it directly causes ESRCH.
The Unix-side ntdll_thread_data struct (embedded in the TEB’s GdiTebBatch region) has an nspa_unix_tid field. This is populated on first access via syscall(SYS_gettid) and cached for the thread’s lifetime.
PE-side read (zero-syscall hot path):
The PE side cannot include unix_private.h (it’s Unix-only). Instead, it reads the TID at a hardcoded byte offset from GdiTebBatch:
#ifdef _WIN64
#define NSPA_UNIX_TID_OFFSET 0xf8
#else
#define NSPA_UNIX_TID_OFFSET 0x88
#endif
static inline DWORD nspa_get_unix_tid(void)
{
DWORD tid = *(volatile DWORD *)((char *)&NtCurrentTeb()->GdiTebBatch
+ NSPA_UNIX_TID_OFFSET);
if (tid) return tid;
return NtNspaGetUnixTid(); /* first call: populate via syscall */
}
Offset safety: C_ASSERT checks in unix_private.h verify that offsetof(struct ntdll_thread_data, nspa_unix_tid) matches the PE-side literal. If the struct layout changes, the build fails.
Cost:
- Hot path (subsequent calls): ~2 ns (one memory load + branch-not-taken)
- Cold path (first call per thread): ~200-500 ns (one syscall(SYS_gettid) round trip)
The cold path fires at most once per thread. In a typical DAW with 20-50 threads, this is 20-50 syscalls total across the entire process lifetime – negligible.
CS-PI is gated on the NSPA_RT_PRIO environment variable. When the variable is unset, CS-PI is inactive and all CS functions execute the upstream Wine legacy path with zero overhead beyond a single branch on a cached state variable.
nspa_cs_pi_state (static LONG):
0 = uninitialized (first CS op on any thread triggers probe)
1 = active (NSPA_RT_PRIO is set with a non-empty value)
-1 = inactive (NSPA_RT_PRIO unset, or kernel returned ENOSYS)
The obvious approach – calling RtlQueryEnvironmentVariable_U to read the env var – causes a recursive stack overflow. That function internally acquires critical sections (the PEB lock, the process heap lock). Those CS operations re-enter nspa_cs_pi_active(), which re-enters RtlQueryEnvironmentVariable_U, and so on. This was observed as err:virtual:virtual_setup_exception crashes on every PE binary launch.
Instead, nspa_cs_pi_active() reads the PEB environment block directly:
NtCurrentTeb()->Peb->ProcessParameters->Environment
This is a null-separated list of L"VAR=value\0" strings. The function walks the list comparing against L"NSPA_RT_PRIO=" character by character, using no Rtl functions and no locks. This mirrors how the Windows loader reads environment variables before kernel32 is loaded.
Multiple threads may call nspa_cs_pi_active() concurrently during the uninitialized phase. Each computes new_state independently, then publishes via InterlockedCompareExchange(&nspa_cs_pi_state, new_state, 0). The first writer wins; all subsequent readers see the published value. Since all threads observe the same PEB environment, they all compute the same answer – the race is benign.
RTL_CRITICAL_SECTION supports recursive acquisition: the same thread can enter a CS multiple times, incrementing RecursionCount each time, and must leave the same number of times.
CS-PI handles recursion without calling futex_lock_pi on a lock we already hold:
On acquire: After the fast-path CAS fails, check crit->OwningThread == ULongToHandle(win_tid). If true, this is a recursive entry: bump RecursionCount and LockCount, return immediately. The futex word already contains our TID.
On release: If RecursionCount > 1, decrement it and LockCount, return immediately. The futex word stays unchanged. Only when RecursionCount drops to 0 is the futex word CAS’d back to zero (or the kernel unlock path invoked).
Calling futex_lock_pi when we already hold the futex would return EDEADLK (the kernel detects the self-deadlock via the rt_mutex chain). We must detect recursion in user space before the syscall. The OwningThread comparison is the canonical way – it uses the Win32 TID, matching both the legacy path’s check and external APIs like RtlIsCriticalSectionLockedByThread.
CS-PI is a soft dependency on kernel FUTEX_LOCK_PI support. If the kernel returns ENOSYS (function not implemented), CS-PI disables itself permanently and all subsequent CS operations use upstream Wine’s legacy keyed-event path.
The first NtNspaLockCriticalSectionPI call that receives ENOSYS returns STATUS_NOT_SUPPORTED to the PE side. The PE-side nspa_cs_enter_pi function:
LockCount (undoing the waiter count bump)InterlockedExchange(&nspa_cs_pi_state, -1) – permanently disabling CS-PISTATUS_RETRYThe calling RtlEnterCriticalSection sees STATUS_RETRY and falls through to the legacy InterlockedIncrement / RtlpWaitForCriticalSection / keyed-event path.
The disable is global and permanent (for the process lifetime). Once nspa_cs_pi_state is set to -1, nspa_cs_pi_active() returns FALSE on every subsequent call. All CS operations across all threads revert to upstream behavior.
FUTEX_LOCK_PI has been in the Linux kernel since 2.6.18 (September 2006). Any kernel from the last 20 years supports it. On PREEMPT_RT kernels (which NSPA requires), it is always available. The fallback exists as a safety net for unusual kernel configurations (e.g., stripped embedded kernels), not as an expected code path.
RTL_SRWLOCK is the other major user-space lock in Wine, used by the process heap, loader, and application code. NSPA adds a bounded spin phase to SRW lock acquisition, complementing CS-PI. These are independent optimizations for different lock types.
Windows SRW locks spin approximately 1024 iterations before parking via NtWaitForAlertByThreadId. Upstream Wine does zero spinning – every contended acquire immediately calls RtlWaitOnAddress, which translates to a futex syscall. NSPA adds 256 spin iterations for normal threads before falling through to the wait.
#define SRW_SPIN_COUNT 256
RT threads skip spinning entirely. An RT thread at SCHED_FIFO spinning on a lock held by a SCHED_OTHER thread would starve the holder – the holder cannot make progress while the RT thread monopolizes the CPU. Better to fall through to the futex wait immediately, allowing the scheduler to handle priority properly (or, for CS, allowing PI to boost the holder).
Single-CPU systems: Spinning is disabled on uniprocessor systems. The holder cannot make progress while the spinner runs on the same (only) core.
SRW locks and critical sections are separate primitives with different internal architectures:
| Property | RTL_CRITICAL_SECTION | RTL_SRWLOCK |
|---|---|---|
| Ownership tracking | Yes (OwningThread) | No |
| Recursive entry | Yes (RecursionCount) | No |
| PI under NSPA | Yes (FUTEX_LOCK_PI) | No (no owner to boost) |
| Spin phase under NSPA | Via SpinCount (existing) | 256 iters (new) |
| Wait mechanism | Keyed event / futex PI | RtlWaitOnAddress / futex |
SRW locks cannot have PI because they do not track ownership – the kernel cannot know which thread to boost. The spin phase is the applicable optimization for SRW.
CS-PI is validated by three test programs in the NSPA RT test suite (nspa_rt_test.exe), run with NSPA_RT_PRIO=80 NSPA_RT_POLICY=FF on a PREEMPT_RT kernel.
Purpose: Validates that PI boost matches uncontended work time. A SCHED_OTHER holder does 200M-iteration busywork inside a CS (approximately 475 ms of CPU time). An RT waiter (SCHED_FIFO 87) blocks on the CS. Four SCHED_OTHER load threads compete for CPU. Under PI, the holder is boosted and the waiter’s wait time matches the holder’s work time. Without PI, CFS time-slices the holder against the load threads, inflating the wait.
Results (v5, latest):
| Metric | Value |
|---|---|
| Hold time per iteration | ~475 ms (work loop) |
| Waiter wait time (with PI) | 474-475 ms |
| Wait/hold ratio | ~1.00x (perfect) |
| Samples captured | 3/3 |
| Verdict | PASS |
The wait time matches the work time to within 1 ms, confirming the holder receives full CPU time under PI boost.
Purpose: Throughput stress test. Four threads (1 RT + 3 load) perform 500,000 CS acquire/release cycles each on a shared critical section. Measures throughput, per-thread max wait, and correctness (shared counter).
Results (v5, latest):
| Metric | Baseline | RT (CS-PI) | Delta |
|---|---|---|---|
| Throughput | 319K ops/s | 327K ops/s | +2.5% |
| RT max wait | – | 36 us | – |
| RT avg wait | – | 1 us | – |
| Shared counter | 2,000,000 | 2,000,000 | correct |
v4 to v5 improvement: RT throughput improved from 312K to 327K ops/s (+4.7%). RT max wait dropped from 46 us to 36 us. These gains are attributed to SIMD memcpy/memmove optimizations reducing overhead in the CS fast path.
Purpose: Dining philosophers with 5 diners, 2 forks each. Philosopher 0 is RT (SCHED_FIFO), philosophers 1-4 are SCHED_OTHER. Four background load threads. Validates transitive PI: philosopher 0 waiting on fork A, held by philosopher 1, who is waiting on fork B, held by philosopher 2 – the PI chain propagates through the rt_mutex infrastructure.
Results (v5, latest):
| Metric | Value |
|---|---|
| Total meals | 250/250 (50 each) |
| Total elapsed | 205 ms |
| RT max wait | 1301 us |
| Spread (max-min meals) | 0 (perfect fairness) |
| Verdict | PASS |
The RT max wait varies between runs due to CFS load placement (v4 measured 601 us, v5 measured 1301 us – both within acceptable range). The critical validation is that all meals complete without deadlock and the PI chain propagates correctly through nested lock acquisitions.
| Metric | v4 | v5 | Cause |
|---|---|---|---|
| rapidmutex RT throughput | 312K ops/s | 327K ops/s (+4.7%) | SIMD + SRW spin |
| rapidmutex RT max wait | 46 us | 36 us | Reduced lock transition overhead |
| cs-contention wait/hold ratio | ~1.00x | ~1.00x | Stable – PI correct |
| philosophers meals | 250/250 | 250/250 | Stable – transitive PI correct |
| fork-mutex RT elapsed | 1021 ms | 948 ms (-7.1%) | SIMD string ops in process setup |
Wine-NSPA CS-PI documentation. Source: dlls/ntdll/sync.c (PE), dlls/ntdll/unix/sync.c (Unix). Generated 2026-04-15.