Wine-NSPA – Critical Section Priority Inheritance (CS-PI)

This page is the design and implementation reference for Wine-NSPA’s RTL_CRITICAL_SECTION priority-inheritance path, from lock acquisition mechanics down to fallback behavior and validation history.

Overview
Upstream Wine vs NSPA Comparison
LockSemaphore Repurposing
Fast Path (Uncontended)
Slow Path (Contended)
Release Path
TID Source
Gating Mechanism
Recursive Locking
Fallback Behavior
SRW Lock Spin Phase
Validation

1. Overview

RTL_CRITICAL_SECTION is the most contended lock primitive in Wine. Every heap allocation, loader operation, DllMain serialization, GDI call, and most application/plugin code exercises critical sections. In a typical DAW workload running 50-100 VST plugins, thousands of CS acquire/release pairs happen per audio callback period (typically 2-5 ms at 48 kHz with a 128-sample buffer).

The core problem: priority inversion. When an RT audio thread (SCHED_FIFO, priority 80+) blocks on a critical section held by a normal-priority thread (SCHED_OTHER), the holder cannot run because the RT thread is monopolizing the CPU. Under CFS, the holder competes with dozens of other SCHED_OTHER threads for time slices. The RT thread’s audio callback deadline passes. The result is an audible glitch – an xrun.

Windows does not implement priority inheritance on CRITICAL_SECTION. Windows' NT scheduler has its own mechanisms for mitigating inversion (priority boosting on wakeup, quantum donation), but these are heuristic and non-deterministic. NSPA’s CS-PI is novel: it grafts Linux’s kernel rt_mutex PI protocol onto Wine’s RTL_CRITICAL_SECTION, giving every CS acquire/release pair the same deterministic, transitive priority inheritance that POSIX pthread_mutex with PTHREAD_PRIO_INHERIT provides.

Key properties of CS-PI:

Transitive. If thread A (FIFO 90) waits on CS1 held by thread B (FIFO 50), which waits on CS2 held by thread C (OTHER), thread C is boosted to FIFO 90. Chains of arbitrary depth are handled by the kernel’s rt_mutex infrastructure.
Race-free. The kernel manages the PI chain atomically. No user-space priority tracking, no TOCTOU windows.
Zero-cost when inactive. When NSPA_RT_PRIO is unset, every CS function short-circuits to upstream Wine’s legacy implementation. No CAS, no TID lookup, no branch misprediction penalty beyond the initial state check.
Graceful degradation. If the kernel lacks FUTEX_LOCK_PI support (pre-2.6.18, effectively impossible in 2026), CS-PI permanently disables itself and falls through to the legacy keyed-event path.

Source Files

File	Content
`dlls/ntdll/sync.c` lines 149-495	PE-side CS-PI: design comment, state machine, fast/slow/release paths
`dlls/ntdll/unix/sync.c` lines 144-168, 4057-4156	Unix-side: futex helpers, `NtNspaGetUnixTid`, `NtNspaLockCriticalSectionPI`, `NtNspaUnlockCriticalSectionPI`
`dlls/ntdll/unix/unix_private.h` lines 150-181	`nspa_unix_tid` field in `ntdll_thread_data`, C_ASSERT offset validation

2. Upstream Wine vs NSPA Comparison

The following diagram shows the complete acquire and release flow for both upstream Wine and NSPA CS-PI, side by side. The left column is upstream Wine’s legacy path; the right column is NSPA’s PI-enabled path. Both start from the same RtlEnterCriticalSection entry point.

3. LockSemaphore Repurposing

RTL_CRITICAL_SECTION has a LockSemaphore field typed as HANDLE (i.e., PVOID – pointer-sized). In upstream Wine, this stores a handle to a keyed event object, lazily created on first contention. The keyed event is used as the park/unpark mechanism: contended acquires call NtWaitForKeyedEvent(LockSemaphore), and releases with waiters call NtReleaseKeyedEvent(LockSemaphore).

Under CS-PI, LockSemaphore is repurposed as a FUTEX_LOCK_PI word. The field is still pointer-sized, but only the low 32 bits are used, matching the LONG that FUTEX_LOCK_PI operates on. The bit layout follows the kernel’s futex.h protocol exactly:

Why this works without breaking the struct layout:

RTL_CRITICAL_SECTION layout is ABI-frozen. LockSemaphore is at a fixed offset. Applications that sizeof(RTL_CRITICAL_SECTION) are unaffected.
The field is HANDLE-sized (PVOID), which is 8 bytes on x86_64 and 4 bytes on i386. The futex word uses only the low 32 bits. On x86_64, the upper 32 bits are implicitly zero (the field is cast to LONG *).
Applications that inspect LockSemaphore expecting a valid HANDLE would see a small integer (a TID, typically in the range 1-32768). This is undocumented internal state; no known application introspects this field.
LockCount is still maintained for external compatibility. Applications that poll LockCount to check if a CS is contended continue to work. However, LockCount is no longer the atomic-primary ownership word – that role is now served by the futex word in LockSemaphore.

4. Fast Path (Uncontended)

The fast path handles the common case: the critical section is free, and the caller acquires it without contention. This executes entirely on the PE side, never crosses to Unix code, and never enters the kernel.

Sequence

Get Linux TID: nspa_get_unix_tid() reads the cached TID from the TEB (a single memory load, ~2 ns). See Section 7 for details.
CAS the futex word: InterlockedCompareExchange(&LockSemaphore, my_tid, 0). If the field was 0 (lock free), it atomically writes my_tid and returns 0 (success). This is a single lock cmpxchg instruction on x86.
Update bookkeeping: InterlockedIncrement(&LockCount), set OwningThread = win_tid, set RecursionCount = 1. These are for external compatibility with code that queries CS state.

Cost

Operation	Time
Upstream uncontended acquire	~3 ns (1 atomic: `InterlockedIncrement`)
NSPA CS-PI uncontended acquire	~8 ns (1 memory load + 1 CAS + 1 atomic)
Overhead	~5 ns per uncontended acquire

The 5 ns overhead is the cost of the CAS on LockSemaphore plus the TEB memory load. This is acceptable: in a DAW running at 48 kHz / 128 samples, a single audio callback period is 2,666,667 ns. Even 10,000 CS acquire/release pairs per callback add only 50 us of overhead – under 2% of the callback budget.

Code Path

static inline BOOL nspa_cs_try_fast( RTL_CRITICAL_SECTION *crit, DWORD unix_tid )
{
    LONG *futex = (LONG *)&crit->LockSemaphore;
    return InterlockedCompareExchange( futex, (LONG)unix_tid, 0 ) == 0;
}

The function is inline – the compiler emits the lock cmpxchg directly at each call site. No function call overhead.

5. Slow Path (Contended)

When the fast-path CAS fails (the futex word is non-zero, meaning another thread holds the lock), the slow path hands control to the kernel’s rt_mutex PI infrastructure.

Sequence

Recursive check: Before going to the kernel, check if the current thread already owns this CS (by comparing OwningThread against the Win32 TID). If yes, bump RecursionCount and return. This avoids calling futex_lock_pi on a lock we already hold, which would return EDEADLK. See Section 9.
Optional spin: If crit->SpinCount > 0, retry the CAS up to SpinCount times with YieldProcessor() between attempts. This catches short critical sections that release before the spin budget expires, avoiding the syscall overhead.
Publish waiter count: InterlockedIncrement(&LockCount) before the syscall. This maintains LockCount semantics for external observers.
Cross PE/Unix boundary: Call NtNspaLockCriticalSectionPI(futex). This is an Nt-style syscall that crosses from PE ntdll to Unix ntdll.
Unix side: futex_lock_pi(futex) – a syscall(__NR_futex, addr, FUTEX_LOCK_PI_PRIVATE, ...). The kernel:
- Reads the owner TID from the futex word
- Looks up the owner’s task struct via the TID
- Creates an rt_mutex backing the futex
- Inserts the caller into the rt_mutex waiter tree (priority-ordered)
- Boosts the owner to the caller’s scheduling priority (if higher)
- Blocks the caller until the owner releases
Return: When the owner releases and the kernel transfers ownership, futex_lock_pi returns 0. The futex word now contains the caller’s TID (possibly with FUTEX_WAITERS set if more threads are waiting). The PE side sets OwningThread and RecursionCount.

PI Chain Diagram

Error Handling

EINTR: The futex_lock_pi call is wrapped in a do { } while (ret == -1 && errno == EINTR) loop. Signal delivery restarts the wait.
ENOSYS: Kernel lacks FUTEX_LOCK_PI. Returns STATUS_NOT_SUPPORTED, which triggers the fallback path (see Section 10).
Other errors: Returns STATUS_UNSUCCESSFUL. The PE side calls RtlRaiseStatus() to raise an exception – this is a fatal condition indicating kernel-level corruption.

6. Release Path

Release is the mirror of acquire, with the same fast/slow split.

Final Release (RecursionCount drops to 0)

Clear bookkeeping: Set RecursionCount = 0, OwningThread = 0. These must be cleared before the futex word is released, because once the futex word is zero, another thread can acquire the lock and see stale values.
Uncontended release (fast): InterlockedCompareExchange(&LockSemaphore, 0, my_tid). If the CAS succeeds (the old value was exactly my_tid with no FUTEX_WAITERS bit), the lock is free. No syscall. Decrement LockCount and return.
Contended release (slow): If the CAS fails – typically because the kernel set the FUTEX_WAITERS bit (0x80000000) on the futex word while a waiter blocked – the old value is my_tid | FUTEX_WAITERS. The PE side calls NtNspaUnlockCriticalSectionPI(futex), which invokes futex(addr, FUTEX_UNLOCK_PI_PRIVATE, ...). The kernel:
- Walks the rt_mutex waiter tree
- Selects the highest-priority waiter
- Atomically transfers the futex word’s TID field to the selected waiter
- Drops the PI boost on the releasing thread (restores original scheduling parameters)
- Wakes the selected waiter

Recursive Release (RecursionCount > 1)

If RecursionCount > 1, this is not the final unlock. Just decrement RecursionCount and LockCount, then return. The futex word stays unchanged (still contains our TID).

Error Recovery

If NtNspaUnlockCriticalSectionPI returns STATUS_NOT_SUPPORTED (the extremely unlikely case where FUTEX_LOCK_PI worked but FUTEX_UNLOCK_PI returns ENOSYS), CS-PI logs an error, disables itself globally, and returns STATUS_SUCCESS. The futex word is stuck (contains a stale TID with FUTEX_WAITERS set), and future acquires on this specific CS will hang. This is accepted as a fatal diagnostic condition – it should never happen on a consistent kernel.

7. TID Source

FUTEX_LOCK_PI requires the owner field in the futex word to be a valid Linux kernel TID (pid_t from SYS_gettid). The kernel validates this against its task list – if the TID is invalid, futex_lock_pi returns ESRCH, and every contended acquire hangs.

Wine’s GetCurrentThreadId() returns the Win32 thread ID from TEB->ClientId.UniqueThread. This is a wineserver-assigned value, unrelated to the Linux kernel TID. Using it directly causes ESRCH.

Solution: Cached TID via TEB

The Unix-side ntdll_thread_data struct (embedded in the TEB’s GdiTebBatch region) has an nspa_unix_tid field. This is populated on first access via syscall(SYS_gettid) and cached for the thread’s lifetime.

PE-side read (zero-syscall hot path):

The PE side cannot include unix_private.h (it’s Unix-only). Instead, it reads the TID at a hardcoded byte offset from GdiTebBatch:

#ifdef _WIN64
#define NSPA_UNIX_TID_OFFSET 0xf8
#else
#define NSPA_UNIX_TID_OFFSET 0x88
#endif

static inline DWORD nspa_get_unix_tid(void)
{
    DWORD tid = *(volatile DWORD *)((char *)&NtCurrentTeb()->GdiTebBatch
                                    + NSPA_UNIX_TID_OFFSET);
    if (tid) return tid;
    return NtNspaGetUnixTid();  /* first call: populate via syscall */
}

Offset safety: C_ASSERT checks in unix_private.h verify that offsetof(struct ntdll_thread_data, nspa_unix_tid) matches the PE-side literal. If the struct layout changes, the build fails.

Cost: - Hot path (subsequent calls): ~2 ns (one memory load + branch-not-taken) - Cold path (first call per thread): ~200-500 ns (one syscall(SYS_gettid) round trip)

The cold path fires at most once per thread. In a typical DAW with 20-50 threads, this is 20-50 syscalls total across the entire process lifetime – negligible.

8. Gating Mechanism

CS-PI is gated on the NSPA_RT_PRIO environment variable. When the variable is unset, CS-PI is inactive and all CS functions execute the upstream Wine legacy path with zero overhead beyond a single branch on a cached state variable.

Tri-State Machine

nspa_cs_pi_state (static LONG):
  0  = uninitialized (first CS op on any thread triggers probe)
  1  = active (NSPA_RT_PRIO is set with a non-empty value)
  -1 = inactive (NSPA_RT_PRIO unset, or kernel returned ENOSYS)

Why Not RtlQueryEnvironmentVariable_U

The obvious approach – calling RtlQueryEnvironmentVariable_U to read the env var – causes a recursive stack overflow. That function internally acquires critical sections (the PEB lock, the process heap lock). Those CS operations re-enter nspa_cs_pi_active(), which re-enters RtlQueryEnvironmentVariable_U, and so on. This was observed as err:virtual:virtual_setup_exception crashes on every PE binary launch.

Direct PEB Scan

Instead, nspa_cs_pi_active() reads the PEB environment block directly:

NtCurrentTeb()->Peb->ProcessParameters->Environment

This is a null-separated list of L"VAR=value\0" strings. The function walks the list comparing against L"NSPA_RT_PRIO=" character by character, using no Rtl functions and no locks. This mirrors how the Windows loader reads environment variables before kernel32 is loaded.

Race Safety

Multiple threads may call nspa_cs_pi_active() concurrently during the uninitialized phase. Each computes new_state independently, then publishes via InterlockedCompareExchange(&nspa_cs_pi_state, new_state, 0). The first writer wins; all subsequent readers see the published value. Since all threads observe the same PEB environment, they all compute the same answer – the race is benign.

9. Recursive Locking

RTL_CRITICAL_SECTION supports recursive acquisition: the same thread can enter a CS multiple times, incrementing RecursionCount each time, and must leave the same number of times.

CS-PI handles recursion without calling futex_lock_pi on a lock we already hold:

On acquire: After the fast-path CAS fails, check crit->OwningThread == ULongToHandle(win_tid). If true, this is a recursive entry: bump RecursionCount and LockCount, return immediately. The futex word already contains our TID.
On release: If RecursionCount > 1, decrement it and LockCount, return immediately. The futex word stays unchanged. Only when RecursionCount drops to 0 is the futex word CAS’d back to zero (or the kernel unlock path invoked).

Why This Matters

Calling futex_lock_pi when we already hold the futex would return EDEADLK (the kernel detects the self-deadlock via the rt_mutex chain). We must detect recursion in user space before the syscall. The OwningThread comparison is the canonical way – it uses the Win32 TID, matching both the legacy path’s check and external APIs like RtlIsCriticalSectionLockedByThread.

10. Fallback Behavior

CS-PI is a soft dependency on kernel FUTEX_LOCK_PI support. If the kernel returns ENOSYS (function not implemented), CS-PI disables itself permanently and all subsequent CS operations use upstream Wine’s legacy keyed-event path.

Trigger

The first NtNspaLockCriticalSectionPI call that receives ENOSYS returns STATUS_NOT_SUPPORTED to the PE side. The PE-side nspa_cs_enter_pi function:

Decrements LockCount (undoing the waiter count bump)
Calls InterlockedExchange(&nspa_cs_pi_state, -1) – permanently disabling CS-PI
Returns STATUS_RETRY

The calling RtlEnterCriticalSection sees STATUS_RETRY and falls through to the legacy InterlockedIncrement / RtlpWaitForCriticalSection / keyed-event path.

Scope

The disable is global and permanent (for the process lifetime). Once nspa_cs_pi_state is set to -1, nspa_cs_pi_active() returns FALSE on every subsequent call. All CS operations across all threads revert to upstream behavior.

When This Fires

FUTEX_LOCK_PI has been in the Linux kernel since 2.6.18 (September 2006). Any kernel from the last 20 years supports it. On PREEMPT_RT kernels (which NSPA requires), it is always available. The fallback exists as a safety net for unusual kernel configurations (e.g., stripped embedded kernels), not as an expected code path.

11. SRW Lock Spin Phase

RTL_SRWLOCK is the other major user-space lock in Wine, used by the process heap, loader, and application code. NSPA adds a bounded spin phase to SRW lock acquisition, complementing CS-PI. These are independent optimizations for different lock types.

Design

Windows SRW locks spin approximately 1024 iterations before parking via NtWaitForAlertByThreadId. Upstream Wine does zero spinning – every contended acquire immediately calls RtlWaitOnAddress, which translates to a futex syscall. NSPA adds 256 spin iterations for normal threads before falling through to the wait.

#define SRW_SPIN_COUNT 256

RT threads skip spinning entirely. An RT thread at SCHED_FIFO spinning on a lock held by a SCHED_OTHER thread would starve the holder – the holder cannot make progress while the RT thread monopolizes the CPU. Better to fall through to the futex wait immediately, allowing the scheduler to handle priority properly (or, for CS, allowing PI to boost the holder).

Single-CPU systems: Spinning is disabled on uniprocessor systems. The holder cannot make progress while the spinner runs on the same (only) core.

Relationship to CS-PI

SRW locks and critical sections are separate primitives with different internal architectures:

Property	RTL_CRITICAL_SECTION	RTL_SRWLOCK
Ownership tracking	Yes (OwningThread)	No
Recursive entry	Yes (RecursionCount)	No
PI under NSPA	Yes (FUTEX_LOCK_PI)	No (no owner to boost)
Spin phase under NSPA	Via SpinCount (existing)	256 iters (new)
Wait mechanism	Keyed event / futex PI	RtlWaitOnAddress / futex

SRW locks cannot have PI because they do not track ownership – the kernel cannot know which thread to boost. The spin phase is the applicable optimization for SRW.

12. Validation

CS-PI is validated by three test programs in the NSPA RT test suite (nspa_rt_test.exe), run with NSPA_RT_PRIO=80 NSPA_RT_POLICY=FF on a PREEMPT_RT kernel.

Test: cs-contention

Purpose: Validates that PI boost matches uncontended work time. A SCHED_OTHER holder does 200M-iteration busywork inside a CS (approximately 475 ms of CPU time). An RT waiter (SCHED_FIFO 87) blocks on the CS. Four SCHED_OTHER load threads compete for CPU. Under PI, the holder is boosted and the waiter’s wait time matches the holder’s work time. Without PI, CFS time-slices the holder against the load threads, inflating the wait.

Results (v5, latest):

Metric	Value
Hold time per iteration	~475 ms (work loop)
Waiter wait time (with PI)	474-475 ms
Wait/hold ratio	~1.00x (perfect)
Samples captured	3/3
Verdict	PASS

The wait time matches the work time to within 1 ms, confirming the holder receives full CPU time under PI boost.

Test: rapidmutex

Purpose: Throughput stress test. Four threads (1 RT + 3 load) perform 500,000 CS acquire/release cycles each on a shared critical section. Measures throughput, per-thread max wait, and correctness (shared counter).

Results (v5, latest):

Metric	Baseline	RT (CS-PI)	Delta
Throughput	319K ops/s	327K ops/s	+2.5%
RT max wait	–	36 us	–
RT avg wait	–	1 us	–
Shared counter	2,000,000	2,000,000	correct

v4 to v5 improvement: RT throughput improved from 312K to 327K ops/s (+4.7%). RT max wait dropped from 46 us to 36 us. These gains are attributed to SIMD memcpy/memmove optimizations reducing overhead in the CS fast path.

Test: philosophers

Purpose: Dining philosophers with 5 diners, 2 forks each. Philosopher 0 is RT (SCHED_FIFO), philosophers 1-4 are SCHED_OTHER. Four background load threads. Validates transitive PI: philosopher 0 waiting on fork A, held by philosopher 1, who is waiting on fork B, held by philosopher 2 – the PI chain propagates through the rt_mutex infrastructure.

Results (v5, latest):

Metric	Value
Total meals	250/250 (50 each)
Total elapsed	205 ms
RT max wait	1301 us
Spread (max-min meals)	0 (perfect fairness)
Verdict	PASS

The RT max wait varies between runs due to CFS load placement (v4 measured 601 us, v5 measured 1301 us – both within acceptable range). The critical validation is that all meals complete without deadlock and the PI chain propagates correctly through nested lock acquisitions.

v4 to v5 Summary

Metric	v4	v5	Cause
rapidmutex RT throughput	312K ops/s	327K ops/s (+4.7%)	SIMD + SRW spin
rapidmutex RT max wait	46 us	36 us	Reduced lock transition overhead
cs-contention wait/hold ratio	~1.00x	~1.00x	Stable – PI correct
philosophers meals	250/250	250/250	Stable – transitive PI correct
fork-mutex RT elapsed	1021 ms	948 ms (-7.1%)	SIMD string ops in process setup

Wine-NSPA CS-PI documentation. Source: dlls/ntdll/sync.c (PE), dlls/ntdll/unix/sync.c (Unix). Generated 2026-04-15.

Wine-NSPA – Critical Section Priority Inheritance (CS-PI)

Table of Contents

1. Overview

Source Files

2. Upstream Wine vs NSPA Comparison

3. LockSemaphore Repurposing

4. Fast Path (Uncontended)

Sequence

Cost

Code Path

5. Slow Path (Contended)

Sequence

PI Chain Diagram

Error Handling

6. Release Path

Final Release (RecursionCount drops to 0)

Recursive Release (RecursionCount > 1)

Error Recovery

7. TID Source

Solution: Cached TID via TEB

8. Gating Mechanism

Tri-State Machine

Why Not RtlQueryEnvironmentVariable_U

Direct PEB Scan

Race Safety

9. Recursive Locking

Why This Matters

10. Fallback Behavior

Trigger

Scope

When This Fires

11. SRW Lock Spin Phase

Design

Relationship to CS-PI

12. Validation

Test: cs-contention

Test: rapidmutex

Test: philosophers

v4 to v5 Summary