Wine-NSPA – NTSync Kernel Driver

Linux-NSPA 6.19.11-rt1-1 (PREEMPT_RT), CONFIG_NTSYNC=m | 2026-04-15 Author: Jordan Johnston

Overview
Upstream vs NSPA Comparison
Driver Architecture
Patch 1: raw_spinlock + rt_mutex Hardening
Patch 2: Priority-Ordered Waiter Queues
Patch 3: Mutex Owner PI Boost v2
Patch 4: uring_fd Extension
Patch 5: GFP_ATOMIC Pre-Allocation
Client-Side Object Creation
Validation

1. Overview

NTSync is a Linux kernel driver (drivers/misc/ntsync.c, /dev/ntsync) that implements Windows NT synchronization primitives – mutexes, semaphores, and events – directly in the kernel. Upstream Wine uses it to replace the wineserver-mediated sync path for these objects, eliminating cross-process round-trips for wait/wake operations.

For Wine-NSPA, upstream ntsync is necessary but insufficient. The upstream driver uses FIFO waiter queues and has no priority inheritance, which means:

Priority inversion: An RT audio thread waiting on a mutex held by a SCHED_OTHER thread gets no help. The holder competes for CPU against dozens of CFS threads while the RT thread starves.
Wrong wakeup order: When multiple threads wait on the same object, the longest-waiting thread is woken first (FIFO), not the highest-priority thread. This violates Windows NT scheduling semantics and causes RT threads to wait behind lower-priority threads.
PREEMPT_RT incompatibility: The upstream spinlock_t becomes a sleeping lock on PREEMPT_RT, changing timing assumptions. The upstream mutex does not carry PI.

Wine-NSPA applies five kernel patches to ntsync that make it RT-safe, priority-correct, and PI-aware. Together with the userspace CS-PI path (FUTEX_LOCK_PI on CRITICAL_SECTION), these patches close all priority inversion vectors in Wine’s synchronization stack.

2. Upstream vs NSPA Comparison

3. Driver Architecture

The ntsync driver exposes a character device (/dev/ntsync) opened once per Wine process. Object creation returns file descriptors; wait/wake operations use ioctls on the device fd.

Object Types

Type	Kernel struct field	Signaled when	Wake behavior
Mutex	`obj->u.mutex`	`count == 0` (unowned) or owner re-acquires	Ownership transfers to highest-priority waiter
Semaphore	`obj->u.sem`	`count > 0`	Decrement count, wake highest-priority waiter
Event	`obj->u.event`	`signaled == true`	Manual: wake all, auto: wake one + reset

Wait/Wake Paths

Wait path (NTSYNC_IOC_WAIT_ANY / NTSYNC_IOC_WAIT_ALL):

Userspace calls ioctl(device_fd, NTSYNC_IOC_WAIT_ANY, &args) with an array of object fds.
Kernel allocates ntsync_q with one ntsync_q_entry per object.
For wait-any: each entry is inserted into its object’s any_waiters list via ntsync_insert_waiter() (priority-ordered). For wait-all: entries go into all_waiters under wait_all_lock.
If any object is immediately signaled, try_wake_any() / try_wake_all() fires synchronously.
Otherwise, ntsync_schedule() puts the thread to sleep via schedule_hrtimeout_range_clock().
On wakeup: entries are removed from waiter lists, PI is recalculated, objects are released.

Wake path (object state change):

A release ioctl (e.g. NTSYNC_IOC_UNLOCK_MUTEX) changes object state.
try_wake_any_mutex() walks the any_waiters list (highest-priority first due to sorted insertion).
First satisfiable waiter gets ownership transferred: obj->u.mutex.owner = q->owner, wake_up_process(q->task).
ntsync_pi_drop() removes the old owner’s boost; ntsync_pi_set_owner() recalculates for the new owner.

Locking Hierarchy

raw_spinlock_t obj->lock -- per-object, protects object state + waiter lists rt_mutex dev->wait_all_lock -- device-wide, serializes wait-all operations raw_spinlock_t dev->boost_lock -- device-wide, protects boosted_owners list

The obj_lock() fast path acquires only obj->lock. When obj->dev_locked is set (another thread is doing a wait-all), obj_lock() falls back to acquiring wait_all_lock first. This avoids ABBA deadlocks between per-object and device-wide locks.

4. Patch 1: raw_spinlock + rt_mutex Hardening

Problem: Upstream ntsync uses spinlock_t for obj->lock and mutex for wait_all_lock. On PREEMPT_RT kernels, spinlock_t becomes a sleeping rt_mutex and mutex becomes a PI-less sleeping lock. This changes the driver’s timing characteristics – code paths that assume short non-preemptible critical sections (object state updates, waiter list manipulation) can now be preempted mid-update.

Fix: Convert to explicit RT-aware types:

Field	Upstream	NSPA
`obj->lock`	`spinlock_t`	`raw_spinlock_t`
`dev->wait_all_lock`	`struct mutex`	`struct rt_mutex`

/ Per-object: true spin semantics, even on PREEMPT_RT /
raw_spin_lock_init(&obj->lock);
raw_spin_lock(&obj->lock);
raw_spin_unlock(&obj->lock);
/ Device-wide: PI-aware sleeping lock for wait-all serialization /
rt_mutex_init(&dev->wait_all_lock);
rt_mutex_lock(&dev->wait_all_lock);
rt_mutex_unlock(&dev->wait_all_lock);

raw_spinlock_t is appropriate here because the critical sections are short (tens of instructions) and only manipulate in-memory state. The rt_mutex for wait_all_lock provides PI inheritance for the device-wide lock – a thread holding wait_all_lock is boosted if a higher-priority thread blocks on it.

5. Patch 2: Priority-Ordered Waiter Queues

Problem: Upstream ntsync uses list_add_tail() to append waiters to the queue. All wakeups are FIFO – the thread that called WaitForSingleObject first is woken first, regardless of scheduling priority. This violates Windows NT semantics, where the highest-priority waiting thread is always woken first.

Fix: Replace list_add_tail() with ntsync_insert_waiter(), which performs a sorted insertion based on task->prio (kernel internal priority, lower value = higher scheduling priority). Same-priority waiters maintain FIFO order within their priority level.

static void ntsync_insert_waiter(struct ntsync_q_entry new_entry,
                                 struct list_head head)
{
    struct ntsync_q_entry *entry;

list_for_each_entry(entry, head, node) {
    if (new_entry->q->task->prio < entry->q->task->prio) {
        list_add_tail(&new_entry->node, &entry->node);
        return;
    }
}
list_add_tail(&new_entry->node, head);


}

The function walks the existing waiter list until it finds an entry with a numerically higher (lower scheduling priority) task->prio, then inserts before that entry. If no such entry exists (the new waiter has the lowest priority), it appends to the tail.

Both any_waiters (wait-any) and all_waiters (wait-all) queues use sorted insertion. The try_wake_any_mutex() function naturally wakes the highest-priority waiter by taking the first satisfiable entry from the sorted list.

6. Patch 3: Mutex Owner PI Boost v2

Problem: When a SCHED_FIFO thread (e.g. prio 15) waits on a mutex held by a SCHED_OTHER thread (prio 120), the holder is time-sliced by CFS against all other normal-priority threads. The RT waiter’s bounded latency guarantee is violated – it can be delayed indefinitely while the holder fails to run.

Fix (v2): ntsync_pi_recalc() boosts the mutex holder to the scheduling priority of the highest-priority waiter via sched_setattr_nocheck(). A per-device tracking structure (ntsync_pi_owner) saves the holder’s original scheduling attributes and counts how many mutex objects are contributing boosts to that task.

PI Boost Chain

Key Design Decisions

Per-task tracking, not per-object: When a task holds mutexes M1 and M2, both with RT waiters, the task’s ntsync_pi_owner has boost_count=2. The original scheduling attributes are saved once (on the first boost) and restored only when boost_count reaches zero. Between removal of the first and last boost, the task is conservatively over-boosted – it runs at too-high priority, never too-low.

Compare against orig_normal_prio, not normal_prio: After sched_setattr_nocheck() boosts a task, normal_prio changes to reflect the boosted priority. Comparing waiters against the post-boost normal_prio would cause thrashing: the comparison would conclude no boost is needed (waiter prio is now equal to the boosted normal_prio), unboost, then re-boost on the next recalc. Using orig_normal_prio from the tracking struct is stable.

Lazy owner_task resolution: When Wine creates a mutex via the wineserver, current is the wineserver thread, not the Win32 owning thread. owner_task starts NULL and is resolved lazily on the first unlock ioctl (where current is the actual Win32 thread).

v2 Bug Fixes (3 bugs from v1)

Bug	v1 Behavior	v2 Fix
Multi-object PI corruption	Single global `orig_attr` overwritten when 2nd mutex boosted	Per-task `ntsync_pi_owner` with `boost_count`
Zero PI for WaitAll	`all_waiters` list not scanned	`ntsync_pi_recalc()` scans both `any_waiters` and `all_waiters`
Stale normal_prio comparison	`owner->normal_prio` changes after boost, causing thrash	Compare against `po->orig_normal_prio` (saved pre-boost value)

PI recalc() core logic

static void ntsync_pi_recalc(struct ntsync_obj obj)
{
    struct task_struct owner = obj->u.mutex.owner_task;
    int highest_prio = MAX_PRIO;

/* Scan BOTH wait lists for highest-priority waiter */
list_for_each_entry(entry, &obj->any_waiters, node) {
    if (entry->q->task->prio < highest_prio)
        highest_prio = entry->q->task->prio;
}
list_for_each_entry(entry, &obj->all_waiters, node) {
    if (entry->q->task->prio < highest_prio)
        highest_prio = entry->q->task->prio;
}

raw_spin_lock(&dev->boost_lock);

po = find_pi_owner(dev, owner);
/* Compare against ORIGINAL priority, not current (which may be boosted) */
base_prio = po ? po->orig_normal_prio : owner->normal_prio;

if (highest_prio < base_prio && !was_boosted) {
    /* First boost for this object -- find or create per-task tracking */
    if (!po) {
        po = new_po;  /* pre-allocated (Patch 5) */
        po->orig_attr = capture_sched(owner);
        po->orig_normal_prio = owner->normal_prio;
        list_add(&po->node, &dev->boosted_owners);
        new_po = NULL;
    }
    po->boost_count++;
    obj->u.mutex.pi_boosted = true;
}

if (needs_boost && highest_prio < owner->prio)
    sched_setattr_nocheck(owner, &boost_attr);

raw_spin_unlock(&dev->boost_lock);
kfree(new_po);  /* free if not consumed */


}

7. Patch 4: uring_fd Extension

Problem: When a thread is blocked inside NTSYNC_IOC_WAIT_ANY (or WAIT_ALL) and an io_uring CQE arrives (e.g. socket data ready), there is no mechanism to wake the thread. The CQE sits in the completion ring until the ntsync wait times out or another ntsync object is signaled. For RT audio, this adds an entire timeout period of unnecessary latency to socket I/O completions.

Fix: Repurpose the reserved pad field in ntsync_wait_args as uring_fd. When userspace passes a valid eventfd (registered with IORING_REGISTER_EVENTFD on the thread’s io_uring ring), the ntsync wait ioctl monitors it alongside the ntsync objects using the standard poll_initwait/vfs_poll kernel pattern.

Protocol

Userspace (ntdll linux_wait_objs()): sets args.pad = uring_fd (the io_uring eventfd).
Kernel (ntsync_wait_any): calls fget(args.uring_fd) to get the eventfd’s struct file.
Kernel (ntsync_schedule): calls poll_initwait(&pwq) + vfs_poll(uring_file, &pwq.pt) to register on the eventfd’s wait queue.
In the schedule loop, each iteration checks vfs_poll(uring_file, NULL). If EPOLLIN is set (CQE available), the ioctl sets args.index = NTSYNC_INDEX_URING_READY (0xFFFFFFFE) and returns.
Userspace reads and discards the eventfd counter, drains CQEs via process_completions(), and re-enters the ntsync wait.

Kernel changes

/ ntsync_schedule() – modified to accept uring_file /
static int ntsync_schedule(struct ntsync_q q, const struct ntsync_wait_args args,
                           struct file *uring_file)
{
    struct poll_wqueues uring_pwq;

if (uring_file) {
    poll_initwait(&uring_pwq);
    vfs_poll(uring_file, &uring_pwq.pt);  /* register on eventfd wqh */
}

do {
    /* ... signal/timeout checks ... */

    if (uring_file) {
        __poll_t mask = vfs_poll(uring_file, NULL);
        if (mask & EPOLLIN) {
            int signaled = -1;
            atomic_try_cmpxchg(&q->signaled, &signaled,
                               NTSYNC_INDEX_URING_READY);
            ret = 0;
            break;
        }
    }

    ret = schedule_hrtimeout_range_clock(...);
} while (ret < 0);

if (uring_file)
    poll_freewait(&uring_pwq);

return ret;


}

Userspace handling

/ dlls/ntdll/unix/sync.c /

define NTSYNC_INDEX_URING_READY 0xFFFFFFFEu

/ In linux_wait_objs(): /
args.pad = uring_fd > 0 ? uring_fd : 0;
/ After ioctl returns: /
if (args.index == NTSYNC_INDEX_URING_READY)
{
    uint64_t val;
    read(uring_fd, &val, sizeof(val));  / consume eventfd counter /
    return STATUS_URING_COMPLETION;     / caller drains CQEs, retries /
}

The pad field reuse is ABI-compatible: upstream requires pad == 0 and returns EINVAL otherwise. NSPA relaxes this check (args.flags & ~NTSYNC_WAIT_REALTIME only) and interprets nonzero pad as uring_fd.

8. Patch 5: GFP_ATOMIC Pre-Allocation

Problem: Patch 3’s ntsync_pi_recalc() originally called kzalloc(sizeof(*po), GFP_ATOMIC) while holding dev->boost_lock (a raw_spinlock). On PREEMPT_RT kernels, the slab allocator’s internal rt_spin_lock can sleep, which triggers a __schedule_bug (BUG: scheduling while atomic) when called under a raw spinlock.

GFP_ATOMIC prevents the allocator from sleeping on memory pressure – it does not prevent the PREEMPT_RT slab lock from sleeping. On non-RT kernels this works fine because slab locks are plain spinlocks (disable preemption). On PREEMPT_RT, they are sleeping locks.

Fix: Pre-allocate the ntsync_pi_owner struct before acquiring the raw spinlock. Pass the pre-allocated pointer into the locked section; consume it only if a new owner entry is actually needed. Free the unconsumed allocation after releasing the lock.

static void ntsync_pi_recalc(struct ntsync_obj obj)
{
    struct ntsync_pi_owner new_po = NULL;

/* Pre-allocate BEFORE taking the raw spinlock.
 * On PREEMPT_RT, kzalloc can sleep (slab rt_spin_lock). */
new_po = kzalloc(sizeof(*new_po), GFP_ATOMIC);

/* ... scan waiter lists ... */

raw_spin_lock(&dev->boost_lock);

if (needs_boost && !was_boosted) {
    if (!po) {
        if (unlikely(!new_po))
            goto out;
        /* initialize and consume new_po */
        new_po->task = owner;
        new_po->orig_attr = ...;
        list_add(&new_po->node, &dev->boosted_owners);
        po = new_po;
        new_po = NULL;  /* consumed -- don't free below */
    }
    po->boost_count++;
}


out:
    raw_spin_unlock(&dev->boost_lock);
    kfree(new_po);  / free if pre-allocated but not consumed (NULL is safe) /
}

The allocation still uses GFP_ATOMIC because ntsync_pi_recalc() is called from ioctl paths that may hold obj->lock (a raw spinlock) in the caller. The key insight is that moving the allocation before boost_lock is sufficient – the caller’s obj->lock is a different raw spinlock and the slab allocator’s rt_spin_lock nesting with it is legal on PREEMPT_RT (raw spinlocks nest with sleeping locks, not with each other – but the slab lock is a sleeping lock on RT).

9. Client-Side Object Creation

Wine-NSPA creates anonymous ntsync objects client-side, bypassing the wineserver entirely for the fast path. This eliminates a server round-trip for every CreateMutex, CreateSemaphore, and CreateEvent call when the object does not need a name (the common case for application-internal synchronization).

alloc_client_handle() Path

Anonymous CreateMutex:
  alloc_client_handle()          – allocate a Wine handle from client-side pool
  ioctl(NTSYNC_IOC_CREATE_MUTEX) – kernel creates ntsync object, returns fd
  cache fd in handle table       – subsequent waits resolve fd without server
Named CreateMutex:
  wineserver creates obj         – needs server for namespace management
  passes fd back to client       – via SCM_RIGHTS over Unix socket
  cache fd in handle table       – subsequent waits same fast path

The client handle pool uses a simple atomic decrement (InterlockedDecrement(&client_handle_next)) to allocate negative handle values that do not collide with server-allocated handles. Wait operations (NtWaitForSingleObject) resolve the handle to a cached fd via inproc_wait(), then call linux_wait_objs() which issues the kernel ioctl directly.

Currently, client-side creation is enabled for mutexes and semaphores. Event object creation is client-capable at the kernel level but disabled in the Wine client due to stability issues with certain applications (Ableton Live).

10. Validation

Priority Wakeup Ordering

Tested with 5 waiters at different priorities on a single ntsync mutex. All waiters woke in correct priority order in both baseline and RT modes across all test runs (v4 and v5).

Waiter	Priority	Expected Wake Order	Actual (Baseline)	Actual (RT)
Thread 1	SCHED_FIFO prio 50	1st	1st	1st
Thread 2	SCHED_FIFO prio 40	2nd	2nd	2nd
Thread 3	SCHED_FIFO prio 30	3rd	3rd	3rd
Thread 4	SCHED_FIFO prio 20	4th	4th	4th
Thread 5	SCHED_OTHER (CFS)	5th	5th	5th

PI Contention: v4 to v5 Comparison

PI contention measures how long an RT thread waits on a mutex held by a CFS thread doing a CPU-bound work loop. Lower is better – it means the PI boost is getting the holder on-core faster.

Depth	v4 RT avg	v5 RT avg	Delta	Notes
d4	387 ms	270 ms	-30.2%	8/8 samples, tight range
d8	419 ms	201 ms	-52.0%	v4 had CFS reversal (RT worse than baseline), v5 resolved
d12	scales	scales	flat	Chain propagation correct, no degradation

PI Chain Scaling (v5)

Transitive PI chains tested up to depth 12. RT wait time does not increase with chain depth beyond the tail holder’s work time (~235ms for a 100M-iteration CPU loop).

Metric	d4	d8	d12
RT PI avg wait	270 ms	201 ms	~235 ms
All sub-tests PASS	8/8	8/8	8/8
Priority wakeup correct	Yes	Yes	Yes
Chain propagation correct	Yes	Yes	Yes

Rapid Kernel Mutex Throughput

Metric	v4 RT	v5 RT	Delta
Throughput	232K ops/s	259K ops/s	+11.6%
RT max_wait	54 us	47 us	-13.0%
Counter correctness	400K/400K	400K/400K	correct

Philosophers (Integration Test)

The philosophers test exercises contention between RT and CFS threads on ntsync mutexes with random hold patterns.

Metric	v4 RT	v5 RT	Delta
RT max_wait	1620 us	865 us	-46.6%

Mixed WFMO

WaitForMultipleObjects (both wait-any and wait-all modes) tested with mixed object types (mutex + semaphore + event). All sub-tests PASS in both baseline and RT modes across d4/d8/d12.

File References

File	Purpose
`drivers/misc/ntsync.c` (kernel)	NTSync driver with all 5 NSPA patches
`include/uapi/linux/ntsync.h` (kernel)	UAPI header: ioctls, `ntsync_wait_args`, `NTSYNC_INDEX_URING_READY`
`dlls/ntdll/unix/sync.c` (Wine)	Client-side ntsync integration: `linux_wait_objs()`, `alloc_client_handle()`, uring_fd retry loop
`dlls/ntdll/unix/io_uring.c` (Wine)	io_uring ring management, eventfd registration
`programs/nspa_rt_test/main.c` (Wine)	RT test suite: PI contention, priority wakeup, chain scaling