Wine-NSPA – NTSync Kernel Driver

Linux-NSPA 6.19.11-rt1-1 (PREEMPT_RT), CONFIG_NTSYNC=m | 2026-04-15 Author: Jordan Johnston

Table of Contents

  1. Overview
  2. Upstream vs NSPA Comparison
  3. Driver Architecture
  4. Patch 1: raw_spinlock + rt_mutex Hardening
  5. Patch 2: Priority-Ordered Waiter Queues
  6. Patch 3: Mutex Owner PI Boost v2
  7. Patch 4: uring_fd Extension
  8. Patch 5: GFP_ATOMIC Pre-Allocation
  9. Client-Side Object Creation
  10. Validation

1. Overview

NTSync is a Linux kernel driver (drivers/misc/ntsync.c, /dev/ntsync) that implements Windows NT synchronization primitives – mutexes, semaphores, and events – directly in the kernel. Upstream Wine uses it to replace the wineserver-mediated sync path for these objects, eliminating cross-process round-trips for wait/wake operations.

For Wine-NSPA, upstream ntsync is necessary but insufficient. The upstream driver uses FIFO waiter queues and has no priority inheritance, which means:

Wine-NSPA applies five kernel patches to ntsync that make it RT-safe, priority-correct, and PI-aware. Together with the userspace CS-PI path (FUTEX_LOCK_PI on CRITICAL_SECTION), these patches close all priority inversion vectors in Wine’s synchronization stack.


2. Upstream vs NSPA Comparison

Upstream NTSync Wine-NSPA NTSync spinlock_t obj->lock On PREEMPT_RT: becomes rt_mutex (sleeps) Changes timing, breaks non-preemptibility assumption mutex wait_all_lock (no PI) FIFO waiter queue list_add_tail(&entry->node, &obj->any_waiters) Longest-waiting thread wakes first, regardless of priority No PI boost RT waiter blocks on mutex held by SCHED_OTHER Holder time-sliced by CFS, unbounded inversion args.pad = 0 (reserved, must be zero) No io_uring integration -- CQEs wait until timeout N/A (no PI allocation needed) Upstream limitations: FIFO wakeup (not priority-ordered) No PI boost (unbounded priority inversion) spinlock_t sleeps on PREEMPT_RT No io_uring CQE wakeup raw_spinlock_t obj->lock (Patch 1) True spin on PREEMPT_RT (never sleeps) Short critical sections: obj state updates only rt_mutex wait_all_lock (PI-aware) Priority-ordered waiter queue (Patch 2) ntsync_insert_waiter(): sorted by task->prio Highest-priority waiter always wakes first (NT semantics) Mutex owner PI boost v2 (Patch 3) ntsync_pi_recalc(): sched_setattr_nocheck() Per-task ntsync_pi_owner, boost_count, orig_attr save/restore uring_fd extension (Patch 4) args.pad repurposed: io_uring eventfd wakes ntsync waits GFP_ATOMIC pre-allocation (Patch 5) NSPA additions: Priority-ordered wakeup (NT-faithful) PI boost: owner promoted to waiter's priority raw_spinlock + rt_mutex (PREEMPT_RT safe) io_uring CQE wakeup via uring_fd Impact: v4 (upstream-like) to v5 (NSPA patched) ntsync-d4: RT PI avg 387 to 270ms (-30%) ntsync-d8: RT PI avg 419 to 201ms (-52%) ntsync-d12: scales to depth 12 philosophers: RT max_wait 1620 to 865us (-46.6%)

3. Driver Architecture

The ntsync driver exposes a character device (/dev/ntsync) opened once per Wine process. Object creation returns file descriptors; wait/wake operations use ioctls on the device fd.

Object Types

Type Kernel struct field Signaled when Wake behavior
Mutex obj->u.mutex count == 0 (unowned) or owner re-acquires Ownership transfers to highest-priority waiter
Semaphore obj->u.sem count > 0 Decrement count, wake highest-priority waiter
Event obj->u.event signaled == true Manual: wake all, auto: wake one + reset

Wait/Wake Paths

Wait path (NTSYNC_IOC_WAIT_ANY / NTSYNC_IOC_WAIT_ALL):

  1. Userspace calls ioctl(device_fd, NTSYNC_IOC_WAIT_ANY, &args) with an array of object fds.
  2. Kernel allocates ntsync_q with one ntsync_q_entry per object.
  3. For wait-any: each entry is inserted into its object’s any_waiters list via ntsync_insert_waiter() (priority-ordered). For wait-all: entries go into all_waiters under wait_all_lock.
  4. If any object is immediately signaled, try_wake_any() / try_wake_all() fires synchronously.
  5. Otherwise, ntsync_schedule() puts the thread to sleep via schedule_hrtimeout_range_clock().
  6. On wakeup: entries are removed from waiter lists, PI is recalculated, objects are released.

Wake path (object state change):

  1. A release ioctl (e.g. NTSYNC_IOC_UNLOCK_MUTEX) changes object state.
  2. try_wake_any_mutex() walks the any_waiters list (highest-priority first due to sorted insertion).
  3. First satisfiable waiter gets ownership transferred: obj->u.mutex.owner = q->owner, wake_up_process(q->task).
  4. ntsync_pi_drop() removes the old owner’s boost; ntsync_pi_set_owner() recalculates for the new owner.

Locking Hierarchy

raw_spinlock_t obj->lock -- per-object, protects object state + waiter lists rt_mutex dev->wait_all_lock -- device-wide, serializes wait-all operations raw_spinlock_t dev->boost_lock -- device-wide, protects boosted_owners list

The obj_lock() fast path acquires only obj->lock. When obj->dev_locked is set (another thread is doing a wait-all), obj_lock() falls back to acquiring wait_all_lock first. This avoids ABBA deadlocks between per-object and device-wide locks.


4. Patch 1: raw_spinlock + rt_mutex Hardening

Problem: Upstream ntsync uses spinlock_t for obj->lock and mutex for wait_all_lock. On PREEMPT_RT kernels, spinlock_t becomes a sleeping rt_mutex and mutex becomes a PI-less sleeping lock. This changes the driver’s timing characteristics – code paths that assume short non-preemptible critical sections (object state updates, waiter list manipulation) can now be preempted mid-update.

Fix: Convert to explicit RT-aware types:

Field Upstream NSPA
obj->lock spinlock_t raw_spinlock_t
dev->wait_all_lock struct mutex struct rt_mutex
/ Per-object: true spin semantics, even on PREEMPT_RT /
raw_spin_lock_init(&obj->lock);
raw_spin_lock(&obj->lock);
raw_spin_unlock(&obj->lock);
/ Device-wide: PI-aware sleeping lock for wait-all serialization /
rt_mutex_init(&dev->wait_all_lock);
rt_mutex_lock(&dev->wait_all_lock);
rt_mutex_unlock(&dev->wait_all_lock);

raw_spinlock_t is appropriate here because the critical sections are short (tens of instructions) and only manipulate in-memory state. The rt_mutex for wait_all_lock provides PI inheritance for the device-wide lock – a thread holding wait_all_lock is boosted if a higher-priority thread blocks on it.


5. Patch 2: Priority-Ordered Waiter Queues

Problem: Upstream ntsync uses list_add_tail() to append waiters to the queue. All wakeups are FIFO – the thread that called WaitForSingleObject first is woken first, regardless of scheduling priority. This violates Windows NT semantics, where the highest-priority waiting thread is always woken first.

Fix: Replace list_add_tail() with ntsync_insert_waiter(), which performs a sorted insertion based on task->prio (kernel internal priority, lower value = higher scheduling priority). Same-priority waiters maintain FIFO order within their priority level.

static void ntsync_insert_waiter(struct ntsync_q_entry new_entry,
                                 struct list_head head)
{
    struct ntsync_q_entry *entry;

list_for_each_entry(entry, head, node) {
    if (new_entry->q->task->prio < entry->q->task->prio) {
        list_add_tail(&new_entry->node, &entry->node);
        return;
    }
}
list_add_tail(&new_entry->node, head);

}

The function walks the existing waiter list until it finds an entry with a numerically higher (lower scheduling priority) task->prio, then inserts before that entry. If no such entry exists (the new waiter has the lowest priority), it appends to the tail.

Priority-Ordered Waiter Queue Insertion Upstream: list_add_tail() -- FIFO order Thread A prio 120 (CFS) arrived 1st Thread B prio 49 (RT) arrived 2nd Thread C prio 120 (CFS) arrived 3rd NEW: Thread D prio 15 (FIFO) highest prio, last! Wake order: A, B, C, D RT waits behind CFS NSPA: ntsync_insert_waiter() -- priority-ordered Before insertion: Thread B prio 49 (RT) Thread A prio 120 (CFS) Thread C prio 120 (CFS) Insert Thread D (prio 15): Thread B prio 49 15 < 49? YES insert before B After insertion: Thread D (prio 15) wakes 1st Thread B (prio 49) wakes 2nd Thread A (prio 120) wakes 3rd Thread C (prio 120) wakes 4th (FIFO) Wake order: D, B, A, C

Both any_waiters (wait-any) and all_waiters (wait-all) queues use sorted insertion. The try_wake_any_mutex() function naturally wakes the highest-priority waiter by taking the first satisfiable entry from the sorted list.


6. Patch 3: Mutex Owner PI Boost v2

Problem: When a SCHED_FIFO thread (e.g. prio 15) waits on a mutex held by a SCHED_OTHER thread (prio 120), the holder is time-sliced by CFS against all other normal-priority threads. The RT waiter’s bounded latency guarantee is violated – it can be delayed indefinitely while the holder fails to run.

Fix (v2): ntsync_pi_recalc() boosts the mutex holder to the scheduling priority of the highest-priority waiter via sched_setattr_nocheck(). A per-device tracking structure (ntsync_pi_owner) saves the holder’s original scheduling attributes and counts how many mutex objects are contributing boosts to that task.

PI Boost Chain

NTSync PI Boost v2: Full Chain 1. RT thread calls WaitForSingleObject(hMutex) RT waiter SCHED_FIFO prio 15 ntsync_insert_waiter() sorted into any_waiters 2. ntsync_pi_recalc(obj) scan any_waiters + all_waiters highest_prio = 15 (RT waiter's prio) compare vs orig_normal_prio (120, CFS) 15 < 120? YES -- boost needed find/create ntsync_pi_owner, boost_count++ 3. Per-task PI owner tracking ntsync_pi_owner task: owner task_struct* orig_attr: { SCHED_OTHER, nice 0 } orig_normal_prio: 120 boost_count: 2 (mutex M1 + M2 boosting) 4. Apply boost sched_setattr_nocheck(owner, { SCHED_FIFO, prio=84 }) 5. On mutex release: ntsync_pi_drop(obj) obj.pi_boosted = false boost_count-- boost_count > 0? Another mutex still boosting YES: keep boost (conservative over-boost, safe) NO (boost_count == 0): restore original sched_setattr_nocheck(owner, &po->orig_attr) list_del(&po->node); kfree(po);

Key Design Decisions

Per-task tracking, not per-object: When a task holds mutexes M1 and M2, both with RT waiters, the task’s ntsync_pi_owner has boost_count=2. The original scheduling attributes are saved once (on the first boost) and restored only when boost_count reaches zero. Between removal of the first and last boost, the task is conservatively over-boosted – it runs at too-high priority, never too-low.

Compare against orig_normal_prio, not normal_prio: After sched_setattr_nocheck() boosts a task, normal_prio changes to reflect the boosted priority. Comparing waiters against the post-boost normal_prio would cause thrashing: the comparison would conclude no boost is needed (waiter prio is now equal to the boosted normal_prio), unboost, then re-boost on the next recalc. Using orig_normal_prio from the tracking struct is stable.

Lazy owner_task resolution: When Wine creates a mutex via the wineserver, current is the wineserver thread, not the Win32 owning thread. owner_task starts NULL and is resolved lazily on the first unlock ioctl (where current is the actual Win32 thread).

v2 Bug Fixes (3 bugs from v1)

Bug v1 Behavior v2 Fix
Multi-object PI corruption Single global orig_attr overwritten when 2nd mutex boosted Per-task ntsync_pi_owner with boost_count
Zero PI for WaitAll all_waiters list not scanned ntsync_pi_recalc() scans both any_waiters and all_waiters
Stale normal_prio comparison owner->normal_prio changes after boost, causing thrash Compare against po->orig_normal_prio (saved pre-boost value)

PI recalc() core logic

static void ntsync_pi_recalc(struct ntsync_obj obj)
{
    struct task_struct owner = obj->u.mutex.owner_task;
    int highest_prio = MAX_PRIO;

/* Scan BOTH wait lists for highest-priority waiter */
list_for_each_entry(entry, &obj->any_waiters, node) {
    if (entry->q->task->prio < highest_prio)
        highest_prio = entry->q->task->prio;
}
list_for_each_entry(entry, &obj->all_waiters, node) {
    if (entry->q->task->prio < highest_prio)
        highest_prio = entry->q->task->prio;
}

raw_spin_lock(&dev->boost_lock);

po = find_pi_owner(dev, owner);
/* Compare against ORIGINAL priority, not current (which may be boosted) */
base_prio = po ? po->orig_normal_prio : owner->normal_prio;

if (highest_prio < base_prio && !was_boosted) {
    /* First boost for this object -- find or create per-task tracking */
    if (!po) {
        po = new_po;  /* pre-allocated (Patch 5) */
        po->orig_attr = capture_sched(owner);
        po->orig_normal_prio = owner->normal_prio;
        list_add(&po->node, &dev->boosted_owners);
        new_po = NULL;
    }
    po->boost_count++;
    obj->u.mutex.pi_boosted = true;
}

if (needs_boost && highest_prio < owner->prio)
    sched_setattr_nocheck(owner, &boost_attr);

raw_spin_unlock(&dev->boost_lock);
kfree(new_po);  /* free if not consumed */

}


7. Patch 4: uring_fd Extension

Problem: When a thread is blocked inside NTSYNC_IOC_WAIT_ANY (or WAIT_ALL) and an io_uring CQE arrives (e.g. socket data ready), there is no mechanism to wake the thread. The CQE sits in the completion ring until the ntsync wait times out or another ntsync object is signaled. For RT audio, this adds an entire timeout period of unnecessary latency to socket I/O completions.

Fix: Repurpose the reserved pad field in ntsync_wait_args as uring_fd. When userspace passes a valid eventfd (registered with IORING_REGISTER_EVENTFD on the thread’s io_uring ring), the ntsync wait ioctl monitors it alongside the ntsync objects using the standard poll_initwait/vfs_poll kernel pattern.

Protocol

  1. Userspace (ntdll linux_wait_objs()): sets args.pad = uring_fd (the io_uring eventfd).
  2. Kernel (ntsync_wait_any): calls fget(args.uring_fd) to get the eventfd’s struct file.
  3. Kernel (ntsync_schedule): calls poll_initwait(&pwq) + vfs_poll(uring_file, &pwq.pt) to register on the eventfd’s wait queue.
  4. In the schedule loop, each iteration checks vfs_poll(uring_file, NULL). If EPOLLIN is set (CQE available), the ioctl sets args.index = NTSYNC_INDEX_URING_READY (0xFFFFFFFE) and returns.
  5. Userspace reads and discards the eventfd counter, drains CQEs via process_completions(), and re-enters the ntsync wait.

Kernel changes

/ ntsync_schedule() – modified to accept uring_file /
static int ntsync_schedule(struct ntsync_q q, const struct ntsync_wait_args args,
                           struct file *uring_file)
{
    struct poll_wqueues uring_pwq;

if (uring_file) {
    poll_initwait(&uring_pwq);
    vfs_poll(uring_file, &uring_pwq.pt);  /* register on eventfd wqh */
}

do {
    /* ... signal/timeout checks ... */

    if (uring_file) {
        __poll_t mask = vfs_poll(uring_file, NULL);
        if (mask & EPOLLIN) {
            int signaled = -1;
            atomic_try_cmpxchg(&q->signaled, &signaled,
                               NTSYNC_INDEX_URING_READY);
            ret = 0;
            break;
        }
    }

    ret = schedule_hrtimeout_range_clock(...);
} while (ret < 0);

if (uring_file)
    poll_freewait(&uring_pwq);

return ret;

}

Userspace handling

/ dlls/ntdll/unix/sync.c /

define NTSYNC_INDEX_URING_READY 0xFFFFFFFEu

/ In linux_wait_objs(): / args.pad = uring_fd > 0 ? uring_fd : 0; / After ioctl returns: / if (args.index == NTSYNC_INDEX_URING_READY) { uint64_t val; read(uring_fd, &val, sizeof(val)); / consume eventfd counter / return STATUS_URING_COMPLETION; / caller drains CQEs, retries / }

The pad field reuse is ABI-compatible: upstream requires pad == 0 and returns EINVAL otherwise. NSPA relaxes this check (args.flags & ~NTSYNC_WAIT_REALTIME only) and interprets nonzero pad as uring_fd.


8. Patch 5: GFP_ATOMIC Pre-Allocation

Problem: Patch 3’s ntsync_pi_recalc() originally called kzalloc(sizeof(*po), GFP_ATOMIC) while holding dev->boost_lock (a raw_spinlock). On PREEMPT_RT kernels, the slab allocator’s internal rt_spin_lock can sleep, which triggers a __schedule_bug (BUG: scheduling while atomic) when called under a raw spinlock.

GFP_ATOMIC prevents the allocator from sleeping on memory pressure – it does not prevent the PREEMPT_RT slab lock from sleeping. On non-RT kernels this works fine because slab locks are plain spinlocks (disable preemption). On PREEMPT_RT, they are sleeping locks.

Fix: Pre-allocate the ntsync_pi_owner struct before acquiring the raw spinlock. Pass the pre-allocated pointer into the locked section; consume it only if a new owner entry is actually needed. Free the unconsumed allocation after releasing the lock.

static void ntsync_pi_recalc(struct ntsync_obj obj)
{
    struct ntsync_pi_owner new_po = NULL;

/* Pre-allocate BEFORE taking the raw spinlock.
 * On PREEMPT_RT, kzalloc can sleep (slab rt_spin_lock). */
new_po = kzalloc(sizeof(*new_po), GFP_ATOMIC);

/* ... scan waiter lists ... */

raw_spin_lock(&dev->boost_lock);

if (needs_boost && !was_boosted) {
    if (!po) {
        if (unlikely(!new_po))
            goto out;
        /* initialize and consume new_po */
        new_po->task = owner;
        new_po->orig_attr = ...;
        list_add(&new_po->node, &dev->boosted_owners);
        po = new_po;
        new_po = NULL;  /* consumed -- don't free below */
    }
    po->boost_count++;
}

out: raw_spin_unlock(&dev->boost_lock); kfree(new_po); / free if pre-allocated but not consumed (NULL is safe) / }

The allocation still uses GFP_ATOMIC because ntsync_pi_recalc() is called from ioctl paths that may hold obj->lock (a raw spinlock) in the caller. The key insight is that moving the allocation before boost_lock is sufficient – the caller’s obj->lock is a different raw spinlock and the slab allocator’s rt_spin_lock nesting with it is legal on PREEMPT_RT (raw spinlocks nest with sleeping locks, not with each other – but the slab lock is a sleeping lock on RT).


9. Client-Side Object Creation

Wine-NSPA creates anonymous ntsync objects client-side, bypassing the wineserver entirely for the fast path. This eliminates a server round-trip for every CreateMutex, CreateSemaphore, and CreateEvent call when the object does not need a name (the common case for application-internal synchronization).

alloc_client_handle() Path

Anonymous CreateMutex:
  alloc_client_handle()          – allocate a Wine handle from client-side pool
  ioctl(NTSYNC_IOC_CREATE_MUTEX) – kernel creates ntsync object, returns fd
  cache fd in handle table       – subsequent waits resolve fd without server
Named CreateMutex:
  wineserver creates obj         – needs server for namespace management
  passes fd back to client       – via SCM_RIGHTS over Unix socket
  cache fd in handle table       – subsequent waits same fast path

The client handle pool uses a simple atomic decrement (InterlockedDecrement(&client_handle_next)) to allocate negative handle values that do not collide with server-allocated handles. Wait operations (NtWaitForSingleObject) resolve the handle to a cached fd via inproc_wait(), then call linux_wait_objs() which issues the kernel ioctl directly.

Currently, client-side creation is enabled for mutexes and semaphores. Event object creation is client-capable at the kernel level but disabled in the Wine client due to stability issues with certain applications (Ableton Live).


10. Validation

Priority Wakeup Ordering

Tested with 5 waiters at different priorities on a single ntsync mutex. All waiters woke in correct priority order in both baseline and RT modes across all test runs (v4 and v5).

Waiter Priority Expected Wake Order Actual (Baseline) Actual (RT)
Thread 1 SCHED_FIFO prio 50 1st 1st 1st
Thread 2 SCHED_FIFO prio 40 2nd 2nd 2nd
Thread 3 SCHED_FIFO prio 30 3rd 3rd 3rd
Thread 4 SCHED_FIFO prio 20 4th 4th 4th
Thread 5 SCHED_OTHER (CFS) 5th 5th 5th

PI Contention: v4 to v5 Comparison

PI contention measures how long an RT thread waits on a mutex held by a CFS thread doing a CPU-bound work loop. Lower is better – it means the PI boost is getting the holder on-core faster.

Depth v4 RT avg v5 RT avg Delta Notes
d4 387 ms 270 ms -30.2% 8/8 samples, tight range
d8 419 ms 201 ms -52.0% v4 had CFS reversal (RT worse than baseline), v5 resolved
d12 scales scales flat Chain propagation correct, no degradation

PI Chain Scaling (v5)

Transitive PI chains tested up to depth 12. RT wait time does not increase with chain depth beyond the tail holder’s work time (~235ms for a 100M-iteration CPU loop).

Metric d4 d8 d12
RT PI avg wait 270 ms 201 ms ~235 ms
All sub-tests PASS 8/8 8/8 8/8
Priority wakeup correct Yes Yes Yes
Chain propagation correct Yes Yes Yes

Rapid Kernel Mutex Throughput

Metric v4 RT v5 RT Delta
Throughput 232K ops/s 259K ops/s +11.6%
RT max_wait 54 us 47 us -13.0%
Counter correctness 400K/400K 400K/400K correct

Philosophers (Integration Test)

The philosophers test exercises contention between RT and CFS threads on ntsync mutexes with random hold patterns.

Metric v4 RT v5 RT Delta
RT max_wait 1620 us 865 us -46.6%

Mixed WFMO

WaitForMultipleObjects (both wait-any and wait-all modes) tested with mixed object types (mutex + semaphore + event). All sub-tests PASS in both baseline and RT modes across d4/d8/d12.


File References

File Purpose
drivers/misc/ntsync.c (kernel) NTSync driver with all 5 NSPA patches
include/uapi/linux/ntsync.h (kernel) UAPI header: ioctls, ntsync_wait_args, NTSYNC_INDEX_URING_READY
dlls/ntdll/unix/sync.c (Wine) Client-side ntsync integration: linux_wait_objs(), alloc_client_handle(), uring_fd retry loop
dlls/ntdll/unix/io_uring.c (Wine) io_uring ring management, eventfd registration
programs/nspa_rt_test/main.c (Wine) RT test suite: PI contention, priority wakeup, chain scaling