Linux-NSPA 6.19.11-rt1-1 (PREEMPT_RT), CONFIG_NTSYNC=m | 2026-04-15 Author: Jordan Johnston
NTSync is a Linux kernel driver (drivers/misc/ntsync.c, /dev/ntsync) that implements Windows NT synchronization primitives – mutexes, semaphores, and events – directly in the kernel. Upstream Wine uses it to replace the wineserver-mediated sync path for these objects, eliminating cross-process round-trips for wait/wake operations.
For Wine-NSPA, upstream ntsync is necessary but insufficient. The upstream driver uses FIFO waiter queues and has no priority inheritance, which means:
spinlock_t becomes a sleeping lock on PREEMPT_RT, changing timing assumptions. The upstream mutex does not carry PI.Wine-NSPA applies five kernel patches to ntsync that make it RT-safe, priority-correct, and PI-aware. Together with the userspace CS-PI path (FUTEX_LOCK_PI on CRITICAL_SECTION), these patches close all priority inversion vectors in Wine’s synchronization stack.
The ntsync driver exposes a character device (/dev/ntsync) opened once per Wine process. Object creation returns file descriptors; wait/wake operations use ioctls on the device fd.
| Type | Kernel struct field | Signaled when | Wake behavior |
|---|---|---|---|
| Mutex | obj->u.mutex |
count == 0 (unowned) or owner re-acquires |
Ownership transfers to highest-priority waiter |
| Semaphore | obj->u.sem |
count > 0 |
Decrement count, wake highest-priority waiter |
| Event | obj->u.event |
signaled == true |
Manual: wake all, auto: wake one + reset |
Wait path (NTSYNC_IOC_WAIT_ANY / NTSYNC_IOC_WAIT_ALL):
ioctl(device_fd, NTSYNC_IOC_WAIT_ANY, &args) with an array of object fds.ntsync_q with one ntsync_q_entry per object.any_waiters list via ntsync_insert_waiter() (priority-ordered). For wait-all: entries go into all_waiters under wait_all_lock.try_wake_any() / try_wake_all() fires synchronously.ntsync_schedule() puts the thread to sleep via schedule_hrtimeout_range_clock().Wake path (object state change):
NTSYNC_IOC_UNLOCK_MUTEX) changes object state.try_wake_any_mutex() walks the any_waiters list (highest-priority first due to sorted insertion).obj->u.mutex.owner = q->owner, wake_up_process(q->task).ntsync_pi_drop() removes the old owner’s boost; ntsync_pi_set_owner() recalculates for the new owner.
raw_spinlock_t obj->lock -- per-object, protects object state + waiter lists
rt_mutex dev->wait_all_lock -- device-wide, serializes wait-all operations
raw_spinlock_t dev->boost_lock -- device-wide, protects boosted_owners list
The obj_lock() fast path acquires only obj->lock. When obj->dev_locked is set (another thread is doing a wait-all), obj_lock() falls back to acquiring wait_all_lock first. This avoids ABBA deadlocks between per-object and device-wide locks.
Problem: Upstream ntsync uses spinlock_t for obj->lock and mutex for wait_all_lock. On PREEMPT_RT kernels, spinlock_t becomes a sleeping rt_mutex and mutex becomes a PI-less sleeping lock. This changes the driver’s timing characteristics – code paths that assume short non-preemptible critical sections (object state updates, waiter list manipulation) can now be preempted mid-update.
Fix: Convert to explicit RT-aware types:
| Field | Upstream | NSPA |
|---|---|---|
obj->lock |
spinlock_t |
raw_spinlock_t |
dev->wait_all_lock |
struct mutex |
struct rt_mutex |
/ Per-object: true spin semantics, even on PREEMPT_RT /
raw_spin_lock_init(&obj->lock);
raw_spin_lock(&obj->lock);
raw_spin_unlock(&obj->lock);
/ Device-wide: PI-aware sleeping lock for wait-all serialization /
rt_mutex_init(&dev->wait_all_lock);
rt_mutex_lock(&dev->wait_all_lock);
rt_mutex_unlock(&dev->wait_all_lock);
raw_spinlock_t is appropriate here because the critical sections are short (tens of instructions) and only manipulate in-memory state. The rt_mutex for wait_all_lock provides PI inheritance for the device-wide lock – a thread holding wait_all_lock is boosted if a higher-priority thread blocks on it.
Problem: Upstream ntsync uses list_add_tail() to append waiters to the queue. All wakeups are FIFO – the thread that called WaitForSingleObject first is woken first, regardless of scheduling priority. This violates Windows NT semantics, where the highest-priority waiting thread is always woken first.
Fix: Replace list_add_tail() with ntsync_insert_waiter(), which performs a sorted insertion based on task->prio (kernel internal priority, lower value = higher scheduling priority). Same-priority waiters maintain FIFO order within their priority level.
static void ntsync_insert_waiter(struct ntsync_q_entry new_entry,
struct list_head head)
{
struct ntsync_q_entry *entry;
list_for_each_entry(entry, head, node) {
if (new_entry->q->task->prio < entry->q->task->prio) {
list_add_tail(&new_entry->node, &entry->node);
return;
}
}
list_add_tail(&new_entry->node, head);
}
The function walks the existing waiter list until it finds an entry with a numerically higher (lower scheduling priority) task->prio, then inserts before that entry. If no such entry exists (the new waiter has the lowest priority), it appends to the tail.
Both any_waiters (wait-any) and all_waiters (wait-all) queues use sorted insertion. The try_wake_any_mutex() function naturally wakes the highest-priority waiter by taking the first satisfiable entry from the sorted list.
Problem: When a SCHED_FIFO thread (e.g. prio 15) waits on a mutex held by a SCHED_OTHER thread (prio 120), the holder is time-sliced by CFS against all other normal-priority threads. The RT waiter’s bounded latency guarantee is violated – it can be delayed indefinitely while the holder fails to run.
Fix (v2): ntsync_pi_recalc() boosts the mutex holder to the scheduling priority of the highest-priority waiter via sched_setattr_nocheck(). A per-device tracking structure (ntsync_pi_owner) saves the holder’s original scheduling attributes and counts how many mutex objects are contributing boosts to that task.
Per-task tracking, not per-object: When a task holds mutexes M1 and M2, both with RT waiters, the task’s ntsync_pi_owner has boost_count=2. The original scheduling attributes are saved once (on the first boost) and restored only when boost_count reaches zero. Between removal of the first and last boost, the task is conservatively over-boosted – it runs at too-high priority, never too-low.
Compare against orig_normal_prio, not normal_prio: After sched_setattr_nocheck() boosts a task, normal_prio changes to reflect the boosted priority. Comparing waiters against the post-boost normal_prio would cause thrashing: the comparison would conclude no boost is needed (waiter prio is now equal to the boosted normal_prio), unboost, then re-boost on the next recalc. Using orig_normal_prio from the tracking struct is stable.
Lazy owner_task resolution: When Wine creates a mutex via the wineserver, current is the wineserver thread, not the Win32 owning thread. owner_task starts NULL and is resolved lazily on the first unlock ioctl (where current is the actual Win32 thread).
| Bug | v1 Behavior | v2 Fix |
|---|---|---|
| Multi-object PI corruption | Single global orig_attr overwritten when 2nd mutex boosted |
Per-task ntsync_pi_owner with boost_count |
| Zero PI for WaitAll | all_waiters list not scanned |
ntsync_pi_recalc() scans both any_waiters and all_waiters |
| Stale normal_prio comparison | owner->normal_prio changes after boost, causing thrash |
Compare against po->orig_normal_prio (saved pre-boost value) |
static void ntsync_pi_recalc(struct ntsync_obj obj)
{
struct task_struct owner = obj->u.mutex.owner_task;
int highest_prio = MAX_PRIO;
/* Scan BOTH wait lists for highest-priority waiter */
list_for_each_entry(entry, &obj->any_waiters, node) {
if (entry->q->task->prio < highest_prio)
highest_prio = entry->q->task->prio;
}
list_for_each_entry(entry, &obj->all_waiters, node) {
if (entry->q->task->prio < highest_prio)
highest_prio = entry->q->task->prio;
}
raw_spin_lock(&dev->boost_lock);
po = find_pi_owner(dev, owner);
/* Compare against ORIGINAL priority, not current (which may be boosted) */
base_prio = po ? po->orig_normal_prio : owner->normal_prio;
if (highest_prio < base_prio && !was_boosted) {
/* First boost for this object -- find or create per-task tracking */
if (!po) {
po = new_po; /* pre-allocated (Patch 5) */
po->orig_attr = capture_sched(owner);
po->orig_normal_prio = owner->normal_prio;
list_add(&po->node, &dev->boosted_owners);
new_po = NULL;
}
po->boost_count++;
obj->u.mutex.pi_boosted = true;
}
if (needs_boost && highest_prio < owner->prio)
sched_setattr_nocheck(owner, &boost_attr);
raw_spin_unlock(&dev->boost_lock);
kfree(new_po); /* free if not consumed */
}
Problem: When a thread is blocked inside NTSYNC_IOC_WAIT_ANY (or WAIT_ALL) and an io_uring CQE arrives (e.g. socket data ready), there is no mechanism to wake the thread. The CQE sits in the completion ring until the ntsync wait times out or another ntsync object is signaled. For RT audio, this adds an entire timeout period of unnecessary latency to socket I/O completions.
Fix: Repurpose the reserved pad field in ntsync_wait_args as uring_fd. When userspace passes a valid eventfd (registered with IORING_REGISTER_EVENTFD on the thread’s io_uring ring), the ntsync wait ioctl monitors it alongside the ntsync objects using the standard poll_initwait/vfs_poll kernel pattern.
linux_wait_objs()): sets args.pad = uring_fd (the io_uring eventfd).ntsync_wait_any): calls fget(args.uring_fd) to get the eventfd’s struct file.ntsync_schedule): calls poll_initwait(&pwq) + vfs_poll(uring_file, &pwq.pt) to register on the eventfd’s wait queue.vfs_poll(uring_file, NULL). If EPOLLIN is set (CQE available), the ioctl sets args.index = NTSYNC_INDEX_URING_READY (0xFFFFFFFE) and returns.process_completions(), and re-enters the ntsync wait./ ntsync_schedule() – modified to accept uring_file /
static int ntsync_schedule(struct ntsync_q q, const struct ntsync_wait_args args,
struct file *uring_file)
{
struct poll_wqueues uring_pwq;
if (uring_file) {
poll_initwait(&uring_pwq);
vfs_poll(uring_file, &uring_pwq.pt); /* register on eventfd wqh */
}
do {
/* ... signal/timeout checks ... */
if (uring_file) {
__poll_t mask = vfs_poll(uring_file, NULL);
if (mask & EPOLLIN) {
int signaled = -1;
atomic_try_cmpxchg(&q->signaled, &signaled,
NTSYNC_INDEX_URING_READY);
ret = 0;
break;
}
}
ret = schedule_hrtimeout_range_clock(...);
} while (ret < 0);
if (uring_file)
poll_freewait(&uring_pwq);
return ret;
}
/ dlls/ntdll/unix/sync.c /
define NTSYNC_INDEX_URING_READY 0xFFFFFFFEu
/ In linux_wait_objs(): /
args.pad = uring_fd > 0 ? uring_fd : 0;
/ After ioctl returns: /
if (args.index == NTSYNC_INDEX_URING_READY)
{
uint64_t val;
read(uring_fd, &val, sizeof(val)); / consume eventfd counter /
return STATUS_URING_COMPLETION; / caller drains CQEs, retries /
}
The pad field reuse is ABI-compatible: upstream requires pad == 0 and returns EINVAL otherwise. NSPA relaxes this check (args.flags & ~NTSYNC_WAIT_REALTIME only) and interprets nonzero pad as uring_fd.
Problem: Patch 3’s ntsync_pi_recalc() originally called kzalloc(sizeof(*po), GFP_ATOMIC) while holding dev->boost_lock (a raw_spinlock). On PREEMPT_RT kernels, the slab allocator’s internal rt_spin_lock can sleep, which triggers a __schedule_bug (BUG: scheduling while atomic) when called under a raw spinlock.
GFP_ATOMIC prevents the allocator from sleeping on memory pressure – it does not prevent the PREEMPT_RT slab lock from sleeping. On non-RT kernels this works fine because slab locks are plain spinlocks (disable preemption). On PREEMPT_RT, they are sleeping locks.
Fix: Pre-allocate the ntsync_pi_owner struct before acquiring the raw spinlock. Pass the pre-allocated pointer into the locked section; consume it only if a new owner entry is actually needed. Free the unconsumed allocation after releasing the lock.
static void ntsync_pi_recalc(struct ntsync_obj obj)
{
struct ntsync_pi_owner new_po = NULL;
/* Pre-allocate BEFORE taking the raw spinlock.
* On PREEMPT_RT, kzalloc can sleep (slab rt_spin_lock). */
new_po = kzalloc(sizeof(*new_po), GFP_ATOMIC);
/* ... scan waiter lists ... */
raw_spin_lock(&dev->boost_lock);
if (needs_boost && !was_boosted) {
if (!po) {
if (unlikely(!new_po))
goto out;
/* initialize and consume new_po */
new_po->task = owner;
new_po->orig_attr = ...;
list_add(&new_po->node, &dev->boosted_owners);
po = new_po;
new_po = NULL; /* consumed -- don't free below */
}
po->boost_count++;
}
out:
raw_spin_unlock(&dev->boost_lock);
kfree(new_po); / free if pre-allocated but not consumed (NULL is safe) /
}
The allocation still uses GFP_ATOMIC because ntsync_pi_recalc() is called from ioctl paths that may hold obj->lock (a raw spinlock) in the caller. The key insight is that moving the allocation before boost_lock is sufficient – the caller’s obj->lock is a different raw spinlock and the slab allocator’s rt_spin_lock nesting with it is legal on PREEMPT_RT (raw spinlocks nest with sleeping locks, not with each other – but the slab lock is a sleeping lock on RT).
Wine-NSPA creates anonymous ntsync objects client-side, bypassing the wineserver entirely for the fast path. This eliminates a server round-trip for every CreateMutex, CreateSemaphore, and CreateEvent call when the object does not need a name (the common case for application-internal synchronization).
Anonymous CreateMutex:
alloc_client_handle() – allocate a Wine handle from client-side pool
ioctl(NTSYNC_IOC_CREATE_MUTEX) – kernel creates ntsync object, returns fd
cache fd in handle table – subsequent waits resolve fd without server
Named CreateMutex:
wineserver creates obj – needs server for namespace management
passes fd back to client – via SCM_RIGHTS over Unix socket
cache fd in handle table – subsequent waits same fast path
The client handle pool uses a simple atomic decrement (InterlockedDecrement(&client_handle_next)) to allocate negative handle values that do not collide with server-allocated handles. Wait operations (NtWaitForSingleObject) resolve the handle to a cached fd via inproc_wait(), then call linux_wait_objs() which issues the kernel ioctl directly.
Currently, client-side creation is enabled for mutexes and semaphores. Event object creation is client-capable at the kernel level but disabled in the Wine client due to stability issues with certain applications (Ableton Live).
Tested with 5 waiters at different priorities on a single ntsync mutex. All waiters woke in correct priority order in both baseline and RT modes across all test runs (v4 and v5).
| Waiter | Priority | Expected Wake Order | Actual (Baseline) | Actual (RT) |
|---|---|---|---|---|
| Thread 1 | SCHED_FIFO prio 50 | 1st | 1st | 1st |
| Thread 2 | SCHED_FIFO prio 40 | 2nd | 2nd | 2nd |
| Thread 3 | SCHED_FIFO prio 30 | 3rd | 3rd | 3rd |
| Thread 4 | SCHED_FIFO prio 20 | 4th | 4th | 4th |
| Thread 5 | SCHED_OTHER (CFS) | 5th | 5th | 5th |
PI contention measures how long an RT thread waits on a mutex held by a CFS thread doing a CPU-bound work loop. Lower is better – it means the PI boost is getting the holder on-core faster.
| Depth | v4 RT avg | v5 RT avg | Delta | Notes |
|---|---|---|---|---|
| d4 | 387 ms | 270 ms | -30.2% | 8/8 samples, tight range |
| d8 | 419 ms | 201 ms | -52.0% | v4 had CFS reversal (RT worse than baseline), v5 resolved |
| d12 | scales | scales | flat | Chain propagation correct, no degradation |
Transitive PI chains tested up to depth 12. RT wait time does not increase with chain depth beyond the tail holder’s work time (~235ms for a 100M-iteration CPU loop).
| Metric | d4 | d8 | d12 |
|---|---|---|---|
| RT PI avg wait | 270 ms | 201 ms | ~235 ms |
| All sub-tests PASS | 8/8 | 8/8 | 8/8 |
| Priority wakeup correct | Yes | Yes | Yes |
| Chain propagation correct | Yes | Yes | Yes |
| Metric | v4 RT | v5 RT | Delta |
|---|---|---|---|
| Throughput | 232K ops/s | 259K ops/s | +11.6% |
| RT max_wait | 54 us | 47 us | -13.0% |
| Counter correctness | 400K/400K | 400K/400K | correct |
The philosophers test exercises contention between RT and CFS threads on ntsync mutexes with random hold patterns.
| Metric | v4 RT | v5 RT | Delta |
|---|---|---|---|
| RT max_wait | 1620 us | 865 us | -46.6% |
WaitForMultipleObjects (both wait-any and wait-all modes) tested with mixed object types (mutex + semaphore + event). All sub-tests PASS in both baseline and RT modes across d4/d8/d12.
| File | Purpose |
|---|---|
drivers/misc/ntsync.c (kernel) |
NTSync driver with all 5 NSPA patches |
include/uapi/linux/ntsync.h (kernel) |
UAPI header: ioctls, ntsync_wait_args, NTSYNC_INDEX_URING_READY |
dlls/ntdll/unix/sync.c (Wine) |
Client-side ntsync integration: linux_wait_objs(), alloc_client_handle(), uring_fd retry loop |
dlls/ntdll/unix/io_uring.c (Wine) |
io_uring ring management, eventfd registration |
programs/nspa_rt_test/main.c (Wine) |
RT test suite: PI contention, priority wakeup, chain scaling |