Wine-NSPA – NTSync PI Kernel

This page documents the Wine-NSPA ntsync kernel overlay that backs PI waits, gamma channels, and aggregate-wait. The companion Wine-side half lives on NTSync Userspace Sync.

Overview
Object Types
Priority-inheritance baseline
Channel object
Thread-token pass-through
RT alloc-hoist
Exclusive receive wakeup fix
Deferred event boost
Channel-entry lifetime fix
Aggregate-wait and burst drain
Receive snapshot fix
Dedicated slab caches
Lockless SEND_PI target scan
Wait-queue dedicated cache
Validation
Audit notes
References

1. Overview

NTSync is a Linux kernel driver (drivers/misc/ntsync.c, /dev/ntsync) that implements Windows NT synchronization primitives – mutexes, semaphores, and events – directly in the kernel. Upstream Wine uses it to replace the wineserver-mediated sync path for these objects, eliminating cross-process round-trips for wait/wake operations.

For Wine-NSPA, upstream ntsync is necessary but insufficient. The upstream driver uses FIFO waiter queues, has no priority inheritance, and uses spinlock_t for the per-object lock – which becomes a sleeping rt_mutex on PREEMPT_RT. None of those characteristics is acceptable for an RT audio workload where the audio callback must wait deterministically on Wine’s primitives without inheriting unbounded inversion latency.

Wine-NSPA carries a kernel overlay that extends upstream ntsync.c in three broad layers:

the PI baseline for mutex, semaphore, and event waits
the channel and aggregate-wait primitives that back the gamma dispatcher
the later hardening and hot-path cleanup work needed for production use

The current overlay on kernel 6.19.11-rt1-1-nspa includes the dedicated wait-queue cache plus SLAB_NO_MERGE across all four ntsync caches (see Section 14). The feature-by-feature detail below keeps the patch numbers for traceability, but the public reading order is by capability rather than by patch label.

This doc is the design and implementation reference for that kernel half: what each carried feature adds, what bug it closes, how it preserves NT semantics, and how it interacts with obj_lock and PREEMPT_RT.

NSPA overlay relationship

Wine-NSPA does not fork ntsync. The patches are diffs against upstream drivers/misc/ntsync.c and apply cleanly in series 1003 -> 1004 -> 1005 -> 1006 -> 1007 -> 1008 -> 1009 -> 1010 -> 1011 -> 1012 -> 1013 -> 1014 -> 1015. They live in wine-rt-claude/ntsync-patches/ as standalone unified diffs. The kernel build (linux-nspa) applies the stack at PKGBUILD time; the resulting .ko ships as part of the kernel package.

The patch numbering (1003- through 1015-) is local to NSPA. It bears no relationship to upstream NTSync revisions or any LKML series.

Feature map at a glance

#	Patch	Purpose	LOC
1003	PI primitives	raw_spinlock obj_lock, priority-ordered waiter queues, mutex owner PI boost, per-task tracking	~600
1004	Channel object	New `NTSYNC_TYPE_CHANNEL` with `CREATE`, `SEND_PI`, `RECV`, `REPLY` ioctls	~530
1005	Thread-token	Per-channel `(tid -> token)` registry + `RECV2` ioctl, eliminates dispatcher userspace lookup	~340
1006	RT alloc-hoist	Hoists 6 sites of `kmalloc`/`kfree` out of `raw_spinlock_t` (RT-illegal); `pi_work` pool	~750
1007	Channel exclusive recv	`wake_up_all` priority-inversion fix: 3-LOC `wait_event_interruptible_exclusive` swap	~3
1008	EVENT_SET_PI deferred boost	Closes fast-path race where consumer takes obj_lock first, sees signaled, returns unboosted	~80
1009	channel_entry refcount UAF	KASAN-caught REPLY-vs-SEND_PI cleanup race; refcount_t on `ntsync_channel_entry`	~15
1010	Aggregate-wait	`NTSYNC_IOC_AGGREGATE_WAIT`: heterogeneous object+fd wait, channel notify-only support	~400
1011	Channel TRY_RECV2	`NTSYNC_IOC_CHANNEL_TRY_RECV2`: non-blocking `RECV2` for post-dispatch burst drain	~30
1012	Channel recv field-snapshot UAF fix	Snapshot popped-entry fields under `obj_lock` before unlock, closes RECV/RECV2 vs sender-cleanup slab UAF	~15
1013	Dedicated kmem_caches	`ntsync_event_pi` / `ntsync_channel_entry` / `ntsync_pi_owner` -> own `kmem_cache`s with `SLAB_HWCACHE_ALIGN`	~120
1014	SEND_PI lockless target scan	`list_empty_careful` fast-path skips `wq->lock` round-trip on empty waiter queues	~10
1014a	kmem_cache_free NULL guard	Site-2089 `pending_pi.new_ep` free is NULL-guarded; closes `cache_from_obj` deref under `SLAB_FREELIST_HARDENED`	~3
1015	Wait-queue dedicated cache	`struct ntsync_q` -> own `kmem_cache` (≤16 entries + kmalloc fallback); `SLAB_NO_MERGE` retro-correction across all 4 ntsync caches	~120

Patches 1003-1006, 1010, 1011, 1013, and 1015 are feature/infrastructure work; 1007-1009, 1012, 1014, and 1014a are minimal surgical fixes for specific KASAN- or trace-confirmed bugs (1014 is also a measurable IRQ-off window reduction on the audio hot path). The distinction matters: Section 16 discusses why.

2. Object Types

Wine-NSPA’s ntsync exposes four object types via /dev/ntsync (one character device opened once per Wine process; object creation returns FDs).

Type	Win32 primitive	Created via	Wait via	Wake / signal via
Mutex	`CreateMutex`, `WaitForSingleObject`	`NTSYNC_IOC_CREATE_MUTEX`	`NTSYNC_IOC_WAIT_ANY` / `WAIT_ALL`	`NTSYNC_IOC_MUTEX_UNLOCK`
Semaphore	`CreateSemaphore`, `ReleaseSemaphore`	`NTSYNC_IOC_CREATE_SEM`	`NTSYNC_IOC_WAIT_ANY` / `WAIT_ALL`	`NTSYNC_IOC_SEM_RELEASE`
Event	`CreateEvent`, `SetEvent`, `ResetEvent`	`NTSYNC_IOC_CREATE_EVENT`	`NTSYNC_IOC_WAIT_ANY` / `WAIT_ALL`	`NTSYNC_IOC_EVENT_SET` / `_RESET` / `_PULSE` / `_SET_PI`
Channel	(no Win32 equivalent – NSPA-private IPC)	`NTSYNC_IOC_CREATE_CHANNEL`	`NTSYNC_IOC_CHANNEL_RECV` / `_RECV2` / `_TRY_RECV2`	`NTSYNC_IOC_CHANNEL_SEND_PI` / `_REPLY`

Mutex / semaphore / event are upstream concepts; their semantics map 1:1 to Win32. The mutex tracks an owner TID for WAIT_ABANDONED semantics and abandoned-recovery; the semaphore is a counted resource pool; the event has both manual-reset and auto-reset variants plus the NSPA-private EVENT_SET_PI for cross-thread priority intent.

The channel is wholly NSPA-private. It does not map to any Win32 primitive. It is a transport for Wine-NSPA’s wineserver request-reply fast path – a kernel-mediated alternative to the legacy futex+manual-sched_setscheduler shm IPC. Channels do not participate in generic WAIT_ANY / WAIT_ALL; they are accessed through their own ioctls, and patch 1010 adds a separate aggregate-wait registration path that can observe channel readiness without consuming the entry. On 1011 kernels the current consumer shape is aggregate-wait, then CHANNEL_RECV2, then TRY_RECV2 until the ready queue is empty.

is_signaled by type

The driver’s central is_signaled() predicate (called from try_wake_any / try_wake_all) returns differently per type:

Type	Signaled when
Mutex	`count == 0` (unowned) or owner matches current TID
Semaphore	`count > 0`
Event	`signaled == true`
Channel	always `false` (channels never wake `WAIT_ANY/ALL`)

The channel case in is_signaled() is a deliberate hard-false: any caller that arrives via WAIT_ANY/ALL with a channel FD is misusing the API and the wait will time out. That remains true after 1010. The aggregate-wait path is different: it registers the channel as a notify-only source and returns “channel fired” to userspace, after which userspace follows with CHANNEL_RECV2 to consume the actual entry.

3. Priority-inheritance baseline

The 1003 patch (originally three logical patches 1001/1002/1003, collapsed in this section for clarity) established the RT baseline that all subsequent patches build on.

Locking hierarchy

The driver has three locks. NSPA classifies them explicitly for PREEMPT_RT:

raw_spinlock_t obj->lock          per-object, protects state + waiter lists
rt_mutex       dev->wait_all_lock device-wide, serializes wait-all setup
raw_spinlock_t dev->boost_lock    device-wide, protects boosted_owners list

raw_spinlock_t keeps true spin semantics on PREEMPT_RT (does not become an rt_mutex). obj->lock is held only across short pointer-only state updates: rb-tree manipulation, list manipulation, signaled-flag flip, owner-TID write. dev->boost_lock is held only across boosted_owners list updates plus a single sched_setattr_nocheck() call. Both critical sections are short, bounded, and never sleep – the PREEMPT_RT contract.

dev->wait_all_lock is rt_mutex, not raw_spinlock_t, because wait-all setup is long: it walks all named objects to be waited on, may copy_from_user the FD array, and may need to take per-object locks. A raw spinlock is the wrong primitive for that. The rt_mutex carries PI – a high-priority thread blocked on wait_all_lock boosts whoever holds it.

The obj_lock() fast path acquires only obj->lock. When obj->dev_locked is set (another thread is doing a wait-all on this object), obj_lock() falls back to acquiring wait_all_lock first. This avoids ABBA deadlocks between per-object and device-wide locks.

Priority-ordered waiter queues

Upstream ntsync uses list_add_tail() to append waiters: FIFO order. NSPA replaces this with ntsync_insert_waiter(), which performs a sorted insertion based on the kernel-internal task->prio (lower numeric value = higher scheduling priority).

static void ntsync_insert_waiter(struct ntsync_q_entry *new_entry,
                                 struct list_head *head)
{
    struct ntsync_q_entry *entry;
    list_for_each_entry(entry, head, node) {
        if (new_entry->q->task->prio < entry->q->task->prio) {
            list_add_tail(&new_entry->node, &entry->node);
            return;
        }
    }
    list_add_tail(&new_entry->node, head);
}

Same-priority waiters maintain FIFO order within their priority level. try_wake_any_*() walks from the head, so the highest-priority satisfiable waiter wakes first. This restores NT semantics (highest-priority waiter wins) and is strictly stronger than upstream’s FIFO.

Mutex owner PI boost

When an RT thread (e.g. SCHED_FIFO prio 80) waits on a mutex held by a SCHED_OTHER thread (prio 120 in kernel terms), the holder is preempted by every running RT thread and time-sliced by CFS against every other normal thread. The RT waiter’s bounded-latency guarantee is violated.

ntsync_pi_recalc(obj, pi_work) (line 424 of the production source) handles this. Whenever a mutex’s wait list changes (insert, wake, unlock) it scans both any_waiters and all_waiters for the highest-priority waiter, then boosts the owner’s scheduling attributes via sched_setattr_nocheck() to match. Per-task tracking (struct ntsync_pi_owner, anchored in dev->boosted_owners) saves the original attributes once and counts how many of the task’s owned mutexes are contributing boosts. Restore happens only when the count drops to zero.

The PI boost design has three v2 lessons baked in:

Bug	v1 behaviour	v2 fix
Multi-object PI corruption	Single global `orig_attr` overwritten when 2nd mutex boosted	Per-task `ntsync_pi_owner` with `boost_count`
Zero PI for WaitAll	`all_waiters` not scanned	Scan both `any_waiters` and `all_waiters`
Stale `normal_prio` thrash	`owner->normal_prio` mutates after boost -> oscillation	Compare against saved `orig_normal_prio` from tracker

The ntsync_pi_owner struct is the unit of bookkeeping. The pool/cleanup pattern that 1006 introduces (Section 6) is the unit of RT-safe allocation for that struct.

EVENT_SET_PI primitive (pre-1008 design)

EVENT_SET_PI was originally introduced in 1003 as the cross-thread priority-intent primitive: an RT thread sets an event, and along with the signal it carries a (policy, prio) boost that the kernel applies to the event’s first waiter. Wine-NSPA uses this for the audio-thread -> dispatcher SendMessage bypass: the audio callback sets a queue event with its own RT priority, and the dispatcher pthread is woken at that priority.

The original design walked event->any_waiters under obj_lock at EVENT_SET_PI time and applied the boost to the head waiter. This had a fast-path race that 1008 closes – see Section 8.

Per-task tracking, conservative over-boost

ntsync_pi_owner is allocated lazily on first boost and freed only when the last contributing object releases. Between the first removal and the last, the owner is conservatively over-boosted: it runs at too-high priority briefly, never too-low. That is the safe direction; under-boost would leak inversion. The lazy lifetime also means owner_task is resolved lazily on the first unlock (where current is the actual Win32-owning thread), since at create time current is the wineserver, not the eventual owner.

4. Channel object

1004-ntsync-channel.patch adds a new object type, NTSYNC_TYPE_CHANNEL. A channel is a bounded, kernel-side priority-ordered request/reply mailbox. It exists to replace Wine-NSPA’s user-space futex + manual sched_setscheduler shm-IPC fast path between client processes and the wineserver.

Why a kernel object

Wine’s wineserver protocol is fundamentally a request/reply RPC. Each client thread sends a request, blocks for the reply, and resumes. The legacy fast path used a process-shared futex on a request slot plus a sched_setscheduler call from the sending audio thread to lift the dispatcher pthread’s priority. That worked but had three problems:

Priority transfer was a separate syscall. The audio thread had to know which pthread it was lifting and call sched_setscheduler on it explicitly. Token-stale racy on thread death.
No priority queueing. When two senders raced, the futex woke one of them in roughly FIFO order; a higher-priority sender could wait behind a lower-priority one if the dispatcher was idle.
No transactional priority drain. If the dispatcher returned without replying (signal, error path) the audio-thread-applied boost had no clear cleanup hook.

A kernel-mediated channel solves all three. The kernel:

Holds a priority-ordered rb-tree of pending requests (priority DESC, sequence ASC).
Atomically boosts the blocked receiver to the sender’s priority on enqueue.
Auto-boosts the receiver to the popped entry’s priority for the handler duration.
Drains the boost at REPLY and at the next RECV (mirroring EVENT_SET_PI’s drain-on-wait pattern).

The channel is purely a transport, not a protocol. The wineserver still drives the request/reply contract; the kernel multiplexes and priority-orders, and never reorders within a single sender (each sender blocks for reply, so per-thread ordering is preserved).

API

Four ioctls, all on a channel FD obtained via NTSYNC_IOC_CREATE_CHANNEL:

ioctl	Caller	Effect
`NTSYNC_IOC_CREATE_CHANNEL`	wineserver	Create channel with `max_depth`. Returns FD.
`NTSYNC_IOC_CHANNEL_SEND_PI`	client thread	Enqueue `(prio, payload_off, reply_off)`; boost recv'er; sleep for reply.
`NTSYNC_IOC_CHANNEL_RECV`	dispatcher pthread	Pop highest-prio entry; auto-boost current to that priority.
`NTSYNC_IOC_CHANNEL_REPLY`	dispatcher pthread	Wake the sender of `entry_id`; drain receiver boost.

The payload_off and reply_off fields are opaque to the kernel; conventionally they are indices into a per-process shared-memory region the client and wineserver both map. The kernel transports the cookies; user space interprets them.

That is the 1004 base interface. The current production surface layers 1005’s CHANNEL_RECV2 on top for thread-token return, then 1011’s CHANNEL_TRY_RECV2 for non-blocking post-dispatch drain.

Internal state

The channel object’s per-instance state lives in obj->u.channel:

struct {
    struct rb_root  pending;     /* PENDING entries (prio DESC, seq ASC) */
    struct list_head dispatched; /* DISPATCHED entries (REPLY can find by id) */
    atomic64_t  next_id;
    atomic64_t  next_seq;
    __u32       depth;           /* current PENDING count */
    __u32       max_depth;
    wait_queue_head_t recv_wq;   /* blocked receivers */
    struct hlist_head thread_regs[64]; /* added by 1005 */
} channel;

Each entry is a struct ntsync_channel_entry:

struct ntsync_channel_entry {
    struct rb_node      rb;       /* in pending rb-tree */
    struct list_head    list;     /* in dispatched list */
    __u64               id, seq;
    __u32               prio, policy;
    __u64               payload_off, reply_off;
    __u32               sender_tid;
    enum  ntsync_channel_state state;  /* PENDING | DISPATCHED */
    bool                replied;
    wait_queue_head_t   wq;       /* sender sleeps on this */
    __u64               thread_token;  /* added by 1005 */
    refcount_t          refcnt;        /* added by 1009 */
};

The rb-tree key is (prio DESC, seq ASC): higher priority sorts first; ties break by enqueue order. channel_pending_insert() returns true iff the entry became the new tree minimum – i.e. it would be popped next. That return value drives the speculative-boost decision in SEND_PI.

SEND_PI flow

Validate (policy, prio). Pre-allocate e and new_ep (the boost tracking entry) with GFP_KERNEL outside any lock – slab on RT cannot be called under raw_spinlock_t.
obj_lock(ch). Reject with -EAGAIN if depth >= max_depth. Insert into pending rb-tree; bump depth. Note whether this entry is the new minimum.
obj_unlock(ch).
If the new entry is the minimum and prio is set, peek the recv_wq head. Take a get_task_struct reference under wq->lock, then call apply_event_pi_boost() to boost that receiver to (policy, prio).
wake_up(&ch->recv_wq) – wakes exactly the head receiver (1007 made this exclusive).
Sleep on e->wq until e->replied is true or signal pending.
On wake: obj_lock(ch), detach e from whichever list/tree it’s on, obj_unlock(ch). Drop refcount_dec_and_test(&e->refcnt); kfree if last ref (1009).

The cleanup path covers the case where the sender was interrupted (signal). The entry might still be PENDING (rb-tree) or DISPATCHED (list); we use e->state to dispatch correctly. depth is decremented only in the PENDING branch – DISPATCHED entries no longer count against max_depth.

RECV / RECV2 flow

drain_event_pi_boosts(dev, current) – release any boost left over from a prior RECV cycle.
Pre-allocate new_ep outside lock.
obj_lock(ch). While pending is empty: obj_unlock, wait_event_interruptible_exclusive(recv_wq, !empty) (1007 made this exclusive), obj_lock again.
Pop the rb-tree minimum; mark DISPATCHED; append to dispatched list; decrement depth.
(1005 only, in RECV2:) e->thread_token = channel_lookup_token(ch, e->sender_tid). See Section 5.
obj_unlock(ch).
If e->prio, auto-boost current to (e->policy, e->prio) for the handler duration via apply_event_pi_boost(dev, current, ...). Boost releases at next RECV’s drain, or at REPLY’s drain.
Copy (entry_id, payload_off, reply_off, sender_tid, prio[, thread_token]) to user space.

In the post-1011 dispatcher path, userspace follows the first successful RECV2 with TRY_RECV2 after each reply until the channel returns empty.

REPLY flow

obj_lock(ch). Walk dispatched list for entry_id. If not found or already replied: -ENOENT.
Set e->replied = true.
refcount_inc(&e->refcnt) (1009 – keep the entry alive across wake_up_all).
obj_unlock(ch).
wake_up_all(&e->wq) – wakes the blocked sender. Outside obj_lock because wq’s internal lock is spinlock_t (becomes rt_mutex on PREEMPT_RT) and cannot nest under our raw_spinlock_t.
drain_event_pi_boosts(dev, current) – handler is done, drop the receiver’s auto-boost.
refcount_dec_and_test(&e->refcnt); kfree if last ref (1009).

Memory ordering

Kernel ioctl syscall entry/exit is a full memory barrier. So payload visibility from sender -> receiver and reply visibility from receiver -> sender is naturally serialised: the sender’s copy_from_user of the payload completed before SEND_PI returns from the syscall handler; the receiver’s copy_to_user happens-before RECV returns; the receiver’s writes to the reply region happen-before REPLY returns; the sender’s copy_from_user of the reply happens-after SEND_PI’s wake.

NT semantics

The kernel does not promise ordering across senders – it priority-orders, but a SCHED_OTHER sender behind a SCHED_FIFO sender will wait. Cross-thread ordering was never guaranteed under the prior per-thread dispatcher pthread shape, so this is strictly stronger semantically (no thread can starve while a higher-prio thread is waiting). Within a single sender, ordering is preserved: each SEND_PI blocks for reply, so back-to-back sends from the same TID are serialised.

Hot-path bound

obj_lock sections in SEND_PI / RECV / REPLY are bounded by tree height. With max_depth = 1024, that is 10 rb-tree comparisons. Zero allocation under lock. No memory copies under lock (the copy_to_user happens after obj_unlock).

Diagnostics: depth and channel emptiness

A channel can only be freed when both pending and dispatched are empty; otherwise senders or dispatchers still hold the file open via the syscall ref. ntsync_free_obj() WARN_ONs either non-empty list at free time – a useful canary if user space ever leaks a channel FD with active entries.

5. Thread-token pass-through

Once the channel was in production, perf 2026-04-26 showed ~10% of dispatcher CPU sitting in a userspace get_thread_from_id() lookup inside the gamma dispatcher’s hot loop. Every received request needed to map sender_tid -> struct thread * to dispatch. This patch eliminates that lookup by stamping a wineserver-supplied opaque token onto each entry at RECV time.

Mechanism

The wineserver registers (tid, token) per channel via a new ioctl. The kernel stores the mapping in a 64-bucket hash on the channel (hlist_head thread_regs[64], keyed by tid & 63, protected by the existing obj_lock). At RECV2 time the kernel looks up the token for e->sender_tid and returns it in extended args.

struct ntsync_channel_recv2_args {
    __u64 entry_id;
    __u64 payload_off;
    __u64 reply_off;
    __u32 sender_tid;
    __u32 prio;
    __u64 thread_token;  /* OUT: registered token (0 if unregistered) */
};

Two new ioctls:

ioctl	Effect
`NTSYNC_IOC_CHANNEL_REGISTER_THREAD`	Install or replace `(tid, token)`
`NTSYNC_IOC_CHANNEL_DEREGISTER_THREAD`	Evict entry for tid (idempotent)

Plus NTSYNC_IOC_CHANNEL_RECV2 – same as RECV but returns an extra thread_token field. The older RECV ioctl still exists in the UAPI, but current Wine-NSPA userspace requires RECV2 and no longer ships the old fallback ladder.

v2 design: lookup at RECV2, not SEND_PI

The first version of this patch did the hash lookup in SEND_PI and stamped thread_token onto the entry there. v2 moved the lookup to RECV2. Two reasons:

Audio-thread cost. The audio thread is the one paying the SEND_PI critical-section cost. Moving the lookup to RECV2 puts the cost on the dispatcher pthread instead – which is fine, the dispatcher is not deadline-bound.
Stale-token correctness. A token snapshotted at SEND_PI could go stale if the sender died and the wineserver deregistered before the dispatcher RECV’d. RECV2-time lookup reflects current registration: a deregistered TID returns token = 0, and userspace falls back to get_thread_from_id (which will fail on a dead TID, and the request gets dropped by the existing logic).

The hash bucket count is fixed at 64 (no resize, no rhashtable). For a typical Wine process with dozens to a few hundred threads, that gives single-digit average chain lengths – well under the rb-tree key comparison cost in SEND_PI/RECV.

Lifetime invariants

The wineserver enforces:

Register before the client may send (specifically, before the init_first_thread reply that signals the client may issue requests).
Deregister after the thread’s last reply is delivered.

Together these ensure RECV2 always sees a non-zero token for a still-live thread. A momentarily-zero token (if registration races a fast first send) yields a userspace fallback that completes correctly – it is only a perf regression, not a correctness one.

`channel_drain_thread_regs()` on free

When a channel is freed, any leftover (tid, token) registrations are dropped. By construction the channel is unreachable at ntsync_free_obj() time (no senders, no dispatchers can have an FD), so no concurrent access is possible – a single pass through the buckets, kfreeing each ntsync_thread_reg.

Current runtime expectation

Old RECV entries still carry thread_token = 0 (initialized in kzalloc), so older consumers can continue using the legacy shape if they exist. Current Wine-NSPA userspace, however, assumes RECV2/TRY_RECV2 and resolves sender threads from the returned token on the normal path.

6. RT alloc-hoist

This is a safety patch, not a feature: it fixes six sites in the driver where slab kzalloc/kfree was being called under raw_spinlock_t on PREEMPT_RT – which is illegal. The bug was latent until 2026-04-26, when an Ableton workload hard-froze the host with a clean kernel oops.

The kernel oops

After installing the first thread-token ntsync.ko build, Ableton hard-froze the host 13 minutes into a session:

BUG: kernel NULL pointer dereference, address: 0x9a
RIP: ___slab_alloc+0x316  (xor (%rbx,%rdx,1),%rax  RBX=0x3a)
Call: __kmalloc_cache_noprof <- ntsync_obj_ioctl+0x427 [ntsync]
Comm: Ableton Web Con      PREEMPT_{RT,(lazy)}

Classic SLUB freelist corruption.

Root cause

obj->lock and dev->boost_lock are both raw_spinlock_t. On PREEMPT_RT, SLUB’s per-CPU fast path uses local_lock_t, which is spinlock_t – a sleeping lock under PREEMPT_RT (confirmed in include/linux/local_lock_internal.h). So kzalloc / kfree under any raw_spinlock_t is unsafe on RT, including GFP_ATOMIC (the GFP flag gates reclaim, not the local_lock).

This is a mechanically verifiable rule: CONFIG_DEBUG_ATOMIC_SLEEP will splat any sleeping function called from a non-sleepable context. The bug was not caught by that infrastructure only because the production kernel ships without it for performance reasons; the rule itself is unambiguous.

Six sites in ntsync.c violated this rule:

#	Function	Line	Issue
1	`ntsync_pi_recalc`	345	`kzalloc(GFP_ATOMIC)` under raw
2	`ntsync_pi_recalc`	409	`kfree` under `boost_lock`
3	`ntsync_pi_recalc`	417	`kfree` under caller’s `obj->lock`
4	`ntsync_pi_drop`	441	`kfree` under `boost_lock`
5	`ntsync_channel_register_thread`	1614	`kfree` under `obj_lock`
6	`ntsync_channel_deregister_thread`	1639	`kfree` under `obj_lock`

Sites 1-4 had been latent since the 1003 PI patch landed; 5-6 were new in 1005 (thread-token registration). The Ableton lockup was almost certainly triggered by 5 or 6: T2 thread-token registration is always-on when channel + kernel support are present, and Ableton boot creates dozens of threads -> dozens of register/deregister calls -> poisoned freelist 13 minutes in. Sites 1-4 had likely also caused several previous unexplained host lockups in the earlier msg-ring, paint-cache, and instrumentation-related lockup series.

The pi_work pool/cleanup pattern

The fix introduces a stack-resident struct ntsync_pi_work that the caller pre-allocates and finishes outside any raw lock:

struct ntsync_pi_work {
    struct list_head new_po_pool;     /* pre-allocated; consumed on demand */
    struct list_head to_free_list;    /* removed entries to free post-unlock */
};

Three helpers:

void ntsync_pi_work_init(w);                  /* INIT_LIST_HEAD x2 */
void ntsync_pi_work_prealloc(w);              /* kzalloc + list_add to pool, OUTSIDE locks */
struct ntsync_pi_owner *ntsync_pi_work_take_new(w); /* pointer-only list_del under raw */
void ntsync_pi_work_finish(w);                /* kfree pool leftovers + to_free_list */

Lifecycle of a pi_owner via this struct:

kzalloc -> list_add to new_po_pool                  (caller, no lock)
consumed: list_del from pool, list_add to dev list  (pi_recalc, raw)
removed: list_move from dev list to to_free_list    (pi_recalc/_drop, raw)
kfree from new_po_pool + to_free_list               (caller, no lock)

Empty pool is a non-fatal fallback: pi_recalc skips the boost (transient priority inversion until next op), matching the prior GFP_ATOMIC behaviour. The hot path stays one slab op per ioctl – just hoisted past the lock, so no extra latency.

Caller pattern

Every ioctl entry that may invoke pi_recalc / pi_drop declares one of these on stack:

struct ntsync_pi_work pi_work;
ntsync_pi_work_init(&pi_work);
ntsync_pi_work_prealloc(&pi_work);

/* ... acquire raw locks, possibly call pi_recalc/pi_drop ... */
/* ... release all raw locks ... */

ntsync_pi_work_finish(&pi_work);

This pattern shows up in try_wake_any, try_wake_all_obj, release_mutex, wait_any, wait_all, event_set_pi, and several other entry points. Sites 5-6 (channel register/deregister) use a simpler local victim pointer pattern – a single removal per call doesn’t justify the pool.

NT semantics preserved exactly

Only observable difference: ntsync_pi_owner cleanup deferred by tens of nanoseconds past raw_spin_unlock. Mutex ownership transfers atomically with wake (cmpxchg unchanged). PI boost levels and stacking semantics unchanged. Channel priority ordering (DESC, seq ASC) unchanged. Token registration replace-or-insert unchanged. Wait-any/all wakeup ordering unchanged.

Why this fix mattered for everything that came after

1006 is a prerequisite for honest stress-testing of the channel path. Without it, every register/deregister churn in a stress test was rolling SLUB freelist dice. With it, KASAN under PREEMPT_RT became a useful tool: any splat is a real bug, not slab dust. That is what made 1009 (the channel_entry refcount UAF) catchable.

Open RT/safety items deferred from 1006

obj_lock() between prepare_to_wait and schedule in ntsync_channel_send_pi: rt_mutex_lock inside obj_lock would clobber TASK_INTERRUPTIBLE state if obj->dev_locked were set. Latent only – channels never participate in wait_all so dev_locked is never set on channels. Safe today; tighten when convenient.

7. Exclusive receive wakeup fix

Bug: ntsync_channel_send_pi speculatively boosts recv_wq.head to the sender’s priority before wake_up(), but wake_up() was waking all non-exclusive waiters because wait_event_interruptible adds non-exclusive waiters by default. Non-head receivers could win the entry-pop race -> the boosted head was stranded with high priority and no work; the winner had low priority and the actual work. A real production priority inversion.

This was the plausible root cause of unexplained gamma-dispatcher lockups previously (and incorrectly) blamed on userspace patches.

Three lines

-  ret = wait_event_interruptible(ch->u.channel.recv_wq,
+  /* Exclusive wait: wake_up() in SEND_PI walks the recv_wq and
+   * stops at the first exclusive waiter.  This makes the head
+   * (which SEND_PI speculatively boosted) the unique winner of
+   * the entry-pop race -- closes the priority-inversion window
+   * where a non-head receiver could pop the entry while the
+   * boosted head got stranded with high prio and no work. */
+  ret = wait_event_interruptible_exclusive(ch->u.channel.recv_wq,
        !RB_EMPTY_ROOT(&ch->u.channel.pending));

Applied in both ntsync_channel_recv and ntsync_channel_recv2.

Why this works

wake_up() is already exclusive-aware: it walks the wait queue and stops at the first exclusive waiter. So once both RECV and RECV2 register exclusive waiters, SEND_PI’s wake_up() wakes exactly the head – the boost target. The boost target becomes the unique race winner.

wait_event_interruptible_exclusive is a kernel primitive; it takes the wait queue lock, sets the waiter’s WQ_FLAG_EXCLUSIVE flag, and otherwise behaves identically to the non-exclusive variant. No new behaviour introduced; we just opted into the existing semantics.

Validation

test-channel-recv-exclusive: 100/100 PASS (was deterministic hang before because the test was stale-coded around pre-1007 wake-all behaviour).
30-iter native suite: zero hangs on the channel path.
No new perf overhead (no extra allocations or fast-path locks).

Why this is the minimal correct fix

The rolled-back “Codex 1007-1011” patch series (Section 10) had attempted a much larger redesign of the channel path, including channel-rejection in setup_wait, cross-snapshot PI cleanup, and a pool/cleanup refactor of the channel allocations themselves. None of that was needed. Three lines suffice.

8. Deferred event boost

Bug: the original EVENT_SET_PI design (Section 3) walked event->any_waiters under obj_lock at signal time and applied the boost to the head waiter. This missed any consumer that took obj_lock first, saw signaled=true and returned without queueing – the standard wait fast-path. Result: ~4% of EVENT_SET_PI calls under PREEMPT_RT debug-kernel scheduling silently failed to apply the boost. A real RT-correctness hole.

The race

Thread A (consumer, fast path)         Thread B (signaler, EVENT_SET_PI)
obj_lock(event)
if (signaled) {                        kzalloc(new_ep)
   /* signaled=false set later */
   fast-path return (NO QUEUE)
}
obj_unlock(event)
                                       obj_lock(event)
                                       walk any_waiters: EMPTY
                                       target = NULL
                                       signaled = true
                                       obj_unlock(event)
                                       kfree(new_ep)  /* dropped! */

The signaler sets the event but has no target to boost; the consumer returns from wait_any having seen the signal but unboosted. The boost was lost.

This was hard to spot because most EVENT_SET_PI calls under PREEMPT_RT scheduling do find a queued waiter (the consumer hadn’t reached obj_lock yet). Only the fast-path race – consumer arrives just before signaler – silently dropped the boost. KASAN debug-kernel testing showed it as a ~4% flake rate on the test-event-set-pi test.

Redesign: stage on event, consume at wait-return

The fix flips ownership of the boost target. Instead of the signaler finding the target at EVENT_SET_PI time, the consumer applies the boost to itself at wait-return.

New per-event state in the event union:

struct {
    u32 policy;
    u32 prio;
    struct ntsync_event_pi *new_ep;   /* pre-allocated; consumer takes ownership */
} pending_pi;

Mechanism in five steps:

Pre-allocate tracking entry outside any lock (slab on RT).
Stage (policy, prio, new_ep) on the event under obj_lock; ALSO set signaled=true and wake any queued waiter.
The first task to consume the signal – whether queued and woken, or fast-path (already-signaled) – applies the staged boost to itself via consume_event_pi_boost() at wait-return. This is race-free: the consumer is by definition the task whose wait_any/wait_all returned with this event as the signaled obj.
Last-writer-wins if EVENT_SET_PI is called twice without an intervening consumption – earlier staged new_ep is freed (under obj_lock-released, RT-safe).
EVENT_RESET clears the staging (signal cancelled, boost too).

Plus a 6th rule: ntsync_free_obj frees any leaked staging entry on object death (no leak if the event dies unconsumed).

consume_event_pi_boost()

Called from wait_any unqueue loop on the signaled obj if it is an event:

static void consume_event_pi_boost(struct ntsync_obj *event)
{
    struct ntsync_event_pi *new_ep = NULL;
    u32 policy = 0, prio = 0;
    bool valid = false, all;

    if (event->type != NTSYNC_TYPE_EVENT)
        return;

    all = ntsync_lock_obj(event->dev, event);
    if (event->u.event.pending_pi.new_ep) {
        new_ep = event->u.event.pending_pi.new_ep;
        policy = event->u.event.pending_pi.policy;
        prio   = event->u.event.pending_pi.prio;
        event->u.event.pending_pi.new_ep = NULL;
        valid = true;
    }
    ntsync_unlock_obj(event->dev, event, all);

    if (valid) {
        if (!apply_event_pi_boost(event->dev, current,
                                   policy, prio, new_ep))
            kfree(new_ep);
    }
}

The atomic capture-and-clear under obj_lock is the one-shot guarantee: the first consumer wins, subsequent consumers see new_ep == NULL and no-op. If EVENT_SET_PI is called again before consumption, the prior new_ep is freed under the same lock and replaced.

EVENT_SET_PI itself, simplified

The new ntsync_event_set_pi:

new_ep = kzalloc(sizeof(*new_ep), GFP_KERNEL);
if (!new_ep) return -ENOMEM;

ntsync_pi_work_init(&pi_work);
ntsync_pi_work_prealloc(&pi_work);

all = ntsync_lock_obj(dev, event);

/* Stage the boost.  Last-writer-wins. */
prior_new_ep = event->u.event.pending_pi.new_ep;
event->u.event.pending_pi.policy = args.policy;
event->u.event.pending_pi.prio   = args.prio;
event->u.event.pending_pi.new_ep = new_ep;

/* Signal: identical to EVENT_SET. */
event->u.event.signaled = true;
if (all)
    try_wake_all_obj(dev, event, &pi_work);
try_wake_any_event(event);

ntsync_unlock_obj(dev, event, all);
ntsync_pi_work_finish(&pi_work);

/* Free overwritten prior staging outside lock (slab on RT). */
kfree(prior_new_ep);

No more target = list_first_entry(...) walk under obj_lock. No more get_task_struct(target) ref management. The signaler just sets the event; whoever consumes it boosts themselves.

EVENT_RESET hook

Resetting the event cancels the signal, so it must cancel any pending boost too:

prior_new_ep = event->u.event.pending_pi.new_ep;
event->u.event.pending_pi.new_ep = NULL;
ntsync_unlock_obj(dev, event, all);
kfree(prior_new_ep);

ntsync_free_obj hook

If the event dies unconsumed, free the staging entry:

if (obj->type == NTSYNC_TYPE_EVENT)
    kfree(obj->u.event.pending_pi.new_ep);

wait_any consumer hook

Inside the wait_any unqueue loop, after the obj is unlocked but before put_obj:

if ((int)i == signaled && obj->type == NTSYNC_TYPE_EVENT)
    consume_event_pi_boost(obj);

The signaled index identifies which obj actually woke this wait. We consume only on that obj – non-signaled objs in a multi-object wait have nothing to apply.

wait_all TODO

ntsync_wait_all cannot call consume_event_pi_boost because that helper takes the obj’s wait-all lock path (via ntsync_lock_obj), and the unqueue loop already holds wait_all_lock. The audio-callback path uses wait_any so this gap is rare in practice; revisit if cross-event boost across wait_all becomes a workload concern. Comment in source:

/* NSPA: TODO -- wait_all consumer hook for EVENT_SET_PI deferred
 * boost.  Cannot call consume_event_pi_boost here because it
 * takes obj's wait-all lock path and we already hold
 * wait_all_lock.  Audio-callback path uses wait_any (handled in
 * the wait_any unqueue), so this is rare in practice; revisit
 * if cross-event boost becomes a workload concern. */

Validation

test-event-set-pi: 100/100 PASS (was 4% flake rate).
test-event-set-pi-stress 60s/8x8: 2.8M signaler ops + 3.4M waiter consumes, 596K boosts cleanly applied, zero KASAN/KCSAN splats, zero leaks (refcnt=0 post-stress), drain restores cleanly.
Native suite still passes (no regressions on event path).

Cost

One extra atomic exchange under obj_lock per EVENT_SET_PI (the pending_pi store + signal flip). One extra obj_lock/obj_unlock per consume. The latter is the only new path; it runs only if the event has staged PI – so on workloads that don’t use EVENT_SET_PI it is a no-op (pending_pi.new_ep == NULL check is one load).

9. Channel-entry lifetime fix

Bug: KASAN-caught slab-use-after-free on ntsync_channel_entry under test-channel-stress 4x4 with thread-registration churn. REPLY’s wake_up_all on e->wq runs outside obj_lock (it must – wq’s internal lock is spinlock_t, becomes rt_mutex on PREEMPT_RT, can’t nest under our raw_spinlock_t). That creates a window where SEND_PI’s cleanup could kfree(e) between REPLY’s obj_unlock and REPLY’s wake_up_all reaching the freed wait queue.

The KASAN splat

BUG: KASAN: slab-use-after-free in do_raw_spin_lock+0x23c/0x270
Read of size 4 at addr ffff8882e30b2564 by task test-channel-st/51072

Call: __wake_up -> ntsync_obj_ioctl+0x8d5 [ntsync]

Allocated by task 51069: __kasan_kmalloc -> ntsync_obj_ioctl+0x941
Freed by task 51069:     kfree         -> ntsync_obj_ioctl+0x3e3c

Cache: kmalloc-256 (256-byte object), 248 bytes used.
Address is 100 bytes inside freed region.

Disassembly maps:

+0x941 = kzalloc(sizeof(*e), GFP_KERNEL) in ntsync_channel_send_pi (size 0xf8 = 248 bytes).
+0x8d5 = wake_up_all(&e->wq) in ntsync_channel_reply (call to __wake_up(wq=rbx+0x60, mode=3=TASK_NORMAL, 0, 0)). Offset 0x60 matches wait_queue_head_t wq field in ntsync_channel_entry.
+0x3e3c = kfree(e) at the tail of ntsync_channel_send_pi cleanup.

The race

Thread A (SEND_PI sleeper)              Thread B (REPLY)
                                        obj_lock(ch)
                                        find e in dispatched
                                        e->replied = true
                                        obj_unlock(ch)
loop iter: prepare_to_wait
loop iter: obj_lock(ch)
loop iter: e->replied is true, break
finish_wait
obj_lock(ch); list_del(&e->list);
obj_unlock(ch)
kfree(e)                                 wake_up_all(&e->wq)  <-- UAF

The wake_up_all outside obj_lock is necessary on PREEMPT_RT (wq’s internal lock cannot be taken under raw_spinlock obj_lock). But that creates the window where SEND_PI’s cleanup can free e between REPLY’s obj_unlock and REPLY’s wake_up_all.

The fix: refcount_t on channel_entry

Add refcount_t refcnt to struct ntsync_channel_entry. SEND_PI initializes it to 1 after queue insertion (the sleeping sender holds one ref). REPLY does refcount_inc under obj_lock before unlock, then wake_up_all, then refcount_dec_and_test+kfree-if-last. SEND_PI cleanup does refcount_dec_and_test+kfree-if-last. Whichever decrement reaches 0 frees.

Code addition is ~15 LOC:

struct ntsync_channel_entry {
    ...
    refcount_t refcnt;
};

/* In SEND_PI, after successful queue insertion: */
refcount_set(&e->refcnt, 1);  /* sleeper holds 1; REPLY will inc */

/* In SEND_PI cleanup, replacing kfree(e): */
if (refcount_dec_and_test(&e->refcnt))
    kfree(e);

/* In REPLY, between obj_unlock and wake_up_all: */
e->replied = true;
refcount_inc(&e->refcnt);
obj_unlock(ch);
wake_up_all(&e->wq);
drain_event_pi_boosts(ch->dev, current);
if (refcount_dec_and_test(&e->refcnt))
    kfree(e);

Why this is the minimal correct fix

There was a previous “Codex 1007-1011” patch series (rolled back; see Section 10) that targeted this same bug class but bundled it with a number of unrelated audit-derived changes (REPLY-fake-on-copy-fail, channel-reject in setup_wait, cross-boost cleanup refactor). The core fix – refcount on the entry – was correct in that series. Everything else was speculative noise that introduced its own bugs.

This patch is just the refcount.

Validation

test-channel-stress 30s/4x4: 819,803 SEND_PI = 819,803 REPLY (perfect match), 974K register ops, 0 syscall errors, 0 KASAN/KCSAN splats, refcnt=0 post-stress.
test-event-set-pi 20/20 PASS, test-channel-recv-exclusive 20/20 PASS (no regression on Bugs 2/3 fixes).
test-event-set-pi-stress 60s/8x8: 2.7M signaler + 3.5M waiter, drain OK, 0 splats.

Why this is the right shape, not the wrong shape

A common alternative for this class of bug is to take a sleepable lock around the wake. We can’t – the obj_lock that protects entry membership is raw_spinlock_t, and we cannot promote it to rt_mutex without losing the bounded-CS guarantee that the rest of the driver depends on. Refcount on the entry is the textbook fix for “object outlives its containing-collection lifetime due to async finishers” – no lock-order changes, no protocol changes, just two incs and three dec_and_tests in the right places.

10. Aggregate-wait and burst drain

Patch 1010 adds the heterogeneous wait primitive that the rest of the NSPA stack had been designing around: NTSYNC_IOC_AGGREGATE_WAIT.

The immediate consumer is the post-1010 gamma dispatcher. Instead of blocking in direct CHANNEL_RECV2 forever, the dispatcher can wait on:

the gamma channel object
the per-process uring eventfd
an explicit shutdown eventfd

in one syscall, while still keeping channel PI visible.

UAPI shape


struct ntsync_aggregate_source {
    __u32 type;
    __u32 events;
    __u64 handle_or_fd;
};

struct ntsync_aggregate_wait_args {
    __u32 nb_sources;
    __u32 reserved;
    __u64 sources;
    struct __kernel_timespec deadline;
    __u32 fired_index;
    __u32 fired_events;
    __u32 flags;
    __u32 owner;
};

Why it is architecturally different from `WAIT_ANY`

WAIT_ANY and WAIT_ALL remain NT-object waits.
1010 is a heterogeneous wait: object sources plus fd sources in one registration.
Channels are still not generic WAIT_ANY participants; 1010 adds a separate notify-only path for them.
Userspace is expected to follow a channel fire with CHANNEL_RECV2, which preserves the existing channel ownership and PI semantics.

Validation surface

1010 was not treated as a paper design or a future placeholder. It was validated with a dedicated native aggregate-wait suite:

aggregate-wait core behavior
fd wake behavior
timeout behavior
channel notify-only behavior
channel-PI propagation while blocked in aggregate-wait

The first result was the post-1009 base plus aggregate-wait and its PI-ordering follow-ups. The next overlay added burst drain on top, and the current overlay keeps both surfaces while adding the later hardening and cache-isolation work.

10.1 Burst drain with `CHANNEL_TRY_RECV2`

1011 adds NTSYNC_IOC_CHANNEL_TRY_RECV2, a non-blocking companion to CHANNEL_RECV2. It does not replace aggregate-wait; it is the follow-on that lets a woken dispatcher keep draining the ready list without paying one more AGG_WAIT round-trip per queued entry.

That is a small kernel change, but it is exactly the shape the gamma dispatcher needs under bursty server-bound RPC load:

block once in aggregate-wait
dequeue one ready entry with CHANNEL_RECV2
dispatch and reply
then issue TRY_RECV2 until the channel is empty

The ioctl is additive at the kernel interface level, but current Wine-NSPA userspace assumes it is present. The old sticky fallback ladder was retired once aggregate-wait became the project baseline.

11. Receive snapshot fix

Bug: in ntsync_obj_ioctl paths for NTSYNC_IOC_CHANNEL_RECV and NTSYNC_IOC_CHANNEL_RECV2, the receiver popped a channel_entry *e under obj_lock, then unlocked, then read e->fields (payload, sender_pid, etc). Between the unlock and the field reads the sender thread – parked in wait_event_interruptible – could be signal-interrupted, run its cleanup path, and kfree(e) in that window. The receiver then read freed memory. KASAN reproducibly caught this on test-channel-stress and on real Ableton workloads.

The lock-drop is mandatory for the rest of the RECV/RECV2 path: copy_to_user, the apply_event_pi_boost call, and the receiver auto-boost cannot run under the raw_spinlock_t obj_lock. The fix therefore narrows what the post-unlock path needs to read off e.

Fix: snapshot relevant fields under `obj_lock` before unlocking

/* Pre-1012 (broken): */
spin_lock(&obj->obj_lock);
e = list_first_entry_or_null(&channel->pending, ...);
if (!e) { spin_unlock(...); return -EAGAIN; }
list_del(&e->link);
spin_unlock(&obj->obj_lock);
/* RACE: sender can free e here */
copy_to_user(buf, e->payload, e->len);

/* Post-1012 (fixed): */
spin_lock(&obj->obj_lock);
e = list_first_entry_or_null(&channel->pending, ...);
if (!e) { spin_unlock(...); return -EAGAIN; }
list_del(&e->link);
/* SNAPSHOT under the lock */
local_payload = e->payload;
local_len = e->len;
local_pid = e->sender_pid;
spin_unlock(&obj->obj_lock);
copy_to_user(buf, &local_payload, local_len);

Why snapshot, not refcount

The 1009 fix used a refcount_t on the entry to keep it alive across REPLY’s wake_up_all. 1012 does not. Refcount would have worked here too, but it would have added two atomic ops (atomic_inc on entry, atomic_dec_and_test on exit) to every channel RECV / RECV2 on the audio dispatcher’s critical chain. Snapshotting collapses the lock-drop window to zero rather than extending the entry’s lifetime, costs zero atomics, and keeps the recv hot path one cacheline narrower. The snapshotted fields are small and well-bounded (a few words).

Coverage

A new test-channel-try-recv2-stress.c was added in the same change as a gap-filler for patch 1011: TRY_RECV2 had no dedicated stress test before this session.

12. Dedicated slab caches

Pre-1013 three ntsync allocation classes lived in the system kmalloc pool:

struct ntsync_event_pi (120 bytes) -> kmalloc-128
struct ntsync_channel_entry (192 bytes) -> kmalloc-192
struct ntsync_pi_owner (120 bytes) -> kmalloc-128

That is functionally correct but architecturally weak for an RT-class hot path: two 120B objects in kmalloc-128 sit back-to-back so the tail of one and the head of the next can share a cacheline; an ntsync object can neighbour a network struct or fs metadata in the same kmalloc bucket; /proc/slabinfo lumps everything into kmalloc-128; kmem_cache_shrink, SLAB_FREELIST_HARDENED, and SLAB_HWCACHE_ALIGN cannot be applied to a subset of kmalloc-128.

Implementation

Three dedicated caches, each sized exactly to the struct, each with SLAB_HWCACHE_ALIGN:


ntsync_event_pi_cache = kmem_cache_create("ntsync_event_pi",
        sizeof(struct ntsync_event_pi),
        0, SLAB_HWCACHE_ALIGN, NULL);

ntsync_channel_entry_cache = kmem_cache_create("ntsync_channel_entry",
        sizeof(struct ntsync_channel_entry),
        0, SLAB_HWCACHE_ALIGN, NULL);

ntsync_pi_owner_cache = kmem_cache_create("ntsync_pi_owner",
        sizeof(struct ntsync_pi_owner),
        0, SLAB_HWCACHE_ALIGN, NULL);

All kzalloc / kfree callsites for the three structs are converted to kmem_cache_alloc / kmem_cache_free. The conversion is mechanical except for one subtle gotcha (see Section 13’s 1014a follow-up).

Init / exit lifecycle

Caches are constructed in ntsync_init in dependency order with mirrored unwind labels, and destroyed in ntsync_exit in reverse order:

misc_register runs after all three caches are constructed – an ioctl can never run against a half-initialised module.
misc_deregister runs before kmem_cache_destroy – all .release callbacks complete before any cache is torn down.
Both file_ops carry .owner = THIS_MODULE, so the module refcount blocks unload while any fd is open.

Structural value (always-on)

Cacheline alignment, every allocation. Each object starts at a 64B cacheline boundary; hot fields (e.g. pending_pi.new_ep) land predictably in cacheline 0 of the object. No false sharing between ntsync objects.
Isolation from kmalloc-128. All ntsync state lives in dedicated pools; coherence traffic stays inside the ntsync hot path.
Visibility. /proc/slabinfo and /sys/kernel/slab/ntsync_* expose per-cache objects, partial, and cpu_slabs directly.
SLAB_FREELIST_HARDENED covers the dedicated caches as a unit on kernels built with that flag. Catches double-free and bad-pointer-free at the slab layer.
No padding waste vs the prior kmalloc-128 route on the 120B structs.

Workload absorption (slabinfo, drum-load capture 2026-05-04)

cache	idle	drum-load	delta	size	pre-1013 home
ntsync_event_pi	637	795	+158	120B	kmalloc-128
ntsync_pi_owner	637	795	+158	120B	kmalloc-128
ntsync_channel_entry	168	168	0	192B	kmalloc-192
kmalloc-128 (system)	2240	2240	0	128B	n/a

158 new event-PI staging pairs (one event_pi + one paired pi_owner) absorbed cleanly in the dedicated caches; kmalloc-128 stayed flat – isolation under real load. SLUB internal state moved in the expected direction: partial slabs filled (8 -> 2), per-CPU slabs went up (18 -> 24), matching “hot path picks up CPU-local allocations”.

Independence from 1014

1013 has no dependency on 1014 and vice versa; the patches are separately revertable.

13. Lockless `SEND_PI` target scan

Motivation

In ntsync_channel_send_pi, before staging the boost on a target waiter, the code scans the channel’s wait_queue_head_t to pick a target. Pre-1014 that scan acquired wq->lock (spin_lock_irqsave – still raw on PREEMPT_RT here because it is the wait-queue’s own lock, not obj_lock) even when the queue was empty. The empty case is the common one for an audio dispatcher under steady load: most SEND_PI fires hit a channel with no parked waiters. That is a wasted IRQ-disable plus spinlock round-trip on the audio thread’s hot path.

Implementation

Replace the unconditional lock+scan with a list_empty_careful peek first:

/* Pre-1014: */
spin_lock_irqsave(&wq->lock, flags);
list_for_each_entry(...) { ... }
spin_unlock_irqrestore(&wq->lock, flags);

/* Post-1014: */
if (list_empty_careful(&wq->head)) {
    /* fall through to any_waiters fallback path; no lock taken */
    goto no_target;
}
spin_lock_irqsave(&wq->lock, flags);
/* same as before */
spin_unlock_irqrestore(&wq->lock, flags);

Correctness

list_empty_careful uses smp_load_acquire and is documented as appropriate for “lockless check, then maybe lock” patterns.
All wq->head mutators (wait_event_*, prepare_to_wait, finish_wait) take wq->lock, so the lockless reader sees a consistent list state.
Target selection is best-effort by design. If a false-positive “empty” reading occurs in a race window, the next SEND_PI picks up the missed waiter and the unconditional wake_up fires either way – no waiter is lost, no boost is misdirected. The fall-through to the any_waiters fallback path is the documented escape valve on the existing path.

RT-safety

Removes a spin_lock_irqsave from the audio thread’s SEND_PI hot path in the common (empty-queue) case – a measurable IRQ-off window reduction in the path that matters most for audio jitter.

1014a: `kmem_cache_free` is not NULL-safe

The 1013 conversion left one kfree-style site for obj->u.event.pending_pi.new_ep un-NULL-guarded in ntsync_free_obj. The diff comment claimed kmem_cache_free is NULL-safe like kfree. The kernel source disagrees:

mm/slub.c:6900 (Linux 6.19.11):


void kmem_cache_free(struct kmem_cache *s, void *x)
{
    s = cache_from_obj(s, x);
    ...
}

static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
{
    if (!IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) &&
        !kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS))
        return s;
    cachep = virt_to_cache(x);  /* DEREFS x */
    ...
}

Under SLAB_FREELIST_HARDENED (debug kernel: enabled) the short-circuit fails and virt_to_cache(NULL) runs, dereferencing offset 0x8 of NULL. kfree() early-outs on ZERO_OR_NULL_PTR(x); kmem_cache_free does not – the asymmetry the diff comment got wrong.

The crash signature on the debug kernel was a page fault at kmem_cache_free+0x5c with RAX = vmemmap[0] (the struct page for NULL) and CR2 = 0x...0008 (the slab->slab_cache deref at offset 8) – exact match.

Fix

if (obj->type == NTSYNC_TYPE_EVENT && obj->u.event.pending_pi.new_ep)
    kmem_cache_free(ntsync_event_pi_cache,
                    obj->u.event.pending_pi.new_ep);

The pattern matches the existing explicit-guard sites the same conversion already used at lines 1306, 1336, 1604, 1740, 1829, 1916. Site 2089 was simply missed in the original 1013 audit.

Audit summary that found 1014a

A four-dimension audit covering the entire post-1014 file:

NULL reachability of every unguarded kmem_cache_free site: 17 sites total, 16 provably safe (loop-vars, refcnt-protected, or explicit upstream if (...) {), 1 bug at site 2089.
Allocator/deallocator pairing: zero misroutes; no kfree on a cache-allocated struct, no kmem_cache_free on a generic kmalloc-allocated struct.
Init / exit lifecycle: caches created before misc_register, mirrored unwind labels, misc_deregister before kmem_cache_destroy, module-owner refcount blocks unload while any fd is open.
1014 lockless semantics: list_empty_careful RCU-safe, mutators all take wq->lock, false-positive “empty” benign by design.

14. Wait-queue dedicated cache

Motivation

A post-1014a audit (2026-05-05) of the live driver enumerated all remaining kmalloc / kzalloc sites in ntsync.c. Six sites total; only two on the audio dispatcher hot path, both for the per-ioctl wait queue (struct ntsync_q) allocated by setup_wait and ntsync_aggregate_setup. Every WAIT_ANY / WAIT_ALL / AGGREGATE_WAIT ioctl pays one kmalloc(struct_size(...)) then one kfree.

Site (post-1014a line)	Path	Status
1974 `ntsync_thread_reg`	per channel thread-register	COLD – skip
2160 `ntsync_obj`	per CreateEvent/Mutex/Sem	COOL – marginal
2383 wait queue `q`	per WAIT_ANY/WAIT_ALL ioctl	HOT – target
2829 wait queue `q` (agg)	per AGGREGATE_WAIT ioctl	HOT – target
2840 `fds` array	per wait-with-FDs (var-count)	not eligible
3092 `ntsync_device`	per chardev open	COLD – skip

struct ntsync_q is the only HOT kmalloc class that survived 1013.

Variable-size design

struct ntsync_q has a flexible-array member entries[] whose count is total_count – 1 (typical audio worker) up to NTSYNC_MAX_WAIT_COUNT+1 = 65 (NtWaitForMultipleObjects cap) or NTSYNC_AGG_MAX = 64 (aggregate). Three options were considered:

(a) Single fixed-size cache for “up to 16 entries” plus kmalloc fallback when total_count > 16. Cleanest.
(b) Two-tier (small ≤16 + large ≤64) plus kmalloc above. Halves typical-case per-slot waste but doubles the cache count.

Shipped (a). 16 entries comfortably covers the typical 1-8 audio wait depth; larger waits keep the kmalloc path with no regression. Slot size with SLAB_HWCACHE_ALIGN is 704B on x86_64 (header 32 + 16×entry(40) = 672B, rounded to the next 64-byte cacheline).

Allocator routing

A bool from_cache field is added to struct ntsync_q, placed in the existing 2-byte trailing pad after bool ownerdead so sizeof(struct ntsync_q) is unchanged. Set by ntsync_alloc_q, read by ntsync_free_q:


static struct ntsync_q *ntsync_alloc_q(__u32 total_count)
{
    struct ntsync_q *q;

    if (total_count <= NTSYNC_Q_CACHE_MAX_ENTRIES) {
        q = kmem_cache_alloc(ntsync_wait_q_cache, GFP_KERNEL);
        if (q)
            q->from_cache = true;
    } else {
        q = kmalloc(struct_size(q, entries, total_count), GFP_KERNEL);
        if (q)
            q->from_cache = false;
    }
    return q;
}

static void ntsync_free_q(struct ntsync_q *q)
{
    if (!q)
        return;
    if (q->from_cache)
        kmem_cache_free(ntsync_wait_q_cache, q);
    else
        kfree(q);
}

ntsync_free_q is NULL-safe by design (early return). kmem_cache_free is not NULL-safe under SLAB_FREELIST_HARDENED (the 1014a lesson); centralising the guard in the wrapper makes per-site audit trivial. Two alloc-site conversions plus six free-site conversions complete the WAIT_* / AGGREGATE_WAIT path.

PREEMPT_RT discipline (1006 alloc-hoist invariant)

Both kmem_cache_alloc(..., GFP_KERNEL) and kmalloc(..., GFP_KERNEL) are sleep-prone (they may direct-reclaim). R1 from ntsync-rt-audit.md forbids sleeping operations under raw_spinlock_t. Verified at every call site:

Site	Context when ntsync*q runs
`setup_wait` 2383	Top of function, no locks held
`ntsync_aggregate_setup` 2829	Top of function, no locks held
`setup_wait` err 2421	Error cleanup, no locks held
`ntsync_wait_any` 2573	After unqueue and `ntsync_pi_work_finish`
`ntsync_wait_all` 2746	After `wait_all_lock` unlock and `ntsync_pi_work_finish`
`ntsync_aggregate_setup` err 2842	fds-alloc fail, no locks held
`ntsync_aggregate_setup` err 2891	Partial-init fail, no locks held
`ntsync_aggregate_wait` 3083	After unqueue and `ntsync_pi_work_finish`

The 1006 alloc-hoist invariant is preserved end-to-end.

UAF / lifecycle (1012 lesson – N/A)

struct ntsync_q has a task-private lifecycle: allocated by the syscalling task, populated by the same task, published into wait queues under obj_lock, list_del’d under obj_lock during unqueue (mutually exclusive with try_wake_any_*), then freed. No cross-thread free path exists. The 1012 snapshot-vs-refcount lesson does not apply – there is no lock-drop window between mutator and freer.

`SLAB_NO_MERGE` retro-correction

The original 1015 patch only added SLAB_HWCACHE_ALIGN (mirroring 1013). First boot showed the new cache absent from /proc/slabinfo. /sys/kernel/slab/ revealed why:


ntsync_channel_entry -> :0000192    # merged
ntsync_event_pi      -> :0000128    # merged
ntsync_pi_owner      -> :0000128    # merged
ntsync_wait_q        -> :0000704    # merged (1015 alone)

All four ntsync caches had been merged by SLUB into generic kmalloc-N classes. The 1013 architectural promise of “isolation from kmalloc-128” had not been holding on the prod kernel since 1013 landed. It held on the debug kernel because SLAB_FREELIST_HARDENED makes caches incompatible for merging – different debug-vs-prod config. Section 12’s drum-load slabinfo absorption table was therefore debug-kernel evidence; on prod, those allocations were going into kmalloc-128 the whole time.

Fix: add SLAB_NO_MERGE (available since kernel 6.4; prod runs 6.19) to all four kmem_cache_create calls, bundled into the 1015 patch as a retroactive correction:


ntsync_event_pi_cache      = kmem_cache_create("ntsync_event_pi",      ..., SLAB_HWCACHE_ALIGN | SLAB_NO_MERGE, NULL);
ntsync_channel_entry_cache = kmem_cache_create("ntsync_channel_entry", ..., SLAB_HWCACHE_ALIGN | SLAB_NO_MERGE, NULL);
ntsync_pi_owner_cache      = kmem_cache_create("ntsync_pi_owner",      ..., SLAB_HWCACHE_ALIGN | SLAB_NO_MERGE, NULL);
ntsync_wait_q_cache        = kmem_cache_create("ntsync_wait_q",        ..., SLAB_HWCACHE_ALIGN | SLAB_NO_MERGE, NULL);

After the fix, /sys/kernel/slab/ntsync_*/ are all real directories; no symlinks, no merging. The 1013 isolation promise holds on prod.

Workload absorption (Ableton, prod kernel, 30s windows at 1Hz)

cache	active_objs (steady-state)	kmalloc-N delta during same window
`ntsync_wait_q`	184	`kmalloc-1k` delta = 0
`ntsync_event_pi`	256	covered in dedicated cache
`ntsync_channel_entry`	168	covered in dedicated cache
`ntsync_pi_owner`	256	covered in dedicated cache

184 active ntsync_wait_q objects is the steady-state concurrency of Ableton’s worker pool parked in NtWaitForMultipleObjects. Pre-1015 those 184 would have lived in kmalloc-1k; post-1015 they sit in the dedicated cache, with kmalloc-1k flat across the load window – isolation proven on the prod kernel for the first time since 1013.

The active_objs metric is concurrency, not throughput; a per-second alloc-rate proof would need /sys/kernel/slab/ntsync_wait_q/alloc_calls cumulative deltas or a slabtop snapshot pair. Not a gate, just a refinement for future evidence-gathering.

Independence

1015 has no dependency on 1012 / 1013 / 1014; the patches remain separately revertable. The SLAB_NO_MERGE retro-correction is bundled because both edits live in the same kmem_cache_create chain – landing it as a separate 1013a would have meant two patch applications for one logical change.

15. Validation

Overlay progression

Stage	What landed	Notes
PI baseline	1003 + 1004 + 1005 + 1006	priority inheritance, channel transport, thread-token return, and RT-safe alloc/free discipline
Channel wake correctness	1007 + 1008 + 1009	exclusive receive wakeup, deferred event boost, and channel-entry lifetime fix
Aggregate-wait	1010	heterogeneous wait over objects plus fds, with channel notify-only support
Burst drain	1011	non-blocking `TRY_RECV2` after one aggregate-wait wake
Snapshot + cache hardening	1012 + 1013 + 1014 + 1014a	receive snapshot fix, dedicated caches, lockless SEND_PI fast path, and the free-site NULL guard
Wait-queue cache isolation	1015	dedicated wait-queue cache plus `SLAB_NO_MERGE` across all four ntsync caches

The current module at /lib/modules/6.19.11-rt1-1-nspa/kernel/drivers/misc/ntsync.ko carries the full overlay above.

Stress validation (debug kernel, KASAN-on)

Test	Build stage	Ops	KASAN	Result
test-event-set-pi-stress 30s/4x4	deferred-boost fix build	1.5M signaler	0	PASS
test-event-set-pi-stress 60s/8x8	deferred-boost fix build	2.8M sig + 3.4M waiter	0	PASS
test-mutex-pi-stress 30s/8+4mtx	deferred-boost fix build	726K acq+rel matched, 632K PI events	0	PASS
test-channel-stress 30s/4x4	deferred-boost fix build	KASAN UAF caught at ~30s	1	EXPECTED FAIL (Bug 4 found)
test-channel-stress 30s/4x4	post-channel-entry fix build	819K SEND_PI = 819K REPLY	0	PASS
test-event-set-pi-stress 60s/8x8	post-channel-entry fix build	2.7M sig + 3.5M waiter	0	PASS
test-event-set-pi 20x sanity	post-channel-entry fix build	20/20 PASS	0	PASS
test-channel-recv-exclusive 20x	post-channel-entry fix build	20/20 PASS	0	PASS
test-mixed-load-stress 5min/13W	post-channel-entry fix build	~10.3M ops, all paths	0	PASS
test-aggregate-wait 9/9	aggregate-wait build	functional + PI sub-tests	n/a	PASS
aggregate-wait 1k mixed stress	aggregate-wait build	1k iterations	0	PASS
aggregate-wait 30k + native suite	aggregate-wait build	long stress + full suite	0	PASS
test-channel-stress (post-1012)	snapshot + cache-hardening build	1.34M ops (post-1012 KASAN re-soak)	0	PASS
test-channel-try-recv2-stress	snapshot + cache-hardening build	2.6M TRY_RECV2 ops	0	PASS
test-mixed-load-stress 300s/13W	snapshot + cache-hardening build	5.28M chan SEND/REPLY, 1.99M audio waits, 12.6M REG/DEREG	0	PASS
test-channel-stress 60s/4x4 (1014a)	snapshot + cache-hardening build	1.40M SEND/REPLY, 1.40M RECV+RECV2	0	PASS
test-channel-try-recv2-stress 30s	snapshot + cache-hardening build	62k SEND, 2.68M attempts, 97.68% EAGAIN	0	PASS

Cumulative debug-kernel: ~30 million operations through post-1009; post-1014a adds another ~14 million ntsync ops (channel SEND_PI hit ~21x more than the post-1012 validation window), zero KASAN splats, zero dmesg matches for BUG/KASAN/Oops/use-after-free/lockdep/warn.

Production validation after aggregate-wait and burst-drain

The aggregate-wait consumer path was validated on the production kernel/userspace pair rather than only in isolation:

native suite: 3 PASS / 0 FAIL
test-aggregate-wait 9/9 PASS
aggregate-wait kitchen-sink: 86,528 wakes / 0 timeouts / 0 errors
channel notify-only wake path PASS
channel PI propagation while blocked in aggregate-wait PASS
dispatcher-burst in the PE matrix gives a reproducible A/B for the dispatcher path itself; burst ops/sec is 841,765 with TRY_RECV2 on vs 555,567 with TRY_RECV2 off (+34% / 1.5x)
aggregate-wait gamma dispatcher default-on under Ableton PASS
async create_file plus TRY_RECV2 under Ableton PASS
dmesg clean after 30k stress + native suite

This matters because 1010 is load-bearing only when the userspace dispatcher is actually blocked inside it. The build result therefore includes both the syscall itself and the post-1010 wake/boost ordering fixes.

Mixed-load-stress detail

13-thread/300s soak across every ntsync path concurrently against a single dev_fd:

1 audio waiter (Tier B FIFO): wait_any on (event, mutex) multi-obj
3 UI signalers: mix EVENT_SET_PI / SET / RESET / mutex acq+rel
3 channel senders: SEND_PI loop
3 channel recvers: RECV -> REPLY loop
1 registrar: REGISTER/DEREGISTER churn
1 churner: pthread_kill SIGUSR1 random workers (Ableton thread-restart pattern)

Operation totals:

Path	Ops	Notes
audio multi-obj waits	8,757,969	100% wake rate
ui EVENT_SET_PI	139,513
ui EVENT_SET / RESET / PULSE	46,506 / 23,181 / 23,324
ui mutex acq=rel	137,297 / 137,297	perfect
chan SEND_PI / REPLY	308,546 / 308,548	perfect after 30 benign races
chan REGISTER / DEREGISTER	730,985 / 365,492
sem release/acquire/read	136,683 / 180,063 / 180,064
wait_all 3-obj acq=rel	71,855 / 71,855	perfect
syscall errors	0
KASAN/KCSAN splats	0
module refcnt post-soak	0

Production-kernel revalidation

After cross-build to the production kernel 6.19.11-rt1-1-nspa (no debug instrumentation, throughput 5x-149x higher than debug):

Layer	Run	Result	Ops
1 native sanity	run-rt-suite.sh native	3/3 PASS	small
1 stress	event-set-pi 60s 8x8	PASS	~158M
1 stress	mutex-pi 30s 8h+4mtx	PASS	~12M
1 stress	channel 30s 4x4	PASS	~52M
1 stress	mixed-load 300s 13 workers	PASS	~145M
2 PE matrix	nspa_rt_test.exe baseline+rt	32 PASS / 0 FAIL / 0 TIMEOUT	n/a

Cumulative on the production kernel: post-channel-entry baseline ~370 M ops, then aggregate-wait, then burst drain, then the receive snapshot and dedicated-cache hardening carries, and the wait-queue cache plus full cache isolation; 0 syscall errors, 0 dmesg splats, refcnt=0 post-soak.

The post-1014a build was also re-validated with the full RT-suite v7 on prod kernel 6.19.11-rt1-1-nspa: 16/16 RT pass + 3/3 native ioctl pass; channel snapshot UAF and kmem_cache_free NULL-deref both closed; dedicated kmem_cache slabinfo evidence captured under real Ableton drum-load (158 new event-PI staging pairs absorbed in the dedicated caches with kmalloc-128 flat). Note: that drum-load slabinfo capture was on the debug kernel; on the prod kernel the 1013 caches had been SLUB-merged into kmalloc-128 the entire time, which is the issue 1015’s SLAB_NO_MERGE retro-correction fixes (Section 14).

Post-1015 validation (prod kernel)

The 1015 build was validated against the same prod kernel 6.19.11-rt1-1-nspa. The native ioctl soak (validate-1015.sh, which exercises both setup_wait and ntsync_aggregate_setup alloc paths through test-mixed-load-stress, test-channel-stress, test-channel-try-recv2-stress, and test-aggregate-wait) was not invoked this round: the prod kernel has no SLAB_FREELIST_HARDENED/KASAN tooling, so the soak’s signal value collapses to functional-only – which Ableton already provides at much higher rate. The actual correctness gate was the four-dimension audit plus the NULL-safe ntsync_free_q wrapper.

Empirical safety: Ableton booted clean both pre- and post-SLAB_NO_MERGE rebuild; audio-path WAIT_ANY ioctls drove the new alloc/free pair constantly with no GP-fault, so from_cache routing is correct in both directions.

Slabinfo absorption (validate-1015-slabinfo-watch.sh, Ableton 30s windows at 1Hz, project loaded, mixed transport activity):

ntsync_wait_q steady-state 184 active objects (worker pool parked in NtWaitForMultipleObjects).
ntsync_event_pi 256, ntsync_channel_entry 168, ntsync_pi_owner
kmalloc-1k delta = 0 across the load window; kmalloc-128/192/ 256/512 deltas are unrelated system traffic, same shape pre-1015.

The 184 active ntsync_wait_q objects – that would have been in kmalloc-1k on every prior build – combined with the flat kmalloc-1k row, are the isolation proof on prod. active_objs is concurrency, not throughput; per-second alloc rate would need /sys/kernel/slab/ntsync_wait_q/alloc_calls deltas (not a gate, just a refinement).

Only PASS/FAIL is authoritative across debug vs production kernels; throughput numbers aren’t directly comparable because the debug-kernel slub_debug=FZPU + kfence + KASAN tax dominates.

Original 1003-era PI metrics (still valid)

The PI contention / priority wakeup ordering / rapid mutex throughput / philosophers tests from the original single-page ntsync doc remain valid. None of the later channel or aggregate-wait carries changed the mutex PI path; the metrics are unchanged:

Metric / Test	v4 RT	v5 RT	Delta
ntsync-d4 RT PI avg	387 ms	270 ms	-30.2%
ntsync-d8 RT PI avg	419 ms	201 ms	-52.0%
Rapid mutex throughput	232K ops/s	259K ops/s	+11.6%
Rapid mutex RT max_wait	54 us	47 us	-13.0%
Philosophers RT max_wait	1620 us	865 us	-46.6%

Priority wakeup ordering is exact (5 waiters at distinct priorities wake in priority order, both baseline and RT modes, all test runs). PI chain propagation is correct up to depth 12.

16. Audit notes

The patches in this stack divide cleanly into two categories. The boundary matters because it dictates which patches were safe to ship in a flurry and which weren’t.

Mechanically verifiable correctness vs. code-review hypothesis

Patches in the mechanically verifiable category enforce a rule that has an oracle. If the rule is violated, kernel debug infra (CONFIG_DEBUG_ATOMIC_SLEEP, LOCKDEP, KASAN) will splat. The patch either makes the splat go away or it doesn’t; there is no ambiguity.

1006 (RT alloc-hoist) enforces “no sleeping alloc/free under raw_spinlock_t on PREEMPT_RT”. CONFIG_DEBUG_ATOMIC_SLEEP will splat on a violation. Mechanical.
1009 (channel_entry refcount) closes a UAF that KASAN catches by construction. The fix is a textbook refcount discipline. Mechanical.
1008 (EVENT_SET_PI deferred boost) closes a measurable flake (4% miss rate on a deterministic test). The fix removes a code path with a known race; the test passes 100%. Mechanical (the test is the oracle).
1007 (channel exclusive recv) closes a deterministic-hang test that was stale-coded around the buggy behaviour. The fix is a 3-LOC swap to a kernel primitive (wait_event_interruptible_exclusive) whose semantics are documented and obvious. Mechanical.
1012 (channel recv field-snapshot) closes a KASAN-caught cross-thread slab UAF. Snapshotting under the existing lock is a textbook fix; KASAN is the oracle. Mechanical.
1014a (kmem_cache_free NULL guard) closes a cache_from_obj deref at site 2089 surfaced by SLAB_FREELIST_HARDENED. The kernel source disagrees with the diff comment; mm/slub.c is the oracle. Mechanical.
1014 (list_empty_careful fast-path) is a smp_load_acquire-based pre-check before an existing wq->lock section. Best-effort target selection means a false-positive “empty” reading is benign by design (next SEND_PI picks up the missed waiter, the unconditional wake fires either way). The semantics are documented. Mechanical.

1013 (dedicated kmem_caches) is structural infrastructure, not a correctness fix. It is always-on (cacheline alignment, isolation, visibility) and does not change observable semantics; the cost was a single missed NULL guard caught and fixed in 1014a.

Patches in the code-review hypothesis category encode a reviewer’s argument that some code is buggy. There is no oracle. If the reviewer’s argument is wrong (or the bug is somewhere else), the patch ships new bugs without fixing the original one.

The rolled-back Codex 1007-1011 series

On 2026-04-26 there was an unfound EVENT_SET_PI slab UAF (___slab_alloc+0x316 GP-fault, ntsync_obj_ioctl+0x44e). KASAN was queued but not yet run. Codex’s review surfaced three “other issues” (cross-snapshot PI, non-exclusive RECV, channel-accept-in-setup_wait), and patches 1007-1011 (5 patches in 6 hours, including a 34KB rewrite) landed under the rationale that “(1) ∧ (2) explains the hang.”

That rationale was theory, not a measured trace. The actual unfound slab UAF was 1006 – a kfree under raw_spinlock_t in channel_register/deregister_thread. None of 1007-1011’s hypotheses were correct about the original symptom. Worse, the 1007-1011 series introduced a new UAF (the CHANNEL_REPLY UAF that 1009 ultimately fixed) that only existed because channels had been added at all.

All of 1007-1011 were rolled back. The proper sequence was then:

First, KASAN-clean the alloc/free sites under raw_spinlock_t (the actual bug). That became patch 1006.
Then, with KASAN usable as an oracle, run the stress tests. Each splat or hang is a real bug, not slab dust.
One bug per patch, surgical, with the test that found it as the validation gate. 1007 / 1008 / 1009 each fix exactly one KASAN- or test-confirmed bug.

Operating principle

When chasing an unidentified bug, narrow on the actual symptom (trace / KASAN / ftrace / repro) – do not pile speculative fixes from adjacent code review under the cover of “while I was in there, I noticed…”. Even when the audit is internally well-reasoned, the issues it surfaces are almost certainly unrelated to the observed symptom – and landing them piles new failure modes on top of the original one.

Independent CRIT findings can still be filed as separate tickets/patches, but they should not ship until the original symptom is understood. At minimum: do not ship them on the same day, on top of an unfound bug, in the same module.

A small surface area that is clearly correct in isolation (e.g. a refcount discipline patch with a real KASAN trace) can ship – but only after asking: “is this fixing damage I caused with adjacent work, or real upstream-relevant correctness?” 1009 was the latter.

This is also why 1006 was safe to ship in-flurry while the rolled-back 1007-1011 wasn’t: 1006 has an oracle (CONFIG_DEBUG_ATOMIC_SLEEP), the rolled-back series had only Codex’s argument.

17. References

Patches (NSPA tree)

All in wine-rt-claude/ntsync-patches/:

1003-ntsync-mutex-owner-pi-boost.patch – PI baseline (combined with 1001+1002 in the live module)
1004-ntsync-channel.patch – channel object
1005-ntsync-channel-thread-token.patch – thread-token + RECV2
1006-ntsync-rt-alloc-hoist.patch – pi_work pool, alloc/free hoist
1007-ntsync-channel-exclusive-recv.patch – exclusive wait_event
1008-ntsync-event-set-pi-deferred-boost.patch – deferred-boost machinery
1009-ntsync-channel-entry-refcount.patch – refcount_t on channel_entry
1010-ntsync-aggregate-wait.patch – NTSYNC_IOC_AGGREGATE_WAIT
1011-ntsync-channel-try-recv2.patch – NTSYNC_IOC_CHANNEL_TRY_RECV2
1012-ntsync-channel-recv-snapshot-pop-fields-uaf-fix.patch – snapshot popped fields under obj_lock
1013-ntsync-dedicated-slab-caches.patch – dedicated kmem_caches for the three hot allocation classes
1014-ntsync-channel-send-pi-lockless-target-scan.patch – list_empty_careful fast-path on SEND_PI, with the 1014a kmem_cache_free NULL guard at site 2089 folded in
1015-ntsync-wait-q-kmem-cache.patch – dedicated kmem_cache for struct ntsync_q (≤16 entries + kmalloc fallback) and SLAB_NO_MERGE retro-correction across all four ntsync caches

Production source

drivers/misc/ntsync.c in linux-nspa-6.19.11-1.src/linux-nspa/src/linux-6.19.11/ – 2182 lines.
Channel section: ntsync_channel_send_pi line 1489, ntsync_channel_recv line 1620, ntsync_channel_recv2 line 1690, ntsync_channel_reply line 1757, consume_event_pi_boost line 1131, apply_event_pi_boost line 596, channel_lookup_token line 1420.
pi_work infrastructure: struct ntsync_pi_work line 196, ntsync_pi_work_*() helpers lines 201-244.
UAPI: include/uapi/linux/ntsync.h – ioctl numbers, ntsync_wait_args, NTSYNC_INDEX_URING_READY, channel and thread-token ioctl arg structs.

Wine consumer

dlls/ntdll/unix/sync.c (Wine submodule) – linux_wait_objs() lines 482-549, linux_set_event_obj_pi() lines 411-417, semaphore/mutex/event helpers lines 380-475.

Tests

In wine/nspa/tests/:

test-event-set-pi.c – 1008 EVENT_SET_PI deferred-boost validation
test-event-set-pi-stress.c – 8x8 EVENT_SET_PI hammer
test-channel-recv-exclusive.c – 1007 exclusive-recv validation (with symmetric cleanup)
test-mutex-pi-stress.c – mutex contention + Tier B FIFO
test-channel-stress.c – channel SEND_PI/RECV/REPLY + register churn (caught the 1009 + 1012 UAFs)
test-channel-try-recv2-stress.c – TRY_RECV2 stress (1011/1012 gap-filler)
test-aggregate-wait.c – 1010 aggregate-wait functional + PI sub-tests
test-mixed-load-stress.c – 13-thread cross-path soak
run-rt-suite.sh – sanity runner with SKIPPED_BY_DESIGN list
wine-rt-claude/ntsync-patches/validate-1015.sh – 1015 native ioctl soak driver (covers both setup_wait and ntsync_aggregate_setup alloc paths)
wine-rt-claude/ntsync-patches/validate-1015-slabinfo-watch.sh – 1015 Ableton-side /proc/slabinfo 1Hz capture for absorption / isolation evidence

Cross-references

ntsync-userspace.gen.html – the Wine ntdll handle-to-fd cache, server-owned vs client-created anonymous sync handles, dispatcher ioctl wrappers, and PE-side wait coverage that consume the kernel surface documented here.
cs-pi.gen.html – the userspace CS-PI counterpart; together with this driver they close all priority inversion vectors in Wine’s synchronization stack.
gamma-channel-dispatcher.gen.html – how the gamma dispatcher userspace code uses the channel object and consumes the thread-token.

Wine-NSPA – NTSync PI Kernel

Table of Contents

1. Overview

NSPA overlay relationship

Feature map at a glance

2. Object Types

is_signaled by type

3. Priority-inheritance baseline

Locking hierarchy

Priority-ordered waiter queues

Mutex owner PI boost

EVENT_SET_PI primitive (pre-1008 design)

Per-task tracking, conservative over-boost

4. Channel object

Why a kernel object

API

Internal state

SEND_PI flow

RECV / RECV2 flow

REPLY flow

Memory ordering

NT semantics

Hot-path bound

Diagnostics: depth and channel emptiness

5. Thread-token pass-through

Mechanism

v2 design: lookup at RECV2, not SEND_PI

Lifetime invariants

channel_drain_thread_regs() on free

Current runtime expectation

6. RT alloc-hoist

The kernel oops

Root cause

The pi_work pool/cleanup pattern

Caller pattern

NT semantics preserved exactly

Why this fix mattered for everything that came after

Open RT/safety items deferred from 1006

7. Exclusive receive wakeup fix

Three lines

Why this works

Validation

Why this is the minimal correct fix

8. Deferred event boost

The race

Redesign: stage on event, consume at wait-return

consume_event_pi_boost()

EVENT_SET_PI itself, simplified

EVENT_RESET hook

ntsync_free_obj hook

wait_any consumer hook

wait_all TODO

Validation

Cost

9. Channel-entry lifetime fix

The KASAN splat

The race

The fix: refcount_t on channel_entry

Why this is the minimal correct fix

Validation

Why this is the right shape, not the wrong shape

10. Aggregate-wait and burst drain

UAPI shape

Why it is architecturally different from WAIT_ANY

Validation surface

10.1 Burst drain with CHANNEL_TRY_RECV2

11. Receive snapshot fix

Fix: snapshot relevant fields under obj_lock before unlocking

Why snapshot, not refcount

Coverage

12. Dedicated slab caches

Implementation

Init / exit lifecycle

Structural value (always-on)

Workload absorption (slabinfo, drum-load capture 2026-05-04)

Independence from 1014

13. Lockless SEND_PI target scan

Motivation

Implementation

Correctness

`channel_drain_thread_regs()` on free

Why it is architecturally different from `WAIT_ANY`

10.1 Burst drain with `CHANNEL_TRY_RECV2`

Fix: snapshot relevant fields under `obj_lock` before unlocking

13. Lockless `SEND_PI` target scan

1014a: `kmem_cache_free` is not NULL-safe

`SLAB_NO_MERGE` retro-correction