Wine-NSPA – NTSync PI Kernel

This page documents the Wine-NSPA ntsync kernel overlay that backs PI waits, gamma channels, and aggregate-wait. The companion Wine-side half lives on NTSync Userspace Sync.

Table of Contents

  1. Overview
  2. Object Types
  3. Priority-inheritance baseline
  4. Channel object
  5. Thread-token pass-through
  6. RT alloc-hoist
  7. Exclusive receive wakeup fix
  8. Deferred event boost
  9. Channel-entry lifetime fix
  10. Aggregate-wait and burst drain
  11. Receive snapshot fix
  12. Dedicated slab caches
  13. Lockless SEND_PI target scan
  14. Wait-queue dedicated cache
  15. Validation
  16. Audit notes
  17. References

1. Overview

NTSync is a Linux kernel driver (drivers/misc/ntsync.c, /dev/ntsync) that implements Windows NT synchronization primitives – mutexes, semaphores, and events – directly in the kernel. Upstream Wine uses it to replace the wineserver-mediated sync path for these objects, eliminating cross-process round-trips for wait/wake operations.

For Wine-NSPA, upstream ntsync is necessary but insufficient. The upstream driver uses FIFO waiter queues, has no priority inheritance, and uses spinlock_t for the per-object lock – which becomes a sleeping rt_mutex on PREEMPT_RT. None of those characteristics is acceptable for an RT audio workload where the audio callback must wait deterministically on Wine’s primitives without inheriting unbounded inversion latency.

Wine-NSPA carries a kernel overlay that extends upstream ntsync.c in three broad layers:

The current overlay on kernel 6.19.11-rt1-1-nspa includes the dedicated wait-queue cache plus SLAB_NO_MERGE across all four ntsync caches (see Section 14). The feature-by-feature detail below keeps the patch numbers for traceability, but the public reading order is by capability rather than by patch label.

This doc is the design and implementation reference for that kernel half: what each carried feature adds, what bug it closes, how it preserves NT semantics, and how it interacts with obj_lock and PREEMPT_RT.

NSPA overlay relationship

Wine-NSPA does not fork ntsync. The patches are diffs against upstream drivers/misc/ntsync.c and apply cleanly in series 1003 -> 1004 -> 1005 -> 1006 -> 1007 -> 1008 -> 1009 -> 1010 -> 1011 -> 1012 -> 1013 -> 1014 -> 1015. They live in wine-rt-claude/ntsync-patches/ as standalone unified diffs. The kernel build (linux-nspa) applies the stack at PKGBUILD time; the resulting .ko ships as part of the kernel package.

The patch numbering (1003- through 1015-) is local to NSPA. It bears no relationship to upstream NTSync revisions or any LKML series.

Feature map at a glance

# Patch Purpose LOC
1003 PI primitives raw_spinlock obj_lock, priority-ordered waiter queues, mutex owner PI boost, per-task tracking ~600
1004 Channel object New NTSYNC_TYPE_CHANNEL with CREATE, SEND_PI, RECV, REPLY ioctls ~530
1005 Thread-token Per-channel (tid -> token) registry + RECV2 ioctl, eliminates dispatcher userspace lookup ~340
1006 RT alloc-hoist Hoists 6 sites of kmalloc/kfree out of raw_spinlock_t (RT-illegal); pi_work pool ~750
1007 Channel exclusive recv wake_up_all priority-inversion fix: 3-LOC wait_event_interruptible_exclusive swap ~3
1008 EVENT_SET_PI deferred boost Closes fast-path race where consumer takes obj_lock first, sees signaled, returns unboosted ~80
1009 channel_entry refcount UAF KASAN-caught REPLY-vs-SEND_PI cleanup race; refcount_t on ntsync_channel_entry ~15
1010 Aggregate-wait NTSYNC_IOC_AGGREGATE_WAIT: heterogeneous object+fd wait, channel notify-only support ~400
1011 Channel TRY_RECV2 NTSYNC_IOC_CHANNEL_TRY_RECV2: non-blocking RECV2 for post-dispatch burst drain ~30
1012 Channel recv field-snapshot UAF fix Snapshot popped-entry fields under obj_lock before unlock, closes RECV/RECV2 vs sender-cleanup slab UAF ~15
1013 Dedicated kmem_caches ntsync_event_pi / ntsync_channel_entry / ntsync_pi_owner -> own kmem_caches with SLAB_HWCACHE_ALIGN ~120
1014 SEND_PI lockless target scan list_empty_careful fast-path skips wq->lock round-trip on empty waiter queues ~10
1014a kmem_cache_free NULL guard Site-2089 pending_pi.new_ep free is NULL-guarded; closes cache_from_obj deref under SLAB_FREELIST_HARDENED ~3
1015 Wait-queue dedicated cache struct ntsync_q -> own kmem_cache (≤16 entries + kmalloc fallback); SLAB_NO_MERGE retro-correction across all 4 ntsync caches ~120

Patches 1003-1006, 1010, 1011, 1013, and 1015 are feature/infrastructure work; 1007-1009, 1012, 1014, and 1014a are minimal surgical fixes for specific KASAN- or trace-confirmed bugs (1014 is also a measurable IRQ-off window reduction on the audio hot path). The distinction matters: Section 16 discusses why.

NTSync in Wine-NSPA: object families, PI paths, and patch layering Wine callers Win32 waits WaitForSingleObject / WaitForMultipleObjects Gamma dispatcher AGG_WAIT -> RECV2 / TRY_RECV2 / REPLY /dev/ntsync (drivers/misc/ntsync.c) shared ioctl entry per-object raw_spinlock obj->lock device rt_mutex wait_all_lock boosted-owner tracking + sched_setattr_nocheck Linux / PREEMPT_RT substrate raw_spinlock_t stays raw on RT rt_mutex gives PI on wait_all_lock wake queues + task priority ordering KASAN / debug kernels found 1007-1009 bugs Mutex 1003 PI baseline priority-sorted waiters owner boost / restore Semaphore 1003 RT-safe waiter ordering wait_any / wait_all participant Event SET / RESET / PULSE 1008 deferred EVENT_SET_PI cross-thread priority intent Channel 1004 request/reply transport 1005 RECV2 + thread-token pass-through 1007 exclusive recv, 1009 refcount UAF fix 1011 TRY_RECV2 burst drain Feature / infrastructure phase 1003 PI baseline 1004 channel object 1005 thread-token 1006 alloc-hoist Surgical hardening phase 1007 recv exclusivity 1008 deferred event boost 1009 channel-entry refcount Operational result single kernel sync substrate for RT waits channel-backed wineserver transport aggregate-wait + TRY_RECV2 burst-drain 1012-1015 carries: snapshot, slab caches, lockless scan, wait-q cache

2. Object Types

Wine-NSPA’s ntsync exposes four object types via /dev/ntsync (one character device opened once per Wine process; object creation returns FDs).

Type Win32 primitive Created via Wait via Wake / signal via
Mutex CreateMutex, WaitForSingleObject NTSYNC_IOC_CREATE_MUTEX NTSYNC_IOC_WAIT_ANY / WAIT_ALL NTSYNC_IOC_MUTEX_UNLOCK
Semaphore CreateSemaphore, ReleaseSemaphore NTSYNC_IOC_CREATE_SEM NTSYNC_IOC_WAIT_ANY / WAIT_ALL NTSYNC_IOC_SEM_RELEASE
Event CreateEvent, SetEvent, ResetEvent NTSYNC_IOC_CREATE_EVENT NTSYNC_IOC_WAIT_ANY / WAIT_ALL NTSYNC_IOC_EVENT_SET / _RESET / _PULSE / _SET_PI
Channel (no Win32 equivalent – NSPA-private IPC) NTSYNC_IOC_CREATE_CHANNEL NTSYNC_IOC_CHANNEL_RECV / _RECV2 / _TRY_RECV2 NTSYNC_IOC_CHANNEL_SEND_PI / _REPLY

Mutex / semaphore / event are upstream concepts; their semantics map 1:1 to Win32. The mutex tracks an owner TID for WAIT_ABANDONED semantics and abandoned-recovery; the semaphore is a counted resource pool; the event has both manual-reset and auto-reset variants plus the NSPA-private EVENT_SET_PI for cross-thread priority intent.

The channel is wholly NSPA-private. It does not map to any Win32 primitive. It is a transport for Wine-NSPA’s wineserver request-reply fast path – a kernel-mediated alternative to the legacy futex+manual-sched_setscheduler shm IPC. Channels do not participate in generic WAIT_ANY / WAIT_ALL; they are accessed through their own ioctls, and patch 1010 adds a separate aggregate-wait registration path that can observe channel readiness without consuming the entry. On 1011 kernels the current consumer shape is aggregate-wait, then CHANNEL_RECV2, then TRY_RECV2 until the ready queue is empty.

is_signaled by type

The driver’s central is_signaled() predicate (called from try_wake_any / try_wake_all) returns differently per type:

Type Signaled when
Mutex count == 0 (unowned) or owner matches current TID
Semaphore count > 0
Event signaled == true
Channel always false (channels never wake WAIT_ANY/ALL)

The channel case in is_signaled() is a deliberate hard-false: any caller that arrives via WAIT_ANY/ALL with a channel FD is misusing the API and the wait will time out. That remains true after 1010. The aggregate-wait path is different: it registers the channel as a notify-only source and returns “channel fired” to userspace, after which userspace follows with CHANNEL_RECV2 to consume the actual entry.


3. Priority-inheritance baseline

The 1003 patch (originally three logical patches 1001/1002/1003, collapsed in this section for clarity) established the RT baseline that all subsequent patches build on.

Locking hierarchy

The driver has three locks. NSPA classifies them explicitly for PREEMPT_RT:

raw_spinlock_t obj->lock          per-object, protects state + waiter lists
rt_mutex       dev->wait_all_lock device-wide, serializes wait-all setup
raw_spinlock_t dev->boost_lock    device-wide, protects boosted_owners list

raw_spinlock_t keeps true spin semantics on PREEMPT_RT (does not become an rt_mutex). obj->lock is held only across short pointer-only state updates: rb-tree manipulation, list manipulation, signaled-flag flip, owner-TID write. dev->boost_lock is held only across boosted_owners list updates plus a single sched_setattr_nocheck() call. Both critical sections are short, bounded, and never sleep – the PREEMPT_RT contract.

dev->wait_all_lock is rt_mutex, not raw_spinlock_t, because wait-all setup is long: it walks all named objects to be waited on, may copy_from_user the FD array, and may need to take per-object locks. A raw spinlock is the wrong primitive for that. The rt_mutex carries PI – a high-priority thread blocked on wait_all_lock boosts whoever holds it.

The obj_lock() fast path acquires only obj->lock. When obj->dev_locked is set (another thread is doing a wait-all on this object), obj_lock() falls back to acquiring wait_all_lock first. This avoids ABBA deadlocks between per-object and device-wide locks.

Priority-ordered waiter queues

Upstream ntsync uses list_add_tail() to append waiters: FIFO order. NSPA replaces this with ntsync_insert_waiter(), which performs a sorted insertion based on the kernel-internal task->prio (lower numeric value = higher scheduling priority).

static void ntsync_insert_waiter(struct ntsync_q_entry *new_entry,
                                 struct list_head *head)
{
    struct ntsync_q_entry *entry;
    list_for_each_entry(entry, head, node) {
        if (new_entry->q->task->prio < entry->q->task->prio) {
            list_add_tail(&new_entry->node, &entry->node);
            return;
        }
    }
    list_add_tail(&new_entry->node, head);
}

Same-priority waiters maintain FIFO order within their priority level. try_wake_any_*() walks from the head, so the highest-priority satisfiable waiter wakes first. This restores NT semantics (highest-priority waiter wins) and is strictly stronger than upstream’s FIFO.

Mutex owner PI boost

When an RT thread (e.g. SCHED_FIFO prio 80) waits on a mutex held by a SCHED_OTHER thread (prio 120 in kernel terms), the holder is preempted by every running RT thread and time-sliced by CFS against every other normal thread. The RT waiter’s bounded-latency guarantee is violated.

ntsync_pi_recalc(obj, pi_work) (line 424 of the production source) handles this. Whenever a mutex’s wait list changes (insert, wake, unlock) it scans both any_waiters and all_waiters for the highest-priority waiter, then boosts the owner’s scheduling attributes via sched_setattr_nocheck() to match. Per-task tracking (struct ntsync_pi_owner, anchored in dev->boosted_owners) saves the original attributes once and counts how many of the task’s owned mutexes are contributing boosts. Restore happens only when the count drops to zero.

The PI boost design has three v2 lessons baked in:

Bug v1 behaviour v2 fix
Multi-object PI corruption Single global orig_attr overwritten when 2nd mutex boosted Per-task ntsync_pi_owner with boost_count
Zero PI for WaitAll all_waiters not scanned Scan both any_waiters and all_waiters
Stale normal_prio thrash owner->normal_prio mutates after boost -> oscillation Compare against saved orig_normal_prio from tracker

The ntsync_pi_owner struct is the unit of bookkeeping. The pool/cleanup pattern that 1006 introduces (Section 6) is the unit of RT-safe allocation for that struct.

EVENT_SET_PI primitive (pre-1008 design)

EVENT_SET_PI was originally introduced in 1003 as the cross-thread priority-intent primitive: an RT thread sets an event, and along with the signal it carries a (policy, prio) boost that the kernel applies to the event’s first waiter. Wine-NSPA uses this for the audio-thread -> dispatcher SendMessage bypass: the audio callback sets a queue event with its own RT priority, and the dispatcher pthread is woken at that priority.

The original design walked event->any_waiters under obj_lock at EVENT_SET_PI time and applied the boost to the head waiter. This had a fast-path race that 1008 closes – see Section 8.

Per-task tracking, conservative over-boost

ntsync_pi_owner is allocated lazily on first boost and freed only when the last contributing object releases. Between the first removal and the last, the owner is conservatively over-boosted: it runs at too-high priority briefly, never too-low. That is the safe direction; under-boost would leak inversion. The lazy lifetime also means owner_task is resolved lazily on the first unlock (where current is the actual Win32-owning thread), since at create time current is the wineserver, not the eventual owner.


4. Channel object

1004-ntsync-channel.patch adds a new object type, NTSYNC_TYPE_CHANNEL. A channel is a bounded, kernel-side priority-ordered request/reply mailbox. It exists to replace Wine-NSPA’s user-space futex + manual sched_setscheduler shm-IPC fast path between client processes and the wineserver.

Why a kernel object

Wine’s wineserver protocol is fundamentally a request/reply RPC. Each client thread sends a request, blocks for the reply, and resumes. The legacy fast path used a process-shared futex on a request slot plus a sched_setscheduler call from the sending audio thread to lift the dispatcher pthread’s priority. That worked but had three problems:

  1. Priority transfer was a separate syscall. The audio thread had to know which pthread it was lifting and call sched_setscheduler on it explicitly. Token-stale racy on thread death.
  2. No priority queueing. When two senders raced, the futex woke one of them in roughly FIFO order; a higher-priority sender could wait behind a lower-priority one if the dispatcher was idle.
  3. No transactional priority drain. If the dispatcher returned without replying (signal, error path) the audio-thread-applied boost had no clear cleanup hook.

A kernel-mediated channel solves all three. The kernel:

The channel is purely a transport, not a protocol. The wineserver still drives the request/reply contract; the kernel multiplexes and priority-orders, and never reorders within a single sender (each sender blocks for reply, so per-thread ordering is preserved).

API

Four ioctls, all on a channel FD obtained via NTSYNC_IOC_CREATE_CHANNEL:

ioctl Caller Effect
NTSYNC_IOC_CREATE_CHANNEL wineserver Create channel with max_depth. Returns FD.
NTSYNC_IOC_CHANNEL_SEND_PI client thread Enqueue (prio, payload_off, reply_off); boost recv'er; sleep for reply.
NTSYNC_IOC_CHANNEL_RECV dispatcher pthread Pop highest-prio entry; auto-boost current to that priority.
NTSYNC_IOC_CHANNEL_REPLY dispatcher pthread Wake the sender of entry_id; drain receiver boost.

The payload_off and reply_off fields are opaque to the kernel; conventionally they are indices into a per-process shared-memory region the client and wineserver both map. The kernel transports the cookies; user space interprets them.

That is the 1004 base interface. The current production surface layers 1005’s CHANNEL_RECV2 on top for thread-token return, then 1011’s CHANNEL_TRY_RECV2 for non-blocking post-dispatch drain.

Internal state

The channel object’s per-instance state lives in obj->u.channel:

struct {
    struct rb_root  pending;     /* PENDING entries (prio DESC, seq ASC) */
    struct list_head dispatched; /* DISPATCHED entries (REPLY can find by id) */
    atomic64_t  next_id;
    atomic64_t  next_seq;
    __u32       depth;           /* current PENDING count */
    __u32       max_depth;
    wait_queue_head_t recv_wq;   /* blocked receivers */
    struct hlist_head thread_regs[64]; /* added by 1005 */
} channel;

Each entry is a struct ntsync_channel_entry:

struct ntsync_channel_entry {
    struct rb_node      rb;       /* in pending rb-tree */
    struct list_head    list;     /* in dispatched list */
    __u64               id, seq;
    __u32               prio, policy;
    __u64               payload_off, reply_off;
    __u32               sender_tid;
    enum  ntsync_channel_state state;  /* PENDING | DISPATCHED */
    bool                replied;
    wait_queue_head_t   wq;       /* sender sleeps on this */
    __u64               thread_token;  /* added by 1005 */
    refcount_t          refcnt;        /* added by 1009 */
};

The rb-tree key is (prio DESC, seq ASC): higher priority sorts first; ties break by enqueue order. channel_pending_insert() returns true iff the entry became the new tree minimum – i.e. it would be popped next. That return value drives the speculative-boost decision in SEND_PI.

Channel object lifecycle: request ordering, dispatch, reply, and cleanup Client sender Kernel channel object Dispatcher pthread SEND_PI payload_off, reply_off, prio PENDING rb-tree key = (prio DESC, seq ASC) depth++, entry may become new minimum RECV / RECV2 waiter blocked on recv_wq; 1011 adds TRY_RECV2 follow-on boost + wake 1007 makes wake exclusive RECV pop remove rb-tree minimum mark DISPATCHED, append list, depth-- 1005 RECV2 adds thread_token lookup handler runs in wineserver current auto-boosted to entry prio reply data written to shared memory then REPLY(entry_id) DISPATCHED list lookup by entry_id at REPLY 1009 refcount keeps entry alive across wake receiver drain drops handler boost sender sleep wait_event on entry->wq cleanup may race with REPLY 1009 closes UAF window Cleanup / free boundary detach from pending or dispatched state, free only at final refcount drop 1011 follow-on after REPLY, userspace may `TRY_RECV2` and drain more ready entries before sleeping again priority ordering is in-kernel. Per-sender order stays serial because each SEND_PI blocks for REPLY.

SEND_PI flow

  1. Validate (policy, prio). Pre-allocate e and new_ep (the boost tracking entry) with GFP_KERNEL outside any lock – slab on RT cannot be called under raw_spinlock_t.
  2. obj_lock(ch). Reject with -EAGAIN if depth >= max_depth. Insert into pending rb-tree; bump depth. Note whether this entry is the new minimum.
  3. obj_unlock(ch).
  4. If the new entry is the minimum and prio is set, peek the recv_wq head. Take a get_task_struct reference under wq->lock, then call apply_event_pi_boost() to boost that receiver to (policy, prio).
  5. wake_up(&ch->recv_wq) – wakes exactly the head receiver (1007 made this exclusive).
  6. Sleep on e->wq until e->replied is true or signal pending.
  7. On wake: obj_lock(ch), detach e from whichever list/tree it’s on, obj_unlock(ch). Drop refcount_dec_and_test(&e->refcnt); kfree if last ref (1009).

The cleanup path covers the case where the sender was interrupted (signal). The entry might still be PENDING (rb-tree) or DISPATCHED (list); we use e->state to dispatch correctly. depth is decremented only in the PENDING branch – DISPATCHED entries no longer count against max_depth.

RECV / RECV2 flow

  1. drain_event_pi_boosts(dev, current) – release any boost left over from a prior RECV cycle.
  2. Pre-allocate new_ep outside lock.
  3. obj_lock(ch). While pending is empty: obj_unlock, wait_event_interruptible_exclusive(recv_wq, !empty) (1007 made this exclusive), obj_lock again.
  4. Pop the rb-tree minimum; mark DISPATCHED; append to dispatched list; decrement depth.
  5. (1005 only, in RECV2:) e->thread_token = channel_lookup_token(ch, e->sender_tid). See Section 5.
  6. obj_unlock(ch).
  7. If e->prio, auto-boost current to (e->policy, e->prio) for the handler duration via apply_event_pi_boost(dev, current, ...). Boost releases at next RECV’s drain, or at REPLY’s drain.
  8. Copy (entry_id, payload_off, reply_off, sender_tid, prio[, thread_token]) to user space.

In the post-1011 dispatcher path, userspace follows the first successful RECV2 with TRY_RECV2 after each reply until the channel returns empty.

REPLY flow

  1. obj_lock(ch). Walk dispatched list for entry_id. If not found or already replied: -ENOENT.
  2. Set e->replied = true.
  3. refcount_inc(&e->refcnt) (1009 – keep the entry alive across wake_up_all).
  4. obj_unlock(ch).
  5. wake_up_all(&e->wq) – wakes the blocked sender. Outside obj_lock because wq’s internal lock is spinlock_t (becomes rt_mutex on PREEMPT_RT) and cannot nest under our raw_spinlock_t.
  6. drain_event_pi_boosts(dev, current) – handler is done, drop the receiver’s auto-boost.
  7. refcount_dec_and_test(&e->refcnt); kfree if last ref (1009).

Memory ordering

Kernel ioctl syscall entry/exit is a full memory barrier. So payload visibility from sender -> receiver and reply visibility from receiver -> sender is naturally serialised: the sender’s copy_from_user of the payload completed before SEND_PI returns from the syscall handler; the receiver’s copy_to_user happens-before RECV returns; the receiver’s writes to the reply region happen-before REPLY returns; the sender’s copy_from_user of the reply happens-after SEND_PI’s wake.

NT semantics

The kernel does not promise ordering across senders – it priority-orders, but a SCHED_OTHER sender behind a SCHED_FIFO sender will wait. Cross-thread ordering was never guaranteed under the prior per-thread dispatcher pthread shape, so this is strictly stronger semantically (no thread can starve while a higher-prio thread is waiting). Within a single sender, ordering is preserved: each SEND_PI blocks for reply, so back-to-back sends from the same TID are serialised.

Hot-path bound

obj_lock sections in SEND_PI / RECV / REPLY are bounded by tree height. With max_depth = 1024, that is 10 rb-tree comparisons. Zero allocation under lock. No memory copies under lock (the copy_to_user happens after obj_unlock).

Diagnostics: depth and channel emptiness

A channel can only be freed when both pending and dispatched are empty; otherwise senders or dispatchers still hold the file open via the syscall ref. ntsync_free_obj() WARN_ONs either non-empty list at free time – a useful canary if user space ever leaks a channel FD with active entries.


5. Thread-token pass-through

Once the channel was in production, perf 2026-04-26 showed ~10% of dispatcher CPU sitting in a userspace get_thread_from_id() lookup inside the gamma dispatcher’s hot loop. Every received request needed to map sender_tid -> struct thread * to dispatch. This patch eliminates that lookup by stamping a wineserver-supplied opaque token onto each entry at RECV time.

Mechanism

The wineserver registers (tid, token) per channel via a new ioctl. The kernel stores the mapping in a 64-bucket hash on the channel (hlist_head thread_regs[64], keyed by tid & 63, protected by the existing obj_lock). At RECV2 time the kernel looks up the token for e->sender_tid and returns it in extended args.

struct ntsync_channel_recv2_args {
    __u64 entry_id;
    __u64 payload_off;
    __u64 reply_off;
    __u32 sender_tid;
    __u32 prio;
    __u64 thread_token;  /* OUT: registered token (0 if unregistered) */
};

Two new ioctls:

ioctl Effect
NTSYNC_IOC_CHANNEL_REGISTER_THREAD Install or replace (tid, token)
NTSYNC_IOC_CHANNEL_DEREGISTER_THREAD Evict entry for tid (idempotent)

Plus NTSYNC_IOC_CHANNEL_RECV2 – same as RECV but returns an extra thread_token field. The older RECV ioctl still exists in the UAPI, but current Wine-NSPA userspace requires RECV2 and no longer ships the old fallback ladder.

v2 design: lookup at RECV2, not SEND_PI

The first version of this patch did the hash lookup in SEND_PI and stamped thread_token onto the entry there. v2 moved the lookup to RECV2. Two reasons:

  1. Audio-thread cost. The audio thread is the one paying the SEND_PI critical-section cost. Moving the lookup to RECV2 puts the cost on the dispatcher pthread instead – which is fine, the dispatcher is not deadline-bound.
  2. Stale-token correctness. A token snapshotted at SEND_PI could go stale if the sender died and the wineserver deregistered before the dispatcher RECV’d. RECV2-time lookup reflects current registration: a deregistered TID returns token = 0, and userspace falls back to get_thread_from_id (which will fail on a dead TID, and the request gets dropped by the existing logic).

The hash bucket count is fixed at 64 (no resize, no rhashtable). For a typical Wine process with dozens to a few hundred threads, that gives single-digit average chain lengths – well under the rb-tree key comparison cost in SEND_PI/RECV.

Lifetime invariants

The wineserver enforces:

Together these ensure RECV2 always sees a non-zero token for a still-live thread. A momentarily-zero token (if registration races a fast first send) yields a userspace fallback that completes correctly – it is only a perf regression, not a correctness one.

channel_drain_thread_regs() on free

When a channel is freed, any leftover (tid, token) registrations are dropped. By construction the channel is unreachable at ntsync_free_obj() time (no senders, no dispatchers can have an FD), so no concurrent access is possible – a single pass through the buckets, kfreeing each ntsync_thread_reg.

Current runtime expectation

Old RECV entries still carry thread_token = 0 (initialized in kzalloc), so older consumers can continue using the legacy shape if they exist. Current Wine-NSPA userspace, however, assumes RECV2/TRY_RECV2 and resolves sender threads from the returned token on the normal path.


6. RT alloc-hoist

This is a safety patch, not a feature: it fixes six sites in the driver where slab kzalloc/kfree was being called under raw_spinlock_t on PREEMPT_RT – which is illegal. The bug was latent until 2026-04-26, when an Ableton workload hard-froze the host with a clean kernel oops.

The kernel oops

After installing the first thread-token ntsync.ko build, Ableton hard-froze the host 13 minutes into a session:

BUG: kernel NULL pointer dereference, address: 0x9a
RIP: ___slab_alloc+0x316  (xor (%rbx,%rdx,1),%rax  RBX=0x3a)
Call: __kmalloc_cache_noprof <- ntsync_obj_ioctl+0x427 [ntsync]
Comm: Ableton Web Con      PREEMPT_{RT,(lazy)}

Classic SLUB freelist corruption.

Root cause

obj->lock and dev->boost_lock are both raw_spinlock_t. On PREEMPT_RT, SLUB’s per-CPU fast path uses local_lock_t, which is spinlock_t – a sleeping lock under PREEMPT_RT (confirmed in include/linux/local_lock_internal.h). So kzalloc / kfree under any raw_spinlock_t is unsafe on RT, including GFP_ATOMIC (the GFP flag gates reclaim, not the local_lock).

This is a mechanically verifiable rule: CONFIG_DEBUG_ATOMIC_SLEEP will splat any sleeping function called from a non-sleepable context. The bug was not caught by that infrastructure only because the production kernel ships without it for performance reasons; the rule itself is unambiguous.

Six sites in ntsync.c violated this rule:

# Function Line Issue
1 ntsync_pi_recalc 345 kzalloc(GFP_ATOMIC) under raw
2 ntsync_pi_recalc 409 kfree under boost_lock
3 ntsync_pi_recalc 417 kfree under caller’s obj->lock
4 ntsync_pi_drop 441 kfree under boost_lock
5 ntsync_channel_register_thread 1614 kfree under obj_lock
6 ntsync_channel_deregister_thread 1639 kfree under obj_lock

Sites 1-4 had been latent since the 1003 PI patch landed; 5-6 were new in 1005 (thread-token registration). The Ableton lockup was almost certainly triggered by 5 or 6: T2 thread-token registration is always-on when channel + kernel support are present, and Ableton boot creates dozens of threads -> dozens of register/deregister calls -> poisoned freelist 13 minutes in. Sites 1-4 had likely also caused several previous unexplained host lockups in the earlier msg-ring, paint-cache, and instrumentation-related lockup series.

The pi_work pool/cleanup pattern

The fix introduces a stack-resident struct ntsync_pi_work that the caller pre-allocates and finishes outside any raw lock:

struct ntsync_pi_work {
    struct list_head new_po_pool;     /* pre-allocated; consumed on demand */
    struct list_head to_free_list;    /* removed entries to free post-unlock */
};

Three helpers:

void ntsync_pi_work_init(w);                  /* INIT_LIST_HEAD x2 */
void ntsync_pi_work_prealloc(w);              /* kzalloc + list_add to pool, OUTSIDE locks */
struct ntsync_pi_owner *ntsync_pi_work_take_new(w); /* pointer-only list_del under raw */
void ntsync_pi_work_finish(w);                /* kfree pool leftovers + to_free_list */

Lifecycle of a pi_owner via this struct:

kzalloc -> list_add to new_po_pool                  (caller, no lock)
consumed: list_del from pool, list_add to dev list  (pi_recalc, raw)
removed: list_move from dev list to to_free_list    (pi_recalc/_drop, raw)
kfree from new_po_pool + to_free_list               (caller, no lock)

Empty pool is a non-fatal fallback: pi_recalc skips the boost (transient priority inversion until next op), matching the prior GFP_ATOMIC behaviour. The hot path stays one slab op per ioctl – just hoisted past the lock, so no extra latency.

Caller pattern

Every ioctl entry that may invoke pi_recalc / pi_drop declares one of these on stack:

struct ntsync_pi_work pi_work;
ntsync_pi_work_init(&pi_work);
ntsync_pi_work_prealloc(&pi_work);

/* ... acquire raw locks, possibly call pi_recalc/pi_drop ... */
/* ... release all raw locks ... */

ntsync_pi_work_finish(&pi_work);

This pattern shows up in try_wake_any, try_wake_all_obj, release_mutex, wait_any, wait_all, event_set_pi, and several other entry points. Sites 5-6 (channel register/deregister) use a simpler local victim pointer pattern – a single removal per call doesn’t justify the pool.

NT semantics preserved exactly

Only observable difference: ntsync_pi_owner cleanup deferred by tens of nanoseconds past raw_spin_unlock. Mutex ownership transfers atomically with wake (cmpxchg unchanged). PI boost levels and stacking semantics unchanged. Channel priority ordering (DESC, seq ASC) unchanged. Token registration replace-or-insert unchanged. Wait-any/all wakeup ordering unchanged.

Why this fix mattered for everything that came after

1006 is a prerequisite for honest stress-testing of the channel path. Without it, every register/deregister churn in a stress test was rolling SLUB freelist dice. With it, KASAN under PREEMPT_RT became a useful tool: any splat is a real bug, not slab dust. That is what made 1009 (the channel_entry refcount UAF) catchable.

Open RT/safety items deferred from 1006

obj_lock() between prepare_to_wait and schedule in ntsync_channel_send_pi: rt_mutex_lock inside obj_lock would clobber TASK_INTERRUPTIBLE state if obj->dev_locked were set. Latent only – channels never participate in wait_all so dev_locked is never set on channels. Safe today; tighten when convenient.


7. Exclusive receive wakeup fix

Bug: ntsync_channel_send_pi speculatively boosts recv_wq.head to the sender’s priority before wake_up(), but wake_up() was waking all non-exclusive waiters because wait_event_interruptible adds non-exclusive waiters by default. Non-head receivers could win the entry-pop race -> the boosted head was stranded with high priority and no work; the winner had low priority and the actual work. A real production priority inversion.

This was the plausible root cause of unexplained gamma-dispatcher lockups previously (and incorrectly) blamed on userspace patches.

Three lines

-  ret = wait_event_interruptible(ch->u.channel.recv_wq,
+  /* Exclusive wait: wake_up() in SEND_PI walks the recv_wq and
+   * stops at the first exclusive waiter.  This makes the head
+   * (which SEND_PI speculatively boosted) the unique winner of
+   * the entry-pop race -- closes the priority-inversion window
+   * where a non-head receiver could pop the entry while the
+   * boosted head got stranded with high prio and no work. */
+  ret = wait_event_interruptible_exclusive(ch->u.channel.recv_wq,
        !RB_EMPTY_ROOT(&ch->u.channel.pending));

Applied in both ntsync_channel_recv and ntsync_channel_recv2.

Why this works

wake_up() is already exclusive-aware: it walks the wait queue and stops at the first exclusive waiter. So once both RECV and RECV2 register exclusive waiters, SEND_PI’s wake_up() wakes exactly the head – the boost target. The boost target becomes the unique race winner.

wait_event_interruptible_exclusive is a kernel primitive; it takes the wait queue lock, sets the waiter’s WQ_FLAG_EXCLUSIVE flag, and otherwise behaves identically to the non-exclusive variant. No new behaviour introduced; we just opted into the existing semantics.

Validation

Why this is the minimal correct fix

The rolled-back “Codex 1007-1011” patch series (Section 10) had attempted a much larger redesign of the channel path, including channel-rejection in setup_wait, cross-snapshot PI cleanup, and a pool/cleanup refactor of the channel allocations themselves. None of that was needed. Three lines suffice.


8. Deferred event boost

Bug: the original EVENT_SET_PI design (Section 3) walked event->any_waiters under obj_lock at signal time and applied the boost to the head waiter. This missed any consumer that took obj_lock first, saw signaled=true and returned without queueing – the standard wait fast-path. Result: ~4% of EVENT_SET_PI calls under PREEMPT_RT debug-kernel scheduling silently failed to apply the boost. A real RT-correctness hole.

The race

Thread A (consumer, fast path)         Thread B (signaler, EVENT_SET_PI)
obj_lock(event)
if (signaled) {                        kzalloc(new_ep)
   /* signaled=false set later */
   fast-path return (NO QUEUE)
}
obj_unlock(event)
                                       obj_lock(event)
                                       walk any_waiters: EMPTY
                                       target = NULL
                                       signaled = true
                                       obj_unlock(event)
                                       kfree(new_ep)  /* dropped! */

The signaler sets the event but has no target to boost; the consumer returns from wait_any having seen the signal but unboosted. The boost was lost.

This was hard to spot because most EVENT_SET_PI calls under PREEMPT_RT scheduling do find a queued waiter (the consumer hadn’t reached obj_lock yet). Only the fast-path race – consumer arrives just before signaler – silently dropped the boost. KASAN debug-kernel testing showed it as a ~4% flake rate on the test-event-set-pi test.

Redesign: stage on event, consume at wait-return

The fix flips ownership of the boost target. Instead of the signaler finding the target at EVENT_SET_PI time, the consumer applies the boost to itself at wait-return.

New per-event state in the event union:

struct {
    u32 policy;
    u32 prio;
    struct ntsync_event_pi *new_ep;   /* pre-allocated; consumer takes ownership */
} pending_pi;

Mechanism in five steps:

  1. Pre-allocate tracking entry outside any lock (slab on RT).
  2. Stage (policy, prio, new_ep) on the event under obj_lock; ALSO set signaled=true and wake any queued waiter.
  3. The first task to consume the signal – whether queued and woken, or fast-path (already-signaled) – applies the staged boost to itself via consume_event_pi_boost() at wait-return. This is race-free: the consumer is by definition the task whose wait_any/wait_all returned with this event as the signaled obj.
  4. Last-writer-wins if EVENT_SET_PI is called twice without an intervening consumption – earlier staged new_ep is freed (under obj_lock-released, RT-safe).
  5. EVENT_RESET clears the staging (signal cancelled, boost too).

Plus a 6th rule: ntsync_free_obj frees any leaked staging entry on object death (no leak if the event dies unconsumed).

consume_event_pi_boost()

Called from wait_any unqueue loop on the signaled obj if it is an event:

static void consume_event_pi_boost(struct ntsync_obj *event)
{
    struct ntsync_event_pi *new_ep = NULL;
    u32 policy = 0, prio = 0;
    bool valid = false, all;

    if (event->type != NTSYNC_TYPE_EVENT)
        return;

    all = ntsync_lock_obj(event->dev, event);
    if (event->u.event.pending_pi.new_ep) {
        new_ep = event->u.event.pending_pi.new_ep;
        policy = event->u.event.pending_pi.policy;
        prio   = event->u.event.pending_pi.prio;
        event->u.event.pending_pi.new_ep = NULL;
        valid = true;
    }
    ntsync_unlock_obj(event->dev, event, all);

    if (valid) {
        if (!apply_event_pi_boost(event->dev, current,
                                   policy, prio, new_ep))
            kfree(new_ep);
    }
}

The atomic capture-and-clear under obj_lock is the one-shot guarantee: the first consumer wins, subsequent consumers see new_ep == NULL and no-op. If EVENT_SET_PI is called again before consumption, the prior new_ep is freed under the same lock and replaced.

EVENT_SET_PI itself, simplified

The new ntsync_event_set_pi:

new_ep = kzalloc(sizeof(*new_ep), GFP_KERNEL);
if (!new_ep) return -ENOMEM;

ntsync_pi_work_init(&pi_work);
ntsync_pi_work_prealloc(&pi_work);

all = ntsync_lock_obj(dev, event);

/* Stage the boost.  Last-writer-wins. */
prior_new_ep = event->u.event.pending_pi.new_ep;
event->u.event.pending_pi.policy = args.policy;
event->u.event.pending_pi.prio   = args.prio;
event->u.event.pending_pi.new_ep = new_ep;

/* Signal: identical to EVENT_SET. */
event->u.event.signaled = true;
if (all)
    try_wake_all_obj(dev, event, &pi_work);
try_wake_any_event(event);

ntsync_unlock_obj(dev, event, all);
ntsync_pi_work_finish(&pi_work);

/* Free overwritten prior staging outside lock (slab on RT). */
kfree(prior_new_ep);

No more target = list_first_entry(...) walk under obj_lock. No more get_task_struct(target) ref management. The signaler just sets the event; whoever consumes it boosts themselves.

EVENT_RESET hook

Resetting the event cancels the signal, so it must cancel any pending boost too:

prior_new_ep = event->u.event.pending_pi.new_ep;
event->u.event.pending_pi.new_ep = NULL;
ntsync_unlock_obj(dev, event, all);
kfree(prior_new_ep);

ntsync_free_obj hook

If the event dies unconsumed, free the staging entry:

if (obj->type == NTSYNC_TYPE_EVENT)
    kfree(obj->u.event.pending_pi.new_ep);

wait_any consumer hook

Inside the wait_any unqueue loop, after the obj is unlocked but before put_obj:

if ((int)i == signaled && obj->type == NTSYNC_TYPE_EVENT)
    consume_event_pi_boost(obj);

The signaled index identifies which obj actually woke this wait. We consume only on that obj – non-signaled objs in a multi-object wait have nothing to apply.

wait_all TODO

ntsync_wait_all cannot call consume_event_pi_boost because that helper takes the obj’s wait-all lock path (via ntsync_lock_obj), and the unqueue loop already holds wait_all_lock. The audio-callback path uses wait_any so this gap is rare in practice; revisit if cross-event boost across wait_all becomes a workload concern. Comment in source:

/* NSPA: TODO -- wait_all consumer hook for EVENT_SET_PI deferred
 * boost.  Cannot call consume_event_pi_boost here because it
 * takes obj's wait-all lock path and we already hold
 * wait_all_lock.  Audio-callback path uses wait_any (handled in
 * the wait_any unqueue), so this is rare in practice; revisit
 * if cross-event boost becomes a workload concern. */

Validation

Cost

One extra atomic exchange under obj_lock per EVENT_SET_PI (the pending_pi store + signal flip). One extra obj_lock/obj_unlock per consume. The latter is the only new path; it runs only if the event has staged PI – so on workloads that don’t use EVENT_SET_PI it is a no-op (pending_pi.new_ep == NULL check is one load).


9. Channel-entry lifetime fix

Bug: KASAN-caught slab-use-after-free on ntsync_channel_entry under test-channel-stress 4x4 with thread-registration churn. REPLY’s wake_up_all on e->wq runs outside obj_lock (it must – wq’s internal lock is spinlock_t, becomes rt_mutex on PREEMPT_RT, can’t nest under our raw_spinlock_t). That creates a window where SEND_PI’s cleanup could kfree(e) between REPLY’s obj_unlock and REPLY’s wake_up_all reaching the freed wait queue.

The KASAN splat

BUG: KASAN: slab-use-after-free in do_raw_spin_lock+0x23c/0x270
Read of size 4 at addr ffff8882e30b2564 by task test-channel-st/51072

Call: __wake_up -> ntsync_obj_ioctl+0x8d5 [ntsync]

Allocated by task 51069: __kasan_kmalloc -> ntsync_obj_ioctl+0x941
Freed by task 51069:     kfree         -> ntsync_obj_ioctl+0x3e3c

Cache: kmalloc-256 (256-byte object), 248 bytes used.
Address is 100 bytes inside freed region.

Disassembly maps:

The race

Thread A (SEND_PI sleeper)              Thread B (REPLY)
                                        obj_lock(ch)
                                        find e in dispatched
                                        e->replied = true
                                        obj_unlock(ch)
loop iter: prepare_to_wait
loop iter: obj_lock(ch)
loop iter: e->replied is true, break
finish_wait
obj_lock(ch); list_del(&e->list);
obj_unlock(ch)
kfree(e)                                 wake_up_all(&e->wq)  <-- UAF

The wake_up_all outside obj_lock is necessary on PREEMPT_RT (wq’s internal lock cannot be taken under raw_spinlock obj_lock). But that creates the window where SEND_PI’s cleanup can free e between REPLY’s obj_unlock and REPLY’s wake_up_all.

The fix: refcount_t on channel_entry

Add refcount_t refcnt to struct ntsync_channel_entry. SEND_PI initializes it to 1 after queue insertion (the sleeping sender holds one ref). REPLY does refcount_inc under obj_lock before unlock, then wake_up_all, then refcount_dec_and_test+kfree-if-last. SEND_PI cleanup does refcount_dec_and_test+kfree-if-last. Whichever decrement reaches 0 frees.

Code addition is ~15 LOC:

struct ntsync_channel_entry {
    ...
    refcount_t refcnt;
};

/* In SEND_PI, after successful queue insertion: */
refcount_set(&e->refcnt, 1);  /* sleeper holds 1; REPLY will inc */

/* In SEND_PI cleanup, replacing kfree(e): */
if (refcount_dec_and_test(&e->refcnt))
    kfree(e);

/* In REPLY, between obj_unlock and wake_up_all: */
e->replied = true;
refcount_inc(&e->refcnt);
obj_unlock(ch);
wake_up_all(&e->wq);
drain_event_pi_boosts(ch->dev, current);
if (refcount_dec_and_test(&e->refcnt))
    kfree(e);

Why this is the minimal correct fix

There was a previous “Codex 1007-1011” patch series (rolled back; see Section 10) that targeted this same bug class but bundled it with a number of unrelated audit-derived changes (REPLY-fake-on-copy-fail, channel-reject in setup_wait, cross-boost cleanup refactor). The core fix – refcount on the entry – was correct in that series. Everything else was speculative noise that introduced its own bugs.

This patch is just the refcount.

Validation

Why this is the right shape, not the wrong shape

A common alternative for this class of bug is to take a sleepable lock around the wake. We can’t – the obj_lock that protects entry membership is raw_spinlock_t, and we cannot promote it to rt_mutex without losing the bounded-CS guarantee that the rest of the driver depends on. Refcount on the entry is the textbook fix for “object outlives its containing-collection lifetime due to async finishers” – no lock-order changes, no protocol changes, just two incs and three dec_and_tests in the right places.


10. Aggregate-wait and burst drain

Patch 1010 adds the heterogeneous wait primitive that the rest of the NSPA stack had been designing around: NTSYNC_IOC_AGGREGATE_WAIT.

The immediate consumer is the post-1010 gamma dispatcher. Instead of blocking in direct CHANNEL_RECV2 forever, the dispatcher can wait on:

in one syscall, while still keeping channel PI visible.

1010 aggregate-wait: the dispatcher-facing kernel surface NTSync object sources events / semaphores / mutexes channel notify-only registration `NTSYNC_IOC_AGGREGATE_WAIT` copy source array register object waits + poll waits sleep once return `fired_index` + `fired_events` timeout sentinel on deadline expiry FD sources uring eventfd future fd-poll / timer wake sources Load-bearing follow-up in production the installed build also carries SEND_PI any-waiters fallback and 1011 then layers TRY_RECV2 burst drain on top of this wait surface

UAPI shape


struct ntsync_aggregate_source {
    __u32 type;
    __u32 events;
    __u64 handle_or_fd;
};

struct ntsync_aggregate_wait_args {
    __u32 nb_sources;
    __u32 reserved;
    __u64 sources;
    struct __kernel_timespec deadline;
    __u32 fired_index;
    __u32 fired_events;
    __u32 flags;
    __u32 owner;
};

Why it is architecturally different from WAIT_ANY

Validation surface

1010 was not treated as a paper design or a future placeholder. It was validated with a dedicated native aggregate-wait suite:

The first result was the post-1009 base plus aggregate-wait and its PI-ordering follow-ups. The next overlay added burst drain on top, and the current overlay keeps both surfaces while adding the later hardening and cache-isolation work.

10.1 Burst drain with CHANNEL_TRY_RECV2

1011 adds NTSYNC_IOC_CHANNEL_TRY_RECV2, a non-blocking companion to CHANNEL_RECV2. It does not replace aggregate-wait; it is the follow-on that lets a woken dispatcher keep draining the ready list without paying one more AGG_WAIT round-trip per queued entry.

That is a small kernel change, but it is exactly the shape the gamma dispatcher needs under bursty server-bound RPC load:

The ioctl is additive at the kernel interface level, but current Wine-NSPA userspace assumes it is present. The old sticky fallback ladder was retired once aggregate-wait became the project baseline.


11. Receive snapshot fix

Bug: in ntsync_obj_ioctl paths for NTSYNC_IOC_CHANNEL_RECV and NTSYNC_IOC_CHANNEL_RECV2, the receiver popped a channel_entry *e under obj_lock, then unlocked, then read e->fields (payload, sender_pid, etc). Between the unlock and the field reads the sender thread – parked in wait_event_interruptible – could be signal-interrupted, run its cleanup path, and kfree(e) in that window. The receiver then read freed memory. KASAN reproducibly caught this on test-channel-stress and on real Ableton workloads.

The lock-drop is mandatory for the rest of the RECV/RECV2 path: copy_to_user, the apply_event_pi_boost call, and the receiver auto-boost cannot run under the raw_spinlock_t obj_lock. The fix therefore narrows what the post-unlock path needs to read off e.

Fix: snapshot relevant fields under obj_lock before unlocking

/* Pre-1012 (broken): */
spin_lock(&obj->obj_lock);
e = list_first_entry_or_null(&channel->pending, ...);
if (!e) { spin_unlock(...); return -EAGAIN; }
list_del(&e->link);
spin_unlock(&obj->obj_lock);
/* RACE: sender can free e here */
copy_to_user(buf, e->payload, e->len);

/* Post-1012 (fixed): */
spin_lock(&obj->obj_lock);
e = list_first_entry_or_null(&channel->pending, ...);
if (!e) { spin_unlock(...); return -EAGAIN; }
list_del(&e->link);
/* SNAPSHOT under the lock */
local_payload = e->payload;
local_len = e->len;
local_pid = e->sender_pid;
spin_unlock(&obj->obj_lock);
copy_to_user(buf, &local_payload, local_len);

Why snapshot, not refcount

The 1009 fix used a refcount_t on the entry to keep it alive across REPLY’s wake_up_all. 1012 does not. Refcount would have worked here too, but it would have added two atomic ops (atomic_inc on entry, atomic_dec_and_test on exit) to every channel RECV / RECV2 on the audio dispatcher’s critical chain. Snapshotting collapses the lock-drop window to zero rather than extending the entry’s lifetime, costs zero atomics, and keeps the recv hot path one cacheline narrower. The snapshotted fields are small and well-bounded (a few words).

Coverage

A new test-channel-try-recv2-stress.c was added in the same change as a gap-filler for patch 1011: TRY_RECV2 had no dedicated stress test before this session.


12. Dedicated slab caches

Pre-1013 three ntsync allocation classes lived in the system kmalloc pool:

That is functionally correct but architecturally weak for an RT-class hot path: two 120B objects in kmalloc-128 sit back-to-back so the tail of one and the head of the next can share a cacheline; an ntsync object can neighbour a network struct or fs metadata in the same kmalloc bucket; /proc/slabinfo lumps everything into kmalloc-128; kmem_cache_shrink, SLAB_FREELIST_HARDENED, and SLAB_HWCACHE_ALIGN cannot be applied to a subset of kmalloc-128.

Implementation

Three dedicated caches, each sized exactly to the struct, each with SLAB_HWCACHE_ALIGN:


ntsync_event_pi_cache = kmem_cache_create("ntsync_event_pi",
        sizeof(struct ntsync_event_pi),
        0, SLAB_HWCACHE_ALIGN, NULL);

ntsync_channel_entry_cache = kmem_cache_create("ntsync_channel_entry",
        sizeof(struct ntsync_channel_entry),
        0, SLAB_HWCACHE_ALIGN, NULL);

ntsync_pi_owner_cache = kmem_cache_create("ntsync_pi_owner",
        sizeof(struct ntsync_pi_owner),
        0, SLAB_HWCACHE_ALIGN, NULL);

All kzalloc / kfree callsites for the three structs are converted to kmem_cache_alloc / kmem_cache_free. The conversion is mechanical except for one subtle gotcha (see Section 13’s 1014a follow-up).

Init / exit lifecycle

Caches are constructed in ntsync_init in dependency order with mirrored unwind labels, and destroyed in ntsync_exit in reverse order:

Structural value (always-on)

Workload absorption (slabinfo, drum-load capture 2026-05-04)

cache idle drum-load delta size pre-1013 home
ntsync_event_pi 637 795 +158 120B kmalloc-128
ntsync_pi_owner 637 795 +158 120B kmalloc-128
ntsync_channel_entry 168 168 0 192B kmalloc-192
kmalloc-128 (system) 2240 2240 0 128B n/a

158 new event-PI staging pairs (one event_pi + one paired pi_owner) absorbed cleanly in the dedicated caches; kmalloc-128 stayed flat – isolation under real load. SLUB internal state moved in the expected direction: partial slabs filled (8 -> 2), per-CPU slabs went up (18 -> 24), matching “hot path picks up CPU-local allocations”.

Independence from 1014

1013 has no dependency on 1014 and vice versa; the patches are separately revertable.


13. Lockless SEND_PI target scan

Motivation

In ntsync_channel_send_pi, before staging the boost on a target waiter, the code scans the channel’s wait_queue_head_t to pick a target. Pre-1014 that scan acquired wq->lock (spin_lock_irqsave – still raw on PREEMPT_RT here because it is the wait-queue’s own lock, not obj_lock) even when the queue was empty. The empty case is the common one for an audio dispatcher under steady load: most SEND_PI fires hit a channel with no parked waiters. That is a wasted IRQ-disable plus spinlock round-trip on the audio thread’s hot path.

Implementation

Replace the unconditional lock+scan with a list_empty_careful peek first:

/* Pre-1014: */
spin_lock_irqsave(&wq->lock, flags);
list_for_each_entry(...) { ... }
spin_unlock_irqrestore(&wq->lock, flags);

/* Post-1014: */
if (list_empty_careful(&wq->head)) {
    /* fall through to any_waiters fallback path; no lock taken */
    goto no_target;
}
spin_lock_irqsave(&wq->lock, flags);
/* same as before */
spin_unlock_irqrestore(&wq->lock, flags);

Correctness

RT-safety

Removes a spin_lock_irqsave from the audio thread’s SEND_PI hot path in the common (empty-queue) case – a measurable IRQ-off window reduction in the path that matters most for audio jitter.

1014a: kmem_cache_free is not NULL-safe

The 1013 conversion left one kfree-style site for obj->u.event.pending_pi.new_ep un-NULL-guarded in ntsync_free_obj. The diff comment claimed kmem_cache_free is NULL-safe like kfree. The kernel source disagrees:

mm/slub.c:6900 (Linux 6.19.11):


void kmem_cache_free(struct kmem_cache *s, void *x)
{
    s = cache_from_obj(s, x);
    ...
}

static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
{
    if (!IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) &&
        !kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS))
        return s;
    cachep = virt_to_cache(x);  /* DEREFS x */
    ...
}

Under SLAB_FREELIST_HARDENED (debug kernel: enabled) the short-circuit fails and virt_to_cache(NULL) runs, dereferencing offset 0x8 of NULL. kfree() early-outs on ZERO_OR_NULL_PTR(x); kmem_cache_free does not – the asymmetry the diff comment got wrong.

The crash signature on the debug kernel was a page fault at kmem_cache_free+0x5c with RAX = vmemmap[0] (the struct page for NULL) and CR2 = 0x...0008 (the slab->slab_cache deref at offset 8) – exact match.

Fix

if (obj->type == NTSYNC_TYPE_EVENT && obj->u.event.pending_pi.new_ep)
    kmem_cache_free(ntsync_event_pi_cache,
                    obj->u.event.pending_pi.new_ep);

The pattern matches the existing explicit-guard sites the same conversion already used at lines 1306, 1336, 1604, 1740, 1829, 1916. Site 2089 was simply missed in the original 1013 audit.

Audit summary that found 1014a

A four-dimension audit covering the entire post-1014 file:

Post-1011 carries: 1012 + 1013 + 1014 + 1014a surface 1012 channel recv snapshot RECV / RECV2 path: snapshot payload, len, sender_pid under obj_lock before unlock; closes cross-thread slab UAF 1013 dedicated kmem_caches ntsync_event_pi (120B) ntsync_channel_entry (192B) ntsync_pi_owner (120B) SLAB_HWCACHE_ALIGN, isolation 1014 lockless target scan SEND_PI: list_empty_careful peek before wq->lock; skips spin_lock_irqsave round-trip on empty queue (common case) 1014a: kmem_cache_free is not NULL-safe under SLAB_FREELIST_HARDENED cache_from_obj derefs operand; explicit NULL guard added at site 2089 (obj->u.event.pending_pi.new_ep free in ntsync_free_obj) post-1014a build with dedicated caches and lockless SEND_PI scan KASAN-clean over ~14M ops; running on prod kernel `6.19.11-rt1-1-nspa` since 2026-05-04

14. Wait-queue dedicated cache

Motivation

A post-1014a audit (2026-05-05) of the live driver enumerated all remaining kmalloc / kzalloc sites in ntsync.c. Six sites total; only two on the audio dispatcher hot path, both for the per-ioctl wait queue (struct ntsync_q) allocated by setup_wait and ntsync_aggregate_setup. Every WAIT_ANY / WAIT_ALL / AGGREGATE_WAIT ioctl pays one kmalloc(struct_size(...)) then one kfree.

Site (post-1014a line) Path Status
1974 ntsync_thread_reg per channel thread-register COLD – skip
2160 ntsync_obj per CreateEvent/Mutex/Sem COOL – marginal
2383 wait queue q per WAIT_ANY/WAIT_ALL ioctl HOT – target
2829 wait queue q (agg) per AGGREGATE_WAIT ioctl HOT – target
2840 fds array per wait-with-FDs (var-count) not eligible
3092 ntsync_device per chardev open COLD – skip

struct ntsync_q is the only HOT kmalloc class that survived 1013.

Variable-size design

struct ntsync_q has a flexible-array member entries[] whose count is total_count – 1 (typical audio worker) up to NTSYNC_MAX_WAIT_COUNT+1 = 65 (NtWaitForMultipleObjects cap) or NTSYNC_AGG_MAX = 64 (aggregate). Three options were considered:

Shipped (a). 16 entries comfortably covers the typical 1-8 audio wait depth; larger waits keep the kmalloc path with no regression. Slot size with SLAB_HWCACHE_ALIGN is 704B on x86_64 (header 32 + 16×entry(40) = 672B, rounded to the next 64-byte cacheline).

Allocator routing

A bool from_cache field is added to struct ntsync_q, placed in the existing 2-byte trailing pad after bool ownerdead so sizeof(struct ntsync_q) is unchanged. Set by ntsync_alloc_q, read by ntsync_free_q:


static struct ntsync_q *ntsync_alloc_q(__u32 total_count)
{
    struct ntsync_q *q;

    if (total_count <= NTSYNC_Q_CACHE_MAX_ENTRIES) {
        q = kmem_cache_alloc(ntsync_wait_q_cache, GFP_KERNEL);
        if (q)
            q->from_cache = true;
    } else {
        q = kmalloc(struct_size(q, entries, total_count), GFP_KERNEL);
        if (q)
            q->from_cache = false;
    }
    return q;
}

static void ntsync_free_q(struct ntsync_q *q)
{
    if (!q)
        return;
    if (q->from_cache)
        kmem_cache_free(ntsync_wait_q_cache, q);
    else
        kfree(q);
}

ntsync_free_q is NULL-safe by design (early return). kmem_cache_free is not NULL-safe under SLAB_FREELIST_HARDENED (the 1014a lesson); centralising the guard in the wrapper makes per-site audit trivial. Two alloc-site conversions plus six free-site conversions complete the WAIT_* / AGGREGATE_WAIT path.

PREEMPT_RT discipline (1006 alloc-hoist invariant)

Both kmem_cache_alloc(..., GFP_KERNEL) and kmalloc(..., GFP_KERNEL) are sleep-prone (they may direct-reclaim). R1 from ntsync-rt-audit.md forbids sleeping operations under raw_spinlock_t. Verified at every call site:

Site Context when ntsync*q runs
setup_wait 2383 Top of function, no locks held
ntsync_aggregate_setup 2829 Top of function, no locks held
setup_wait err 2421 Error cleanup, no locks held
ntsync_wait_any 2573 After unqueue and ntsync_pi_work_finish
ntsync_wait_all 2746 After wait_all_lock unlock and ntsync_pi_work_finish
ntsync_aggregate_setup err 2842 fds-alloc fail, no locks held
ntsync_aggregate_setup err 2891 Partial-init fail, no locks held
ntsync_aggregate_wait 3083 After unqueue and ntsync_pi_work_finish

The 1006 alloc-hoist invariant is preserved end-to-end.

UAF / lifecycle (1012 lesson – N/A)

struct ntsync_q has a task-private lifecycle: allocated by the syscalling task, populated by the same task, published into wait queues under obj_lock, list_del’d under obj_lock during unqueue (mutually exclusive with try_wake_any_*), then freed. No cross-thread free path exists. The 1012 snapshot-vs-refcount lesson does not apply – there is no lock-drop window between mutator and freer.

SLAB_NO_MERGE retro-correction

The original 1015 patch only added SLAB_HWCACHE_ALIGN (mirroring 1013). First boot showed the new cache absent from /proc/slabinfo. /sys/kernel/slab/ revealed why:


ntsync_channel_entry -> :0000192    # merged
ntsync_event_pi      -> :0000128    # merged
ntsync_pi_owner      -> :0000128    # merged
ntsync_wait_q        -> :0000704    # merged (1015 alone)

All four ntsync caches had been merged by SLUB into generic kmalloc-N classes. The 1013 architectural promise of “isolation from kmalloc-128” had not been holding on the prod kernel since 1013 landed. It held on the debug kernel because SLAB_FREELIST_HARDENED makes caches incompatible for merging – different debug-vs-prod config. Section 12’s drum-load slabinfo absorption table was therefore debug-kernel evidence; on prod, those allocations were going into kmalloc-128 the whole time.

Fix: add SLAB_NO_MERGE (available since kernel 6.4; prod runs 6.19) to all four kmem_cache_create calls, bundled into the 1015 patch as a retroactive correction:


ntsync_event_pi_cache      = kmem_cache_create("ntsync_event_pi",      ..., SLAB_HWCACHE_ALIGN | SLAB_NO_MERGE, NULL);
ntsync_channel_entry_cache = kmem_cache_create("ntsync_channel_entry", ..., SLAB_HWCACHE_ALIGN | SLAB_NO_MERGE, NULL);
ntsync_pi_owner_cache      = kmem_cache_create("ntsync_pi_owner",      ..., SLAB_HWCACHE_ALIGN | SLAB_NO_MERGE, NULL);
ntsync_wait_q_cache        = kmem_cache_create("ntsync_wait_q",        ..., SLAB_HWCACHE_ALIGN | SLAB_NO_MERGE, NULL);

After the fix, /sys/kernel/slab/ntsync_*/ are all real directories; no symlinks, no merging. The 1013 isolation promise holds on prod.

Workload absorption (Ableton, prod kernel, 30s windows at 1Hz)

cache active_objs (steady-state) kmalloc-N delta during same window
ntsync_wait_q 184 kmalloc-1k delta = 0
ntsync_event_pi 256 covered in dedicated cache
ntsync_channel_entry 168 covered in dedicated cache
ntsync_pi_owner 256 covered in dedicated cache

184 active ntsync_wait_q objects is the steady-state concurrency of Ableton’s worker pool parked in NtWaitForMultipleObjects. Pre-1015 those 184 would have lived in kmalloc-1k; post-1015 they sit in the dedicated cache, with kmalloc-1k flat across the load window – isolation proven on the prod kernel for the first time since 1013.

The active_objs metric is concurrency, not throughput; a per-second alloc-rate proof would need /sys/kernel/slab/ntsync_wait_q/alloc_calls cumulative deltas or a slabtop snapshot pair. Not a gate, just a refinement for future evidence-gathering.

Independence

1015 has no dependency on 1012 / 1013 / 1014; the patches remain separately revertable. The SLAB_NO_MERGE retro-correction is bundled because both edits live in the same kmem_cache_create chain – landing it as a separate 1013a would have meant two patch applications for one logical change.

1015: ntsync_wait_q cache + SLAB_NO_MERGE on all four ntsync caches WAIT_ANY / WAIT_ALL / AGGREGATE_WAIT setup_wait (q = ntsync_alloc_q(total_count)) ntsync_aggregate_setup (q = ntsync_alloc_q(nb_obj)) total_count: 1 typical, 16 cache cap, 65 max ntsync_wait_q (704B slot) ≤16 entries -> cache slot >16 entries -> kmalloc fallback routed via q->from_cache (in pad) Ableton steady-state 184 active objs (worker-pool waits) pre-1015 home: kmalloc-1k kmalloc-1k delta = 0 across load Pre-SLAB_NO_MERGE on prod kernel: all four caches silently merged into generic kmalloc-N /sys/kernel/slab/ntsync_* are symlinks to :0000128 / :0000192 / :0000704 ntsync_event_pi 128B slot · 1013 SLAB_NO_MERGE ntsync_channel_entry 192B slot · 1013 SLAB_NO_MERGE ntsync_pi_owner 128B slot · 1013 SLAB_NO_MERGE ntsync_wait_q (NEW · 1015) 704B slot · ≤16-entry q + fallback SLAB_HWCACHE_ALIGN | SLAB_NO_MERGE current build: post-1015 with `SLAB_NO_MERGE` on all four ntsync caches

15. Validation

Overlay progression

Stage What landed Notes
PI baseline 1003 + 1004 + 1005 + 1006 priority inheritance, channel transport, thread-token return, and RT-safe alloc/free discipline
Channel wake correctness 1007 + 1008 + 1009 exclusive receive wakeup, deferred event boost, and channel-entry lifetime fix
Aggregate-wait 1010 heterogeneous wait over objects plus fds, with channel notify-only support
Burst drain 1011 non-blocking TRY_RECV2 after one aggregate-wait wake
Snapshot + cache hardening 1012 + 1013 + 1014 + 1014a receive snapshot fix, dedicated caches, lockless SEND_PI fast path, and the free-site NULL guard
Wait-queue cache isolation 1015 dedicated wait-queue cache plus SLAB_NO_MERGE across all four ntsync caches

The current module at /lib/modules/6.19.11-rt1-1-nspa/kernel/drivers/misc/ntsync.ko carries the full overlay above.

Stress validation (debug kernel, KASAN-on)

Test Build stage Ops KASAN Result
test-event-set-pi-stress 30s/4x4 deferred-boost fix build 1.5M signaler 0 PASS
test-event-set-pi-stress 60s/8x8 deferred-boost fix build 2.8M sig + 3.4M waiter 0 PASS
test-mutex-pi-stress 30s/8+4mtx deferred-boost fix build 726K acq+rel matched, 632K PI events 0 PASS
test-channel-stress 30s/4x4 deferred-boost fix build KASAN UAF caught at ~30s 1 EXPECTED FAIL (Bug 4 found)
test-channel-stress 30s/4x4 post-channel-entry fix build 819K SEND_PI = 819K REPLY 0 PASS
test-event-set-pi-stress 60s/8x8 post-channel-entry fix build 2.7M sig + 3.5M waiter 0 PASS
test-event-set-pi 20x sanity post-channel-entry fix build 20/20 PASS 0 PASS
test-channel-recv-exclusive 20x post-channel-entry fix build 20/20 PASS 0 PASS
test-mixed-load-stress 5min/13W post-channel-entry fix build ~10.3M ops, all paths 0 PASS
test-aggregate-wait 9/9 aggregate-wait build functional + PI sub-tests n/a PASS
aggregate-wait 1k mixed stress aggregate-wait build 1k iterations 0 PASS
aggregate-wait 30k + native suite aggregate-wait build long stress + full suite 0 PASS
test-channel-stress (post-1012) snapshot + cache-hardening build 1.34M ops (post-1012 KASAN re-soak) 0 PASS
test-channel-try-recv2-stress snapshot + cache-hardening build 2.6M TRY_RECV2 ops 0 PASS
test-mixed-load-stress 300s/13W snapshot + cache-hardening build 5.28M chan SEND/REPLY, 1.99M audio waits, 12.6M REG/DEREG 0 PASS
test-channel-stress 60s/4x4 (1014a) snapshot + cache-hardening build 1.40M SEND/REPLY, 1.40M RECV+RECV2 0 PASS
test-channel-try-recv2-stress 30s snapshot + cache-hardening build 62k SEND, 2.68M attempts, 97.68% EAGAIN 0 PASS

Cumulative debug-kernel: ~30 million operations through post-1009; post-1014a adds another ~14 million ntsync ops (channel SEND_PI hit ~21x more than the post-1012 validation window), zero KASAN splats, zero dmesg matches for BUG/KASAN/Oops/use-after-free/lockdep/warn.

Production validation after aggregate-wait and burst-drain

The aggregate-wait consumer path was validated on the production kernel/userspace pair rather than only in isolation:

This matters because 1010 is load-bearing only when the userspace dispatcher is actually blocked inside it. The build result therefore includes both the syscall itself and the post-1010 wake/boost ordering fixes.

Mixed-load-stress detail

13-thread/300s soak across every ntsync path concurrently against a single dev_fd:

Operation totals:

Path Ops Notes
audio multi-obj waits 8,757,969 100% wake rate
ui EVENT_SET_PI 139,513
ui EVENT_SET / RESET / PULSE 46,506 / 23,181 / 23,324
ui mutex acq=rel 137,297 / 137,297 perfect
chan SEND_PI / REPLY 308,546 / 308,548 perfect after 30 benign races
chan REGISTER / DEREGISTER 730,985 / 365,492
sem release/acquire/read 136,683 / 180,063 / 180,064
wait_all 3-obj acq=rel 71,855 / 71,855 perfect
syscall errors 0
KASAN/KCSAN splats 0
module refcnt post-soak 0

Production-kernel revalidation

After cross-build to the production kernel 6.19.11-rt1-1-nspa (no debug instrumentation, throughput 5x-149x higher than debug):

Layer Run Result Ops Errors
1 native sanity run-rt-suite.sh native 3/3 PASS small 0
1 stress event-set-pi 60s 8x8 PASS ~158M 0
1 stress mutex-pi 30s 8h+4mtx PASS ~12M 0
1 stress channel 30s 4x4 PASS ~52M 0
1 stress mixed-load 300s 13 workers PASS ~145M 0
2 PE matrix nspa_rt_test.exe baseline+rt 32 PASS / 0 FAIL / 0 TIMEOUT n/a 0

Cumulative on the production kernel: post-channel-entry baseline ~370 M ops, then aggregate-wait, then burst drain, then the receive snapshot and dedicated-cache hardening carries, and the wait-queue cache plus full cache isolation; 0 syscall errors, 0 dmesg splats, refcnt=0 post-soak.

The post-1014a build was also re-validated with the full RT-suite v7 on prod kernel 6.19.11-rt1-1-nspa: 16/16 RT pass + 3/3 native ioctl pass; channel snapshot UAF and kmem_cache_free NULL-deref both closed; dedicated kmem_cache slabinfo evidence captured under real Ableton drum-load (158 new event-PI staging pairs absorbed in the dedicated caches with kmalloc-128 flat). Note: that drum-load slabinfo capture was on the debug kernel; on the prod kernel the 1013 caches had been SLUB-merged into kmalloc-128 the entire time, which is the issue 1015’s SLAB_NO_MERGE retro-correction fixes (Section 14).

Post-1015 validation (prod kernel)

The 1015 build was validated against the same prod kernel 6.19.11-rt1-1-nspa. The native ioctl soak (validate-1015.sh, which exercises both setup_wait and ntsync_aggregate_setup alloc paths through test-mixed-load-stress, test-channel-stress, test-channel-try-recv2-stress, and test-aggregate-wait) was not invoked this round: the prod kernel has no SLAB_FREELIST_HARDENED/KASAN tooling, so the soak’s signal value collapses to functional-only – which Ableton already provides at much higher rate. The actual correctness gate was the four-dimension audit plus the NULL-safe ntsync_free_q wrapper.

Empirical safety: Ableton booted clean both pre- and post-SLAB_NO_MERGE rebuild; audio-path WAIT_ANY ioctls drove the new alloc/free pair constantly with no GP-fault, so from_cache routing is correct in both directions.

Slabinfo absorption (validate-1015-slabinfo-watch.sh, Ableton 30s windows at 1Hz, project loaded, mixed transport activity):

The 184 active ntsync_wait_q objects – that would have been in kmalloc-1k on every prior build – combined with the flat kmalloc-1k row, are the isolation proof on prod. active_objs is concurrency, not throughput; per-second alloc rate would need /sys/kernel/slab/ntsync_wait_q/alloc_calls deltas (not a gate, just a refinement).

Only PASS/FAIL is authoritative across debug vs production kernels; throughput numbers aren’t directly comparable because the debug-kernel slub_debug=FZPU + kfence + KASAN tax dominates.

Original 1003-era PI metrics (still valid)

The PI contention / priority wakeup ordering / rapid mutex throughput / philosophers tests from the original single-page ntsync doc remain valid. None of the later channel or aggregate-wait carries changed the mutex PI path; the metrics are unchanged:

Metric / Test v4 RT v5 RT Delta
ntsync-d4 RT PI avg 387 ms 270 ms -30.2%
ntsync-d8 RT PI avg 419 ms 201 ms -52.0%
Rapid mutex throughput 232K ops/s 259K ops/s +11.6%
Rapid mutex RT max_wait 54 us 47 us -13.0%
Philosophers RT max_wait 1620 us 865 us -46.6%

Priority wakeup ordering is exact (5 waiters at distinct priorities wake in priority order, both baseline and RT modes, all test runs). PI chain propagation is correct up to depth 12.


16. Audit notes

The patches in this stack divide cleanly into two categories. The boundary matters because it dictates which patches were safe to ship in a flurry and which weren’t.

Mechanically verifiable correctness vs. code-review hypothesis

Patches in the mechanically verifiable category enforce a rule that has an oracle. If the rule is violated, kernel debug infra (CONFIG_DEBUG_ATOMIC_SLEEP, LOCKDEP, KASAN) will splat. The patch either makes the splat go away or it doesn’t; there is no ambiguity.

1013 (dedicated kmem_caches) is structural infrastructure, not a correctness fix. It is always-on (cacheline alignment, isolation, visibility) and does not change observable semantics; the cost was a single missed NULL guard caught and fixed in 1014a.

Patches in the code-review hypothesis category encode a reviewer’s argument that some code is buggy. There is no oracle. If the reviewer’s argument is wrong (or the bug is somewhere else), the patch ships new bugs without fixing the original one.

The rolled-back Codex 1007-1011 series

On 2026-04-26 there was an unfound EVENT_SET_PI slab UAF (___slab_alloc+0x316 GP-fault, ntsync_obj_ioctl+0x44e). KASAN was queued but not yet run. Codex’s review surfaced three “other issues” (cross-snapshot PI, non-exclusive RECV, channel-accept-in-setup_wait), and patches 1007-1011 (5 patches in 6 hours, including a 34KB rewrite) landed under the rationale that “(1) ∧ (2) explains the hang.”

That rationale was theory, not a measured trace. The actual unfound slab UAF was 1006 – a kfree under raw_spinlock_t in channel_register/deregister_thread. None of 1007-1011’s hypotheses were correct about the original symptom. Worse, the 1007-1011 series introduced a new UAF (the CHANNEL_REPLY UAF that 1009 ultimately fixed) that only existed because channels had been added at all.

All of 1007-1011 were rolled back. The proper sequence was then:

  1. First, KASAN-clean the alloc/free sites under raw_spinlock_t (the actual bug). That became patch 1006.
  2. Then, with KASAN usable as an oracle, run the stress tests. Each splat or hang is a real bug, not slab dust.
  3. One bug per patch, surgical, with the test that found it as the validation gate. 1007 / 1008 / 1009 each fix exactly one KASAN- or test-confirmed bug.

Operating principle

When chasing an unidentified bug, narrow on the actual symptom (trace / KASAN / ftrace / repro) – do not pile speculative fixes from adjacent code review under the cover of “while I was in there, I noticed…”. Even when the audit is internally well-reasoned, the issues it surfaces are almost certainly unrelated to the observed symptom – and landing them piles new failure modes on top of the original one.

Independent CRIT findings can still be filed as separate tickets/patches, but they should not ship until the original symptom is understood. At minimum: do not ship them on the same day, on top of an unfound bug, in the same module.

A small surface area that is clearly correct in isolation (e.g. a refcount discipline patch with a real KASAN trace) can ship – but only after asking: “is this fixing damage I caused with adjacent work, or real upstream-relevant correctness?” 1009 was the latter.

This is also why 1006 was safe to ship in-flurry while the rolled-back 1007-1011 wasn’t: 1006 has an oracle (CONFIG_DEBUG_ATOMIC_SLEEP), the rolled-back series had only Codex’s argument.


17. References

Patches (NSPA tree)

All in wine-rt-claude/ntsync-patches/:

Production source

Wine consumer

Tests

In wine/nspa/tests/:

Cross-references