Wine-NSPA – Validation Baselines and Comparison

This page preserves the archived RT-suite snapshots and the subsystem deltas that are still useful when reading older Wine-NSPA results.

Overview
Archived suite snapshots
Current archived boundary
Historical suite deltas worth keeping
Targeted subsystem A/Bs after the archive boundary
Methodology boundaries
References

1. Overview

The public test story has two layers:

archived full-suite snapshots, where the whole default matrix was rerun and versioned
targeted validators and workload A/Bs, which carry later subsystem work that did not get a new archived suite snapshot

This page keeps both. It does not try to turn every later A/B into a fake matrix version, but it also does not throw away the concrete per-test and before/after data that explains why individual carries mattered.

2. Archived suite snapshots

These are the public snapshots worth keeping as suite-level boundaries.

Snapshot	Date	Suite shape	Result	Why it matters
`v4`	`2026-04-15`	PE-only matrix, 10 tests x baseline + RT	`20/20 PASS`	first public socket `io_uring` numbers and post-PI-v2 kernel fixes
`v5`	`2026-04-15/16`	PE-only matrix, 10 tests x baseline + RT	`20/20 PASS`	SIMD / SRW-spin era with detailed per-test deltas
`v7`	`2026-04-28`	early two-layer	Layer 1 native suite added; Layer 2 `22/22 PASS`	native ntsync Layer 1 becomes part of the public suite
`v8`	`2026-04-30`	early two-layer	Layer 1 `3 PASS / 0 FAIL`; Layer 2 `24 PASS / 0 FAIL / 0 TIMEOUT`	`dispatcher-burst` enters the default PE matrix
`v9-validation-default`	`2026-05-03`	current archived two-layer default	Layer 1 `3 PASS / 0 FAIL / 0 SKIP`; Layer 2 `32 PASS / 0 FAIL / 0 TIMEOUT`	current archived baseline used by the public docs

Later carries are documented through targeted validators and workload A/Bs unless another full-suite archive supersedes v9-validation-default.

The last older-method whole-suite reference run before the methodology shift was v18. Its useful late-suite numbers are carried in 5.4 ntsync allocator and wait-queue follow-ons rather than being presented as a fake new archived boundary.

3. Current archived boundary

The current archived full-suite boundary is v9-validation-default (2026-05-03).

Layer	Result	Notes
Layer 1 native suite	`3 PASS / 0 FAIL / 0 SKIP`	`test-event-set-pi`, `test-channel-recv-exclusive`, `test-aggregate-wait`
Layer 2 PE matrix	`32 PASS / 0 FAIL / 0 TIMEOUT`	`16` default tests x `baseline` + `rt`

The archived default PE test set is:

Sync and contention	I/O and bypass	Timers and UI
`rapidmutex`	`socket-io`	`nt-timer`
`philosophers`	`rpc-bypass`	`wm-timer`
`fork-mutex`	`irot-bypass`	`dispatcher-burst`
`cs-contention`	`large-pages`	`signal-recursion`
`ntsync-d4`
`ntsync-d8`
`ntsync-d12`
`condvar-pi`

This is the baseline to use when a public page says “archived matrix” or “current archived validation boundary.”

4. Historical suite deltas worth keeping

4.1 `v3` -> `v4`

The v4 archive is still useful because it was the first full public matrix after the PI-v2 ntsync fixes and the socket io_uring overlapped path landed.

Metric	Earlier run	`v4`	Read
philosophers RT max wait	`1620 us`	`601 us`	`-63%`; PI-v2 bugfix removes boost/unboost thrash
socket-io B avg latency	—	`113.2 us`	first public overlapped `io_uring` result
socket-io B throughput	—	`8837 msg/s`	first public overlapped `io_uring` result
rapidmutex RT throughput	`~262K-301K ops/s`	`312K ops/s`	stable CS-PI hot path after PI-v2 cleanup

4.2 `v4` -> `v5`

v5 is still the densest archived example of how a micro-optimization bundle shows up in RT tests.

Metric	`v4`	`v5`	Read
rapidmutex RT throughput	`312K ops/s`	`327K ops/s`	`+4.7%`; lower CS-path copy overhead
rapidmutex RT max wait	`46 us`	`36 us`	lower transition overhead
fork-mutex RT elapsed	`1021 ms`	`948 ms`	`-7.1%`; process-start path trimmed
ntsync d4 rapid throughput	`232K ops/s`	`259K ops/s`	`+11.6%`; lower lock transition cost
ntsync d8 RT PI avg	`419 ms`	`201 ms`	`-52%`; the earlier reversal disappears
baseline socket-io B avg	`133.2 us`	`104.5 us`	`-21%`; overlapped path copy cost drops
baseline socket-io B throughput	`7506 msg/s`	`9568 msg/s`	`+27%`; same socket path improvement

4.3 Representative per-test numbers

Some individual test numbers are still worth keeping because later pages cite them directly.

Test	Metric	Value
`condvar-pi`	with PI max wait	`152 us`
`condvar-pi`	without PI max wait	`263 us`
`rapidmutex`	RT throughput (`v5`)	`327K ops/s`
`rapidmutex`	RT max wait (`v5`)	`36 us`
`philosophers`	meals served	`250/250`
`socket-io` B	overlapped avg (`v5` baseline)	`104.5 us`
`socket-io` B	overlapped throughput (`v5` baseline)	`9568 msg/s`

5. Targeted subsystem A/Bs after the archive boundary

The public docs kept moving after v9-validation-default, but not every carry was followed by a new archived suite snapshot. These are the concrete deltas worth keeping.

5.1 Dispatcher and message-path work

Feature	Before	After	Read
`dispatcher-burst` burst ops/sec	`555,567`	`841,765`	`+34% / 1.5x` with `TRY_RECV2` burst drain
`dispatcher-burst` burst worst max ns	`31,843,082`	`23,014,325`	`-28%`
`channel_dispatcher` samples	`14.51%`	`0.70%`	`-13.81pp / -95%`
`main_loop_epoll` samples	`7.24%`	`2.68%`	`-4.56pp`
`get_message` calls / 60 s	`3,880`	`866`	`-78%` with empty-poll cache
`get_message` direct handler time	`16.5 ms`	`2.2 ms`	`-87%`
`get_message` total handler time	`46.8 ms`	`36.9 ms`	`-21%`

5.2 Shared-state and zero-time waits

Feature	Before	After	Read
zero-time process wait	`~10000 ns/poll`	`~144 ns/poll`	`process_shm` short-circuit
zero-time thread wait	`~11940 ns/poll`	`~164 ns/poll`	`thread_shm` short-circuit
local sections `nspa_create_mapping_from_unix_fd` count	`2,664`	`~800`	`-70%`; mapping work stays local
local sections total handler time	`1,991 ms`	`1,077 ms`	lighter mapping path

5.3 x86_64 hot-path bundle

Metric	Before	After	Read
user-mode samples	`97K`	`86K`	`-11.3%`
iTLB misses	`229.7M`	`180.8M`	`-21.30%`
dTLB misses	`51.5M`	`42.4M`	`-17.69%`
branch misses	`348.3M`	`308.4M`	`-11.45%`
page faults	`130,349`	`71,754`	`-44.95%`
`NtGetTickCount` dispatcher entries	`3,081,551`	`0`	inline path confirmed end-to-end

5.4 ntsync allocator and wait-queue follow-ons

The later ntsync cache-isolation and wait-queue work was validated more through targeted stress and tail numbers than through a new public full-suite archive. The best late post-1011 whole-suite reference in that family was v18, so it is worth keeping its numbers explicit instead of collapsing them into prose.

Metric	post-1011 `v17`	post-1011 `v18`	later cache-isolated sample	Read
philosophers RT max wait	`4154 us`	`3039 us`	`857 us`	`v18` was the tighter post-1011 run; later cache isolation tightens the tail again
philosophers worst max wait	`13097 us`	`9963 us`	`1424 us`	same tail-tightening story
rapidmutex ops/sec	`862K`	`254K`	`270K`	no collapse after the later allocator/cache changes
dispatcher p99	`22.8M ns`	`17.0M ns`	`16.9M ns`	no regression on the dispatcher hot path
socket-io p99	`238 us`	`161 us`	`234 us`	later sample remains inside the earlier variance band

These numbers are useful as a no-regression and tail-tightening read, not as a replacement for a new archived suite boundary.

6. Methodology boundaries

The methodology note should be short but explicit.

Family	Snapshots	What changed	Comparison rule
PE-only matrix	`v3` through `v6`	one PE layer, smaller default set	compare within the family
Early two-layer	`v7`, `v8`	native Layer 1 added; `dispatcher-burst` joins the PE set	compare within the family
Current archived default	`v9-validation-default`	16-test default PE matrix with Layer 1 intact	current public baseline

So:

use archived totals inside the same family
keep later targeted A/Bs as feature evidence
do not pretend a targeted A/B is a new suite snapshot

For the current runner shape and default test list, see RT Test Harness.