Thread-local staging Mat (video_player.cpp:1400-1407) — single biggest win. Eliminates the 12 MB per-call malloc/free cycle.
Contiguous get_buffer2 allocator (video_decoder.cpp:35-102) — keeps the 3 bulk memcpys cache-friendly. Would also enable FAST/zero-copy for resolutions where visible_h % 64 == 0.
SW-decoder thread config (video_decoder.cpp:528-540) — thread_count=0, thread_type=FRAME|SLICE. FRAME is downgraded to SLICE-only by AV_CODEC_FLAG_LOW_DELAY, but decode throughput is sufficient for your input rate.
SetTargetFPS(100) delivery throttle (already there) — caps onVideoFrame post-decode work at 10 FPS. Keeps the caller path warm-cached.
Instrumentation — [MEDIA_DecInit] / [MEDIA_Convert] / [MEDIA_SWDec] / [MEDIA_Timing] / [MEDIA_JpegTiming] — always-on regression detector, zero cost when ANSCORE_DEBUGVIEW=OFF.
getImage() previously held _mutex across the 4K NV12->BGR sws_scale in
avframeToCVMat, blocking the decoder callback (onVideoFrame) for 100-300ms
per frame. Under multi-camera load this cascaded into 5-21s frame stalls
and STALE PTS events in the log.
- avframeToCVMat: drop outer _mutex. NV12/YUV420P paths touch no shared
state; avframeAnyToCvmat still locks internally for swsCtx.
- getImage: split into two short locked phases with the BGR conversion
unlocked between them. Decoder callbacks can push new frames and run
the CUDA HW capture path in parallel with the reader's conversion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>