Files
ANSCORE/NV12_GLOBAL_POOL_FIX_V2.md

7.9 KiB
Raw Blame History

NV12 Global Slot Pool Fix — Complete Reference (v2)

Problem Statement

When RTSP cameras disconnect in LabVIEW, the flow is ReleaseANSRTSPHandle → Destroy() → delete → CreateANSRTSPHandle. The old per-camera GPU buffer pool was destroyed during this cycle, causing:

  1. Frozen inferenceforceReleaseByOwner deleted GpuFrameData mid-inference
  2. Processing spikescudaDeviceSynchronize blocked ALL GPU work (900ms+)
  3. Crashes — inference read freed pool buffers after camera deletion

Architecture: Global GpuNV12SlotPool

GPU buffer ownership is decoupled from camera lifetime:

  • Buffers live in a process-wide singleton (GpuNV12SlotPool)
  • Slots are recycled (never freed during camera Destroy)
  • 50ms cooldown prevents slot reuse while GPU kernels still read
  • Per-slot non-blocking CUDA stream avoids NULL-stream implicit sync
  • Background av_frame_free thread removes SRW lock blocking from hot path

Files Modified (from original codebase)

NEW FILES (3 + copies)

File Purpose
include/GpuNV12SlotPool.h Global pool header — GpuNV12Slot struct, GpuNV12SlotPool class
modules/ANSCV/GpuNV12SlotPool.cpp Canonical singleton + acquire() implementation (CUDA)
modules/ANSODEngine/GpuNV12SlotPool.cpp Cross-DLL resolver via GetProcAddress
modules/ANSOCR/GpuNV12SlotPool.cpp Same resolver (copy of ANSODEngine version)
modules/ANSFR/GpuNV12SlotPool.cpp Same resolver (copy of ANSODEngine version)
modules/ANSLPR/GpuNV12SlotPool.cpp Same resolver (copy of ANSODEngine version)

MODIFIED FILES

File Changes
include/ANSGpuFrameRegistry.h Added #include "GpuNV12SlotPool.h", GpuNV12Slot* poolSlot field in GpuFrameData, move constructor transfers poolSlot, freeOwnedBuffers_locked() calls deferRelease(poolSlot), added pushPendingFree_locked(), debug macros guarded by ANSCORE_GPU_DEBUG
modules/ANSCV/ANSGpuFrameOps.h gpu_frame_attach_cuda() rewritten: sync D2D on per-slot stream, deferred av_frame_free, CPU snapshot only on fallback, background av_frame_free thread in gpu_frame_evict_stale(). Debug macros guarded.
modules/ANSCV/ANSRTSP.h Removed GpuNV12Pool struct, EnsureGpuPool(), DestroyGpuPool(), GetGpuPool()
modules/ANSCV/ANSRTSP.cpp Removed EnsureGpuPool/DestroyGpuPool implementations. Destroy() and Reconnect() simplified: no forceReleaseByOwner, no cudaDeviceSynchronize, no DestroyGpuPool. GetRTSPCVImage() uses GpuNV12SlotPool::instance().acquire(). Added SLOW FRAME timing log (>500ms, to both spdlog and DebugView). Debug macros guarded.
modules/ANSODEngine/NV12PreprocessHelper.cpp Debug logging blocks guarded by ANSCORE_GPU_DEBUG. One-time [NV12 ACTIVE] log to DebugView when NV12 fast path activates.
modules/ANSODEngine/CMakeLists.txt Added GpuNV12SlotPool.cpp to source list
modules/ANSFR/CMakeLists.txt Added GpuNV12SlotPool.cpp to source list
modules/ANSLPR/CMakeLists.txt Added GpuNV12SlotPool.cpp to source list
(ANSOCR uses file GLOB — auto-included)

Key Design Decisions

Decision Rationale
Sync D2D on per-slot stream Non-blocking stream avoids NULL-stream implicit sync with inference (was causing 1-2s stalls). cudaStreamSynchronize waits only for the 2 copies (~1.5ms). Slot held briefly → pool stays small (64 slots for 20+ cameras).
50ms cooldown on slot reuse GPU kernels complete in <10ms. 50ms = 5× safety margin. Prevents buffer overwrite while inference reads. Short enough to keep pool pressure low.
Background av_frame_free thread av_frame_free on CUDA-mapped frames acquires nvcuda64 SRW lock (5-20ms each). Background thread frees in batches every 50ms, removing all SRW lock blocking from camera hot path.
CPU NV12 snapshot deferred to fallback only 4K snapshot = ~12MB malloc+memcpy+free per frame (~276MB/s). Only needed for cross-GPU fallback (rare). Skipping on pool-success path eliminates heap churn.
Debug logging guarded by ANSCORE_GPU_DEBUG 500-2000 OutputDebugString calls/sec caused process-wide mutex convoy stalls. Default off. Add -DANSCORE_GPU_DEBUG=1 to CMake to re-enable.
Always-on diagnostics NEW slot allocation, POOL FULL, SLOW FRAME (>500ms), and NV12 ACTIVE path selection always log to DebugView (low volume, ~1-10 per session).

Data Flow

GetRTSPCVImage (camera thread):
  1. GetImage() → BGR frame (shallow copy)
  2. anscv_mat_replace → swap Mat pointer
  3. TryIncrementInFlight() → atomic guard
  4. GetCudaHWFrame() → NVDEC device pointers
  5. GetNV12Frame() → CPU NV12 AVFrame (cloned)
  6. slot = GpuNV12SlotPool::acquire(gpuIdx, w, h)
     └─ drainCooledSlots_locked() first (COOLING→FREE if >50ms)
  7. gpu_frame_attach_cuda(*image, cudaFrame, gpuIdx, pts, cpuNV12, slot):
     a. cudaMemcpy2DAsync(slot->bufY, ..., nvdecY, ..., slot->copyStream)
     b. cudaMemcpy2DAsync(slot->bufUV, ..., nvdecUV, ..., slot->copyStream)
     c. cudaStreamSynchronize(slot->copyStream) — waits ~1.5ms (copy only)
     d. data.poolSlot = slot
     e. DEFERRED: push cudaFrame+cpuNV12 to m_pendingFree (NOT av_frame_free)
     f. registry.attach(mat, data)
  8. Wire onReleaseFn → DecrementInFlight
  9. return (~3-5ms total)

Inference (engine thread):
  1. gpuFrame = lookup(*cvImage) → GpuFrameData*
  2. tl_currentGpuFrame() = gpuFrame
  3. tryNV12(): reads yPlane/uvPlane → slot buffers (data is valid, sync done)
  4. NV12→RGB kernel launch → reads from slot buffer
  5. Inference finishes → clone released → refcount→0
     → freeOwnedBuffers_locked → deferRelease(poolSlot) → COOLING
     → onReleaseFn → DecrementInFlight

Background av_frame_free thread (started once):
  - Every 50ms: drain m_pendingFree → av_frame_free each
  - Runs independently of camera/inference threads
  - SRW lock blocking happens HERE, not in hot path

Slot lifecycle:
  acquire() → STATE_ACTIVE
  refcount→0 → deferRelease → STATE_COOLING (cooldownStart = now)
  50ms later → drainCooledSlots_locked → STATE_FREE
  next acquire() → reuse

Destroy (camera thread) — LIGHTWEIGHT:
  1. _isPlaying = false
  2. Wait _inFlightFrames == 0 (fast — sync copy means in-flight = GetRTSPCVImage only)
  3. invalidateOwner(this) — prevent stale callbacks
  4. close() — destroys NVDEC decoder only
  *** NO forceReleaseByOwner ***
  *** NO cudaDeviceSynchronize ***
  *** NO DestroyGpuPool ***
  Pool slots survive — inference keeps reading safely.

DebugView Diagnostics (always-on)

[NV12Pool] NEW slot #1: 1920x1080 gpu=0 Y=0000001764000000 UV=... stream=...
[NV12Pool] NEW slot #2: 3840x2160 gpu=0 Y=... UV=... stream=...
[NV12 ACTIVE] ANSRTYOLO Path: CUDA_ZERO_COPY | isCuda=1 gpuMatch=1 decodeGpu=0 infGpu=0 frame=1920x1080
[GetRTSPCVImage] SLOW FRAME: total=523.1ms (getImage=2.1ms cuda=521.0ms) 3840x2160
[NV12Pool] POOL FULL (64 slots) — fallback to CPU

Build Configuration

  • Production (default): Debug logging OFF. Only slot allocation, POOL FULL, SLOW FRAME, and NV12 ACTIVE visible in DebugView.
  • Debug: Add -DANSCORE_GPU_DEBUG=1 to CMake. Enables per-frame verbose logging (WARNING: causes performance degradation from OutputDebugString lock contention at high frame rates).

Test Checklist

  • Start multiple RTSP cameras with HW decoding + multiple AI engines
  • Verify DebugView shows: NEW slot allocations, NV12 ACTIVE with CUDA_ZERO_COPY
  • Verify: zero POOL FULL entries
  • Verify: zero or very few SLOW FRAME entries (>500ms)
  • Trigger camera reconnect (disconnect cable or ReleaseHandle+CreateHandle)
  • Verify: no crash, inference continues on remaining cameras
  • Verify: processing time chart stable (no multi-second spikes)
  • Check nvidia-smi: VRAM stable (slots recycled, not growing)
  • Long run: 1+ hours with cameras reconnecting periodically