Disable NV12 path for ANSCV by default. Currenly use cv::Mat** directly

2026-04-04 10:09:47 +11:00
parent 445abefebe
commit 3a21026790
19 changed files with 575 additions and 232 deletions
--- a/NV12_GLOBAL_POOL_FIX_V2.md
+++ b/NV12_GLOBAL_POOL_FIX_V2.md
@@ -0,0 +1,133 @@
+# NV12 Global Slot Pool Fix — Complete Reference (v2)
+
+## Problem Statement
+When RTSP cameras disconnect in LabVIEW, the flow is `ReleaseANSRTSPHandle → Destroy() → delete → CreateANSRTSPHandle`. The old per-camera GPU buffer pool was destroyed during this cycle, causing:
+1. **Frozen inference** — `forceReleaseByOwner` deleted GpuFrameData mid-inference
+2. **Processing spikes** — `cudaDeviceSynchronize` blocked ALL GPU work (900ms+)
+3. **Crashes** — inference read freed pool buffers after camera deletion
+
+## Architecture: Global GpuNV12SlotPool
+
+GPU buffer ownership is **decoupled from camera lifetime**:
+- Buffers live in a **process-wide singleton** (`GpuNV12SlotPool`)
+- Slots are **recycled** (never freed during camera Destroy)
+- **50ms cooldown** prevents slot reuse while GPU kernels still read
+- **Per-slot non-blocking CUDA stream** avoids NULL-stream implicit sync
+- **Background av_frame_free thread** removes SRW lock blocking from hot path
+
+## Files Modified (from original codebase)
+
+### NEW FILES (3 + copies)
+
+| File | Purpose |
+|------|---------|
+| `include/GpuNV12SlotPool.h` | Global pool header — GpuNV12Slot struct, GpuNV12SlotPool class |
+| `modules/ANSCV/GpuNV12SlotPool.cpp` | Canonical singleton + acquire() implementation (CUDA) |
+| `modules/ANSODEngine/GpuNV12SlotPool.cpp` | Cross-DLL resolver via GetProcAddress |
+| `modules/ANSOCR/GpuNV12SlotPool.cpp` | Same resolver (copy of ANSODEngine version) |
+| `modules/ANSFR/GpuNV12SlotPool.cpp` | Same resolver (copy of ANSODEngine version) |
+| `modules/ANSLPR/GpuNV12SlotPool.cpp` | Same resolver (copy of ANSODEngine version) |
+
+### MODIFIED FILES
+
+| File | Changes |
+|------|---------|
+| `include/ANSGpuFrameRegistry.h` | Added `#include "GpuNV12SlotPool.h"`, `GpuNV12Slot* poolSlot` field in GpuFrameData, move constructor transfers poolSlot, `freeOwnedBuffers_locked()` calls `deferRelease(poolSlot)`, added `pushPendingFree_locked()`, debug macros guarded by `ANSCORE_GPU_DEBUG` |
+| `modules/ANSCV/ANSGpuFrameOps.h` | `gpu_frame_attach_cuda()` rewritten: sync D2D on per-slot stream, deferred av_frame_free, CPU snapshot only on fallback, background av_frame_free thread in `gpu_frame_evict_stale()`. Debug macros guarded. |
+| `modules/ANSCV/ANSRTSP.h` | Removed `GpuNV12Pool` struct, `EnsureGpuPool()`, `DestroyGpuPool()`, `GetGpuPool()` |
+| `modules/ANSCV/ANSRTSP.cpp` | Removed `EnsureGpuPool`/`DestroyGpuPool` implementations. `Destroy()` and `Reconnect()` simplified: no `forceReleaseByOwner`, no `cudaDeviceSynchronize`, no `DestroyGpuPool`. `GetRTSPCVImage()` uses `GpuNV12SlotPool::instance().acquire()`. Added SLOW FRAME timing log (>500ms, to both spdlog and DebugView). Debug macros guarded. |
+| `modules/ANSODEngine/NV12PreprocessHelper.cpp` | Debug logging blocks guarded by `ANSCORE_GPU_DEBUG`. One-time `[NV12 ACTIVE]` log to DebugView when NV12 fast path activates. |
+| `modules/ANSODEngine/CMakeLists.txt` | Added `GpuNV12SlotPool.cpp` to source list |
+| `modules/ANSFR/CMakeLists.txt` | Added `GpuNV12SlotPool.cpp` to source list |
+| `modules/ANSLPR/CMakeLists.txt` | Added `GpuNV12SlotPool.cpp` to source list |
+| (ANSOCR uses file GLOB — auto-included) | |
+
+## Key Design Decisions
+
+| Decision | Rationale |
+|----------|-----------|
+| **Sync D2D on per-slot stream** | Non-blocking stream avoids NULL-stream implicit sync with inference (was causing 1-2s stalls). `cudaStreamSynchronize` waits only for the 2 copies (~1.5ms). Slot held briefly → pool stays small (64 slots for 20+ cameras). |
+| **50ms cooldown on slot reuse** | GPU kernels complete in <10ms. 50ms = 5× safety margin. Prevents buffer overwrite while inference reads. Short enough to keep pool pressure low. |
+| **Background av_frame_free thread** | `av_frame_free` on CUDA-mapped frames acquires nvcuda64 SRW lock (5-20ms each). Background thread frees in batches every 50ms, removing all SRW lock blocking from camera hot path. |
+| **CPU NV12 snapshot deferred to fallback only** | 4K snapshot = ~12MB malloc+memcpy+free per frame (~276MB/s). Only needed for cross-GPU fallback (rare). Skipping on pool-success path eliminates heap churn. |
+| **Debug logging guarded by ANSCORE_GPU_DEBUG** | 500-2000 OutputDebugString calls/sec caused process-wide mutex convoy stalls. Default off. Add `-DANSCORE_GPU_DEBUG=1` to CMake to re-enable. |
+| **Always-on diagnostics** | NEW slot allocation, POOL FULL, SLOW FRAME (>500ms), and NV12 ACTIVE path selection always log to DebugView (low volume, ~1-10 per session). |
+
+## Data Flow
+
+```
+GetRTSPCVImage (camera thread):
+  1. GetImage() → BGR frame (shallow copy)
+  2. anscv_mat_replace → swap Mat pointer
+  3. TryIncrementInFlight() → atomic guard
+  4. GetCudaHWFrame() → NVDEC device pointers
+  5. GetNV12Frame() → CPU NV12 AVFrame (cloned)
+  6. slot = GpuNV12SlotPool::acquire(gpuIdx, w, h)
+     └─ drainCooledSlots_locked() first (COOLING→FREE if >50ms)
+  7. gpu_frame_attach_cuda(*image, cudaFrame, gpuIdx, pts, cpuNV12, slot):
+     a. cudaMemcpy2DAsync(slot->bufY, ..., nvdecY, ..., slot->copyStream)
+     b. cudaMemcpy2DAsync(slot->bufUV, ..., nvdecUV, ..., slot->copyStream)
+     c. cudaStreamSynchronize(slot->copyStream) — waits ~1.5ms (copy only)
+     d. data.poolSlot = slot
+     e. DEFERRED: push cudaFrame+cpuNV12 to m_pendingFree (NOT av_frame_free)
+     f. registry.attach(mat, data)
+  8. Wire onReleaseFn → DecrementInFlight
+  9. return (~3-5ms total)
+
+Inference (engine thread):
+  1. gpuFrame = lookup(*cvImage) → GpuFrameData*
+  2. tl_currentGpuFrame() = gpuFrame
+  3. tryNV12(): reads yPlane/uvPlane → slot buffers (data is valid, sync done)
+  4. NV12→RGB kernel launch → reads from slot buffer
+  5. Inference finishes → clone released → refcount→0
+     → freeOwnedBuffers_locked → deferRelease(poolSlot) → COOLING
+     → onReleaseFn → DecrementInFlight
+
+Background av_frame_free thread (started once):
+  - Every 50ms: drain m_pendingFree → av_frame_free each
+  - Runs independently of camera/inference threads
+  - SRW lock blocking happens HERE, not in hot path
+
+Slot lifecycle:
+  acquire() → STATE_ACTIVE
+  refcount→0 → deferRelease → STATE_COOLING (cooldownStart = now)
+  50ms later → drainCooledSlots_locked → STATE_FREE
+  next acquire() → reuse
+
+Destroy (camera thread) — LIGHTWEIGHT:
+  1. _isPlaying = false
+  2. Wait _inFlightFrames == 0 (fast — sync copy means in-flight = GetRTSPCVImage only)
+  3. invalidateOwner(this) — prevent stale callbacks
+  4. close() — destroys NVDEC decoder only
+  *** NO forceReleaseByOwner ***
+  *** NO cudaDeviceSynchronize ***
+  *** NO DestroyGpuPool ***
+  Pool slots survive — inference keeps reading safely.
+```
+
+## DebugView Diagnostics (always-on)
+
+```
+[NV12Pool] NEW slot #1: 1920x1080 gpu=0 Y=0000001764000000 UV=... stream=...
+[NV12Pool] NEW slot #2: 3840x2160 gpu=0 Y=... UV=... stream=...
+[NV12 ACTIVE] ANSRTYOLO Path: CUDA_ZERO_COPY | isCuda=1 gpuMatch=1 decodeGpu=0 infGpu=0 frame=1920x1080
+[GetRTSPCVImage] SLOW FRAME: total=523.1ms (getImage=2.1ms cuda=521.0ms) 3840x2160
+[NV12Pool] POOL FULL (64 slots) — fallback to CPU
+```
+
+## Build Configuration
+
+- **Production (default):** Debug logging OFF. Only slot allocation, POOL FULL, SLOW FRAME, and NV12 ACTIVE visible in DebugView.
+- **Debug:** Add `-DANSCORE_GPU_DEBUG=1` to CMake. Enables per-frame verbose logging (WARNING: causes performance degradation from OutputDebugString lock contention at high frame rates).
+
+## Test Checklist
+
+- [ ] Start multiple RTSP cameras with HW decoding + multiple AI engines
+- [ ] Verify DebugView shows: NEW slot allocations, NV12 ACTIVE with CUDA_ZERO_COPY
+- [ ] Verify: zero POOL FULL entries
+- [ ] Verify: zero or very few SLOW FRAME entries (>500ms)
+- [ ] Trigger camera reconnect (disconnect cable or ReleaseHandle+CreateHandle)
+- [ ] Verify: no crash, inference continues on remaining cameras
+- [ ] Verify: processing time chart stable (no multi-second spikes)
+- [ ] Check nvidia-smi: VRAM stable (slots recycled, not growing)
+- [ ] Long run: 1+ hours with cameras reconnecting periodically