Files
ANSCORE/NV12_GLOBAL_POOL_FIX_V2.md

134 lines
7.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# NV12 Global Slot Pool Fix — Complete Reference (v2)
## Problem Statement
When RTSP cameras disconnect in LabVIEW, the flow is `ReleaseANSRTSPHandle → Destroy() → delete → CreateANSRTSPHandle`. The old per-camera GPU buffer pool was destroyed during this cycle, causing:
1. **Frozen inference**`forceReleaseByOwner` deleted GpuFrameData mid-inference
2. **Processing spikes**`cudaDeviceSynchronize` blocked ALL GPU work (900ms+)
3. **Crashes** — inference read freed pool buffers after camera deletion
## Architecture: Global GpuNV12SlotPool
GPU buffer ownership is **decoupled from camera lifetime**:
- Buffers live in a **process-wide singleton** (`GpuNV12SlotPool`)
- Slots are **recycled** (never freed during camera Destroy)
- **50ms cooldown** prevents slot reuse while GPU kernels still read
- **Per-slot non-blocking CUDA stream** avoids NULL-stream implicit sync
- **Background av_frame_free thread** removes SRW lock blocking from hot path
## Files Modified (from original codebase)
### NEW FILES (3 + copies)
| File | Purpose |
|------|---------|
| `include/GpuNV12SlotPool.h` | Global pool header — GpuNV12Slot struct, GpuNV12SlotPool class |
| `modules/ANSCV/GpuNV12SlotPool.cpp` | Canonical singleton + acquire() implementation (CUDA) |
| `modules/ANSODEngine/GpuNV12SlotPool.cpp` | Cross-DLL resolver via GetProcAddress |
| `modules/ANSOCR/GpuNV12SlotPool.cpp` | Same resolver (copy of ANSODEngine version) |
| `modules/ANSFR/GpuNV12SlotPool.cpp` | Same resolver (copy of ANSODEngine version) |
| `modules/ANSLPR/GpuNV12SlotPool.cpp` | Same resolver (copy of ANSODEngine version) |
### MODIFIED FILES
| File | Changes |
|------|---------|
| `include/ANSGpuFrameRegistry.h` | Added `#include "GpuNV12SlotPool.h"`, `GpuNV12Slot* poolSlot` field in GpuFrameData, move constructor transfers poolSlot, `freeOwnedBuffers_locked()` calls `deferRelease(poolSlot)`, added `pushPendingFree_locked()`, debug macros guarded by `ANSCORE_GPU_DEBUG` |
| `modules/ANSCV/ANSGpuFrameOps.h` | `gpu_frame_attach_cuda()` rewritten: sync D2D on per-slot stream, deferred av_frame_free, CPU snapshot only on fallback, background av_frame_free thread in `gpu_frame_evict_stale()`. Debug macros guarded. |
| `modules/ANSCV/ANSRTSP.h` | Removed `GpuNV12Pool` struct, `EnsureGpuPool()`, `DestroyGpuPool()`, `GetGpuPool()` |
| `modules/ANSCV/ANSRTSP.cpp` | Removed `EnsureGpuPool`/`DestroyGpuPool` implementations. `Destroy()` and `Reconnect()` simplified: no `forceReleaseByOwner`, no `cudaDeviceSynchronize`, no `DestroyGpuPool`. `GetRTSPCVImage()` uses `GpuNV12SlotPool::instance().acquire()`. Added SLOW FRAME timing log (>500ms, to both spdlog and DebugView). Debug macros guarded. |
| `modules/ANSODEngine/NV12PreprocessHelper.cpp` | Debug logging blocks guarded by `ANSCORE_GPU_DEBUG`. One-time `[NV12 ACTIVE]` log to DebugView when NV12 fast path activates. |
| `modules/ANSODEngine/CMakeLists.txt` | Added `GpuNV12SlotPool.cpp` to source list |
| `modules/ANSFR/CMakeLists.txt` | Added `GpuNV12SlotPool.cpp` to source list |
| `modules/ANSLPR/CMakeLists.txt` | Added `GpuNV12SlotPool.cpp` to source list |
| (ANSOCR uses file GLOB — auto-included) | |
## Key Design Decisions
| Decision | Rationale |
|----------|-----------|
| **Sync D2D on per-slot stream** | Non-blocking stream avoids NULL-stream implicit sync with inference (was causing 1-2s stalls). `cudaStreamSynchronize` waits only for the 2 copies (~1.5ms). Slot held briefly → pool stays small (64 slots for 20+ cameras). |
| **50ms cooldown on slot reuse** | GPU kernels complete in <10ms. 50ms = 5× safety margin. Prevents buffer overwrite while inference reads. Short enough to keep pool pressure low. |
| **Background av_frame_free thread** | `av_frame_free` on CUDA-mapped frames acquires nvcuda64 SRW lock (5-20ms each). Background thread frees in batches every 50ms, removing all SRW lock blocking from camera hot path. |
| **CPU NV12 snapshot deferred to fallback only** | 4K snapshot = ~12MB malloc+memcpy+free per frame (~276MB/s). Only needed for cross-GPU fallback (rare). Skipping on pool-success path eliminates heap churn. |
| **Debug logging guarded by ANSCORE_GPU_DEBUG** | 500-2000 OutputDebugString calls/sec caused process-wide mutex convoy stalls. Default off. Add `-DANSCORE_GPU_DEBUG=1` to CMake to re-enable. |
| **Always-on diagnostics** | NEW slot allocation, POOL FULL, SLOW FRAME (>500ms), and NV12 ACTIVE path selection always log to DebugView (low volume, ~1-10 per session). |
## Data Flow
```
GetRTSPCVImage (camera thread):
1. GetImage() → BGR frame (shallow copy)
2. anscv_mat_replace → swap Mat pointer
3. TryIncrementInFlight() → atomic guard
4. GetCudaHWFrame() → NVDEC device pointers
5. GetNV12Frame() → CPU NV12 AVFrame (cloned)
6. slot = GpuNV12SlotPool::acquire(gpuIdx, w, h)
└─ drainCooledSlots_locked() first (COOLING→FREE if >50ms)
7. gpu_frame_attach_cuda(*image, cudaFrame, gpuIdx, pts, cpuNV12, slot):
a. cudaMemcpy2DAsync(slot->bufY, ..., nvdecY, ..., slot->copyStream)
b. cudaMemcpy2DAsync(slot->bufUV, ..., nvdecUV, ..., slot->copyStream)
c. cudaStreamSynchronize(slot->copyStream) — waits ~1.5ms (copy only)
d. data.poolSlot = slot
e. DEFERRED: push cudaFrame+cpuNV12 to m_pendingFree (NOT av_frame_free)
f. registry.attach(mat, data)
8. Wire onReleaseFn → DecrementInFlight
9. return (~3-5ms total)
Inference (engine thread):
1. gpuFrame = lookup(*cvImage) → GpuFrameData*
2. tl_currentGpuFrame() = gpuFrame
3. tryNV12(): reads yPlane/uvPlane → slot buffers (data is valid, sync done)
4. NV12→RGB kernel launch → reads from slot buffer
5. Inference finishes → clone released → refcount→0
→ freeOwnedBuffers_locked → deferRelease(poolSlot) → COOLING
→ onReleaseFn → DecrementInFlight
Background av_frame_free thread (started once):
- Every 50ms: drain m_pendingFree → av_frame_free each
- Runs independently of camera/inference threads
- SRW lock blocking happens HERE, not in hot path
Slot lifecycle:
acquire() → STATE_ACTIVE
refcount→0 → deferRelease → STATE_COOLING (cooldownStart = now)
50ms later → drainCooledSlots_locked → STATE_FREE
next acquire() → reuse
Destroy (camera thread) — LIGHTWEIGHT:
1. _isPlaying = false
2. Wait _inFlightFrames == 0 (fast — sync copy means in-flight = GetRTSPCVImage only)
3. invalidateOwner(this) — prevent stale callbacks
4. close() — destroys NVDEC decoder only
*** NO forceReleaseByOwner ***
*** NO cudaDeviceSynchronize ***
*** NO DestroyGpuPool ***
Pool slots survive — inference keeps reading safely.
```
## DebugView Diagnostics (always-on)
```
[NV12Pool] NEW slot #1: 1920x1080 gpu=0 Y=0000001764000000 UV=... stream=...
[NV12Pool] NEW slot #2: 3840x2160 gpu=0 Y=... UV=... stream=...
[NV12 ACTIVE] ANSRTYOLO Path: CUDA_ZERO_COPY | isCuda=1 gpuMatch=1 decodeGpu=0 infGpu=0 frame=1920x1080
[GetRTSPCVImage] SLOW FRAME: total=523.1ms (getImage=2.1ms cuda=521.0ms) 3840x2160
[NV12Pool] POOL FULL (64 slots) — fallback to CPU
```
## Build Configuration
- **Production (default):** Debug logging OFF. Only slot allocation, POOL FULL, SLOW FRAME, and NV12 ACTIVE visible in DebugView.
- **Debug:** Add `-DANSCORE_GPU_DEBUG=1` to CMake. Enables per-frame verbose logging (WARNING: causes performance degradation from OutputDebugString lock contention at high frame rates).
## Test Checklist
- [ ] Start multiple RTSP cameras with HW decoding + multiple AI engines
- [ ] Verify DebugView shows: NEW slot allocations, NV12 ACTIVE with CUDA_ZERO_COPY
- [ ] Verify: zero POOL FULL entries
- [ ] Verify: zero or very few SLOW FRAME entries (>500ms)
- [ ] Trigger camera reconnect (disconnect cable or ReleaseHandle+CreateHandle)
- [ ] Verify: no crash, inference continues on remaining cameras
- [ ] Verify: processing time chart stable (no multi-second spikes)
- [ ] Check nvidia-smi: VRAM stable (slots recycled, not growing)
- [ ] Long run: 1+ hours with cameras reconnecting periodically