# NV12 Camera Reconnect Fix — Complete Summary ## Original Issue When RTSP camera reconnects (stop→destroy→release→create) while inference engines (ANSALPR, ANSFR, ANSOCR, ANSODEngine) are running, the camera process destroys NV12 GPU memory (NVDEC surfaces) that inference is actively reading → crash LabVIEW. ## Root Causes Discovered (in order of discovery) ### 1. No inference guard in Reconnect() `Reconnect()` called `close()` immediately without waiting for in-flight inference. `Destroy()` had the guard but `Reconnect()` did not. ### 2. Race window between GetCudaHWFrame() and IncrementInFlight() `_inFlightFrames` was incremented AFTER `GetCudaHWFrame()` returned NVDEC pointers. Reconnect could see counter=0 and call `close()` while D2D copy was in progress. **Log proof:** `attach_cuda: START` with no `D2D OK` (ANSLEGION.log line 52859). ### 3. NVDEC surfaces tied to decoder lifetime Even with proper ref counting, `player->close()` destroys the NVDEC decoder context, invalidating ALL surfaces at the driver level regardless of AVFrame references. Raw NVDEC device pointers in GpuFrameData become dangling. ### 4. Cross-stream cudaFree race (cudaErrorIllegalAddress 700) `cudaFree` on default stream (stream 0) doesn't wait for NV12 kernels on cv::cuda::Stream. Freeing a buffer while a kernel reads it → error 700 → CUDA context permanently corrupted → all subsequent CUDA operations fail. **Log proof:** ANSLEGION.3log.log line 75769. ### 5. cudaDeviceSynchronize blocking (900-1800ms spikes) Adding `cudaDeviceSynchronize()` before `cudaFree` fixed error 700 but blocked the entire GPU pipeline for 900ms+ per call, causing processing time spikes. ### 6. GPU memory exhaustion (cudaMallocPitch FAILED e1=2) Per-frame `cudaMallocPitch`/`cudaFree` caused VRAM to fill up when cudaFree was deferred. Error 2 = `cudaErrorMemoryAllocation` (OOM). **Log proof:** ANSLEGION.7.log at timestamp 457s. ### 7. Unreleased clones (70 images) on shutdown LabVIEW AI tasks hold cloned cv::Mat* references. On quit, 70+ clones are not released. Their GPU buffers (gpuCacheY/UV) are never freed. After `close()` destroys CUDA context, later `cudaFree` crashes. **Log proof:** dump log showing `WriteFlatTransferData` access violation. ### 8. WDDM nvcuda64 SRW lock contention (45-90s freeze) Even with async copy, `cudaStreamSynchronize` blocks when another camera's `close()` holds the nvcuda64 exclusive SRW lock. All CUDA operations from other cameras stall. **Log proof:** ANSLEGION9.log — 45s gap between START and D2D OK. --- ## Files Modified ### `include/ANSGpuFrameRegistry.h` - Added `GpuPendingFreeEntry` struct with `{ptr, deviceIdx, queuedAt}` for time-based safe free - Added `forceReleaseByOwner(void* client)` — force-cleans all frames owned by a camera - Changed `drain_gpu_pending(int minAgeMs=0)` — supports time-based filtering (only free entries >100ms old) - Added `REG_DBG` logging macro (OutputDebugString + fprintf) - Added debug logging to `attach()`, `release()`, `freeOwnedBuffers_locked()` ### `modules/ANSCV/ANSGpuFrameOps.h` - Added `#include ` and `GPU_FRAME_DBG` logging macro - Added `drainAndFreeGpuPending(bool forceAll=false)` — time-based safe free (100ms threshold) or forced (with cudaDeviceSynchronize for Destroy/Reconnect) - **Rewrote `gpu_frame_attach_cuda()`**: - Accepts optional pool buffers: `poolY, poolYPitch, poolUV, poolUVPitch, poolStream` - **Pool path (ANSRTSP):** `cudaMemcpy2DAsync` on non-blocking stream, no sync, keeps AVFrame alive - **Legacy path (other modules):** per-frame `cudaMallocPitch` + `cudaMemcpy2D` (backward compatible) - Both paths: snapshot CPU NV12 as cross-GPU fallback, set `isCudaDevicePtr=true` - `gpu_frame_remove()` — no cudaFree in hot path (deferred to eviction/cleanup) - `gpu_frame_evict_stale()` — calls `drainAndFreeGpuPending(false)` with time-based safety ### `modules/ANSCV/ANSRTSP.h` - Added `GpuNV12Pool` struct: ping-pong buffers (Y[2], UV[2]), pitch, resolution, GPU index, cudaStream - Added `EnsureGpuPool()`, `DestroyGpuPool()`, `GetGpuPool()` methods - Added `TryIncrementInFlight()` — atomically checks `_isPlaying` AND increments `_inFlightFrames` under same mutex ### `modules/ANSCV/ANSRTSP.cpp` - Added `RTSP_DBG` logging macro (OutputDebugString + fprintf) - Implemented `EnsureGpuPool()`: allocates 2 Y + 2 UV buffers + non-blocking CUDA stream on first CUDA frame - Implemented `DestroyGpuPool()`: syncs pool stream, cudaFree all 4 buffers, destroy stream - **`Reconnect()`**: added inference guard (wait for _inFlightFrames), forceReleaseByOwner, DestroyGpuPool before close() - **`Destroy()`**: added forceReleaseByOwner, DestroyGpuPool before close() - **`GetRTSPCVImage()`**: calls `TryIncrementInFlight()` BEFORE `GetCudaHWFrame()`, uses pool buffers + pool stream for D2D copy ### `modules/ANSODEngine/NV12PreprocessHelper.cpp` - Added debug logging before tryNV12 (pointer state, refcount, gpuCacheY) and at kernel launch --- ## Architecture After Fix ``` GetRTSPCVImage (camera thread): 1. TryIncrementInFlight() — atomic check _isPlaying + increment (blocks Reconnect) 2. GetCudaHWFrame() — gets NVDEC device pointers 3. EnsureGpuPool() — allocate 2 ping-pong buffer pairs (first frame only) 4. pool.next() — get next ping-pong slot 5. gpu_frame_attach_cuda(..., poolY, poolUV, poolStream): a. cudaMemcpy2DAsync: NVDEC → pool buffer (non-blocking, no sync) b. Keep AVFrame alive (stored in GpuFrameData.avframe) c. yPlane = poolY, uvPlane = poolUV, isCudaDevicePtr = true d. gpuCacheY = nullptr (pool owns buffers) 6. Registry callback: DecrementInFlight on last clone release Inference (engine thread): 1. tl_currentGpuFrame() = lookup(*cvImage) — raw GpuFrameData pointer 2. tryNV12(): reads yPlane/uvPlane (pool buffer, CUDA device ptr) 3. useZeroCopy = true → wraps as GpuMat → CUDA kernel launch 4. Kernel reads from pool buffer — same data, independent of NVDEC Reconnect (camera thread): 1. _isPlaying = false — blocks new TryIncrementInFlight 2. Wait _inFlightFrames == 0 (5s timeout) 3. forceReleaseByOwner(this) — free all registry entries for this camera 4. DestroyGpuPool() — cudaStreamSynchronize + cudaFree 4 buffers + destroy stream 5. close() — destroy NVDEC decoder (safe, no dependencies) 6. Setup() + play() — re-create camera 7. Next frame: EnsureGpuPool re-allocates Destroy (camera thread): Same as Reconnect steps 1-5, then delete object. ``` ## Key Design Decisions | Decision | Why | |----------|-----| | D2D copy instead of raw NVDEC pointers | Decouples NV12 lifetime from decoder — close() is always safe | | Ping-pong GPU buffer pool (2 pairs) | Eliminates per-frame cudaMalloc/cudaFree — constant VRAM | | cudaMemcpy2DAsync + no sync | Avoids 45-90s WDDM SRW lock freeze during close() | | Keep AVFrame alive during async copy | NVDEC source valid until GPU finishes the copy | | TryIncrementInFlight before GetCudaHWFrame | Closes race window — Reconnect must wait before close() | | forceReleaseByOwner on Destroy/Reconnect | Cleans up unreleased clones before CUDA context destroyed | | Time-based cudaFree (100ms threshold) | Frees legacy-path buffers without cudaDeviceSynchronize | | No cudaFree in hot path (attach/remove) | Prevents cross-stream races and processing spikes | ## VRAM Usage Per Camera - 4K (3840×2160): ~26MB fixed (2 × Y + 2 × UV with pitch alignment) - 1080p (1920×1080): ~6.5MB fixed - 720p (1280×720): ~3MB fixed - Zero growth during normal operation ## Status of Each Fix (chronological) | # | Fix | Log Verified | Result | |---|-----|-------------|--------| | 1 | Inference guard in Reconnect | ANSLEGION.log | Closed race but didn't fix root cause | | 2 | D2D copy (owned GPU buffers) | ANSLEGION.1log | Fixed NVDEC dependency, new race found | | 3 | TryIncrementInFlight | ANSLEGION.2log | Fixed race window, ran 85min stable | | 4 | cudaDeviceSynchronize before cudaFree | ANSLEGION.3log→4 | Fixed err=700, caused 900ms spikes | | 5 | Remove cudaFree from hot path | ANSLEGION.5→6 | Removed spikes, caused memory leak | | 6 | GPU buffer pool (ping-pong) | ANSLEGION.7→8 | Fixed OOM, zero CUDA errors, 7hr run | | 7 | Time-based safe cudaFree (100ms) | ANSLEGION.8 | Fixed memory leak from legacy path | | 8 | Async D2D + no sync | ANSLEGION.9→10 | Fix for WDDM SRW lock freeze | | 9 | Keep AVFrame alive during async | Current | Ensures NVDEC source valid for async copy | ## Files NOT Modified (transparent to engines) - All engine DLLs (ANSODEngine, ANSLPR, ANSFR, ANSOCR) — no changes - NV12PreprocessHelper logic — zero-copy path unchanged (isCudaDevicePtr=true) - CUDA kernels (nv12_to_rgb.cu) — reads GPU pointers regardless of source - TensorRT engine (EngineRunInference.inl) — no changes - Other camera modules (ANSFLV, ANSSRT, ANSMJPEG, ANSRTMP) — use legacy path, backward compatible ## Testing Checklist - [ ] Start multiple RTSP cameras with HW decoding + multiple AI engines - [ ] Trigger camera reconnect (disconnect network cable or call ReconnectRTSP) - [ ] Verify: no crash, inference continues after reconnection - [ ] Verify: processing time stable (no 900ms+ spikes) - [ ] Verify: VRAM stable (no growth over time, check nvidia-smi) - [ ] Verify: clean shutdown (no crash on LabVIEW quit) - [ ] Verify logs: `D2D OK (pool)` on every frame, `async=1` - [ ] Verify logs: no `cudaFree FAILED`, no `FALLBACK CPU`, no `err=700` - [ ] Long run test: 1+ hours with cameras reconnecting periodically