Files
ANSCORE/NV12_CAMERA_RECONNECT_FIX_SUMMARY.md
2026-04-03 14:51:52 +11:00

9.4 KiB
Raw Permalink Blame History

NV12 Camera Reconnect Fix — Complete Summary

Original Issue

When RTSP camera reconnects (stop→destroy→release→create) while inference engines (ANSALPR, ANSFR, ANSOCR, ANSODEngine) are running, the camera process destroys NV12 GPU memory (NVDEC surfaces) that inference is actively reading → crash LabVIEW.

Root Causes Discovered (in order of discovery)

1. No inference guard in Reconnect()

Reconnect() called close() immediately without waiting for in-flight inference. Destroy() had the guard but Reconnect() did not.

2. Race window between GetCudaHWFrame() and IncrementInFlight()

_inFlightFrames was incremented AFTER GetCudaHWFrame() returned NVDEC pointers. Reconnect could see counter=0 and call close() while D2D copy was in progress. Log proof: attach_cuda: START with no D2D OK (ANSLEGION.log line 52859).

3. NVDEC surfaces tied to decoder lifetime

Even with proper ref counting, player->close() destroys the NVDEC decoder context, invalidating ALL surfaces at the driver level regardless of AVFrame references. Raw NVDEC device pointers in GpuFrameData become dangling.

4. Cross-stream cudaFree race (cudaErrorIllegalAddress 700)

cudaFree on default stream (stream 0) doesn't wait for NV12 kernels on cv::cuda::Stream. Freeing a buffer while a kernel reads it → error 700 → CUDA context permanently corrupted → all subsequent CUDA operations fail. Log proof: ANSLEGION.3log.log line 75769.

5. cudaDeviceSynchronize blocking (900-1800ms spikes)

Adding cudaDeviceSynchronize() before cudaFree fixed error 700 but blocked the entire GPU pipeline for 900ms+ per call, causing processing time spikes.

6. GPU memory exhaustion (cudaMallocPitch FAILED e1=2)

Per-frame cudaMallocPitch/cudaFree caused VRAM to fill up when cudaFree was deferred. Error 2 = cudaErrorMemoryAllocation (OOM). Log proof: ANSLEGION.7.log at timestamp 457s.

7. Unreleased clones (70 images) on shutdown

LabVIEW AI tasks hold cloned cv::Mat* references. On quit, 70+ clones are not released. Their GPU buffers (gpuCacheY/UV) are never freed. After close() destroys CUDA context, later cudaFree crashes. Log proof: dump log showing WriteFlatTransferData access violation.

8. WDDM nvcuda64 SRW lock contention (45-90s freeze)

Even with async copy, cudaStreamSynchronize blocks when another camera's close() holds the nvcuda64 exclusive SRW lock. All CUDA operations from other cameras stall. Log proof: ANSLEGION9.log — 45s gap between START and D2D OK.


Files Modified

include/ANSGpuFrameRegistry.h

  • Added GpuPendingFreeEntry struct with {ptr, deviceIdx, queuedAt} for time-based safe free
  • Added forceReleaseByOwner(void* client) — force-cleans all frames owned by a camera
  • Changed drain_gpu_pending(int minAgeMs=0) — supports time-based filtering (only free entries >100ms old)
  • Added REG_DBG logging macro (OutputDebugString + fprintf)
  • Added debug logging to attach(), release(), freeOwnedBuffers_locked()

modules/ANSCV/ANSGpuFrameOps.h

  • Added #include <cuda_runtime.h> and GPU_FRAME_DBG logging macro
  • Added drainAndFreeGpuPending(bool forceAll=false) — time-based safe free (100ms threshold) or forced (with cudaDeviceSynchronize for Destroy/Reconnect)
  • Rewrote gpu_frame_attach_cuda():
    • Accepts optional pool buffers: poolY, poolYPitch, poolUV, poolUVPitch, poolStream
    • Pool path (ANSRTSP): cudaMemcpy2DAsync on non-blocking stream, no sync, keeps AVFrame alive
    • Legacy path (other modules): per-frame cudaMallocPitch + cudaMemcpy2D (backward compatible)
    • Both paths: snapshot CPU NV12 as cross-GPU fallback, set isCudaDevicePtr=true
  • gpu_frame_remove() — no cudaFree in hot path (deferred to eviction/cleanup)
  • gpu_frame_evict_stale() — calls drainAndFreeGpuPending(false) with time-based safety

modules/ANSCV/ANSRTSP.h

  • Added GpuNV12Pool struct: ping-pong buffers (Y[2], UV[2]), pitch, resolution, GPU index, cudaStream
  • Added EnsureGpuPool(), DestroyGpuPool(), GetGpuPool() methods
  • Added TryIncrementInFlight() — atomically checks _isPlaying AND increments _inFlightFrames under same mutex

modules/ANSCV/ANSRTSP.cpp

  • Added RTSP_DBG logging macro (OutputDebugString + fprintf)
  • Implemented EnsureGpuPool(): allocates 2 Y + 2 UV buffers + non-blocking CUDA stream on first CUDA frame
  • Implemented DestroyGpuPool(): syncs pool stream, cudaFree all 4 buffers, destroy stream
  • Reconnect(): added inference guard (wait for _inFlightFrames), forceReleaseByOwner, DestroyGpuPool before close()
  • Destroy(): added forceReleaseByOwner, DestroyGpuPool before close()
  • GetRTSPCVImage(): calls TryIncrementInFlight() BEFORE GetCudaHWFrame(), uses pool buffers + pool stream for D2D copy

modules/ANSODEngine/NV12PreprocessHelper.cpp

  • Added debug logging before tryNV12 (pointer state, refcount, gpuCacheY) and at kernel launch

Architecture After Fix

GetRTSPCVImage (camera thread):
  1. TryIncrementInFlight() — atomic check _isPlaying + increment (blocks Reconnect)
  2. GetCudaHWFrame() — gets NVDEC device pointers
  3. EnsureGpuPool() — allocate 2 ping-pong buffer pairs (first frame only)
  4. pool.next() — get next ping-pong slot
  5. gpu_frame_attach_cuda(..., poolY, poolUV, poolStream):
     a. cudaMemcpy2DAsync: NVDEC → pool buffer (non-blocking, no sync)
     b. Keep AVFrame alive (stored in GpuFrameData.avframe)
     c. yPlane = poolY, uvPlane = poolUV, isCudaDevicePtr = true
     d. gpuCacheY = nullptr (pool owns buffers)
  6. Registry callback: DecrementInFlight on last clone release

Inference (engine thread):
  1. tl_currentGpuFrame() = lookup(*cvImage) — raw GpuFrameData pointer
  2. tryNV12(): reads yPlane/uvPlane (pool buffer, CUDA device ptr)
  3. useZeroCopy = true → wraps as GpuMat → CUDA kernel launch
  4. Kernel reads from pool buffer — same data, independent of NVDEC

Reconnect (camera thread):
  1. _isPlaying = false — blocks new TryIncrementInFlight
  2. Wait _inFlightFrames == 0 (5s timeout)
  3. forceReleaseByOwner(this) — free all registry entries for this camera
  4. DestroyGpuPool() — cudaStreamSynchronize + cudaFree 4 buffers + destroy stream
  5. close() — destroy NVDEC decoder (safe, no dependencies)
  6. Setup() + play() — re-create camera
  7. Next frame: EnsureGpuPool re-allocates

Destroy (camera thread):
  Same as Reconnect steps 1-5, then delete object.

Key Design Decisions

Decision Why
D2D copy instead of raw NVDEC pointers Decouples NV12 lifetime from decoder — close() is always safe
Ping-pong GPU buffer pool (2 pairs) Eliminates per-frame cudaMalloc/cudaFree — constant VRAM
cudaMemcpy2DAsync + no sync Avoids 45-90s WDDM SRW lock freeze during close()
Keep AVFrame alive during async copy NVDEC source valid until GPU finishes the copy
TryIncrementInFlight before GetCudaHWFrame Closes race window — Reconnect must wait before close()
forceReleaseByOwner on Destroy/Reconnect Cleans up unreleased clones before CUDA context destroyed
Time-based cudaFree (100ms threshold) Frees legacy-path buffers without cudaDeviceSynchronize
No cudaFree in hot path (attach/remove) Prevents cross-stream races and processing spikes

VRAM Usage Per Camera

  • 4K (3840×2160): ~26MB fixed (2 × Y + 2 × UV with pitch alignment)
  • 1080p (1920×1080): ~6.5MB fixed
  • 720p (1280×720): ~3MB fixed
  • Zero growth during normal operation

Status of Each Fix (chronological)

# Fix Log Verified Result
1 Inference guard in Reconnect ANSLEGION.log Closed race but didn't fix root cause
2 D2D copy (owned GPU buffers) ANSLEGION.1log Fixed NVDEC dependency, new race found
3 TryIncrementInFlight ANSLEGION.2log Fixed race window, ran 85min stable
4 cudaDeviceSynchronize before cudaFree ANSLEGION.3log→4 Fixed err=700, caused 900ms spikes
5 Remove cudaFree from hot path ANSLEGION.5→6 Removed spikes, caused memory leak
6 GPU buffer pool (ping-pong) ANSLEGION.7→8 Fixed OOM, zero CUDA errors, 7hr run
7 Time-based safe cudaFree (100ms) ANSLEGION.8 Fixed memory leak from legacy path
8 Async D2D + no sync ANSLEGION.9→10 Fix for WDDM SRW lock freeze
9 Keep AVFrame alive during async Current Ensures NVDEC source valid for async copy

Files NOT Modified (transparent to engines)

  • All engine DLLs (ANSODEngine, ANSLPR, ANSFR, ANSOCR) — no changes
  • NV12PreprocessHelper logic — zero-copy path unchanged (isCudaDevicePtr=true)
  • CUDA kernels (nv12_to_rgb.cu) — reads GPU pointers regardless of source
  • TensorRT engine (EngineRunInference.inl) — no changes
  • Other camera modules (ANSFLV, ANSSRT, ANSMJPEG, ANSRTMP) — use legacy path, backward compatible

Testing Checklist

  • Start multiple RTSP cameras with HW decoding + multiple AI engines
  • Trigger camera reconnect (disconnect network cable or call ReconnectRTSP)
  • Verify: no crash, inference continues after reconnection
  • Verify: processing time stable (no 900ms+ spikes)
  • Verify: VRAM stable (no growth over time, check nvidia-smi)
  • Verify: clean shutdown (no crash on LabVIEW quit)
  • Verify logs: D2D OK (pool) on every frame, async=1
  • Verify logs: no cudaFree FAILED, no FALLBACK CPU, no err=700
  • Long run test: 1+ hours with cameras reconnecting periodically