9.4 KiB
NV12 Camera Reconnect Fix — Complete Summary
Original Issue
When RTSP camera reconnects (stop→destroy→release→create) while inference engines (ANSALPR, ANSFR, ANSOCR, ANSODEngine) are running, the camera process destroys NV12 GPU memory (NVDEC surfaces) that inference is actively reading → crash LabVIEW.
Root Causes Discovered (in order of discovery)
1. No inference guard in Reconnect()
Reconnect() called close() immediately without waiting for in-flight inference.
Destroy() had the guard but Reconnect() did not.
2. Race window between GetCudaHWFrame() and IncrementInFlight()
_inFlightFrames was incremented AFTER GetCudaHWFrame() returned NVDEC pointers.
Reconnect could see counter=0 and call close() while D2D copy was in progress.
Log proof: attach_cuda: START with no D2D OK (ANSLEGION.log line 52859).
3. NVDEC surfaces tied to decoder lifetime
Even with proper ref counting, player->close() destroys the NVDEC decoder context,
invalidating ALL surfaces at the driver level regardless of AVFrame references.
Raw NVDEC device pointers in GpuFrameData become dangling.
4. Cross-stream cudaFree race (cudaErrorIllegalAddress 700)
cudaFree on default stream (stream 0) doesn't wait for NV12 kernels on
cv::cuda::Stream. Freeing a buffer while a kernel reads it → error 700 →
CUDA context permanently corrupted → all subsequent CUDA operations fail.
Log proof: ANSLEGION.3log.log line 75769.
5. cudaDeviceSynchronize blocking (900-1800ms spikes)
Adding cudaDeviceSynchronize() before cudaFree fixed error 700 but blocked
the entire GPU pipeline for 900ms+ per call, causing processing time spikes.
6. GPU memory exhaustion (cudaMallocPitch FAILED e1=2)
Per-frame cudaMallocPitch/cudaFree caused VRAM to fill up when cudaFree was
deferred. Error 2 = cudaErrorMemoryAllocation (OOM).
Log proof: ANSLEGION.7.log at timestamp 457s.
7. Unreleased clones (70 images) on shutdown
LabVIEW AI tasks hold cloned cv::Mat* references. On quit, 70+ clones are not
released. Their GPU buffers (gpuCacheY/UV) are never freed. After close() destroys
CUDA context, later cudaFree crashes.
Log proof: dump log showing WriteFlatTransferData access violation.
8. WDDM nvcuda64 SRW lock contention (45-90s freeze)
Even with async copy, cudaStreamSynchronize blocks when another camera's close()
holds the nvcuda64 exclusive SRW lock. All CUDA operations from other cameras stall.
Log proof: ANSLEGION9.log — 45s gap between START and D2D OK.
Files Modified
include/ANSGpuFrameRegistry.h
- Added
GpuPendingFreeEntrystruct with{ptr, deviceIdx, queuedAt}for time-based safe free - Added
forceReleaseByOwner(void* client)— force-cleans all frames owned by a camera - Changed
drain_gpu_pending(int minAgeMs=0)— supports time-based filtering (only free entries >100ms old) - Added
REG_DBGlogging macro (OutputDebugString + fprintf) - Added debug logging to
attach(),release(),freeOwnedBuffers_locked()
modules/ANSCV/ANSGpuFrameOps.h
- Added
#include <cuda_runtime.h>andGPU_FRAME_DBGlogging macro - Added
drainAndFreeGpuPending(bool forceAll=false)— time-based safe free (100ms threshold) or forced (with cudaDeviceSynchronize for Destroy/Reconnect) - Rewrote
gpu_frame_attach_cuda():- Accepts optional pool buffers:
poolY, poolYPitch, poolUV, poolUVPitch, poolStream - Pool path (ANSRTSP):
cudaMemcpy2DAsyncon non-blocking stream, no sync, keeps AVFrame alive - Legacy path (other modules): per-frame
cudaMallocPitch+cudaMemcpy2D(backward compatible) - Both paths: snapshot CPU NV12 as cross-GPU fallback, set
isCudaDevicePtr=true
- Accepts optional pool buffers:
gpu_frame_remove()— no cudaFree in hot path (deferred to eviction/cleanup)gpu_frame_evict_stale()— callsdrainAndFreeGpuPending(false)with time-based safety
modules/ANSCV/ANSRTSP.h
- Added
GpuNV12Poolstruct: ping-pong buffers (Y[2], UV[2]), pitch, resolution, GPU index, cudaStream - Added
EnsureGpuPool(),DestroyGpuPool(),GetGpuPool()methods - Added
TryIncrementInFlight()— atomically checks_isPlayingAND increments_inFlightFramesunder same mutex
modules/ANSCV/ANSRTSP.cpp
- Added
RTSP_DBGlogging macro (OutputDebugString + fprintf) - Implemented
EnsureGpuPool(): allocates 2 Y + 2 UV buffers + non-blocking CUDA stream on first CUDA frame - Implemented
DestroyGpuPool(): syncs pool stream, cudaFree all 4 buffers, destroy stream Reconnect(): added inference guard (wait for _inFlightFrames), forceReleaseByOwner, DestroyGpuPool before close()Destroy(): added forceReleaseByOwner, DestroyGpuPool before close()GetRTSPCVImage(): callsTryIncrementInFlight()BEFOREGetCudaHWFrame(), uses pool buffers + pool stream for D2D copy
modules/ANSODEngine/NV12PreprocessHelper.cpp
- Added debug logging before tryNV12 (pointer state, refcount, gpuCacheY) and at kernel launch
Architecture After Fix
GetRTSPCVImage (camera thread):
1. TryIncrementInFlight() — atomic check _isPlaying + increment (blocks Reconnect)
2. GetCudaHWFrame() — gets NVDEC device pointers
3. EnsureGpuPool() — allocate 2 ping-pong buffer pairs (first frame only)
4. pool.next() — get next ping-pong slot
5. gpu_frame_attach_cuda(..., poolY, poolUV, poolStream):
a. cudaMemcpy2DAsync: NVDEC → pool buffer (non-blocking, no sync)
b. Keep AVFrame alive (stored in GpuFrameData.avframe)
c. yPlane = poolY, uvPlane = poolUV, isCudaDevicePtr = true
d. gpuCacheY = nullptr (pool owns buffers)
6. Registry callback: DecrementInFlight on last clone release
Inference (engine thread):
1. tl_currentGpuFrame() = lookup(*cvImage) — raw GpuFrameData pointer
2. tryNV12(): reads yPlane/uvPlane (pool buffer, CUDA device ptr)
3. useZeroCopy = true → wraps as GpuMat → CUDA kernel launch
4. Kernel reads from pool buffer — same data, independent of NVDEC
Reconnect (camera thread):
1. _isPlaying = false — blocks new TryIncrementInFlight
2. Wait _inFlightFrames == 0 (5s timeout)
3. forceReleaseByOwner(this) — free all registry entries for this camera
4. DestroyGpuPool() — cudaStreamSynchronize + cudaFree 4 buffers + destroy stream
5. close() — destroy NVDEC decoder (safe, no dependencies)
6. Setup() + play() — re-create camera
7. Next frame: EnsureGpuPool re-allocates
Destroy (camera thread):
Same as Reconnect steps 1-5, then delete object.
Key Design Decisions
| Decision | Why |
|---|---|
| D2D copy instead of raw NVDEC pointers | Decouples NV12 lifetime from decoder — close() is always safe |
| Ping-pong GPU buffer pool (2 pairs) | Eliminates per-frame cudaMalloc/cudaFree — constant VRAM |
| cudaMemcpy2DAsync + no sync | Avoids 45-90s WDDM SRW lock freeze during close() |
| Keep AVFrame alive during async copy | NVDEC source valid until GPU finishes the copy |
| TryIncrementInFlight before GetCudaHWFrame | Closes race window — Reconnect must wait before close() |
| forceReleaseByOwner on Destroy/Reconnect | Cleans up unreleased clones before CUDA context destroyed |
| Time-based cudaFree (100ms threshold) | Frees legacy-path buffers without cudaDeviceSynchronize |
| No cudaFree in hot path (attach/remove) | Prevents cross-stream races and processing spikes |
VRAM Usage Per Camera
- 4K (3840×2160): ~26MB fixed (2 × Y + 2 × UV with pitch alignment)
- 1080p (1920×1080): ~6.5MB fixed
- 720p (1280×720): ~3MB fixed
- Zero growth during normal operation
Status of Each Fix (chronological)
| # | Fix | Log Verified | Result |
|---|---|---|---|
| 1 | Inference guard in Reconnect | ANSLEGION.log | Closed race but didn't fix root cause |
| 2 | D2D copy (owned GPU buffers) | ANSLEGION.1log | Fixed NVDEC dependency, new race found |
| 3 | TryIncrementInFlight | ANSLEGION.2log | Fixed race window, ran 85min stable |
| 4 | cudaDeviceSynchronize before cudaFree | ANSLEGION.3log→4 | Fixed err=700, caused 900ms spikes |
| 5 | Remove cudaFree from hot path | ANSLEGION.5→6 | Removed spikes, caused memory leak |
| 6 | GPU buffer pool (ping-pong) | ANSLEGION.7→8 | Fixed OOM, zero CUDA errors, 7hr run |
| 7 | Time-based safe cudaFree (100ms) | ANSLEGION.8 | Fixed memory leak from legacy path |
| 8 | Async D2D + no sync | ANSLEGION.9→10 | Fix for WDDM SRW lock freeze |
| 9 | Keep AVFrame alive during async | Current | Ensures NVDEC source valid for async copy |
Files NOT Modified (transparent to engines)
- All engine DLLs (ANSODEngine, ANSLPR, ANSFR, ANSOCR) — no changes
- NV12PreprocessHelper logic — zero-copy path unchanged (isCudaDevicePtr=true)
- CUDA kernels (nv12_to_rgb.cu) — reads GPU pointers regardless of source
- TensorRT engine (EngineRunInference.inl) — no changes
- Other camera modules (ANSFLV, ANSSRT, ANSMJPEG, ANSRTMP) — use legacy path, backward compatible
Testing Checklist
- Start multiple RTSP cameras with HW decoding + multiple AI engines
- Trigger camera reconnect (disconnect network cable or call ReconnectRTSP)
- Verify: no crash, inference continues after reconnection
- Verify: processing time stable (no 900ms+ spikes)
- Verify: VRAM stable (no growth over time, check nvidia-smi)
- Verify: clean shutdown (no crash on LabVIEW quit)
- Verify logs:
D2D OK (pool)on every frame,async=1 - Verify logs: no
cudaFree FAILED, noFALLBACK CPU, noerr=700 - Long run test: 1+ hours with cameras reconnecting periodically