Files
ANSCORE/NV12_CAMERA_RECONNECT_FIX_SUMMARY.md
2026-04-03 14:51:52 +11:00

174 lines
9.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# NV12 Camera Reconnect Fix — Complete Summary
## Original Issue
When RTSP camera reconnects (stop→destroy→release→create) while inference engines
(ANSALPR, ANSFR, ANSOCR, ANSODEngine) are running, the camera process destroys NV12
GPU memory (NVDEC surfaces) that inference is actively reading → crash LabVIEW.
## Root Causes Discovered (in order of discovery)
### 1. No inference guard in Reconnect()
`Reconnect()` called `close()` immediately without waiting for in-flight inference.
`Destroy()` had the guard but `Reconnect()` did not.
### 2. Race window between GetCudaHWFrame() and IncrementInFlight()
`_inFlightFrames` was incremented AFTER `GetCudaHWFrame()` returned NVDEC pointers.
Reconnect could see counter=0 and call `close()` while D2D copy was in progress.
**Log proof:** `attach_cuda: START` with no `D2D OK` (ANSLEGION.log line 52859).
### 3. NVDEC surfaces tied to decoder lifetime
Even with proper ref counting, `player->close()` destroys the NVDEC decoder context,
invalidating ALL surfaces at the driver level regardless of AVFrame references.
Raw NVDEC device pointers in GpuFrameData become dangling.
### 4. Cross-stream cudaFree race (cudaErrorIllegalAddress 700)
`cudaFree` on default stream (stream 0) doesn't wait for NV12 kernels on
cv::cuda::Stream. Freeing a buffer while a kernel reads it → error 700 →
CUDA context permanently corrupted → all subsequent CUDA operations fail.
**Log proof:** ANSLEGION.3log.log line 75769.
### 5. cudaDeviceSynchronize blocking (900-1800ms spikes)
Adding `cudaDeviceSynchronize()` before `cudaFree` fixed error 700 but blocked
the entire GPU pipeline for 900ms+ per call, causing processing time spikes.
### 6. GPU memory exhaustion (cudaMallocPitch FAILED e1=2)
Per-frame `cudaMallocPitch`/`cudaFree` caused VRAM to fill up when cudaFree was
deferred. Error 2 = `cudaErrorMemoryAllocation` (OOM).
**Log proof:** ANSLEGION.7.log at timestamp 457s.
### 7. Unreleased clones (70 images) on shutdown
LabVIEW AI tasks hold cloned cv::Mat* references. On quit, 70+ clones are not
released. Their GPU buffers (gpuCacheY/UV) are never freed. After `close()` destroys
CUDA context, later `cudaFree` crashes.
**Log proof:** dump log showing `WriteFlatTransferData` access violation.
### 8. WDDM nvcuda64 SRW lock contention (45-90s freeze)
Even with async copy, `cudaStreamSynchronize` blocks when another camera's `close()`
holds the nvcuda64 exclusive SRW lock. All CUDA operations from other cameras stall.
**Log proof:** ANSLEGION9.log — 45s gap between START and D2D OK.
---
## Files Modified
### `include/ANSGpuFrameRegistry.h`
- Added `GpuPendingFreeEntry` struct with `{ptr, deviceIdx, queuedAt}` for time-based safe free
- Added `forceReleaseByOwner(void* client)` — force-cleans all frames owned by a camera
- Changed `drain_gpu_pending(int minAgeMs=0)` — supports time-based filtering (only free entries >100ms old)
- Added `REG_DBG` logging macro (OutputDebugString + fprintf)
- Added debug logging to `attach()`, `release()`, `freeOwnedBuffers_locked()`
### `modules/ANSCV/ANSGpuFrameOps.h`
- Added `#include <cuda_runtime.h>` and `GPU_FRAME_DBG` logging macro
- Added `drainAndFreeGpuPending(bool forceAll=false)` — time-based safe free (100ms threshold) or forced (with cudaDeviceSynchronize for Destroy/Reconnect)
- **Rewrote `gpu_frame_attach_cuda()`**:
- Accepts optional pool buffers: `poolY, poolYPitch, poolUV, poolUVPitch, poolStream`
- **Pool path (ANSRTSP):** `cudaMemcpy2DAsync` on non-blocking stream, no sync, keeps AVFrame alive
- **Legacy path (other modules):** per-frame `cudaMallocPitch` + `cudaMemcpy2D` (backward compatible)
- Both paths: snapshot CPU NV12 as cross-GPU fallback, set `isCudaDevicePtr=true`
- `gpu_frame_remove()` — no cudaFree in hot path (deferred to eviction/cleanup)
- `gpu_frame_evict_stale()` — calls `drainAndFreeGpuPending(false)` with time-based safety
### `modules/ANSCV/ANSRTSP.h`
- Added `GpuNV12Pool` struct: ping-pong buffers (Y[2], UV[2]), pitch, resolution, GPU index, cudaStream
- Added `EnsureGpuPool()`, `DestroyGpuPool()`, `GetGpuPool()` methods
- Added `TryIncrementInFlight()` — atomically checks `_isPlaying` AND increments `_inFlightFrames` under same mutex
### `modules/ANSCV/ANSRTSP.cpp`
- Added `RTSP_DBG` logging macro (OutputDebugString + fprintf)
- Implemented `EnsureGpuPool()`: allocates 2 Y + 2 UV buffers + non-blocking CUDA stream on first CUDA frame
- Implemented `DestroyGpuPool()`: syncs pool stream, cudaFree all 4 buffers, destroy stream
- **`Reconnect()`**: added inference guard (wait for _inFlightFrames), forceReleaseByOwner, DestroyGpuPool before close()
- **`Destroy()`**: added forceReleaseByOwner, DestroyGpuPool before close()
- **`GetRTSPCVImage()`**: calls `TryIncrementInFlight()` BEFORE `GetCudaHWFrame()`, uses pool buffers + pool stream for D2D copy
### `modules/ANSODEngine/NV12PreprocessHelper.cpp`
- Added debug logging before tryNV12 (pointer state, refcount, gpuCacheY) and at kernel launch
---
## Architecture After Fix
```
GetRTSPCVImage (camera thread):
1. TryIncrementInFlight() — atomic check _isPlaying + increment (blocks Reconnect)
2. GetCudaHWFrame() — gets NVDEC device pointers
3. EnsureGpuPool() — allocate 2 ping-pong buffer pairs (first frame only)
4. pool.next() — get next ping-pong slot
5. gpu_frame_attach_cuda(..., poolY, poolUV, poolStream):
a. cudaMemcpy2DAsync: NVDEC → pool buffer (non-blocking, no sync)
b. Keep AVFrame alive (stored in GpuFrameData.avframe)
c. yPlane = poolY, uvPlane = poolUV, isCudaDevicePtr = true
d. gpuCacheY = nullptr (pool owns buffers)
6. Registry callback: DecrementInFlight on last clone release
Inference (engine thread):
1. tl_currentGpuFrame() = lookup(*cvImage) — raw GpuFrameData pointer
2. tryNV12(): reads yPlane/uvPlane (pool buffer, CUDA device ptr)
3. useZeroCopy = true → wraps as GpuMat → CUDA kernel launch
4. Kernel reads from pool buffer — same data, independent of NVDEC
Reconnect (camera thread):
1. _isPlaying = false — blocks new TryIncrementInFlight
2. Wait _inFlightFrames == 0 (5s timeout)
3. forceReleaseByOwner(this) — free all registry entries for this camera
4. DestroyGpuPool() — cudaStreamSynchronize + cudaFree 4 buffers + destroy stream
5. close() — destroy NVDEC decoder (safe, no dependencies)
6. Setup() + play() — re-create camera
7. Next frame: EnsureGpuPool re-allocates
Destroy (camera thread):
Same as Reconnect steps 1-5, then delete object.
```
## Key Design Decisions
| Decision | Why |
|----------|-----|
| D2D copy instead of raw NVDEC pointers | Decouples NV12 lifetime from decoder — close() is always safe |
| Ping-pong GPU buffer pool (2 pairs) | Eliminates per-frame cudaMalloc/cudaFree — constant VRAM |
| cudaMemcpy2DAsync + no sync | Avoids 45-90s WDDM SRW lock freeze during close() |
| Keep AVFrame alive during async copy | NVDEC source valid until GPU finishes the copy |
| TryIncrementInFlight before GetCudaHWFrame | Closes race window — Reconnect must wait before close() |
| forceReleaseByOwner on Destroy/Reconnect | Cleans up unreleased clones before CUDA context destroyed |
| Time-based cudaFree (100ms threshold) | Frees legacy-path buffers without cudaDeviceSynchronize |
| No cudaFree in hot path (attach/remove) | Prevents cross-stream races and processing spikes |
## VRAM Usage Per Camera
- 4K (3840×2160): ~26MB fixed (2 × Y + 2 × UV with pitch alignment)
- 1080p (1920×1080): ~6.5MB fixed
- 720p (1280×720): ~3MB fixed
- Zero growth during normal operation
## Status of Each Fix (chronological)
| # | Fix | Log Verified | Result |
|---|-----|-------------|--------|
| 1 | Inference guard in Reconnect | ANSLEGION.log | Closed race but didn't fix root cause |
| 2 | D2D copy (owned GPU buffers) | ANSLEGION.1log | Fixed NVDEC dependency, new race found |
| 3 | TryIncrementInFlight | ANSLEGION.2log | Fixed race window, ran 85min stable |
| 4 | cudaDeviceSynchronize before cudaFree | ANSLEGION.3log→4 | Fixed err=700, caused 900ms spikes |
| 5 | Remove cudaFree from hot path | ANSLEGION.5→6 | Removed spikes, caused memory leak |
| 6 | GPU buffer pool (ping-pong) | ANSLEGION.7→8 | Fixed OOM, zero CUDA errors, 7hr run |
| 7 | Time-based safe cudaFree (100ms) | ANSLEGION.8 | Fixed memory leak from legacy path |
| 8 | Async D2D + no sync | ANSLEGION.9→10 | Fix for WDDM SRW lock freeze |
| 9 | Keep AVFrame alive during async | Current | Ensures NVDEC source valid for async copy |
## Files NOT Modified (transparent to engines)
- All engine DLLs (ANSODEngine, ANSLPR, ANSFR, ANSOCR) — no changes
- NV12PreprocessHelper logic — zero-copy path unchanged (isCudaDevicePtr=true)
- CUDA kernels (nv12_to_rgb.cu) — reads GPU pointers regardless of source
- TensorRT engine (EngineRunInference.inl) — no changes
- Other camera modules (ANSFLV, ANSSRT, ANSMJPEG, ANSRTMP) — use legacy path, backward compatible
## Testing Checklist
- [ ] Start multiple RTSP cameras with HW decoding + multiple AI engines
- [ ] Trigger camera reconnect (disconnect network cable or call ReconnectRTSP)
- [ ] Verify: no crash, inference continues after reconnection
- [ ] Verify: processing time stable (no 900ms+ spikes)
- [ ] Verify: VRAM stable (no growth over time, check nvidia-smi)
- [ ] Verify: clean shutdown (no crash on LabVIEW quit)
- [ ] Verify logs: `D2D OK (pool)` on every frame, `async=1`
- [ ] Verify logs: no `cudaFree FAILED`, no `FALLBACK CPU`, no `err=700`
- [ ] Long run test: 1+ hours with cameras reconnecting periodically