diff --git a/.claude/settings.local.json b/.claude/settings.local.json index aaec1c5..35748c3 100644 --- a/.claude/settings.local.json +++ b/.claude/settings.local.json @@ -75,7 +75,8 @@ "Bash(powershell -Command \"\\(Get-Content ''C:\\\\Users\\\\nghia\\\\Downloads\\\\ANSLEGION40.log''\\).Count\")", "Bash(python -c \":*)", "Bash(find /c/Projects/CLionProjects/ANSCORE -type d -name *ANSODEngine*)", - "Bash(powershell -Command \"\\(Get-Content ''C:\\\\Users\\\\nghia\\\\Downloads\\\\ANSLEGION41.log''\\).Count\")" + "Bash(powershell -Command \"\\(Get-Content ''C:\\\\Users\\\\nghia\\\\Downloads\\\\ANSLEGION41.log''\\).Count\")", + "Bash(powershell -Command \"\\(Get-Content ''C:\\\\Users\\\\nghia\\\\Downloads\\\\ANSLEGION42.log''\\).Count\")" ] } } diff --git a/ANSFrame_Multi_Resolution_Plan.md b/ANSFrame_Multi_Resolution_Plan.md new file mode 100644 index 0000000..34d4c06 --- /dev/null +++ b/ANSFrame_Multi_Resolution_Plan.md @@ -0,0 +1,864 @@ +# ANSFrame Multi-Resolution Architecture Plan + +## Detailed Comparison: Current vs ANSFrame + +### Test Configuration +- GPU: RTX 4070 Laptop (8 GB VRAM) +- Cameras: 5 running (3840×2160, 2880×1620, 1920×1080) +- AI Tasks: 12 (subscribing to cameras) +- Engines: 7 TRT engines +- Decode: Software (YUV420P, CPU) +- Frame rate: SetTargetFPS(100) = ~10 FPS per camera +- Baseline: 8.4 hour stable run (ANSLEGION42), 1,048,325 inferences + +### Per-Frame Generation (inside GetImage) + +| Step | Current | ANSFrame | +|---|---|---| +| Decode YUV420P | ~5ms (CPU) | ~5ms (same) | +| Full res BGR (cvtColor) | ~4-8ms (4K YUV→BGR) | ~4-8ms (same) | +| 640×640 letterbox | Not done here | ~0.8ms (resize YUV planes + cvtColor, done ONCE) | +| 1080p display | Not done (returns 4K) | ~1.5ms (resize YUV planes + cvtColor, done ONCE) | +| **Total GetImage time** | **~4-8ms** | **~6-10ms** (+2ms for 2 extra sizes) | + +### Clone & Dispatch to AI (per clone × 12 AI tasks) + +| Step | Current (4K clone) | ANSFrame (1080p clone) | +|---|---|---| +| Clone image size | 3840×2160×3 = **24.9 MB** | 1920×1080×3 = **6.2 MB** | +| memcpy time per clone | **3-5ms** | **0.8-1.2ms** | +| 12 clones total size | **299 MB** | **74 MB** | +| 12 clones total time | **36-60ms** | **10-14ms** | + +### AI Preprocessing (per AI task) + +| Step | Current | ANSFrame | +|---|---|---| +| Receive image | 4K BGR (24.9 MB) | 1080p BGR (6.2 MB) + ANSFrame ref | +| Local clone in engine | 24.9 MB memcpy (~3ms) | 6.2 MB memcpy (~0.8ms) | +| CPU letterbox resize | 4K→640×640 (~2-3ms) | **SKIP** (use ANSFrame.inference, 0ms) | +| BGR→RGB | 640×640 (~0.3ms) | 640×640 (~0.3ms) | +| GPU upload | 1.2 MB (~0.1ms) | 1.2 MB (~0.1ms) | +| **Total preprocess per task** | **~5-6ms** | **~1.2ms** | +| **12 tasks total** | **~60-72ms** | **~14ms** | + +### Pipeline Crop Quality (ALPR, Face Recognition) + +| Step | Current | ANSFrame | +|---|---|---| +| Detection from | 4K image | 1080p display (detection) → ANSFrame.fullRes (crop) | +| Crop source | Same 4K image | ANSFrame.fullRes (4K original) | +| **Crop quality** | **4K** | **4K** (identical) | + +### Total Processing Time Per Frame Cycle + +| Phase | Current (4K) | ANSFrame | Savings | +|---|---|---|---| +| GetImage generation | 4-8ms | 6-10ms | -2ms | +| Clone × 12 | 36-60ms | 10-14ms | **+22-46ms** | +| AI preprocess × 12 | 60-72ms | 14ms | **+46-58ms** | +| TRT inference × 12 | 60-600ms | 60-600ms | Same | +| Postprocess × 12 | 12-60ms | 12-60ms | Same | +| **Total CPU overhead** | **112-200ms** | **28-38ms** | **~80-160ms saved** | +| **Total with inference** | **172-800ms** | **88-638ms** | **~80-160ms saved** | + +### RAM Usage + +| Resource | Current | ANSFrame | +|---|---|---| +| GetImage output (per camera) | 24.9 MB (4K BGR) | 32.3 MB (3 images) | +| Clones in flight (12 tasks) | 12 × 24.9 = **299 MB** | 12 × 6.2 = **74 MB** | +| AI local clone (12 tasks) | 12 × 24.9 = **299 MB** | 12 × 6.2 = **74 MB** | +| ANSFrame shared data | 0 | 32.3 MB (shared, refcounted) | +| **Total RAM per frame cycle** | **~623 MB** | **~213 MB** | +| **Peak RAM (5 cams, 2-3 cycles)** | **~1.2-1.9 GB** | **~0.4-0.6 GB** | + +### GPU / VRAM Usage + +| Resource | Current | ANSFrame | +|---|---|---| +| VRAM (engines + workspace) | ~5.9 GB | ~5.9 GB (same) | +| GPU upload per inference | 1.2 MB | 1.2 MB (same) | +| PCIe bandwidth | ~300 MB/s | ~300 MB/s (same) | +| SM utilization | 0-34% | 0-34% (same) | + +### CPU Usage + +| Component | Current | ANSFrame | +|---|---|---| +| SW decode (5 cameras) | ~3-5 cores | ~3-5 cores (same) | +| YUV→BGR generation | ~0.3 cores | ~0.42 cores (+0.12 for extra sizes) | +| CPU resize per AI task | ~1.5 cores | **0 cores (pre-computed)** | +| Clone memcpy | ~2.4 cores | **~0.5 cores** | +| **Total CPU for pipeline** | **~7-9 cores** | **~4-6 cores** | +| **CPU savings** | — | **~3 cores freed** | + +### LabVIEW Thread Scheduling Impact + +| Factor | Current | ANSFrame | +|---|---|---| +| Data per task dispatch | 24.9 MB | **6.2 MB** (4x less) | +| Memory allocation pressure | 299 MB in flight | **74 MB** (4x less) | +| Cache efficiency | Poor (24.9 MB > L3) | **Better** (6.2 MB closer to L3) | +| **Processing time (LabVIEW)** | **100-500ms** | **70-300ms (~30-40% faster)** | + +### Final Summary + +| Metric | Current (8h stable) | ANSFrame (projected) | Improvement | +|---|---|---|---| +| Clone time (12 tasks) | 36-60ms | 10-14ms | **3-4x faster** | +| Preprocess (12 tasks) | 60-72ms | 14ms | **4-5x faster** | +| CPU overhead total | 112-200ms | 28-38ms | **4-5x less** | +| RAM usage (frames) | ~1.2-1.9 GB | ~0.4-0.6 GB | **66% less** | +| CPU cores for pipeline | ~7-9 cores | ~4-6 cores | **3 cores freed** | +| VRAM | No change | No change | — | +| Crop quality (ALPR/FR) | 4K | 4K | Same | +| Processing time | 100-500ms | 70-300ms | **~30-40% faster** | +| Code complexity | Simple | Medium | +~500 lines | + +--- + +## Overview + +Replace the single-resolution `cv::Mat**` output from `GetImage` with a multi-resolution `ANSFrame` that contains 3 pre-computed images generated from the YUV420P decoded frame. This eliminates redundant resizing across AI tasks and reduces clone/memcpy overhead by 20x. + +## Current Flow (Inefficient) + +``` +Decoder → YUV420P → cvtColor → 4K BGR (25 MB) → GetImage returns 4K +LabVIEW: clone 4K (25 MB × 12 tasks = 300 MB memcpy) +Each AI task: CPU resize 4K → 640×640 (redundant × 12 tasks) +``` + +## New Flow (Optimized) + +``` +Decoder → YUV420P → generate 3 images from YUV planes: + ├─ Full resolution BGR (for crop/pipeline) + ├─ 640×640 letterbox BGR (for detection inference) + └─ 1080p BGR (for display, configurable) + +GetImage returns ANSFrame (contains refs to all 3 images) +LabVIEW: clone 1080p only (6.2 MB × 12 = 74 MB, was 300 MB) +AI task: uses 640×640 directly (0.1 MB clone, no resize needed) +Pipeline crop: uses full resolution image (no upscaling artifacts) +``` + +## Performance Comparison + +### Resize from YUV420P planes (Option A — RECOMMENDED) + +``` +4K YUV420P frame (12.4 MB in 3 planes) + │ + ├─ Full res: cvtColor(Y+U+V → BGR) ~4-8ms + │ Result: 3840×2160 BGR (24.9 MB) + │ + ├─ 640×640: resize Y(640×360) + U,V(320×180) ~0.5ms + │ + pad bottom + cvtColor ~0.3ms + │ Result: 640×640 BGR (1.2 MB) Total: ~0.8ms + │ + └─ 1080p: resize Y(1920×1080) + U,V(960×540) ~1.0ms + + cvtColor ~0.5ms + Result: 1920×1080 BGR (6.2 MB) Total: ~1.5ms + +Total generation time: ~6-10ms (all 3 images) +``` + +### Resize from BGR (Option B — slower) + +``` +4K YUV420P → cvtColor → 4K BGR (24.9 MB) ~4-8ms + │ + ├─ Full res: already done ~0ms + ├─ 640×640: cv::resize(4K BGR → 640×640) + letterbox ~2-3ms + └─ 1080p: cv::resize(4K BGR → 1080p) ~1-2ms + +Total generation time: ~7-13ms (all 3 images) +``` + +### Recommendation: Option A (resize YUV planes) + +Option A is ~30% faster because YUV420P is 1.5 bytes/pixel vs BGR 3 bytes/pixel — half the data to resize. The YUV plane resize produces identical quality because `cvtColor` is applied after resize (same as how GPU NV12 resize works). + +## ANSFrame Structure + +### Header: `include/ANSFrame.h` + +```cpp +#pragma once +#include +#include +#include + +// ANSFrame holds pre-computed multi-resolution images from a single decoded frame. +// Generated once in avframeYUV420PToCvMat, shared across all AI tasks via registry. +// Eliminates per-task resize and reduces clone size by 20x. +struct ANSFrame { + // --- Pre-computed images (all BGR, CPU RAM) --- + cv::Mat fullRes; // Original resolution (e.g., 3840×2160) — for crop/pipeline + cv::Mat inference; // Model input size (e.g., 640×640 letterbox) — for detection + cv::Mat display; // Display resolution (e.g., 1920×1080) — for LabVIEW UI + + // --- Metadata --- + int originalWidth = 0; // Original frame width before any resize + int originalHeight = 0; // Original frame height before any resize + int inferenceWidth = 0; // Inference image width (e.g., 640) + int inferenceHeight = 0;// Inference image height (e.g., 640) + float letterboxRatio = 1.0f; // Scale ratio used for letterbox (for coordinate mapping) + int64_t pts = 0; // Presentation timestamp + + // --- Configuration (set per camera) --- + int displayMaxHeight = 1080; // Configurable display resolution + int inferenceSize = 640; // Configurable inference size (default 640) + + // --- Lifecycle --- + std::atomic refcount{1}; +}; +``` + +### Changes to `ANSFrame` Registry + +The existing `ANSGpuFrameRegistry` can be extended or a new `ANSFrameRegistry` created to map `cv::Mat*` (the display image pointer) to its parent `ANSFrame`. When LabVIEW clones the display image and sends it to AI, the AI task can look up the parent `ANSFrame` to access the inference or full-res image. + +```cpp +// Registry: maps display cv::Mat* → ANSFrame* +class ANSFrameRegistry { + std::unordered_map m_map; // key = Mat.data pointer + std::mutex m_mutex; +public: + void attach(cv::Mat* displayMat, ANSFrame* frame); + ANSFrame* lookup(const cv::Mat& mat); // lookup by data pointer + void release(cv::Mat* mat); +}; +``` + +## Implementation Steps + +### Step 1: Create ANSFrame structure and registry + +**Files to create:** +- `include/ANSFrame.h` — ANSFrame struct definition +- `modules/ANSCV/ANSFrameRegistry.h` — Registry mapping display Mat → ANSFrame +- `modules/ANSCV/ANSFrameRegistry.cpp` — Registry implementation + +**Key design decisions:** +- ANSFrame is allocated per decoded frame, shared across all clones +- refcount tracks how many clones reference this frame +- When refcount → 0, all 3 images are freed + +### Step 2: Generate multi-resolution images in avframeYUV420PToCvMat + +**File to modify:** `MediaClient/media/video_player.cpp` + +Replace current `avframeYUV420PToCvMat` which returns single BGR with new version that populates ANSFrame with 3 images. + +```cpp +ANSFrame* CVideoPlayer::generateANSFrame(const AVFrame* frame) { + auto* ansFrame = new ANSFrame(); + const int W = frame->width; + const int H = frame->height; + ansFrame->originalWidth = W; + ansFrame->originalHeight = H; + + // --- Resize YUV planes for each resolution --- + + // 1. Full resolution: direct cvtColor (no resize) + cv::Mat yuv(H * 3/2, W, CV_8UC1); + // ... copy planes ... + cv::cvtColor(yuv, ansFrame->fullRes, cv::COLOR_YUV2BGR_I420); + + // 2. Inference size (640×640 letterbox from YUV planes) + int infSize = ansFrame->inferenceSize; // default 640 + float r = std::min((float)infSize / W, (float)infSize / H); + int unpadW = (int)(r * W), unpadH = (int)(r * H); + ansFrame->letterboxRatio = 1.0f / r; + + // Resize Y plane + cv::Mat yFull(H, W, CV_8UC1, frame->data[0], frame->linesize[0]); + cv::Mat yResized; + cv::resize(yFull, yResized, cv::Size(unpadW, unpadH)); + + // Resize U, V planes + cv::Mat uFull(H/2, W/2, CV_8UC1, frame->data[1], frame->linesize[1]); + cv::Mat vFull(H/2, W/2, CV_8UC1, frame->data[2], frame->linesize[2]); + cv::Mat uResized, vResized; + cv::resize(uFull, uResized, cv::Size(unpadW/2, unpadH/2)); + cv::resize(vFull, vResized, cv::Size(unpadW/2, unpadH/2)); + + // Assemble padded I420 buffer + cv::Mat yuvInf(infSize * 3/2, infSize, CV_8UC1, cv::Scalar(114)); // gray padding + yResized.copyTo(yuvInf(cv::Rect(0, 0, unpadW, unpadH))); + // ... copy U, V with padding ... + + cv::cvtColor(yuvInf, ansFrame->inference, cv::COLOR_YUV2BGR_I420); + ansFrame->inferenceWidth = infSize; + ansFrame->inferenceHeight = infSize; + + // 3. Display resolution (1080p from YUV planes) + int dispH = ansFrame->displayMaxHeight; + float dispScale = (float)dispH / H; + int dispW = (int)(W * dispScale); + + cv::Mat yDisp, uDisp, vDisp; + cv::resize(yFull, yDisp, cv::Size(dispW, dispH)); + cv::resize(uFull, uDisp, cv::Size(dispW/2, dispH/2)); + cv::resize(vFull, vDisp, cv::Size(dispW/2, dispH/2)); + + cv::Mat yuvDisp(dispH * 3/2, dispW, CV_8UC1); + // ... assemble I420 ... + cv::cvtColor(yuvDisp, ansFrame->display, cv::COLOR_YUV2BGR_I420); + + return ansFrame; +} +``` + +### Step 3: Modify GetImage to return display image + attach ANSFrame + +**File to modify:** `modules/ANSCV/ANSRTSP.cpp` (and other ANSCV classes) + +```cpp +cv::Mat ANSRTSPClient::GetImage(int& width, int& height, int64_t& pts) { + // ... existing logic to get frame from player ... + + // GetImage returns the DISPLAY image (1080p) + // ANSFrame is attached to the Mat via registry + ANSFrame* frame = _currentANSFrame; + width = frame->display.cols; + height = frame->display.rows; + pts = frame->pts; + + // Register: display Mat's data pointer → ANSFrame + ANSFrameRegistry::instance().attach(&frame->display, frame); + + return frame->display; // 1080p, ~6.2 MB (was 25 MB for 4K) +} +``` + +### Step 4: Modify ANSCV_CloneImage_S to link clone to ANSFrame + +**File to modify:** `modules/ANSCV/ANSOpenCV.cpp` + +```cpp +int ANSCV_CloneImage_S(cv::Mat** imageIn, cv::Mat** imageOut) { + *imageOut = anscv_mat_new(**imageIn); // clone display image (6.2 MB, was 25 MB) + + // Link clone to same ANSFrame (refcount++) + ANSFrame* frame = ANSFrameRegistry::instance().lookup(**imageIn); + if (frame) { + frame->refcount++; + ANSFrameRegistry::instance().attach(*imageOut, frame); + } + + return 1; +} +``` + +### Step 5: Modify engine Preprocess to use ANSFrame inference image + +**Files to modify:** All engine Preprocess functions + +```cpp +// In ANSRTYOLO::DetectObjects (and all other engines): +std::vector> ANSRTYOLO::Preprocess( + const cv::Mat& inputImage, ImageMetadata& outMeta) { + + // Try to get pre-resized inference image from ANSFrame + ANSFrame* frame = ANSFrameRegistry::instance().lookup(inputImage); + + cv::Mat srcForInference; + if (frame && !frame->inference.empty() && + inputImage.cols <= frame->inferenceWidth) { + // Use pre-computed 640×640 — ZERO resize needed + srcForInference = frame->inference; + outMeta.imgHeight = frame->originalHeight; + outMeta.imgWidth = frame->originalWidth; + outMeta.ratio = frame->letterboxRatio; + } else if (frame && !frame->fullRes.empty() && + inputImage.cols > frame->inferenceWidth) { + // Need larger than inference size — use full resolution + srcForInference = frame->fullRes; + // ... resize to model input from full res ... + } else { + // Fallback: use input image directly (backward compat) + srcForInference = inputImage; + } + + // Convert BGR → RGB + cv::Mat cpuRGB; + cv::cvtColor(srcForInference, cpuRGB, cv::COLOR_BGR2RGB); + + // Upload small image to GPU + cv::cuda::GpuMat gpuResized; + gpuResized.upload(cpuRGB, stream); + // ... +} +``` + +### Step 6: Pipeline crop uses full resolution + +```cpp +// In ANSLPR or any pipeline that crops detected objects: +// Instead of cropping from display image (1080p, upscaling artifacts): +ANSFrame* frame = ANSFrameRegistry::instance().lookup(inputImage); +cv::Mat cropSource = (frame && !frame->fullRes.empty()) + ? frame->fullRes // Full 4K quality for face/plate recognition + : inputImage; // Fallback + +// Scale bbox from display coords to full-res coords +float scaleX = (float)cropSource.cols / displayImage.cols; +float scaleY = (float)cropSource.rows / displayImage.rows; +cv::Rect fullResBbox(bbox.x * scaleX, bbox.y * scaleY, + bbox.width * scaleX, bbox.height * scaleY); +cv::Mat crop = cropSource(fullResBbox).clone(); +``` + +### Step 7: Configuration API + +```cpp +// Set inference size (default 640) — before StartRTSP +void SetRTSPInferenceSize(ANSRTSPClient** Handle, int size); // 640, 320, 1280 + +// Set display resolution (default 1080) — before StartRTSP +void SetRTSPDisplayResolution(ANSRTSPClient** Handle, int width, int height); + +// Check if ANSFrame is available for a cloned image +int HasANSFrame(cv::Mat** image); // returns 1 if ANSFrame attached + +// Get specific resolution from ANSFrame +int GetANSFrameInference(cv::Mat** displayImage, cv::Mat** inferenceImage); +int GetANSFrameFullRes(cv::Mat** displayImage, cv::Mat** fullResImage); +``` + +## Memory & Performance Impact + +### Per-Frame Memory + +| Image | Resolution | Size | Before (single 4K) | +|---|---|---|---| +| Full resolution | 3840×2160 | 24.9 MB | 24.9 MB | +| Inference | 640×640 | 1.2 MB | (generated per AI task) | +| Display | 1920×1080 | 6.2 MB | (was part of 4K) | +| **Total per frame** | | **32.3 MB** | **24.9 MB** | + ++7.4 MB per frame for pre-computed images, BUT: + +### Clone Savings (12 AI tasks) + +| | Before | After | +|---|---|---| +| Clone size per task | 24.9 MB | 6.2 MB (display only) | +| 12 clones total | 299 MB | 74 MB | +| Clone time | 36-60ms | 8-12ms | +| Resize per task | 2-3ms × 12 = 24-36ms | 0ms (pre-computed) | +| **Total savings** | | **~250 MB RAM, ~50ms CPU** | + +### Generation Time (one-time per frame) + +| Step | Time | +|---|---| +| Full res: cvtColor YUV→BGR | ~4-8ms | +| 640×640: resize YUV planes + cvtColor | ~0.8ms | +| 1080p: resize YUV planes + cvtColor | ~1.5ms | +| **Total** | **~6-10ms** | + +vs Current: cvtColor 4K = ~4-8ms + resize per task = ~2-3ms × 12 = ~28-44ms total + +**Net savings: ~20-35ms per frame cycle across all tasks.** + +## Files to Create/Modify + +### New files: +1. `include/ANSFrame.h` — ANSFrame struct +2. `modules/ANSCV/ANSFrameRegistry.h` — Registry header +3. `modules/ANSCV/ANSFrameRegistry.cpp` — Registry implementation + +### Modified files: +4. `MediaClient/media/video_player.h` — Add generateANSFrame declaration +5. `MediaClient/media/video_player.cpp` — Implement generateANSFrame, modify getImage +6. `modules/ANSCV/ANSRTSP.h` — Add ANSFrame member, SetInferenceSize +7. `modules/ANSCV/ANSRTSP.cpp` — Modify GetImage to return display + attach ANSFrame +8. `modules/ANSCV/ANSOpenCV.cpp` — Modify CloneImage_S to link to ANSFrame +9. `modules/ANSCV/ANSMatRegistry.h` — Optional: integrate ANSFrame into mat registry +10. `modules/ANSODEngine/ANSRTYOLO.cpp` — Use ANSFrame inference image +11. `modules/ANSODEngine/ANSTENSORTRTOD.cpp` — Same +12. `modules/ANSODEngine/ANSTENSORRTPOSE.cpp` — Same +13. `modules/ANSODEngine/ANSTENSORRTSEG.cpp` — Same +14. `modules/ANSODEngine/ANSTENSORRTCL.cpp` — Same +15. `modules/ANSODEngine/ANSYOLOV12RTOD.cpp` — Same +16. `modules/ANSODEngine/ANSYOLOV10RTOD.cpp` — Same +17. `modules/ANSODEngine/SCRFDFaceDetector.cpp` — Same +18. `modules/ANSODEngine/dllmain.cpp` — Set tl_currentANSFrame for pipeline lookup +19. `modules/ANSLPR/ANSLPR_OD.cpp` — Use fullRes for plate crop +20. `modules/ANSFR/ARCFaceRT.cpp` — Use fullRes for face crop +21. `modules/ANSFR/ANSFaceRecognizer.cpp` — Use fullRes for face crop + +### Apply to other ANSCV classes: +22. `modules/ANSCV/ANSFLV.h/.cpp` — Same pattern as ANSRTSP +23. `modules/ANSCV/ANSMJPEG.h/.cpp` — Same +24. `modules/ANSCV/ANSRTMP.h/.cpp` — Same +25. `modules/ANSCV/ANSSRT.h/.cpp` — Same + +## Clone-to-ANSFrame Mapping (Critical Design) + +### The Problem + +LabVIEW calls `ANSCV_CloneImage_S` to create a deep copy of the 1080p display image. The clone has a **different `data` pointer** than the original — so a simple pointer lookup won't find the ANSFrame. + +``` +GetImage returns display Mat: data = 0xAAAA → registry: 0xAAAA → ANSFrame #1 +CloneImage creates deep copy: data = 0xBBBB → registry: ??? (not registered) +AI task tries lookup(0xBBBB): NOT FOUND — fallback to slow path +``` + +### The Solution + +Register the clone's `data` pointer to the same ANSFrame during `ANSCV_CloneImage_S`: + +``` +GetImage: data = 0xAAAA → registry: 0xAAAA → ANSFrame #1 (refcount=1) +CloneImage: data = 0xBBBB → registry: 0xBBBB → ANSFrame #1 (refcount=2) +CloneImage: data = 0xCCCC → registry: 0xCCCC → ANSFrame #1 (refcount=3) +ReleaseImage: remove 0xBBBB → ANSFrame #1 (refcount=2) +ReleaseImage: remove 0xCCCC → ANSFrame #1 (refcount=1) +Next GetImage: remove 0xAAAA → ANSFrame #1 (refcount=0) → FREE all 3 images +``` + +### Implementation in ANSCV_CloneImage_S (ANSOpenCV.cpp) + +```cpp +int ANSCV_CloneImage_S(cv::Mat** imageIn, cv::Mat** imageOut) { + *imageOut = anscv_mat_new(**imageIn); // deep copy display (6.2 MB) + gpu_frame_addref(*imageIn, *imageOut); // existing: link GpuFrameData + ANSFrameRegistry::instance().addRef(*imageIn, *imageOut); // NEW: link ANSFrame + return 1; +} +``` + +### Implementation in ANSCV_ReleaseImage_S (ANSOpenCV.cpp) + +```cpp +int ANSCV_ReleaseImage_S(cv::Mat** imageIn) { + ANSFrameRegistry::instance().release(*imageIn); // NEW: refcount--, free if 0 + anscv_mat_delete(imageIn); // existing: free Mat + return 1; +} +``` + +### Implementation in ANSFrameRegistry + +```cpp +class ANSFrameRegistry { + std::unordered_map m_map; // Mat.data → ANSFrame + std::mutex m_mutex; +public: + // Register original display Mat → ANSFrame + void attach(const cv::Mat* mat, ANSFrame* frame) { + std::lock_guard lock(m_mutex); + // Remove old mapping if exists + auto it = m_map.find(mat->data); + if (it != m_map.end() && it->second != frame) { + if (--it->second->refcount <= 0) delete it->second; + } + m_map[mat->data] = frame; + } + + // Link clone to same ANSFrame (called from CloneImage) + void addRef(const cv::Mat* src, const cv::Mat* dst) { + std::lock_guard lock(m_mutex); + auto it = m_map.find(src->data); + if (it == m_map.end()) return; + ANSFrame* frame = it->second; + frame->refcount++; + m_map[dst->data] = frame; + } + + // Lookup by any Mat (original or clone) + ANSFrame* lookup(const cv::Mat& mat) { + std::lock_guard lock(m_mutex); + auto it = m_map.find(mat.data); + return (it != m_map.end()) ? it->second : nullptr; + } + + // Release mapping (called from ReleaseImage) + void release(const cv::Mat* mat) { + std::lock_guard lock(m_mutex); + auto it = m_map.find(mat->data); + if (it == m_map.end()) return; + ANSFrame* frame = it->second; + m_map.erase(it); + if (--frame->refcount <= 0) delete frame; + } +}; +``` + +### Thread Safety + +- Registry uses `std::mutex` — same pattern as `ANSGpuFrameRegistry` +- ANSFrame images (fullRes, inference, display) are **immutable after creation** — safe to read from any thread +- Only `refcount` is modified concurrently — uses `std::atomic` +- ANSFrame is freed only when refcount reaches 0 (all clones released) + +### Lifecycle Diagram + +``` +Camera Thread: AI Task 1: AI Task 2: + +generateANSFrame() + → fullRes, inference, display + → refcount = 1 + → registry: display.data → AF + +GetImage returns display + +CloneImage(display, &clone1) ─────► clone1 + → registry: clone1.data → AF + → refcount = 2 + +CloneImage(display, &clone2) ──────────────────────────► clone2 + → registry: clone2.data → AF + → refcount = 3 + + lookup(clone1) → AF + use AF->inference + use AF->fullRes (crop) + ReleaseImage(clone1) + → refcount = 2 + + lookup(clone2) → AF + use AF->inference + ReleaseImage(clone2) + → refcount = 1 + +Next GetImage: + → new ANSFrame + → old display.data removed + → refcount = 0 → FREE old AF +``` + +## Leak Prevention (Critical) + +### Leak Scenarios + +| Scenario | What leaks | Size per leak | +|---|---|---| +| LabVIEW forgets to call ReleaseImage | ANSFrame (fullRes + inference + display) | ~32 MB | +| Camera reconnect while clones exist | Old ANSFrame stays alive until clones released | ~32 MB | +| LabVIEW crash/abort | All ANSFrames in registry | ~32 MB × N frames | +| AI task throws exception, skips Release | ANSFrame refcount never reaches 0 | ~32 MB | + +### Protection 1: TTL-Based Auto-Eviction + +Same pattern as `ANSGpuFrameRegistry::evictStaleFrames()` — periodically scan for old ANSFrames and force-free them. + +```cpp +class ANSFrameRegistry { + static constexpr int FRAME_TTL_SECONDS = 5; // Max lifetime of any ANSFrame + static constexpr int EVICT_INTERVAL_MS = 1000; // Check every 1 second + + struct Entry { + ANSFrame* frame; + std::chrono::steady_clock::time_point createdAt; + }; + + void evictStale() { + auto now = std::chrono::steady_clock::now(); + // Throttle: only run every EVICT_INTERVAL_MS + if (now - m_lastEvict < std::chrono::milliseconds(EVICT_INTERVAL_MS)) return; + m_lastEvict = now; + + std::lock_guard lock(m_mutex); + for (auto it = m_frames.begin(); it != m_frames.end(); ) { + double ageSec = std::chrono::duration(now - it->createdAt).count(); + if (ageSec > FRAME_TTL_SECONDS) { + // Force-free: remove all Mat* mappings to this frame + ANSFrame* frame = it->frame; + for (auto mit = m_map.begin(); mit != m_map.end(); ) { + if (mit->second == frame) mit = m_map.erase(mit); + else ++mit; + } + delete frame; + it = m_frames.erase(it); + } else { + ++it; + } + } + } +}; +``` + +Call `evictStale()` from `GetImage()` (piggybacked on camera thread activity — same as `gpu_frame_evict_stale()`). + +### Protection 2: Max ANSFrame Pool Size + +Limit total number of live ANSFrames. If pool is full, force-free the oldest before creating a new one. + +```cpp +static constexpr int MAX_ANSFRAMES = 100; // Max live frames across all cameras + +ANSFrame* createANSFrame(...) { + evictStale(); // Clean up expired frames first + + // If still over limit, force-free oldest + while (m_frames.size() >= MAX_ANSFRAMES) { + auto oldest = m_frames.begin(); + // ... force-remove all mappings + delete ... + } + + auto* frame = new ANSFrame(); + // ... populate ... + m_frames.push_back({frame, std::chrono::steady_clock::now()}); + return frame; +} +``` + +### Protection 3: Camera-Scoped Cleanup + +When a camera is stopped or destroyed, force-free ALL ANSFrames belonging to that camera (regardless of refcount). + +```cpp +// In ANSRTSPClient::Stop() and Destroy(): +ANSFrameRegistry::instance().releaseByOwner(this); + +// In ANSFrameRegistry: +void releaseByOwner(void* owner) { + std::lock_guard lock(m_mutex); + for (auto it = m_frames.begin(); it != m_frames.end(); ) { + if (it->frame->owner == owner) { + // Remove all Mat* mappings + for (auto mit = m_map.begin(); mit != m_map.end(); ) { + if (mit->second == it->frame) mit = m_map.erase(mit); + else ++mit; + } + delete it->frame; + it = m_frames.erase(it); + } else { + ++it; + } + } +} +``` + +### Protection 4: One ANSFrame Per Camera (Ring Buffer) + +Each camera keeps only the **latest** ANSFrame. When a new frame arrives, the previous ANSFrame is marked for cleanup (refcount decremented). This bounds memory to 1 ANSFrame per camera. + +```cpp +class ANSRTSPClient { + ANSFrame* _currentANSFrame = nullptr; + + void onNewFrame(AVFrame* decoded) { + ANSFrame* newFrame = generateANSFrame(decoded); + newFrame->owner = this; + + // Replace old frame — decrement refcount + if (_currentANSFrame) { + ANSFrameRegistry::instance().detachOwner(_currentANSFrame); + // If refcount reaches 0, freed immediately + // If clones still hold refs, freed when they release + } + _currentANSFrame = newFrame; + ANSFrameRegistry::instance().attach(&newFrame->display, newFrame); + } +}; +``` + +### Protection 5: ANSFrame Struct with Owner Tracking + +```cpp +struct ANSFrame { + // ... existing fields ... + + // Leak protection + void* owner = nullptr; // Camera that created this frame + std::chrono::steady_clock::time_point createdAt; + std::atomic refcount{1}; + + ~ANSFrame() { + // Images are cv::Mat — automatically freed by OpenCV refcount + // No manual cleanup needed for fullRes, inference, display + } +}; +``` + +### Memory Budget Analysis + +With all protections: + +| Cameras | Max ANSFrames | Memory (worst case) | +|---|---|---| +| 5 running | 5 current + ~10 in-flight clones | 5 × 32 MB = 160 MB | +| 20 running | 20 current + ~40 in-flight clones | 20 × 32 MB = 640 MB | +| 100 created, 5 running | 5 current + ~10 in-flight | 5 × 32 MB = 160 MB | +| 100 created, 95 stopped | 0 (stopped cameras free ANSFrame) | 0 MB | + +**Worst case bounded by:** `running_cameras × 32 MB` — predictable, no growth over time. + +### TTL Guarantee + +Even if ALL protections fail, the 5-second TTL eviction ensures: +- Maximum leak duration: 5 seconds +- Maximum leaked memory: `cameras × 5 seconds × 10 FPS × 32 MB / frame` — but with ring buffer (1 per camera), it's just `cameras × 32 MB` +- Periodic cleanup on every `GetImage` call ensures no accumulation + +## Replacing GpuFrameRegistry + +### Current State (wasteful with NV12 disabled) + +With `_useNV12FastPath = false` (current default), `GpuFrameRegistry` is never populated — no `gpu_frame_attach` is called. But `gpu_frame_addref`, `gpu_frame_remove`, and `gpu_frame_evict_stale` still run on every clone/release/replace — doing empty lookups that waste CPU cycles. + +``` +Current code paths that run but do nothing: + ANSCV_CloneImage_S → gpu_frame_addref → lookup → NOT FOUND → no-op + ANSCV_ReleaseImage_S → gpu_frame_remove → lookup → NOT FOUND → no-op + anscv_mat_replace → gpu_frame_remove → lookup → NOT FOUND → no-op + anscv_mat_replace → gpu_frame_evict_stale → scans empty registry → no-op +``` + +### Plan: ANSFrameRegistry replaces GpuFrameRegistry + +ANSFrameRegistry serves the same purpose (mapping `cv::Mat*` → frame metadata) but without GPU complexity: + +| Feature | GpuFrameRegistry | ANSFrameRegistry | +|---|---|---| +| Maps Mat* to | GpuFrameData (NV12 GPU pointers) | ANSFrame (3 CPU images) | +| Used when | NV12 fast path enabled | Always (SW or HW decode) | +| GPU dependency | CUDA, pool slots, D2D copy | None | +| Thread safety | mutex + atomic refcount | mutex + atomic refcount | +| Cleanup | TTL eviction + pool cooldown | TTL eviction (simpler) | + +### Migration Path + +1. **Phase 1 (implement ANSFrame):** ANSFrameRegistry runs alongside GpuFrameRegistry + - `CloneImage`: calls both `gpu_frame_addref` + `ansframe_addref` + - `ReleaseImage`: calls both `gpu_frame_remove` + `ansframe_release` + - Safe: both registries handle NOT FOUND gracefully + +2. **Phase 2 (NV12 disabled permanently):** Remove GpuFrameRegistry calls + - Remove `gpu_frame_addref` from `CloneImage` + - Remove `gpu_frame_remove` from `ReleaseImage` and `anscv_mat_replace` + - Remove `gpu_frame_evict_stale` from `anscv_mat_replace` + - Keep GpuFrameRegistry code for future NV12 re-enablement + +3. **Phase 3 (optional, if NV12 re-enabled):** Merge into single registry + - ANSFrame struct gains optional GPU fields (yPlane, uvPlane, poolSlot) + - Single registry, single refcount, single lookup + +### Recommended: Implement Phase 1 first, Phase 2 after testing + +## Backward Compatibility + +- If ANSFrame is not available (e.g., old camera module), engines fall back to current behavior (resize input image) +- The `cv::Mat**` API stays the same — LabVIEW doesn't need changes +- ANSFrame is transparent to LabVIEW — it only sees the display image +- The `GetANSFrameInference` / `GetANSFrameFullRes` APIs are optional for advanced use + +## Risk Assessment + +| Risk | Mitigation | +|---|---| +| Extra 7.4 MB RAM per frame | Negligible vs 250 MB clone savings | +| ANSFrame lifecycle (refcount) | Same pattern as GpuFrameData — proven | +| Coordinate mapping errors | letterboxRatio stored in ANSFrame — deterministic | +| YUV plane resize quality | Same as GPU NV12 resize — proven equivalent | +| Thread safety | ANSFrame is immutable after creation — safe to share |