32 KiB
ANSFrame Multi-Resolution Architecture Plan
Detailed Comparison: Current vs ANSFrame
Test Configuration
- GPU: RTX 4070 Laptop (8 GB VRAM)
- Cameras: 5 running (3840×2160, 2880×1620, 1920×1080)
- AI Tasks: 12 (subscribing to cameras)
- Engines: 7 TRT engines
- Decode: Software (YUV420P, CPU)
- Frame rate: SetTargetFPS(100) = ~10 FPS per camera
- Baseline: 8.4 hour stable run (ANSLEGION42), 1,048,325 inferences
Per-Frame Generation (inside GetImage)
| Step | Current | ANSFrame |
|---|---|---|
| Decode YUV420P | ~5ms (CPU) | ~5ms (same) |
| Full res BGR (cvtColor) | ~4-8ms (4K YUV→BGR) | ~4-8ms (same) |
| 640×640 letterbox | Not done here | ~0.8ms (resize YUV planes + cvtColor, done ONCE) |
| 1080p display | Not done (returns 4K) | ~1.5ms (resize YUV planes + cvtColor, done ONCE) |
| Total GetImage time | ~4-8ms | ~6-10ms (+2ms for 2 extra sizes) |
Clone & Dispatch to AI (per clone × 12 AI tasks)
| Step | Current (4K clone) | ANSFrame (1080p clone) |
|---|---|---|
| Clone image size | 3840×2160×3 = 24.9 MB | 1920×1080×3 = 6.2 MB |
| memcpy time per clone | 3-5ms | 0.8-1.2ms |
| 12 clones total size | 299 MB | 74 MB |
| 12 clones total time | 36-60ms | 10-14ms |
AI Preprocessing (per AI task)
| Step | Current | ANSFrame |
|---|---|---|
| Receive image | 4K BGR (24.9 MB) | 1080p BGR (6.2 MB) + ANSFrame ref |
| Local clone in engine | 24.9 MB memcpy (~3ms) | 6.2 MB memcpy (~0.8ms) |
| CPU letterbox resize | 4K→640×640 (~2-3ms) | SKIP (use ANSFrame.inference, 0ms) |
| BGR→RGB | 640×640 (~0.3ms) | 640×640 (~0.3ms) |
| GPU upload | 1.2 MB (~0.1ms) | 1.2 MB (~0.1ms) |
| Total preprocess per task | ~5-6ms | ~1.2ms |
| 12 tasks total | ~60-72ms | ~14ms |
Pipeline Crop Quality (ALPR, Face Recognition)
| Step | Current | ANSFrame |
|---|---|---|
| Detection from | 4K image | 1080p display (detection) → ANSFrame.fullRes (crop) |
| Crop source | Same 4K image | ANSFrame.fullRes (4K original) |
| Crop quality | 4K | 4K (identical) |
Total Processing Time Per Frame Cycle
| Phase | Current (4K) | ANSFrame | Savings |
|---|---|---|---|
| GetImage generation | 4-8ms | 6-10ms | -2ms |
| Clone × 12 | 36-60ms | 10-14ms | +22-46ms |
| AI preprocess × 12 | 60-72ms | 14ms | +46-58ms |
| TRT inference × 12 | 60-600ms | 60-600ms | Same |
| Postprocess × 12 | 12-60ms | 12-60ms | Same |
| Total CPU overhead | 112-200ms | 28-38ms | ~80-160ms saved |
| Total with inference | 172-800ms | 88-638ms | ~80-160ms saved |
RAM Usage
| Resource | Current | ANSFrame |
|---|---|---|
| GetImage output (per camera) | 24.9 MB (4K BGR) | 32.3 MB (3 images) |
| Clones in flight (12 tasks) | 12 × 24.9 = 299 MB | 12 × 6.2 = 74 MB |
| AI local clone (12 tasks) | 12 × 24.9 = 299 MB | 12 × 6.2 = 74 MB |
| ANSFrame shared data | 0 | 32.3 MB (shared, refcounted) |
| Total RAM per frame cycle | ~623 MB | ~213 MB |
| Peak RAM (5 cams, 2-3 cycles) | ~1.2-1.9 GB | ~0.4-0.6 GB |
GPU / VRAM Usage
| Resource | Current | ANSFrame |
|---|---|---|
| VRAM (engines + workspace) | ~5.9 GB | ~5.9 GB (same) |
| GPU upload per inference | 1.2 MB | 1.2 MB (same) |
| PCIe bandwidth | ~300 MB/s | ~300 MB/s (same) |
| SM utilization | 0-34% | 0-34% (same) |
CPU Usage
| Component | Current | ANSFrame |
|---|---|---|
| SW decode (5 cameras) | ~3-5 cores | ~3-5 cores (same) |
| YUV→BGR generation | ~0.3 cores | ~0.42 cores (+0.12 for extra sizes) |
| CPU resize per AI task | ~1.5 cores | 0 cores (pre-computed) |
| Clone memcpy | ~2.4 cores | ~0.5 cores |
| Total CPU for pipeline | ~7-9 cores | ~4-6 cores |
| CPU savings | — | ~3 cores freed |
LabVIEW Thread Scheduling Impact
| Factor | Current | ANSFrame |
|---|---|---|
| Data per task dispatch | 24.9 MB | 6.2 MB (4x less) |
| Memory allocation pressure | 299 MB in flight | 74 MB (4x less) |
| Cache efficiency | Poor (24.9 MB > L3) | Better (6.2 MB closer to L3) |
| Processing time (LabVIEW) | 100-500ms | 70-300ms (~30-40% faster) |
Final Summary
| Metric | Current (8h stable) | ANSFrame (projected) | Improvement |
|---|---|---|---|
| Clone time (12 tasks) | 36-60ms | 10-14ms | 3-4x faster |
| Preprocess (12 tasks) | 60-72ms | 14ms | 4-5x faster |
| CPU overhead total | 112-200ms | 28-38ms | 4-5x less |
| RAM usage (frames) | ~1.2-1.9 GB | ~0.4-0.6 GB | 66% less |
| CPU cores for pipeline | ~7-9 cores | ~4-6 cores | 3 cores freed |
| VRAM | No change | No change | — |
| Crop quality (ALPR/FR) | 4K | 4K | Same |
| Processing time | 100-500ms | 70-300ms | ~30-40% faster |
| Code complexity | Simple | Medium | +~500 lines |
Overview
Replace the single-resolution cv::Mat** output from GetImage with a multi-resolution ANSFrame that contains 3 pre-computed images generated from the YUV420P decoded frame. This eliminates redundant resizing across AI tasks and reduces clone/memcpy overhead by 20x.
Current Flow (Inefficient)
Decoder → YUV420P → cvtColor → 4K BGR (25 MB) → GetImage returns 4K
LabVIEW: clone 4K (25 MB × 12 tasks = 300 MB memcpy)
Each AI task: CPU resize 4K → 640×640 (redundant × 12 tasks)
New Flow (Optimized)
Decoder → YUV420P → generate 3 images from YUV planes:
├─ Full resolution BGR (for crop/pipeline)
├─ 640×640 letterbox BGR (for detection inference)
└─ 1080p BGR (for display, configurable)
GetImage returns ANSFrame (contains refs to all 3 images)
LabVIEW: clone 1080p only (6.2 MB × 12 = 74 MB, was 300 MB)
AI task: uses 640×640 directly (0.1 MB clone, no resize needed)
Pipeline crop: uses full resolution image (no upscaling artifacts)
Performance Comparison
Resize from YUV420P planes (Option A — RECOMMENDED)
4K YUV420P frame (12.4 MB in 3 planes)
│
├─ Full res: cvtColor(Y+U+V → BGR) ~4-8ms
│ Result: 3840×2160 BGR (24.9 MB)
│
├─ 640×640: resize Y(640×360) + U,V(320×180) ~0.5ms
│ + pad bottom + cvtColor ~0.3ms
│ Result: 640×640 BGR (1.2 MB) Total: ~0.8ms
│
└─ 1080p: resize Y(1920×1080) + U,V(960×540) ~1.0ms
+ cvtColor ~0.5ms
Result: 1920×1080 BGR (6.2 MB) Total: ~1.5ms
Total generation time: ~6-10ms (all 3 images)
Resize from BGR (Option B — slower)
4K YUV420P → cvtColor → 4K BGR (24.9 MB) ~4-8ms
│
├─ Full res: already done ~0ms
├─ 640×640: cv::resize(4K BGR → 640×640) + letterbox ~2-3ms
└─ 1080p: cv::resize(4K BGR → 1080p) ~1-2ms
Total generation time: ~7-13ms (all 3 images)
Recommendation: Option A (resize YUV planes)
Option A is ~30% faster because YUV420P is 1.5 bytes/pixel vs BGR 3 bytes/pixel — half the data to resize. The YUV plane resize produces identical quality because cvtColor is applied after resize (same as how GPU NV12 resize works).
ANSFrame Structure
Header: include/ANSFrame.h
#pragma once
#include <opencv2/core/mat.hpp>
#include <atomic>
#include <cstdint>
// ANSFrame holds pre-computed multi-resolution images from a single decoded frame.
// Generated once in avframeYUV420PToCvMat, shared across all AI tasks via registry.
// Eliminates per-task resize and reduces clone size by 20x.
struct ANSFrame {
// --- Pre-computed images (all BGR, CPU RAM) ---
cv::Mat fullRes; // Original resolution (e.g., 3840×2160) — for crop/pipeline
cv::Mat inference; // Model input size (e.g., 640×640 letterbox) — for detection
cv::Mat display; // Display resolution (e.g., 1920×1080) — for LabVIEW UI
// --- Metadata ---
int originalWidth = 0; // Original frame width before any resize
int originalHeight = 0; // Original frame height before any resize
int inferenceWidth = 0; // Inference image width (e.g., 640)
int inferenceHeight = 0;// Inference image height (e.g., 640)
float letterboxRatio = 1.0f; // Scale ratio used for letterbox (for coordinate mapping)
int64_t pts = 0; // Presentation timestamp
// --- Configuration (set per camera) ---
int displayMaxHeight = 1080; // Configurable display resolution
int inferenceSize = 640; // Configurable inference size (default 640)
// --- Lifecycle ---
std::atomic<int> refcount{1};
};
Changes to ANSFrame Registry
The existing ANSGpuFrameRegistry can be extended or a new ANSFrameRegistry created to map cv::Mat* (the display image pointer) to its parent ANSFrame. When LabVIEW clones the display image and sends it to AI, the AI task can look up the parent ANSFrame to access the inference or full-res image.
// Registry: maps display cv::Mat* → ANSFrame*
class ANSFrameRegistry {
std::unordered_map<const uchar*, ANSFrame*> m_map; // key = Mat.data pointer
std::mutex m_mutex;
public:
void attach(cv::Mat* displayMat, ANSFrame* frame);
ANSFrame* lookup(const cv::Mat& mat); // lookup by data pointer
void release(cv::Mat* mat);
};
Implementation Steps
Step 1: Create ANSFrame structure and registry
Files to create:
include/ANSFrame.h— ANSFrame struct definitionmodules/ANSCV/ANSFrameRegistry.h— Registry mapping display Mat → ANSFramemodules/ANSCV/ANSFrameRegistry.cpp— Registry implementation
Key design decisions:
- ANSFrame is allocated per decoded frame, shared across all clones
- refcount tracks how many clones reference this frame
- When refcount → 0, all 3 images are freed
Step 2: Generate multi-resolution images in avframeYUV420PToCvMat
File to modify: MediaClient/media/video_player.cpp
Replace current avframeYUV420PToCvMat which returns single BGR with new version that populates ANSFrame with 3 images.
ANSFrame* CVideoPlayer::generateANSFrame(const AVFrame* frame) {
auto* ansFrame = new ANSFrame();
const int W = frame->width;
const int H = frame->height;
ansFrame->originalWidth = W;
ansFrame->originalHeight = H;
// --- Resize YUV planes for each resolution ---
// 1. Full resolution: direct cvtColor (no resize)
cv::Mat yuv(H * 3/2, W, CV_8UC1);
// ... copy planes ...
cv::cvtColor(yuv, ansFrame->fullRes, cv::COLOR_YUV2BGR_I420);
// 2. Inference size (640×640 letterbox from YUV planes)
int infSize = ansFrame->inferenceSize; // default 640
float r = std::min((float)infSize / W, (float)infSize / H);
int unpadW = (int)(r * W), unpadH = (int)(r * H);
ansFrame->letterboxRatio = 1.0f / r;
// Resize Y plane
cv::Mat yFull(H, W, CV_8UC1, frame->data[0], frame->linesize[0]);
cv::Mat yResized;
cv::resize(yFull, yResized, cv::Size(unpadW, unpadH));
// Resize U, V planes
cv::Mat uFull(H/2, W/2, CV_8UC1, frame->data[1], frame->linesize[1]);
cv::Mat vFull(H/2, W/2, CV_8UC1, frame->data[2], frame->linesize[2]);
cv::Mat uResized, vResized;
cv::resize(uFull, uResized, cv::Size(unpadW/2, unpadH/2));
cv::resize(vFull, vResized, cv::Size(unpadW/2, unpadH/2));
// Assemble padded I420 buffer
cv::Mat yuvInf(infSize * 3/2, infSize, CV_8UC1, cv::Scalar(114)); // gray padding
yResized.copyTo(yuvInf(cv::Rect(0, 0, unpadW, unpadH)));
// ... copy U, V with padding ...
cv::cvtColor(yuvInf, ansFrame->inference, cv::COLOR_YUV2BGR_I420);
ansFrame->inferenceWidth = infSize;
ansFrame->inferenceHeight = infSize;
// 3. Display resolution (1080p from YUV planes)
int dispH = ansFrame->displayMaxHeight;
float dispScale = (float)dispH / H;
int dispW = (int)(W * dispScale);
cv::Mat yDisp, uDisp, vDisp;
cv::resize(yFull, yDisp, cv::Size(dispW, dispH));
cv::resize(uFull, uDisp, cv::Size(dispW/2, dispH/2));
cv::resize(vFull, vDisp, cv::Size(dispW/2, dispH/2));
cv::Mat yuvDisp(dispH * 3/2, dispW, CV_8UC1);
// ... assemble I420 ...
cv::cvtColor(yuvDisp, ansFrame->display, cv::COLOR_YUV2BGR_I420);
return ansFrame;
}
Step 3: Modify GetImage to return display image + attach ANSFrame
File to modify: modules/ANSCV/ANSRTSP.cpp (and other ANSCV classes)
cv::Mat ANSRTSPClient::GetImage(int& width, int& height, int64_t& pts) {
// ... existing logic to get frame from player ...
// GetImage returns the DISPLAY image (1080p)
// ANSFrame is attached to the Mat via registry
ANSFrame* frame = _currentANSFrame;
width = frame->display.cols;
height = frame->display.rows;
pts = frame->pts;
// Register: display Mat's data pointer → ANSFrame
ANSFrameRegistry::instance().attach(&frame->display, frame);
return frame->display; // 1080p, ~6.2 MB (was 25 MB for 4K)
}
Step 4: Modify ANSCV_CloneImage_S to link clone to ANSFrame
File to modify: modules/ANSCV/ANSOpenCV.cpp
int ANSCV_CloneImage_S(cv::Mat** imageIn, cv::Mat** imageOut) {
*imageOut = anscv_mat_new(**imageIn); // clone display image (6.2 MB, was 25 MB)
// Link clone to same ANSFrame (refcount++)
ANSFrame* frame = ANSFrameRegistry::instance().lookup(**imageIn);
if (frame) {
frame->refcount++;
ANSFrameRegistry::instance().attach(*imageOut, frame);
}
return 1;
}
Step 5: Modify engine Preprocess to use ANSFrame inference image
Files to modify: All engine Preprocess functions
// In ANSRTYOLO::DetectObjects (and all other engines):
std::vector<std::vector<cv::cuda::GpuMat>> ANSRTYOLO::Preprocess(
const cv::Mat& inputImage, ImageMetadata& outMeta) {
// Try to get pre-resized inference image from ANSFrame
ANSFrame* frame = ANSFrameRegistry::instance().lookup(inputImage);
cv::Mat srcForInference;
if (frame && !frame->inference.empty() &&
inputImage.cols <= frame->inferenceWidth) {
// Use pre-computed 640×640 — ZERO resize needed
srcForInference = frame->inference;
outMeta.imgHeight = frame->originalHeight;
outMeta.imgWidth = frame->originalWidth;
outMeta.ratio = frame->letterboxRatio;
} else if (frame && !frame->fullRes.empty() &&
inputImage.cols > frame->inferenceWidth) {
// Need larger than inference size — use full resolution
srcForInference = frame->fullRes;
// ... resize to model input from full res ...
} else {
// Fallback: use input image directly (backward compat)
srcForInference = inputImage;
}
// Convert BGR → RGB
cv::Mat cpuRGB;
cv::cvtColor(srcForInference, cpuRGB, cv::COLOR_BGR2RGB);
// Upload small image to GPU
cv::cuda::GpuMat gpuResized;
gpuResized.upload(cpuRGB, stream);
// ...
}
Step 6: Pipeline crop uses full resolution
// In ANSLPR or any pipeline that crops detected objects:
// Instead of cropping from display image (1080p, upscaling artifacts):
ANSFrame* frame = ANSFrameRegistry::instance().lookup(inputImage);
cv::Mat cropSource = (frame && !frame->fullRes.empty())
? frame->fullRes // Full 4K quality for face/plate recognition
: inputImage; // Fallback
// Scale bbox from display coords to full-res coords
float scaleX = (float)cropSource.cols / displayImage.cols;
float scaleY = (float)cropSource.rows / displayImage.rows;
cv::Rect fullResBbox(bbox.x * scaleX, bbox.y * scaleY,
bbox.width * scaleX, bbox.height * scaleY);
cv::Mat crop = cropSource(fullResBbox).clone();
Step 7: Configuration API
// Set inference size (default 640) — before StartRTSP
void SetRTSPInferenceSize(ANSRTSPClient** Handle, int size); // 640, 320, 1280
// Set display resolution (default 1080) — before StartRTSP
void SetRTSPDisplayResolution(ANSRTSPClient** Handle, int width, int height);
// Check if ANSFrame is available for a cloned image
int HasANSFrame(cv::Mat** image); // returns 1 if ANSFrame attached
// Get specific resolution from ANSFrame
int GetANSFrameInference(cv::Mat** displayImage, cv::Mat** inferenceImage);
int GetANSFrameFullRes(cv::Mat** displayImage, cv::Mat** fullResImage);
Memory & Performance Impact
Per-Frame Memory
| Image | Resolution | Size | Before (single 4K) |
|---|---|---|---|
| Full resolution | 3840×2160 | 24.9 MB | 24.9 MB |
| Inference | 640×640 | 1.2 MB | (generated per AI task) |
| Display | 1920×1080 | 6.2 MB | (was part of 4K) |
| Total per frame | 32.3 MB | 24.9 MB |
+7.4 MB per frame for pre-computed images, BUT:
Clone Savings (12 AI tasks)
| Before | After | |
|---|---|---|
| Clone size per task | 24.9 MB | 6.2 MB (display only) |
| 12 clones total | 299 MB | 74 MB |
| Clone time | 36-60ms | 8-12ms |
| Resize per task | 2-3ms × 12 = 24-36ms | 0ms (pre-computed) |
| Total savings | ~250 MB RAM, ~50ms CPU |
Generation Time (one-time per frame)
| Step | Time |
|---|---|
| Full res: cvtColor YUV→BGR | ~4-8ms |
| 640×640: resize YUV planes + cvtColor | ~0.8ms |
| 1080p: resize YUV planes + cvtColor | ~1.5ms |
| Total | ~6-10ms |
vs Current: cvtColor 4K = ~4-8ms + resize per task = ~2-3ms × 12 = ~28-44ms total
Net savings: ~20-35ms per frame cycle across all tasks.
Files to Create/Modify
New files:
include/ANSFrame.h— ANSFrame structmodules/ANSCV/ANSFrameRegistry.h— Registry headermodules/ANSCV/ANSFrameRegistry.cpp— Registry implementation
Modified files:
MediaClient/media/video_player.h— Add generateANSFrame declarationMediaClient/media/video_player.cpp— Implement generateANSFrame, modify getImagemodules/ANSCV/ANSRTSP.h— Add ANSFrame member, SetInferenceSizemodules/ANSCV/ANSRTSP.cpp— Modify GetImage to return display + attach ANSFramemodules/ANSCV/ANSOpenCV.cpp— Modify CloneImage_S to link to ANSFramemodules/ANSCV/ANSMatRegistry.h— Optional: integrate ANSFrame into mat registrymodules/ANSODEngine/ANSRTYOLO.cpp— Use ANSFrame inference imagemodules/ANSODEngine/ANSTENSORTRTOD.cpp— Samemodules/ANSODEngine/ANSTENSORRTPOSE.cpp— Samemodules/ANSODEngine/ANSTENSORRTSEG.cpp— Samemodules/ANSODEngine/ANSTENSORRTCL.cpp— Samemodules/ANSODEngine/ANSYOLOV12RTOD.cpp— Samemodules/ANSODEngine/ANSYOLOV10RTOD.cpp— Samemodules/ANSODEngine/SCRFDFaceDetector.cpp— Samemodules/ANSODEngine/dllmain.cpp— Set tl_currentANSFrame for pipeline lookupmodules/ANSLPR/ANSLPR_OD.cpp— Use fullRes for plate cropmodules/ANSFR/ARCFaceRT.cpp— Use fullRes for face cropmodules/ANSFR/ANSFaceRecognizer.cpp— Use fullRes for face crop
Apply to other ANSCV classes:
modules/ANSCV/ANSFLV.h/.cpp— Same pattern as ANSRTSPmodules/ANSCV/ANSMJPEG.h/.cpp— Samemodules/ANSCV/ANSRTMP.h/.cpp— Samemodules/ANSCV/ANSSRT.h/.cpp— Same
Clone-to-ANSFrame Mapping (Critical Design)
The Problem
LabVIEW calls ANSCV_CloneImage_S to create a deep copy of the 1080p display image. The clone has a different data pointer than the original — so a simple pointer lookup won't find the ANSFrame.
GetImage returns display Mat: data = 0xAAAA → registry: 0xAAAA → ANSFrame #1
CloneImage creates deep copy: data = 0xBBBB → registry: ??? (not registered)
AI task tries lookup(0xBBBB): NOT FOUND — fallback to slow path
The Solution
Register the clone's data pointer to the same ANSFrame during ANSCV_CloneImage_S:
GetImage: data = 0xAAAA → registry: 0xAAAA → ANSFrame #1 (refcount=1)
CloneImage: data = 0xBBBB → registry: 0xBBBB → ANSFrame #1 (refcount=2)
CloneImage: data = 0xCCCC → registry: 0xCCCC → ANSFrame #1 (refcount=3)
ReleaseImage: remove 0xBBBB → ANSFrame #1 (refcount=2)
ReleaseImage: remove 0xCCCC → ANSFrame #1 (refcount=1)
Next GetImage: remove 0xAAAA → ANSFrame #1 (refcount=0) → FREE all 3 images
Implementation in ANSCV_CloneImage_S (ANSOpenCV.cpp)
int ANSCV_CloneImage_S(cv::Mat** imageIn, cv::Mat** imageOut) {
*imageOut = anscv_mat_new(**imageIn); // deep copy display (6.2 MB)
gpu_frame_addref(*imageIn, *imageOut); // existing: link GpuFrameData
ANSFrameRegistry::instance().addRef(*imageIn, *imageOut); // NEW: link ANSFrame
return 1;
}
Implementation in ANSCV_ReleaseImage_S (ANSOpenCV.cpp)
int ANSCV_ReleaseImage_S(cv::Mat** imageIn) {
ANSFrameRegistry::instance().release(*imageIn); // NEW: refcount--, free if 0
anscv_mat_delete(imageIn); // existing: free Mat
return 1;
}
Implementation in ANSFrameRegistry
class ANSFrameRegistry {
std::unordered_map<const uchar*, ANSFrame*> m_map; // Mat.data → ANSFrame
std::mutex m_mutex;
public:
// Register original display Mat → ANSFrame
void attach(const cv::Mat* mat, ANSFrame* frame) {
std::lock_guard<std::mutex> lock(m_mutex);
// Remove old mapping if exists
auto it = m_map.find(mat->data);
if (it != m_map.end() && it->second != frame) {
if (--it->second->refcount <= 0) delete it->second;
}
m_map[mat->data] = frame;
}
// Link clone to same ANSFrame (called from CloneImage)
void addRef(const cv::Mat* src, const cv::Mat* dst) {
std::lock_guard<std::mutex> lock(m_mutex);
auto it = m_map.find(src->data);
if (it == m_map.end()) return;
ANSFrame* frame = it->second;
frame->refcount++;
m_map[dst->data] = frame;
}
// Lookup by any Mat (original or clone)
ANSFrame* lookup(const cv::Mat& mat) {
std::lock_guard<std::mutex> lock(m_mutex);
auto it = m_map.find(mat.data);
return (it != m_map.end()) ? it->second : nullptr;
}
// Release mapping (called from ReleaseImage)
void release(const cv::Mat* mat) {
std::lock_guard<std::mutex> lock(m_mutex);
auto it = m_map.find(mat->data);
if (it == m_map.end()) return;
ANSFrame* frame = it->second;
m_map.erase(it);
if (--frame->refcount <= 0) delete frame;
}
};
Thread Safety
- Registry uses
std::mutex— same pattern asANSGpuFrameRegistry - ANSFrame images (fullRes, inference, display) are immutable after creation — safe to read from any thread
- Only
refcountis modified concurrently — usesstd::atomic<int> - ANSFrame is freed only when refcount reaches 0 (all clones released)
Lifecycle Diagram
Camera Thread: AI Task 1: AI Task 2:
generateANSFrame()
→ fullRes, inference, display
→ refcount = 1
→ registry: display.data → AF
GetImage returns display
CloneImage(display, &clone1) ─────► clone1
→ registry: clone1.data → AF
→ refcount = 2
CloneImage(display, &clone2) ──────────────────────────► clone2
→ registry: clone2.data → AF
→ refcount = 3
lookup(clone1) → AF
use AF->inference
use AF->fullRes (crop)
ReleaseImage(clone1)
→ refcount = 2
lookup(clone2) → AF
use AF->inference
ReleaseImage(clone2)
→ refcount = 1
Next GetImage:
→ new ANSFrame
→ old display.data removed
→ refcount = 0 → FREE old AF
Leak Prevention (Critical)
Leak Scenarios
| Scenario | What leaks | Size per leak |
|---|---|---|
| LabVIEW forgets to call ReleaseImage | ANSFrame (fullRes + inference + display) | ~32 MB |
| Camera reconnect while clones exist | Old ANSFrame stays alive until clones released | ~32 MB |
| LabVIEW crash/abort | All ANSFrames in registry | ~32 MB × N frames |
| AI task throws exception, skips Release | ANSFrame refcount never reaches 0 | ~32 MB |
Protection 1: TTL-Based Auto-Eviction
Same pattern as ANSGpuFrameRegistry::evictStaleFrames() — periodically scan for old ANSFrames and force-free them.
class ANSFrameRegistry {
static constexpr int FRAME_TTL_SECONDS = 5; // Max lifetime of any ANSFrame
static constexpr int EVICT_INTERVAL_MS = 1000; // Check every 1 second
struct Entry {
ANSFrame* frame;
std::chrono::steady_clock::time_point createdAt;
};
void evictStale() {
auto now = std::chrono::steady_clock::now();
// Throttle: only run every EVICT_INTERVAL_MS
if (now - m_lastEvict < std::chrono::milliseconds(EVICT_INTERVAL_MS)) return;
m_lastEvict = now;
std::lock_guard<std::mutex> lock(m_mutex);
for (auto it = m_frames.begin(); it != m_frames.end(); ) {
double ageSec = std::chrono::duration<double>(now - it->createdAt).count();
if (ageSec > FRAME_TTL_SECONDS) {
// Force-free: remove all Mat* mappings to this frame
ANSFrame* frame = it->frame;
for (auto mit = m_map.begin(); mit != m_map.end(); ) {
if (mit->second == frame) mit = m_map.erase(mit);
else ++mit;
}
delete frame;
it = m_frames.erase(it);
} else {
++it;
}
}
}
};
Call evictStale() from GetImage() (piggybacked on camera thread activity — same as gpu_frame_evict_stale()).
Protection 2: Max ANSFrame Pool Size
Limit total number of live ANSFrames. If pool is full, force-free the oldest before creating a new one.
static constexpr int MAX_ANSFRAMES = 100; // Max live frames across all cameras
ANSFrame* createANSFrame(...) {
evictStale(); // Clean up expired frames first
// If still over limit, force-free oldest
while (m_frames.size() >= MAX_ANSFRAMES) {
auto oldest = m_frames.begin();
// ... force-remove all mappings + delete ...
}
auto* frame = new ANSFrame();
// ... populate ...
m_frames.push_back({frame, std::chrono::steady_clock::now()});
return frame;
}
Protection 3: Camera-Scoped Cleanup
When a camera is stopped or destroyed, force-free ALL ANSFrames belonging to that camera (regardless of refcount).
// In ANSRTSPClient::Stop() and Destroy():
ANSFrameRegistry::instance().releaseByOwner(this);
// In ANSFrameRegistry:
void releaseByOwner(void* owner) {
std::lock_guard<std::mutex> lock(m_mutex);
for (auto it = m_frames.begin(); it != m_frames.end(); ) {
if (it->frame->owner == owner) {
// Remove all Mat* mappings
for (auto mit = m_map.begin(); mit != m_map.end(); ) {
if (mit->second == it->frame) mit = m_map.erase(mit);
else ++mit;
}
delete it->frame;
it = m_frames.erase(it);
} else {
++it;
}
}
}
Protection 4: One ANSFrame Per Camera (Ring Buffer)
Each camera keeps only the latest ANSFrame. When a new frame arrives, the previous ANSFrame is marked for cleanup (refcount decremented). This bounds memory to 1 ANSFrame per camera.
class ANSRTSPClient {
ANSFrame* _currentANSFrame = nullptr;
void onNewFrame(AVFrame* decoded) {
ANSFrame* newFrame = generateANSFrame(decoded);
newFrame->owner = this;
// Replace old frame — decrement refcount
if (_currentANSFrame) {
ANSFrameRegistry::instance().detachOwner(_currentANSFrame);
// If refcount reaches 0, freed immediately
// If clones still hold refs, freed when they release
}
_currentANSFrame = newFrame;
ANSFrameRegistry::instance().attach(&newFrame->display, newFrame);
}
};
Protection 5: ANSFrame Struct with Owner Tracking
struct ANSFrame {
// ... existing fields ...
// Leak protection
void* owner = nullptr; // Camera that created this frame
std::chrono::steady_clock::time_point createdAt;
std::atomic<int> refcount{1};
~ANSFrame() {
// Images are cv::Mat — automatically freed by OpenCV refcount
// No manual cleanup needed for fullRes, inference, display
}
};
Memory Budget Analysis
With all protections:
| Cameras | Max ANSFrames | Memory (worst case) |
|---|---|---|
| 5 running | 5 current + ~10 in-flight clones | 5 × 32 MB = 160 MB |
| 20 running | 20 current + ~40 in-flight clones | 20 × 32 MB = 640 MB |
| 100 created, 5 running | 5 current + ~10 in-flight | 5 × 32 MB = 160 MB |
| 100 created, 95 stopped | 0 (stopped cameras free ANSFrame) | 0 MB |
Worst case bounded by: running_cameras × 32 MB — predictable, no growth over time.
TTL Guarantee
Even if ALL protections fail, the 5-second TTL eviction ensures:
- Maximum leak duration: 5 seconds
- Maximum leaked memory:
cameras × 5 seconds × 10 FPS × 32 MB / frame— but with ring buffer (1 per camera), it's justcameras × 32 MB - Periodic cleanup on every
GetImagecall ensures no accumulation
Replacing GpuFrameRegistry
Current State (wasteful with NV12 disabled)
With _useNV12FastPath = false (current default), GpuFrameRegistry is never populated — no gpu_frame_attach is called. But gpu_frame_addref, gpu_frame_remove, and gpu_frame_evict_stale still run on every clone/release/replace — doing empty lookups that waste CPU cycles.
Current code paths that run but do nothing:
ANSCV_CloneImage_S → gpu_frame_addref → lookup → NOT FOUND → no-op
ANSCV_ReleaseImage_S → gpu_frame_remove → lookup → NOT FOUND → no-op
anscv_mat_replace → gpu_frame_remove → lookup → NOT FOUND → no-op
anscv_mat_replace → gpu_frame_evict_stale → scans empty registry → no-op
Plan: ANSFrameRegistry replaces GpuFrameRegistry
ANSFrameRegistry serves the same purpose (mapping cv::Mat* → frame metadata) but without GPU complexity:
| Feature | GpuFrameRegistry | ANSFrameRegistry |
|---|---|---|
| Maps Mat* to | GpuFrameData (NV12 GPU pointers) | ANSFrame (3 CPU images) |
| Used when | NV12 fast path enabled | Always (SW or HW decode) |
| GPU dependency | CUDA, pool slots, D2D copy | None |
| Thread safety | mutex + atomic refcount | mutex + atomic refcount |
| Cleanup | TTL eviction + pool cooldown | TTL eviction (simpler) |
Migration Path
-
Phase 1 (implement ANSFrame): ANSFrameRegistry runs alongside GpuFrameRegistry
CloneImage: calls bothgpu_frame_addref+ansframe_addrefReleaseImage: calls bothgpu_frame_remove+ansframe_release- Safe: both registries handle NOT FOUND gracefully
-
Phase 2 (NV12 disabled permanently): Remove GpuFrameRegistry calls
- Remove
gpu_frame_addreffromCloneImage - Remove
gpu_frame_removefromReleaseImageandanscv_mat_replace - Remove
gpu_frame_evict_stalefromanscv_mat_replace - Keep GpuFrameRegistry code for future NV12 re-enablement
- Remove
-
Phase 3 (optional, if NV12 re-enabled): Merge into single registry
- ANSFrame struct gains optional GPU fields (yPlane, uvPlane, poolSlot)
- Single registry, single refcount, single lookup
Recommended: Implement Phase 1 first, Phase 2 after testing
Backward Compatibility
- If ANSFrame is not available (e.g., old camera module), engines fall back to current behavior (resize input image)
- The
cv::Mat**API stays the same — LabVIEW doesn't need changes - ANSFrame is transparent to LabVIEW — it only sees the display image
- The
GetANSFrameInference/GetANSFrameFullResAPIs are optional for advanced use
Risk Assessment
| Risk | Mitigation |
|---|---|
| Extra 7.4 MB RAM per frame | Negligible vs 250 MB clone savings |
| ANSFrame lifecycle (refcount) | Same pattern as GpuFrameData — proven |
| Coordinate mapping errors | letterboxRatio stored in ANSFrame — deterministic |
| YUV plane resize quality | Same as GPU NV12 resize — proven equivalent |
| Thread safety | ANSFrame is immutable after creation — safe to share |