ANSCENTER/ANSCORE

Fork 0

Files

Tuan Nghia Nguyen db089c3697 Use CPU resize before upload to GPU to remove PCIe bottleneck

2026-04-05 06:55:15 +10:00

32 KiB

Raw Blame History

ANSFrame Multi-Resolution Architecture Plan

Detailed Comparison: Current vs ANSFrame

Test Configuration

GPU: RTX 4070 Laptop (8 GB VRAM)
Cameras: 5 running (3840×2160, 2880×1620, 1920×1080)
AI Tasks: 12 (subscribing to cameras)
Engines: 7 TRT engines
Decode: Software (YUV420P, CPU)
Frame rate: SetTargetFPS(100) = ~10 FPS per camera
Baseline: 8.4 hour stable run (ANSLEGION42), 1,048,325 inferences

Per-Frame Generation (inside GetImage)

Step	Current	ANSFrame
Decode YUV420P	~5ms (CPU)	~5ms (same)
Full res BGR (cvtColor)	~4-8ms (4K YUV→BGR)	~4-8ms (same)
640×640 letterbox	Not done here	~0.8ms (resize YUV planes + cvtColor, done ONCE)
1080p display	Not done (returns 4K)	~1.5ms (resize YUV planes + cvtColor, done ONCE)
Total GetImage time	~4-8ms	~6-10ms (+2ms for 2 extra sizes)

Clone & Dispatch to AI (per clone × 12 AI tasks)

Step	Current (4K clone)	ANSFrame (1080p clone)
Clone image size	3840×2160×3 = 24.9 MB	1920×1080×3 = 6.2 MB
memcpy time per clone	3-5ms	0.8-1.2ms
12 clones total size	299 MB	74 MB
12 clones total time	36-60ms	10-14ms

AI Preprocessing (per AI task)

Step	Current	ANSFrame
Receive image	4K BGR (24.9 MB)	1080p BGR (6.2 MB) + ANSFrame ref
Local clone in engine	24.9 MB memcpy (~3ms)	6.2 MB memcpy (~0.8ms)
CPU letterbox resize	4K→640×640 (~2-3ms)	SKIP (use ANSFrame.inference, 0ms)
BGR→RGB	640×640 (~0.3ms)	640×640 (~0.3ms)
GPU upload	1.2 MB (~0.1ms)	1.2 MB (~0.1ms)
Total preprocess per task	~5-6ms	~1.2ms
12 tasks total	~60-72ms	~14ms

Pipeline Crop Quality (ALPR, Face Recognition)

Step	Current	ANSFrame
Detection from	4K image	1080p display (detection) → ANSFrame.fullRes (crop)
Crop source	Same 4K image	ANSFrame.fullRes (4K original)
Crop quality	4K	4K (identical)

Total Processing Time Per Frame Cycle

Phase	Current (4K)	ANSFrame	Savings
GetImage generation	4-8ms	6-10ms	-2ms
Clone × 12	36-60ms	10-14ms	+22-46ms
AI preprocess × 12	60-72ms	14ms	+46-58ms
TRT inference × 12	60-600ms	60-600ms	Same
Postprocess × 12	12-60ms	12-60ms	Same
Total CPU overhead	112-200ms	28-38ms	~80-160ms saved
Total with inference	172-800ms	88-638ms	~80-160ms saved

RAM Usage

Resource	Current	ANSFrame
GetImage output (per camera)	24.9 MB (4K BGR)	32.3 MB (3 images)
Clones in flight (12 tasks)	12 × 24.9 = 299 MB	12 × 6.2 = 74 MB
AI local clone (12 tasks)	12 × 24.9 = 299 MB	12 × 6.2 = 74 MB
ANSFrame shared data	0	32.3 MB (shared, refcounted)
Total RAM per frame cycle	~623 MB	~213 MB
Peak RAM (5 cams, 2-3 cycles)	~1.2-1.9 GB	~0.4-0.6 GB

GPU / VRAM Usage

Resource	Current	ANSFrame
VRAM (engines + workspace)	~5.9 GB	~5.9 GB (same)
GPU upload per inference	1.2 MB	1.2 MB (same)
PCIe bandwidth	~300 MB/s	~300 MB/s (same)
SM utilization	0-34%	0-34% (same)

CPU Usage

Component	Current	ANSFrame
SW decode (5 cameras)	~3-5 cores	~3-5 cores (same)
YUV→BGR generation	~0.3 cores	~0.42 cores (+0.12 for extra sizes)
CPU resize per AI task	~1.5 cores	0 cores (pre-computed)
Clone memcpy	~2.4 cores	~0.5 cores
Total CPU for pipeline	~7-9 cores	~4-6 cores
CPU savings	—	~3 cores freed

LabVIEW Thread Scheduling Impact

Factor	Current	ANSFrame
Data per task dispatch	24.9 MB	6.2 MB (4x less)
Memory allocation pressure	299 MB in flight	74 MB (4x less)
Cache efficiency	Poor (24.9 MB > L3)	Better (6.2 MB closer to L3)
Processing time (LabVIEW)	100-500ms	70-300ms (~30-40% faster)

Final Summary

Metric	Current (8h stable)	ANSFrame (projected)	Improvement
Clone time (12 tasks)	36-60ms	10-14ms	3-4x faster
Preprocess (12 tasks)	60-72ms	14ms	4-5x faster
CPU overhead total	112-200ms	28-38ms	4-5x less
RAM usage (frames)	~1.2-1.9 GB	~0.4-0.6 GB	66% less
CPU cores for pipeline	~7-9 cores	~4-6 cores	3 cores freed
VRAM	No change	No change	—
Crop quality (ALPR/FR)	4K	4K	Same
Processing time	100-500ms	70-300ms	~30-40% faster
Code complexity	Simple	Medium	+~500 lines

Overview

Replace the single-resolution cv::Mat** output from GetImage with a multi-resolution ANSFrame that contains 3 pre-computed images generated from the YUV420P decoded frame. This eliminates redundant resizing across AI tasks and reduces clone/memcpy overhead by 20x.

Current Flow (Inefficient)

Decoder → YUV420P → cvtColor → 4K BGR (25 MB) → GetImage returns 4K
LabVIEW: clone 4K (25 MB × 12 tasks = 300 MB memcpy)
Each AI task: CPU resize 4K → 640×640 (redundant × 12 tasks)

New Flow (Optimized)

Decoder → YUV420P → generate 3 images from YUV planes:
  ├─ Full resolution BGR (for crop/pipeline)
  ├─ 640×640 letterbox BGR (for detection inference)
  └─ 1080p BGR (for display, configurable)

GetImage returns ANSFrame (contains refs to all 3 images)
LabVIEW: clone 1080p only (6.2 MB × 12 = 74 MB, was 300 MB)
AI task: uses 640×640 directly (0.1 MB clone, no resize needed)
Pipeline crop: uses full resolution image (no upscaling artifacts)

Performance Comparison

Resize from YUV420P planes (Option A — RECOMMENDED)

4K YUV420P frame (12.4 MB in 3 planes)
  │
  ├─ Full res: cvtColor(Y+U+V → BGR)                    ~4-8ms
  │  Result: 3840×2160 BGR (24.9 MB)
  │
  ├─ 640×640: resize Y(640×360) + U,V(320×180)           ~0.5ms
  │            + pad bottom + cvtColor                    ~0.3ms
  │  Result: 640×640 BGR (1.2 MB)                   Total: ~0.8ms
  │
  └─ 1080p: resize Y(1920×1080) + U,V(960×540)           ~1.0ms
             + cvtColor                                   ~0.5ms
     Result: 1920×1080 BGR (6.2 MB)                 Total: ~1.5ms

Total generation time: ~6-10ms (all 3 images)

Resize from BGR (Option B — slower)

4K YUV420P → cvtColor → 4K BGR (24.9 MB)                 ~4-8ms
  │
  ├─ Full res: already done                               ~0ms
  ├─ 640×640: cv::resize(4K BGR → 640×640) + letterbox    ~2-3ms
  └─ 1080p: cv::resize(4K BGR → 1080p)                   ~1-2ms

Total generation time: ~7-13ms (all 3 images)

Recommendation: Option A (resize YUV planes)

Option A is ~30% faster because YUV420P is 1.5 bytes/pixel vs BGR 3 bytes/pixel — half the data to resize. The YUV plane resize produces identical quality because cvtColor is applied after resize (same as how GPU NV12 resize works).

ANSFrame Structure

Header: `include/ANSFrame.h`

#pragma once
#include <opencv2/core/mat.hpp>
#include <atomic>
#include <cstdint>

// ANSFrame holds pre-computed multi-resolution images from a single decoded frame.
// Generated once in avframeYUV420PToCvMat, shared across all AI tasks via registry.
// Eliminates per-task resize and reduces clone size by 20x.
struct ANSFrame {
    // --- Pre-computed images (all BGR, CPU RAM) ---
    cv::Mat fullRes;        // Original resolution (e.g., 3840×2160) — for crop/pipeline
    cv::Mat inference;      // Model input size (e.g., 640×640 letterbox) — for detection
    cv::Mat display;        // Display resolution (e.g., 1920×1080) — for LabVIEW UI

    // --- Metadata ---
    int originalWidth = 0;  // Original frame width before any resize
    int originalHeight = 0; // Original frame height before any resize
    int inferenceWidth = 0; // Inference image width (e.g., 640)
    int inferenceHeight = 0;// Inference image height (e.g., 640)
    float letterboxRatio = 1.0f; // Scale ratio used for letterbox (for coordinate mapping)
    int64_t pts = 0;        // Presentation timestamp

    // --- Configuration (set per camera) ---
    int displayMaxHeight = 1080;  // Configurable display resolution
    int inferenceSize = 640;      // Configurable inference size (default 640)

    // --- Lifecycle ---
    std::atomic<int> refcount{1};
};

Changes to `ANSFrame` Registry

The existing ANSGpuFrameRegistry can be extended or a new ANSFrameRegistry created to map cv::Mat* (the display image pointer) to its parent ANSFrame. When LabVIEW clones the display image and sends it to AI, the AI task can look up the parent ANSFrame to access the inference or full-res image.

// Registry: maps display cv::Mat* → ANSFrame*
class ANSFrameRegistry {
    std::unordered_map<const uchar*, ANSFrame*> m_map;  // key = Mat.data pointer
    std::mutex m_mutex;
public:
    void attach(cv::Mat* displayMat, ANSFrame* frame);
    ANSFrame* lookup(const cv::Mat& mat);  // lookup by data pointer
    void release(cv::Mat* mat);
};

Implementation Steps

Step 1: Create ANSFrame structure and registry

Files to create:

include/ANSFrame.h — ANSFrame struct definition
modules/ANSCV/ANSFrameRegistry.h — Registry mapping display Mat → ANSFrame
modules/ANSCV/ANSFrameRegistry.cpp — Registry implementation

Key design decisions:

ANSFrame is allocated per decoded frame, shared across all clones
refcount tracks how many clones reference this frame
When refcount → 0, all 3 images are freed

Step 2: Generate multi-resolution images in avframeYUV420PToCvMat

File to modify: MediaClient/media/video_player.cpp

Replace current avframeYUV420PToCvMat which returns single BGR with new version that populates ANSFrame with 3 images.

ANSFrame* CVideoPlayer::generateANSFrame(const AVFrame* frame) {
    auto* ansFrame = new ANSFrame();
    const int W = frame->width;
    const int H = frame->height;
    ansFrame->originalWidth = W;
    ansFrame->originalHeight = H;

    // --- Resize YUV planes for each resolution ---

    // 1. Full resolution: direct cvtColor (no resize)
    cv::Mat yuv(H * 3/2, W, CV_8UC1);
    // ... copy planes ...
    cv::cvtColor(yuv, ansFrame->fullRes, cv::COLOR_YUV2BGR_I420);

    // 2. Inference size (640×640 letterbox from YUV planes)
    int infSize = ansFrame->inferenceSize;  // default 640
    float r = std::min((float)infSize / W, (float)infSize / H);
    int unpadW = (int)(r * W), unpadH = (int)(r * H);
    ansFrame->letterboxRatio = 1.0f / r;

    // Resize Y plane
    cv::Mat yFull(H, W, CV_8UC1, frame->data[0], frame->linesize[0]);
    cv::Mat yResized;
    cv::resize(yFull, yResized, cv::Size(unpadW, unpadH));

    // Resize U, V planes
    cv::Mat uFull(H/2, W/2, CV_8UC1, frame->data[1], frame->linesize[1]);
    cv::Mat vFull(H/2, W/2, CV_8UC1, frame->data[2], frame->linesize[2]);
    cv::Mat uResized, vResized;
    cv::resize(uFull, uResized, cv::Size(unpadW/2, unpadH/2));
    cv::resize(vFull, vResized, cv::Size(unpadW/2, unpadH/2));

    // Assemble padded I420 buffer
    cv::Mat yuvInf(infSize * 3/2, infSize, CV_8UC1, cv::Scalar(114)); // gray padding
    yResized.copyTo(yuvInf(cv::Rect(0, 0, unpadW, unpadH)));
    // ... copy U, V with padding ...

    cv::cvtColor(yuvInf, ansFrame->inference, cv::COLOR_YUV2BGR_I420);
    ansFrame->inferenceWidth = infSize;
    ansFrame->inferenceHeight = infSize;

    // 3. Display resolution (1080p from YUV planes)
    int dispH = ansFrame->displayMaxHeight;
    float dispScale = (float)dispH / H;
    int dispW = (int)(W * dispScale);

    cv::Mat yDisp, uDisp, vDisp;
    cv::resize(yFull, yDisp, cv::Size(dispW, dispH));
    cv::resize(uFull, uDisp, cv::Size(dispW/2, dispH/2));
    cv::resize(vFull, vDisp, cv::Size(dispW/2, dispH/2));

    cv::Mat yuvDisp(dispH * 3/2, dispW, CV_8UC1);
    // ... assemble I420 ...
    cv::cvtColor(yuvDisp, ansFrame->display, cv::COLOR_YUV2BGR_I420);

    return ansFrame;
}

Step 3: Modify GetImage to return display image + attach ANSFrame

File to modify: modules/ANSCV/ANSRTSP.cpp (and other ANSCV classes)

cv::Mat ANSRTSPClient::GetImage(int& width, int& height, int64_t& pts) {
    // ... existing logic to get frame from player ...

    // GetImage returns the DISPLAY image (1080p)
    // ANSFrame is attached to the Mat via registry
    ANSFrame* frame = _currentANSFrame;
    width = frame->display.cols;
    height = frame->display.rows;
    pts = frame->pts;

    // Register: display Mat's data pointer → ANSFrame
    ANSFrameRegistry::instance().attach(&frame->display, frame);

    return frame->display;  // 1080p, ~6.2 MB (was 25 MB for 4K)
}

Step 4: Modify ANSCV_CloneImage_S to link clone to ANSFrame

File to modify: modules/ANSCV/ANSOpenCV.cpp

int ANSCV_CloneImage_S(cv::Mat** imageIn, cv::Mat** imageOut) {
    *imageOut = anscv_mat_new(**imageIn);  // clone display image (6.2 MB, was 25 MB)

    // Link clone to same ANSFrame (refcount++)
    ANSFrame* frame = ANSFrameRegistry::instance().lookup(**imageIn);
    if (frame) {
        frame->refcount++;
        ANSFrameRegistry::instance().attach(*imageOut, frame);
    }

    return 1;
}

Step 5: Modify engine Preprocess to use ANSFrame inference image

Files to modify: All engine Preprocess functions

// In ANSRTYOLO::DetectObjects (and all other engines):
std::vector<std::vector<cv::cuda::GpuMat>> ANSRTYOLO::Preprocess(
    const cv::Mat& inputImage, ImageMetadata& outMeta) {

    // Try to get pre-resized inference image from ANSFrame
    ANSFrame* frame = ANSFrameRegistry::instance().lookup(inputImage);

    cv::Mat srcForInference;
    if (frame && !frame->inference.empty() &&
        inputImage.cols <= frame->inferenceWidth) {
        // Use pre-computed 640×640 — ZERO resize needed
        srcForInference = frame->inference;
        outMeta.imgHeight = frame->originalHeight;
        outMeta.imgWidth = frame->originalWidth;
        outMeta.ratio = frame->letterboxRatio;
    } else if (frame && !frame->fullRes.empty() &&
               inputImage.cols > frame->inferenceWidth) {
        // Need larger than inference size — use full resolution
        srcForInference = frame->fullRes;
        // ... resize to model input from full res ...
    } else {
        // Fallback: use input image directly (backward compat)
        srcForInference = inputImage;
    }

    // Convert BGR → RGB
    cv::Mat cpuRGB;
    cv::cvtColor(srcForInference, cpuRGB, cv::COLOR_BGR2RGB);

    // Upload small image to GPU
    cv::cuda::GpuMat gpuResized;
    gpuResized.upload(cpuRGB, stream);
    // ...
}

Step 6: Pipeline crop uses full resolution

// In ANSLPR or any pipeline that crops detected objects:
// Instead of cropping from display image (1080p, upscaling artifacts):
ANSFrame* frame = ANSFrameRegistry::instance().lookup(inputImage);
cv::Mat cropSource = (frame && !frame->fullRes.empty())
    ? frame->fullRes     // Full 4K quality for face/plate recognition
    : inputImage;        // Fallback

// Scale bbox from display coords to full-res coords
float scaleX = (float)cropSource.cols / displayImage.cols;
float scaleY = (float)cropSource.rows / displayImage.rows;
cv::Rect fullResBbox(bbox.x * scaleX, bbox.y * scaleY,
                     bbox.width * scaleX, bbox.height * scaleY);
cv::Mat crop = cropSource(fullResBbox).clone();

Step 7: Configuration API

// Set inference size (default 640) — before StartRTSP
void SetRTSPInferenceSize(ANSRTSPClient** Handle, int size);  // 640, 320, 1280

// Set display resolution (default 1080) — before StartRTSP
void SetRTSPDisplayResolution(ANSRTSPClient** Handle, int width, int height);

// Check if ANSFrame is available for a cloned image
int HasANSFrame(cv::Mat** image);  // returns 1 if ANSFrame attached

// Get specific resolution from ANSFrame
int GetANSFrameInference(cv::Mat** displayImage, cv::Mat** inferenceImage);
int GetANSFrameFullRes(cv::Mat** displayImage, cv::Mat** fullResImage);

Memory & Performance Impact

Per-Frame Memory

Image	Resolution	Size	Before (single 4K)
Full resolution	3840×2160	24.9 MB	24.9 MB
Inference	640×640	1.2 MB	(generated per AI task)
Display	1920×1080	6.2 MB	(was part of 4K)
Total per frame		32.3 MB	24.9 MB

+7.4 MB per frame for pre-computed images, BUT:

Clone Savings (12 AI tasks)

	Before	After
Clone size per task	24.9 MB	6.2 MB (display only)
12 clones total	299 MB	74 MB
Clone time	36-60ms	8-12ms
Resize per task	2-3ms × 12 = 24-36ms	0ms (pre-computed)
Total savings		~250 MB RAM, ~50ms CPU

Generation Time (one-time per frame)

Step	Time
Full res: cvtColor YUV→BGR	~4-8ms
640×640: resize YUV planes + cvtColor	~0.8ms
1080p: resize YUV planes + cvtColor	~1.5ms
Total	~6-10ms

vs Current: cvtColor 4K = ~4-8ms + resize per task = ~2-3ms × 12 = ~28-44ms total

Net savings: ~20-35ms per frame cycle across all tasks.

Files to Create/Modify

New files:

include/ANSFrame.h — ANSFrame struct
modules/ANSCV/ANSFrameRegistry.h — Registry header
modules/ANSCV/ANSFrameRegistry.cpp — Registry implementation

Modified files:

MediaClient/media/video_player.h — Add generateANSFrame declaration
MediaClient/media/video_player.cpp — Implement generateANSFrame, modify getImage
modules/ANSCV/ANSRTSP.h — Add ANSFrame member, SetInferenceSize
modules/ANSCV/ANSRTSP.cpp — Modify GetImage to return display + attach ANSFrame
modules/ANSCV/ANSOpenCV.cpp — Modify CloneImage_S to link to ANSFrame
modules/ANSCV/ANSMatRegistry.h — Optional: integrate ANSFrame into mat registry
modules/ANSODEngine/ANSRTYOLO.cpp — Use ANSFrame inference image
modules/ANSODEngine/ANSTENSORTRTOD.cpp — Same
modules/ANSODEngine/ANSTENSORRTPOSE.cpp — Same
modules/ANSODEngine/ANSTENSORRTSEG.cpp — Same
modules/ANSODEngine/ANSTENSORRTCL.cpp — Same
modules/ANSODEngine/ANSYOLOV12RTOD.cpp — Same
modules/ANSODEngine/ANSYOLOV10RTOD.cpp — Same
modules/ANSODEngine/SCRFDFaceDetector.cpp — Same
modules/ANSODEngine/dllmain.cpp — Set tl_currentANSFrame for pipeline lookup
modules/ANSLPR/ANSLPR_OD.cpp — Use fullRes for plate crop
modules/ANSFR/ARCFaceRT.cpp — Use fullRes for face crop
modules/ANSFR/ANSFaceRecognizer.cpp — Use fullRes for face crop

Apply to other ANSCV classes:

modules/ANSCV/ANSFLV.h/.cpp — Same pattern as ANSRTSP
modules/ANSCV/ANSMJPEG.h/.cpp — Same
modules/ANSCV/ANSRTMP.h/.cpp — Same
modules/ANSCV/ANSSRT.h/.cpp — Same

Clone-to-ANSFrame Mapping (Critical Design)

The Problem

LabVIEW calls ANSCV_CloneImage_S to create a deep copy of the 1080p display image. The clone has a different data pointer than the original — so a simple pointer lookup won't find the ANSFrame.

GetImage returns display Mat:   data = 0xAAAA → registry: 0xAAAA → ANSFrame #1
CloneImage creates deep copy:   data = 0xBBBB → registry: ??? (not registered)
AI task tries lookup(0xBBBB):   NOT FOUND — fallback to slow path

The Solution

GetImage:     data = 0xAAAA → registry: 0xAAAA → ANSFrame #1 (refcount=1)
CloneImage:   data = 0xBBBB → registry: 0xBBBB → ANSFrame #1 (refcount=2)
CloneImage:   data = 0xCCCC → registry: 0xCCCC → ANSFrame #1 (refcount=3)
ReleaseImage: remove 0xBBBB → ANSFrame #1 (refcount=2)
ReleaseImage: remove 0xCCCC → ANSFrame #1 (refcount=1)
Next GetImage: remove 0xAAAA → ANSFrame #1 (refcount=0) → FREE all 3 images

Implementation in ANSCV_CloneImage_S (ANSOpenCV.cpp)

int ANSCV_CloneImage_S(cv::Mat** imageIn, cv::Mat** imageOut) {
    *imageOut = anscv_mat_new(**imageIn);          // deep copy display (6.2 MB)
    gpu_frame_addref(*imageIn, *imageOut);          // existing: link GpuFrameData
    ANSFrameRegistry::instance().addRef(*imageIn, *imageOut);  // NEW: link ANSFrame
    return 1;
}

Implementation in ANSCV_ReleaseImage_S (ANSOpenCV.cpp)

int ANSCV_ReleaseImage_S(cv::Mat** imageIn) {
    ANSFrameRegistry::instance().release(*imageIn);  // NEW: refcount--, free if 0
    anscv_mat_delete(imageIn);                        // existing: free Mat
    return 1;
}

Implementation in ANSFrameRegistry

class ANSFrameRegistry {
    std::unordered_map<const uchar*, ANSFrame*> m_map;  // Mat.data → ANSFrame
    std::mutex m_mutex;
public:
    // Register original display Mat → ANSFrame
    void attach(const cv::Mat* mat, ANSFrame* frame) {
        std::lock_guard<std::mutex> lock(m_mutex);
        // Remove old mapping if exists
        auto it = m_map.find(mat->data);
        if (it != m_map.end() && it->second != frame) {
            if (--it->second->refcount <= 0) delete it->second;
        }
        m_map[mat->data] = frame;
    }

    // Link clone to same ANSFrame (called from CloneImage)
    void addRef(const cv::Mat* src, const cv::Mat* dst) {
        std::lock_guard<std::mutex> lock(m_mutex);
        auto it = m_map.find(src->data);
        if (it == m_map.end()) return;
        ANSFrame* frame = it->second;
        frame->refcount++;
        m_map[dst->data] = frame;
    }

    // Lookup by any Mat (original or clone)
    ANSFrame* lookup(const cv::Mat& mat) {
        std::lock_guard<std::mutex> lock(m_mutex);
        auto it = m_map.find(mat.data);
        return (it != m_map.end()) ? it->second : nullptr;
    }

    // Release mapping (called from ReleaseImage)
    void release(const cv::Mat* mat) {
        std::lock_guard<std::mutex> lock(m_mutex);
        auto it = m_map.find(mat->data);
        if (it == m_map.end()) return;
        ANSFrame* frame = it->second;
        m_map.erase(it);
        if (--frame->refcount <= 0) delete frame;
    }
};

Thread Safety

Registry uses std::mutex — same pattern as ANSGpuFrameRegistry
ANSFrame images (fullRes, inference, display) are immutable after creation — safe to read from any thread
Only refcount is modified concurrently — uses std::atomic<int>
ANSFrame is freed only when refcount reaches 0 (all clones released)

Lifecycle Diagram

Camera Thread:                    AI Task 1:              AI Task 2:

generateANSFrame()
  → fullRes, inference, display
  → refcount = 1
  → registry: display.data → AF

GetImage returns display

CloneImage(display, &clone1) ─────► clone1
  → registry: clone1.data → AF
  → refcount = 2

CloneImage(display, &clone2) ──────────────────────────► clone2
  → registry: clone2.data → AF
  → refcount = 3

                                  lookup(clone1) → AF
                                  use AF->inference
                                  use AF->fullRes (crop)
                                  ReleaseImage(clone1)
                                  → refcount = 2

                                                          lookup(clone2) → AF
                                                          use AF->inference
                                                          ReleaseImage(clone2)
                                                          → refcount = 1

Next GetImage:
  → new ANSFrame
  → old display.data removed
  → refcount = 0 → FREE old AF

Leak Prevention (Critical)

Leak Scenarios

Scenario	What leaks	Size per leak
LabVIEW forgets to call ReleaseImage	ANSFrame (fullRes + inference + display)	~32 MB
Camera reconnect while clones exist	Old ANSFrame stays alive until clones released	~32 MB
LabVIEW crash/abort	All ANSFrames in registry	~32 MB × N frames
AI task throws exception, skips Release	ANSFrame refcount never reaches 0	~32 MB

Protection 1: TTL-Based Auto-Eviction

Same pattern as ANSGpuFrameRegistry::evictStaleFrames() — periodically scan for old ANSFrames and force-free them.

class ANSFrameRegistry {
    static constexpr int FRAME_TTL_SECONDS = 5;  // Max lifetime of any ANSFrame
    static constexpr int EVICT_INTERVAL_MS = 1000;  // Check every 1 second

    struct Entry {
        ANSFrame* frame;
        std::chrono::steady_clock::time_point createdAt;
    };

    void evictStale() {
        auto now = std::chrono::steady_clock::now();
        // Throttle: only run every EVICT_INTERVAL_MS
        if (now - m_lastEvict < std::chrono::milliseconds(EVICT_INTERVAL_MS)) return;
        m_lastEvict = now;

        std::lock_guard<std::mutex> lock(m_mutex);
        for (auto it = m_frames.begin(); it != m_frames.end(); ) {
            double ageSec = std::chrono::duration<double>(now - it->createdAt).count();
            if (ageSec > FRAME_TTL_SECONDS) {
                // Force-free: remove all Mat* mappings to this frame
                ANSFrame* frame = it->frame;
                for (auto mit = m_map.begin(); mit != m_map.end(); ) {
                    if (mit->second == frame) mit = m_map.erase(mit);
                    else ++mit;
                }
                delete frame;
                it = m_frames.erase(it);
            } else {
                ++it;
            }
        }
    }
};

Call evictStale() from GetImage() (piggybacked on camera thread activity — same as gpu_frame_evict_stale()).

Protection 2: Max ANSFrame Pool Size

Limit total number of live ANSFrames. If pool is full, force-free the oldest before creating a new one.

static constexpr int MAX_ANSFRAMES = 100;  // Max live frames across all cameras

ANSFrame* createANSFrame(...) {
    evictStale();  // Clean up expired frames first

    // If still over limit, force-free oldest
    while (m_frames.size() >= MAX_ANSFRAMES) {
        auto oldest = m_frames.begin();
        // ... force-remove all mappings + delete ...
    }

    auto* frame = new ANSFrame();
    // ... populate ...
    m_frames.push_back({frame, std::chrono::steady_clock::now()});
    return frame;
}

Protection 3: Camera-Scoped Cleanup

When a camera is stopped or destroyed, force-free ALL ANSFrames belonging to that camera (regardless of refcount).

// In ANSRTSPClient::Stop() and Destroy():
ANSFrameRegistry::instance().releaseByOwner(this);

// In ANSFrameRegistry:
void releaseByOwner(void* owner) {
    std::lock_guard<std::mutex> lock(m_mutex);
    for (auto it = m_frames.begin(); it != m_frames.end(); ) {
        if (it->frame->owner == owner) {
            // Remove all Mat* mappings
            for (auto mit = m_map.begin(); mit != m_map.end(); ) {
                if (mit->second == it->frame) mit = m_map.erase(mit);
                else ++mit;
            }
            delete it->frame;
            it = m_frames.erase(it);
        } else {
            ++it;
        }
    }
}

Protection 4: One ANSFrame Per Camera (Ring Buffer)

Each camera keeps only the latest ANSFrame. When a new frame arrives, the previous ANSFrame is marked for cleanup (refcount decremented). This bounds memory to 1 ANSFrame per camera.

class ANSRTSPClient {
    ANSFrame* _currentANSFrame = nullptr;

    void onNewFrame(AVFrame* decoded) {
        ANSFrame* newFrame = generateANSFrame(decoded);
        newFrame->owner = this;

        // Replace old frame — decrement refcount
        if (_currentANSFrame) {
            ANSFrameRegistry::instance().detachOwner(_currentANSFrame);
            // If refcount reaches 0, freed immediately
            // If clones still hold refs, freed when they release
        }
        _currentANSFrame = newFrame;
        ANSFrameRegistry::instance().attach(&newFrame->display, newFrame);
    }
};

Protection 5: ANSFrame Struct with Owner Tracking

struct ANSFrame {
    // ... existing fields ...

    // Leak protection
    void* owner = nullptr;          // Camera that created this frame
    std::chrono::steady_clock::time_point createdAt;
    std::atomic<int> refcount{1};

    ~ANSFrame() {
        // Images are cv::Mat — automatically freed by OpenCV refcount
        // No manual cleanup needed for fullRes, inference, display
    }
};

Memory Budget Analysis

With all protections:

Cameras	Max ANSFrames	Memory (worst case)
5 running	5 current + ~10 in-flight clones	5 × 32 MB = 160 MB
20 running	20 current + ~40 in-flight clones	20 × 32 MB = 640 MB
100 created, 5 running	5 current + ~10 in-flight	5 × 32 MB = 160 MB
100 created, 95 stopped	0 (stopped cameras free ANSFrame)	0 MB

Worst case bounded by: running_cameras × 32 MB — predictable, no growth over time.

TTL Guarantee

Even if ALL protections fail, the 5-second TTL eviction ensures:

Maximum leak duration: 5 seconds
Maximum leaked memory: cameras × 5 seconds × 10 FPS × 32 MB / frame — but with ring buffer (1 per camera), it's just cameras × 32 MB
Periodic cleanup on every GetImage call ensures no accumulation

Replacing GpuFrameRegistry

Current State (wasteful with NV12 disabled)

With _useNV12FastPath = false (current default), GpuFrameRegistry is never populated — no gpu_frame_attach is called. But gpu_frame_addref, gpu_frame_remove, and gpu_frame_evict_stale still run on every clone/release/replace — doing empty lookups that waste CPU cycles.

Current code paths that run but do nothing:
  ANSCV_CloneImage_S   → gpu_frame_addref → lookup → NOT FOUND → no-op
  ANSCV_ReleaseImage_S → gpu_frame_remove → lookup → NOT FOUND → no-op
  anscv_mat_replace    → gpu_frame_remove → lookup → NOT FOUND → no-op
  anscv_mat_replace    → gpu_frame_evict_stale → scans empty registry → no-op

Plan: ANSFrameRegistry replaces GpuFrameRegistry

ANSFrameRegistry serves the same purpose (mapping cv::Mat* → frame metadata) but without GPU complexity:

Feature	GpuFrameRegistry	ANSFrameRegistry
Maps Mat* to	GpuFrameData (NV12 GPU pointers)	ANSFrame (3 CPU images)
Used when	NV12 fast path enabled	Always (SW or HW decode)
GPU dependency	CUDA, pool slots, D2D copy	None
Thread safety	mutex + atomic refcount	mutex + atomic refcount
Cleanup	TTL eviction + pool cooldown	TTL eviction (simpler)

Migration Path

Phase 1 (implement ANSFrame): ANSFrameRegistry runs alongside GpuFrameRegistry
- CloneImage: calls both gpu_frame_addref + ansframe_addref
- ReleaseImage: calls both gpu_frame_remove + ansframe_release
- Safe: both registries handle NOT FOUND gracefully
Phase 2 (NV12 disabled permanently): Remove GpuFrameRegistry calls
- Remove gpu_frame_addref from CloneImage
- Remove gpu_frame_remove from ReleaseImage and anscv_mat_replace
- Remove gpu_frame_evict_stale from anscv_mat_replace
- Keep GpuFrameRegistry code for future NV12 re-enablement
Phase 3 (optional, if NV12 re-enabled): Merge into single registry
- ANSFrame struct gains optional GPU fields (yPlane, uvPlane, poolSlot)
- Single registry, single refcount, single lookup

Recommended: Implement Phase 1 first, Phase 2 after testing

Backward Compatibility

If ANSFrame is not available (e.g., old camera module), engines fall back to current behavior (resize input image)
The cv::Mat** API stays the same — LabVIEW doesn't need changes
ANSFrame is transparent to LabVIEW — it only sees the display image
The GetANSFrameInference / GetANSFrameFullRes APIs are optional for advanced use

Risk Assessment

Risk	Mitigation
Extra 7.4 MB RAM per frame	Negligible vs 250 MB clone savings
ANSFrame lifecycle (refcount)	Same pattern as GpuFrameData — proven
Coordinate mapping errors	letterboxRatio stored in ANSFrame — deterministic
YUV plane resize quality	Same as GPU NV12 resize — proven equivalent
Thread safety	ANSFrame is immutable after creation — safe to share

32 KiB Raw Blame History Unescape Escape