Thread-local staging Mat (video_player.cpp:1400-1407) — single biggest win. Eliminates the 12 MB per-call malloc/free cycle.
Contiguous get_buffer2 allocator (video_decoder.cpp:35-102) — keeps the 3 bulk memcpys cache-friendly. Would also enable FAST/zero-copy for resolutions where visible_h % 64 == 0.
SW-decoder thread config (video_decoder.cpp:528-540) — thread_count=0, thread_type=FRAME|SLICE. FRAME is downgraded to SLICE-only by AV_CODEC_FLAG_LOW_DELAY, but decode throughput is sufficient for your input rate.
SetTargetFPS(100) delivery throttle (already there) — caps onVideoFrame post-decode work at 10 FPS. Keeps the caller path warm-cached.
Instrumentation — [MEDIA_DecInit] / [MEDIA_Convert] / [MEDIA_SWDec] / [MEDIA_Timing] / [MEDIA_JpegTiming] — always-on regression detector, zero cost when ANSCORE_DEBUGVIEW=OFF.
When the decoder hasn't produced a frame in 5s, skip the call to
_playerClient->getImage() entirely and return the cached frame with
unchanged _pts. LabVIEW sees STALE PTS one poll earlier and can
trigger reconnect sooner.
Threshold matches the existing checks on the duplicate-PTS branch and
in areImagesIdentical() so all three stale paths agree. Near-zero cost:
one getLastFrameAgeMs() call before the main path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
getImage() previously held _mutex across the 4K NV12->BGR sws_scale in
avframeToCVMat, blocking the decoder callback (onVideoFrame) for 100-300ms
per frame. Under multi-camera load this cascaded into 5-21s frame stalls
and STALE PTS events in the log.
- avframeToCVMat: drop outer _mutex. NV12/YUV420P paths touch no shared
state; avframeAnyToCvmat still locks internally for swsCtx.
- getImage: split into two short locked phases with the BGR conversion
unlocked between them. Decoder callbacks can push new frames and run
the CUDA HW capture path in parallel with the reader's conversion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removing the Fix 1 ClearAllHeaders/ClearAllQueryParams block from all 8
upload entry points. CkRest.AddHeader already overwrites by header name,
so the clears were unnecessary; they also wipe the internal state
Chilkat's SigV4 signer relies on between requests, which caused AWS to
reject every PUT with HTTP 400 MaxMessageLengthExceeded (max 35240 B).
Per-step ClearAllQueryParams inside multipart kept (followed immediately
by AddQueryParam).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleaned up verbose engine telemetry emitted to stdout/stderr and the
Windows Event Viewer. Removes logEngineEvent/logEvent calls (and their
diagnostic-only locals) across the TensorRT engine load, build, run,
multi-GPU, and pool-manager paths, plus the now-unused logEvent helper
in EnginePoolManager.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- NvJpegPool: singleton pool of 4 NvJpegCompressor instances with
lock-free slot acquisition (~160MB VRAM). Threads that can't grab
a slot fall back to TurboJPEG with zero wait.
- JPEG passthrough: BmpToJpeg now checks if input is already JPEG
(FF D8 FF magic) and copies directly without re-encoding.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BmpToJpeg was slow (~25-45ms for 4K) due to two bottlenecks:
1. cv::imdecode for BMP parsing (unnecessary for uncompressed BMP)
2. TurboJPEG CPU encoding (~11ms for 4K)
Fix 1: Zero-copy BMP parsing — parse header directly and wrap pixel
data in cv::Mat without allocation or copy. Eliminates ~47MB of heap
allocations per 4K frame.
Fix 2: NvJpegCompressor class using nvJPEG hardware encoder on NVIDIA
GPUs (~1-2ms for 4K). Integrated into CompressJpegToString so all 5
JPEG encoding callsites benefit automatically. Reusable GPU buffer
avoids per-frame cudaMalloc/cudaFree. Silent fallback to TurboJPEG
on Intel/AMD or if nvJPEG fails.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix 1 — Chunk oversized bucket groups (the correctness fix)
ONNXOCRRecognizer::RecognizeBatch now slices each bucket group into chunks of ≤ kRecMaxBatch before submitting to TRT. A frame with 30 crops in bucket 320 produces two back-to-back batched calls (24 + 6), both within the profile, both on the fast path.
Fix 2 — Raise the profile max from 16 to 24 (the performance fix)
The old profile max was 16; your real scenes routinely hit 24. Raising the profile max to 24 means the common 12-plate scene (24 crops) fits in a single batched call with no chunking needed. Scenes with > 24 crops now use chunking, but that's rare.