- NvJpegPool: singleton pool of 4 NvJpegCompressor instances with
lock-free slot acquisition (~160MB VRAM). Threads that can't grab
a slot fall back to TurboJPEG with zero wait.
- JPEG passthrough: BmpToJpeg now checks if input is already JPEG
(FF D8 FF magic) and copies directly without re-encoding.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BmpToJpeg was slow (~25-45ms for 4K) due to two bottlenecks:
1. cv::imdecode for BMP parsing (unnecessary for uncompressed BMP)
2. TurboJPEG CPU encoding (~11ms for 4K)
Fix 1: Zero-copy BMP parsing — parse header directly and wrap pixel
data in cv::Mat without allocation or copy. Eliminates ~47MB of heap
allocations per 4K frame.
Fix 2: NvJpegCompressor class using nvJPEG hardware encoder on NVIDIA
GPUs (~1-2ms for 4K). Integrated into CompressJpegToString so all 5
JPEG encoding callsites benefit automatically. Reusable GPU buffer
avoids per-frame cudaMalloc/cudaFree. Silent fallback to TurboJPEG
on Intel/AMD or if nvJPEG fails.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix 1 — Chunk oversized bucket groups (the correctness fix)
ONNXOCRRecognizer::RecognizeBatch now slices each bucket group into chunks of ≤ kRecMaxBatch before submitting to TRT. A frame with 30 crops in bucket 320 produces two back-to-back batched calls (24 + 6), both within the profile, both on the fast path.
Fix 2 — Raise the profile max from 16 to 24 (the performance fix)
The old profile max was 16; your real scenes routinely hit 24. Raising the profile max to 24 means the common 12-plate scene (24 crops) fits in a single batched call with no chunking needed. Scenes with > 24 crops now use chunking, but that's rare.