- NvJpegPool: singleton pool of 4 NvJpegCompressor instances with
lock-free slot acquisition (~160MB VRAM). Threads that can't grab
a slot fall back to TurboJPEG with zero wait.
- JPEG passthrough: BmpToJpeg now checks if input is already JPEG
(FF D8 FF magic) and copies directly without re-encoding.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BmpToJpeg was slow (~25-45ms for 4K) due to two bottlenecks:
1. cv::imdecode for BMP parsing (unnecessary for uncompressed BMP)
2. TurboJPEG CPU encoding (~11ms for 4K)
Fix 1: Zero-copy BMP parsing — parse header directly and wrap pixel
data in cv::Mat without allocation or copy. Eliminates ~47MB of heap
allocations per 4K frame.
Fix 2: NvJpegCompressor class using nvJPEG hardware encoder on NVIDIA
GPUs (~1-2ms for 4K). Integrated into CompressJpegToString so all 5
JPEG encoding callsites benefit automatically. Reusable GPU buffer
avoids per-frame cudaMalloc/cudaFree. Silent fallback to TurboJPEG
on Intel/AMD or if nvJPEG fails.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>