Removing the Fix 1 ClearAllHeaders/ClearAllQueryParams block from all 8
upload entry points. CkRest.AddHeader already overwrites by header name,
so the clears were unnecessary; they also wipe the internal state
Chilkat's SigV4 signer relies on between requests, which caused AWS to
reject every PUT with HTTP 400 MaxMessageLengthExceeded (max 35240 B).
Per-step ClearAllQueryParams inside multipart kept (followed immediately
by AddQueryParam).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleaned up verbose engine telemetry emitted to stdout/stderr and the
Windows Event Viewer. Removes logEngineEvent/logEvent calls (and their
diagnostic-only locals) across the TensorRT engine load, build, run,
multi-GPU, and pool-manager paths, plus the now-unused logEvent helper
in EnginePoolManager.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- NvJpegPool: singleton pool of 4 NvJpegCompressor instances with
lock-free slot acquisition (~160MB VRAM). Threads that can't grab
a slot fall back to TurboJPEG with zero wait.
- JPEG passthrough: BmpToJpeg now checks if input is already JPEG
(FF D8 FF magic) and copies directly without re-encoding.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BmpToJpeg was slow (~25-45ms for 4K) due to two bottlenecks:
1. cv::imdecode for BMP parsing (unnecessary for uncompressed BMP)
2. TurboJPEG CPU encoding (~11ms for 4K)
Fix 1: Zero-copy BMP parsing — parse header directly and wrap pixel
data in cv::Mat without allocation or copy. Eliminates ~47MB of heap
allocations per 4K frame.
Fix 2: NvJpegCompressor class using nvJPEG hardware encoder on NVIDIA
GPUs (~1-2ms for 4K). Integrated into CompressJpegToString so all 5
JPEG encoding callsites benefit automatically. Reusable GPU buffer
avoids per-frame cudaMalloc/cudaFree. Silent fallback to TurboJPEG
on Intel/AMD or if nvJPEG fails.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix 1 — Chunk oversized bucket groups (the correctness fix)
ONNXOCRRecognizer::RecognizeBatch now slices each bucket group into chunks of ≤ kRecMaxBatch before submitting to TRT. A frame with 30 crops in bucket 320 produces two back-to-back batched calls (24 + 6), both within the profile, both on the fast path.
Fix 2 — Raise the profile max from 16 to 24 (the performance fix)
The old profile max was 16; your real scenes routinely hit 24. Raising the profile max to 24 means the common 12-plate scene (24 crops) fits in a single batched call with no chunking needed. Scenes with > 24 crops now use chunking, but that's rare.