From 90 to 11 Seconds: How Chunkr Achieved an 8x Performance Boost

From 90 to 11 Seconds: How Chunkr Achieved an 8x Performance Boost in Document Processing
Introduction
Hi, I'm Danial Hasan, founder of LegalFlow. We use Chunkr heavily to power our AI systems, and I jumped over the moon when I saw the processing speeds of the new API. They dropped their speeds from 90 seconds to 11 seconds for the legal documents we benchmarked it on (25–30 pages of lawsuits/contracts). This massive speed boost supercharges our extraction and classification pipelines, ensuring our data flows fast and reliably. But how did they do it?
The Challenge
Chunkr's document pipeline involves multiple resource‑intensive steps:
- Converting documents to PDFs
- Rendering pages as images (can use up to 1 GB of RAM per page)
- Performing OCR and layout detection
- Chunking content for LLM consumption
- Post‑processing with AI models
When executed sequentially, these steps can drag processing times.
So, how do you speed up the pipeline without losing accuracy?
The answer: a complete architectural rewrite that decouples services and modularizes the codebase, combined with OCR model quantization and a parallelized inference engine.
Architectural Evolution
From Monolithic to Modular
The old codebase was monolithic—server, client, and worker logic were tangled together. Debugging and optimizing were a nightmare. The new structure separates concerns clearly:
Old Structure:
/src
/server
- main.rs # Mixed concerns
- handlers.rs # Everything in one place
/client
- ui.rs # Tightly coupled UI
New Structure:
/core
/src
/models # Domain objects
/chunkr
- task.rs # Clear task lifecycle
- open_ai.rs # LLM integration
/pipeline # Processing steps
- segmentation_and_ocr.rs
- segment_processing.rs
/workers
- task.rs # Worker orchestration
/apps
/web # Separated frontend
This modularization enables independent debugging, isolated optimization, and scalable components. It also enables observability, which can make or break your debugging efforts when dealing with finicky AI systems.
Key Improvements Deep Dive
1. Modular, Parallelized Pipeline
The biggest change was restructuring the pipeline from sequential to parallel using Rayon, a Rust data parallelism library.
Old Approach (Sequential):
// Old monolithic processing
async fn process_document(doc: Document) -> Result<Output> {
let pdf = convert_to_pdf(doc).await?;
let images = render_to_images(pdf).await?;
let text = perform_ocr(images).await?;
let chunks = create_chunks(text).await?;
process_chunks(chunks).await
}
New Approach (Parallel with Rayon):
use rayon::prelude::*;
// New parallel processing pipeline
async fn process_document(doc: Document) -> Result<Output> {
let pdf = convert_to_pdf(doc).await?;
// Parallel page processing
let pages: Vec<Page> = pdf.pages.par_iter()
.map(|page| {
let image = render_to_image(page)?;
let ocr_result = perform_ocr(image)?;
let segments = detect_layout(ocr_result)?;
Ok(ProcessedPage::new(segments))
})
.collect::<Result<Vec<_>>>()?;
// Concurrent chunk processing
let chunks = pages.into_par_iter()
.flat_map(|page| page.create_chunks())
.collect::<Vec<_>>();
process_chunks_in_parallel(chunks).await
}
// Create and process chunks in parallel using Rayon
pub fn create_chunks_in_parallel(pages: Vec<ProcessedPage>) -> Vec<Chunk> {
pages.into_par_iter()
.flat_map(|page| {
page.segments
.into_par_iter()
.flat_map(|segment| segment.create_chunks())
})
.collect()
}
Performance Impact of Rayon
Document Processing Times (15–20 pages):
┌─────────────────────┬────────┬─────────┬─────────────┐
│ Stage │ Before │ After │ Improvement │
├─────────────────────┼────────┼─────────┼─────────────┤
│ PDF Conversion │ 15s │ 2s │ 7.5x │
│ OCR │ 45s │ 5s │ 9x │
│ Chunking │ 20s │ 3s │ 6.7x │
│ Post‑process │ 10s │ 1s │ 10x │
├─────────────────────┼────────┼─────────┼─────────────┤
│ Total │ 90s │ 11s │ 8.2x │
└─────────────────────┴────────┴─────────┴─────────────┘
2. OCR Model Quantization
Quantizing the OCR models shaved down inference time and model size while keeping accuracy nearly intact.
// OCR model configuration with quantization
#[derive(Debug, Serialize, Deserialize)]
pub struct OCRConfig {
model_path: PathBuf,
quantization: QuantizationConfig,
}
#[derive(Debug, Serialize, Deserialize)]
pub struct QuantizationConfig {
enabled: bool,
bits: u8,
scheme: QuantizationScheme,
}
impl OCRModel {
pub fn new(config: OCRConfig) -> Self {
let model = if config.quantization.enabled {
// Load quantized model for faster inference
Model::load_quantized(
config.model_path,
config.quantization.bits,
config.quantization.scheme,
)
} else {
// Load full precision model
Model::load(config.model_path)
};
Self { model }
}
}
- Model size: 2.3 GB → 600 MB
- Inference time per page: 3.2s → 0.8s
- Accuracy: 99.1% → 98.9%
3. Rewritten Inference Engine
The inference engine was overhauled to handle concurrency like a champ.
Before: Basic Sequential Processing
// Old implementation
pub struct OldInferenceEngine {
model: OCRModel,
}
impl OldInferenceEngine {
pub async fn process_page(&self, page: Page) -> Result<OCRResult> {
self.model.process_page(page).await
}
pub async fn process_batch(&self, pages: Vec<Page>) -> Result<Vec<OCRResult>> {
let mut results = Vec::new();
for page in pages {
results.push(self.process_page(page).await?);
}
Ok(results)
}
}
Issues encountered:
- Single model instance bottleneck
- No connection pooling or rate limiting
- Sequential batch processing causing inefficiency
After: Sophisticated Concurrency Management
pub struct InferenceEngine {
pool: Pool<OCRModel>, // Connection pooling
rate_limiter: Arc<RateLimiter>, // Rate limiting
timeout_config: TimeoutConfig, // Timeout management
retry_policy: RetryPolicy, // Retry handling
}
impl InferenceEngine {
pub async fn process_page(&self, page: Page) -> Result<OCRResult> {
// Acquire rate limit permit with timeout
let _permit = self.rate_limiter
.acquire_permit_with_timeout(self.timeout_config.acquire_timeout)
.await?;
// Get model from pool with backoff
let model = self.pool.get_with_retry(&self.retry_policy).await?;
// Process with timeout and automatic retry
let result = retry_with_backoff(|| async {
tokio::time::timeout(
self.timeout_config.process_timeout,
model.process_page(page.clone())
).await?
}, &self.retry_policy).await?;
Ok(result)
}
pub async fn process_batch(&self, pages: Vec<Page>) -> Result<Vec<OCRResult>> {
// Dynamic batch size based on system load
let batch_size = self.calculate_optimal_batch_size().await;
// Process in parallel with controlled concurrency
stream::iter(pages)
.chunks(batch_size)
.map(|chunk| {
let models = self.pool.get_multiple(chunk.len()).await?;
chunk.into_iter()
.zip(models)
.map(|(page, model)| self.process_with_model(page, model))
.collect::<Vec<_>>()
})
.try_collect()
.await
}
async fn calculate_optimal_batch_size(&self) -> usize {
let system_load = self.get_system_metrics().await;
let pool_stats = self.pool.stats().await;
min(
system_load.available_cores,
pool_stats.available_connections,
self.max_batch_size
)
}
}
Key Improvements:
- Smart Resource Pooling: Dynamic pool of model instances prevents resource exhaustion.
- Adaptive Batch Processing: Batch size adjusts based on system load and pool availability.
- Intelligent Rate Limiting: Distributed rate limiting (using Redis) prevents overload.
- Sophisticated Error Handling: Automatic retries with exponential backoff and timeouts.
- Performance Monitoring: Metrics for processing times, error rates, and connection pool utilization.
The result?
- 75% reduction in memory usage
- 4x increase in concurrent processing capacity
- 99.9% reduction in timeout errors
- Zero resource exhaustion incidents
- Consistent performance under varying load
After migrating to the new API, LegalFlow's data pipelines and AI systems have been fed high‑quality chunks, fast as hell, every time.
Impact on Development and Debugging
The modular architecture simplified isolated testing and error handling:
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test]
async fn test_ocr_pipeline() {
let config = OCRConfig::for_testing();
let engine = InferenceEngine::new(config);
// Test single page processing
let result = engine.process_page(test_page()).await;
assert!(result.is_ok());
// Test error handling
let result = engine.process_page(corrupt_page()).await;
assert!(matches!(result, Err(Error::InvalidImage)));
}
}
And better error propagation makes debugging super easy.
Performance Metrics
Document Processing Times (15–20 pages):
┌─────────────────────┬────────┬─────────┬─────────────┐
│ Stage │ Before │ After │ Improvement │
├─────────────────────┼────────┼─────────┼─────────────┤
│ PDF Conversion │ 15s │ 2s │ 7.5x │
│ OCR │ 45s │ 5s │ 9x │
│ Chunking │ 20s │ 3s │ 6.7x │
│ Post‑process │ 10s │ 1s │ 10x │
├─────────────────────┼────────┼─────────┼─────────────┤
│ Total │ 90s │ 11s │ 8.2x │
└─────────────────────┴────────┴─────────┴─────────────┘
Conclusion
The Chunkr upgrade proves that thoughtful architectural changes and optimization techniques can transform system performance. By focusing on modular design, parallel processing, model optimization, and efficient resource management, we achieved an 8x speedup—boosting both performance and developer productivity.
Looking Forward
No idea what further improvements the team might roll out, but the future might hold:
- Deeper model quantization opportunities
- Enhanced caching strategies
- Additional parallelization
- Smarter resource allocation
As new techniques emerge, the team will adopt them to ensure Chunkr keeps delivering cutting‑edge performance.
Thanks for reading! I'm Danial Hasan, founder of LegalFlow, collaborating with Centrai to share knowledge learned at the frontier of applied AI. We're building the best AI systems for the real estate industry. More insights shared here!