From 90 to 11 Seconds: How Chunkr Achieved an 8x Performance Boost in Document Processing

Introduction

Hi, I'm Danial Hasan, founder of LegalFlow. We use Chunkr heavily to power our AI systems, and I jumped over the moon when I saw the processing speeds of the new API. They dropped their speeds from 90 seconds to 11 seconds for the legal documents we benchmarked it on (25–30 pages of lawsuits/contracts). This massive speed boost supercharges our extraction and classification pipelines, ensuring our data flows fast and reliably. But how did they do it?

The Challenge

Chunkr's document pipeline involves multiple resource‑intensive steps:

Converting documents to PDFs
Rendering pages as images (can use up to 1 GB of RAM per page)
Performing OCR and layout detection
Chunking content for LLM consumption
Post‑processing with AI models

When executed sequentially, these steps can drag processing times.

So, how do you speed up the pipeline without losing accuracy?

The answer: a complete architectural rewrite that decouples services and modularizes the codebase, combined with OCR model quantization and a parallelized inference engine.

Architectural Evolution

From Monolithic to Modular

The old codebase was monolithic—server, client, and worker logic were tangled together. Debugging and optimizing were a nightmare. The new structure separates concerns clearly:

Old Structure:

/src
  /server
    - main.rs       # Mixed concerns
    - handlers.rs   # Everything in one place
  /client
    - ui.rs         # Tightly coupled UI

New Structure:

/core
  /src
    /models         # Domain objects
      /chunkr
        - task.rs   # Clear task lifecycle
        - open_ai.rs # LLM integration
    /pipeline       # Processing steps
      - segmentation_and_ocr.rs
      - segment_processing.rs
    /workers
      - task.rs     # Worker orchestration
/apps
  /web             # Separated frontend

This modularization enables independent debugging, isolated optimization, and scalable components. It also enables observability, which can make or break your debugging efforts when dealing with finicky AI systems.

Key Improvements Deep Dive

1. Modular, Parallelized Pipeline

The biggest change was restructuring the pipeline from sequential to parallel using Rayon, a Rust data parallelism library.

Old Approach (Sequential):

// Old monolithic processing
async fn process_document(doc: Document) -> Result<Output> {
    let pdf = convert_to_pdf(doc).await?;
    let images = render_to_images(pdf).await?;
    let text = perform_ocr(images).await?;
    let chunks = create_chunks(text).await?;
    process_chunks(chunks).await
}

New Approach (Parallel with Rayon):

use rayon::prelude::*;

// New parallel processing pipeline
async fn process_document(doc: Document) -> Result<Output> {
    let pdf = convert_to_pdf(doc).await?;

    // Parallel page processing
    let pages: Vec<Page> = pdf.pages.par_iter()
        .map(|page| {
            let image = render_to_image(page)?;
            let ocr_result = perform_ocr(image)?;
            let segments = detect_layout(ocr_result)?;
            Ok(ProcessedPage::new(segments))
        })
        .collect::<Result<Vec<_>>>()?;

    // Concurrent chunk processing
    let chunks = pages.into_par_iter()
        .flat_map(|page| page.create_chunks())
        .collect::<Vec<_>>();

    process_chunks_in_parallel(chunks).await
}

// Create and process chunks in parallel using Rayon
pub fn create_chunks_in_parallel(pages: Vec<ProcessedPage>) -> Vec<Chunk> {
    pages.into_par_iter()
        .flat_map(|page| {
            page.segments
                .into_par_iter()
                .flat_map(|segment| segment.create_chunks())
        })
        .collect()
}

Performance Impact of Rayon

Document Processing Times (15–20 pages):
┌─────────────────────┬────────┬─────────┬─────────────┐
│ Stage               │ Before │ After   │ Improvement │
├─────────────────────┼────────┼─────────┼─────────────┤
│ PDF Conversion      │ 15s    │ 2s      │ 7.5x        │
│ OCR                 │ 45s    │ 5s      │ 9x          │
│ Chunking            │ 20s    │ 3s      │ 6.7x        │
│ Post‑process        │ 10s    │ 1s      │ 10x         │
├─────────────────────┼────────┼─────────┼─────────────┤
│ Total               │ 90s    │ 11s     │ 8.2x        │
└─────────────────────┴────────┴─────────┴─────────────┘

2. OCR Model Quantization

Quantizing the OCR models shaved down inference time and model size while keeping accuracy nearly intact.

// OCR model configuration with quantization
#[derive(Debug, Serialize, Deserialize)]
pub struct OCRConfig {
    model_path: PathBuf,
    quantization: QuantizationConfig,
}

#[derive(Debug, Serialize, Deserialize)]
pub struct QuantizationConfig {
    enabled: bool,
    bits: u8,
    scheme: QuantizationScheme,
}

impl OCRModel {
    pub fn new(config: OCRConfig) -> Self {
        let model = if config.quantization.enabled {
            // Load quantized model for faster inference
            Model::load_quantized(
                config.model_path,
                config.quantization.bits,
                config.quantization.scheme,
            )
        } else {
            // Load full precision model
            Model::load(config.model_path)
        };
        Self { model }
    }
}

Model size: 2.3 GB → 600 MB
Inference time per page: 3.2s → 0.8s
Accuracy: 99.1% → 98.9%

3. Rewritten Inference Engine

The inference engine was overhauled to handle concurrency like a champ.

Before: Basic Sequential Processing

// Old implementation
pub struct OldInferenceEngine {
    model: OCRModel,
}

impl OldInferenceEngine {
    pub async fn process_page(&self, page: Page) -> Result<OCRResult> {
        self.model.process_page(page).await
    }

    pub async fn process_batch(&self, pages: Vec<Page>) -> Result<Vec<OCRResult>> {
        let mut results = Vec::new();
        for page in pages {
            results.push(self.process_page(page).await?);
        }
        Ok(results)
    }
}

Issues encountered:

Single model instance bottleneck
No connection pooling or rate limiting
Sequential batch processing causing inefficiency

After: Sophisticated Concurrency Management

pub struct InferenceEngine {
    pool: Pool<OCRModel>,           // Connection pooling
    rate_limiter: Arc<RateLimiter>,   // Rate limiting
    timeout_config: TimeoutConfig,    // Timeout management
    retry_policy: RetryPolicy,        // Retry handling
}

impl InferenceEngine {
    pub async fn process_page(&self, page: Page) -> Result<OCRResult> {
        // Acquire rate limit permit with timeout
        let _permit = self.rate_limiter
            .acquire_permit_with_timeout(self.timeout_config.acquire_timeout)
            .await?;

        // Get model from pool with backoff
        let model = self.pool.get_with_retry(&self.retry_policy).await?;

        // Process with timeout and automatic retry
        let result = retry_with_backoff(|| async {
            tokio::time::timeout(
                self.timeout_config.process_timeout,
                model.process_page(page.clone())
            ).await?
        }, &self.retry_policy).await?;

        Ok(result)
    }

    pub async fn process_batch(&self, pages: Vec<Page>) -> Result<Vec<OCRResult>> {
        // Dynamic batch size based on system load
        let batch_size = self.calculate_optimal_batch_size().await;

        // Process in parallel with controlled concurrency
        stream::iter(pages)
            .chunks(batch_size)
            .map(|chunk| {
                let models = self.pool.get_multiple(chunk.len()).await?;
                chunk.into_iter()
                    .zip(models)
                    .map(|(page, model)| self.process_with_model(page, model))
                    .collect::<Vec<_>>()
            })
            .try_collect()
            .await
    }

    async fn calculate_optimal_batch_size(&self) -> usize {
        let system_load = self.get_system_metrics().await;
        let pool_stats = self.pool.stats().await;
        min(
            system_load.available_cores,
            pool_stats.available_connections,
            self.max_batch_size
        )
    }
}

Key Improvements:

Smart Resource Pooling: Dynamic pool of model instances prevents resource exhaustion.
Adaptive Batch Processing: Batch size adjusts based on system load and pool availability.
Intelligent Rate Limiting: Distributed rate limiting (using Redis) prevents overload.
Sophisticated Error Handling: Automatic retries with exponential backoff and timeouts.
Performance Monitoring: Metrics for processing times, error rates, and connection pool utilization.

The result?

75% reduction in memory usage
4x increase in concurrent processing capacity
99.9% reduction in timeout errors
Zero resource exhaustion incidents
Consistent performance under varying load

After migrating to the new API, LegalFlow's data pipelines and AI systems have been fed high‑quality chunks, fast as hell, every time.

Impact on Development and Debugging

The modular architecture simplified isolated testing and error handling:

#[cfg(test)]
mod tests {
    use super::*;

    #[tokio::test]
    async fn test_ocr_pipeline() {
        let config = OCRConfig::for_testing();
        let engine = InferenceEngine::new(config);

        // Test single page processing
        let result = engine.process_page(test_page()).await;
        assert!(result.is_ok());

        // Test error handling
        let result = engine.process_page(corrupt_page()).await;
        assert!(matches!(result, Err(Error::InvalidImage)));
    }
}

And better error propagation makes debugging super easy.

Performance Metrics

Document Processing Times (15–20 pages):
┌─────────────────────┬────────┬─────────┬─────────────┐
│ Stage               │ Before │ After   │ Improvement │
├─────────────────────┼────────┼─────────┼─────────────┤
│ PDF Conversion      │ 15s    │ 2s      │ 7.5x        │
│ OCR                 │ 45s    │ 5s      │ 9x          │
│ Chunking            │ 20s    │ 3s      │ 6.7x        │
│ Post‑process        │ 10s    │ 1s      │ 10x         │
├─────────────────────┼────────┼─────────┼─────────────┤
│ Total               │ 90s    │ 11s     │ 8.2x        │
└─────────────────────┴────────┴─────────┴─────────────┘

Conclusion

The Chunkr upgrade proves that thoughtful architectural changes and optimization techniques can transform system performance. By focusing on modular design, parallel processing, model optimization, and efficient resource management, we achieved an 8x speedup—boosting both performance and developer productivity.

Looking Forward

No idea what further improvements the team might roll out, but the future might hold:

Deeper model quantization opportunities
Enhanced caching strategies
Additional parallelization
Smarter resource allocation

As new techniques emerge, the team will adopt them to ensure Chunkr keeps delivering cutting‑edge performance.

Thanks for reading! I'm Danial Hasan, founder of LegalFlow, collaborating with Centrai to share knowledge learned at the frontier of applied AI. We're building the best AI systems for the real estate industry. More insights shared here!