Skip to main content

7. Performance & Scalability

7.1 Benchmarks​

Performance metrics from production workloads in Batho v1.1.0:

MetricValueNotes
Indexing throughput~1,000 files/sec8 workers, cached AST
Full build (100K files)~3 minutesCold start, Python repository
Incremental patch (50 files)~1.5 secondsContent-hash based patch
AST Cache hit rate>95%Typical pull request size
Memory footprint~1.5GB100K Python files
Arrow Bundle size~45MBCompressed Arrow IPC Bundle database
Agent BSG export size~3.5MB12K token budget

7.2 Scaling Dimensions​

DimensionStrategyLimit
FilesParallel extraction + caching200,000 default
WorkersCPU × 2, capped at 32Auto-detected
File sizeConfigurable max (default 500KB)Per-file
RunsGC cleanup & vacuum policiesConfigurable retention

Resource Requirements​

Repository SizeCPUMemoryDisk
Small (≤10K files)2 cores512MB100MB
Medium (10K-50K)4 cores1.5GB500MB
Large (50K-200K)8+ cores4GB+2GB

7.3 Cache Strategy​

The caching strategy minimizes redundant work:

Figure 17: Cache Strategy - Flowchart showing the caching logic that minimizes redundant parsing through mtime and SHA-256 validation.

Cache Layers​

LayerTechnologyTTLPurpose
AST Cachemsgpack30 daysPersisted tree-sitter parsed entity structure
Dependency Cachemsgpack90 daysShared third-party symbol resolution mapping
BSG CacheArrow IPC30 daysRendered views stored in file_artifacts

Cache Invalidation & Maintenance​

  • mtime + SHA-256 Verification: Changes in files are caught by comparing filesystem markers and file content hashes against the database tracking schema.
  • Garbage Collection: Outdated runs and cache entries are cleaned up using batho gc. Running batho gc vacuum frees up database sectors by triggering vacuum operations on Arrow IPC files.