Skip to main content

10. Performance & Scalability

10.1 Benchmarks

Performance metrics from production workloads:

MetricValueNotes
Indexing throughput~1,000 files/sec8 workers, cached
Full index (100K files)~3 minutesCold start, Python repo
Incremental patch (50 files)~2 secondsSnapshot-based
Cache hit rate>95%PR-sized changes
Memory footprint~2GB100K Python files
Graph JSON size~150MB100K files, uncompressed
BSG compressed~5MB12K token budget

10.2 Scaling Dimensions

DimensionStrategyLimit
FilesParallel extraction + caching200,000 default
WorkersCPU × 2, capped at 32Auto-detected
File sizeConfigurable max (default 500KB)Per-file
SnapshotsDeduplication + retention policy500 default
PatchesChain compression + retention5,000 default

Resource Requirements

Repository SizeCPUMemoryDisk
Small (≤10K files)2 cores512MB1GB
Medium (10K-50K)4 cores1GB5GB
Large (50K-200K)8+ cores4GB+20GB+

10.3 Cache Strategy

The caching strategy minimizes redundant work:

Figure 20: Cache Strategy - Flowchart showing the caching logic that minimizes redundant parsing through mtime and SHA-256 validation.

Cache Layers

LayerTechnologyTTLPurpose
AST CacheSQLite90 daysParsed entity cache
Symbol CacheSQLite90 daysCross-file resolution
BSG CacheJSON files90 daysRendered graphs
Snapshot CacheJSON filesConfigurableTime-travel snapshots

Cache Invalidation

  • mtime-based: Skip unchanged files
  • SHA-256 validation: Detect content changes
  • Manual invalidation: batho cache invalidate "*.pyc"
  • Full clear: batho cache clear

10.4 Performance Tuning

Worker Configuration

# batho.yaml
indexer:
max_workers: 8 # Default: CPU count × 2
batch_size: 100 # Files per batch

Memory Optimization

# Limit memory usage
batho index --root . --max-memory 2G

# Enable streaming mode for large repos
batho index --root . --stream