Skip to main content

1. Architecture Overview

1.1 High-Level System Architecture​

Batho's architecture follows a layered approach with clear separation between extraction, indexing, intelligence, and output layers. The system is designed for deterministic processing, enabling reliable caching and incremental updates.

Figure 2: High-Level System Architecture - Flowchart showing the layered architecture from source inputs through the core engine to output interfaces. Components include Source Inputs (Git Repository, batho.yaml), Batho Core Engine (AST Extractor, InMemoryGraph, AST Cache, SymbolIndex, IncrementalGraphUpdater), Intelligence Layer (BSGMap, BSG Rule Plugins), and Output Interfaces (Arrow IPC Bundle, batho CLI).

Figure 2: High-Level System Architecture - Detailed component view showing the layered architecture from source inputs through the core engine to output interfaces.

1.2 Data Flow Pipeline​

The data flow pipeline ensures deterministic processing with built-in caching and validation:

Figure 3: Data Flow Pipeline - Sequence diagram showing the deterministic indexing process with caching and validation steps. Flow: User triggers batho CLI build command, CLI discovers files respecting gitignore, Extractor checks cache using mtime and SHA-256 hash, parallel extraction parses files with tree-sitter and emits entities to InMemoryGraph, Graph resolves imports via SymbolIndex, BSGMap builds flat symbol index, Rule Plugins apply semantic overlay, and the output is serialized to the Arrow Bundle Store.

Figure 3: Data Flow Pipeline - Sequence diagram showing the deterministic indexing process with caching and validation steps.

1.3 Component Responsibilities​

Core Engine Components​

ComponentPurposeKey Features
AST ExtractorMulti-language parsing via tree-sitter40+ language support, parallel processing, mtime tracking
InMemoryGraphHypergraph storageLazy adjacency indexing, relationship deduplication, cross-file resolution
AST CachePersistent entity cachemsgpack-backed, SHA-256 validation, automatic invalidation
SymbolIndexCross-file symbol resolutionTwo-pass resolution, unresolved target tracking
IncrementalUpdaterPatch applicationDiff-based updates, content-hash comparisons, rollback support

Intelligence Layer Components​

ComponentPurposeKey Features
BSGMapStructured graph representationFlat symbol index, priority scoring, rendering modes
Rule PluginsSemantic analysisYAML-defined rules, plugin architecture, tag-based annotation

1.4 Output Interfaces​

InterfaceTransportPurpose
CLITerminalDirect control, scripting, automation, history diffs, integrity repair, gc
Arrow IPC BundleArrow / IPC (.batho)High-performance serialized storage of entities, dependencies, and BSG views
JSON ExportJSON StreamStandard representation for LLM context injection and downstream tool integrations