# ADR-014: Use Tree-sitter for AST-Based Code Chunking | Field | Value | |---|---| | **Status** | Accepted | | **Date** | 2026-06-23 | | **Deciders** | Ai Team | | **Supersedes** | – | | **Superseded by** | – | --- ## Context The ingestion pipeline currently parses source code files using language-specific regular expressions to identify top-level definitions such as functions, classes, constants, and variables. This approach has several limitations: - Only Python, JavaScript, TypeScript, and Go are supported currently. - Parsing relies on heuristic pattern matching rather than actual language syntax. - Decorators, annotations, comments, multiline signatures, and other language constructs may be assigned to incorrect chunks. - Nested definitions and complex syntax are difficult to handle reliably. - Supporting additional programming languages requires implementing and maintaining custom regex patterns. As the project grows and additional languages are expected to be supported, maintaining regex-based parsing becomes increasingly difficult and error-prone. Tree-sitter is a parser framework that generates concrete syntax trees (CSTs) for source code and supports more than 100 programming languages through reusable grammars. It provides a language-aware mechanism for identifying top-level code structures while using a consistent API across languages. --- ## Decision Drivers - Improve correctness of code chunk boundaries. - Correctly associate decorators, annotations, comments, and multiline signatures with their definitions. - Support multiple programming languages through a unified parsing approach. - Reduce maintenance overhead caused by language-specific regex patterns. - Preserve existing ingestion behavior for non-code files. - Maintain acceptable ingestion performance. - Enable future language support without significant implementation effort. - Enrich chunk metadata with symbol information (e.g. function or class names). --- ## Considered Options - **Option A** – Continue using regex-based parsing. - **Option B** – Use language-specific AST parsers for each supported language. - **Option C** – Use Tree-sitter as the primary code parser with regex fallback. --- ## Decision **Chosen option: Option C – Use Tree-sitter as the primary code parser with regex fallback.** Tree-sitter provides accurate syntax-aware code boundaries across multiple languages while preserving a consistent implementation model. Existing regex parsing will remain available as a fallback for unsupported languages. --- ## Rationale The current regex-based implementation is simple but fundamentally limited because it cannot understand source code structure. Language-specific parsers would provide accurate results but require maintaining different parsing implementations and APIs for each language. This would increase complexity and make adding new languages more expensive. Tree-sitter provides a balance between correctness, maintainability, and extensibility: - It accurately identifies top-level definitions using the language grammar. - It correctly handles decorators, annotations, multiline signatures, and nested constructs. - It provides a common API across languages. - Additional languages can be supported by adding grammars rather than implementing new parsers. - Existing chunking logic, metadata generation, and preamble handling can largely be retained. Keeping the existing regex parser as a fallback minimizes migration risk and ensures unsupported languages continue to be processed. --- ## Pros and Cons of the Options ### Option A – Regex-Based Parsing #### Pros - Simple implementation. - No additional dependencies. - Fast execution. #### Cons - Incorrect handling of decorators, annotations, and multiline declarations. - Difficult to extend reliably. - Requires custom maintenance for every language. - Not syntax-aware. ### Option B – Language-Specific AST Parsers #### Pros - Accurate parsing for each language. - Full access to language-specific syntax information. #### Cons - Requires different implementations per language. - Higher maintenance burden. - Inconsistent APIs and parsing behavior. - More difficult to add new languages. ### Option C – Tree-sitter with Regex Fallback #### Pros - Accurate syntax-aware parsing. - Unified API across languages. - Supports current languages and many future languages. - Correct handling of decorators, annotations, comments, and multiline declarations. - Easier long-term maintenance. - Allows enrichment of metadata with symbol names and symbol types. - Existing regex parser can remain as fallback. #### Cons - Introduces additional dependencies and grammar management. - Slightly higher parsing overhead. - Requires initial implementation effort and test coverage. --- ## Consequences ### Positive - More accurate code chunk boundaries. - Improved retrieval quality due to better chunk semantics. - Decorators, comments, and annotations remain attached to their corresponding definitions. - Consistent support for Python, JavaScript, TypeScript, and Go. - Easier addition of future languages. - Ability to store symbol metadata such as: - `symbol_name` - `symbol_kind` ### Negative / Trade-offs - Additional dependency on Tree-sitter and language grammars. - Increased implementation complexity compared to regex parsing. - Slightly increased parsing time during ingestion. - Grammar versions must be maintained and updated. --- ## Links