ADR-014: Use Tree-sitter for AST-Based Code Chunking
Field |
Value |
|---|---|
Status |
Accepted |
Date |
2026-06-23 |
Deciders |
Ai Team |
Supersedes |
– |
Superseded by |
– |
Context
The ingestion pipeline currently parses source code files using language-specific regular expressions to identify top-level definitions such as functions, classes, constants, and variables.
This approach has several limitations:
Only Python, JavaScript, TypeScript, and Go are supported currently.
Parsing relies on heuristic pattern matching rather than actual language syntax.
Decorators, annotations, comments, multiline signatures, and other language constructs may be assigned to incorrect chunks.
Nested definitions and complex syntax are difficult to handle reliably.
Supporting additional programming languages requires implementing and maintaining custom regex patterns.
As the project grows and additional languages are expected to be supported, maintaining regex-based parsing becomes increasingly difficult and error-prone.
Tree-sitter is a parser framework that generates concrete syntax trees (CSTs) for source code and supports more than 100 programming languages through reusable grammars. It provides a language-aware mechanism for identifying top-level code structures while using a consistent API across languages.
Decision Drivers
Improve correctness of code chunk boundaries.
Correctly associate decorators, annotations, comments, and multiline signatures with their definitions.
Support multiple programming languages through a unified parsing approach.
Reduce maintenance overhead caused by language-specific regex patterns.
Preserve existing ingestion behavior for non-code files.
Maintain acceptable ingestion performance.
Enable future language support without significant implementation effort.
Enrich chunk metadata with symbol information (e.g. function or class names).
Considered Options
Option A – Continue using regex-based parsing.
Option B – Use language-specific AST parsers for each supported language.
Option C – Use Tree-sitter as the primary code parser with regex fallback.
Decision
Chosen option: Option C – Use Tree-sitter as the primary code parser with regex fallback.
Tree-sitter provides accurate syntax-aware code boundaries across multiple languages while preserving a consistent implementation model. Existing regex parsing will remain available as a fallback for unsupported languages.
Rationale
The current regex-based implementation is simple but fundamentally limited because it cannot understand source code structure.
Language-specific parsers would provide accurate results but require maintaining different parsing implementations and APIs for each language. This would increase complexity and make adding new languages more expensive.
Tree-sitter provides a balance between correctness, maintainability, and extensibility:
It accurately identifies top-level definitions using the language grammar.
It correctly handles decorators, annotations, multiline signatures, and nested constructs.
It provides a common API across languages.
Additional languages can be supported by adding grammars rather than implementing new parsers.
Existing chunking logic, metadata generation, and preamble handling can largely be retained.
Keeping the existing regex parser as a fallback minimizes migration risk and ensures unsupported languages continue to be processed.
Pros and Cons of the Options
Option A – Regex-Based Parsing
Pros
Simple implementation.
No additional dependencies.
Fast execution.
Cons
Incorrect handling of decorators, annotations, and multiline declarations.
Difficult to extend reliably.
Requires custom maintenance for every language.
Not syntax-aware.
Option B – Language-Specific AST Parsers
Pros
Accurate parsing for each language.
Full access to language-specific syntax information.
Cons
Requires different implementations per language.
Higher maintenance burden.
Inconsistent APIs and parsing behavior.
More difficult to add new languages.
Option C – Tree-sitter with Regex Fallback
Pros
Accurate syntax-aware parsing.
Unified API across languages.
Supports current languages and many future languages.
Correct handling of decorators, annotations, comments, and multiline declarations.
Easier long-term maintenance.
Allows enrichment of metadata with symbol names and symbol types.
Existing regex parser can remain as fallback.
Cons
Introduces additional dependencies and grammar management.
Slightly higher parsing overhead.
Requires initial implementation effort and test coverage.
Consequences
Positive
More accurate code chunk boundaries.
Improved retrieval quality due to better chunk semantics.
Decorators, comments, and annotations remain attached to their corresponding definitions.
Consistent support for Python, JavaScript, TypeScript, and Go.
Easier addition of future languages.
Ability to store symbol metadata such as:
symbol_namesymbol_kind
Negative / Trade-offs
Additional dependency on Tree-sitter and language grammars.
Increased implementation complexity compared to regex parsing.
Slightly increased parsing time during ingestion.
Grammar versions must be maintained and updated.