# ADR-014: Use Tree-sitter for AST-Based Code Chunking

| Field | Value |
|---|---|
| **Status** | Accepted |
| **Date** | 2026-06-23 |
| **Deciders** | Ai Team |
| **Supersedes** | – |
| **Superseded by** | – |

---

## Context

The ingestion pipeline currently parses source code files using language-specific regular expressions to identify top-level definitions such as functions, classes, constants, and variables.

This approach has several limitations:

- Only Python, JavaScript, TypeScript, and Go are supported currently.
- Parsing relies on heuristic pattern matching rather than actual language syntax.
- Decorators, annotations, comments, multiline signatures, and other language constructs may be assigned to incorrect chunks.
- Nested definitions and complex syntax are difficult to handle reliably.
- Supporting additional programming languages requires implementing and maintaining custom regex patterns.

As the project grows and additional languages are expected to be supported, maintaining regex-based parsing becomes increasingly difficult and error-prone.

Tree-sitter is a parser framework that generates concrete syntax trees (CSTs) for source code and supports more than 100 programming languages through reusable grammars. It provides a language-aware mechanism for identifying top-level code structures while using a consistent API across languages.

---

## Decision Drivers

- Improve correctness of code chunk boundaries.
- Correctly associate decorators, annotations, comments, and multiline signatures with their definitions.
- Support multiple programming languages through a unified parsing approach.
- Reduce maintenance overhead caused by language-specific regex patterns.
- Preserve existing ingestion behavior for non-code files.
- Maintain acceptable ingestion performance.
- Enable future language support without significant implementation effort.
- Enrich chunk metadata with symbol information (e.g. function or class names).

---

## Considered Options

- **Option A** – Continue using regex-based parsing.
- **Option B** – Use language-specific AST parsers for each supported language.
- **Option C** – Use Tree-sitter as the primary code parser with regex fallback.

---

## Decision

**Chosen option: Option C – Use Tree-sitter as the primary code parser with regex fallback.**

Tree-sitter provides accurate syntax-aware code boundaries across multiple languages while preserving a consistent implementation model. Existing regex parsing will remain available as a fallback for unsupported languages.

---

## Rationale

The current regex-based implementation is simple but fundamentally limited because it cannot understand source code structure.

Language-specific parsers would provide accurate results but require maintaining different parsing implementations and APIs for each language. This would increase complexity and make adding new languages more expensive.

Tree-sitter provides a balance between correctness, maintainability, and extensibility:

- It accurately identifies top-level definitions using the language grammar.
- It correctly handles decorators, annotations, multiline signatures, and nested constructs.
- It provides a common API across languages.
- Additional languages can be supported by adding grammars rather than implementing new parsers.
- Existing chunking logic, metadata generation, and preamble handling can largely be retained.

Keeping the existing regex parser as a fallback minimizes migration risk and ensures unsupported languages continue to be processed.

---

## Pros and Cons of the Options

### Option A – Regex-Based Parsing

#### Pros

- Simple implementation.
- No additional dependencies.
- Fast execution.

#### Cons

- Incorrect handling of decorators, annotations, and multiline declarations.
- Difficult to extend reliably.
- Requires custom maintenance for every language.
- Not syntax-aware.

### Option B – Language-Specific AST Parsers

#### Pros

- Accurate parsing for each language.
- Full access to language-specific syntax information.

#### Cons

- Requires different implementations per language.
- Higher maintenance burden.
- Inconsistent APIs and parsing behavior.
- More difficult to add new languages.

### Option C – Tree-sitter with Regex Fallback

#### Pros

- Accurate syntax-aware parsing.
- Unified API across languages.
- Supports current languages and many future languages.
- Correct handling of decorators, annotations, comments, and multiline declarations.
- Easier long-term maintenance.
- Allows enrichment of metadata with symbol names and symbol types.
- Existing regex parser can remain as fallback.

#### Cons

- Introduces additional dependencies and grammar management.
- Slightly higher parsing overhead.
- Requires initial implementation effort and test coverage.

---

## Consequences

### Positive

- More accurate code chunk boundaries.
- Improved retrieval quality due to better chunk semantics.
- Decorators, comments, and annotations remain attached to their corresponding definitions.
- Consistent support for Python, JavaScript, TypeScript, and Go.
- Easier addition of future languages.
- Ability to store symbol metadata such as:
  - `symbol_name`
  - `symbol_kind`

### Negative / Trade-offs

- Additional dependency on Tree-sitter and language grammars.
- Increased implementation complexity compared to regex parsing.
- Slightly increased parsing time during ingestion.
- Grammar versions must be maintained and updated.

---

## Links