Arbor-Vitae: A Code-Understanding System Combining Static Graph Analysis with Hybrid Lexical-Semantic Retrieval

Ryu, Jae-uk

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

Arbor-Vitae: A Code-Understanding System Combining Static Graph Analysis with Hybrid Lexical-Semantic Retrieval

descriptionPublicationkeyboard_double_arrow_right Preprint Under curation English Publisher:Zenodo

Authors: Ryu, Jae-uk;

doi: 10.5281/zenodo.20549779

Arbor-Vitae: A Code-Understanding System Combining Static Graph Analysis with Hybrid Lexical-Semantic Retrieval

- Summary

Abstract

We present Arbor-Vitae, a code-understanding system that integrates a deterministic static-graph analyzer with a hybrid lexical-semantic retrieval pipeline. Arbor indexes codebases across six languages (Rust, TypeScript, JavaScript, Java, Python, Go) using tree-sitter parsers, builds an in-memory typed symbol graph enriched with call, inheritance, re-export, and new in this work method override edges, and exposes the resulting analysis through a twelve-tool MCP interface suitable for LLM-driven coding agents. We report results along four independent axes. (i) Retrieval quality: On a 24-repository, six-language benchmark, Arbor's hybrid pipeline (BM25 + dense RRF + definition boost + file coherence) reaches NDCG@10 = 0.836 compared with Semble's 0.850. The observed gap is small relative to the per-repository variance, and Arbor exceeds Semble on symbol queries by 9.3 pp. All components are pure Rust; no Python dependency is required at inference time. (ii) Graph accuracy: On a self-annotated impact-analysis benchmark (n=8 cases), Arbor achieves F1 = 0.816 after the typed-graph additions. (iii) Agent token efficiency: A controlled comparison of a shell-tools agent against an Arbor-CLI agent on 12 code-understanding tasks finds the Arbor agent uses 3.3x fewer input tokens on average (12.8x on the both-correct subset). F1 also trends higher (0.70 vs. 0.47 in a single-shot pilot), though single-shot F1 can flip between identical runs, so we treat the token result as the primary efficiency claim. (iv) Integration gap: A system-prompt nudge designed to improve agent behavior produces a severe regression (F1 1.00 to 0.00 on one task), motivating an architectural fix rather than further prompt engineering. We additionally report that path-based file penalties improve a benchmark by +0.013 NDCG@10 but produce a -0.011 regression on MTEB CodeSearchNetRetrieval, illustrating how ground-truth design assumptions constrain generalizability.

Found an issue? Give us feedback