Ilzam - Fractional CTO

Ilzam

Fractional CTO

← All notes
Engineering March 2026

GitNexus vs Claude Code's Native Exploration: A Real Benchmark

TL;DR

I ran the same architectural question through two different codebase exploration approaches on a mid-sized Flutter project. First: Claude Code's built-in Explore subagent (very thorough mode). Second: GitNexus, a graph-indexed code intelligence layer backed by KuzuDB. The graph approach was 25% cheaper ($0.52 vs $0.39), used 68% fewer tool calls (38 vs ~12), and consumed 43% fewer tokens, with no loss in answer quality. Both approaches read roughly the same source files. The difference was how fast each figured out which files to read.

Testing retrieval tooling for architectural questions is tricky, because most retrieval benchmarks aren't designed for code.

The typical benchmark uses document collections (legal texts, support tickets, documentation) where finding the right answer is largely a vocabulary problem. Search for "authentication" and you surface documents about authentication. Close enough works.

Code questions are different. The questions that actually matter to someone reasoning about an architecture aren't about vocabulary, they're about structure: which components depend on this interface, what path connects a user action to the database, what breaks if I change this abstraction. These are graph traversal problems. Whether a retrieval system can answer them depends on whether it preserved the relational structure of the codebase when it built its index, not just the text content.

So I ran both approaches against exactly that kind of structural question and measured what happened.

Why vector search misses the point for code

Vector retrieval converts text into points in a high-dimensional space and finds matches by geometric proximity. It works well for prose, where similar vocabulary generally means similar meaning.

Three things about source code make this break down.

Chunking cuts cross-file relationships

To build a vector index, you have to split files into chunks. That splitting destroys exactly what makes code meaningful: the caller-callee links, import chains, interface implementations. A function's chunk captures what the function does. It doesn't capture that it's the only write path in the entire app, because that fact lives in its callers, not in the function itself.

Closely related code doesn't always look similar

Two components can be tightly coupled at runtime while sharing almost no vocabulary. An authentication lifecycle handler and a sync connection provider don't use the same terms, yet one directly triggers the other at a critical point in the app flow. That relationship is an edge in a call graph. You can't recover it from text similarity.

Imprecise retrieval leads to context bloat

When vector retrieval can't pinpoint the right files, the workaround is to load more files and let the language model sort through them. You can see this in the benchmark numbers: cache reads were 88-96% of total token consumption across both runs. The model wasn't fetching targeted information, it was scanning a large context window looking for what mattered.

How Claude Code's built-in exploration works

Before getting to GitNexus, it's worth being precise about the baseline. Claude Code's Explore subagent isn't doing random file traversal. It runs a structured, iterative search using three tools: Glob for finding files by pattern, Grep for searching file contents by keyword or symbol, and Read for loading files.

The process is iterative. It starts with search terms derived from the question, runs Glob and Grep, reads the files it finds, extracts new candidate symbols and filenames from those files, then searches again. This continues until coverage seems sufficient, with each discovered file added to the model's context as the session goes on.

It's an effective, general approach. The cost is that every round of discovery requires an inference step to decide what to look for next, and every file has to be loaded in full to yield new search terms. The more indirect the connections between relevant components, the more rounds it takes.

What GitNexus does differently

GitNexus indexes a codebase as a property graph rather than a vector store or a flat file corpus. Functions, classes, interfaces, and modules become nodes. Calls, imports, inheritance, and interface implementations become directed, labeled edges. Execution flows, where indexed, become traversable paths.

This is the structural information that chunked vector indexes throw away. When you query the graph, you get back the nodes matching a concept and their full relationship neighborhood: not "text similar to this description" but "the exact files involved in this feature, and here's how they connect" [1].

GitNexus exposes this through an MCP server, so a language model can issue structured graph queries during a session instead of running iterative file searches.

The storage engine: KuzuDB

GitNexus's query performance comes largely from its storage engine: KuzuDB, an embedded columnar property graph database out of the University of Waterloo [1]. Think of it as DuckDB for graphs, an in-process engine that skips the network overhead of client-server databases like Neo4j.

How it stores graph structure

KuzuDB uses Compressed Sparse Row (CSR) adjacency indexes to store edges [1]. Node properties sit in columnar disk files. For each source node, the destination node IDs and edge properties are stored contiguously on disk. This means traversing edges takes time proportional to how many connections a node has, not the size of the whole graph, because you're not doing hash lookups or B-tree scans to find neighbors.

For code intelligence, this matters. A query like "find every caller of function f across the whole codebase" is a reverse-edge traversal plus a property filter. KuzuDB runs that without touching unrelated parts of the symbol table.

Factorized query execution

The more interesting piece is KuzuDB's query execution model. Traditional join execution builds a full intermediate result table at each join step. In graph queries with many-to-many relationships, those intermediate tables can get exponentially larger than the final answer, a known performance problem with graph workloads [1].

KuzuDB's multiway ASP-Join operator (Accumulate-Semijoin-Probe) processes joins column-by-column rather than table-by-table, keeping intermediate results in a compressed form [1, 2, 3]. Redundant copies are suppressed. On multi-hop queries, intermediate result sets compress by 50-100x compared to standard join execution. This is directly relevant to GitNexus: questions like "give me the full dependency subgraph of component X" or "trace all paths from entry point A to the persistence layer" are exactly the multi-hop, cyclic patterns this handles well.

Vectorized batches and in-process execution

Query execution uses vectorized batches (2,048 values per batch, sized to fit L3 cache), which enables CPU-level SIMD optimization [4]. Multi-core work is scheduled with minimal coordination overhead.

The in-process part matters: KuzuDB runs inside the GitNexus MCP server with no network round-trips to a separate database. In an agent session where tool call latency accumulates, that's a real difference.

How GitNexus builds and uses the index

GitNexus runs a multi-phase pipeline over the repository: tree-sitter parses source files into abstract syntax trees, which get mapped into a property graph schema (files, symbols, typed relationship edges) and stored in KuzuDB, with precomputed metadata like community assignments and confidence scores attached. At query time, the MCP server translates a natural-language concept into a Cypher query against KuzuDB, returning the relevant symbol nodes and their relationship neighborhoods directly, without loading any source file into the model's context.

What I tested

The codebase was a mid-sized Flutter project with an offline-first sync architecture: bidirectional real-time sync with a remote service, local SQLite persistence through a typed ORM layer, and Riverpod-managed state. The question was architectural, asking how the sync integration works end-to-end, from app startup through credential acquisition to upstream and downstream data exchange.

I ran two sessions on the same question. First, Claude Code's Explore subagent in very thorough mode. Second, GitNexus as the primary discovery layer, reading source files only after the graph query identified them as relevant.

Token consumption and cost were measured as deltas from usage snapshots taken immediately before and after each run, so the numbers reflect only that session's activity.

Results

Metric Claude Code Explore (baseline) GitNexus
Total session cost $0.52 $0.39
Total tokens consumed 1,573,096 889,126
Cache read tokens (% of total) 1,386,972 (88%) 858,166 (96%)
Tool calls issued 38 ~12
Elapsed time ~105 seconds ~2-3 minutes
Source files read ~10 ~10

The most telling number is that both sessions read roughly the same source files. The token savings didn't come from reading less, they came from spending fewer inference steps figuring out what to read.

The Explore subagent's 38 tool calls are the iterative discovery process in action: Glob runs across naming patterns, Grep runs across keyword sets, Read operations on candidate files, then more searches based on what those files revealed. GitNexus issued a single gitnexus_query call that returned 9 relevant files directly, with no false positives to filter out. The remaining tool calls were targeted Read operations on those files.

The baseline session also spun up a secondary Haiku subagent for the indexing pass: 158k cache tokens created and 1M cache tokens read, which accounted for $0.32 of the $0.52 total. GitNexus didn't need a secondary model at all. That discovery work happened at index build time, not at query time.

Where the current index hits its ceiling

The GitNexus index at the time of this test had 338 symbols and 322 relationship edges, but 0 indexed execution flows.

Execution flows are the highest-value thing a graph index can provide for architectural reasoning. A complete flow for the sync initialization path, covering the auth event through connection establishment to active data exchange, would let the architectural question get answered in a single path traversal, without opening any source file. That's what KuzuDB's CSR adjacency structures and factorized join execution are built to handle efficiently [1].

Without flows, the graph acted as a precise file locator rather than a full architectural reasoning layer. It cut the iterative discovery overhead, but didn't cut the reading step. Synthesizing the end-to-end answer still required going through source files, just more targeted ones than the baseline.

With flows fully indexed, the 5-6 source file reads needed for architectural synthesis could be replaced by a single path query. Estimated cost for the same task would likely drop to the $0.25-0.30 range, with proportional reductions in time and reasoning load.

Checking index health before relying on flows

Before using GitNexus for execution flow queries, check that the index was built with process analysis enabled. Run gitnexus analyze and look at the reported process count. An index with 0 processes will still give you the file discovery benefit shown here, but won't support path traversal queries.

What this means in practice

Claude Code's native exploration is solid for general-purpose investigation. The limitation is structural: because it discovers relationships by searching iteratively, it pays an inference cost proportional to how many indirect connections exist between the relevant components. On complex architectures, distributed sync systems, multi-layered dependency graphs, event-driven flows, that cost compounds quickly.

Graph-indexed retrieval solves this at the index level. GitNexus precomputes the symbol graph using KuzuDB's CSR adjacency structures and factorized joins [1], so each query session starts with the relational structure already built. It doesn't reconstruct that structure from scratch through iterative tool calls.

The tradeoff is upfront cost: index build time, keeping the index current across commits, tooling integration. That's a fixed investment that pays off more the more frequently you ask architectural questions. For smaller codebases or single-file queries, the difference is marginal.

For the kinds of projects I work on, offline-first mobile apps, distributed sync architectures, multi-service backends with layered dependencies, the questions that come up most are relational: "what calls this," "what depends on this," "where does this data come from." Those are graph questions. The right retrieval layer for graph questions is a graph index.

References

  1. [1] X. Feng, G. Jin, Z. Chen, C. Liu, and S. Salihoglu, "KÙZU: Graph Database Management System," in Proceedings of the Conference on Innovative Data Systems Research (CIDR), Amsterdam, Netherlands, 2023.
  2. [2] H. Q. Ngo, E. Porat, C. Ré, and A. Rudra, "Skew Strikes Back: New Developments in the Theory of Join Algorithms," ACM SIGMOD Record, vol. 42, no. 4, pp. 5-16, 2014.
  3. [3] D. Olteanu and J. Závodný, "Factorised Representations of Query Results," in Proceedings of the 15th International Conference on Database Theory (ICDT), Berlin, Germany, 2012.
  4. [4] D. J. Abadi, S. Madden, and N. Hachem, "Column-Stores vs. Row-Stores: How Different Are They Really?" in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, Canada, 2008.

Related Notes

Working on something that needs AI-assisted development or RAG infrastructure?

I've shipped RAG pipelines and offline-first architectures in production. If you want a second opinion on your approach, book a technical call.

Book an Advisory Session