Building a Claude Code Multi-Agent System with TEI + Qdrant
TL;DR
I vectorized code with TEI (Text Embeddings Inference), stored it in Qdrant, and built a system where 9 Claude Code agents search code instead of reading it.
1. Why I Built This
Limitations of Claude Code
Claude Code is powerful, but when it needs to understand the overall codebase, it opens and reads files one by one.
Read src/auth.ts β 200 lines consumed
Read src/middleware.ts β 150 lines consumed
Read src/utils/token.ts β 100 lines consumed
...There are unavoidable cases where the entire codebase needs to be read, such as code quality reviews or security audits, and the consumption is significant.
I figured that agents handling code reviews or security checks could understand the code without the Read tool, reducing token costs. So I gave it a try.
2. Architecture
| Layer | Components |
|---|---|
| Claude Code | 9 agents (Morpheus, Neo, Seraph, ...) |
| β MCP Protocol (stdio) | |
| mcp-code-rag | Rust binary. Indexing/search bridge |
| β HTTP / gRPC | |
| TEI (e5-base) | Text β vector conversion |
| Qdrant | Vector storage + similarity search |
Components
| Component | Role | Location |
|---|---|---|
| mcp-code-rag | MCP server. Handles indexing/search requests | Local (Rust binary) |
| TEI (e5-base) | Text β 768-dimensional vector conversion | Docker |
| Qdrant | Vector storage + cosine similarity search | Docker |
| Claude Code Agents | Search via MCP tools when exploring code | Local |
3. Why This Stack
The goal was not to find the latest embedding model or the optimal combination. The goal was to test whether a RAG pipeline actually works on a GPU-less NAS. So I chose a lightweight, easy-to-self-host stack.
TEI + e5-base
- TEI β Official Hugging Face embedding server. One Docker command and it's deployed
- e5-base β 768 dimensions. Practical enough at ~50ms/query on CPU
- Supports asymmetric search with
query:/passage:prefixes, which is advantageous for matching short queries β long code chunks
Why the MCP Server Is Written in Rust
- Single binary β Deploy with one executable. Zero runtime dependencies
- Memory β ~15MB resident memory. Python servers are 200MB+
- Claude Code MCP β stdio-based JSON-RPC. Rust's serde + tokio is ideal
- External communication β Async HTTP + gRPC handling
4. Core Flows
4.1 Indexing (Code β Vectors)
Project directory
β
βΌ
[walkdir] File traversal (max depth: 10)
β
βΌ
[filter] 52 directories + 60 extensions excluded
(node_modules, target, .git, images, binaries...)
β
βΌ
[incremental check] Compare file modification times
β Process only changed files
β
βΌ
[chunk] Split into 1500-char units (300-char overlap)
β
βΌ
[TEI] Each chunk β "passage: {content}" β 768-dim vector
β
βΌ
[Qdrant] UUID v5 based upsert (batches of 100)
Collection: "code-rag-{project_name}"Why 1500-char chunking?
- Too large β search precision drops (noise)
- Too small β a single function splits across multiple chunks, losing context
- 1500 chars + 300-char overlap β prevents splitting at function boundaries
4.2 Search Example β Code Review Agent
[code_review_agent] Start security audit
β
βΌ
search_codebase("authentication logic", project_name="my-app")
β
βΌ
Search results received (Top 10, similarity β₯ 0.5 only):
β
βΌ
search_codebase("SQL query generation", project_name="my-app")
β
βΌ
search_codebase("user input validation", project_name="my-app")
β
βΌ
Generate security report from collected code snippetsResult: 0 Reads, 3 searches to complete the report. It analyzes security vulnerabilities across the codebase without opening a single file.
4.3 Auto-Indexing
Agent: search_codebase("error handling", project_name="blog")
β
βΌ
[Collection check] Does "my-app" exist?
β
ββ No β Trigger auto-indexing β Search after indexing completes
β
ββ Yes β Search immediatelyAgents don't need to call indexing directly. The index is created automatically on the first search.
5. Real-World Numbers
| Metric | Value |
|---|---|
| TEI embedding speed | ~50ms/query (CPU) |
| Indexing (1000 files) | ~3 min (incremental: changed files only) |
| Search response | ~200ms (cache miss), ~5ms (cache hit) |
| mcp-code-rag memory | ~15MB |
| Vector dimensions | 768 (e5-base) |
| Cache TTL | 5 min |
| Max indexable files | 5,000 |
| Chunk size | 1,500 chars (300-char overlap) |
6. MCP Tools
mcp-code-rag provides 2 MCP tools to Claude Code.
| Tool | Description |
|---|---|
| search_codebase | Search code with natural language queries. Pass a project name to auto-switch collections + auto-index if not yet indexed |
| refresh_index | Manually index by specifying a project path. Incremental β only processes changed files |
By enforcing a rule that agents call search_codebase first when exploring code, you can prevent the token waste of traversing files with the Read tool.
Strengths
- Index once, search instantly thereafter β Indexing is slow only the first time; subsequent runs are incremental and process only changed files. Search itself takes ~200ms
- Zero token consumption β No files are opened with Read, so the context window isn't spent on searches
- Code exploration via natural language β Find relevant code immediately with queries like "authentication handling" or "SQL queries"
- Strong for codebase-wide tasks β Well-suited for broad pattern analysis like security audits and code reviews
Limitations
Precise semantic search like "Is error handling properly implemented?" is insufficient with vector similarity alone. For such cases, you need a Rerank model, or a hybrid approach combining keyword search (BM25) with vector search.
| Use Case | Current System | Needs Addition |
|---|---|---|
| Understanding overall code structure | Suitable | β |
| Security pattern scanning | Suitable | β |
| Precise logic search | Insufficient | Rerank, hybrid search |
| Error handling tracing | Insufficient | Rerank + AST analysis |
7. Getting Started
Prerequisites
- Docker environment (for running Qdrant, TEI)
- Rust (for building mcp-code-rag)
- Claude Code (Pro plan)
Once everything is installed and the MCP server is registered with Claude Code, it fetches only the code you need via semantic search β with zero token waste.
Log
- β’ 2026-01-29: create