voyage-code-3 (Managed) and jina-embeddings-v2-base-code (Local) are the defaultsCross-corpus benchmarks · 2026-04-19 · FG-38 Managed-tier + FG-40 Local-tier
Jump to: Managed default · Local default
On google/leveldb 1.23 (a canonical small C++ systems codebase, 131 files, 1,555 Function nodes), Voyage voyage-code-3 achieves MRR@10 0.793 against a 20-query intent-bucketed benchmark — outperforming OpenAI text-embedding-3-small (0.481) by +31.2 percentage points. Voyage also clears a false-confidence calibration gate (max adversarial top-1 sim 0.517 vs a 0.60 threshold).
As of forge-api 0.8.0,
POST /v1/embed {"mode":"managed"} without an explicit
provider field defaults to Voyage voyage-code-3. OpenAI remains
selectable via ?provider=openai.
| Provider | MRR@10 | Recall@10 |
|---|---|---|
| Ollama qodo-embed-1-1.5b (Local) | 0.408 | 0.70 |
| OpenAI text-embedding-3-small | 0.481 | 0.75 |
| Voyage voyage-code-3 | 0.793 | 0.95 |
The 20 primary queries split across five LevelDB-specific intent buckets. Voyage wins 4 of 5 categories outright, ties the fifth.
| Category (n) | Ollama | OpenAI | Voyage |
|---|---|---|---|
| memtable (4) | 0.050 | 0.396 | 0.875 |
| sstable (5) | 0.233 | 0.600 | 0.700 |
| wal (4) | 0.633 | 0.625 | 0.800 |
| compaction (4) | 0.542 | 0.425 | 0.833 |
| cache (3) | 0.700 | 0.278 | 0.778 |
An earlier internal benchmark on a different C++ codebase produced the same provider ordering: voyage 0.752, openai 0.586, ollama-qodo 0.232. Two independent corpora agreeing on the ranking rules out corpus artifact — this is a provider-quality signal, not a LevelDB quirk.
google/leveldb at tag
1.23, commit 99b3c03b (released 2021-02-23,
BSD-3 licensed). Ingested by DKE-Forge into an isolated benchmark
tenant: 131 files, 4,360 graph nodes, 12,528 edges.
The three adversarial queries (“parse HTTP headers”, “train a neural network”, “run Raft consensus”) probe whether a provider will return confident matches for tasks LevelDB does not perform. Top-1 similarity for the best-matching hit:
On the same LevelDB corpus (inverse invariant: Local-mode Ollama only,
zero external egress), we swept four candidates against the incumbent
qodo-embed-1-1.5b:
mxbai-embed-large (334M, 1024-dim),
nomic-embed-text:v1.5 (137M, 768-dim),
jina-embeddings-v2-base-code (160M, 768-dim, Q4 GGUF),
re-measuring qodo itself for apples-to-apples comparison. Each
candidate runs on a single GTX 1650 Ti (4 GB VRAM) on tower-l1’s
embedder host over WireGuard, with HNSW wiped between sweeps.
| Candidate | MRR@10 | Recall@10 | Adv max-top1 |
|---|---|---|---|
| Ollama qodo-embed-1-1.5b (incumbent) | 0.408 | 0.70 | 0.959 |
| Ollama mxbai-embed-large | 0.434 | 0.85 | 0.633 |
| Ollama nomic-embed-text:v1.5 | 0.113 | 0.30 | 0.606 |
| Ollama jina-embeddings-v2-base-code | 0.444 | 0.80 | 0.408 |
Jina improves qodo on all three axes: MRR@10 (+0.036), Recall@10 (+0.10), and max adversarial top-1 (−0.551 — cut nearly in half). The MRR gain alone is modest — under a pure “beat the incumbent by 5 points MRR” rule this candidate would fail to promote. The adversarial calibration win is what decides the flip. Qodo’s 0.959 adversarial top-1 on out-of-scope queries is a silent-failure mode: the provider confidently returns a nearest neighbour for “parse HTTP headers” in a LevelDB corpus that has no such function. Jina’s 0.408 lands in the OpenAI/Voyage calibrated range — the model knows when it has no answer. That behaviour, not the MRR margin, is the substantive correctness improvement the Local tier needed.
Qodo also has a memtable category collapse — per-category
MRR@10 of 0.050, essentially no better than random on
memtable intent queries. Jina covers that category at 0.411, and mxbai at
0.411 as well. Aggregate MRR@10 masks this: qodo compensates with
wal (0.633) and cache (0.700) wins that happen to
balance the memtable zero. A Local default with a quiet category-level
blind spot is a product risk; a Local default that degrades gracefully
across categories is not.
nomic-embed-text:v1.5 dominated on no axis and collapsed on
three (memtable 0.000, sstable 0.075, compaction 0.000). Out of contention.
mxbai-embed-large was the runner-up — also-better on all
three axes, but with a weaker adversarial delta (−0.326 vs
jina’s −0.551) and a weaker memtable + cache per-category
spread. Jina wins on calibration & category-balance jointly.
As of forge-api 0.8.2,
POST /v1/embed {"mode":"local"} defaults to
ollama/hf.co/second-state/jina-embeddings-v2-base-code-GGUF:Q4_0
at 768-dim. Existing tenants with HNSW sidecars from the previous
1536-dim default will need a wipe + re-embed to pick up the new dim —
embedding-dimension validation (FG-34 EM2) refuses to mix dims in one
sidecar by design.