Why `voyage-code-3` (Managed) and `jina-embeddings-v2-base-code` (Local) are the defaults

Cross-corpus benchmarks · 2026-04-19 · FG-38 Managed-tier + FG-40 Local-tier

Jump to: Managed default · Local default

Managed default: voyage-code-3 (FG-38)

TL;DR

On google/leveldb 1.23 (a canonical small C++ systems codebase, 131 files, 1,555 Function nodes), Voyage voyage-code-3 achieves MRR@10 0.793 against a 20-query intent-bucketed benchmark — outperforming OpenAI text-embedding-3-small (0.481) by +31.2 percentage points. Voyage also clears a false-confidence calibration gate (max adversarial top-1 sim 0.517 vs a 0.60 threshold).

As of forge-api 0.8.0, POST /v1/embed {"mode":"managed"} without an explicit provider field defaults to Voyage voyage-code-3. OpenAI remains selectable via ?provider=openai.

Per-provider overall (primary queries, n=20)

Provider	MRR@10	Recall@10
Ollama qodo-embed-1-1.5b (Local)	0.408	0.70
OpenAI text-embedding-3-small	0.481	0.75
Voyage voyage-code-3	0.793	0.95

Per-category MRR@10

The 20 primary queries split across five LevelDB-specific intent buckets. Voyage wins 4 of 5 categories outright, ties the fifth.

Category (n)	Ollama	OpenAI	Voyage
memtable (4)	0.050	0.396	0.875
sstable (5)	0.233	0.600	0.700
wal (4)	0.633	0.625	0.800
compaction (4)	0.542	0.425	0.833
cache (3)	0.700	0.278	0.778

Cross-corpus consistency

An earlier internal benchmark on a different C++ codebase produced the same provider ordering: voyage 0.752, openai 0.586, ollama-qodo 0.232. Two independent corpora agreeing on the ranking rules out corpus artifact — this is a provider-quality signal, not a LevelDB quirk.

Methodology

Corpus pin: google/leveldb at tag 1.23, commit 99b3c03b (released 2021-02-23, BSD-3 licensed). Ingested by DKE-Forge into an isolated benchmark tenant: 131 files, 4,360 graph nodes, 12,528 edges.
Query set: 20 primary queries across five intent buckets (memtable, sstable, wal, compaction, cache), each with a pre-declared expected-hit Function node resolved by file-substring and signature-substring disambiguation. 3 adversarial queries target tasks LevelDB does not perform (HTTP parsing, neural network training, Raft consensus).
Metrics: MRR@10 and Recall@10 on the 20 primary queries; pairwise RBO (p=0.9) and Jaccard top-10 across all 23 queries; adversarial top-1 similarity as calibration signal.
Acceptance gates: voyage MRR@10 must exceed openai MRR@10 (pass: 0.793 > 0.481), voyage MRR@10 must clear 0.50 in absolute terms (pass: 0.793), voyage max adversarial top-1 similarity must stay at or below 0.60 (pass: 0.517).
Independence: All three pairwise RBO values sit below 0.40 — providers produce highly independent rankings, so ensemble stacking cannot cheaply close voyage’s 31-point lead. Single-provider promotion is the correct product move.

Adversarial calibration

The three adversarial queries (“parse HTTP headers”, “train a neural network”, “run Raft consensus”) probe whether a provider will return confident matches for tasks LevelDB does not perform. Top-1 similarity for the best-matching hit:

Ollama qodo-embed-1-1.5b: 0.959 (mis-calibrated — over-confident)
OpenAI text-embedding-3-small: 0.324 (well-calibrated)
Voyage voyage-code-3: 0.517 (calibrated, within the 0.60 gate)

Local default: jina-embeddings-v2-base-code (FG-40)

On the same LevelDB corpus (inverse invariant: Local-mode Ollama only, zero external egress), we swept four candidates against the incumbent qodo-embed-1-1.5b: mxbai-embed-large (334M, 1024-dim), nomic-embed-text:v1.5 (137M, 768-dim), jina-embeddings-v2-base-code (160M, 768-dim, Q4 GGUF), re-measuring qodo itself for apples-to-apples comparison. Each candidate runs on a single GTX 1650 Ti (4 GB VRAM) on tower-l1’s embedder host over WireGuard, with HNSW wiped between sweeps.

Candidate	MRR@10	Recall@10	Adv max-top1
Ollama qodo-embed-1-1.5b (incumbent)	0.408	0.70	0.959
Ollama mxbai-embed-large	0.434	0.85	0.633
Ollama nomic-embed-text:v1.5	0.113	0.30	0.606
Ollama jina-embeddings-v2-base-code	0.444	0.80	0.408

Winner-selection reasoning

Jina improves qodo on all three axes: MRR@10 (+0.036), Recall@10 (+0.10), and max adversarial top-1 (−0.551 — cut nearly in half). The MRR gain alone is modest — under a pure “beat the incumbent by 5 points MRR” rule this candidate would fail to promote. The adversarial calibration win is what decides the flip. Qodo’s 0.959 adversarial top-1 on out-of-scope queries is a silent-failure mode: the provider confidently returns a nearest neighbour for “parse HTTP headers” in a LevelDB corpus that has no such function. Jina’s 0.408 lands in the OpenAI/Voyage calibrated range — the model knows when it has no answer. That behaviour, not the MRR margin, is the substantive correctness improvement the Local tier needed.

Qodo also has a memtable category collapse — per-category MRR@10 of 0.050, essentially no better than random on memtable intent queries. Jina covers that category at 0.411, and mxbai at 0.411 as well. Aggregate MRR@10 masks this: qodo compensates with wal (0.633) and cache (0.700) wins that happen to balance the memtable zero. A Local default with a quiet category-level blind spot is a product risk; a Local default that degrades gracefully across categories is not.

nomic-embed-text:v1.5 dominated on no axis and collapsed on three (memtable 0.000, sstable 0.075, compaction 0.000). Out of contention. mxbai-embed-large was the runner-up — also-better on all three axes, but with a weaker adversarial delta (−0.326 vs jina’s −0.551) and a weaker memtable + cache per-category spread. Jina wins on calibration & category-balance jointly.

As of forge-api 0.8.2, POST /v1/embed {"mode":"local"} defaults to ollama/hf.co/second-state/jina-embeddings-v2-base-code-GGUF:Q4_0 at 768-dim. Existing tenants with HNSW sidecars from the previous 1536-dim default will need a wipe + re-embed to pick up the new dim — embedding-dimension validation (FG-34 EM2) refuses to mix dims in one sidecar by design.

Why voyage-code-3 (Managed) and jina-embeddings-v2-base-code (Local) are the defaults