Why voyage-code-3 (Managed) and jina-embeddings-v2-base-code (Local) are the defaults

Cross-corpus benchmarks · 2026-04-19 · FG-38 Managed-tier + FG-40 Local-tier

Jump to: Managed default · Local default

Managed default: voyage-code-3 (FG-38)

TL;DR

On google/leveldb 1.23 (a canonical small C++ systems codebase, 131 files, 1,555 Function nodes), Voyage voyage-code-3 achieves MRR@10 0.793 against a 20-query intent-bucketed benchmark — outperforming OpenAI text-embedding-3-small (0.481) by +31.2 percentage points. Voyage also clears a false-confidence calibration gate (max adversarial top-1 sim 0.517 vs a 0.60 threshold).

As of forge-api 0.8.0, POST /v1/embed {"mode":"managed"} without an explicit provider field defaults to Voyage voyage-code-3. OpenAI remains selectable via ?provider=openai.

Per-provider overall (primary queries, n=20)

Provider MRR@10 Recall@10
Ollama qodo-embed-1-1.5b (Local) 0.408 0.70
OpenAI text-embedding-3-small 0.481 0.75
Voyage voyage-code-3 0.793 0.95

Per-category MRR@10

The 20 primary queries split across five LevelDB-specific intent buckets. Voyage wins 4 of 5 categories outright, ties the fifth.

Category (n) Ollama OpenAI Voyage
memtable (4) 0.050 0.396 0.875
sstable (5) 0.233 0.600 0.700
wal (4) 0.633 0.625 0.800
compaction (4) 0.542 0.425 0.833
cache (3) 0.700 0.278 0.778

Cross-corpus consistency

An earlier internal benchmark on a different C++ codebase produced the same provider ordering: voyage 0.752, openai 0.586, ollama-qodo 0.232. Two independent corpora agreeing on the ranking rules out corpus artifact — this is a provider-quality signal, not a LevelDB quirk.

Methodology

Adversarial calibration

The three adversarial queries (“parse HTTP headers”, “train a neural network”, “run Raft consensus”) probe whether a provider will return confident matches for tasks LevelDB does not perform. Top-1 similarity for the best-matching hit:

Local default: jina-embeddings-v2-base-code (FG-40)

On the same LevelDB corpus (inverse invariant: Local-mode Ollama only, zero external egress), we swept four candidates against the incumbent qodo-embed-1-1.5b: mxbai-embed-large (334M, 1024-dim), nomic-embed-text:v1.5 (137M, 768-dim), jina-embeddings-v2-base-code (160M, 768-dim, Q4 GGUF), re-measuring qodo itself for apples-to-apples comparison. Each candidate runs on a single GTX 1650 Ti (4 GB VRAM) on tower-l1’s embedder host over WireGuard, with HNSW wiped between sweeps.

Candidate MRR@10 Recall@10 Adv max-top1
Ollama qodo-embed-1-1.5b (incumbent) 0.408 0.70 0.959
Ollama mxbai-embed-large 0.434 0.85 0.633
Ollama nomic-embed-text:v1.5 0.113 0.30 0.606
Ollama jina-embeddings-v2-base-code 0.444 0.80 0.408

Winner-selection reasoning

Jina improves qodo on all three axes: MRR@10 (+0.036), Recall@10 (+0.10), and max adversarial top-1 (−0.551 — cut nearly in half). The MRR gain alone is modest — under a pure “beat the incumbent by 5 points MRR” rule this candidate would fail to promote. The adversarial calibration win is what decides the flip. Qodo’s 0.959 adversarial top-1 on out-of-scope queries is a silent-failure mode: the provider confidently returns a nearest neighbour for “parse HTTP headers” in a LevelDB corpus that has no such function. Jina’s 0.408 lands in the OpenAI/Voyage calibrated range — the model knows when it has no answer. That behaviour, not the MRR margin, is the substantive correctness improvement the Local tier needed.

Qodo also has a memtable category collapse — per-category MRR@10 of 0.050, essentially no better than random on memtable intent queries. Jina covers that category at 0.411, and mxbai at 0.411 as well. Aggregate MRR@10 masks this: qodo compensates with wal (0.633) and cache (0.700) wins that happen to balance the memtable zero. A Local default with a quiet category-level blind spot is a product risk; a Local default that degrades gracefully across categories is not.

nomic-embed-text:v1.5 dominated on no axis and collapsed on three (memtable 0.000, sstable 0.075, compaction 0.000). Out of contention. mxbai-embed-large was the runner-up — also-better on all three axes, but with a weaker adversarial delta (−0.326 vs jina’s −0.551) and a weaker memtable + cache per-category spread. Jina wins on calibration & category-balance jointly.

As of forge-api 0.8.2, POST /v1/embed {"mode":"local"} defaults to ollama/hf.co/second-state/jina-embeddings-v2-base-code-GGUF:Q4_0 at 768-dim. Existing tenants with HNSW sidecars from the previous 1536-dim default will need a wipe + re-embed to pick up the new dim — embedding-dimension validation (FG-34 EM2) refuses to mix dims in one sidecar by design.