Benchmarking & Quality
PomaiDB benchmarks exist to earn trust, not to market speed.
This document covers our trust model, recall methodology, and the latest live results.
1. Benchmark Trust Model
We strictly define what PomaiDB guarantees versus what it does not.
Guarantees vs Benchmarks
| Guarantee |
What it means |
Benchmark |
Enforcement |
| Correctness |
Approximate search results match a brute-force oracle. |
Recall Correctness |
Recall@1/10/100 must be ≥ 0.94. |
| Tail latency |
Latency under mixed load is visible and does not hide p999. |
Mixed Load Tail |
p50/p95/p99/p999 reported. |
| Crash safety |
Data is not lost after SIGKILL during ingest. |
Crash Recovery |
Recovered count validated. |
2. Recall & Methodology
PomaiDB targets Recall@10 >= 0.95 for production search workloads.
Search Implementation
- Indexing:
IvfCoarse with KMeans++ initialization. Vectors are buffered
until sufficient samples are collected for robust training.
- Search: Query top
nprobe centroids, then perform exact scan (SIMD Dot
Product) on candidate buckets.
- Tie-Breaking: Uses
VectorId (ascending) to guarantee determinism.
Tuning Advice
- Low Recall? Increase
nprobe to search more buckets. This trades
latency for accuracy.
- High Latency? Reduce
nprobe or increase nlist (more
clusters, smaller buckets).
3. Benchmarking Guide
Quick Start
# Trust benchmarks (Standard Scripts)
./scripts/pomai-bench recall
./scripts/pomai-bench mixed-load
./scripts/pomai-bench crash-recovery
# Comprehensive benchmark (Build from source)
cmake --build build --target comprehensive_bench
./build/comprehensive_bench --dataset small
Dataset Sizes
| Size |
Vectors |
Dimensions |
Queries |
| small |
10,000 |
128 |
1,000 |
| medium |
100,000 |
256 |
5,000 |
| large |
1,000,000 |
768 |
10,000 |
Interpreting Results
Good Performance Baseline (Medium Dataset):
- ✅ P99 latency < 1ms
- ✅ Throughput > 5K QPS
- ✅ Recall@10 > 0.90
4. Live Results (CBR-S Suite)
Performance metrics from the latest CI run. Comparing Fanout (Baseline) vs
CBR-S (Smart Routing).
Tail Latency
P99 / P999
microseconds, not vibes
Quality
Recall@k
vs brute-force oracle
Scaling
QPS & RSS
throughput + memory
| Mode |
Recall@10 |
P99 (µs) |
Query QPS |
Shards Visited |
Verdict |
| fanout |
1.0000 |
3,740.5 |
103.8 |
4.00 |
PASS |
| cbrs |
1.0000 |
8,390.8 |
71.1 |
1.06 |
PASS |
| cbrs_no_dual |
1.0000 |
8,488.5 |
70.9 |
1.06 |
PASS |
| Mode |
Recall@10 |
P99 (µs) |
Query QPS |
Shards Visited |
Verdict |
| fanout |
1.0000 |
13,579.1 |
30.2 |
4.00 |
PASS |
| cbrs |
1.0000 |
22,433.8 |
25.3 |
1.70 |
PASS |
| cbrs_no_dual |
1.0000 |
22,626.6 |
25.2 |
1.70 |
PASS |
| Mode |
Recall@10 |
P99 (µs) |
Query QPS |
Shards Visited |
Verdict |
| fanout |
1.0000 |
29,113.1 |
11.1 |
8.00 |
PASS |
| cbrs |
1.0000 |
58,733.7 |
10.3 |
2.00 |
PASS |
| cbrs_no_dual |
1.0000 |
63,280.6 |
10.6 |
2.00 |
PASS |
| Mode |
Recall@1 |
P99 (µs) |
Query QPS |
Shards Visited |
Verdict |
| fanout |
0.1000 |
30,522.0 |
14.5 |
4.00 |
PASS |
| cbrs |
0.1000 |
73,674.3 |
9.4 |
1.87 |
PASS |
| cbrs_no_dual |
0.1000 |
68,675.5 |
9.6 |
1.87 |
PASS |
| Mode |
Recall@10 |
P99 (µs) |
Query QPS |
Shards Visited |
Verdict |
| fanout |
1.0000 |
11,061.3 |
37.4 |
4.00 |
PASS |
| cbrs |
1.0000 |
25,408.4 |
25.1 |
1.74 |
PASS |
| cbrs_no_dual |
1.0000 |
25,199.8 |
25.1 |
1.74 |
PASS |
| Mode |
Recall@10 |
P99 (µs) |
Query QPS |
Shards Visited |
Verdict |
| fanout |
1.0000 |
17,004.5 |
31.5 |
4.00 |
PASS |
| cbrs |
0.9703 |
25,772.1 |
28.1 |
1.00 |
PASS |
| cbrs_no_dual |
0.5000 |
24,737.5 |
28.5 |
1.00 |
WARN |
Interpretation Guide
Good signs
- P99 stays close to P50 (no tail explosions).
- QPS increases with threads until expected saturation.
- Recall@10 stable across dataset modes and routing epochs.
- routed_shards_avg drops significantly vs shard count, without recall loss.
Red flags
- P99 ≫ P50 (10x+) — tail variance, often contention or IO.
- QPS decreases with more threads — thrash or locks.
- Recall drops in overlap/epoch drift — routing or snapshot semantics issues.