Scope, Definitions, and Survey Method
Goals and Scope (Part A — survey report)
This report systematically surveys the field of vector databases and vector systems, including similarity search, ANN/ANNS (approximate nearest neighbor search), and embedding stores. The scope covers: (i) theoretical foundations of k-nearest neighbor (kNN) and ANNS, (ii) index structures and search algorithms (exact vs approximate; hashing, trees, quantization, graphs, disk-based indexes), (iii) scalability and distributed/cloud-managed architectures, (iv) durability, consistency, and dynamic/real-time updates, (v) integration with ML/LLM pipelines (hybrid search, RAG, multimodal retrieval), and (vi) safety–privacy–fairness and legal/ethical risks. [47] [48] [35]
Working Definitions and Concept Boundaries
A vector system is any software system that supports: (1) storing/reading vectors (embeddings), and (2) executing kNN/ANNS queries under a distance metric or similarity (L2, cosine, inner product/dot product, etc.), typically with index optimizations to avoid full linear scans. [37] [30]
A vector database (vector DB / VDBMS) is broader than a vector index library: beyond ANN indexes, it provides DBMS-grade data management such as CRUD/upsert/delete, metadata filtering, partitions/namespaces, access control, horizontal scaling, durability (persistence), and distributed operations (replication, fault tolerance). This distinction aligns with recent VDBMS surveys and vendor technical documentation. [47] [12]
Survey Methodology and Selection Criteria
We use a “systematic narrative survey” approach with a default time window of 2000–2025, covering the rapid evolution of ANNS from algorithms (LSH, PQ, HNSW, DiskANN) to systems integration (VDBMS, vector-enabled DBs, serverless filtering, filtered vector search). Earlier foundations (k-d tree, metric trees/vp-tree) are included as historical anchors. [1] [3] [4] [48]
- Primary sources: arXiv, ACL Anthology, NeurIPS proceedings, PVLDB, official project documentation (GitHub/Docs), and benchmark sites (ANN-Benchmarks, BigANN). [10] [11] [12]
- Query keywords (examples): “vector database management system”, “approximate nearest neighbor search”, “HNSW”, “product quantization”, “filtered vector search”, “disk-based ANN”, “hybrid retrieval”, “serverless vector database”, “metadata filtering”, “RAG retrieval augmented generation”, “multimodal retrieval CLIP”. [35] [37] [48]
- Selection criteria: prioritize primary/seminal papers and official system docs; for commercial systems, prefer official whitepapers, vendor research papers, or peer-reviewed publications by the providers when available. [25]
- Limitations and transparency: some digital libraries may restrict access; when necessary, we use preprints/mirrors. If a trustworthy, primary benchmark number is unavailable, we mark it as “unknown” and explain why. [11]
Theoretical Foundations and Index Taxonomy
Problem Definition and Similarity Metrics
For a query vector q, kNN/ANNS typically returns top-k points x that
minimize distance
d(q, x) (e.g., L2) or maximize similarity (dot/inner product). Cosine similarity is commonly
reduced to dot product
after L2-normalization.
[30]
[7]
From Exact NN to Approximate NN (ANN/ANNS)
In high dimensions and at large scale, exact NN often becomes impractical, motivating ANNS that trades accuracy for speed and memory. A major theoretical branch uses locality-sensitive hashing (LSH) with approximation guarantees under certain assumptions. [4] [5]
Historical Milestones (Selected)
- Trees / partitioning: k-d tree (axis-aligned splits), ball tree (hyperspheres), vp-tree/metric trees for general metric spaces. [1] [2] [3]
- LSH: foundational theory for approximate NN, including p-stable LSH for
l_p(notablyl_2). [4] [5] - Quantization: Product Quantization (PQ) as a key milestone for compressing vectors and accelerating approximate distances. [6]
- Graph-based: HNSW as a widely deployed multi-layer proximity graph index due to strong recall–latency behavior and incremental inserts. [7]
- Disk-based ANN: DiskANN demonstrates high accuracy at low latency and high throughput at billion scale on SSDs. [14]
Algorithm / Index Taxonomy by Core Mechanism
- Exact scan / exact indexes (baseline): linear scan; exact trees in low dimensions. [1]
- Hashing-based: LSH (including p-stable). Pros: theoretical guarantees in some models; cons: many tables, memory-heavy, can underperform on modern embedding distributions. [4][5]
- Tree-based / space partitioning: k-d tree, ball tree, vp-tree; often strong in low-to-mid dimensions or structured data. [1][2][3]
- Quantization-based (PQ / IVF-PQ / scalar quantization): short codes and lookup tables, good for memory compression and speed. [6][8]
- Graph-based (HNSW and variants): excellent empirical performance with tunable parameters
(
M,efConstruction,efSearch). [7] - Disk / out-of-core: DiskANN/SSD-optimized graphs for billion-scale retrieval with modest RAM footprints. [14]
- Hybrid & filtered vector search: combine metadata filtering and/or lexical retrieval with vector search; a major current research bottleneck. [48]
System Architecture and Vector Data Management
From Index Libraries to Vector DBMS (VDBMS)
The ecosystem commonly falls into three layers: (i) libraries (algorithms + index APIs), (ii) engines/services (vector search service runtime), (iii) VDBMS or vector-enabled databases (durability, distribution, governance, security). Recent surveys emphasize a shift from “algorithm-centric” to “system-centric” driven by LLM/RAG demand. [47] [35]
Typical Cloud-Native VDBMS Architecture
- Storage/compute separation and componentization (ingest nodes, query nodes, index builders, metadata store, object storage/WAL) are common patterns for scaling and fault tolerance. [15] [17]
- Dynamic updates and reindexing: many systems use segment-based designs (growing vs sealed), asynchronous index building, and sometimes temporary indexes for “query immediately after ingest”. [15]
Examples (Academic–Industrial)
- Purpose-built VDBMS papers report dynamic updates and real-time queries with snapshot-like consistency, separating query execution, GPU acceleration, and storage, with distributed scaling. [15]
- Serverless commercial research highlights metadata filtering as a central algorithm+systems challenge in serverless vector databases. [25] [48]
- Vector search inside traditional search engines commonly “co-resides” with inverted indexes (BM25), enabling practical hybrid retrieval. [21] [22] [23]
Evaluation, Datasets, and Comparative Analysis
Core Metrics
- Recall@k / Recall@r: fraction of queries where true NN appears in top-k/top-r. [10]
- Latency (p50/p95/p99) and Throughput/QPS: online vs batch behavior. [12]
- Index build time and ingest/update cost: critical under streaming updates. [15]
- Memory footprint / index size / disk usage: operational cost and scaling. [14]
- Cost per query (managed/serverless): compute + storage + network; BigANN emphasizes cost/power normalization. [12]
Benchmark Methodology and Reproducibility
- ANN-Benchmarks standardizes quality–performance evaluation across many ANN libraries. [10]
- A reproducibility protocol recommends publishing full configs/hardware/versions to reduce bias from parameter/hardware differences. [11]
- BigANN targets billion-scale workloads with defined datasets, reference hardware, and QPS–recall measurement. [12]
Recommended Datasets (by common practice)
- BIGANN/SIFT1B: billion-scale SIFT descriptors used by BigANN. [12]
- SIFT1M: smaller-scale SIFT commonly used in ANN-Benchmarks. [10]
- Deep1B: billion-scale deep descriptors (CVPR 2016). [13]
- GloVe embeddings: standard word embeddings (ANN-Benchmarks includes glove-100-angular variants). [31][10]
- MS MARCO: QA/ranking dataset used in dense retrieval and RAG pipelines. [32][36]
- LAION-400M: large multimodal dataset (CLIP embeddings widely used for multimodal retrieval). [33][37]
- ImageNet features: a common vision benchmark foundation. [34]
Integration Patterns and Safety/Privacy
Hybrid Search (BM25 + Vectors)
BM25 remains a lexical retrieval baseline; dense retrieval can outperform BM25 on some tasks and is commonly used as the retriever in RAG. Sparse neural retrieval methods (e.g., SPLADE) offer another route toward hybrid retrieval (dense + sparse). [38] [36] [39]
RAG (Retrieval-Augmented Generation)
RAG combines parametric memory (a generative model) with non-parametric memory (a dense vector index) to improve factuality and updatability. This makes vector databases a core infrastructure component for many GenAI applications, especially when provenance (evidence) matters. [35]
Multimodal Retrieval
CLIP provides a shared embedding space for text and images; LAION-400M releases data that supports large-scale multimodal similarity search. [37] [33]
Streaming Ingestion and Real-Time Updates
Real systems often separate index building from serving queries and use log/WAL-based ingestion to support recovery and shard-level consistency. [15] [17]
Security, Privacy, and Fairness
- Encryption and access control: production deployments typically require transport encryption (TLS), at-rest encryption (e.g., AES-256), and RBAC/API keys—important because embeddings may encode sensitive information. [26] [27] [29]
- Privacy inference risks: membership inference and model inversion show that models/outputs can leak information about training data. In vector DB contexts, embeddings can also leak membership or sensitive attributes; access control and leakage mitigation should be considered. [42] [43]
- Differential Privacy (DP): a foundational framework for limiting individual data leakage; DP-SGD adapts DP to deep learning training. [40] [41]
- Bias/fairness in embeddings: work on word embedding bias shows embeddings can reflect or amplify social bias, creating risks in high-stakes use cases. [44]
- Legal frameworks: GDPR imposes personal data protections; the EU AI Act introduces risk-tiered AI rules that can affect storage, retrieval, and auditability in AI systems using vector DBs. [45] [46]
Open Challenges and Research Directions
- Benchmark vs real workload gap: dynamic data, metadata filtering, and mixed structured+vector workloads can drastically change trade-offs compared to “pure vector” benchmarks. [48] [47]
-
Theory guarantees vs engineering knobs: LSH offers theoretical approximation factors, but many
production systems pick HNSW/graph/quantization for empirical performance;
bridging formal guarantees to parameters like
ef,M, PQ code length remains an open area. [4] [7] [6] - Index maintenance under high update rates: deletes/updates in graph indexes cause fragmentation and quality drift, motivating compaction/rebuild strategies and better dynamic algorithms. [7] [15]
- Energy/cost-aware ANNS: move from recall–latency to quality–latency–cost–energy optimization. [12]
- Explainability/provenance for embedding retrieval: harder to explain “why retrieved” than lexical methods; RAG adds pressure for provenance and auditable retrieval. [35] [38]
- Benchmark standardization: lacks unified, widely adopted protocols for filtered search, streaming updates, multi-vector queries, multimodal, and cloud cost. [10] [11] [12]
Proposed Title
Vector Databases and Similarity Search Systems: A Systematic Survey, Taxonomy, Comparative Analysis, and Open Challenges for Modern AI Infrastructure [47][35]
Abstract
Vector databases and similarity search systems have become core infrastructure for modern AI applications such as semantic search, recommendation, and especially Retrieval-Augmented Generation (RAG). Rapid progress in this area is driven by algorithmic advances (LSH, product quantization, HNSW, SSD-based indexes like DiskANN) and systems advances (cloud-native VDBMS, vector-in-RDBMS/search engines, efficient metadata filtering, serverless operation). This paper presents: (i) a systematic narrative survey for 2000–2025, (ii) a unified taxonomy across algorithms, system architectures, and application patterns, (iii) qualitative and semi-quantitative comparative analysis grounded in primary literature and public benchmarks, and (iv) open challenges in theoretical guarantees, streaming updates, filtered vector search, benchmarking standardization, and security–privacy–fairness. [47] [10] [12] [48]
Keywords
vector database; approximate nearest neighbor search; HNSW; product quantization; filtered vector search; hybrid retrieval; RAG; multimodal retrieval; benchmarking; privacy. [7] [6] [48] [35] [10]
Proposed Figures (Embed-Ready Blocks)
Figure A — Timeline (Mermaid)
timeline
title Milestones in vector indexing and vector systems (selected)
1970s : k-d tree (Bentley)
1980s : Ball tree (Omohundro)
1990s : VP-tree / metric trees (Yianilos)
1998 : LSH foundations (Indyk–Motwani)
2004 : p-stable LSH (Datar et al.)
2011 : Product Quantization (PQ)
2016 : HNSW (preprint); ANN-Benchmarks emerges
2017 : FAISS GPU similarity search (preprint)
2019 : DiskANN (NeurIPS)
2020 : RAG; DPR; vector search inside search engines
2021 : Milvus (SIGMOD) — purpose-built VDBMS
2024 : VDBMS survey; filtered vector search accelerates
2025 : Filtering/metadata research + industrial scaling
Figure B — Taxonomy (Mermaid)
Figure C — Distributed Vector System Architecture (Mermaid)
Figure D — Hybrid Search + RAG Pipeline (Mermaid)
Figure E — Recall vs Latency (Python snippet)
import matplotlib.pyplot as plt
# NOTE:
# - DiskANN numbers should come from the DiskANN paper or your reproduced runs.
# - Others below are placeholders: replace with your measured results under a fixed protocol.
labels = ["DiskANN (paper)", "HNSW (placeholder)", "IVF-PQ (placeholder)", "ScaNN/AVQ (placeholder)"]
latency_ms = [2.8, 5.0, 8.0, 4.0] # replace with your measurements
recall = [0.95, 0.92, 0.90, 0.93] # replace with your measurements
plt.figure()
plt.scatter(latency_ms, recall)
for x, y, s in zip(latency_ms, recall, labels):
plt.text(x, y, s)
plt.xlabel("Latency (ms) — lower is better")
plt.ylabel("Recall@1 (or Recall@k) — higher is better")
plt.title("Illustrative Recall-Latency trade-off (replace placeholders with reproducible results)")
plt.ylim(0.7, 1.0)
plt.xlim(0, max(latency_ms) + 2)
plt.show()
Figure F — Reproducible Experiment Workflow (Mermaid)
Table 1 — Algorithm Comparison (Minimum)
| Algorithm / Index | Group | Core Idea | Query Complexity (qualitative) | Memory | Dynamic Updates | Best Fit | Primary Source |
|---|---|---|---|---|---|---|---|
| k-d tree | Tree-based | Axis-aligned partitioning | Strong at low dimension; degrades as dimension increases | Medium | Yes | Exact/approx NN in low dimensions | [1] |
| Ball tree | Tree-based | Nodes are hyperspheres | Often better than k-d tree for some distributions | Medium | Yes | Metric L2, mid dimension | [2] |
| VP-tree / Metric tree | Tree-based | Vantage points for general metric spaces | Data-dependent; good for general metric spaces | Medium | Yes | Metric spaces beyond Euclidean | [3] |
| LSH (foundations) | Hashing-based | Probabilistic hashing preserves “near” | Expected sublinear under assumptions | High | Yes | When theory guarantees are required | [4] |
| p-stable LSH | Hashing-based | p-stable distributions for lp (esp. l2) | Good for l2 with proper tuning | High | Yes | Approx NN under lp norms | [5] |
| Product Quantization (PQ) | Quantization | Subspace quantization + lookup tables | Fast table lookup; depends on code settings | Low | Sometimes | Large compression, fast serving | [6] |
| HNSW | Graph-based | Multi-layer navigable small world graph | Very fast in practice; tunable via ef/M | Higher than PQ | Yes (incremental) | Diverse workloads; high recall at low latency | [7] |
| DiskANN | Disk-based graph | SSD-optimized graph layout + search | Low latency, high QPS at billion scale on SSD | Low RAM + SSD | Depends | Billion-scale out-of-core | [14] |
Table 2 — System Comparison (Minimum 12)
Performance numbers are often not directly comparable across systems due to differing datasets, hardware, and tuning; use a fixed protocol (ANN-Benchmarks / BigANN) for Q1-level claims. [10][11][12]
| System | Group | License | Core Methods | Metrics | Dynamic Updates | Distributed | GPU | Notes | Primary Source |
|---|---|---|---|---|---|---|---|---|---|
| FAISS | Library | MIT | IVF/PQ, HNSW, many index types | L2, IP, cosine | Depends on index | No (library) | Yes | Widely used baseline library | [8] |
| Annoy | Library | Apache-2.0 | Random projection trees; mmap index | Angular/L2 (practice) | Mostly offline | No | No | Memory-mapped read-optimized | (project docs) |
| hnswlib | Library | Apache-2.0 | HNSW | Configurable | Yes | No | No | Popular lightweight HNSW impl | [7] |
| NMSLIB | Library/toolkit | Apache-2.0 | Multiple methods incl. HNSW | Multiple spaces | Yes | No | No | Toolkit for many ANN methods | (project docs) |
| ScaNN | Library | Apache-2.0 | AVQ + pruning + reorder | MIPS/L2 | Yes | No | CPU-optimized | Strong quality–speed when tuned | [9] |
| Milvus | VDBMS | Apache-2.0 | IVF/PQ, HNSW; segment-based | Multiple | Yes | Yes | Yes | Purpose-built VDBMS | [15] |
| Weaviate | VDBMS | BSD-3-Clause | HNSW, flat; dynamic switching | Configurable | Yes | Yes (cloud-native) | Varies | Popular open-source VDBMS | [47] |
| Qdrant | VDBMS | (repo) | HNSW + payload filtering; quantization | Configurable | Yes | Deployment options | Varies | Filtering-first design | (project docs) |
| pgvector | DB extension | (repo) | HNSW, IVFFlat | L2, IP, cosine, L1, etc. | Yes | Via Postgres | No | Vector search inside Postgres | [30] |
| Vespa | Search + Vector | Apache-2.0 | HNSW + ranking pipeline | Configurable | Yes (real-time) | Yes | Varies | Hybrid ranking at scale | (project docs) |
| Elasticsearch kNN (Lucene) | Search + Vector | Multiple | HNSW per segment (Lucene) | kNN metrics | Yes | Yes (cluster) | No | Strong hybrid story with BM25 | [21][23] |
| Pinecone | Managed VDB | Commercial | Not fully disclosed | Dot/L2/cosine (docs) | Yes | Yes | Varies | Notable research on serverless filtering | [25][26] |
Recommended Reproducible Benchmark Protocol (Summary)
To compare fairly, report at least one “medium” dataset (SIFT1M or GloVe) and one “large” dataset
(Deep1B or
SIFT1B),
fix dtype/normalization/metric, tune key parameters (HNSW: M, efConstruction,
efSearch;
IVF: lists/probes; PQ: code size; ScaNN: tree+quant+reorder), and report latency p50/p95/p99, QPS under
defined concurrency,
recall@k against ground truth, and resource usage (index size, RSS, disk footprint, build time). Publish
configs/logs/checksums.
[10][11][12]
References
Click any in-text citation like [14] to jump here. Each reference has an anchor so citations scroll to the corresponding item.
- Bentley, J. L. Multidimensional Binary Search Trees in Database Applications. IEEE TSE, 1979. Back to top
- Omohundro, S. M. Five Balltree Construction Algorithms, 1989. Back to top
- Yianilos, P. N. Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. 1993. DOI: 10.1145/313559.313789. Back to top
- Indyk, P., Motwani, R. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998. Back to top
- Datar, M., Immorlica, N., Indyk, P., Mirrokni, V. Locality-sensitive hashing scheme based on p-stable distributions. SoCG 2004. DOI: 10.1145/997817.997857. Back to top
- Jégou, H., Douze, M., Schmid, C. Product Quantization for Nearest Neighbor Search. IEEE TPAMI 33(1), 2011. DOI: 10.1109/TPAMI.2010.57. Back to top
- Malkov, Y. A., Yashunin, D. A. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. arXiv:1603.09320. Back to top
- Johnson, J., Douze, M., Jégou, H. Billion-scale similarity search with GPUs. arXiv:1702.08734 (FAISS). Back to top
- Guo, R., Sun, P., Lindgren, E., et al. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. arXiv:1908.10396 (ScaNN/AVQ). Back to top
- Aumüller, M., Bernhardsson, E., Faithfull, A. A Benchmarking Tool for Approximate Nearest Neighbor Search. arXiv:1807.05614 (ANN-Benchmarks). Back to top
- Aumüller, M., Bernhardsson, E., Faithfull, A. Reproducibility protocol for ANN-Benchmarks, 2022. Back to top
-
BigANN / Big-ANN Benchmarks (NeurIPS tracks):
https://big-ann-benchmarks.com/Back to top - Babenko, A., Lempitsky, V. Efficient Indexing of Billion-Scale Datasets of Deep Descriptors. CVPR 2016 (Deep1B). Back to top
- Subramanya, S. J., et al. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. NeurIPS 2019. Back to top
- Wang, J., Yi, X., Guo, R., et al. Milvus: A Purpose-Built Vector Data Management System. SIGMOD 2021. DOI: 10.1145/3448016.3457550. Back to top
- Milvus Documentation — Index types overview. Back to top
- Milvus Documentation — Architecture overview (streaming/query/data nodes & storage). Back to top
- Weaviate Documentation — Vector indexing (HNSW/flat, dynamic switching). Back to top
- Weaviate GitHub — license and system description. Back to top
- Vespa Documentation — Approximate NN search using HNSW. Back to top
- Elastic Docs — kNN search in Elasticsearch. Back to top
- Elastic Docs — Optimize approximate kNN search (HNSW per segment). Back to top
- Apache Lucene — Lucene99HnswVectorsFormat (HNSW vectors format). Back to top
- Pinecone Docs — Hybrid search. Back to top
- Ingber, A., Liberty, E. Accurate and Efficient Metadata Filtering in Pinecone’s Serverless Vector Database. ICML 2025. Back to top
- Pinecone Docs — Security overview (TLS/at-rest encryption). Back to top
- Pinecone — Technical & Organizational Security Measures (PDF). Back to top
- Qdrant GitHub — features (payload filtering, quantization). Back to top
- Qdrant Documentation — Security (API key, TLS). Back to top
- pgvector GitHub — HNSW/IVFFlat and parameters. Back to top
- Pennington, J., Socher, R., Manning, C. GloVe: Global Vectors for Word Representation. EMNLP 2014. DOI: 10.3115/v1/D14-1162. Back to top
- Bajaj, P., Campos, D., Craswell, N., et al. MS MARCO. arXiv:1611.09268. Back to top
- Schuhmann, C., Vencu, R., Beaumont, R., et al. LAION-400M. arXiv:2111.02114. DOI: 10.48550/arXiv.2111.02114. Back to top
- Russakovsky, O., Deng, J., Su, H., et al. ImageNet Large Scale Visual Recognition Challenge. IJCV 2015. DOI: 10.1007/s11263-015-0816-y. Back to top
- Lewis, P., Perez, E., Piktus, A., et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. Back to top
- Karpukhin, V., Oğuz, B., Min, S., et al. Dense Passage Retrieval for Open-Domain QA. EMNLP 2020. Back to top
- Radford, A., Kim, J. W., Hallacy, C., et al. CLIP. ICML 2021. Back to top
- Robertson, S., Zaragoza, H. BM25 and Beyond. FnT IR 2009. DOI: 10.1561/1500000019. Back to top
- Formal, T., Piwowarski, B., Clinchant, S. SPLADE. SIGIR 2021. DOI: 10.1145/3404835.3463098. Back to top
- Dwork, C., Roth, A. The Algorithmic Foundations of Differential Privacy. 2014. Back to top
- Abadi, M., Chu, A., Goodfellow, I., et al. Deep Learning with Differential Privacy. CCS 2016. Back to top
- Shokri, R., Stronati, M., Song, C., Shmatikov, V. Membership Inference Attacks Against ML Models. IEEE S&P 2017. Back to top
- Fredrikson, M., Jha, S., Ristenpart, T. Model Inversion Attacks. CCS 2015. DOI: 10.1145/2810103.2813677. Back to top
- Bolukbasi, T., Chang, K.-W., Zou, J., et al. Debiasing Word Embeddings. NeurIPS 2016. Back to top
- Regulation (EU) 2016/679 (GDPR). Back to top
- Regulation (EU) 2024/1689 (EU AI Act). Back to top
- Pan, J. J., et al. Survey of vector database management systems. VLDB Journal 2024. Back to top
- Chronis, Y., Caminal, P., et al. Filtered Vector Search: State-of-the-art and Research Directions. PVLDB 2025. Back to top
- Elastic — Licensing FAQ (AGPL option; licensing changes). Back to top
- Elastic Blog — “Elasticsearch Is Open Source. Again!” (AGPL option). Back to top
End of document.