Pomai Research

A Comprehensive Survey of Vector Databases and Similarity Search Systems

Scope, taxonomy, system architecture, benchmarking, ML/LLM integration, and safety/privacy (2000–2025).

Scope, Definitions, and Survey Method

Goals and Scope (Part A — survey report)

This report systematically surveys the field of vector databases and vector systems, including similarity search, ANN/ANNS (approximate nearest neighbor search), and embedding stores. The scope covers: (i) theoretical foundations of k-nearest neighbor (kNN) and ANNS, (ii) index structures and search algorithms (exact vs approximate; hashing, trees, quantization, graphs, disk-based indexes), (iii) scalability and distributed/cloud-managed architectures, (iv) durability, consistency, and dynamic/real-time updates, (v) integration with ML/LLM pipelines (hybrid search, RAG, multimodal retrieval), and (vi) safety–privacy–fairness and legal/ethical risks. [47] [48] [35]

Working Definitions and Concept Boundaries

A vector system is any software system that supports: (1) storing/reading vectors (embeddings), and (2) executing kNN/ANNS queries under a distance metric or similarity (L2, cosine, inner product/dot product, etc.), typically with index optimizations to avoid full linear scans. [37] [30]

A vector database (vector DB / VDBMS) is broader than a vector index library: beyond ANN indexes, it provides DBMS-grade data management such as CRUD/upsert/delete, metadata filtering, partitions/namespaces, access control, horizontal scaling, durability (persistence), and distributed operations (replication, fault tolerance). This distinction aligns with recent VDBMS surveys and vendor technical documentation. [47] [12]

Survey Methodology and Selection Criteria

We use a “systematic narrative survey” approach with a default time window of 2000–2025, covering the rapid evolution of ANNS from algorithms (LSH, PQ, HNSW, DiskANN) to systems integration (VDBMS, vector-enabled DBs, serverless filtering, filtered vector search). Earlier foundations (k-d tree, metric trees/vp-tree) are included as historical anchors. [1] [3] [4] [48]

Theoretical Foundations and Index Taxonomy

Problem Definition and Similarity Metrics

For a query vector q, kNN/ANNS typically returns top-k points x that minimize distance d(q, x) (e.g., L2) or maximize similarity (dot/inner product). Cosine similarity is commonly reduced to dot product after L2-normalization. [30] [7]

From Exact NN to Approximate NN (ANN/ANNS)

In high dimensions and at large scale, exact NN often becomes impractical, motivating ANNS that trades accuracy for speed and memory. A major theoretical branch uses locality-sensitive hashing (LSH) with approximation guarantees under certain assumptions. [4] [5]

Historical Milestones (Selected)

Algorithm / Index Taxonomy by Core Mechanism

  1. Exact scan / exact indexes (baseline): linear scan; exact trees in low dimensions. [1]
  2. Hashing-based: LSH (including p-stable). Pros: theoretical guarantees in some models; cons: many tables, memory-heavy, can underperform on modern embedding distributions. [4][5]
  3. Tree-based / space partitioning: k-d tree, ball tree, vp-tree; often strong in low-to-mid dimensions or structured data. [1][2][3]
  4. Quantization-based (PQ / IVF-PQ / scalar quantization): short codes and lookup tables, good for memory compression and speed. [6][8]
  5. Graph-based (HNSW and variants): excellent empirical performance with tunable parameters (M, efConstruction, efSearch). [7]
  6. Disk / out-of-core: DiskANN/SSD-optimized graphs for billion-scale retrieval with modest RAM footprints. [14]
  7. Hybrid & filtered vector search: combine metadata filtering and/or lexical retrieval with vector search; a major current research bottleneck. [48]

System Architecture and Vector Data Management

From Index Libraries to Vector DBMS (VDBMS)

The ecosystem commonly falls into three layers: (i) libraries (algorithms + index APIs), (ii) engines/services (vector search service runtime), (iii) VDBMS or vector-enabled databases (durability, distribution, governance, security). Recent surveys emphasize a shift from “algorithm-centric” to “system-centric” driven by LLM/RAG demand. [47] [35]

Typical Cloud-Native VDBMS Architecture

Examples (Academic–Industrial)

Evaluation, Datasets, and Comparative Analysis

Core Metrics

Benchmark Methodology and Reproducibility

Recommended Datasets (by common practice)

Integration Patterns and Safety/Privacy

Hybrid Search (BM25 + Vectors)

BM25 remains a lexical retrieval baseline; dense retrieval can outperform BM25 on some tasks and is commonly used as the retriever in RAG. Sparse neural retrieval methods (e.g., SPLADE) offer another route toward hybrid retrieval (dense + sparse). [38] [36] [39]

RAG (Retrieval-Augmented Generation)

RAG combines parametric memory (a generative model) with non-parametric memory (a dense vector index) to improve factuality and updatability. This makes vector databases a core infrastructure component for many GenAI applications, especially when provenance (evidence) matters. [35]

Multimodal Retrieval

CLIP provides a shared embedding space for text and images; LAION-400M releases data that supports large-scale multimodal similarity search. [37] [33]

Streaming Ingestion and Real-Time Updates

Real systems often separate index building from serving queries and use log/WAL-based ingestion to support recovery and shard-level consistency. [15] [17]

Security, Privacy, and Fairness

Open Challenges and Research Directions

Proposed Title

Vector Databases and Similarity Search Systems: A Systematic Survey, Taxonomy, Comparative Analysis, and Open Challenges for Modern AI Infrastructure [47][35]

Abstract

Vector databases and similarity search systems have become core infrastructure for modern AI applications such as semantic search, recommendation, and especially Retrieval-Augmented Generation (RAG). Rapid progress in this area is driven by algorithmic advances (LSH, product quantization, HNSW, SSD-based indexes like DiskANN) and systems advances (cloud-native VDBMS, vector-in-RDBMS/search engines, efficient metadata filtering, serverless operation). This paper presents: (i) a systematic narrative survey for 2000–2025, (ii) a unified taxonomy across algorithms, system architectures, and application patterns, (iii) qualitative and semi-quantitative comparative analysis grounded in primary literature and public benchmarks, and (iv) open challenges in theoretical guarantees, streaming updates, filtered vector search, benchmarking standardization, and security–privacy–fairness. [47] [10] [12] [48]

Keywords

vector database; approximate nearest neighbor search; HNSW; product quantization; filtered vector search; hybrid retrieval; RAG; multimodal retrieval; benchmarking; privacy. [7] [6] [48] [35] [10]

Proposed Figures (Embed-Ready Blocks)

Figure A — Timeline (Mermaid)

timeline
  title Milestones in vector indexing and vector systems (selected)
  1970s : k-d tree (Bentley)
  1980s : Ball tree (Omohundro)
  1990s : VP-tree / metric trees (Yianilos)
  1998  : LSH foundations (Indyk–Motwani)
  2004  : p-stable LSH (Datar et al.)
  2011  : Product Quantization (PQ)
  2016  : HNSW (preprint); ANN-Benchmarks emerges
  2017  : FAISS GPU similarity search (preprint)
  2019  : DiskANN (NeurIPS)
  2020  : RAG; DPR; vector search inside search engines
  2021  : Milvus (SIGMOD) — purpose-built VDBMS
  2024  : VDBMS survey; filtered vector search accelerates
  2025  : Filtering/metadata research + industrial scaling

Figure B — Taxonomy (Mermaid)

flowchart TB %% Định nghĩa Style Science classDef default fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a,font-family:monospace; classDef primary fill:#fff,stroke:#0055ff,stroke-width:2px,color:#0055ff,font-weight:bold; A["Vector Search Problem"]:::primary --> B1["Exact"] A --> B2["Approximate (ANNS)"] B1 --> T1["Scan / Exact Index"] B1 --> T2["Low-dim Trees"] B2 --> C1["Hashing-based"] B2 --> C2["Tree / Partition-based"] B2 --> C3["Quantization-based"] B2 --> C4["Graph-based"] B2 --> C5["Disk / Out-of-core"] B2 --> C6["Filtered / Hybrid"] C1 --> LSH["LSH (p-stable...)"] C2 --> KD["k-d tree / ball tree / vp-tree"] C3 --> PQ["PQ / IVF-PQ / Scalar Quant."] C4 --> HNSW["HNSW and variants"] C5 --> DISK["DiskANN / SSD graphs"] C6 --> HYB["Hybrid lexical+vector; metadata filtering"]

Figure C — Distributed Vector System Architecture (Mermaid)

flowchart LR classDef default fill:#fff,stroke:#333,stroke-width:1px,color:#1a1a1a,font-family:monospace; classDef storage fill:#f9f9f9,stroke:#888,stroke-dasharray: 5 5; subgraph Client Q["Query / Upsert API"] end subgraph ControlPlane META["Metadata / Schema / Access Control"] SCHED["Scheduler / Coordinator"] end subgraph DataPlane ING["Ingest / Streaming Node"] SEG["Segment Manager"] IDX["Index Builder"] QN["Query Node"] end subgraph Storage WAL["WAL / Log Broker"]:::storage OBJ["Object Storage / Filesystem"]:::storage KV["Meta store (KV / RDB)"]:::storage end Q --> META Q --> QN Q --> ING ING --> WAL ING --> SEG SEG --> IDX IDX --> OBJ QN --> OBJ QN --> META META --> KV WAL --> OBJ SCHED --> ING SCHED --> QN SCHED --> IDX

Figure D — Hybrid Search + RAG Pipeline (Mermaid)

flowchart TB classDef default fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a,font-family:monospace; classDef highlight fill:#f0f7ff,stroke:#0055ff,stroke-width:2px; U["User Query"]:::highlight --> E["Embed Query (Dense)"] U --> S["Sparse Encoder / BM25 Query"] subgraph Retrieval direction TB V["Vector ANN Search"] --> C1["Top-N Candidates"] K["Keyword / Sparse Search"] --> C2["Top-M Candidates"] end E --> V S --> K C1 --> MRG["Merge and Reweight"] C2 --> MRG MRG --> RR["Reranker (Cross-encoder / LLM)"] RR --> CTX["Context Selection"] CTX --> LLM["LLM Generation"] LLM --> A["Answer + Citations / Provenance"]:::highlight

Figure E — Recall vs Latency (Python snippet)


            import matplotlib.pyplot as plt

# NOTE:
# - DiskANN numbers should come from the DiskANN paper or your reproduced runs.
# - Others below are placeholders: replace with your measured results under a fixed protocol.
labels = ["DiskANN (paper)", "HNSW (placeholder)", "IVF-PQ (placeholder)", "ScaNN/AVQ (placeholder)"]
latency_ms = [2.8, 5.0, 8.0, 4.0]     # replace with your measurements
recall =     [0.95, 0.92, 0.90, 0.93] # replace with your measurements

plt.figure()
plt.scatter(latency_ms, recall)
for x, y, s in zip(latency_ms, recall, labels):
    plt.text(x, y, s)

plt.xlabel("Latency (ms) — lower is better")
plt.ylabel("Recall@1 (or Recall@k) — higher is better")
plt.title("Illustrative Recall-Latency trade-off (replace placeholders with reproducible results)")
plt.ylim(0.7, 1.0)
plt.xlim(0, max(latency_ms) + 2)
plt.show()

Figure F — Reproducible Experiment Workflow (Mermaid)

flowchart TD %% Định nghĩa Style Science (Lab-style) classDef default fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a,font-family:monospace; classDef highlight fill:#fff,stroke:#0055ff,stroke-width:2px,color:#0055ff,font-weight:bold; classDef storage fill:#f9f9f9,stroke:#888,stroke-dasharray: 5 5; A["Choose Dataset + Query Set + Ground Truth"]:::highlight --> B["Normalize Vectors: dtype, norm, metric"] B --> C["Choose Algorithm + Parameters (Grid/Random Search)"] C --> D["Build Index + Measure Build Time + Size"] D --> E["Warm-up + Benchmark (Latency/QPS/Recall)"] E --> F["Measure p50/p95/p99 + CPU/GPU/Mem/Disk"] F --> G["Record Hardware + Versions + Seed + Config"] G --> H["Publish Scripts + Configs + Logs + Checksums"]:::highlight %% Ghi chú quy trình khép kín H -.-> |Validation| A

Table 1 — Algorithm Comparison (Minimum)

Algorithm / Index Group Core Idea Query Complexity (qualitative) Memory Dynamic Updates Best Fit Primary Source
k-d tree Tree-based Axis-aligned partitioning Strong at low dimension; degrades as dimension increases Medium Yes Exact/approx NN in low dimensions [1]
Ball tree Tree-based Nodes are hyperspheres Often better than k-d tree for some distributions Medium Yes Metric L2, mid dimension [2]
VP-tree / Metric tree Tree-based Vantage points for general metric spaces Data-dependent; good for general metric spaces Medium Yes Metric spaces beyond Euclidean [3]
LSH (foundations) Hashing-based Probabilistic hashing preserves “near” Expected sublinear under assumptions High Yes When theory guarantees are required [4]
p-stable LSH Hashing-based p-stable distributions for lp (esp. l2) Good for l2 with proper tuning High Yes Approx NN under lp norms [5]
Product Quantization (PQ) Quantization Subspace quantization + lookup tables Fast table lookup; depends on code settings Low Sometimes Large compression, fast serving [6]
HNSW Graph-based Multi-layer navigable small world graph Very fast in practice; tunable via ef/M Higher than PQ Yes (incremental) Diverse workloads; high recall at low latency [7]
DiskANN Disk-based graph SSD-optimized graph layout + search Low latency, high QPS at billion scale on SSD Low RAM + SSD Depends Billion-scale out-of-core [14]

Table 2 — System Comparison (Minimum 12)

Performance numbers are often not directly comparable across systems due to differing datasets, hardware, and tuning; use a fixed protocol (ANN-Benchmarks / BigANN) for Q1-level claims. [10][11][12]

System Group License Core Methods Metrics Dynamic Updates Distributed GPU Notes Primary Source
FAISS Library MIT IVF/PQ, HNSW, many index types L2, IP, cosine Depends on index No (library) Yes Widely used baseline library [8]
Annoy Library Apache-2.0 Random projection trees; mmap index Angular/L2 (practice) Mostly offline No No Memory-mapped read-optimized (project docs)
hnswlib Library Apache-2.0 HNSW Configurable Yes No No Popular lightweight HNSW impl [7]
NMSLIB Library/toolkit Apache-2.0 Multiple methods incl. HNSW Multiple spaces Yes No No Toolkit for many ANN methods (project docs)
ScaNN Library Apache-2.0 AVQ + pruning + reorder MIPS/L2 Yes No CPU-optimized Strong quality–speed when tuned [9]
Milvus VDBMS Apache-2.0 IVF/PQ, HNSW; segment-based Multiple Yes Yes Yes Purpose-built VDBMS [15]
Weaviate VDBMS BSD-3-Clause HNSW, flat; dynamic switching Configurable Yes Yes (cloud-native) Varies Popular open-source VDBMS [47]
Qdrant VDBMS (repo) HNSW + payload filtering; quantization Configurable Yes Deployment options Varies Filtering-first design (project docs)
pgvector DB extension (repo) HNSW, IVFFlat L2, IP, cosine, L1, etc. Yes Via Postgres No Vector search inside Postgres [30]
Vespa Search + Vector Apache-2.0 HNSW + ranking pipeline Configurable Yes (real-time) Yes Varies Hybrid ranking at scale (project docs)
Elasticsearch kNN (Lucene) Search + Vector Multiple HNSW per segment (Lucene) kNN metrics Yes Yes (cluster) No Strong hybrid story with BM25 [21][23]
Pinecone Managed VDB Commercial Not fully disclosed Dot/L2/cosine (docs) Yes Yes Varies Notable research on serverless filtering [25][26]

Recommended Reproducible Benchmark Protocol (Summary)

To compare fairly, report at least one “medium” dataset (SIFT1M or GloVe) and one “large” dataset (Deep1B or SIFT1B), fix dtype/normalization/metric, tune key parameters (HNSW: M, efConstruction, efSearch; IVF: lists/probes; PQ: code size; ScaNN: tree+quant+reorder), and report latency p50/p95/p99, QPS under defined concurrency, recall@k against ground truth, and resource usage (index size, RSS, disk footprint, build time). Publish configs/logs/checksums. [10][11][12]


References

Click any in-text citation like [14] to jump here. Each reference has an anchor so citations scroll to the corresponding item.

  1. Bentley, J. L. Multidimensional Binary Search Trees in Database Applications. IEEE TSE, 1979. Back to top
  2. Omohundro, S. M. Five Balltree Construction Algorithms, 1989. Back to top
  3. Yianilos, P. N. Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. 1993. DOI: 10.1145/313559.313789. Back to top
  4. Indyk, P., Motwani, R. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998. Back to top
  5. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V. Locality-sensitive hashing scheme based on p-stable distributions. SoCG 2004. DOI: 10.1145/997817.997857. Back to top
  6. Jégou, H., Douze, M., Schmid, C. Product Quantization for Nearest Neighbor Search. IEEE TPAMI 33(1), 2011. DOI: 10.1109/TPAMI.2010.57. Back to top
  7. Malkov, Y. A., Yashunin, D. A. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. arXiv:1603.09320. Back to top
  8. Johnson, J., Douze, M., Jégou, H. Billion-scale similarity search with GPUs. arXiv:1702.08734 (FAISS). Back to top
  9. Guo, R., Sun, P., Lindgren, E., et al. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. arXiv:1908.10396 (ScaNN/AVQ). Back to top
  10. Aumüller, M., Bernhardsson, E., Faithfull, A. A Benchmarking Tool for Approximate Nearest Neighbor Search. arXiv:1807.05614 (ANN-Benchmarks). Back to top
  11. Aumüller, M., Bernhardsson, E., Faithfull, A. Reproducibility protocol for ANN-Benchmarks, 2022. Back to top
  12. BigANN / Big-ANN Benchmarks (NeurIPS tracks): https://big-ann-benchmarks.com/ Back to top
  13. Babenko, A., Lempitsky, V. Efficient Indexing of Billion-Scale Datasets of Deep Descriptors. CVPR 2016 (Deep1B). Back to top
  14. Subramanya, S. J., et al. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. NeurIPS 2019. Back to top
  15. Wang, J., Yi, X., Guo, R., et al. Milvus: A Purpose-Built Vector Data Management System. SIGMOD 2021. DOI: 10.1145/3448016.3457550. Back to top
  16. Milvus Documentation — Index types overview. Back to top
  17. Milvus Documentation — Architecture overview (streaming/query/data nodes & storage). Back to top
  18. Weaviate Documentation — Vector indexing (HNSW/flat, dynamic switching). Back to top
  19. Weaviate GitHub — license and system description. Back to top
  20. Vespa Documentation — Approximate NN search using HNSW. Back to top
  21. Elastic Docs — kNN search in Elasticsearch. Back to top
  22. Elastic Docs — Optimize approximate kNN search (HNSW per segment). Back to top
  23. Apache Lucene — Lucene99HnswVectorsFormat (HNSW vectors format). Back to top
  24. Pinecone Docs — Hybrid search. Back to top
  25. Ingber, A., Liberty, E. Accurate and Efficient Metadata Filtering in Pinecone’s Serverless Vector Database. ICML 2025. Back to top
  26. Pinecone Docs — Security overview (TLS/at-rest encryption). Back to top
  27. Pinecone — Technical & Organizational Security Measures (PDF). Back to top
  28. Qdrant GitHub — features (payload filtering, quantization). Back to top
  29. Qdrant Documentation — Security (API key, TLS). Back to top
  30. pgvector GitHub — HNSW/IVFFlat and parameters. Back to top
  31. Pennington, J., Socher, R., Manning, C. GloVe: Global Vectors for Word Representation. EMNLP 2014. DOI: 10.3115/v1/D14-1162. Back to top
  32. Bajaj, P., Campos, D., Craswell, N., et al. MS MARCO. arXiv:1611.09268. Back to top
  33. Schuhmann, C., Vencu, R., Beaumont, R., et al. LAION-400M. arXiv:2111.02114. DOI: 10.48550/arXiv.2111.02114. Back to top
  34. Russakovsky, O., Deng, J., Su, H., et al. ImageNet Large Scale Visual Recognition Challenge. IJCV 2015. DOI: 10.1007/s11263-015-0816-y. Back to top
  35. Lewis, P., Perez, E., Piktus, A., et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. Back to top
  36. Karpukhin, V., Oğuz, B., Min, S., et al. Dense Passage Retrieval for Open-Domain QA. EMNLP 2020. Back to top
  37. Radford, A., Kim, J. W., Hallacy, C., et al. CLIP. ICML 2021. Back to top
  38. Robertson, S., Zaragoza, H. BM25 and Beyond. FnT IR 2009. DOI: 10.1561/1500000019. Back to top
  39. Formal, T., Piwowarski, B., Clinchant, S. SPLADE. SIGIR 2021. DOI: 10.1145/3404835.3463098. Back to top
  40. Dwork, C., Roth, A. The Algorithmic Foundations of Differential Privacy. 2014. Back to top
  41. Abadi, M., Chu, A., Goodfellow, I., et al. Deep Learning with Differential Privacy. CCS 2016. Back to top
  42. Shokri, R., Stronati, M., Song, C., Shmatikov, V. Membership Inference Attacks Against ML Models. IEEE S&P 2017. Back to top
  43. Fredrikson, M., Jha, S., Ristenpart, T. Model Inversion Attacks. CCS 2015. DOI: 10.1145/2810103.2813677. Back to top
  44. Bolukbasi, T., Chang, K.-W., Zou, J., et al. Debiasing Word Embeddings. NeurIPS 2016. Back to top
  45. Regulation (EU) 2016/679 (GDPR). Back to top
  46. Regulation (EU) 2024/1689 (EU AI Act). Back to top
  47. Pan, J. J., et al. Survey of vector database management systems. VLDB Journal 2024. Back to top
  48. Chronis, Y., Caminal, P., et al. Filtered Vector Search: State-of-the-art and Research Directions. PVLDB 2025. Back to top
  49. Elastic — Licensing FAQ (AGPL option; licensing changes). Back to top
  50. Elastic Blog — “Elasticsearch Is Open Source. Again!” (AGPL option). Back to top

End of document.