/vector-search-class-notes

Class notes for the course "Long Term Memory in AI - Vector Search and Databases" COS 495 @ Princeton Fall 2023

Primary LanguageTeXMIT LicenseMIT

Long Term Memory in AI - Vector Search and Databases

COS 597A is given at Princeton during the Fall semester 2023 by

  • Edo Liberty, the Founder and CEO of Pinecone, the world's leading Vector Database.

  • Matthijs Douze the architect and main developer of FAISS the most popular and advanced open source library for vector search.

The course covers the core concepts, algorithms, and data structures used for modern vector search systems and platforms. An advanced undergraduate or graduate student with some hands-on experience in linear algebra, probability, algorithms, and data structures should be able to follow this course.

Abstract

Long Term Memory is a fundamental capability in the modern AI Stack. At their core, these systems are using Vector search. Vector search is also a fundamental tool for systems that manipulate large collections of media like search engines, knowledge bases, content moderation tools, recommendation systems, etc. As such, the discipline lays at the intersection of Artificial Intelligence and Database Management Systems. This course will cover the scientific foundations and practical implementation of vector search applications, algorithms, and systems. The course will be evaluated with project and in-class presentation.

Syllabus

The class contents below are tentative.

  1. Introduction to Vector Search [Matthijs]

    • 1/2h of intro to the course (w/ Edo)
    • Embeddings as an information bottleneck. Instead of learning end-to-end, use embeddings as an intermediate representation
    • Advantages: scalability, instant updates, and explainability
    • Typical volumes of data and scalability. Embeddings are the only way to manage / access large databases
    • The embedding contract: the embedding extractor and embedding indexer agree on the meaning of the distance. Separation of concerns.
    • The vector space model in information retrieval
    • Vector embeddings in machine learning
    • Define vector, vector search, ranking -- retrieval -- recall
  2. Text embeddings [Matthijs]

    • 2-layer word embeddings. Word2vec and fastText, obtained via a factorization of a co-occurrence matrix. Embedding arithmetic: king + woman - man = queen, (already based on similarity search)
    • Sentence embeddings: How to train, masked LM. Properties of sentence embeddings.
    • Large Language Models: reasoning as an emerging property of a LM. What happens when the training set = the whole web
  3. Image embeddings [Matthijs]

    • Pixel structures of images. Early works on direct pixel indexing
    • Traditional CV models. Global descriptors (GIST). Local descriptors (SIFT and friends)Direct indexing of local descriptors for image matching, local descriptor pooling (Fisher, VLAD)
    • Convolutional Neural Nets. Off-the-shelf models. Trained specifically (contrastive learning, self-supervised learning)
    • Modern Computer Vision models
  4. Practical indexing [Edo]

    • How an index works: basic functionalities (search, add). Optional functionalities: snapshot, incremental add, remove
    • k-NN search vs. range search
    • maintaining top-k results
    • Criteria: Speed / accuracy / memory usage / updateability / index construction time
    • Early works on bag-of-visual-words inspiration, based on quantization
    • Voronoi diagram with search buckets
  5. Low Dimensional Vector Search [Edo]

    • k-d tree, space partitioning based algorithms, proof, structures, and asymptotic behavior
    • Probabilistic inequalities. Recap of basic inequalities: Markov, Chernoof, Hoeffding
    • Concentration Of Measure phenomena. Orthogonality of random vectors
    • Curse of dimensionality. The failure of space partitioning
  6. Dimensionality Reduction [Edo]

    • Principal Components Analysis, optimal dimension reduction in the sum of squared distance measure
    • Fast PCA algorithms
    • Random Projections. Gaussian random i.i.d. dimension reduction
    • Fast Random Projections. Accelerating the above to near linear time
  7. Locality Sensitive Hashing [Edo]

    • Definition of Approximate Nearest Neighbor Search (ANNS)
    • Criteria: Speed / accuracy / memory usage / updateability / index construction time
    • Definition of Locality Sensitive Hashing and examples
    • The LSH Algorithm
    • LSH Analysis, proof of correctness, and asymptotics
  8. Clustering [Edo]

    • Semantic clustering: properties (purity, as aid for annotation)
    • Clustering from a similarity graph (spectral clustering, agglomerative clustering)
    • Vector clustering: mean squared error criterion. Tradeoff with number of clusters
    • Relationship between vector clustering and quantization (OOD extension)
    • The k-means clustering measure and Lloyd's algorithm
    • Lloyd's optimality conditions
    • Initialization strategies (kmeans++, progressive dimensions with PCA)
    • The inverted file model. Relationship with sparse matrices
  9. Quantization for lossy vector compression [Matthijs]

    • Vector quantization is a topline (directly optimizes the objective)
    • Binary quantization and hamming comparison
    • Product quantization. Chunked vector quantization. Optimized vector quantization
    • Additive quantization. Extension of product quantization. Difficulty in training approximations (Residual quantization, CQ, TQ, LSQ, etc.)
    • Cost of coarse quantization vs. inverted list scanning
  10. Graph based indexes [Guest lecture]

    • Early works: hierarchical k-means
    • Neighborhood graphs. How to construct them. Nearest Neighbor Descent
    • Greedy search in Neighborhood graphs. That does not work -- need long jumps
    • HNSW. A practical hierarchical graph-based index
    • NSG. Evolving a graph k-NN graph
  11. Computing Hardware and Vector Search [Guest lecture]

    • Computing platform: local vs. service / CPU vs. GPU
    • efficient implementation of brute force search
    • distance computations for product quantization -- tradeoffs. SIMD implementation
    • Parallelization and distribution -- sharding vs. inverted list distribution
    • Using co-processors (GPUs)
    • Using a hierarchy of memory types (RAM + SSD or RAM + GPU RAM)
  12. Advanced topics -- articles to be presented by students [all]

    • Vectors in Generative AI
    • Neural quantization
    • Vector Databases
    • Beyond feature extraction and indexing: neural indexing
  13. In class project presentations

Selected literature

Build

On unix-like systems with the bibtex and pdflatex available you should be able to do this:

git clone git@github.com:edoliberty/vector-search-class-notes.git
cd vector-search-class-notes
./build

Contribute

These class notes are intended to be used freely by academics anywhere, students and professors alike. Please feel free to contribute in the form of pull requests or opening issues.