What is a vector database?

In my PhD work, I am thinking about ways of encoding data that are inherently quantum mechanical. It pays to keep up to date with new ways of storing data. I was told about vector databases – roughly that they are used in machine learning applications – but I don’t know much more than this. In this post I will answer the following questions. These are rough notes typed up while reading in 1-2 hours.

Questions:

  • What sort of data is stored in vector databases and how is it stored?
  • What are the advantages of using vector databases?
  • Who is using vector databases?

I am reading the resources provided by Pinecone, a company that builds a vector database management software: https://www.pinecone.io/learn/

Vector Embeddings

  • Representing objects in your dataset as vectors allows you to process them nicely using Linear Algebra and Machine Learning tricks. The mapping of your datapoints to a vector space is called a vector embedding.
  • The vector embedding should be chosen in a way such that relationships between vectors mean something. For example, we might want objects that are similar to be represented by vectors that are closer together (have a larger inner product). Techniques to create vector embeddings are called vector embedding models.
  • Example vector embedding models: word2vec, using the outputs on the last hidden layer of a neural network.

Vector Search

  • Keyword Search is not as good as Semantic Search. Keyword Search looks for matches between the user search query and all the text in the corpus being searched over. This might miss search results that are relevant to the query but do not use the words in the search query. Semantic Search looks for items in the corpus that are semantically close to the search query, thus finding more of the relevant results. Semantic Search can be done using vector embeddings to represent the items of the database and the query in such a way that Vecto Search can be used to retrieve relevant items in the database.
  • Vector Search is performed in general as an Approximate Nearest Neighbour search. The search achieves a reduced running time by being approximate rather than exact.

Vector Indices

  • Vector Indices are a precursor to Vector Databases, that support some features such as fast vector search, but miss other features of data base management systems like supporting scalability, security, easy data management, easy software updating etc.
  • Facebook AI Semantic Search (FAISS) is one example of a Vector Index. It uses tricks like partitioning the vector space into cells and determining which cell to search for a match within when a query comes in, or splitting a vector into subvectors and clustering those.

Vector Databases

  • “a vector database provides a superior solution for handling vector embeddings by addressing the limitations of standalone vector indices, such as scalability challenges, cumbersome integration processes, and the absence of real-time updates and built-in security measures, ensuring a more effective and streamlined data management experience.”
  • More tricks to improve efficiency of similarity queries are used (like the one discussed above under vector indices). See here for descriptions: https://www.pinecone.io/learn/vector-database/#how-does-a-vector-database-work

Who uses Vector Databases?

  • Tech companies like Google, Meta, Spotify, etc.

Leave a comment