Vector Databases & Embeddings

Vector databases and embeddings are hot topics in AI right now.

Vector database company Pinecone recently raised $100 million at a valuation of about $1 billion.

Companies like Shopify, Brex, Hubspot, and others use them for their AI applications.

But what are they, how do they work, and why are they so important in AI?

First, what are vector embeddings?
A simple explanation is:

Embeddings are just N-dimensional vectors of numbers. They can represent anything, such as text, music, videos, etc. We will focus on text.

The process of creating embeddings is simple. It involves an embedding model (e.g., Ada from OpenAI).

You send the text to the model, and it creates a vector representation of that data for you to store and use later.

The reason vector databases are important is that they give us the ability to do semantic search, which is searching based on similarity. Just like searching by the meaning of text.

In this example, we can model a man, king, woman, and queen on a vector plane and easily see their relationships to each other.

Here's a more direct example: Imagine you're a child with a big box of toys. Now you want to find similar toys, like toy cars and toy buses. They are both vehicles, so they are similar.

This is called "semantic similarity" - when things have similar meanings/ideas.

Now suppose there are two related but different toys. Like a toy car and a toy road. They are not the same, but they do belong to the same category because cars typically drive on roads.

So why are they important? Well, it's because of the context limitation of large language models (LLMs).

Ideally, we could put an unlimited number of words into an LLM prompt. But we can't. Right now, it's limited to ~4096-32k tokens.

Because of the "memory" of an LLM, which is how many words we can fit within its token limit, we are strictly limited in how we interact with it.

That's why you can't copy and paste a PDF into chatGPT and ask it to summarize it. (Maybe you can now with gpt4-32k)

So how does it all come together? We can use vector embeddings to inject relevant text into the context window of an LLM. Let's look at an example:

Suppose you have a huge PDF, maybe one of the congressional hearings (hehe), and you're lazy, so you don't want to read the whole thing, and you can't paste the whole thing because it's a billion pages long.

You first get the PDF, create vector embeddings, and store them in a database. Now you ask a question, "What did they say to xyz?" First: create an embedding for your question, "What did they say to xyz?"

Now we have two vectors: your question [1,2,3] and the PDF [1,2,3,34]. Then we use similarity search to compare the question vector with our giant PDF vector. OpenAI recommends cosine similarity.

Now we have: 3 most relevant embeddings and text. We can now use the output of 3 and feed it to the LLM through some prompt engineering.

The most common use is to answer user questions based on the context. If it can't be done, just say "I don't know." LLM extracts relevant text chunks from the PDF and tries to answer the question truthfully.

This is a basic explanation of how embeddings and LLM provide chat-like functionality for any form of data. This is also how all those "chat with your site/pdf/blah blah" things work! It's not fine-tuning, just a marketing term.

Embeddings are n-dimensional vector representations of data, whether it's text, images, audio, or video.

Chat-like products mainly use text embeddings.

The process:

Split the PDF into small text fragments and create embeddings using OpenAI's Ada model, storing them in a local or remote vector database.
Create an embedding for the user's question and compare it with the previously created PDF vectors using semantic similarity search (cosine algorithm) to find the most relevant text fragments. It's better than keyword search because it doesn't require exact keyword matches and can discover text relevance, like cars and roads.
Send the user's question and the relevant text fragments to OpenAI, writing a prompt to instruct ChatGPT to generate an answer based on the given content. If there are no relevant texts or the relevance is low, the answer will be "I don't know."

To prevent ChatGPT from going off track, the temperature is usually set very low or even to 0.

Note: This method still has limited practicality and the quality is not great. There is some room for optimization (text slicing, question-answer pairs).

Nowadays, this approach is necessary because of the limited context memory tokens of ChatGPT (32k would be better), and we can't directly feed it with extremely long documents for analysis.