Vector databases: how to protect your private data when using AI?

by Angus Innes ✦ 22 October 2023 ✦ 5 Min Read

Vector Databases: How to protect your private data when using AI?

If you want to leverage the power of Large Language Models (LLMs) in your digital product or application but are concerned about keeping your customer and company’s data private, then you need to know about vector databases.

LLMs often need context or data which is not in their original training set in order to give the right answer to your queries. By making your company data available at the time that an LLM generates its response you can improve their ability to answer more useful, personalised questions.

But how can you do this safely and securely and provide your data in a way that an LLM can use it effectively?

The answer lies in translating your data into a format that can be understood by an LLM. You'll need to store the data accessibly and ensure your application consults this data source before generating answers.

Read on to learn how and why to do this.

Understanding how LLMs work

First, we need to understand a little more about how LLMs work. You can argue that LLMs and applications that use them such as ChatGPT are glorified autocomplete. ChatGPT pattern matches against its training data to predict a likely response relative to the question and context you have provided it with.

LLMs are able to do this because of “embeddings”.

Embeddings are an important aspect of how LLMs “understand” the world.

Embeddings algorithms are able to capture the semantic similarity of data. In other words, they store the meaning of text, images and other data and how they relate to one another. They achieve this by creating numerical representations or “vectors” that can then be used to assess how close or far apart concepts are.

For example, animals and fruits are semantically distinct groupings and so 'cow' and 'pig' are grouped more closely than 'pig' and 'apple'.

This same process can capture the relationship and meaning between bits of more complex, unstructured data. For example, the content of podcasts or user reviews for products.

Embeddings represented in a vector space

Using semantic search to create effective recommendations

This is where the idea of semantic search comes in. Data stored as embeddings can be compared for similarity in “meaning”, so users can search intuitively, in the same way we might ask a real person where to find something – with descriptive phrases rather than by keywords.

This is incredibly powerful because we tend to have an idea of what we’re looking for, but rarely do we know precisely which keywords appear in what we want.

For example, with semantic search we can find documents related to "healthy recipes" even if they don't explicitly contain the keyword "healthy".

Semantic search is how recommendation engines can be so effective. Semantic search allows recommendation engines to statistically determine similarities between items you have already told them you like.

I wrote about how Spotify and Skin Rocks are effectively using semantic search in this article on the best use cases of generative AI.

Vector search has been big tech’s tightly kept secret for years. When you find Amazon suggesting that product you never knew you wanted, that isn’t magic; it’s vector search.

Pinecone, a leading vector database provider

Using private data with LLMs

Okay, so now we understand how LLMs interpret and “understand” the world to effectively predict the output users want from queries.

How do you feed in private data to securely use the power of LLMs in applications for completing tasks for end users?

Use an embeddings algorithm (such as the one offered by OpenAI’s API) to create embeddings for your private data and store it in a vector database. Then when a user makes a query, your data sources are consulted and provided to the LLM for use in its reasoning when generating a response. These methods are shown in steps 1 and 2 in the diagram below.

Data storage and retrieval with embeddings and LLMs

This architecture is commonly referred to as “retrieval augmented generation” (RAG).

As outlined in the diagram below, users' queries are directed first to your retrieval system before the LLM so that your data is added into the context that the LLM can use to return an accurate answer.

A simplified representation of augmenting user queries with a retrieval system

You can then write code to automatically update the database so that the LLM has near real-time insight into your company information such as documentation, stock availability or customer history.

These same architectures can also help bypass the memory limits imposed by LLMs by representing and storing information in a more compact and efficient manner.

Protecting your company's private data

With vector databases, you can go beyond API calls to GPT and add advanced features to your AI applications.

Innovations like semantic information retrieval and retrieval augmented generation simultaneously increase the accuracy and reliability of responses to queries, whilst circumventing the context constraints of LLMs and keeping your company data private.

Thinking about using generative AI in your product or business?

Whether you or your company plan to build or buy with generative AI (or do a combination of both) we believe there’s a significant opportunity to innovate in this space.

Generative AI can help make your technology more intuitive and open up new use cases for your products and services.

If you’re interested in exploring how AI could unlock new ideas on your roadmap then get in touch. We’re running free half-day workshops to answer all your questions and collaborate on what opportunities exist. We’re experienced with building, buying and everything in between.

Product Agony Aunt, shipping little and often to solve your product problems one step at a time.

Angus

Innes