AI & Machine Learning

Getting started with retrieval augmented generation on BigQuery with LangChain

June 4, 2024

https://storage.googleapis.com/gweb-cloudblog-publish/images/aiml2022.max-2500x2500.png

Jeff Nelson

Developer Advocate

Ashley Xu

Software Engineer, Google

Try Gemini 1.5 models

Google's most advanced multimodal models in Vertex AI

Try it

The ability of large language models (LLMs) to process and generate human language continues to revolutionize many aspects of business. But an LLM’s knowledge is limited to the data it was trained on, which can cause drawbacks when dealing with specific company information or nuanced industry contexts. Retrieval-augmented generation (RAG) offers a powerful solution to this limitation by connecting LLMs with your own data sources and enabling them to pull from internal knowledge bases, enabling new business processes all grounded in the specifics of your data.

BigQuery now allows you to generate embeddings and execute powerful vector search at scale, enabling RAG workflows within BigQuery. By leveraging LangChain, a framework designed for developing applications with LLMs, you can seamlessly build RAG applications tailor-made for your business needs.

In this blog, we’ll provide a practical guide to implement RAG using BigQuery and LangChain and provide you with a framework to get started with your own data.

Limitations of LLMs

Imagine a scenario where we want to ask questions about the 2024 Cymbal Starlight — a fictional automobile. We might ask: “how many miles until I need to change my oil?” Or “I broke down on the highway and where can I get help?” Traditionally, we might consult the owner’s manual and page through it until we find an answer.

We could also simply pose a question to an LLM:

Unfortunately, this response doesn’t answer our question. This is no surprise, because the 2024 Cymbal Starlight is a fictional vehicle and its owners manual wasn’t included in the LLM’s training data. To solve this constraint, we can use retrieval-augmented generation, which augments the LLM with proprietary or first-party data, like the 2024 Cymbal Starlight owner’s manual!

Enter retrieval augmented generation (RAG)

LLMs are powerful tools, but can be limited by their internal knowledge. RAG addresses this by incorporating data from external sources, allowing LLMs to access relevant information in real-time and without having to fine-tune or retrain a model. A simple RAG pipeline has two main components:

Data preprocessing:
- Input data like documents are split into smaller chunks, converted into vector embeddings, and sent to a vector store for later retrieval
Query and retrieval:
- A user asks a question in natural language. This is turned into an embedding relevant context is retrieved from a vector search
- The context is provided to an LLM to augment its knowledge
- The LLM generates a response that weaves together retrieved chunks with its pretrained knowledge and summarization capabilities

LangChain

LangChain is an open source orchestration framework to work with LLMs, enabling developers to quickly build generative AI applications on their data. Google Cloud contributed a new LangChain integration with BigQuery that can make it simple to pre-process your data, generate and store embeddings, and run vector search, all using BigQuery.

In this demo, we’ll handle both the pre-processing and runtime steps with LangChain. Let’s take a look!

Building a RAG pipeline with BigQuery and LangChain

This blog post highlights a few of the major steps to building a simple RAG pipeline using BigQuery and LangChain. To view other steps, get more in-depth, or To follow along and view additional steps, you can make a copy of the notebook, Augment Q&A Generation using LangChain and BigQuery Vector Search, which allows you to run the following example in Colab using your own Google Cloud environment.

Data preprocessing

We begin by reading our document, the 2024 Cymbal Starlight Owner’s Manual, into memory using a LangChain Document Loader, called PyPDFLoader, which loads objects from Google Cloud Storage.

Once loaded, we split the document into smaller chunks. Chunking makes RAG more efficient, as chunks allow for more targeted retrieval of relevant information and reduced computational load. This improves the accuracy and contextuality of generated responses and improves response time. We use LangChain’s RecursiveTextSplitter, which splits text based on rules we define.

With the text chunks stored in doc_splits, we now need to generate embeddings for each chunk and store them in BigQuery. To do so, we’ll first initialize a LangChain Vector Store using the new BigQueryVectorSearch class. This requires some input around your Google Cloud and BigQuery environments and requires us to define an embedding model. We’ll use a textembedding-gecko model from VertexAI.

Lastly, we call the vector store (bq_vector_cars_manual) and pass it all of the document chunks. LangChain facilitates turning these chunks into embeddings and sending them to BigQuery.

We can inspect the BigQuery table and confirm that it contains the document metadata, content, and text embedding.

https://storage.googleapis.com/gweb-cloudblog-publish/images/table_1.max-1000x1000.png

Query and retrieval

Now that our text embedding data exists in BigQuery, we search for relevant chunks and ground our generated answers with them. This pattern is often called RAG. We’ll begin by initializing a Vertex AI LLM and a LangChain retriever to fetch documents using BigQuery Vector Search.

For Q&A chains, our retriever is passed directly to the chain and can be used without further configuration. When I ask a question, the following happens behind the scenes:

My question is turned into a text embedding
A vector search occurs on BigQuery and the relevant document chunks are retrieved
These chunks are then passed to the prompt used by the LLM to augment its knowledge and generate a concise answer.

Let’s take a look at a basic example using LangChain’s RetrievalQA Chain.

The LLM now provides us with a concrete answer! We should change the oil every 5,000 miles on this vehicle.

Now let’s take a slightly more sophisticated example. We will use the ConversationalRetrievalChain. This still uses BigQuery Vector Search, but persists previous conversation history in memory and adds it as context to the LLM response. This provides a conversational capability with your data.

We can then ask a follow up question without needing to provide much additional context, because the last question and answer are already passed through.

Recall that initially, the LLM was unable to answer any questions about the 2024 Cymbal Starlight. But in a few steps, we used BigQuery Vector Search and LangChain to build a simple RAG Q&A application that provides us with useful information grounded in our own documents!

Get started

Google Cloud offers many tools to store embeddings and run vector search. BigQuery Vector Search is optimized for large-scale analytical workloads and incorporates many of the features you expect from BigQuery. It’s fully managed, serverless - scaling up and down without needing to worry about infrastructure management, and incorporates capabilities like governance and fine-grained access control.

Get started building a RAG application today with BigQuery and LangChain! Check out the sample notebook to follow the example above with greater depth, or read the new BigQuery Vector Search LangChain documentation to begin building an application on your data.

For additional approaches and resources on building RAG applications on Google Cloud, check out the following:

Posted in