1.RAG入门
RAG的结构¶
通常,RAG包含Indexing和Retrieval and Generation两个组件,其中Indexing负责离线地给数据建立索引,Retrieval负责获取用户的查询然后进行搜索。
Indexing¶
- 加载 : 首先通过Document Loader加载数据,可以是csv或者pdf等等。langchain将文档抽象为
Document
对象,包含page_content
:表示内容的字符串,以及metadata
:包含任意元数据的dict。简单样例如下:from langchain_core.documents import Document documents = [ Document( page_content="Dogs are great companions, known for their loyalty and friendliness.", metadata={"source": "mammal-pets-doc"}, ), Document( page_content="Cats are independent pets that often enjoy their own space.", metadata={"source": "mammal-pets-doc"}, ), ]
- 拆分 : 文本拆分器将大型 Document 拆分为较小的块(
chunk
)。这对于索引数据以及将其传递到模型中都很有用,因为大型块更难搜索,并且不适合模型的有限上下文窗口。比如一个pdf中,可能一页是一个Document,但是一个块只有1k个token,并且每个块之间会有一定 的overlap,防止重要的信息只能被包含在一段上下文内。 - Store:我们需要某个地方来存储和索引我们的split,以便以后可以搜索它们。这通常是使用 VectorStore 和 Embeddings 模型完成的。具体来讲,先用embedding模型将文本/图像/……数据投射成向量,然后存储在一个空间内,这样可以基于向量距离查询相似度,
返回在embedding最相近的语句。
输出如下
results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?") doc, score = results[0] print(f"Score: {score}\n") print(doc)
Score: 0.23699893057346344 page_content='Table of Contents FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line: FISCAL 2023 COMPARED TO FISCAL 2022 •NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively. The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6, 2 and 1 percentage points to NIKE, Inc. Revenues, respectively. •NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale equivalent basis.' metadata={'page': 35, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}
Retrieval¶
-
Retrieve:给定用户输入,使用 Retriever 从存储中检索相关分片。
-
Generate :ChatModel / LLM 使用提示生成答案,该提示包括问题和检索到的数据
简单预览¶
只需50多行,就可以实现一个基本的RAG。
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
# 爬虫获取博客内容
loader = WebBaseLoader(
web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
bs_kwargs=dict(
parse_only=bs4.SoupStrainer(
class_=("post-content", "post-title", "post-header")
)
),
)
docs = loader.load() #转换为Document对象
# 将Doc分割成块
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)
# 将块存储到向量空间里
_ = vector_store.add_documents(documents=all_splits)
# 拉取RAG问答的prompt模板
prompt = hub.pull("rlm/rag-prompt")
# 定义状态类
class State(TypedDict):
question: str
context: List[Document]
answer: str
# Retrieve:在向量存储空间里,使用相似性搜索问题。
def retrieve(state: State):
retrieved_docs = vector_store.similarity_search(state["question"])
return {"context": retrieved_docs}
# Generate:将搜索得到的文档,和问题本身,都喂给LLM
def generate(state: State):
docs_content = "\n\n".join(doc.page_content for doc in state["context"])
#
messages = prompt.invoke({"question": state["question"], "context": docs_content})
response = llm.invoke(messages)
return {"answer": response.content}
# 状态机,让它先retrieve再generate
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()
加上历史消息¶
def generate(state: MessagesState):
# 生成工具调用信息
recent_tool_messages = []
for message in reversed(state["messages"]):
if message.type == "tool":
recent_tool_messages.append(message)
else:
break
tool_messages = recent_tool_messages[::-1]
# 整理prompt
docs_content = "\n\n".join(doc.content for doc in tool_messages)
system_message_content = (
"You are an assistant for question-answering tasks. "
"Use the following pieces of retrieved context to answer "
"the question. If you don't know the answer, say that you "
"don't know. Use three sentences maximum and keep the "
"answer concise."
"\n\n"
f"{docs_content}"
)
# 所有除了tool_call的历史消息
conversation_messages = [
message
for message in state["messages"]
if message.type in ("human", "system")
or (message.type == "ai" and not message.tool_calls)
]
# 将本次检索内容和历史消息组合
prompt = [SystemMessage(system_message_content)] + conversation_messages
# Run
response = llm.invoke(prompt)
return {"messages": [response]}