1.RAG入门

RAG的结构¶

通常，RAG包含Indexing和Retrieval and Generation两个组件，其中Indexing负责离线地给数据建立索引，Retrieval负责获取用户的查询然后进行搜索。

Indexing¶

加载：首先通过Document Loader加载数据，可以是csv或者pdf等等。langchain将文档抽象为Document对象，包含page_content：表示内容的字符串，以及metadata：包含任意元数据的dict。简单样例如下：

from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

拆分：文本拆分器将大型 Document 拆分为较小的块(chunk)。这对于索引数据以及将其传递到模型中都很有用，因为大型块更难搜索，并且不适合模型的有限上下文窗口。比如一个pdf中，可能一页是一个Document，但是一个块只有1k个token，并且每个块之间会有一定的overlap，防止重要的信息只能被包含在一段上下文内。
```
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits) # 524
```

Store：我们需要某个地方来存储和索引我们的split，以便以后可以搜索它们。这通常是使用 VectorStore 和 Embeddings 模型完成的。具体来讲，先用embedding模型将文本/图像/……数据投射成向量，然后存储在一个空间内，这样可以基于向量距离查询相似度，返回在embedding最相近的语句。

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

输出如下

Score: 0.23699893057346344

page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS
The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metadata={'page': 35, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}

Retrieval¶

Retrieve：给定用户输入，使用 Retriever 从存储中检索相关分片。
Generate ：ChatModel / LLM 使用提示生成答案，该提示包括问题和检索到的数据

alt text

简单预览¶

只需50多行，就可以实现一个基本的RAG。

import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict

# 爬虫获取博客内容
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load() #转换为Document对象

# 将Doc分割成块
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)

# 将块存储到向量空间里
_ = vector_store.add_documents(documents=all_splits)

# 拉取RAG问答的prompt模板
prompt = hub.pull("rlm/rag-prompt")


# 定义状态类
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# Retrieve：在向量存储空间里，使用相似性搜索问题。
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}

# Generate：将搜索得到的文档，和问题本身，都喂给LLM
def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    # 
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# 状态机，让它先retrieve再generate
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

加上历史消息¶

def generate(state: MessagesState):
    # 生成工具调用信息
    recent_tool_messages = []
    for message in reversed(state["messages"]):
        if message.type == "tool":
            recent_tool_messages.append(message)
        else:
            break
    tool_messages = recent_tool_messages[::-1]

    # 整理prompt
    docs_content = "\n\n".join(doc.content for doc in tool_messages)
    system_message_content = (
        "You are an assistant for question-answering tasks. "
        "Use the following pieces of retrieved context to answer "
        "the question. If you don't know the answer, say that you "
        "don't know. Use three sentences maximum and keep the "
        "answer concise."
        "\n\n"
        f"{docs_content}"
    )
    # 所有除了tool_call的历史消息
    conversation_messages = [
        message
        for message in state["messages"]
        if message.type in ("human", "system")
        or (message.type == "ai" and not message.tool_calls)
    ]
    # 将本次检索内容和历史消息组合
    prompt = [SystemMessage(system_message_content)] + conversation_messages

    # Run
    response = llm.invoke(prompt)
    return {"messages": [response]}