Return documents retrieved by `create_retriever_tool` when used in `create_openai_tools_agent` #17398

AndreaBrg · 2024-02-12T09:38:06Z

AndreaBrg
Feb 12, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

tool = create_retriever_tool(
    vector_store_retriever,
    name="search_products",
    description="Searches and returns information about the product in the catalog.",
)
agent = create_openai_tools_agent(llm_model, [tool], hub.pull("hwchase17/openai-tools-agent"))

AgentExecutor(agent=agent, tools=tools, memory=memory)

Description

I'd like to extract the documents retrieved by create_retriever_tool when this is used to create an OpenAI agent with create_openai_tools_agent.

Currently, when invoked the agent returns:

The input
The output
The chat history

How can I retrieve the documents used to create the output?

Thanks

System Info

langchain==0.1.6
langchain-community==0.0.19
langchain-core==0.1.22
langchain-openai==0.0.5
langchainhub==0.1.14
langsmith==0.0.87

@AndreaBrg · 2024-02-12T09:38:18Z

dosubot[bot]
bot Feb 12, 2024

🤖

Hey @AndreaBrg, great to see you diving into another challenge with LangChain! Always a pleasure to have you back. How's everything going?

Based on the information you've provided and the context from the LangChain repository, it seems you want to extract the documents retrieved by the create_retriever_tool when it is used in create_openai_tools_agent.

The create_retriever_tool function in LangChain creates a tool for retrieving documents. The documents retrieved by this tool can be extracted from the return value of the function that this tool wraps, which is either _get_relevant_documents or _aget_relevant_documents. Both of these functions retrieve documents using the get_relevant_documents or aget_relevant_documents method of the retriever object respectively, format them using the format_document function, and then join them into a single string using the document_separator.

To extract the documents retrieved by the create_retriever_tool when it is used in create_openai_tools_agent, you would need to parse the string returned by the tool's function or coroutine, splitting it by the document_separator to get the individual documents.

However, it seems like you want to access the source documents directly from the agent's response. In that case, you might find the discussion on the issue RetrievalQA chain return source_documents when using it as a Tool for an Agent helpful. In this issue, the user was able to return the source documents by setting return_source_documents=True when creating the RetrievalQA chain.

Here is the relevant code from that discussion:

chain = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retriever=retriever, return_source_documents=True, verbose=True, input_key="question")

And then using this chain in the tool:

Tool(
    name="Search summaries",
    func=lambda query: chain({"question": query}),
    description="useful for when you need to answer questions. Input should be a fully formed question."
)

Please note that this might make the chain look a bit clunky, but it should allow you to access the source documents directly from the agent's response.

I hope this helps! If you have any further questions, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

AndreaBrg · 2024-02-12T14:50:14Z

AndreaBrg
Feb 12, 2024
Author

@dosu-bot so there is no way to extract the source documents used to generate the output natively when using create_retriever_tool?

0 replies

@dosu-bot · 2024-02-12T14:59:18Z

dosubot[bot]
bot Feb 12, 2024

🤖

Based on the current implementation of the LangChain framework, there is no built-in method to extract the source documents used by the create_retriever_tool function when it's used in create_openai_tools_agent.

The create_retriever_tool function uses a BaseRetriever object to retrieve relevant documents based on a query. However, the source documents used by the BaseRetriever are not exposed or returned by the create_retriever_tool function. The function only returns a Tool object, which encapsulates the retrieval process but does not provide direct access to the source documents. You can see this in the source code of the create_retriever_tool function.

Similarly, the create_openai_tools_agent function does not provide any mechanism to access the source documents used by the tools it utilizes. The function only binds the tools to the language model and uses them in the created agent, but it does not expose any functionality for accessing the source documents of these tools. This is evident in the source code of the create_openai_tools_agent function.

Therefore, if you need to access the source documents used by the create_retriever_tool function, you might need to modify the BaseRetriever class or the create_retriever_tool function to expose these documents. However, please note that this could potentially affect the encapsulation and abstraction of the LangChain framework, and it's recommended to carefully consider the implications before making such modifications.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

AndreaBrg · 2024-02-12T16:01:01Z

AndreaBrg
Feb 12, 2024
Author

@dosu-bot Could you suggest how to change my code to support the extraction of source documents?

from langchain import hub
from langchain.agents import create_react_agent, create_openai_tools_agent, AgentExecutor
from langchain.tools.retriever import create_retriever_tool
from langchain_core.language_models import BaseLanguageModel
from langchain_core.memory import BaseMemory
from langchain_core.vectorstores import VectorStoreRetriever

from brainybot.llm.chain.settings import ChainAgentSettings


def make_agent_tools(vector_store_retriever: VectorStoreRetriever):
    # TODO make this better
    tool = create_retriever_tool(
        vector_store_retriever,
        name="search_vinodoo_products",
        description="Searches and returns information about the product in the Vinodoo's catalog.",
    )
    return [tool]


def create_conversation_chain(llm_model_type: str, llm_model: BaseLanguageModel,
                              vector_store_retriever: VectorStoreRetriever, memory: BaseMemory,
                              agent_settings: ChainAgentSettings | None) -> AgentExecutor:
    # Create agent tools
    tools = make_agent_tools(vector_store_retriever)
    # Based on the llm model type we create a general or a specific agent

    match llm_model_type.lower():
        case "openai":
            # TODO make better prompt instead of hub.pull("hwchase17/openai-tools-agent")
            agent = create_openai_tools_agent(llm_model, tools, hub.pull("hwchase17/openai-tools-agent"))
            if agent_settings is None:
                agent_settings = ChainAgentSettings.from_openai_default()
        case _:
            # TODO: Get the prompt to use - you can modify this!
            agent = create_react_agent(llm_model, tools, hub.pull("hwchase17/react-chat"))
            if agent_settings is None:
                agent_settings = ChainAgentSettings.from_default()

    # Return the agent
    return AgentExecutor(agent=agent, tools=tools, memory=memory,
                         max_iterations=agent_settings.max_iterations,
                         max_execution_time=agent_settings.max_execution_time,
                         early_stopping_method=agent_settings.early_stopping_method,
                         handle_parsing_errors=agent_settings.handle_parsing_errors)

0 replies

@dosu-bot · 2024-02-12T16:12:39Z

dosubot[bot]
bot Feb 12, 2024

🤖

To extract the source documents used by the create_retriever_tool function, you can modify the _get_relevant_documents and _aget_relevant_documents functions to return both the formatted documents and the original documents. Here's how you can do it:

def _get_relevant_documents(
    query: str,
    retriever: BaseRetriever,
    document_prompt: BasePromptTemplate,
    document_separator: str,
    callbacks: Callbacks = None,
) -> (str, list):
    docs = retriever.get_relevant_documents(query, callbacks=callbacks)
    return document_separator.join(
        format_document(doc, document_prompt) for doc in docs
    ), docs


async def _aget_relevant_documents(
    query: str,
    retriever: BaseRetriever,
    document_prompt: BasePromptTemplate,
    document_separator: str,
    callbacks: Callbacks = None,
) -> (str, list):
    docs = await retriever.aget_relevant_documents(query, callbacks=callbacks)
    return document_separator.join(
        format_document(doc, document_prompt) for doc in docs
    ), docs

Then, in the create_retriever_tool function, you need to adjust the func and afunc partial functions to handle the new return values:

def create_retriever_tool(
    retriever: BaseRetriever,
    name: str,
    description: str,
    *,
    document_prompt: Optional[BasePromptTemplate] = None,
    document_separator: str = "\n\n",
) -> Tool:
    document_prompt = document_prompt or PromptTemplate.from_template("{page_content}")
    func = partial(
        _get_relevant_documents,
        retriever=retriever,
        document_prompt=document_prompt,
        document_separator=document_separator,
    )
    afunc = partial(
        _aget_relevant_documents,
        retriever=retriever,
        document_prompt=document_prompt,
        document_separator=document_separator,
    )
    return Tool(
        name=name,
        description=description,
        func=lambda query: func(query)[0],  # Only return the formatted documents
        coroutine=lambda query: afunc(query)[0],  # Only return the formatted documents
        args_schema=RetrieverInput,
        original_docs_func=lambda query: func(query)[1],  # Add a new method to get the original documents
        original_docs_coroutine=lambda query: afunc(query)[1],  # Add a new method to get the original documents
    )

Now, when you create a tool using create_retriever_tool, you can use the original_docs_func and original_docs_coroutine methods to get the original documents used by the retriever.

Please note that this modification is not part of the official LangChain repository and is a custom solution to your specific requirement. It's always recommended to keep your LangChain packages up-to-date with the official releases for the best compatibility and support.

For more details, you can refer to the LangChain retriever tool source code.

Sources

libs/langchain/langchain/tools/retriever.py

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

zmartin1 · 2024-02-12T17:55:10Z

zmartin1
Feb 12, 2024

@AndreaBrg Pass an additional argument like this:
tool = create_retriever_tool(
vector_store_retriever,
name="search_products",
description="Searches and returns information about the product in the catalog.",
document_prompt=PromptTemplate.from_template(
"Page chunk {page_chunk}: {page_content}")
)
and then follow the instructions in this example: https://python.langchain.com/docs/modules/agents/how_to/agent_structured

1 reply

kiranramanna May 21, 2024

just to be clear, page_chunk and page_content has to be in the metadata of documents which gets added to retriever.

mzhadigerov · 2024-05-23T09:34:41Z

mzhadigerov
May 23, 2024

Hi! Were you be able to figure it out?

0 replies

mzhadigerov · 2024-05-23T11:32:47Z

mzhadigerov
May 23, 2024

@zmartin1 From which place are you following the example? It is not working for me. I'm getting an empty output

0 replies

sebatinoco · 2024-07-18T20:04:00Z

sebatinoco
Jul 18, 2024

Hi! Any updates on this issue? I followed @zmartin1 instructions and I'm now able to retrieve documents/metadata from the documents, but this is not always the case as the info retrieved is generated by the LLM (sometimes it returns the values I want, sometimes it returns nothing). Any help with this?

1 reply

gopidon Sep 26, 2024

Hi, see my solution below.

JACKEYED · 2024-07-28T05:16:59Z

JACKEYED
Jul 28, 2024

@dosu-bot

0 replies

JACKEYED · 2024-07-28T05:21:13Z

JACKEYED
Jul 28, 2024

@dosu-bot please read my code segment and correct it. It error but i don't know why it error and how i can correct it

from langchain.tools.retriever import create_retriever_tool
import bs4
from langchain import hub
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
loader = WebBaseLoader(
web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
bs_kwargs=dict(
parse_only=bs4.SoupStrainer(
class_=("post-content", "post-title", "post-header")
)
),
)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits,embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

tool = create_retriever_tool(
retriever,
"blog_post_retriever",
"Searches and returns excerpts from the Autonomous Agents blog post."
)

Traceback (most recent call last):
File "D:\PyCharm\python_code\chatbot\ConversationRAG.py", line 181, in
tool = create_retriever_tool(
File "D:\Anaconda\envs\DL\lib\site-packages\langchain_core\tools.py", line 1356, in create_retriever_tool
return Tool(
File "D:\Anaconda\envs\DL\lib\site-packages\langchain_core\tools.py", line 852, in init
super(Tool, self).init( # type: ignore[call-arg]
File "D:\Anaconda\envs\DL\lib\site-packages\langchain_core\tools.py", line 394, in init
raise TypeError(
TypeError: args_schema must be a subclass of pydantic BaseModel. Got: <class 'langchain_core.tools.RetrieverInput'>.

0 replies

gopidon · 2024-09-26T06:29:57Z

gopidon
Sep 26, 2024

We need to create a custom tool for this. Stop using create_retriever_tool

class RetrieverInput(BaseModel):
    """Input to the retriever."""
    query: str = Field(description="query to look up in retriever")

def get_documents(query):
    docs = some_retriever.invoke(query)
    #Return a tuple here
    return "\n\n".join(doc.page_content for doc in docs), docs

#Pass the option response_format as "content_and_artifact" while creating the tool below:

retriever_tool = Tool(
    name="retriever_name",
    func=get_documents,
    description="Search and return information about something....",
    args_schema=RetrieverInput,
    response_format="content_and_artifact"
)

Now use the retriever tool in the ToolNode as usual like so

tools = [retriever_tool]
graph_builder.add_node("tools", ToolNode(tools))

The ToolMessage returned by the "tools" node will now have another property by name "artifact" along with the "content" property.
This artifact property will have all the documents that were returned by you in the get_documents method above. They provide the context you are looking for.

1 reply

kauly Sep 26, 2024

Hey @gopidon

Do you know if changes are necessary in the AgentExecutor to get this artifact at the agent outputs?

sidh4rth7 · 2024-11-12T09:27:40Z

sidh4rth7
Nov 12, 2024

This is another version that doesn't use response_format="content_and_artifact" but its default version response_format="content" and this could be useful when you want your RAG agent to pick out relevant metadata [can be URLS, etc.,] to build context along with the Document -page_content.
FYI - metadata is included as the combined ToolMessage content and surprisingly it worked better for my use.

Model: GPT-4o-mini

from langchain_core.tools.simple import Tool
from pydantic import BaseModel, Field

class RetrieverInput(BaseModel):
    """Input to the retriever."""
    query: str = Field(description="Query to look up in the retriever")

def get_documents(query, retriever):
    docs = retriever.invoke(query)
    
    # Combine the content of all documents
    combined_content = "\n\n".join(doc.page_content for doc in docs)
    
    # Combine the metadata of all documents into a single dictionary or list
    combined_metadata = {}
    for doc in docs:
        for key, value in doc.metadata.items():
            # Append metadata values to lists for each key
            if key in combined_metadata:
                combined_metadata[key].append(value)
            else:
                combined_metadata[key] = [value]

    # Create a combined document with combined content and metadata
    combined_doc = {
        "content": combined_content,
        "metadata": combined_metadata
    }

    return combined_doc

def create_retriever_tool(name: str, description: str, retriever) -> Tool:
    """Creates a Tool for a given retriever with custom name and description."""
    return Tool(
        name=name,
        func=lambda query: get_documents(query, retriever),
        description=description,
        args_schema=RetrieverInput,
        # response_format="content_and_artifact"
    )

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return documents retrieved by `create_retriever_tool` when used in `create_openai_tools_agent` #17398

{{title}}

Replies: 13 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

About Dosu

{{title}}

{{title}}

About Dosu

{{title}}

{{title}}

About Dosu

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Return documents retrieved by create_retriever_tool when used in create_openai_tools_agent #17398

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 13 comments · 3 replies

dosubot[bot] bot Feb 12, 2024

Sources

About Dosu

AndreaBrg Feb 12, 2024 Author

dosubot[bot] bot Feb 12, 2024

Sources

About Dosu

AndreaBrg Feb 12, 2024 Author

dosubot[bot] bot Feb 12, 2024

Sources

About Dosu

Return documents retrieved by `create_retriever_tool` when used in `create_openai_tools_agent` #17398

Replies: 13 comments 3 replies

dosubot[bot]
bot Feb 12, 2024

AndreaBrg
Feb 12, 2024
Author

dosubot[bot]
bot Feb 12, 2024

AndreaBrg
Feb 12, 2024
Author

dosubot[bot]
bot Feb 12, 2024