Inconsistent performance from Groq's new tool-calling functionality #19990

mingxuan-he · 2024-04-04T07:11:11Z

mingxuan-he
Apr 4, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

llm = ChatGroq(
            model_name="mixtral-8x7b-32768",
            temperature=0,
        )
    # gmail resource code omitted
    toolkit = GmailToolkit(api_resource=gmail_service)
    tools = toolkit.get_tools()
    
    system_prompt = """
    You are an experienced email assistant. Use the tools provided to you to process requests related to emails and communication.
    For email search, you MUST use the '-category:promotions' operator to exclude promotions.
    The user's email address is [email protected].
    """
    prompt = ChatPromptTemplate.from_messages(
        [
            SystemMessage(content=system_prompt),
            MessagesPlaceholder(variable_name='messages'),
            MessagesPlaceholder(variable_name='agent_scratchpad')
        ]
    )
    
    # create agent
    agent = create_openai_tools_agent(llm, tools, prompt)

    # create runnable executor
    agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=verbose, handle_parsing_errors=True)

Description

First of all kudos to the langchain team for shipping integration with Groq's updated tool calling support on day one (https://console.groq.com/docs/tool-use).
My use case for fast inference is for tool-calling agents, like the email agent above. So I swapped out the LLM from GPT-4-turbo to groq to test the model's tool calling.
The results are pretty underwhelming -- with mixtral 8x7b, I got frequent parsing errors (usually because the LLM outputs texts like "sure here's your json object" before the actual json). Other times the agent outputs correct formats but fails to understand the instructions, or even lies about whether it used a tool. Overall, I just don't think the OSS models offered by Groq are very well tuned for openai-style function calling at the moment.
I tested in both one-turn and multi-turn settings. The one-turn accuracy is okay if the tool is simple like an extractor, but more complex (api-calling) tools and multi-turn conversations are really bad.

The speed is lightning fast like promised (end to end maybe 5x-10x faster than GPT4 turbo for me), but it's really a shame that this new functionality fails for most non-toy use cases. Would love some suggestions (prompting techniques, etc.) on how to improve tool-calling accuracy on the groq models.

System Info

langchain-groq==0.1.0
langchain==0.1.14

@dosu-bot · 2024-04-04T07:11:43Z

dosubot[bot]
bot Apr 4, 2024

It seems you're facing two primary challenges with Groq's new tool-calling functionality: parsing errors and inaccuracies in tool-calling. Here are some targeted strategies to address these issues:

Parsing Errors: These might be due to the model's output format not aligning with LangChain's parsing expectations. Consider adding a post-processing step to clean the model's output, ensuring it's in the correct format by removing extraneous text.
Inaccurate Tool-Calling: This could stem from the model not understanding instructions properly or not selecting the correct tool. Experiment with different instruction phrasings and use the tool_choice parameter in the bind_tools method with "auto" to improve tool selection.

Further strategies include:

Improve Tool Definitions: Ensure tools are well-defined with clear descriptions, inputs, and outputs.
Customize Model Parameters: Adjust parameters like temperature, max_tokens, and n to fine-tune model behavior.
Error Handling and Retries: Implement mechanisms for error handling and retries in tool-calling logic.
Logging and Monitoring: Log and monitor tool calls to identify common errors or patterns.
Feedback Loop: Use user feedback or automated testing feedback for tool improvement.
Update and Test Regularly: Keep tools up to date and regularly test them to catch issues early.

These suggestions aim to improve the accuracy and reliability of tool-calling with Groq models. If you encounter further issues or need more specific advice, please provide additional details about the challenges you're facing.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

adithya04dev · 2024-05-08T13:05:11Z

adithya04dev
May 8, 2024

facing same issue

0 replies

weissenbacherpwc · 2024-06-28T19:20:05Z

weissenbacherpwc
Jun 28, 2024

me too, sometimes function calling with llm.with_structured_output(...) works, sometimes not, very weird

0 replies

kevinbayes · 2024-11-08T22:43:02Z

kevinbayes
Nov 8, 2024

I have the same issue. Sometimes I get a good response, sometimes the response if embedded in the message.

For llama3-70b-8192 the tool use is okay but using llama3-8b-8192 shows the problem more.

Good example:
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"tool_calls": [
{
"id": "call_fbaw",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{"location":"Boston, MA","unit":"fahrenheit"}"
}
}
]
},
"logprobs": null,
"finish_reason": "tool_calls"
}
],

Bad example:
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "/tool-use>\n{\n\t"tool_calls": [\n\t\t{\n\t\t\t"id": "pending",\n\t\t\t"type": "function",\n\t\t\t"function": {\n\t\t\t\t"name": "get_current_weather"\n\t\t\t},\n\t\t\t"parameters": {\n\t\t\t\t"location": "Boston",\n\t\t\t\t"unit": "fahrenheit"\n\t\t\t}\n\t\t}\n\t]\n}/tool-use>"
},
"logprobs": null,
"finish_reason": "stop"
}
],

Request:
{
"model": "llama3-70b-8192",
"messages": [
{
"role": "user",
"content": "What's the weather like in Boston today?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto"
}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent performance from Groq's new tool-calling functionality #19990

{{title}}

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

About Dosu

{{title}}

{{title}}

{{title}}

Select a reply

Inconsistent performance from Groq's new tool-calling functionality #19990

mingxuan-he Apr 4, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 4 comments

dosubot[bot] bot Apr 4, 2024

Sources

About Dosu

adithya04dev May 8, 2024

weissenbacherpwc Jun 28, 2024

kevinbayes Nov 8, 2024

mingxuan-he
Apr 4, 2024

dosubot[bot]
bot Apr 4, 2024

adithya04dev
May 8, 2024

weissenbacherpwc
Jun 28, 2024

kevinbayes
Nov 8, 2024