Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into swarmdoc
Browse files Browse the repository at this point in the history
  • Loading branch information
marklysze committed Nov 20, 2024
2 parents 658ddee + 762045a commit 5eacb20
Show file tree
Hide file tree
Showing 17 changed files with 129 additions and 93 deletions.
2 changes: 1 addition & 1 deletion .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,6 @@

## Checks

- [ ] I've included any doc changes needed for https://ag2ai.github.io/autogen/. See https://ag2ai.github.io/ag2/docs/Contribute#documentation to build and test documentation locally.
- [ ] I've included any doc changes needed for https://ag2ai.github.io/ag2/. See https://ag2ai.github.io/ag2/docs/Contribute#documentation to build and test documentation locally.
- [ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
- [ ] I've made sure all auto checks have passed.
6 changes: 2 additions & 4 deletions autogen/agentchat/contrib/agent_eval/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
Agents for running the [AgentEval](https://ag2ai.github.io/autogen/blog/2023/11/20/AgentEval/) pipeline.
Agents for running the [AgentEval](https://ag2ai.github.io/ag2/blog/2023/11/20/AgentEval/) pipeline.

AgentEval is a process for evaluating a LLM-based system's performance on a given task.

When given a task to evaluate and a few example runs, the critic and subcritic agents create evaluation criteria for evaluating a system's solution. Once the criteria has been created, the quantifier agent can evaluate subsequent task solutions based on the generated criteria.

For more information see: [AgentEval Integration Roadmap](https://github.com/microsoft/autogen/issues/2162)

See our [blog post](https://ag2ai.github.io/autogen/blog/2024/06/21/AgentEval) for usage examples and general explanations.
See our [blog post](https://ag2ai.github.io/ag2/blog/2024/06/21/AgentEval) for usage examples and general explanations.
2 changes: 1 addition & 1 deletion notebook/JSON_mode_example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
"\n",
"\n",
"Please find documentation about this feature in OpenAI [here](https://platform.openai.com/docs/guides/text-generation/json-mode).\n",
"More information about Agent Descriptions is located [here](https://ag2ai.github.io/autogen/blog/2023/12/29/AgentDescriptions/)\n",
"More information about Agent Descriptions is located [here](https://ag2ai.github.io/ag2/blog/2023/12/29/AgentDescriptions/)\n",
"\n",
"Benefits\n",
"- This contribution provides a method to implement precise speaker transitions based on content of the input message. The example can prevent Prompt hacks that use coersive language.\n",
Expand Down
2 changes: 1 addition & 1 deletion notebook/agentchat_MathChat.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"\n",
"AutoGen offers conversable agents powered by LLM, tool or human, which can be used to perform tasks collectively via automated chat. This framework allows tool use and human participation through multi-agent conversation. Please find documentation about this feature [here](https://ag2ai.github.io/ag2/docs/Use-Cases/agent_chat).\n",
"\n",
"MathChat is an experimental conversational framework for math problem solving. In this notebook, we demonstrate how to use MathChat to solve math problems. MathChat uses the `AssistantAgent` and `MathUserProxyAgent`, which is similar to the usage of `AssistantAgent` and `UserProxyAgent` in other notebooks (e.g., [Automated Task Solving with Code Generation, Execution & Debugging](https://github.com/ag2ai/ag2/blob/main/notebook/agentchat_auto_feedback_from_code_execution.ipynb)). Essentially, `MathUserProxyAgent` implements a different auto reply mechanism corresponding to the MathChat prompts. You can find more details in the paper [An Empirical Study on Challenging Math Problem Solving with GPT-4](https://arxiv.org/abs/2306.01337) or the [blogpost](https://ag2ai.github.io/autogen/blog/2023/06/28/MathChat).\n",
"MathChat is an experimental conversational framework for math problem solving. In this notebook, we demonstrate how to use MathChat to solve math problems. MathChat uses the `AssistantAgent` and `MathUserProxyAgent`, which is similar to the usage of `AssistantAgent` and `UserProxyAgent` in other notebooks (e.g., [Automated Task Solving with Code Generation, Execution & Debugging](https://github.com/ag2ai/ag2/blob/main/notebook/agentchat_auto_feedback_from_code_execution.ipynb)). Essentially, `MathUserProxyAgent` implements a different auto reply mechanism corresponding to the MathChat prompts. You can find more details in the paper [An Empirical Study on Challenging Math Problem Solving with GPT-4](https://arxiv.org/abs/2306.01337) or the [blogpost](https://ag2ai.github.io/ag2/blog/2023/06/28/MathChat).\n",
"\n",
"````{=mdx}\n",
":::info Requirements\n",
Expand Down
156 changes: 97 additions & 59 deletions notebook/agentchat_cost_token_tracking.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion notebook/agenteval_cq_math.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
"\n",
"![AgentEval](https://media.githubusercontent.com/media/ag2ai/ag2/main/website/blog/2023-11-20-AgentEval/img/agenteval-CQ.png)\n",
"\n",
"For more detailed explanations, please refer to the accompanying [blog post](https://ag2ai.github.io/autogen/blog/2023/11/20/AgentEval)\n",
"For more detailed explanations, please refer to the accompanying [blog post](https://ag2ai.github.io/ag2/blog/2023/11/20/AgentEval)\n",
"\n",
"## Requirements\n",
"\n",
Expand Down
12 changes: 6 additions & 6 deletions notebook/autogen_uniformed_api_calling.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
"\n",
"... and more to come!\n",
"\n",
"You can also [plug in your local deployed LLM](https://ag2ai.github.io/autogen/blog/2024/01/26/Custom-Models) into AutoGen if needed."
"You can also [plug in your local deployed LLM](https://ag2ai.github.io/ag2/blog/2024/01/26/Custom-Models) into AutoGen if needed."
]
},
{
Expand Down Expand Up @@ -376,11 +376,11 @@
],
"metadata": {
"front_matter": {
"description": "Uniform interface to call different LLM.",
"tags": [
"integration",
"custom model"
]
"description": "Uniform interface to call different LLM.",
"tags": [
"integration",
"custom model"
]
},
"kernelspec": {
"display_name": "autodev",
Expand Down
2 changes: 1 addition & 1 deletion notebook/oai_chatgpt_gpt4.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
"\n",
"In this notebook, we tune OpenAI ChatGPT (both GPT-3.5 and GPT-4) models for math problem solving. We use [the MATH benchmark](https://crfm.stanford.edu/helm/latest/?group=math_chain_of_thought) for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning.\n",
"\n",
"Related link: [Blogpost](https://ag2ai.github.io/autogen/blog/2023/04/21/LLM-tuning-math) based on this experiment.\n",
"Related link: [Blogpost](https://ag2ai.github.io/ag2/blog/2023/04/21/LLM-tuning-math) based on this experiment.\n",
"\n",
"## Requirements\n",
"\n",
Expand Down
10 changes: 5 additions & 5 deletions test/agentchat/contrib/test_web_surfer.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
from test_assistant_agent import KEY_LOC, OAI_CONFIG_LIST # noqa: E402

BLOG_POST_URL = "https://ag2ai.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
BLOG_POST_TITLE = "Does Model and Inference Parameter Matter in LLM Applications? - A Case Study for MATH | AutoGen"
BLOG_POST_URL = "https://ag2ai.github.io/ag2/blog/2023/04/21/LLM-tuning-math"
BLOG_POST_TITLE = "Does Model and Inference Parameter Matter in LLM Applications? - A Case Study for MATH | AG2"
BING_QUERY = "Microsoft"

try:
Expand Down Expand Up @@ -54,7 +54,7 @@ def test_web_surfer() -> None:
page_size = 4096
web_surfer = WebSurferAgent(
"web_surfer",
llm_config={"model": "gpt-4", "config_list": []},
llm_config={"model": "gpt-4o", "config_list": []},
browser_config={"viewport_size": page_size},
)

Expand Down Expand Up @@ -110,7 +110,7 @@ def test_web_surfer_oai() -> None:
llm_config = {"config_list": config_list, "timeout": 180, "cache_seed": 42}

# adding Azure name variations to the model list
model = ["gpt-3.5-turbo-1106", "gpt-3.5-turbo-16k-0613", "gpt-3.5-turbo-16k"]
model = ["gpt-4o", "gpt-4o-mini"]
model += [m.replace(".", "") for m in model]

summarizer_llm_config = {
Expand Down Expand Up @@ -160,7 +160,7 @@ def test_web_surfer_bing() -> None:
llm_config={
"config_list": [
{
"model": "gpt-3.5-turbo-16k",
"model": "gpt-4o",
"api_key": "sk-PLACEHOLDER_KEY",
}
]
Expand Down
6 changes: 3 additions & 3 deletions test/test_browser_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,15 @@
import requests
from agentchat.test_assistant_agent import KEY_LOC # noqa: E402

BLOG_POST_URL = "https://ag2ai.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
BLOG_POST_TITLE = "Does Model and Inference Parameter Matter in LLM Applications? - A Case Study for MATH | AutoGen"
BLOG_POST_URL = "https://ag2ai.github.io/ag2/blog/2023/04/21/LLM-tuning-math"
BLOG_POST_TITLE = "Does Model and Inference Parameter Matter in LLM Applications? - A Case Study for MATH | AG2"
BLOG_POST_STRING = "Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. GPT-4 is currently the state of the art LLM in the world. Is model selection irrelevant? What about inference parameters?"

WIKIPEDIA_URL = "https://en.wikipedia.org/wiki/Microsoft"
WIKIPEDIA_TITLE = "Microsoft - Wikipedia"
WIKIPEDIA_STRING = "Redmond"

PLAIN_TEXT_URL = "https://raw.githubusercontent.com/microsoft/autogen/main/README.md"
PLAIN_TEXT_URL = "https://raw.githubusercontent.com/ag2ai/ag2/main/README.md"
IMAGE_URL = "https://github.com/afourney.png"

PDF_URL = "https://arxiv.org/pdf/2308.08155.pdf"
Expand Down
2 changes: 1 addition & 1 deletion website/blog/2023-11-20-AgentEval/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ tags: [LLM, GPT, evaluation, task utility]
**TL;DR:**
* As a developer of an LLM-powered application, how can you assess the utility it brings to end users while helping them with their tasks?
* To shed light on the question above, we introduce `AgentEval` — the first version of the framework to assess the utility of any LLM-powered application crafted to assist users in specific tasks. AgentEval aims to simplify the evaluation process by automatically proposing a set of criteria tailored to the unique purpose of your application. This allows for a comprehensive assessment, quantifying the utility of your application against the suggested criteria.
* We demonstrate how `AgentEval` work using [math problems dataset](https://ag2ai.github.io/autogen/blog/2023/06/28/MathChat) as an example in the [following notebook](https://github.com/ag2ai/ag2/blob/main/notebook/agenteval_cq_math.ipynb). Any feedback would be useful for future development. Please contact us on our [Discord](http://aka.ms/autogen-dc).
* We demonstrate how `AgentEval` work using [math problems dataset](https://ag2ai.github.io/ag2/blog/2023/06/28/MathChat) as an example in the [following notebook](https://github.com/ag2ai/ag2/blob/main/notebook/agenteval_cq_math.ipynb). Any feedback would be useful for future development. Please contact us on our [Discord](http://aka.ms/autogen-dc).


## Introduction
Expand Down
4 changes: 2 additions & 2 deletions website/blog/2024-01-25-AutoGenBench/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ autogenbench tabulate Results/human_eval_two_agents

## Introduction

Measurement and evaluation are core components of every major AI or ML research project. The same is true for AutoGen. To this end, today we are releasing AutoGenBench, a standalone command line tool that we have been using to guide development of AutoGen. Conveniently, AutoGenBench handles: downloading, configuring, running, and reporting results of agents on various public benchmark datasets. In addition to reporting top-line numbers, each AutoGenBench run produces a comprehensive set of logs and telemetry that can be used for debugging, profiling, computing custom metrics, and as input to [AgentEval](https://ag2ai.github.io/autogen/blog/2023/11/20/AgentEval). In the remainder of this blog post, we outline core design principles for AutoGenBench (key to understanding its operation); present a guide to installing and running AutoGenBench; outline a roadmap for evaluation; and conclude with an open call for contributions.
Measurement and evaluation are core components of every major AI or ML research project. The same is true for AutoGen. To this end, today we are releasing AutoGenBench, a standalone command line tool that we have been using to guide development of AutoGen. Conveniently, AutoGenBench handles: downloading, configuring, running, and reporting results of agents on various public benchmark datasets. In addition to reporting top-line numbers, each AutoGenBench run produces a comprehensive set of logs and telemetry that can be used for debugging, profiling, computing custom metrics, and as input to [AgentEval](https://ag2ai.github.io/ag2/blog/2023/11/20/AgentEval). In the remainder of this blog post, we outline core design principles for AutoGenBench (key to understanding its operation); present a guide to installing and running AutoGenBench; outline a roadmap for evaluation; and conclude with an open call for contributions.

## Design Principles

Expand All @@ -52,7 +52,7 @@ AutoGenBench is designed around three core design principles. Knowing these prin

- **Isolation:** Agents interact with their worlds in both subtle and overt ways. For example an agent may install a python library or write a file to disk. This can lead to ordering effects that can impact future measurements. Consider, for example, comparing two agents on a common benchmark. One agent may appear more efficient than the other simply because it ran second, and benefitted from the hard work the first agent did in installing and debugging necessary Python libraries. To address this, AutoGenBench isolates each task in its own Docker container. This ensures that all runs start with the same initial conditions. (Docker is also a _much safer way to run agent-produced code_, in general.)

- **Instrumentation:** While top-line metrics are great for comparing agents or models, we often want much more information about how the agents are performing, where they are getting stuck, and how they can be improved. We may also later think of new research questions that require computing a different set of metrics. To this end, AutoGenBench is designed to log everything, and to compute metrics from those logs. This ensures that one can always go back to the logs to answer questions about what happened, run profiling software, or feed the logs into tools like [AgentEval](https://ag2ai.github.io/autogen/blog/2023/11/20/AgentEval).
- **Instrumentation:** While top-line metrics are great for comparing agents or models, we often want much more information about how the agents are performing, where they are getting stuck, and how they can be improved. We may also later think of new research questions that require computing a different set of metrics. To this end, AutoGenBench is designed to log everything, and to compute metrics from those logs. This ensures that one can always go back to the logs to answer questions about what happened, run profiling software, or feed the logs into tools like [AgentEval](https://ag2ai.github.io/ag2/blog/2023/11/20/AgentEval).

## Installing and Running AutoGenBench

Expand Down
2 changes: 1 addition & 1 deletion website/blog/2024-05-24-Agent/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ better with low cost. [EcoAssistant](/blog/2023/11/09/EcoAssistant) is a good ex

There are certainly tradeoffs to make. The large design space of multi-agents offers these tradeoffs and opens up new opportunities for optimization.

> Over a year since the debut of Ask AT&T, the generative AI platform to which we’ve onboarded over 80,000 users, AT&T has been enhancing its capabilities by incorporating 'AI Agents'. These agents, powered by the Autogen framework pioneered by Microsoft (https://ag2ai.github.io/autogen/blog/2023/12/01/AutoGenStudio/), are designed to tackle complicated workflows and tasks that traditional language models find challenging. To drive collaboration, AT&T is contributing back to the open-source project by introducing features that facilitate enhanced security and role-based access for various projects and data.
> Over a year since the debut of Ask AT&T, the generative AI platform to which we’ve onboarded over 80,000 users, AT&T has been enhancing its capabilities by incorporating 'AI Agents'. These agents, powered by the Autogen framework pioneered by Microsoft (https://ag2ai.github.io/ag2/blog/2023/12/01/AutoGenStudio/), are designed to tackle complicated workflows and tasks that traditional language models find challenging. To drive collaboration, AT&T is contributing back to the open-source project by introducing features that facilitate enhanced security and role-based access for various projects and data.
>
> > Andy Markus, Chief Data Officer at AT&T
Expand Down
4 changes: 2 additions & 2 deletions website/blog/2024-06-21-AgentEval/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,13 @@ tags: [LLM, GPT, evaluation, task utility]

TL;DR:
* As a developer, how can you assess the utility and effectiveness of an LLM-powered application in helping end users with their tasks?
* To shed light on the question above, we previously introduced [`AgentEval`](https://ag2ai.github.io/autogen/blog/2023/11/20/AgentEval/) — a framework to assess the multi-dimensional utility of any LLM-powered application crafted to assist users in specific tasks. We have now embedded it as part of the AutoGen library to ease developer adoption.
* To shed light on the question above, we previously introduced [`AgentEval`](https://ag2ai.github.io/ag2/blog/2023/11/20/AgentEval/) — a framework to assess the multi-dimensional utility of any LLM-powered application crafted to assist users in specific tasks. We have now embedded it as part of the AutoGen library to ease developer adoption.
* Here, we introduce an updated version of AgentEval that includes a verification process to estimate the robustness of the QuantifierAgent. More details can be found in [this paper](https://arxiv.org/abs/2405.02178).


## Introduction

Previously introduced [`AgentEval`](https://ag2ai.github.io/autogen/blog/2023/11/20/AgentEval/) is a comprehensive framework designed to bridge the gap in assessing the utility of LLM-powered applications. It leverages recent advancements in LLMs to offer a scalable and cost-effective alternative to traditional human evaluations. The framework comprises three main agents: `CriticAgent`, `QuantifierAgent`, and `VerifierAgent`, each playing a crucial role in assessing the task utility of an application.
Previously introduced [`AgentEval`](https://ag2ai.github.io/ag2/blog/2023/11/20/AgentEval/) is a comprehensive framework designed to bridge the gap in assessing the utility of LLM-powered applications. It leverages recent advancements in LLMs to offer a scalable and cost-effective alternative to traditional human evaluations. The framework comprises three main agents: `CriticAgent`, `QuantifierAgent`, and `VerifierAgent`, each playing a crucial role in assessing the task utility of an application.

**CriticAgent: Defining the Criteria**

Expand Down
4 changes: 2 additions & 2 deletions website/docs/FAQ.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ In version >=1, OpenAI renamed their `api_base` parameter to `base_url`. So for

Yes. You currently have two options:

- Autogen can work with any API endpoint which complies with OpenAI-compatible RESTful APIs - e.g. serving local LLM via FastChat or LM Studio. Please check https://ag2ai.github.io/autogen/blog/2023/07/14/Local-LLMs for an example.
- You can supply your own custom model implementation and use it with Autogen. Please check https://ag2ai.github.io/autogen/blog/2024/01/26/Custom-Models for more information.
- Autogen can work with any API endpoint which complies with OpenAI-compatible RESTful APIs - e.g. serving local LLM via FastChat or LM Studio. Please check https://ag2ai.github.io/ag2/blog/2023/07/14/Local-LLMs for an example.
- You can supply your own custom model implementation and use it with Autogen. Please check https://ag2ai.github.io/ag2/blog/2024/01/26/Custom-Models for more information.

## Handle Rate Limit Error and Timeout Error

Expand Down
Loading

0 comments on commit 5eacb20

Please sign in to comment.