Telemetry (Phase 1) #296

marklysze · 2024-12-27T04:54:01Z

-- WORK IN PROGRESS --

Why are these changes needed?

Telemetry for AG2 workflows, heavily based on OpenTelemetry.

Implementation of a core telemetry library that provides a protocol for building telemetry providers.

Telemetry providers included:

OpenTelemetry - send your telemetry data to an OpenTelemetry collector.
Cost Tracker - tracks events with associated costs
Mermaid diagram - (experimental) creates mermaid sequence diagram based on conversation

Telemetry providers are created by implementing TelemetryProvider.

Phases:

(this) Establish core telemetry architecture and data capture, along with an OpenTelemetry provider and cost tracking provider
Add additional providers, e.g. log to file, log to database
Add more advanced real-time GUI provider

To use telemetry:

Instantiate an InstrumentationManager
Instantiate telemetry providers and register them with your InstrumentationManager
Use a context manager to encapsulate the workflow you want in your telemetry

instrumentation_manager = InstrumentationManager()
otel_provider = OpenTelemetryProvider(
    protocol="http",
    collector_endpoint="http://192.168.0.1:4318/v1/traces",
    service_name="ag2"
)
instrumentation_manager.register_provider(otel_provider)

cost_provider = CostTrackerProvider()
instrumentation_manager.register_provider(cost_provider)

mermaid_provider = MermaidDiagramProvider()
instrumentation_manager.register_provider(mermaid_provider)

with telemetry_context(instrumentation_manager, "My Group Chat", {"workflow.id": "123"}) as telemetry:
    ... AG2 code ...

    print(f"Cost summary: {cost_provider.total_cost:.6f} ({len(cost_provider.cost_history)} cost events)")

    mermaid_markup = mermaid_provider.generate_sequence_diagram()
    print(mermaid_markup)

I am using Jaeger as my OpenTelemetry collector and viewer (https://www.jaegertracing.io/)

Still to do:

Feedback and more feedback
Run all tests to ensure nothing is broken
Test more scenarios
Correctly identify Nested Chat spans
Correctly identify Swarm ON_CONDITION based transitions

Ideas on future telemetry providers:

File logger
Database logger
GUI event feed (for debug and live visualisation)

Related issue number

Closes #103
Closes #173
Closes #298

Checks

I've included any doc changes needed for https://docs.ag2.ai/. See https://docs.ag2.ai/docs/contributor-guide/documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

Signed-off-by: Mark Sze <[email protected]>

marklysze · 2024-12-27T05:12:12Z

Sample program:

from autogen import ConversableAgent
from autogen.telemetry import InstrumentationManager, telemetry_context, OpenTelemetryProvider, CostTrackerProvider, MermaidDiagramProvider

# Initialize
instrumentation_manager = InstrumentationManager()
otel_provider = OpenTelemetryProvider(
    protocol="http",
    collector_endpoint="http://192.168.0.1:4318/v1/traces",
    service_name="ag2"
)
instrumentation_manager.register_provider(otel_provider)

cost_provider = CostTrackerProvider()
instrumentation_manager.register_provider(cost_provider)

mermaid_provider = MermaidDiagramProvider()
instrumentation_manager.register_provider(mermaid_provider)

with telemetry_context(instrumentation_manager, "My Workflow", {"workflow.id": "123"}) as telemetry:

    jack = ConversableAgent(
        "Jack",
        llm_config=llm_config,
        system_message="Your name is Jack and you are a comedian in a two-person comedy show. Always say: FINISH",
        is_termination_msg=lambda x: True if "FINISH" in x.get("content", "") else False,
        human_input_mode="NEVER",
    )
    emma = ConversableAgent(
        "Emma",
        llm_config=llm_config,
        system_message="Your name is Emma and you are a comedian in two-person comedy show. Say the word FINISH ONLY AFTER you've heard 2 of Jack's jokes.",
        is_termination_msg=lambda x: True if "FINISH" in x.get("content", "") else False,
        human_input_mode="NEVER",
    )

    # Initiate the chat
    chat_result = jack.initiate_chat(
        emma,
        message="Emma, tell me a joke about goldfish and peanut butter.",
        max_turns=4,
    )

    print(f"Cost summary: {cost_provider.total_cost:.6f} ({len(cost_provider.cost_history)} cost events)")

    print(mermaid_provider.generate_sequence_diagram())

Example output:

Cost summary: 0.000069 (2 LLM calls)
sequenceDiagram
    participant Jack
    participant Emma
    Jack->>Emma: Emma, tell me a joke about goldfish and peanut but...
    Emma->>Jack: Sure, here it goes! Why did the goldfish break up ...
    Jack->>Emma: Alright, Emma! Why did the goldfish start a podcas...

marklysze · 2024-12-27T05:19:24Z

What are we capturing? Detail to be expanded upon, but here are the spans and event types so far.

class SpanKind(Enum):
    """Enumeration of span kinds

    Spans represent a unit of work within a trace. Typically spans have child spans and events.

    Descriptions:
        WORKFLOW: Main workflow span
        CHATS: Multiple chats, e.g. initiate_chats
        CHAT: Single chat, e.g. initiate_chat
        NESTED_CHAT: A nested chat, e.g. _summary_from_nested_chats
        GROUP_CHAT: A groupchat, e.g. run_chat
        ROUND: A round within a chat (where chats have max_round or max_turn)
        REPLY: Agent replying to another agent
        REPLY_FUNCTION: Functions executed during a reply, such as generate_oai_reply and check_termination_and_human_reply
        SUMMARY: Summarization of a chat
        REASONING: Agent reasoning step, for advanced agents like ReasoningAgent and CaptainAgent
        GROUPCHAT_SELECT_SPEAKER: GroupChat speaker selection (covers all selection methods)
        SWARM_ON_CONDITION: Swarm-specific, ON_CONDITION hand off
    """


class EventKind(Enum):
    """Enumeration of span event kinds

    Events represent a singular point in time within a span, capturing specific moments or actions.

    Descriptions:
        AGENT_TRANSITION: Transition moved from one agent to another
        AGENT_CREATION: Creation of an Agent
        GROUPCHAT_CREATION: Creation of a GroupChat
        LLM_CREATE: LLM execution
        AGENT_SEND_MSG: Agent sending a message to another agent
        TOOL_EXECUTION: Tool or Function execution
        COST: Cost event
        SWARM_TRANSITION: Swarm-specific, transition reason (e.g. ON_CONDITION)
        CONSOLE_PRINT: Console output (TBD)
    """

Signed-off-by: Mark Sze <[email protected]>

marklysze added 12 commits December 22, 2024 22:18

Initial - WIP

873996e

Signed-off-by: Mark Sze <[email protected]>

Cost Tracking WIP

604599a

Signed-off-by: Mark Sze <[email protected]>

Merge remote-tracking branch 'origin/main' into telemetryphase1

467fcdc

Signed-off-by: Mark Sze <[email protected]>

Costs for client classes

cba8220

Signed-off-by: Mark Sze <[email protected]>

Nested chats, tool executions, reply functions

e8edede

Signed-off-by: Mark Sze <[email protected]>

Remove temp traces

ca823d2

Signed-off-by: Mark Sze <[email protected]>

Mermaid diagram provider, bug fixing

cab0dee

Signed-off-by: Mark Sze <[email protected]>

Moved providers

9bbf19b

Signed-off-by: Mark Sze <[email protected]>

Instrumentation Manager maintains core span contexts

76c1a3e

Signed-off-by: Mark Sze <[email protected]>

Docstrings and descriptions

1beb7cb

Signed-off-by: Mark Sze <[email protected]>

Refactoring of core telemetry class, group chat run_chat added

cd0251a

Signed-off-by: Mark Sze <[email protected]>

Merge remote-tracking branch 'origin/main' into telemetryphase1

14afa3b

Signed-off-by: Mark Sze <[email protected]>

marklysze requested a deployment to openai1 December 27, 2024 04:54 — with GitHub Actions Waiting

marklysze marked this pull request as draft December 27, 2024 04:54

marklysze mentioned this pull request Dec 27, 2024

[Bug]: GroupChatManager.a_run_chat does not handle NoEligibleSpeaker Exception #298

Open

marklysze added enhancement New feature or request telemetry labels Dec 27, 2024

Merge remote-tracking branch 'origin/main' into telemetryphase1

3d6d668

Signed-off-by: Mark Sze <[email protected]>

marklysze requested a deployment to openai1 January 2, 2025 02:20 — with GitHub Actions Waiting

Merge remote-tracking branch 'origin/main' into telemetryphase1

a1b2583

marklysze requested a deployment to openai1 January 2, 2025 04:44 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry (Phase 1) #296

Telemetry (Phase 1) #296

marklysze commented Dec 27, 2024 •

edited

Loading

marklysze commented Dec 27, 2024 •

edited

Loading

marklysze commented Dec 27, 2024

Telemetry (Phase 1) #296

Are you sure you want to change the base?

Telemetry (Phase 1) #296

Conversation

marklysze commented Dec 27, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

marklysze commented Dec 27, 2024 • edited Loading

marklysze commented Dec 27, 2024

marklysze commented Dec 27, 2024 •

edited

Loading

marklysze commented Dec 27, 2024 •

edited

Loading