Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knowledge #1567

Merged
merged 49 commits into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
75322b2
initial knowledge
joaomdmoura Nov 4, 2024
dc314c1
Merge branch 'main' into knowledge
bhancockio Nov 4, 2024
a8a2f80
WIP
bhancockio Nov 5, 2024
1a35114
Adding core knowledge sources
bhancockio Nov 6, 2024
6131dba
Improve types and better support for file paths
bhancockio Nov 6, 2024
617ee98
added additional sources
bhancockio Nov 6, 2024
4af263c
Merge branch 'main' into knowledge
bhancockio Nov 7, 2024
59165cb
fix linting
bhancockio Nov 7, 2024
86ede83
update yaml to include optional deps
bhancockio Nov 7, 2024
7b59c5b
adding in lorenze feedback
bhancockio Nov 7, 2024
98a708c
Merge branch 'main' of github.com:crewAIInc/crewAI into knowledge
lorenzejay Nov 14, 2024
10f445e
ensure embeddings are persisted
lorenzejay Nov 15, 2024
cb03ee6
improvements all around Knowledge class
lorenzejay Nov 15, 2024
cdf5233
Merge branch 'main' of github.com:crewAIInc/crewAI into knowledge
lorenzejay Nov 15, 2024
b907938
return this
lorenzejay Nov 15, 2024
352d053
properly reset memory
lorenzejay Nov 18, 2024
b2c06d5
properly reset memory+knowledge
lorenzejay Nov 18, 2024
cbfcde7
consolodation and improvements
lorenzejay Nov 18, 2024
4831dcb
Merge branch 'main' of github.com:crewAIInc/crewAI into knowledge
lorenzejay Nov 18, 2024
d579c5a
linted
lorenzejay Nov 18, 2024
b104404
cleanup rm unused embedder
lorenzejay Nov 19, 2024
70910dd
fix test
lorenzejay Nov 19, 2024
c8bf242
fix duplicate
lorenzejay Nov 19, 2024
cbfdbe3
generating cassettes for knowledge test
lorenzejay Nov 19, 2024
e882725
updated default embedder
lorenzejay Nov 19, 2024
efa8a37
None embedder to use default on pipeline cloning
lorenzejay Nov 19, 2024
de742c8
improvements
lorenzejay Nov 19, 2024
914067d
fixed text_file_knowledge
lorenzejay Nov 19, 2024
0c5b6f2
mypysrc fixes
lorenzejay Nov 19, 2024
705ee16
type check fixes
lorenzejay Nov 19, 2024
58bf2d5
added extra cassette
lorenzejay Nov 19, 2024
ec2fe6f
just mocks
lorenzejay Nov 19, 2024
8373c9b
linted
lorenzejay Nov 19, 2024
e7d816f
Merge branch 'main' of github.com:crewAIInc/crewAI into knowledge
lorenzejay Nov 19, 2024
787f2ea
mock knowledge query to not spin up db
lorenzejay Nov 20, 2024
b185b9e
linted
lorenzejay Nov 20, 2024
4663997
verbose run
lorenzejay Nov 20, 2024
76da972
put a flag
lorenzejay Nov 20, 2024
fe18da5
fix
lorenzejay Nov 20, 2024
23276cb
adding docs
lorenzejay Nov 20, 2024
3c4504b
better docs
lorenzejay Nov 20, 2024
44ab749
improvements from review
lorenzejay Nov 20, 2024
52189a4
more docs
lorenzejay Nov 20, 2024
8a54042
linted
lorenzejay Nov 20, 2024
8564f55
rm print
lorenzejay Nov 20, 2024
38c0d61
more fixes
lorenzejay Nov 20, 2024
9329119
clearer docs
lorenzejay Nov 20, 2024
6359b64
added docstrings and type hints for cli
lorenzejay Nov 20, 2024
c0ad457
Merge branch 'main' of github.com:crewAIInc/crewAI into knowledge
lorenzejay Nov 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
run: uv python install 3.11.9

- name: Install the project
run: uv sync --dev
run: uv sync --dev --all-extras
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do you think we need the --all-extra option in this case? It seems like we'll have to install all the optional dependencies to be able to run our tests. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are a bunch of optional dep that were brought up like the pdfplumber for our PdfKnowledgeSource.


- name: Run tests
run: uv run pytest tests
32 changes: 32 additions & 0 deletions path/to/src/crewai/knowledge/source/base_knowledge_source.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from abc import ABC, abstractmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I imagine that the path of this file is not correct.
path/to/src/crewai/knowledge/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks right ? Abstract class could be inside the source dir

from typing import List

from crewai.knowledge.embedder.base_embedder import BaseEmbedder


class BaseKnowledgeSource(ABC):
"""Abstract base class for different types of knowledge sources."""

def __init__(
self,
chunk_size: int = 1000,
chunk_overlap: int = 200,
):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.chunks: List[str] = []

@abstractmethod
def load_content(self):
lorenzejay marked this conversation as resolved.
Show resolved Hide resolved
"""Load and preprocess content from the source."""
pass

@abstractmethod
def add(self, embedder: BaseEmbedder) -> None:
"""Add content to the knowledge base, chunk it, and compute embeddings."""
pass

@abstractmethod
def query(self, embedder: BaseEmbedder, query: str, top_k: int = 3) -> str:
"""Query the knowledge base using semantic search."""
pass
10 changes: 10 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,16 @@ Repository = "https://github.com/crewAIInc/crewAI"
[project.optional-dependencies]
tools = ["crewai-tools>=0.14.0"]
agentops = ["agentops>=0.3.0"]
fastembed = ["fastembed>=0.4.1"]
pdfplumber = [
"pdfplumber>=0.11.4",
]
pandas = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:I'm wondering if we need to keep "pandas" as an optional dependency. I took a look at the code, and it seems we're only using it to read Excel files and save them as CSVs. Maybe we could find some lighter libraries to handle that? Just a thought!

If the lib is still required maybe we should go with "polars"

Polars: ~8.5MB
Pandas: ~12MB

Polars: ~70ms
NumPy: ~104ms
Pandas: ~520ms

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are optional deps, maybe this can be a fast follow ?

"pandas>=2.2.3",
]
openpyxl = [
"openpyxl>=3.1.5",
]
mem0 = ["mem0ai>=0.1.29"]

[tool.uv]
Expand Down
14 changes: 13 additions & 1 deletion src/crewai/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
import warnings

from crewai.agent import Agent
from crewai.crew import Crew
from crewai.flow.flow import Flow
from crewai.knowledge.knowledge import Knowledge
from crewai.llm import LLM
from crewai.pipeline import Pipeline
from crewai.process import Process
Expand All @@ -15,4 +17,14 @@
module="pydantic.main",
)
__version__ = "0.79.4"
__all__ = ["Agent", "Crew", "Process", "Task", "Pipeline", "Router", "LLM", "Flow"]
__all__ = [
"Agent",
"Crew",
"Process",
"Task",
"Pipeline",
"Router",
"LLM",
"Flow",
"Knowledge",
]
25 changes: 24 additions & 1 deletion src/crewai/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,12 @@
from crewai.agents.agent_builder.base_agent import BaseAgent
from crewai.agents.crew_agent_executor import CrewAgentExecutor
from crewai.cli.constants import ENV_VARS
from crewai.knowledge.knowledge import Knowledge
from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource
from crewai.llm import LLM
from crewai.memory.contextual.contextual_memory import ContextualMemory
from crewai.tools.agent_tools.agent_tools import AgentTools
from crewai.tools import BaseTool
from crewai.tools.agent_tools.agent_tools import AgentTools
from crewai.utilities import Converter, Prompts
from crewai.utilities.constants import TRAINED_AGENTS_DATA_FILE, TRAINING_DATA_FILE
from crewai.utilities.token_counter_callback import TokenCalcHandler
Expand Down Expand Up @@ -52,6 +54,7 @@ class Agent(BaseAgent):
role: The role of the agent.
goal: The objective of the agent.
backstory: The backstory of the agent.
knowledge: The knowledge base of the agent.
config: Dict representation of agent configuration.
llm: The language model that will run the agent.
function_calling_llm: The language model that will handle the tool calling for this agent, it overrides the crew function_calling_llm.
Expand Down Expand Up @@ -85,6 +88,10 @@ class Agent(BaseAgent):
llm: Union[str, InstanceOf[LLM], Any] = Field(
description="Language model that will run the agent.", default=None
)
knowledge_sources: Optional[List[BaseKnowledgeSource]] = Field(
default=None,
description="Knowledge sources for the agent.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to declare this on the crew class. the task prompt will query from the relevant trickling down to the agent level, then defining here on the agent level

)
function_calling_llm: Optional[Any] = Field(
description="Language model that will run the agent.", default=None
)
Expand Down Expand Up @@ -119,6 +126,8 @@ class Agent(BaseAgent):
default="safe",
description="Mode for code execution: 'safe' (using Docker) or 'unsafe' (direct execution).",
)
# TODO: Lorenze add knowledge_embedder. Support direct class or config dict.
_knowledge: Optional[Knowledge] = PrivateAttr(default=None)

@model_validator(mode="after")
def post_init_setup(self):
Expand Down Expand Up @@ -227,6 +236,12 @@ def post_init_setup(self):
if self.allow_code_execution:
self._validate_docker_installation()

# Initialize the Knowledge object if knowledge_sources are provided
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: But in here you can do

self._knowledge = None
if self.crew and self.crew.knowledge_store:
     self._knowledge = self.crew.knowledge_store

Or even remove the = None
Since the default form the Model is None

if self.knowledge_sources:
self._knowledge = Knowledge(sources=self.knowledge_sources)
else:
self._knowledge = None

return self

def _setup_agent_executor(self):
Expand Down Expand Up @@ -272,6 +287,14 @@ def execute_task(
if memory.strip() != "":
task_prompt += self.i18n.slice("memory").format(memory=memory)

# Integrate the knowledge base
if self._knowledge:
# Query the knowledge base for relevant information
knowledge_snippets = self._knowledge.query(query=task.prompt())
if knowledge_snippets:
formatted_knowledge = "\n".join(knowledge_snippets)
task_prompt += f"\n\nAdditional Information:\n{formatted_knowledge}"

tools = tools or self.tools or []
self.create_agent_executor(tools=tools, task=task)

Expand Down
Empty file.
Empty file.
55 changes: 55 additions & 0 deletions src/crewai/knowledge/embedder/base_embedder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
from abc import ABC, abstractmethod
from typing import List

import numpy as np


class BaseEmbedder(ABC):
"""
Abstract base class for text embedding models
"""

@abstractmethod
def embed_chunks(self, chunks: List[str]) -> np.ndarray:
"""
Generate embeddings for a list of text chunks

Args:
chunks: List of text chunks to embed

Returns:
Array of embeddings
"""
pass

@abstractmethod
def embed_texts(self, texts: List[str]) -> np.ndarray:
"""
Generate embeddings for a list of texts

Args:
texts: List of texts to embed

Returns:
Array of embeddings
"""
pass

@abstractmethod
def embed_text(self, text: str) -> np.ndarray:
"""
Generate embedding for a single text

Args:
text: Text to embed

Returns:
Embedding array
"""
pass

@property
@abstractmethod
def dimension(self) -> int:
"""Get the dimension of the embeddings"""
pass
93 changes: 93 additions & 0 deletions src/crewai/knowledge/embedder/fastembed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
from pathlib import Path
from typing import List, Optional, Union

import numpy as np

from .base_embedder import BaseEmbedder

try:
from fastembed_gpu import TextEmbedding # type: ignore

FASTEMBED_AVAILABLE = True
except ImportError:
try:
from fastembed import TextEmbedding

FASTEMBED_AVAILABLE = True
except ImportError:
FASTEMBED_AVAILABLE = False


class FastEmbed(BaseEmbedder):
"""
A wrapper class for text embedding models using FastEmbed
"""

def __init__(
self,
model_name: str = "BAAI/bge-small-en-v1.5",
cache_dir: Optional[Union[str, Path]] = None,
):
"""
Initialize the embedding model

Args:
model_name: Name of the model to use
cache_dir: Directory to cache the model
gpu: Whether to use GPU acceleration
"""
if not FASTEMBED_AVAILABLE:
raise ImportError(
"FastEmbed is not installed. Please install it with: "
"pip install fastembed or pip install fastembed-gpu for GPU support"
)

self.model = TextEmbedding(
model_name=model_name,
cache_dir=str(cache_dir) if cache_dir else None,
)

def embed_chunks(self, chunks: List[str]) -> List[np.ndarray]:
"""
Generate embeddings for a list of text chunks

Args:
chunks: List of text chunks to embed

Returns:
List of embeddings
"""
embeddings = list(self.model.embed(chunks))
return embeddings

def embed_texts(self, texts: List[str]) -> List[np.ndarray]:
"""
Generate embeddings for a list of texts

Args:
texts: List of texts to embed

Returns:
List of embeddings
"""
embeddings = list(self.model.embed(texts))
return embeddings

def embed_text(self, text: str) -> np.ndarray:
"""
Generate embedding for a single text

Args:
text: Text to embed

Returns:
Embedding array
"""
return self.embed_texts([text])[0]

@property
def dimension(self) -> int:
"""Get the dimension of the embeddings"""
# Generate a test embedding to get dimensions
test_embed = self.embed_text("test")
return len(test_embed)
82 changes: 82 additions & 0 deletions src/crewai/knowledge/embedder/ollama.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd drop the ollama version. support openai, then let anyone bring their own embedder function (super easy) then have the knowledge_config setup like embedder_config setup for our rag storage

from typing import List, Optional

import numpy as np
from openai import OpenAI

from .base_embedder import BaseEmbedder


class OllamaEmbedder(BaseEmbedder):
"""
A wrapper class for text embedding models using Ollama's API
"""

def __init__(
self,
model_name: str,
api_key: Optional[str] = None,
base_url: str = "http://localhost:11434/v1",
):
"""
Initialize the embedding model

Args:
model_name: Name of the model to use
api_key: API key (defaults to 'ollama' or environment variable 'OLLAMA_API_KEY')
base_url: Base URL for the Ollama API (default is 'http://localhost:11434/v1')
"""
self.model_name = model_name
self.api_key = api_key or os.getenv("OLLAMA_API_KEY") or "ollama"
self.base_url = base_url
self.client = OpenAI(base_url=self.base_url, api_key=self.api_key)

def embed_chunks(self, chunks: List[str]) -> List[np.ndarray]:
"""
Generate embeddings for a list of text chunks

Args:
chunks: List of text chunks to embed

Returns:
List of embeddings
"""
return self.embed_texts(chunks)

def embed_texts(self, texts: List[str]) -> List[np.ndarray]:
"""
Generate embeddings for a list of texts

Args:
texts: List of texts to embed

Returns:
List of embeddings
"""
embeddings = []
max_batch_size = 2048 # Adjust batch size if necessary
for i in range(0, len(texts), max_batch_size):
batch = texts[i : i + max_batch_size]
response = self.client.embeddings.create(input=batch, model=self.model_name)
batch_embeddings = [np.array(item.embedding) for item in response.data]
embeddings.extend(batch_embeddings)
return embeddings

def embed_text(self, text: str) -> np.ndarray:
"""
Generate embedding for a single text

Args:
text: Text to embed

Returns:
Embedding array
"""
return self.embed_texts([text])[0]

@property
def dimension(self) -> int:
"""Get the dimension of the embeddings"""
# Embedding dimensions may vary; we'll determine it dynamically
test_embed = self.embed_text("test")
return len(test_embed)
Loading
Loading