-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Knowledge #1567
Knowledge #1567
Conversation
from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource | ||
|
||
|
||
class Knowledge(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets write some docs about:
- how to use this
- setting your own custom embedder for this
@@ -0,0 +1,82 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd drop the ollama version. support openai, then let anyone bring their own embedder function (super easy) then have the knowledge_config setup like embedder_config setup for our rag storage
@abstractmethod | ||
def add(self, embedder: BaseEmbedder) -> None: | ||
"""Process content, chunk it, compute embeddings, and save them.""" | ||
pass | ||
|
||
def get_embeddings(self) -> List[np.ndarray]: | ||
"""Return the list of embeddings for the chunks.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets make this save to the project directory instead of the root.
# Compute embeddings for the new chunks | ||
new_embeddings = embedder.embed_chunks(new_chunks) | ||
# Save the embeddings | ||
self.chunk_embeddings.extend(new_embeddings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also be saving this to a db and persist it. like our ragstorage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do this as we can generate the embeddings for files once, then just query if they already exist. otherwise, users will spend tokens generating embeddings that already exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll help with this.
from pydantic import BaseModel, ConfigDict, Field | ||
|
||
from crewai.knowledge.embedder.base_embedder import BaseEmbedder | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extending this to save embeddings to db, then using knowledge class to query from there
src/crewai/agent.py
Outdated
@@ -85,6 +88,10 @@ class Agent(BaseAgent): | |||
llm: Union[str, InstanceOf[LLM], Any] = Field( | |||
description="Language model that will run the agent.", default=None | |||
) | |||
knowledge_sources: Optional[List[BaseKnowledgeSource]] = Field( | |||
default=None, | |||
description="Knowledge sources for the agent.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be better to declare this on the crew class. the task prompt will query from the relevant trickling down to the agent level, then defining here on the agent level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @lorenzejay left some comments, let me know if you have any questions on those 😄
docs/concepts/knowledge.mdx
Outdated
) | ||
|
||
# Create a knowledge store with a list of sources and metadata | ||
knowledge = Knowledge(sources=[string_source], metadata={"preference": "personal"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Should knowledge be used in some of the code or works behind the curtains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crew(knowledge_sources = knowledge, ...)
?
docs/concepts/knowledge.mdx
Outdated
|
||
```python | ||
from crewai import Agent, Task, Crew, Process, LLM | ||
from crewai.knowledge.source.string_knowledge_source import StringKnowledgeSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: I saw that StringKnowledgeSource
was imported but not used here. Is that the intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Youre right. that was not intended
@@ -26,7 +26,7 @@ jobs: | |||
run: uv python install 3.11.9 | |||
|
|||
- name: Install the project | |||
run: uv sync --dev | |||
run: uv sync --dev --all-extras |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Do you think we need the --all-extra option in this case? It seems like we'll have to install all the optional dependencies to be able to run our tests. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are a bunch of optional dep that were brought up like the pdfplumber
for our PdfKnowledgeSource.
@@ -0,0 +1,30 @@ | |||
from abc import ABC, abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: I imagine that the path of this file is not correct.
path/to/src/crewai/knowledge/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks right ? Abstract class could be inside the source dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Are you planning to upload this .pdf to the repository?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes as we are using it for the PDFKnowledgeSource
storage: KnowledgeStorage = Field(default_factory=KnowledgeStorage) | ||
metadata: Dict[str, Any] = Field(default_factory=dict) | ||
|
||
def model_post_init(self, context): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Noticed that context is being used as a parameter but not used on the method, is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved this was a pydantic specific feature: https://docs.pydantic.dev/latest/api/base_model/#pydantic.BaseModel.model_post_init
so did _
instead
|
||
def save_documents(self, metadata: Dict[str, Any]): | ||
"""Save the documents to the storage.""" | ||
chunk_metadatas = [metadata.copy() for _ in self.chunks] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Could you help me understand what this line of code does? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before adding documents to the vector db, it needs metadata object. The metadata can be the same as its a representation of itself as a chunk.
When getting chunked (lets say a big pdf), it will x amount of content. so the defined metadata, should be cloned to be passed as well
overall we need per chunk
a metadata dict to go with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ✅
Amazing Work
No description provided.