Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knowledge #1567

Merged
merged 49 commits into from
Nov 20, 2024
Merged

Knowledge #1567

merged 49 commits into from
Nov 20, 2024

Conversation

bhancockio
Copy link
Collaborator

No description provided.

@lorenzejay lorenzejay self-requested a review November 14, 2024 20:22
from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource


class Knowledge(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets write some docs about:

  1. how to use this
  2. setting your own custom embedder for this

@@ -0,0 +1,82 @@
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd drop the ollama version. support openai, then let anyone bring their own embedder function (super easy) then have the knowledge_config setup like embedder_config setup for our rag storage

Comment on lines 25 to 31
@abstractmethod
def add(self, embedder: BaseEmbedder) -> None:
"""Process content, chunk it, compute embeddings, and save them."""
pass

def get_embeddings(self) -> List[np.ndarray]:
"""Return the list of embeddings for the chunks."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets make this save to the project directory instead of the root.

# Compute embeddings for the new chunks
new_embeddings = embedder.embed_chunks(new_chunks)
# Save the embeddings
self.chunk_embeddings.extend(new_embeddings)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also be saving this to a db and persist it. like our ragstorage

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this as we can generate the embeddings for files once, then just query if they already exist. otherwise, users will spend tokens generating embeddings that already exist.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll help with this.

from pydantic import BaseModel, ConfigDict, Field

from crewai.knowledge.embedder.base_embedder import BaseEmbedder

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extending this to save embeddings to db, then using knowledge class to query from there

@@ -85,6 +88,10 @@ class Agent(BaseAgent):
llm: Union[str, InstanceOf[LLM], Any] = Field(
description="Language model that will run the agent.", default=None
)
knowledge_sources: Optional[List[BaseKnowledgeSource]] = Field(
default=None,
description="Knowledge sources for the agent.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to declare this on the crew class. the task prompt will query from the relevant trickling down to the agent level, then defining here on the agent level

@lorenzejay
Copy link
Collaborator

tests passing locally:

Screenshot 2024-11-19 at 2 59 27 PM

Copy link
Collaborator

@pythonbyte pythonbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @lorenzejay left some comments, let me know if you have any questions on those 😄

)

# Create a knowledge store with a list of sources and metadata
knowledge = Knowledge(sources=[string_source], metadata={"preference": "personal"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should knowledge be used in some of the code or works behind the curtains?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crew(knowledge_sources = knowledge, ...)?


```python
from crewai import Agent, Task, Crew, Process, LLM
from crewai.knowledge.source.string_knowledge_source import StringKnowledgeSource
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I saw that StringKnowledgeSource was imported but not used here. Is that the intention?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Youre right. that was not intended

@@ -26,7 +26,7 @@ jobs:
run: uv python install 3.11.9

- name: Install the project
run: uv sync --dev
run: uv sync --dev --all-extras
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do you think we need the --all-extra option in this case? It seems like we'll have to install all the optional dependencies to be able to run our tests. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are a bunch of optional dep that were brought up like the pdfplumber for our PdfKnowledgeSource.

@@ -0,0 +1,30 @@
from abc import ABC, abstractmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I imagine that the path of this file is not correct.
path/to/src/crewai/knowledge/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks right ? Abstract class could be inside the source dir

src/crewai/knowledge/knowledge.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Are you planning to upload this .pdf to the repository?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes as we are using it for the PDFKnowledgeSource

storage: KnowledgeStorage = Field(default_factory=KnowledgeStorage)
metadata: Dict[str, Any] = Field(default_factory=dict)

def model_post_init(self, context):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Noticed that context is being used as a parameter but not used on the method, is that expected?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved this was a pydantic specific feature: https://docs.pydantic.dev/latest/api/base_model/#pydantic.BaseModel.model_post_init

so did _ instead


def save_documents(self, metadata: Dict[str, Any]):
"""Save the documents to the storage."""
chunk_metadatas = [metadata.copy() for _ in self.chunks]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Could you help me understand what this line of code does? Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before adding documents to the vector db, it needs metadata object. The metadata can be the same as its a representation of itself as a chunk.

When getting chunked (lets say a big pdf), it will x amount of content. so the defined metadata, should be cloned to be passed as well

overall we need per chunk a metadata dict to go with it.

Copy link
Collaborator

@pythonbyte pythonbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅
Amazing Work

@lorenzejay lorenzejay merged commit 14a36d3 into main Nov 20, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants