-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Omni-Modal Embedding, Vector Index and Retriever #13551
base: main
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
embeddings (List[List[float]]): List of embeddings. | ||
|
||
""" | ||
|
||
chunks: List[str] | ||
chunks: Sequence[object] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is required to satisfy the type checker when logging chunks for non-text data. Don't think this would break anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that logging the data as objects directly can be very slow depending on the modality. I'm reverting this back to List[str]
and will instead stringify the data objects before logging them.
Edit: Seems that it's mostly slow because Pydantic is validating the embedding list. Still, using str(data)
would allow a more user-friendly display than just storing the object directly.
from llama_index.core.schema import NodeWithScore | ||
|
||
|
||
class OmniModalQueryEngine(BaseQueryEngine, Generic[KD, KQ]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly a copy of MultiModalQueryEngine
. Once we have OmniModalLLM
, we can generalize this class to other modalities as well.
|
||
return await super()._aretrieve_from_object(obj, query_bundle, score) | ||
|
||
def _handle_recursive_retrieval( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike in MultiModalVectorIndexRetriever
, composite nodes are nominally supported for non-text modalities, but this feature has yet to be tested.
These are some pretty core changes. Thanks for taking a stab at this, we will have to spend some time digging into the structure here and ensure it fits with existing multimodal plans |
callback_manager = callback_manager_from_settings_or_context(Settings, None) | ||
|
||
# Distinguish from the case where an empty sequence is provided. | ||
if transformations is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also apply this change to the BaseIndex
? Imo it's unexpected behaviour that passing transformations=[]
fails to actually override the default settings.
@@ -48,7 +48,7 @@ def __init__( | |||
index_struct: Optional[IS] = None, | |||
storage_context: Optional[StorageContext] = None, | |||
callback_manager: Optional[CallbackManager] = None, | |||
transformations: Optional[List[TransformComponent]] = None, | |||
transformations: Optional[Sequence[TransformComponent]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseIndex
does not require that transformations
specifically be a list. Existing subclasses that assume that it is a list should remain unaffected (in terms of type safety) as long as they specify transformations
as a list in the subclass's initializer.
@DarkLight1337 ok, finally had some time to look at this. In general, this is a lot of work, and I'm definitely thankful for this! However, for multi-modalities, we are really wanting to make this more core to the library (i.e. everything is multimodal by default, rather than having specific subclasses) For example
Here's some of our current planning
|
Thanks for the detailed roadmap! I totally agree that this should be built into the core functionality; I used separate subclasses here to avoid editing the core code without first pinning down the details. This PR mainly works on items 3 and 6:
Regarding item 5, the current node architecture requires us to create a bunch of subclasses just to support each modality combination (for example, by having a mixin for each modality and then mix-and-matching them). This is clearly not very scalable. we can leverage the existing node composition logic to create a root note that contains multiple modalities (by having child nodes, each of a single modality). Based on this, we can add a modality parameter to Side note, I think we should open a new issue for this multi-modality refactoring roadmap so that it can get more visibility. |
cls, | ||
path: str, | ||
*, | ||
query_loader: Callable[[Dict[str, Any]], QueryBundle] = QueryBundle.from_dict, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could we deserialize multi-modal queries and documents (and any class that contains them, like this OmniModalEmbeddingQAFinetuneDataset
) without requiring the user to pass in the concrete classes to load?
0240f39
to
5bb6331
Compare
Looks like we can't have any tests for multi-modal retriever since that requires CLIP which is not installed in the CI environment. |
Description
This PR lays the groundwork for extending the multi-modal support to other modalities (such as audio). The main components of this PR are:
Modality
: Encapsulates the embedding-agnostic information about each modalityBaseNode
andQueryBundle
s that belong to that modality.OmniModalEmbedding
: Base class for Embedding component that supports any modality, not just text and image.document_modalities
andquery_modalities
, and implement_get_embedding
(and related methods) for those modalities accordingly.OmniModalEmbeddingBundle
: Composite ofOmniModalEmbedding
where multiple embedding models can be combined together.OmniModalVectorStoreIndex
: Index component that stores documents usingOmniModalEmbeddingBundle
. It is meant to be a drop-in replacement forMultiModalVectorStoreIndex
.BaseNode
s. The modality is inferred automatically based on the class type.OmniModalVectorStoreIndex.load_from_storage
instead ofllama_index.core.load_index_from_storage
, since we do not serialize the details of each modality.OmniModalVectorIndexRetriever
: Retriever component that queries documents usingOmniModalEmbeddingBundle
.QueryBundle
. (May be changed in the future to automatically detect the modality in a manner similar to the case for document nodes)As you may have guessed, I took inspiration from the recently released
GPT-4o
(where "o" stands for "omni") when naming these components, to distinguish from the existingMultiModal*
components for text-image retrieval. I am open to other naming suggestions.Type of Change
Please delete options that are not relevant.
This change intentionally leaves the existing code untouched at the expense of some code duplication. Future PRs may work on the following:
BaseEmbedding
class withOmniModalEmbedding
(since it's more general).Modality
into existingBaseNode
andQueryBundle
classes. That way, we can replaceModality
with a string key which can be serialized and deserialized easily.How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
I have added basic unit tests for the internals of
OmniModalEmbedding
andOmniModalEmbeddingBundle
.It appears that the original multi-modal index (#8709) and retriever (#8787) don't have any unit tests. I am not sure what would be the best approach for testing their functionality. Perhaps @hatianzhang would have some ideas?
To demonstrate the compatibility between
OmniModalVectorStoreIndex
andMultiModalVectorStoreIndex
, I have createdomni_modal_retrieval.ipynb
which is basically the same asmulti_modal_retrieval.ipynb
except thatMultiModal*
components are replaced withOmniModal*
ones.Future PRs can work on adding new modality types. In particular, audio and video support would complement GPT-4o well (unfortunately we probably can't use GPT-4o directly to generate embeddings).
Suggested Checklist:
make format; make lint
to appease the lint godsI will update the code documentation once the details are finalized.