[Core] Omni-Modal Embedding, Vector Index and Retriever #13551

DarkLight1337 · 2024-05-17T10:02:04Z

Description

This PR lays the groundwork for extending the multi-modal support to other modalities (such as audio). The main components of this PR are:

Modality: Encapsulates the embedding-agnostic information about each modality
- e.g.: how to read BaseNode and QueryBundles that belong to that modality.
OmniModalEmbedding: Base class for Embedding component that supports any modality, not just text and image.
- Concrete subclasses should define the supported document_modalities and query_modalities, and implement _get_embedding (and related methods) for those modalities accordingly.
OmniModalEmbeddingBundle: Composite of OmniModalEmbedding where multiple embedding models can be combined together.
- To avoid ambiguity, only one model per document modality is allowed.
- There can be multiple models per query modality (each covering a different document modality).
OmniModalVectorStoreIndex: Index component that stores documents using OmniModalEmbeddingBundle. It is meant to be a drop-in replacement for MultiModalVectorStoreIndex.
- There is no need to specify the document modality when storing BaseNodes. The modality is inferred automatically based on the class type.
- Note: To load a persisted index, use OmniModalVectorStoreIndex.load_from_storage instead of llama_index.core.load_index_from_storage, since we do not serialize the details of each modality.
OmniModalVectorIndexRetriever: Retriever component that queries documents using OmniModalEmbeddingBundle.
- You can set top-k for each document modality.
- You must specify the query modality when passing in the QueryBundle. (May be changed in the future to automatically detect the modality in a manner similar to the case for document nodes)
- You can specify one or more document modalities to retrieve from, which may be different from the query modality.

As you may have guessed, I took inspiration from the recently released GPT-4o (where "o" stands for "omni") when naming these components, to distinguish from the existing MultiModal* components for text-image retrieval. I am open to other naming suggestions.

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

This change intentionally leaves the existing code untouched at the expense of some code duplication. Future PRs may work on the following:

Replace the existing BaseEmbedding class with OmniModalEmbedding (since it's more general).
Integrate the function of extracting document/query data based on Modality into existing BaseNode and QueryBundle classes. That way, we can replace Modality with a string key which can be serialized and deserialized easily.
Some existing enums/constants need to be refactored to enable downstream developers to define their own modalities.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Added new unit/integration tests

I have added basic unit tests for the internals of OmniModalEmbedding and OmniModalEmbeddingBundle.

It appears that the original multi-modal index (#8709) and retriever (#8787) don't have any unit tests. I am not sure what would be the best approach for testing their functionality. Perhaps @hatianzhang would have some ideas?

Added new notebook (that tests end-to-end)

To demonstrate the compatibility between OmniModalVectorStoreIndex and MultiModalVectorStoreIndex, I have created omni_modal_retrieval.ipynb which is basically the same as multi_modal_retrieval.ipynb except that MultiModal* components are replaced with OmniModal* ones.

Future PRs can work on adding new modality types. In particular, audio and video support would complement GPT-4o well (unfortunately we probably can't use GPT-4o directly to generate embeddings).

I stared at the code and made sure it makes sense

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

I will update the code documentation once the details are finalized.

…_retrieval.ipynb`

review-notebook-app · 2024-05-17T10:02:09Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

DarkLight1337 · 2024-05-17T10:04:54Z

llama-index-core/llama_index/core/instrumentation/events/embedding.py

        embeddings (List[List[float]]): List of embeddings.

    """

-    chunks: List[str]
+    chunks: Sequence[object]


This change is required to satisfy the type checker when logging chunks for non-text data. Don't think this would break anything.

I found that logging the data as objects directly can be very slow depending on the modality. I'm reverting this back to List[str] and will instead stringify the data objects before logging them.

Edit: Seems that it's mostly slow because Pydantic is validating the embedding list. Still, using str(data) would allow a more user-friendly display than just storing the object directly.

DarkLight1337 · 2024-05-17T10:36:04Z

llama-index-core/llama_index/core/query_engine/omni_modal.py

+from llama_index.core.schema import NodeWithScore
+
+
+class OmniModalQueryEngine(BaseQueryEngine, Generic[KD, KQ]):


Mostly a copy of MultiModalQueryEngine. Once we have OmniModalLLM, we can generalize this class to other modalities as well.

DarkLight1337 · 2024-05-17T10:40:48Z

llama-index-core/llama_index/core/indices/omni_modal/retriever.py

+
+        return await super()._aretrieve_from_object(obj, query_bundle, score)
+
+    def _handle_recursive_retrieval(


Unlike in MultiModalVectorIndexRetriever, composite nodes are nominally supported for non-text modalities, but this feature has yet to be tested.

logan-markewich · 2024-05-20T02:25:30Z

These are some pretty core changes. Thanks for taking a stab at this, we will have to spend some time digging into the structure here and ensure it fits with existing multimodal plans

… converting nodes into text nodes

DarkLight1337 · 2024-05-27T09:14:49Z

llama-index-core/llama_index/core/indices/omni_modal/base.py

+            callback_manager = callback_manager_from_settings_or_context(Settings, None)
+
+        # Distinguish from the case where an empty sequence is provided.
+        if transformations is None:


Should we also apply this change to the BaseIndex? Imo it's unexpected behaviour that passing transformations=[] fails to actually override the default settings.

DarkLight1337 · 2024-05-27T09:15:51Z

llama-index-core/llama_index/core/indices/base.py

@@ -48,7 +48,7 @@ def __init__(
        index_struct: Optional[IS] = None,
        storage_context: Optional[StorageContext] = None,
        callback_manager: Optional[CallbackManager] = None,
-        transformations: Optional[List[TransformComponent]] = None,
+        transformations: Optional[Sequence[TransformComponent]] = None,


BaseIndex does not require that transformations specifically be a list. Existing subclasses that assume that it is a list should remain unaffected (in terms of type safety) as long as they specify transformations as a list in the subclass's initializer.

logan-markewich · 2024-06-03T17:10:04Z

@DarkLight1337 ok, finally had some time to look at this.

In general, this is a lot of work, and I'm definitely thankful for this!

However, for multi-modalities, we are really wanting to make this more core to the library (i.e. everything is multimodal by default, rather than having specific subclasses)

For example

chat messages should have a content type (the TS package has this already)
LLMs should automatically handle modalities rather than having separate base classes (i.e. unifying mutli-modal-llms with the LLM base class, reducing code duplication)
embeddings should probably also not have a specific base class, probably a mixin makes more sense (and reduces duplication again)

Here's some of our current planning

unified LLM interface
- no more separate LLM class
- do subclassing at the ChatMessage level instead (follows principle of composition over inheritance)
  - text chat message, and image chat message
multimodal prompt interface
- corollary of the above, if we’re interleaving images, text at the chat message level then we’ll also need a way to format both images and text with our chat prompt template
unified embedding interface (?)
- this one i feel less strongly about, text and image embeddings are oftentimes not interchangeable
refactor our structured output extraction
- LLMProgram
  - takes in text and/or images, output result
refactor our data representations
- some of this will depend on what we do in the outer vector index abstraction
- idea: have a node store both text and images (they technically already do this too)
  - e.g. for recursive retrieval, link a text representation to an image
refactor multimodal vector store abstraction
- i don’t have a clear idea what to do here yet. some ideas (some mutually exclusive, some not)
  - keep VectorStoreIndex as central interface
    - if image embedding model is specified, then use that to embed images
    - if text embedding model is specified, then use that to embed text
    - the derived query engine will be able to process both text and images depending on the retrieved nodes.
  - have vector index retriever auto-retrieve images if image is linked (similar to recursive retrieval for text)
  - have a multimodal response synthesizer (shove in images and text)
multimodal agents (?)
- i don’t know here yet

DarkLight1337 · 2024-06-04T02:55:04Z

@DarkLight1337 ok, finally had some time to look at this.

In general, this is a lot of work, and I'm definitely thankful for this!

However, for multi-modalities, we are really wanting to make this more core to the library (i.e. everything is multimodal by default, rather than having specific subclasses)

For example

chat messages should have a content type (the TS package has this already)

LLMs should automatically handle modalities rather than having separate base classes (i.e. unifying mutli-modal-llms with the LLM base class, reducing code duplication)

embeddings should probably also not have a specific base class, probably a mixin makes more sense (and reduces duplication again)

Here's some of our current planning...

Thanks for the detailed roadmap! I totally agree that this should be built into the core functionality; I used separate subclasses here to avoid editing the core code without first pinning down the details. This PR mainly works on items 3 and 6:

Item 3: OmniModalEmbeddingBundle provides an easy way to compose embeddings of different modalities.
Item 6: OmniModalVectorStoreIndex and OmniModalVectorIndexRetriever provide a unified interface for storing and retrieving any modality (identified by a string key) respectively. The OmniModalVectorIndexRetriever is already coded to handle recursive retrieval on other modalities (not really tested though).

Regarding item 5, the current node architecture requires us to create a bunch of subclasses just to support each modality combination (for example, by having a mixin for each modality and then mix-and-matching them). This is clearly not very scalable. we can leverage the existing node composition logic to create a root note that contains multiple modalities (by having child nodes, each of a single modality). Based on this, we can add a modality parameter to BaseNode.get_content to retrieve the data inside the (root) node by modality. WDYT?

Side note, I think we should open a new issue for this multi-modality refactoring roadmap so that it can get more visibility.

DarkLight1337 · 2024-06-06T14:06:33Z

llama-index-core/llama_index/core/evaluation/retrieval/omni_modal.py

+        cls,
+        path: str,
+        *,
+        query_loader: Callable[[Dict[str, Any]], QueryBundle] = QueryBundle.from_dict,


How could we deserialize multi-modal queries and documents (and any class that contains them, like this OmniModalEmbeddingQAFinetuneDataset) without requiring the user to pass in the concrete classes to load?

DarkLight1337 · 2024-06-12T02:40:20Z

Looks like we can't have any tests for multi-modal retriever since that requires CLIP which is not installed in the CI environment.

DarkLight1337 added 4 commits May 17, 2024 06:11

Enable multi-modal RAG on arbitrary modality combinations

5f2aadf

Add demo of OmniModalVectorStoreIndex heavily based on `multi_modal…

c5493dc

…_retrieval.ipynb`

Move inner classes outside

c625b4c

Fix using pydantic field instead of dataclass field

9ca072c

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label May 17, 2024

DarkLight1337 commented May 17, 2024

View reviewed changes

Add missing wikipedia dependency

2951c29

DarkLight1337 commented May 17, 2024

View reviewed changes

DarkLight1337 added 5 commits May 18, 2024 05:07

Fix some issues in GenericTransformComponent

a56c48d

Fix self equality

1654df6

Add tests for OmniModalEmbedding

74a0963

Add tests for OmniModalEmbeddingBundle

311bf26

Code cleanup

2630e7a

DarkLight1337 added 5 commits May 21, 2024 07:25

Fix incompatible override

9c4938d

Enable const inference of key type parameter

3370419

Merge branch 'main' into omni-modal

5567ea2

Improve error message

7fc7594

Fix unexpected error in from_documents due to the default behaviour…

3ccce97

… converting nodes into text nodes

DarkLight1337 commented May 27, 2024

View reviewed changes

DarkLight1337 added 6 commits May 27, 2024 10:21

Avoid expensive logging operation for large data objects

22e4d60

Merge branch 'main' into omni-modal

5b61e90

Apply run-llama#13352

ab3b32e

Apply run-llama#13712

ca5236c

Avoid expensive validation of embeddings

60513d1

Add sanity check for multi-modal persist/load cycle

92acc6d

Cleanup

5bb6331

DarkLight1337 commented Jun 6, 2024

View reviewed changes

DarkLight1337 force-pushed the omni-modal branch 2 times, most recently from 0240f39 to 5bb6331 Compare June 7, 2024 03:09

Add retrieval evaluator for any modality

fb04e0c

DarkLight1337 force-pushed the omni-modal branch from f3b5600 to fb04e0c Compare June 7, 2024 03:43

DarkLight1337 added 5 commits June 12, 2024 01:27

Add expected docs to result

1e6d158

Merge branch 'main' into omni-modal

43fb821

Apply run-llama#13973

0d7ce4f

Apply run-llama#14003

83d30d0

Apply run-llama#13700

761c816

DarkLight1337 added 14 commits June 12, 2024 02:47

Skip test when CLIP is not available

9331e9a

Fix circular import (follow-up to "Apply run-llama#13973")

e69fdb8

Add description for tqdm

6c8bdf6

Merge branch 'main' into omni-modal

f6b7ef8

Add generic parameters

1be9920

Merge branch 'main' into omni-modal

8712914

Merge branch 'main' into omni-modal

13420d9

Merge branch 'main' into omni-modal

ba0199d

Merge branch 'main' into omni-modal

73e1527

Fix incomplete serialization

aa2833f

Apply run-llama#14383

ffe9750

Merge branch 'main' into omni-modal

6eb4583

Factor out validation logic

ba7c68f

More simplification

b5840eb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Omni-Modal Embedding, Vector Index and Retriever #13551

[Core] Omni-Modal Embedding, Vector Index and Retriever #13551

DarkLight1337 commented May 17, 2024 •

edited

Loading

review-notebook-app bot commented May 17, 2024

DarkLight1337 May 17, 2024

DarkLight1337 May 27, 2024 •

edited

Loading

DarkLight1337 May 17, 2024

DarkLight1337 May 17, 2024

logan-markewich commented May 20, 2024

DarkLight1337 May 27, 2024 •

edited

Loading

DarkLight1337 May 27, 2024

logan-markewich commented Jun 3, 2024

DarkLight1337 commented Jun 4, 2024 •

edited

Loading

DarkLight1337 Jun 6, 2024 •

edited

Loading

DarkLight1337 commented Jun 12, 2024

		from llama_index.core.schema import NodeWithScore


		class OmniModalQueryEngine(BaseQueryEngine, Generic[KD, KQ]):


		return await super()._aretrieve_from_object(obj, query_bundle, score)

		def _handle_recursive_retrieval(

[Core] Omni-Modal Embedding, Vector Index and Retriever #13551

Are you sure you want to change the base?

[Core] Omni-Modal Embedding, Vector Index and Retriever #13551

Conversation

DarkLight1337 commented May 17, 2024 • edited Loading

Description

Type of Change

How Has This Been Tested?

Suggested Checklist:

review-notebook-app bot commented May 17, 2024

DarkLight1337 May 17, 2024

Choose a reason for hiding this comment

DarkLight1337 May 27, 2024 • edited Loading

Choose a reason for hiding this comment

DarkLight1337 May 17, 2024

Choose a reason for hiding this comment

DarkLight1337 May 17, 2024

Choose a reason for hiding this comment

logan-markewich commented May 20, 2024

DarkLight1337 May 27, 2024 • edited Loading

Choose a reason for hiding this comment

DarkLight1337 May 27, 2024

Choose a reason for hiding this comment

logan-markewich commented Jun 3, 2024

DarkLight1337 commented Jun 4, 2024 • edited Loading

DarkLight1337 Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

DarkLight1337 commented Jun 12, 2024

DarkLight1337 commented May 17, 2024 •

edited

Loading

DarkLight1337 May 27, 2024 •

edited

Loading

DarkLight1337 May 27, 2024 •

edited

Loading

DarkLight1337 commented Jun 4, 2024 •

edited

Loading

DarkLight1337 Jun 6, 2024 •

edited

Loading