Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Omni-Modal Embedding, Vector Index and Retriever #13551

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

DarkLight1337
Copy link
Contributor

@DarkLight1337 DarkLight1337 commented May 17, 2024

Description

This PR lays the groundwork for extending the multi-modal support to other modalities (such as audio). The main components of this PR are:

  • Modality: Encapsulates the embedding-agnostic information about each modality
    • e.g.: how to read BaseNode and QueryBundles that belong to that modality.
  • OmniModalEmbedding: Base class for Embedding component that supports any modality, not just text and image.
    • Concrete subclasses should define the supported document_modalities and query_modalities, and implement _get_embedding (and related methods) for those modalities accordingly.
  • OmniModalEmbeddingBundle: Composite of OmniModalEmbedding where multiple embedding models can be combined together.
    • To avoid ambiguity, only one model per document modality is allowed.
    • There can be multiple models per query modality (each covering a different document modality).
  • OmniModalVectorStoreIndex: Index component that stores documents using OmniModalEmbeddingBundle. It is meant to be a drop-in replacement for MultiModalVectorStoreIndex.
    • There is no need to specify the document modality when storing BaseNodes. The modality is inferred automatically based on the class type.
    • Note: To load a persisted index, use OmniModalVectorStoreIndex.load_from_storage instead of llama_index.core.load_index_from_storage, since we do not serialize the details of each modality.
  • OmniModalVectorIndexRetriever: Retriever component that queries documents using OmniModalEmbeddingBundle.
    • You can set top-k for each document modality.
    • You must specify the query modality when passing in the QueryBundle. (May be changed in the future to automatically detect the modality in a manner similar to the case for document nodes)
    • You can specify one or more document modalities to retrieve from, which may be different from the query modality.

As you may have guessed, I took inspiration from the recently released GPT-4o (where "o" stands for "omni") when naming these components, to distinguish from the existing MultiModal* components for text-image retrieval. I am open to other naming suggestions.

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

This change intentionally leaves the existing code untouched at the expense of some code duplication. Future PRs may work on the following:

  • Replace the existing BaseEmbedding class with OmniModalEmbedding (since it's more general).
  • Integrate the function of extracting document/query data based on Modality into existing BaseNode and QueryBundle classes. That way, we can replace Modality with a string key which can be serialized and deserialized easily.
  • Some existing enums/constants need to be refactored to enable downstream developers to define their own modalities.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Added new unit/integration tests

I have added basic unit tests for the internals of OmniModalEmbedding and OmniModalEmbeddingBundle.

It appears that the original multi-modal index (#8709) and retriever (#8787) don't have any unit tests. I am not sure what would be the best approach for testing their functionality. Perhaps @hatianzhang would have some ideas?

  • Added new notebook (that tests end-to-end)

To demonstrate the compatibility between OmniModalVectorStoreIndex and MultiModalVectorStoreIndex, I have created omni_modal_retrieval.ipynb which is basically the same as multi_modal_retrieval.ipynb except that MultiModal* components are replaced with OmniModal* ones.

Future PRs can work on adding new modality types. In particular, audio and video support would complement GPT-4o well (unfortunately we probably can't use GPT-4o directly to generate embeddings).

  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

I will update the code documentation once the details are finalized.

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label May 17, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

embeddings (List[List[float]]): List of embeddings.

"""

chunks: List[str]
chunks: Sequence[object]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is required to satisfy the type checker when logging chunks for non-text data. Don't think this would break anything.

Copy link
Contributor Author

@DarkLight1337 DarkLight1337 May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that logging the data as objects directly can be very slow depending on the modality. I'm reverting this back to List[str] and will instead stringify the data objects before logging them.

Edit: Seems that it's mostly slow because Pydantic is validating the embedding list. Still, using str(data) would allow a more user-friendly display than just storing the object directly.

from llama_index.core.schema import NodeWithScore


class OmniModalQueryEngine(BaseQueryEngine, Generic[KD, KQ]):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a copy of MultiModalQueryEngine. Once we have OmniModalLLM, we can generalize this class to other modalities as well.


return await super()._aretrieve_from_object(obj, query_bundle, score)

def _handle_recursive_retrieval(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike in MultiModalVectorIndexRetriever, composite nodes are nominally supported for non-text modalities, but this feature has yet to be tested.

@logan-markewich
Copy link
Collaborator

These are some pretty core changes. Thanks for taking a stab at this, we will have to spend some time digging into the structure here and ensure it fits with existing multimodal plans

callback_manager = callback_manager_from_settings_or_context(Settings, None)

# Distinguish from the case where an empty sequence is provided.
if transformations is None:
Copy link
Contributor Author

@DarkLight1337 DarkLight1337 May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also apply this change to the BaseIndex? Imo it's unexpected behaviour that passing transformations=[] fails to actually override the default settings.

@@ -48,7 +48,7 @@ def __init__(
index_struct: Optional[IS] = None,
storage_context: Optional[StorageContext] = None,
callback_manager: Optional[CallbackManager] = None,
transformations: Optional[List[TransformComponent]] = None,
transformations: Optional[Sequence[TransformComponent]] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseIndex does not require that transformations specifically be a list. Existing subclasses that assume that it is a list should remain unaffected (in terms of type safety) as long as they specify transformations as a list in the subclass's initializer.

@logan-markewich
Copy link
Collaborator

@DarkLight1337 ok, finally had some time to look at this.

In general, this is a lot of work, and I'm definitely thankful for this!

However, for multi-modalities, we are really wanting to make this more core to the library (i.e. everything is multimodal by default, rather than having specific subclasses)

For example

  • chat messages should have a content type (the TS package has this already)
  • LLMs should automatically handle modalities rather than having separate base classes (i.e. unifying mutli-modal-llms with the LLM base class, reducing code duplication)
  • embeddings should probably also not have a specific base class, probably a mixin makes more sense (and reduces duplication again)

Here's some of our current planning

  1. unified LLM interface
    • no more separate LLM class
    • do subclassing at the ChatMessage level instead (follows principle of composition over inheritance)
      • text chat message, and image chat message
  2. multimodal prompt interface
    • corollary of the above, if we’re interleaving images, text at the chat message level then we’ll also need a way to format both images and text with our chat prompt template
  3. unified embedding interface (?)
    • this one i feel less strongly about, text and image embeddings are oftentimes not interchangeable
  4. refactor our structured output extraction
    • LLMProgram
      • takes in text and/or images, output result
  5. refactor our data representations
    • some of this will depend on what we do in the outer vector index abstraction
    • idea: have a node store both text and images (they technically already do this too)
      • e.g. for recursive retrieval, link a text representation to an image
  6. refactor multimodal vector store abstraction
    • i don’t have a clear idea what to do here yet. some ideas (some mutually exclusive, some not)
      • keep VectorStoreIndex as central interface
        • if image embedding model is specified, then use that to embed images
        • if text embedding model is specified, then use that to embed text
        • the derived query engine will be able to process both text and images depending on the retrieved nodes.
      • have vector index retriever auto-retrieve images if image is linked (similar to recursive retrieval for text)
      • have a multimodal response synthesizer (shove in images and text)
  7. multimodal agents (?)
    • i don’t know here yet

@DarkLight1337
Copy link
Contributor Author

DarkLight1337 commented Jun 4, 2024

@DarkLight1337 ok, finally had some time to look at this.

In general, this is a lot of work, and I'm definitely thankful for this!

However, for multi-modalities, we are really wanting to make this more core to the library (i.e. everything is multimodal by default, rather than having specific subclasses)

For example

  • chat messages should have a content type (the TS package has this already)
  • LLMs should automatically handle modalities rather than having separate base classes (i.e. unifying mutli-modal-llms with the LLM base class, reducing code duplication)
  • embeddings should probably also not have a specific base class, probably a mixin makes more sense (and reduces duplication again)

Here's some of our current planning...

Thanks for the detailed roadmap! I totally agree that this should be built into the core functionality; I used separate subclasses here to avoid editing the core code without first pinning down the details. This PR mainly works on items 3 and 6:

  • Item 3: OmniModalEmbeddingBundle provides an easy way to compose embeddings of different modalities.
  • Item 6: OmniModalVectorStoreIndex and OmniModalVectorIndexRetriever provide a unified interface for storing and retrieving any modality (identified by a string key) respectively. The OmniModalVectorIndexRetriever is already coded to handle recursive retrieval on other modalities (not really tested though).

Regarding item 5, the current node architecture requires us to create a bunch of subclasses just to support each modality combination (for example, by having a mixin for each modality and then mix-and-matching them). This is clearly not very scalable. we can leverage the existing node composition logic to create a root note that contains multiple modalities (by having child nodes, each of a single modality). Based on this, we can add a modality parameter to BaseNode.get_content to retrieve the data inside the (root) node by modality. WDYT?

Side note, I think we should open a new issue for this multi-modality refactoring roadmap so that it can get more visibility.

cls,
path: str,
*,
query_loader: Callable[[Dict[str, Any]], QueryBundle] = QueryBundle.from_dict,
Copy link
Contributor Author

@DarkLight1337 DarkLight1337 Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could we deserialize multi-modal queries and documents (and any class that contains them, like this OmniModalEmbeddingQAFinetuneDataset) without requiring the user to pass in the concrete classes to load?

@DarkLight1337 DarkLight1337 force-pushed the omni-modal branch 2 times, most recently from 0240f39 to 5bb6331 Compare June 7, 2024 03:09
@DarkLight1337
Copy link
Contributor Author

Looks like we can't have any tests for multi-modal retriever since that requires CLIP which is not installed in the CI environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants