Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings included in Memory index OR serialized JSON index #834

Open
drush opened this issue Oct 22, 2024 · 3 comments
Open

Embeddings included in Memory index OR serialized JSON index #834

drush opened this issue Oct 22, 2024 · 3 comments

Comments

@drush
Copy link

drush commented Oct 22, 2024

Describe the bug

When building an index with plugin-embeddings, the embeddings are not persisted to disk on save().

Building the index with plugin-astro, where we have made enhancements to support generating embeddings at indexing time. index.json file is the same size regardless of if embeddings are specified or not. We have verified embeddings are generated (added diagnostic logging), but the index file does not serialize the 'embeddings' key or any other embeddings data (other than specifying an 'embeddings' key as part of the schema)

To Reproduce

  • Stack: orama, tfjs-node, plugin-embeddings, plugin-astro
  • run pnpm build
  • inspect the resulting DB JSON file, or attempt a vector search on the DB

No embeddings can be found.

Expected behavior

I expected embeddings to be returned in search results, or the persisted file to be significantly larger.

Environment Info

OS: Macos 15.1
NODE: 22.10.0
Orama 3.0.1 
Astro: 4.16.6

Affected areas

Search, Serialization

Additional context

No response

@drush
Copy link
Author

drush commented Oct 22, 2024

Code based on current documentation to reproduce errors:

import { create, insert, search } from '@orama/orama'
import { pluginEmbeddings } from '@orama/plugin-embeddings'
import '@tensorflow/tfjs-node'

const enableVectors = await pluginEmbeddings({
  embeddings: {
    defaultProperty: 'embedding',
    onInsert: {
      generate: true,
      properties: ['title'],
      verbose: true,
    },
  },
})

const db = create({
  schema: {
    title: 'string',
    embedding: 'vector[512]',
  },
  plugins: [enableVectors],
})

// When using this plugin, document insertion becomes async
await insert(db, { title: 'The quick brown fox jumps over the lazy dog' })
await insert(db, {
  title: "I've seen a lazy dog dreaming of jumping over a quick brown fox",
})

console.log('Async Indexing complete')

// const index = await persist(db, 'json')
// console.log('Saved to disk', JSON.stringify(index, null, 2))

// This fails whether index is saved or not
const results = await search(db, {
  mode: 'vector',
  term: 'dog',
  includeVectors: true, // Defaults to `false`
  // similarity: 0.85, // Minimum similarity. Defaults to `0.8`
})

console.log(JSON.stringify(results, null, 2))

@drush drush changed the title Embeddings not serialized to JSON index Embeddings included in Memory index OR serialized JSON index Oct 26, 2024
@drush
Copy link
Author

drush commented Oct 26, 2024

@micheleriva Looks like you're in active development on some of the code related to this issue. The snippet above should allow you to reproduce the issue easily. I originally thought that the bug was around the persisted version of the index, but it looks like it's the index in general (in-memory as well). I've updated the title accordingly. Happy to help dagnose this further or assist with validating any fixes.

@micheleriva
Copy link
Member

Hi @drush, thank you so much for opening this. It looks like a bug. We're on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants