Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

512MB seems to be the max supported file size for disk persistence plugin #851

Open
bennyzen opened this issue Nov 30, 2024 · 7 comments
Open

Comments

@bennyzen
Copy link

Describe the bug

Using Orama with the persistence plugin, I seem to have hit a wall. While indexing some publications, everything was fine until the database grew. Now I keep getting the same error while trying to read from the persisted file:

node:buffer:711
    slice: (buf, start, end) => buf.hexSlice(start, end),
                                    ^

Error: Cannot create a string longer than 0x1fffffe8 characters
    at Object.slice (node:buffer:711:37)
    at Buffer.toString (node:buffer:863:14)
    at persist (file:///home/node/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/index.js:60:45)
    at async persistToFile (file:///home/node/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/server.js:16:24)
    at async addPDF (file:///home/node/lib/utils.mjs:149:3)
    at async indexPublications (file:///home/node/lib/utils.mjs:218:9)
    at async Command.<anonymous> (file:///home/node/dbtool:59:5) {
  code: 'ERR_STRING_TOO_LONG'
}

Node.js v22.11.0
node@81095f0d53a4:~$ ./dbtool stats
file:///home/node/lib/utils.mjs:173
    size: filesize(Buffer.byteLength(JSON.stringify(db))),
                                          ^

RangeError: Invalid string length
    at JSON.stringify (<anonymous>)
    at getStats (file:///home/node/lib/utils.mjs:173:43)
    at async Command.<anonymous> (file:///home/node/dbtool:32:19)
    
node@81095f0d53a4:~$ du -h db.orama 
512M    db.orama

To Reproduce

  1. Use Orama with the persistence plugin
  2. Ingest a lot of docs until you reach 512MB in size
  3. Watch your whole database go up in smoke

Expected behavior

Being able to reach more than 512MB in database size.

Environment Info

OS: Manjaro Linux 6.6.54
Node: v22.11.0
Orama: @orama/orama 3.0.2 @orama/plugin-data-persistence 3.0.2

Affected areas

Initialization, Data Insertion

Additional context

Only tried Linux so far, as it's my daily driver.

@micheleriva
Copy link
Member

Hi @bennyzen,
how are you serializing the database? Via JSON, DPACK, or MessagePack?

@bennyzen
Copy link
Author

bennyzen commented Dec 7, 2024

Ciao Michele,

first of all, thank you for this amazing project.

From my humble understanding, as I yet haven't studied the internals of Orama, I simply followed the instructions in the docs calling the provided persistToFile() and restoreFromFile() methods respectively, both with the "binary" argument.

There's a real chance that I've been delusional by ingesting so much data into the db, as it's maybe just not made for such volumes.

BTW: Did someone successfully store, persisted and restored more than 512MB of data, or is it just me having this kind of issue?

@micheleriva
Copy link
Member

Can you try persisting this data in a JSON format? Using the json option instead of the binary one. Built-in JSON support is far superior in JavaScript than binary support via third-party libs like msgpack or dpack.

As far as I know, 512MB shouldn't really be a problem. Especially in JSON!

@bennyzen
Copy link
Author

bennyzen commented Dec 7, 2024

Yes, I'll surely try to persist using JSON. But it will take some time to embed and ingest all those records again to reach that volume.

The only thing that still boggles me is what I've come across here. If I understand that right, it means that the max string length has regressed back to 0.5GB. But as always, please correct me if I'm wrong.

@bennyzen
Copy link
Author

bennyzen commented Dec 8, 2024

Here's a quick'n'dirty bare bones reproduction using both binary or json mode:

import { create, insert } from '@orama/orama'
import {
  persistToFile,
  restoreFromFile,
} from '@orama/plugin-data-persistence/server'

const inserts = 512 * 10
const blockSize = 1048576 / 10 // 1MB / 10, as a whole 1MB block would cause another error
const mode = 'json'

const payload = () => {
  let payload = ''
  for (let i = 0; i < blockSize; i++) {
    payload += 'a'
  }
  return payload
}

const db = create({
  schema: {
    payload: 'string',
  },
})

console.time('inserting')
for (let i = 0; i < inserts; i++) {
  await insert(db, {
    payload: payload(),
  })
}
console.timeEnd('inserting')

// persist the database to disk
console.time('persisting')
const path = await persistToFile(db, mode, 'db.dat')
console.timeEnd('persisting')

// restore the database from disk
console.time('restoring')
const restored = await restoreFromFile(mode, path)
console.timeEnd('restoring')

JSON mode yields this error:

inserting: 21.506s
file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/index.js:50
            serialized = JSON.stringify(dbExport);
                              ^

RangeError: Invalid string length
    at JSON.stringify (<anonymous>)
    at persist (file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/index.js:50:31)
    at async persistToFile (file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/server.js:16:24)
    at async file:///home/ben/tmp/orama-persist-limit/main.mjs:35:14

Node.js v22.11.0

BINARY mode yields this error:

inserting: 21.573s
node:buffer:711
    slice: (buf, start, end) => buf.hexSlice(start, end),
                                    ^

Error: Cannot create a string longer than 0x1fffffe8 characters
    at Object.slice (node:buffer:711:37)
    at Buffer.toString (node:buffer:863:14)
    at persist (file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/index.js:60:45)
    at async persistToFile (file:///home/ben/tmp/orama-persist-limit/node_modules/.pnpm/@[email protected]/node_modules/@orama/plugin-data-persistence/dist/server.js:16:24)
    at async file:///home/ben/tmp/orama-persist-limit/main.mjs:34:14 {
  code: 'ERR_STRING_TOO_LONG'
}

Node.js v22.11.0

So yes, the limit seems to be 512MB. Correct?

@micheleriva
Copy link
Member

It shouldn't be. We're investigating, we'll keep you posted (cc. @matijagaspar, @faustoq)

@bennyzen
Copy link
Author

It's just an assumption and probably too vague to be useful, but couldn't it be mitigated by using eg. a streaming ndjson parser/serializer? It surely would involve some significant rework of the actual code-base, but IMHO would remove these constraining "limitations" and significantly reduce memory consumption on larger data volumes.

Another thing that I've noticed during testing: Field-size seems to be limited to 100KB (see rudimentary code above). Sure, which sane person puts 100 KB of data into a single field? But that's maybe stuff for another consideration/issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants