Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot index 5MB PDF with default settings using bedrock #94

Open
dirkpetersen opened this issue Oct 27, 2024 · 5 comments
Open

Cannot index 5MB PDF with default settings using bedrock #94

dirkpetersen opened this issue Oct 27, 2024 · 5 comments

Comments

@dirkpetersen
Copy link

dirkpetersen commented Oct 27, 2024

I try to upload this file (5MB, 2,384,000 chars) to LibreChat with bedrock API activated
https://pve.proxmox.com/pve-docs/pve-admin-guide.pdf

I tried dev and dev-lite containers but am getting an upload error ("An Error occurred while uploading a file) in the LibreChat GUI but no real error in the logs with DEBUG_RAG_API=true, Strange

If set CHUNK_SIZE=5000 it works however, these are my RAG settings

DEBUG_RAG_API=true
RAG_USE_FULL_CONTEXT=true
PDF_EXTRACT_IMAGES=false # false is default
CHUNK_SIZE=5000 # 1500 is default

AWS_DEFAULT_REGION=us-west-2
AWS_ACCESS_KEY_ID=cc
AWS_SECRET_ACCESS_KEY=cc

EMBEDDINGS_PROVIDER=bedrock
EMBEDDINGS_MODEL=amazon.titan-embed-text-v1

RAG_API_URL=http://host-gateway:8000
@dirkpetersen
Copy link
Author

dirkpetersen commented Oct 27, 2024

Further testing shows that CHUNK_SIZE=5000 does not fully fix the issue, more testing needed, ChatGPT accepts this document but Claude says it is too big

@FinnConnor
Copy link
Collaborator

FinnConnor commented Oct 28, 2024

I tested with CHUNK_SIZE=1500 EMBEDDINGS_PROVIDER=bedrock
EMBEDDINGS_MODEL=amazon.titan-embed-text-v1 PDF_EXTRACT_IMAGES=False.

I was unable to to see any issue with indexing this pdf (5 MB) and querying in both with docker and with only the rag_api.

If you are getting a file upload error. I would run just the rag_api (and database) and see if you are able to use the \embed to upload the 5MB pdf. This will help confirm if it is an issue with embedding the file or something else.

If your not having an issue with that, it may be that you have RAG_USE_FULL_CONTEXT=true this will send the entire context (all text of 5MB PDF) to chat, which very likely exceed the max number of input tokens.

Thanks for bringing this up @dirkpetersen

@dirkpetersen
Copy link
Author

dirkpetersen commented Oct 28, 2024

Thanks @ScarFX I set RAG_USE_FULL_CONTEXT=false but the problem persists.

LibreChat-NGINX   | 97.113.82.140 - - [28/Oct/2024:23:02:31 +0000] "POST /api/convos/gen_title HTTP/2.0" 200 54 "https://ochat1028b.aws.internetchen.de/c/77d89899-5729-42cf-84e8-d8a8a228ce78" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36" "-"
LibreChat-NGINX   | 2024/10/28 23:02:38 [warn] 30#30: *1 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000001, client: 97.113.82.140, server: _, request: "POST /api/files HTTP/2.0", host: "ochat1028b.aws.internetchen.de", referrer: "https://ochat1028b.aws.internetchen.de/c/77d89899-5729-42cf-84e8-d8a8a228ce78"
rag_api-1         | /usr/local/lib/python3.10/site-packages/pypdf/_crypt_providers/_cryptography.py:32: CryptographyDeprecationWarning: ARC4 has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.ARC4 and will be removed from this module in 48.0.0.
rag_api-1         |   from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
chat-mongodb      | {"t":{"$date":"2024-10-28T23:03:04.400+00:00"},"s":"I",  "c":"WTCHKPT",  "id":22430,   "ctx":"Checkpointer","msg":"WiredTiger message","attr":{"message":{"ts_sec":1730156584,"ts_usec":400378,"thread":"1:0xffff8cf8e6c0","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_CHECKPOINT_PROGRESS","category_id":7,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"saving checkpoint snapshot min: 21, snapshot max: 21 snapshot count: 0, oldest timestamp: (0, 0) , meta checkpoint timestamp: (0, 0) base write gen: 545"}}}
LibreChat-NGINX   | 2024/10/28 23:03:38 [error] 30#30: *1 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 97.113.82.140, server: _, request: "POST /api/files HTTP/2.0", upstream: "http://172.20.0.6:3080/api/files", host: "ochat1028b.aws.internetchen.de", referrer: "https://ochat1028b.aws.internetchen.de/c/77d89899-5729-42cf-84e8-d8a8a228ce78"
LibreChat-NGINX   | 97.113.82.140 - - [28/Oct/2024:23:03:38 +0000] "POST /api/files HTTP/2.0" 504 569 "https://ochat1028b.aws.internetchen.de/c/77d89899-5729-42cf-84e8-d8a8a228ce78" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36" "-"
rag_api-1         | 2024-10-28 23:03:38,768 - root - INFO - Request POST http://rag_api:8000/embed - 200

It seems there is a timeout: Next is trying RAG API standalone

@FinnConnor
Copy link
Collaborator

@dirkpetersen were you able to get RAG API to work?

@dvejsada
Copy link

We have been experiencing this issue as well. From around 3MB, the file upload fails. For smaller files, it works fine. Here I attach one of the failed files for replication purposes (saved article to PDF from website).
Clanek_SeznamZpravy.pdf. @danny-avila could you please have a look what may be causing this? We use Azure OpenAI embeddings, with text-embedding-3-large model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants