Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push Files to Index from Obsidian, Emacs & Desktop Clients using Multi-Part Forms Method #499

Merged
merged 24 commits into from
Oct 17, 2023

Conversation

debanjum
Copy link
Member

@debanjum debanjum commented Oct 13, 2023

Overview

  • Add ability to push data to index from the Emacs, Obsidian clients
  • Switch to standard mechanism of syncing files via HTTP multi-part/form. Previously we were streaming the data as JSON
    • Benefits of new mechanism
      • No manual parsing of files to send or receive on clients or server is required as most have in-built mechanisms to send multi-part/form requests
      • The whole response is not required to be kept in memory to parse content as JSON. As individual files arrive they're automatically pushed to disk to conserve memory if required
      • Binary files don't need to be encoded on client and decoded on server

Code Details

Major

  • 60e9a61 Use multi-part form to receive files to index on server
  • 68018ef Use multi-part form to send files to index on desktop client
  • fc99431 Send files to index on server from the khoj.el emacs client
    • 292f042 Send content for indexing on server at a regular interval from khoj.el
  • f2e293a: Send files to index on server from the khoj obsidian client
  • bed3aff 6baaaaf Update tests to test multi-part/form method of pushing files to index

Minor

  • 6aa69da Put indexer API endpoint under /api path segment
  • bea196a Explicitly make GET request to /config/data from khoj.el:khoj-server-configure method
  • 9ba173b Improve emoji, message on content index updated via logger
  • 7b9161e Don't call khoj server on khoj.el load, only once khoj invoked explicitly by user
  • Improve indexing of binary files
    • 541cd59 Let fs_syncer pass PDF files directly as binary before indexing
    • d27dc71 Use encoding of each file set in indexer request to read file
  • 99a2c93 Add CORS policy to khoj server. Allow requests from khoj apps, obsidian & localhost
  • 84654ff Update indexer API endpoint URL to index/update from indexer/batch

Resolves #471 #243

Update FastAPI app router, desktop app and to use new url path to
batch indexer API endpoint

All api endpoints should exist under /api path segment
Use mailbox closed with flag down once content index completed.

Use standard, existing logger messages in new indexer messages, when
files to index sent by clients
- This uses existing HTTP affordance to process files
  - Better handling of binary file formats as removes need to url encode/decode
  - Less memory utilization than streaming json as files get
    automatically written to disk once memory utilization exceeds preset limits
  - No manual parsing of raw files streams required
- Add typing for variables in for loop and other minor formatting clean-up
- Assume utf8 encoding for text files and binary for image, pdf files
- Add elisp variable to set API key to engage with the Khoj server
- Use multi-part form to POST the files to index to the indexer API
  endpoint on the khoj server
Instead of using the previous method to push data as json payload of POST request
pass it as files to upload via the multi-part/form to the batch indexer API endpoint
- Allow indexing frequency to be configurable by user
- Ensure there is only one khoj indexing timer running
…configure method

Previously global state of `url-request-method' would affect the
kind of request made to api/config/data API endpoint as it wasn't
being explicitly being set before calling the API endpoint

This was done with the assumption that the default value of GET for
url-request-method wouldn't change globally

But in some cases, experientially, it can get changed. This was
resulting in khoj.el load failing as POST request was being made
instead which would throw error
Copy link
Member

@sabaimran sabaimran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, cleaner standardization of the API spec!

src/khoj/routers/indexer.py Outdated Show resolved Hide resolved
src/interface/desktop/main.js Show resolved Hide resolved
src/khoj/routers/indexer.py Show resolved Hide resolved
- Pass payloads as unibyte. This was causing the request to fail for
  files with unicode characters
- Suppress messages with file content in on index updates
- Fix rendering response from server on index update API call
- Extract code to populate body of index update HTTP request with files
@debanjum debanjum added the maintain Maintain code, documentation or project label Oct 14, 2023
This prevents Khoj from polling the Khoj server until explicitly
invoked via `khoj' entrypoint function.

Previously it'd make a request to the khoj server every time Emacs or
khoj.el was loaded

Closes #243
@debanjum debanjum force-pushed the update-index-via-file-push-from-clients branch from 7b9161e to f64fa06 Compare October 17, 2023 02:16
Use the multi-part/form-data request to sync Markdown, PDF files in
vault to index on khoj server

Run scheduled job to push updates to value for indexing every 1 hour
- Keep state of previously synced files to identify files to be deleted
- Last synced files stored in settings for persistence of this data
  across Obsidian reboots
Get encoding type from multi-part/form-request body for each file
Read text files as utf-8 and pdfs, images as binary
No need to do unneeded base64 encoding/decoding to pass pdf contents
for indexing from fs_syncer to pdf_to_jsonl
Using fetch from Khoj Obsidian plugin was failing due to cross-origin
request and method: no-cors didn't allow passing x-api-key custom
header. And using Obsidian's request with multi-part/form-data wasn't
possible either.
Obsidian client now pushes vault files to index instead
Use the indexer/batch API endpoint to regenerate content index rather
than the previous pull based content indexing API endpoint
New URL follows action oriented endpoint naming convention used for
other Khoj API endpoints

Update desktop, obsidian and emacs client to call this new API
endpoint
New URL query params, `force' and `t' match name of query parameter in
existing Khoj API endpoints

Update Desktop, Obsidian and Emacs client to call using these new API
query params. Set `client' query param from each client for telemetry
visibility
@debanjum debanjum force-pushed the update-index-via-file-push-from-clients branch from fad990c to 6a4f1b2 Compare October 17, 2023 12:43
It passes locally on running individually but fails when run in
parallel on local or CI
@debanjum debanjum changed the title Push Files to Index from Emacs, Desktop Clients using Multi-Part Forms Method Push Files to Index from Obsidian, Emacs & Desktop Clients using Multi-Part Forms Method Oct 17, 2023
@debanjum debanjum merged commit ecc6fbf into master Oct 17, 2023
11 checks passed
@debanjum debanjum deleted the update-index-via-file-push-from-clients branch October 17, 2023 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintain Maintain code, documentation or project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Emacs, Obsidian clients to push files to the Khoj backend
2 participants