This is a project to generate meaningful titles for ingested paperless documents using AI. Sends the OCR text of the document to the OpenAI API and generates a title for the document. The title is then saved to the document's metadata.
- taxreturntranscript2022
- aigclaims_check_20230313
- prosharesultrabloombergtax2023
git clone https://github.com/sjafferali/paperless-titles-from-ai.git
cp -av paperless-titles-from-ai/.env.example paperless-titles-from-ai/.env
# Update .env file with the correct values
Update docker compose file with the correct path to the project directory.
services:
# ...
paperless-webserver:
# ...
volumes:
- /path/to/paperless-titles-from-ai:/usr/src/paperless/scripts
- /path/to/paperless-titles-from-ai/init:/custom-cont-init.d:ro
environment:
# ...
PAPERLESS_POST_CONSUME_SCRIPT: /usr/src/paperless/scripts/app/main.py
The init folder (used to ensure open package is installed) must be owned by root.
To back-fill titles on existing documents, run the helper cli from the project directory:
docker run --rm -v ./app:/app python:3 /app/scripts/backfill.sh [args] [single|all]
Arguments
Option | Required | Default | Description |
---|---|---|---|
--paperlessurl [URL] | Yes | https://paperless.local:8080 | Sets the URL of the paperless API endpoint. |
--paperlesskey [KEY] | Yes | Sets the API key to use when authenticating to paperless. | |
--openaimodel [MODEL] | No | gpt-4-turbo | Sets the OpenAI model used to generate title. Full list of supported models available at models. |
--openaibaseurl [API Endpoint] | No | Sets the OpenAI compatible endpoint to generate the title from. | |
--openaikey [KEY] | Yes | Sets the OpenAI key used to generate title. | |
--dry | No | False | Enables dry run which only prints out the changes that would be made. |
--loglevel [LEVEL] | No | INFO | Loglevel sets the desired loglevel. |
docker run --rm -v ./app:/app python:3 /app/scripts/backfill.sh [args] all [filter_args]
Arguments
Option | Required | Default | Description |
---|---|---|---|
--exclude [ID] | No | Excludes the document ID specified from being updated. This argument may be specified multiple times. | |
--filterstr [FILTERSTRING] | No | Filters the documents to be updated based on the URL filter string. |
docker run --rm -v ./app:/app python:3 /app/scripts/backfill.sh [args] single (document_id)
- The default OpenAI model used for generation is gpt-4-turbo. For a slightly less accurate title generation, but drastically reduced cost, use a GPT 3.5 model.
- The number of characters of the OCR text that is sent varies depending on the model being used. We try to send the maximum number the model supports to get the best generation we can.
Although the OpenAI API privacy document states that data sent to the OpenAI API is not used for training, other OpenAI compatible API endpoints are also supported by this post-consume script, which allows you to use a locally hosted LLM to generate titles.
- Create a GitHub issue for bug reports, feature requests, or questions.
- Add a ⭐️ star on GitHub.
- PRs are welcome