Prototype UI for handling automated document extraction tasks using WatsonX models on PDF files.
Create prompts from the UI for usage against uploaded documents.
A sample version deployed with Schematics (Terraform) is available here
Either use IBMid to login or sign up using App ID.
For best results, create a new prompt that matches the type of document you are uploading. The prompt gets applied to each single page. This behavior can be enhanced in future iterations.
This web application consists of the following components:
- React Web UI
webclient/client
- Express Web Server
webclient/server
- Python (Procrastinate) worker
server/procrastinateworker
- Sqitch database schema change management
sqitch
- PostgresT (provided by dockerhub)
- Minio S3 (provided by dockerhub)
- Postgres DB (provided by dockerhub)
- Rancher Desktop/Docker
- Pyenv
pip install virtualenv
- From root directory, run
cp .env.example .env
- From root directory run
cp ./server/procrastinateworker/.env_example ./server/procrastinateworker/.env
- Modify the WML_APIKEY, WML_ENDPOINT,WML_PROJECT_ID value in the file
./server/procrastinateworker/.env
- From webclient directory, run
npm i
and thennpm run setupenv
- this will copy the .env files for local development
- From project root directory, run
docker compose -f docker-compose.yaml up --force-recreate
- Wait for postgres DB to start up, and wait for sqitch schema to finish provisioning tables
- ctrl-c to stop the process once logs are quiet
- Run
docker compose -f docker-compose.yaml -f docker-compose.minio.yaml up
- From
webclient
runnpm i
- From
webclient
runnpm run install:all
- From
webclient
directory, runnpm run develop:server
- This starts the express server, it monitors code changes in
webclient/server
- This server connects to s3/minio and postgresT
- This starts the express server, it monitors code changes in
- From
webclient
directory, runnpm run develop:client
- This starts the web application, it monitors code changes in
webclient/client
- This starts the web application, it monitors code changes in
- From project root folder make a new python virtual environment
python3 -m venv detpyvenv
(detpyvenv is the name of the virual env) - From project root folder activate the new python virtual environment
source ./detpyvenv/bin/activate
- From
server
directory, runpip install -r requirements.txt
- From
server
directory, runexport PYTHONPATH=.
- From
server
directory, runprocrastinate --verbose --app=procrastinateworker.worker.app worker
- This starts the Procrastinate worker that processes the PDF files
- Visit the web UI at http://localhost:3003/
- The postgresT endpoint is at http://localhost:3000/
- The minIO s3 admin console is available at http://localhost:9001/
The database is only temporary, it can be reset by simply running docker compose -f docker-compose.yaml up --force-recreate
- Users/Roles are only placeholders and do not actually work.
- Install CLI and plugins - https://cloud.ibm.com/docs/schematics?topic=schematics-schematics-cli-reference
- Install Terraform https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli
- Create a IBM Cloud resource group
- Create a Code Engine Project in that same resource group
- Create a Schematics workspace
Schematics workspace settings:
Field | Description |
---|---|
URL | https://github.com/IBM/document-extraction-toolkit/tree/main/terraform |
Use full repository | Uncheck |
Personal Access Token | Not required (if you fork this repo, you will need one) |
Terraform version | terraform_v1.5 |
Resource group | Use the one created previously |
Once the workspace is created, please update the following variables:
Field | Description |
---|---|
project_name | Unique name of the project with no spaces. It will be used to prefix resources created by the template. |
resource_group | Name of the resource group previously created. |
ce_project_id | UUID of the Code Engine project. You can find this either on the URL path or the "Details" button. |
cr_namespace | Use a unique name. Do not use the default value! |
cr_registry | Use icr.io for global. TF currently doesn't provision into other regions yet so do not use another value until this is fixed. If you see an error where built images can't be pushed into registry 403 error, change the region here and apply the template again. |
region | Use the same region where your resource group was created. All new resources are deployed to this region. |
use_ssh_key | Determines if the SSH deploy key is to be used when pulling from the repo. Applies to private repos only. |
ssh_deploykey | Create an SSH deploy key used for container registry and Code Engine deployments when using with a private repo. Always mark as sensitive. You can leave this blank for public github deployments, but mark it sensitive! |
wml_apikey | Your WatsonX API key. Sensitive. |
wml_endpoint | WatsonX service endpoint URL. eg https://us-south.ml.cloud.ibm.com/ml/v1-beta |
wml_project_id | Project UUID (Required) |
Verify all settings and execute a generate plan. It should succeed, and then you can proceed to apply plan. If apply plan fails, you can try again one more time and it will start again. Some of the elements are timed, due to limitations on the provider. The plan should take about less than 2 hours to finish from start to finish and provisions all the necessary backend services (appid, s3), container registry, code engine image builds, code engine deployments. You will receive a fully functionaly web application after the template finishes.
The code engine application **-worker needs to be manually adjusted after the template executes to be a daemon instead of a task (due to a terraform bug).
When using Schematics Workspaces, the auth token could potentially timeout midway during the run. If that happens, you can apply the template again without any issue.
This application is provided as a template for building a web UI that allows users to create tasks for watsonx to process.
Component | Link |
---|---|
UI | https://marmelab.com/react-admin/ |
Database change management | https://sqitch.org/docs/manual/sqitchtutorial/ |
PostgresT | https://postgrest.org/ |
Procrastinate | https://procrastinate.readthedocs.io |