From 96c6aeed4ee2b62d45f5f3955a0a1f21a3fdd9ba Mon Sep 17 00:00:00 2001 From: Eva Khmelinskaya Date: Thu, 18 Jul 2024 16:59:30 -0700 Subject: [PATCH] Linting fixes --- classify-split-extract-workflow/README.md | 81 +++++++++++------------ 1 file changed, 39 insertions(+), 42 deletions(-) diff --git a/classify-split-extract-workflow/README.md b/classify-split-extract-workflow/README.md index d308ec093..a89f63981 100644 --- a/classify-split-extract-workflow/README.md +++ b/classify-split-extract-workflow/README.md @@ -32,7 +32,7 @@ This solution aims to streamline document Classification/Splitting and Extracting with all data being saved to the BigQuery using. The user can simply plugin their own custom maid Classifier/Splitter/Extractor by changing the configuration file (and can even do it real-time, since the file is stored in GCS) and specify output BigQuery table for each processor. -For an example use case, the application is equipped to process an individual US Tax Return using the Lending Document AI Processors (out-of-the box Specialized processors). +For an example use case, the application is equipped to process an individual US Tax Return using the Lending Document AI Processors (out-of-the box Specialized processors). However you absolutely can and should train your own custom Splitter/Classifier and Extractor. Then you can specify fields (labels) to be extracted and saved to bigQuery in the format you need. > NOTE: LDAI Splitter & Classifier in this Demo require allowlisting to use. @@ -40,27 +40,27 @@ However you absolutely can and should train your own custom Splitter/Classifier ## Architecture -![](img/architecture.png) +![Architecture](img/architecture.png) ### Pipeline Steps All environment variables referred further on are defined in the [vars.sh](vars.sh) file. -1) Pipeline execution is triggered by uploading document into the GCS bucket. - - Note, there is a dedicated bucket assigned to send Pub/Sub notifications: `CLASSIFY_INPUT_BUCKET` - - If `DATA_SYNC` environment variable is set to _true_, then any PDF document will trigger pipeline as long as it is uploaded inside the `CLASSIFY_INPUT_BUCKET`. - - Otherwise only `START_PIPELINE` file will trigger BATCh processing (of all the documents inside the uploaded folder). - - Do not upload files into the `splitter_output` sub-folder - it is the `system` directory to store the split sub-documents and therefore is ignored. -2) Pub/Sub enet is forwarded to the [GCP Workflow execution](https://console.cloud.google.com/workflows/workflow/us-central1/classify-extract/). -3) Workflow checks the file uploaded and triggers Cloud Run Job for the Document Classification and Splitting. -4) [Classification Cloud Run Job](https://console.cloud.google.com/run/jobs/details/us-central1/classify-job): +1. Pipeline execution is triggered by uploading document into the GCS bucket. + - Note, there is a dedicated bucket assigned to send Pub/Sub notifications: `CLASSIFY_INPUT_BUCKET` + - If `DATA_SYNC` environment variable is set to _true_, then any PDF document will trigger pipeline as long as it is uploaded inside the `CLASSIFY_INPUT_BUCKET`. + - Otherwise only `START_PIPELINE` file will trigger BATCH processing (of all the documents inside the uploaded folder). + - Do not upload files into the `splitter_output` sub-folder - it is the `system` directory to store the split sub-documents and therefore is ignored. +2. Pub/Sub event is forwarded to the [GCP Workflow execution](https://console.cloud.google.com/workflows/workflow/us-central1/classify-extract/). +3. Workflow checks the file uploaded and triggers Cloud Run Job for the Document Classification and Splitting. +4. [Classification Cloud Run Job](https://console.cloud.google.com/run/jobs/details/us-central1/classify-job): - Uses Document AI Classifier or Splitter as it is defined in the [config.json](classify-job/config/config.json) file (`parser_config`/`classifier`). - - For each document sent for the processing, determines `confidence` and `type` (Classifier), determines page boundaries and type of each page (Splitter) and does the splitting into the `splitter_output` sub-folder. - - File `config.json` defines the relation between Classifier labels and Document Parsers to be used for those label (as well as the output BigQuery table for each model). + - For each document sent for the processing, determines `confidence` and `type` (Classifier), determines page boundaries and type of each page (Splitter) and does the splitting into the `splitter_output` sub-folder. + - File `config.json` defines the relation between Classifier labels and Document Parsers to be used for those labels (as well as the output BigQuery table for each model). - Creates a _json_ file inside `CLASSIFY_OUTPUT_BUCKET` bucket that is the result of the classification/splitting job and is used for the Extraction. - The path to this json file is sent back to the GCP Workflow in the callback when Classification job is completed. - Here is an example of the output json file: - + ```text [ ... @@ -77,21 +77,20 @@ All environment variables referred further on are defined in the [vars.sh](vars. ... ] ``` - - _object_table_name_ - contains all the documents that were classified/split and ended up having the same document type. - - _model_name_ - corresponds to the Document AI Extractor MODEL - - _out_table_name_ - is the output BigQuery table name to be used for the extraction - - - It also creates the BigQuery mlops tables required for the [ML.PROCESS_DOCUMENT](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-process-document) function such as: - - [Object tables](https://cloud.google.com/bigquery/docs/object-tables) for the GCS documents - - MODEL for the Document AI parsers - - And it assigns GCS [custom metadata](https://cloud.google.com/storage/docs/metadata#custom-metadata) to the input documents with values of classification result (`confidence` score and document `type`). - - These metadata is then also saved along into the BigQuery. - - Documents that were result of splitting, have metadata pointing to the original document. -5) Entity Extraction is done by [ML.PROCESS_DOCUMENT function](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-process-document) as the GCP Workflow next step and saved into the BigQuery. - - Uses `json` file created by the Classifier to run the Extraction. -6) (Optional _Future_) Extracted data can be sent downstream to a 3rd party as an API call for the further integration as the final step of the Workflow Execution +- _object_table_name_ - contains all the documents that were classified/split and ended up having the same document type. +- _model_name_ - corresponds to the Document AI Extractor MODEL +- _out_table_name_ - is the output BigQuery table name to be used for the extraction +- It also creates the BigQuery mlops tables required for the [ML.PROCESS_DOCUMENT](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-process-document) function such as: + - [Object tables](https://cloud.google.com/bigquery/docs/object-tables) for the GCS documents + - MODEL for the Document AI parsers +- And it assigns GCS [custom metadata](https://cloud.google.com/storage/docs/metadata#custom-metadata) to the input documents with values of classification result (`confidence` score and document `type`). + - These metadata is then also saved along into the BigQuery. + - Documents that were result of splitting, have metadata pointing to the original document. +5. Entity Extraction is done by [ML.PROCESS_DOCUMENT function](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-process-document) as the GCP Workflow next step and saved into the BigQuery. + - Uses `json` file created by the Classifier to run the Extraction. +6. (Optional _Future_) Extracted data can be sent downstream to a 3rd party as an API call for the further integration as the final step of the Workflow Execution. ### Google Cloud Products Used @@ -108,22 +107,20 @@ All environment variables referred further on are defined in the [vars.sh](vars. [3]: https://cloud.google.com/run [4]: https://cloud.google.com/workflows - ### Quotas Default Quotas to be aware of: - Number of concurrent batch prediction requests - 10 - Number of concurrent batch prediction requests processed using document processor (Single Region) per region - 5 ### Big Query Tables -The BigQuery Table schema is determined in the runtime based on the DocumentAI parser used. +The BigQuery Table schema is determined in the runtime based on the DocumentAI parser used. - For the Generic Form parser and OCR parser the schema does not contain any specific fields labels (only the extracted json and metadata). So it is flexible on the usage and all core information is within the json field: `ml_process_document_result`. - - Therefore you can easily export data from both OCR and FORM parser into the same BigQuery table. + - Therefore you can easily export data from both OCR and FORM parser into the same BigQuery table. - For the Specialized Document parsers (like W-2 Parser) fields are predefined for you. -- For the user defined Custom Document Extractor the schema corresponds to the labels defined by the user and is fixed once the table is created (thus if you need to make changes to the Extractor, you will need to either start using a new table or manually fix the schema). +- For the user defined Custom Document Extractor the schema corresponds to the labels defined by the user and is fixed once the table is created (thus if you need to make changes to the Extractor, you will need to either start using a new table or manually fix the schema). - In order to use same BigQuery table for different Custom Document Extractors, they must use the same schema (and data types being extracted). - #### Form/OCR Parser Form/OCR Parser Big Query Table Schema: @@ -132,10 +129,8 @@ Form/OCR Parser Big Query Table Schema: Sample Extracted Data using Form/OCR parser: ![GENERIC FORMS DATA](img/generic_forms_data.png) - ### Specialized Processors - Corresponding Big Query Table Schema extracted: - TODO @@ -153,14 +148,13 @@ Corresponding Big Query Table Schema extracted: - ## Setup ### Preparation -The goal is to be able to re-direct in the real time each document to the appropriate Document Extractor. +The goal is to be able to re-direct in the real time each document to the appropriate Document Extractor. -As a preparation, the user needs to: -- Define which document types are expected and which document extractors are needed. +As a preparation, the user needs to: +- Define which document types are expected and which document extractors are needed. - Train a classifier or splitter, that would be able to predict document class (and optionally identify document boundaries). - Deploy (and possibly train) required document extractors. @@ -171,25 +165,27 @@ As a preparation, the user needs to: 3. Run `gcloud init`, create a new project, and [enable billing](https://cloud.google.com/billing/docs/how-to/modify-project#enable_billing_for_a_project) 4. Setup application default authentication, run: - - `gcloud auth application-default login` +- `gcloud auth application-default login` ### Deployment #### Setup * Create new GCP Project + ```shell export PROJECT_ID=.. export DOCAI_PROJECT_ID=... gcloud config set project $PROJECT_ID ``` -* If you want to make use of the existing Document AI processors in another project for example, set env variable for the Project where processors are located. Otherwise skip this step. -* This is needed to setup proper access rights. +* If you want to make use of the existing Document AI processors in another project for example, set the env variable for the Project where processors are located. Otherwise, skip this step. +* This is needed to set up proper access rights. ```shell export DOCAI_PROJECT_ID=... ``` * Run infrastructure setup script using newly created Project: + ```shell ./setup.sh ``` @@ -211,7 +207,6 @@ Following script will generate following Document AI processors and update [conf ./create_demo_rpocessors.sh ``` - #### BigQuery Reservations * Create BigQuery Reservations: Before working with ML.PROCESS_DOCUMENT, you’ll need to turn on BigQuery Editions using the Reservations functionality. @@ -251,12 +246,12 @@ Here is the explanation of the structure of the [config.json](classify-job/confi * Modify [config.json](classify-job/config/config.json) file to match your needs or leave it as is for the demo with taxes. * Copy file to GCS: + ```shell source vars.sh gsutil cp classify-job/config/config.json gs://$CONFIG_BUCKET/ ``` - ## Running the Pipeline ### Out-of-the box demo @@ -290,9 +285,11 @@ To trigger single document processing: - Modify [vars.sh](vars.sh) and set `DATA_SYNCH` to `true` - Redeploy: + ```shell ./deploy.sh ``` + - Upload pdf document into the `CLASSIFY_INPUT_BUCKET` bucket (defined in [vars.sh](vars.sh)) Be mindful fo quotas (5 concurrent API requests). Therefore when you upload more than five times, this would trigger separate Pub/Sub events for each file and you will easily reach the quota limit.