Skip to content

Commit

Permalink
Linting fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
evekhm committed Jul 19, 2024
1 parent 4d8fdea commit 8cb3225
Showing 1 changed file with 51 additions and 41 deletions.
92 changes: 51 additions & 41 deletions classify-split-extract-workflow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,28 +42,46 @@ However you absolutely can and should train your own custom Splitter/Classifier

![Architecture](img/architecture.png)

### Pipeline Steps

All environment variables referred further on are defined in the [vars.sh](vars.sh) file.

1. Pipeline execution is triggered by uploading document into the GCS bucket.
- Note, there is a dedicated bucket assigned to send Pub/Sub notifications: `CLASSIFY_INPUT_BUCKET`
- If `DATA_SYNC` environment variable is set to _true_, then any PDF document will trigger pipeline as long as it is uploaded inside the `CLASSIFY_INPUT_BUCKET`.
- Otherwise only `START_PIPELINE` file will trigger BATCH processing (of all the documents inside the uploaded folder).
- Do not upload files into the `splitter_output` sub-folder - it is the `system` directory to store the split sub-documents and therefore is ignored.
2. Pub/Sub event is forwarded to the [GCP Workflow execution](https://console.cloud.google.com/workflows/workflow/us-central1/classify-extract/).
3. Workflow checks the file uploaded and triggers Cloud Run Job for the Document Classification and Splitting.
4. [Classification Cloud Run Job](https://console.cloud.google.com/run/jobs/details/us-central1/classify-job):
- Uses Document AI Classifier or Splitter as it is defined in the [config.json](classify-job/config/config.json) file (`parser_config`/`classifier`).
- For each document sent for the processing, determines `confidence` and `type` (Classifier), determines page boundaries and type of each page (Splitter) and does the splitting into the `splitter_output` sub-folder.
- File `config.json` defines the relation between Classifier labels and Document Parsers to be used for those labels (as well as the output BigQuery table for each model).
- Creates a _json_ file inside `CLASSIFY_OUTPUT_BUCKET` bucket that is the result of the classification/splitting job and is used for the Extraction.
- The path to this json file is sent back to the GCP Workflow in the callback when Classification job is completed.
- Here is an example of the output json file:

```text
### Pipeline Execution Steps

> All environment variables referred further on are defined in the [vars.sh](vars.sh) file.
#### 1. Pipeline Execution Trigger
- The pipeline is triggered by uploading a document into the GCS bucket.
- Note, there is a dedicated bucket assigned to send Pub/Sub notifications: `CLASSIFY_INPUT_BUCKET`.
- If the `DATA_SYNC` environment variable is set to `true`, any PDF document will trigger the pipeline as long as it is uploaded inside the `CLASSIFY_INPUT_BUCKET`.
- Otherwise, only the `START_PIPELINE` file will trigger batch processing of all documents inside the uploaded folder.
- Do not upload files into the `splitter_output` sub-folder. This is a `system` directory used to store split sub-documents and is therefore ignored.

#### 2. Pub/Sub Event Forwarding
- The Pub/Sub event is forwarded to the [GCP Workflow execution](https://console.cloud.google.com/workflows/workflow/us-central1/classify-extract/).

#### 3. Workflow Execution
- The workflow checks the uploaded file and triggers a Cloud Run Job for Document Classification and Splitting.

#### 4. Classification/Splitting by the Cloud Run Job
- The [Classification Cloud Run Job](https://console.cloud.google.com/run/jobs/details/us-central1/classify-job) performs the following tasks:
- Uses Document AI Classifier or Splitter as defined in the [config.json](classify-job/config/config.json) file (`parser_config`/`classifier`).
- For each document sent for processing:
- Determines `confidence` and `type` (Classifier).
- Determines page boundaries and the type of each page (Splitter).
- Performs the splitting into the `splitter_output` sub-folder.
- The `config.json` file defines the relation between Classifier labels and Document Parsers to be used for those labels, as well as the output BigQuery table for each model.
- Creates a JSON file inside the `CLASSIFY_OUTPUT_BUCKET` bucket. This file is the result of the classification/splitting job and is used for extraction.
- The path to this JSON file is sent back to the GCP Workflow in the callback when the classification job is completed.
- It also creates the BigQuery mlops tables required for the [ML.PROCESS_DOCUMENT](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-process-document) function such as:
- [Object tables](https://cloud.google.com/bigquery/docs/object-tables) for the GCS documents.
- MODEL for the Document AI parsers.
- And it assigns GCS [custom metadata](https://cloud.google.com/storage/docs/metadata#custom-metadata) to the input documents with values of classification result (`confidence` score and document `type`).
- These metadata is then also saved along into the BigQuery.
- Documents that were the result of splitting have metadata pointing to the original document.
- Here is an example of the output JSON file, where:
- _object_table_name_ - contains all the documents that were classified/split and ended up having the same document type.
- _model_name_ - corresponds to the Document AI Extractor MODEL.
- _out_table_name_ - is the output BigQuery table name to be used for the extraction.

```json
[
...
{
"object_table_name": "classify-extract-docai-01.mlops.GENERIC_FORM_DOCUMENTS_20240713_071014814006",
"model_name": "classify-extract-docai-01.mlops.OCR_PARSER_MODEL",
Expand All @@ -73,26 +91,16 @@ All environment variables referred further on are defined in the [vars.sh](vars.
"object_table_name": "classify-extract-docai-01.mlops.MISC1099_FORM_DOCUMENTS_20240713_071014814006",
"model_name": "classify-extract-docai-01.mlops.MISC1099_PARSER_MODEL",
"out_table_name": "classify-extract-docai-01.processed_documents.MISC1099"
},
...
}
]
```

- _object_table_name_ - contains all the documents that were classified/split and ended up having the same document type.
- _model_name_ - corresponds to the Document AI Extractor MODEL
- _out_table_name_ - is the output BigQuery table name to be used for the extraction

- It also creates the BigQuery mlops tables required for the [ML.PROCESS_DOCUMENT](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-process-document) function such as:
- [Object tables](https://cloud.google.com/bigquery/docs/object-tables) for the GCS documents
- MODEL for the Document AI parsers
- And it assigns GCS [custom metadata](https://cloud.google.com/storage/docs/metadata#custom-metadata) to the input documents with values of classification result (`confidence` score and document `type`).
- These metadata is then also saved along into the BigQuery.
- Documents that were result of splitting, have metadata pointing to the original document.
5. Entity Extraction is done by [ML.PROCESS_DOCUMENT function](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-process-document) as the GCP Workflow next step and saved into the BigQuery.
- Uses `json` file created by the Classifier to run the Extraction.
6. (Optional _Future_) Extracted data can be sent downstream to a 3rd party as an API call for the further integration as the final step of the Workflow Execution.
#### 5. Entity Extraction
- Entity Extraction is done by [ML.PROCESS_DOCUMENT function](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-process-document) as the GCP Workflow next step and saved into the BigQuery.
- Uses the JSON file created by the Classifier to run the Extraction.

### Google Cloud Products Used
#### 6. (Optional Future) Data Integration
- Extracted data can be sent downstream to a third party as an API call for further integration as the final step of the Workflow Execution.### Google Cloud Products Used

- [Document AI Processors][1]
- `LDAI Splitter & Classifier`
Expand Down Expand Up @@ -124,7 +132,7 @@ The BigQuery Table schema is determined in the runtime based on the DocumentAI p
#### Form/OCR Parser
Form/OCR Parser Big Query Table Schema:

<img src="img/generic_forms.png" width="400"/>
<img src="img/generic_forms.png" width="400" alt="Generic Forms"/>

Sample Extracted Data using Form/OCR parser:
![GENERIC FORMS DATA](img/generic_forms_data.png)
Expand All @@ -142,11 +150,11 @@ Sample of the extracted data:
#### Custom Document Extractor
User-defined Labels in the DocumentAI console:

<img src="img/pa-forms_docai.png" width="450"/>
<img src="img/pa-forms_docai.png" width="450" alt="PA Forms"/>

Corresponding Big Query Table Schema extracted:

<img src="img/pa-forms.png" width="450"/>
<img src="img/pa-forms.png" width="450" alt="PA Forms Data"/>

## Setup

Expand All @@ -169,7 +177,7 @@ As a preparation, the user needs to:

### Deployment

#### Setup
#### Environment Variables
* Create new GCP Project

```shell
Expand All @@ -180,10 +188,12 @@ gcloud config set project $PROJECT_ID

* If you want to make use of the existing Document AI processors in another project for example, set the env variable for the Project where processors are located. Otherwise, skip this step.
* This is needed to set up proper access rights.

```shell
export DOCAI_PROJECT_ID=...
```

#### Infrastructure Setup
* Run infrastructure setup script using newly created Project:

```shell
Expand Down

0 comments on commit 8cb3225

Please sign in to comment.