Skip to content

Commit

Permalink
LLM-based classification
Browse files Browse the repository at this point in the history
  • Loading branch information
fscelliott committed Nov 19, 2024
1 parent 920f973 commit 74942fc
Show file tree
Hide file tree
Showing 3 changed files with 40 additions and 61 deletions.
74 changes: 22 additions & 52 deletions openapi_classification.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,21 +27,19 @@ paths:
summary: Classify document by type

description: |
Score a document's similarity to each document type you defined in your Sensible account and to each reference document in the highest-scoring type.
To retrieve the scores, poll the `download_link` in this endpoint's response until it returns a non-error response.
This endpoint is asynchronous. For more information about scores, expand the 200 response in the synchronous [classification](ref:classify-document-sync) endpoint.
Classify a document into one of the document types you defined in your Sensible account. For more information, see [Classifying documents by type](doc:classify).
Use this endpoint:
- In an extraction workflow. For example, determine which documents to extract prior to calling a Sensible extraction endpoint.
- Outside an extraction workflow. For example, to determine where to route each document or to label each document in a system of record.
- Outside an extraction workflow. For example, determine where to route each document or to label each document in a system of record.
To post the document bytes, specify the non-encoded document bytes as the entire request body,and specify the `Content-Type` header, for example,"application/pdf" or "image/jpeg".
For supported file sizes, see [Supported file types](doc:file-types).
For supported file size and types, see [Supported file types](doc:file-types).
requestBody:
$ref: '#/components/requestBodies/SupportedFileTypes'

Expand Down Expand Up @@ -77,7 +75,7 @@ paths:
**Note:** Use this Classify endpoint for testing. Use the asynchronous Classify endpoint for production.
Score a document's similarity to each document type you defined in your Sensible account. Get scores for the document's similarity to document types and to their reference documents.
Classify a document into one of the document types you defined in your Sensible account. For more information, see [Classifying documents by type](doc:classify).
Use this endpoint:
Expand All @@ -99,7 +97,7 @@ paths:
schema:
$ref: '#/components/schemas/ClassifySingleResponse'
description: |
The document type and reference documents in the Sensible account that are most similar to this document.
The document type in your Sensible account that's most similar to this document.
'401':
$ref: '#/components/responses/401'
'400':
Expand All @@ -111,12 +109,6 @@ paths:
'500':
$ref: '#/components/responses/500'







components:

responses:
Expand Down Expand Up @@ -172,8 +164,6 @@ components:
example: Sensible encountered an unknown error
#parameters:



securitySchemes:
bearerAuth: # arbitrary name for the security scheme
type: http
Expand Down Expand Up @@ -243,7 +233,8 @@ components:
type: string
description: File format of the document for which you requested classification.
download_link:
description: Poll until the download URL returns a non-error response. Links to a JSON download that contains the same response as from the synchronous Classify endpoint request.
description: |
Poll until the download URL returns a non-error response. Links to a JSON download that contains the same response as from the synchronous Classify endpoint request.
type: string
format: url
example:
Expand All @@ -256,7 +247,7 @@ components:
properties:
document_type:
type: object
description: Document type defined in the Sensible account that this doc is most similar to. To use a document type for classification, Sensible requires that the type contains at least one reference document.
description: The document type defined in your Sensible account that this document is most similar to.
properties:
id:
type: string
Expand All @@ -266,64 +257,43 @@ components:
description: User-friendly name for the document type.
score:
type: number
description: Similarity score comparing the document to the document type, between 0 and 1.
description: Deprecated. Similarity score comparing the document to the document type, where a score of 1 indicates a match.
reference_documents:
type: array
description: Reference documents uploaded to the Sensible account that this document is most similar to.
description: Deprecated. Scoring for embeddings-based classification, replaced by LLM-based classification.
items:
type: object
properties:
id:
type: string
description: Unique ID for the reference document.
description: Deprecated. Unique ID for the reference document.
name:
type: string
description: User-friendly name for the reference document.
description: Deprecated. User-friendly name for the reference document.
score:
type: number
description: Similarity score comparing the document to the reference document, between 0 and 1.
description: Deprecated. Similarity score comparing the document to the reference document, between 0 and 1.

classification_summary:
type: array
description: Scores for this document's similarity to each document type in the Sensible account, excluding document types Sensible created in your account as tutorials, such as `senseml_basics`.
description: Deprecated. Scoring for embeddings-based classification, replaced by LLM-based classification.
items:
type: object
properties:
id:
type: string
description: Unique ID for the document type.
description: Deprecated. Unique ID for the document type.
name:
type: string
description: User-friendly name for the document type.
description: Deprecated. User-friendly name for the document type.
score:
type: number
description: Similarity score comparing the document to the document type, between 0 and 1.
description: Deprecated. Similarity score comparing the document to the document type, between 0 and 1.

example:
document_type:
id: 77c2ab88-3389-4ea8-93c7-912c2bfd373a
name: 1040s
score: 0.9637581544082083
reference_documents:
- id: b4fbc822-de99-4916-b43a-2902131f2619
name: 1040_2020_sample
score: 0.999649884175599
- id: 23680cc8-7855-4698-b51f-6a054704fd1e
name: 1040_2019_sample
score: 0.983879165384638
- id: 58aef918-7017-4576-ad5c-f987b98b4ae7
name: 1040_2021_sample
score: 0.9670293766923486
- id: 161b27ab-5218-4650-a919-65df03de3454
name: senior_1040_2021_sample
score: 0.939401195335292
- id: fb9ed1c3-0545-4f79-bd00-565838bd96a4
name: 1040_2018_sample
score: 0.9288311504531641
classification_summary:
- id: 28eee728-e51b-471c-ba92-827c995476f6
name: home_policy_declaration_pages
score: 0.7760597095611765
- id: 16b06941-9486-475a-a6bf-120cf433f6f3
name: bank_statements
score: 0.7639481987557378
score: 1
reference_documents: []
classification_summary: []
18 changes: 12 additions & 6 deletions readme-sync/v0/document-type-classification/1000 - classify.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,24 @@ Sensible supports two levels of document classification:

This topic covers classifying a document by its type.

For example, if you define a [bank statements](https://github.com/sensible-hq/sensible-configuration-library/tree/main/templates/Financial%20Services/Bank%20Statements) type and a [1040s](https://github.com/sensible-hq/sensible-configuration-library/tree/main/templates/Tax%20Forms/1040s) type in your account, you can classify 1040 forms, 1099 forms, Bank of America statements, Chase statements, and other documents, into those two types. In this scenario, for a `2023-1-1_bankofamerica_statement_jon_doe.pdf` document, Sensible:
Sensible classifies a document by comparing it to the types you define in your account. For example, you can classify 1040 forms and bank statements if you define the following types in your account:

- Classifies this document into the `bank_statements` document type.
- Classifies the statement doc by its similarity to reference documents in the `bank_statements` document type. The highest score is for [a Bank of America sample statement](https://github.com/sensible-hq/sensible-configuration-library/blob/main/templates/Financial%20Services/Bank%20Statements/refdocs/bank_of_america_sample.pdf).
- Provides metadata for the classification, including similarity scores for this document compared to each document type in your Sensible account and to each reference document in the `bank_statements` type.
- a [bank statements](https://github.com/sensible-hq/sensible-configuration-library/tree/main/templates/Financial%20Services/Bank%20Statements) type

- a [1040s](https://github.com/sensible-hq/sensible-configuration-library/tree/main/templates/Tax%20Forms/1040s) type

Sensible uses a document type's name and its description for LLM-based classification:

- If Sensible doesn't find an existing document type to which to match your document in your account, it returns an error.
- Since Sensible doesn't use configs or reference documents for classification, Sensible can classify documents into your document types even if the document type lacks a config or example. For example, if you lack a `citibank` config or reference document in your `bank_statements` type, Sensible can still classify a `2023-1-1_citbank_statement_jon_doe.pdf` document as a bank statement.

To improve classification results, describe each document type in your account in its **Settings** tab. For examples of descriptions, see [Document type descriptions](doc:descriptions).

Use document type classification:

- Prior to an extraction workflow. For example, determine which documents to extract prior to calling a Sensible extraction endpoint.

- Independent from an extraction workflow. For example, determine where to route each document or to label each document in a system of record.

To improve classification results, Sensible recommends that a document type includes a sample set of reference documents that represent the diversity you expect to see in the document type. To use a document type for classification, Sensible requires that the type contains at least one reference document.

To classify documents, use the Sensible API or SDKs.

Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
---
title: "LLM portfolio description"
title: "Document type descriptions"
hidden: false
---

Describe the document type to enable segmenting documents' page ranges from a [portfolio](doc:portfolio) file using LLMs. For example, describe a typical first page of a document type, a typical last page of a document type, and commonly found fields and their values.
Describe a document type in its **Settings** tab to:

Example of document type descriptions:
- Enable segmenting documents' page ranges from a [portfolio](doc:portfolio) file using LLMs. For example, describe a typical first page of a document type, a typical last page of a document type, and commonly found fields and their values.
- Improve [classifying](doc:classify) a document into an existing document type in your account.

Examples of document type descriptions:

- `To accurately classify this type of document look at the bottom left of each page of the document and if you see ACORD 131 then it is an instance of an Acord 131 form.`
- `This type of document is a scanned bank check. Usually only a single page.`
Expand Down

0 comments on commit 74942fc

Please sign in to comment.