This Python application is a document classification tool that utilizes Large Language Models (LLM) on Amazon Bedrock to classify various documents through in-context learning. The application allows users to upload PDF or image files and classify the content of these files into predefined labels or categories.
The classifier.ipynb
notebook walks you through this solution. Alternatively, there is a Strealit App that for a better user experience to classify documents.
- File Upload: Users can upload PDF or image files through the Streamlit interface.
- Document Processing: The application handles the upload and processing of PDF and image files, including converting PDF files into individual image pages (to be used with Claude3 Vision). You can select between Amazon Textract or Claude 3 Vision for document processing.
- Text Extraction: For PDF and image files, the application uses Amazon Textract to extract the text content from the documents. The extracted text is cached in an Amazon S3 bucket for future use.
- Document Classification: Users can provide a manifest file containing a list of possible labels or categories for the documents. The application prompts the selected Claude language model with the extracted text and the list of possible labels, and the model generates a response classifying the document content into one or more of the provided labels.
- Model Selection: The application supports various Claude models, including
claude-3-sonnet
,claude-3-haiku
,claude-instant-v1
,claude-v2
, andclaude-v2:1
. Users can select the desired model through the Streamlit sidebar. - Cost Calculation: The application calculates and displays the cost of using the selected model based on input and output token pricing defined in the
pricing.json
file. - Caching and Persistence: The application caches the extracted text and Textract results in the S3 bucket to avoid redundant processing.
- Set up the necessary AWS resources:
- Create an S3 bucket (if not already have one) to store uploaded documents and Textract output.
The application's behavior can be customized by modifying the config.json
file. Here are the available options:
Bucket_Name
: The name of the S3 bucket used for caching documents and extracted text.max-output-token
: The maximum number of output tokens allowed for the AI assistant's response.bedrock-region
: The AWS region where the Bedrock runtime is deployed.s3_path_prefix
: The S3 bucket path where uploaded documents are stored (without the trailing foward slash).textract_output
: The S3 bucket path where the extracted content by Textract are stored (without the trailing foward slash).
- Update the
pricing.json
file with the latest pricing information for the Claude models on Amazon Bedrock.
If You have a sagemaker Studio Domain already set up, ignore the first item, however, item 2 is required.
- Set Up SageMaker Studio
- SageMaker execution role should have access to interact with Bedrock, Textract
- Launch SageMaker Studio
- Clone this git repo into studio
- Open a system terminal by clicking on Amazon SageMaker Studio and then System Terminal as shown in the diagram below
- Navigate into the cloned repository directory using the
cd
command and run the commandpip install -r req.txt
to install the needed python libraries - Run command
python3 -m streamlit run classifier.py --server.enableXsrfProtection false --server.enableCORS false
to start the Streamlit server. Do not use the links generated by the command as they won't work in studio. - To enter the Streamlit app, open and run the cell in the StreamlitLink.ipynb notebook. This will generate the appropiate link to enter your Streamlit app from SageMaker studio. Click on the link to enter your Streamlit app.
- ⚠ Note: If you rerun the Streamlit server it may use a different port. Take not of the port used (port number is the last 4 digit number after the last : (colon)) and modify the
port
variable in theStreamlitLink.ipynb
notebook to get the correct link.
To run this Streamlit App on AWS EC2 (I tested this on the Ubuntu Image)
- Create a new ec2 instance
- Expose TCP port range 8500-8510 on Inbound connections of the attached Security group to the ec2 instance. TCP port 8501 is needed for Streamlit to work. See image below
- Connect to your ec2 instance
- Run the appropiate commands to update the ec2 instance (
sudo apt update
andsudo apt upgrade
-for Ubuntu) - Clone this git repo
git clone [github_link]
- Install python3 and pip if not already installed
- EC2 instance profile role has the required permissions to access the services used by this application mentioned above.
- Install the dependencies in the requirements.txt file by running the command
sudo pip install -r req.txt
- Run command
tmux new -s mysession
. Then in the new session createdcd
into the ChatBot dir and runpython3 -m streamlit run classifier.py
to start the streamlit app. This allows you to run the Streamlit application in the background and keep it running even if you disconnect from the terminal session. - Copy the External URL link generated and paste in a new browser tab.
To stop the
tmux
session, in your ec2 terminal PressCtrl+b
, thend
to detach. to kill the session, runtmux kill-session -t mysession
-
Launch the Streamlit application in your web browser.
-
In the sidebar, select the desired Claude model and specify whether you want to classify labels per page or for the entire document.
-
Upload a manifest file containing the list of possible labels and their descriptions. Label names, a colon and the description. Each label-description pair per line saved in a .txt file. For Example:
drivers license : This is a US drivers license W2 : This is a tax reporting form Bank Statement : This is personal bank document PayStub : This is an individual's pay info
-
Upload the PDF or image files you want to classify.
-
Click the "Classify" button to initiate the classification process.
-
The application will display the classification results and the cost associated with using the selected model.
Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.
This project is licensed under the MIT License.