SambaNova AI Starter Kits

Data Extraction Examples

Data Extraction Examples

This kit include a series of Notebooks that demonstrates various methods for extracting text from documents in different input formats. including Markdown, PDF, CSV, RTF, DOCX, XLS, HTML

Deploy the starter kit

Option 1: Run through local virtual environment

Important: With this option you have to install some packages directly in your system:

pandoc (for local rtf files loading)

tesseract-ocr (for PDF ocr and table extraction)

poppler-utils (for PDF ocr and table extraction)

Clone the repo.

git clone https://github.sambanovasystems.com/SambaNova/ai-starter-kit.git

(Recommended) Set up a venv or conda environment for installation.

cd ai-starter-kit
python3 -m venv data_extract_env
source data_extract_env/bin/activate
cd data_extraction
pip install -r requirements.txt

Install files required for the paddle utility: We recommend that you use virtualenv or conda environment for installation.

Use this in case you want to use Paddle OCR recipe for PDF OCR and table extraction you should use the requirementsPaddle file instead.

cd ai-starter-kit
python3 -m venv data_extract_env
source data_extract_env/bin/activate
cd data_extraction
pip install -r requirementsPaddle.txt

Some text extraction examples use the Unstructured library. Register at Unstructured.io to get a free API key and create an enviroment file to store the API key and URL:

echo 'UNSTRUCTURED_API_KEY="your_API_key_here"\nUNSTRUCTURED_API_KEY="your_API_url_here"' > .env

Or start the parsing service, add parsing service url and API key (can be any value):

make start-parsing-service

echo 'UNSTRUCTURED_API_KEY="your_API_key_here"\nUNSTRUCTURED_API_KEY="http://localhost:8005/general/v0/general"' > .env

Option 2: Run via Docker

With this option, all functionality and Jupyter notebooks are ready to use.

Ensure that you have the Docker engine installed Docker installation.
Clone the repo.

git clone https://github.sambanovasystems.com/SambaNova/ai-starter-kit.git

Some text extraction examples use the Unstructured library. Register at Unstructured.io to get a free API key and create an enviroment file to store the API key and URL:

echo 'UNSTRUCTURED_API_KEY="your_API_key_here"\nUNSTRUCTURED_API_KEY="your_API_url_here"' > .env

Or start the parsing service, add parsing service url and API key (can be any value):

make start-parsing-service

echo 'UNSTRUCTURED_API_KEY="your_API_key_here"\nUNSTRUCTURED_API_KEY="http://host.docker.internal:8005/general/v0/general"' > .env

Run the data extraction Docker container:

sudo docker-compose up data_extraction_service

Run data extraction docker container for Paddle utility.

Use this in case you want to use Paddle OCR recipe for PDF OCR and table extraction, use the startPaddle script instead

sudo docker-compose up data_extraction_paddle_service

File loaders

The notebooks folder has several data extraction recipes and pipelines:

CSV Documents

csv_extraction.ipynb: Examples of text extraction from CSV files using different packages. Depending on your use case, some packages may perform better than others.

XLS/XLSX Documents

xls_extraction.ipynb: Examples of text extraction from files in different input formats using the Unstructured library. Section 2 includes two examples, one using the Unstructured API and the other using the local unstructured loader.

DOC/DOCX Documents

docx_extraction.ipynb: Examples of text extraction from files in different input formats using the Unstructured library. Section 3 includes two examples, one using the Unstructured API and the other using the local unstructured loader.

RTF Documents

rtf_extraction.ipynb: Examples of text extraction from files in different input formats using the Unstructured library. Section 4 includes two examples, one using the Unstructured API and the other using the local unstructured loader.

Markdown Documents

markdown_extraction.ipynb: Examples of text extraction from files in different input formats using the Unstructured library. Section 5 includes two examples, one using the Unstructured API and the other using the local unstructured loader.

HTML Documents

web_extraction.ipynb: Examples of text extraction from files in different input format using the Unstructured library. Section 6 includes two loading examples, one using the Unstructured API and the other using the local unstructured loader.

PDF Documents

pdf_extraction.ipynb: Examples of text extraction from PDF documents using different packages including different OCR and non-OCR packages. Depending on your specific use case, some packages may perform better than others.
retrieval_from_pdf_tables.ipynb: Example of a simple RAG retiever and an example of a multivector RAG retriever for PDF with tables retrieval. For SambaNova model endpoint usage, refer to the top-level ai-starter-kit README
qa_qc_util.ipynb: Simple utility for visualizing text boxes extracted using the Fitz package. This visualization can be particularly helpful when dealing with complex multi-column PDF documents, and in the debugging process.

Included files

data: Sample data for running the notebooks. Used as storage for intermediate steps.
src: Source code for some functionalities used in the notebooks.
docker: Docker file for the data extraction starter kit.

Third-party tools and data sources

All the packages/tools are listed in the requirements.txt file in the project directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SambaNova AI Starter Kits

Data Extraction Examples

Deploy the starter kit

Option 1: Run through local virtual environment

Option 2: Run via Docker

File loaders

CSV Documents

XLS/XLSX Documents

DOC/DOCX Documents

RTF Documents

Markdown Documents

HTML Documents

PDF Documents

Included files

Third-party tools and data sources

Files

README.md

Latest commit

History

README.md

File metadata and controls

SambaNova AI Starter Kits

Data Extraction Examples

Deploy the starter kit

Option 1: Run through local virtual environment

Option 2: Run via Docker

File loaders

CSV Documents

XLS/XLSX Documents

DOC/DOCX Documents

RTF Documents

Markdown Documents

HTML Documents

PDF Documents

Included files

Third-party tools and data sources