GitHub - yuvalsp-pelles/sycamore: 🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data.

Sycamore is an open source, AI-powered document processing engine for ETL, RAG, LLM-based applications, and analytics on unstructured data. Sycamore can partition and enrich a wide range of document types including reports, presentations, transcripts, manuals, and more. It can analyze and chunk complex documents such as PDFs and images with embedded tables, figures, graphs, and other infographics. Check out an example notebook here.

For processing PDFs, Sycamore leverages the Aryn Partitioning Service, a serverless, GPU-powered API for segmenting and labeleing documents, doing OCR, extracting tables and images, and more. It levereages Aryn's state-of-the-art, open source deep learning DETR AI model trained on 80k+ enterprise documents, and it can lead to 6x more accurate data chunking and 2x improved recall on hybrid search or RAG when compared to alternate systems. You can sign-up for free here, or choose to run the Aryn Partitioner locally.

The Aryn Partitioning Service takes PDFs and returns the partitioned output in JSON, and you can use Sycamore for additional data extraction, enrichment, transforms, cleaning, and loading into downstream databases. You can choose the LLMs to use with these transforms.

Sycamore reliably loads your vector databases and hybrid search engines, including as OpenSearch, ElasticSearch, Pinecone, DuckDB and Weaviate, with higher quality data.

The Sycamore framework is built around a scalable and robust abstraction for document processing called a DocSet, and includes powerful high-level transformations in Python for data processing, enrichment, and cleaning. DocSets also encapsulate scalable data processing techniques removing the undifferentiated heavy lifting of reliably loading chunks. DocSets' functional programming approach allows you to rapidly customize and experiment with your chunking for better quality RAG results.

Features

Integrated with the Aryn Partitioning Service, using a state-of-the art vision AI model for segmentation and preserving the semantic structure of documents
DocSet abstraction to scalably and reliably transform and manipulate unstructured documents
High-quality table extraction, OCR, visual summarization, LLM-powered UDFs, and other performant Python data transforms
Quickly create vector embeddings using your choice of AI model
Helpful features like automatic data crawlers (Amazon S3 and HTTP), Jupyter notebook for writing and iterating on jobs, and an OpenSearch hybrid search and RAG engine for testing
Scalable Ray backend

Demo

Introduction to the Aryn Partitioning Service

Get Started

Sycamore currently runs on Linux and Mac OS. To install, run:

pip install sycamore-ai

To use the Aryn Partitioning Service, sign-up for free here and use the API key.

You can next choose to run a demo that prepares and ingests data from the Sort Benchmark website, crawl data from a public website, or write your own data preparation script.

For more info about writing Sycamore scripts, visit the Sycamore documentation.

Resources

Documentation: https://sycamore.readthedocs.io
Slack: https://join.slack.com/t/sycamore-ulj8912/shared_invite/zt-23sv0yhgy-MywV5dkVQ~F98Aoejo48Jg
Data preparation libraries (PyPi): https://pypi.org/project/sycamore-ai/
Contact us: [email protected]

Contributing

Check out our Contributing Guide for more information about how to contribute to Sycamore and set up your environment for development.

Name		Name	Last commit message	Last commit date
Latest commit History 750 Commits
.github		.github
apps		apps
docs		docs
examples		examples
lib		lib
notebooks		notebooks
.dockerignore		.dockerignore
.env		.env
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
compose.yaml		compose.yaml
docker-app-user.sh		docker-app-user.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Demo

Get Started

Resources

Contributing

About

Releases

Packages

Languages

License

yuvalsp-pelles/sycamore

Folders and files

Latest commit

History

Repository files navigation

Features

Demo

Get Started

Resources

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages