Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Oracle Database Document Loader and Parser #7251

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

Minjun1Kim
Copy link

@Minjun1Kim Minjun1Kim commented Nov 24, 2024

feat: Add Oracle Database Document Loader and Parser

Description

This PR introduces support for loading and parsing documents from Oracle databases into LangChain.js. By implementing this feature, we align the JavaScript version of LangChain with the Python library, providing a consistent API for Oracle database integration across both ecosystems.

Motivation

Currently, LangChain.js does not support Oracle databases, limiting JavaScript developers from integrating Oracle data sources into their Language Model (LLM) applications. This feature bridges that gap, enabling:

  • Seamless integration of Oracle-stored data into LangChain.js applications.
  • Standardized document processing for Oracle database content using LangChain's Document and BaseDocumentLoader abstractions.

Adding Oracle database support ensures LangChain.js developers can access Oracle data without relying on external tools or libraries.

Key Features

  1. Oracle Document Loader (OracleDocLoader)

    • Supports loading documents from:
      • Oracle Tables.
      • Local files (e.g., .html).
      • Directories containing multiple files.
    • Converts Oracle data into standardized Document objects.
  2. Metadata Parsing with ParseOracleDocMetadata

    • Extracts metadata from HTML documents, parsing <title> and <meta> tags.
    • Ensures compatibility with Oracle HTML documents and metadata extraction workflows.
  3. File Reader with OracleDocReader

    • Reads files and generates unique document IDs using timestamp and hash-based methods.
    • Processes data using Oracle’s dbms_vector_chain package.
  4. Integration with LangChain

    • Adheres to LangChain's design principles, extending BaseDocumentLoader to create a consistent API.
  5. Validation and Security

    • Validates SQL identifiers to prevent SQL injection attacks.
    • Implements error handling for missing parameters, invalid configurations, and unsupported data types.

Changes Made

  • Added OracleDocLoader, ParseOracleDocMetadata, and OracleDocReader classes.
  • Introduced support for Oracle-specific document loading and metadata extraction.
  • Created integration tests for file, directory, and table-based loading.
  • Added unit tests to validate parsing and loader functionality.
  • Created a walkthrough example using LangChain’s Cookbook format (in progress).

New Files

Source Code

  • langchain-ai/langchainjs/libs/langchain-community/src/document_loaders/web/oracleai.ts: Contains the implementations of OracleDocLoader, ParseOracleDocMetadata, and OracleDocReader.

Tests

  • langchain-ai/langchainjs/libs/langchain-community/src/document_loaders/tests/oracleai.test.ts:
    • Unit tests for ParseOracleDocMetadata.
    • Unit and integration tests for OracleDocLoader.

Example Data

  • langchain-ai/langchainjs/libs/langchain-community/src/document_loaders/tests/example_data/oracleai/: Sample data for testing file and directory loading.

Tests and Coverage

Unit Tests

  • Metadata Parsing Tests:

    • Validates parsing of <title> and <meta> tags.
    • Handles edge cases such as missing or empty tags.
  • Document Loading Tests:

    • Tests loading from files, directories, and tables.
    • Ensures proper error handling for invalid parameters and unsupported data types.

Integration Tests

  • Interacts with a live Oracle database instance to test:
    • Loading from Oracle tables with various data types and schemas.

Documentation

API Documentation (in progress)

  • Updated to include new classes and methods for Oracle database support.
  • Includes code examples for all new methods.

Community Engagement

  • Discussion Post: Initiated a discussion to propose the feature.
  • Issue Creation: Opened a feature request to track progress and gather feedback.
  • Approval: Received approval from LangChain.js maintainers.

Checklist

  • Implemented OracleDocLoader, ParseOracleDocMetadata, and OracleDocReader.
  • Added error handling and SQL identifier validation.
  • Created comprehensive unit and integration tests.
  • Ensured compliance with LangChain.js contributing guidelines.

Notes

  • Dependencies: Introduces oracledb as a new dependency for Oracle database connectivity.
  • No Breaking Changes: The implementation is self-contained and does not affect existing modules.

Thank you for your time and consideration. Please let me know if you’d like me to refine any part of this PR.

Copy link

vercel bot commented Nov 24, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchainjs-docs ❌ Failed (Inspect) Nov 27, 2024 4:39am
1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchainjs-api-refs ⬜️ Ignored (Inspect) Nov 27, 2024 4:39am

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Nov 24, 2024
@Minjun1Kim
Copy link
Author

@jacoblee93 we're still working on documentation and tidying up the code but we wanted to get your thoughts and any feedback on the changes we've made so far. Thank you!

@jacoblee93
Copy link
Collaborator

Hey there! Thanks for this PR - it looks like the peer dep you're adding has some funny licensing. I don't fully understand the implications of it, and would be cautious because it's Oracle. Will put it on hold for now.

@jacoblee93 jacoblee93 added the hold On hold label Nov 25, 2024
@AshwinM1523
Copy link

AshwinM1523 commented Nov 25, 2024

Hi Jacob, just wanted to confirm—are you referring to the licensing of OracleDB? I noticed that the LangChain.py library also uses it as a peer dependency for its implementation of Oracle.

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Nov 26, 2024
@Minjun1Kim
Copy link
Author

@jacoblee93 do you have any ideas as to why the deployment to Preview-langchainjs-docs is failing? We've looked at another PR thread and saw that relative paths should be replaced by absolute paths. We've tried this solution and we're not sure if we're overseeing this, but we can't seem to pinpoint what's causing this failure. A response would be greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features hold On hold size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants