The Search API is a high-performance gRPC-based service designed exclusively for executing advanced search queries over indexed documents in Apache Solr. This API leverages semantic vector embeddings to enhance search accuracy and relevance, making it ideal for applications requiring intelligent and context-aware document retrieval.
Note: Document indexing is handled by a separate service. This API focuses solely on search functionalities.
- gRPC-Based Communication: High-performance, low-latency communication using gRPC.
- Semantic Vector Search: Utilize vector embeddings for semantic understanding and accurate search results.
- Keyword Search: Traditional keyword-based search capabilities.
- Caching Mechanism: In-memory caching of vector embeddings to optimize performance and reduce redundant calculations, with persistence to disk.
- Flexible Search Strategies: Combine semantic and keyword search strategies using logical operators.
- Highlighting: Highlight matched terms and snippets in search results.
- Faceted Search: Advanced faceting options for refined search results.
- Programming Language: Java 11+
- Framework: Micronaut
- Search Engine: Apache Solr (Version 9.7.0)
- Communication Protocol: gRPC
- Containerization: Docker
- Testing: JUnit 5, Testcontainers
- Build Tool: Maven
- Java: Ensure Java 17 or higher is installed.
- Docker: Required for running containerized services.
- Maven: For building the project.
- Clone the Repository:
git clone https://github.com/krickert/search-api.git cd search-api ./mvnw install package
The Search API is designed to provide a comprehensive search functionality over indexed documents in a Solr-based search engine. The API will support both keyword and semantic searches and will integrate with an Embedding Service for vector-based querying.
- Inline Vectors: Support indexing where the vector representation is included within the same document.
- Embedded Documents: Allow chunking of fields and embedding documents within the main document.
- Outside Join: Index chunked fields into a new collection for advanced queries.
The API will support various query types to enhance search capabilities:
- Semantic Matching: Utilize vector-based search for retrieving documents that are semantically similar to the query.
- Keyword Matching: Perform traditional keyword-based search queries.
- Keyword with Semantic Boost: Combine keyword search with an additional boost from semantic vectors.
- Semantic with Keyword Boost: Boost semantic search results using keyword matching.
- Dynamic Configuration: Allow dynamic configuration of Solr collections and query parameters.
- Vector Configuration: Support inline, embedded, or external collection configurations for vector fields.
- Support for Filter Queries: Implement filtering (
fq
) for refined search results. - Highlighting: Use Solr's highlighting capabilities to highlight matched snippets in search results.
- Matched Snippets: Return matched snippets from either chunked or inline text.
- Embedding Service Integration: Integrate with an
EmbeddingService
to generate embeddings for text. - GRPC Protocol: Use gRPC for communication with the embedding service.
- Caching: Implement caching for embeddings to minimize redundant calculations. The cache should:
- Be in-memory and shared between tests.
- Persist to disk at the end of the execution.
- Use document IDs as keys for vector retrieval.
- Scalability: Ensure the API can handle a large number of requests and documents efficiently.
- Asynchronous Processing: Support asynchronous processing for queries and embedding generation.
- Authentication: Implement secure authentication mechanisms for accessing the API.
- Data Protection: Ensure data integrity and protection during transmission.
- Logging: Implement logging for debugging and monitoring purposes.
- Metrics: Collect metrics for API usage, performance, and error rates.
- API Documentation: Provide clear and comprehensive documentation for all API endpoints and usage examples.
- Developer Guides: Include developer guides for setup, configuration, and integration.
- Search Engine: Apache Solr (Version 9.7.0)
- Programming Language: Java
- Frameworks: Micronaut for building the API.
- Containerization: Use Docker for service containerization.
- gRPC: For communication with the embedding service.
- Development Environment: Ensure a local development setup for testing and debugging.
- Test Containers: Utilize Testcontainers for integration testing with Solr.
- Component Tests: Ensure that each component of the API is unit tested for functionality.
- End-to-End Tests: Implement integration tests that cover end-to-end scenarios, including indexing, querying, and caching.
- Load Testing: Conduct load testing to ensure the API can handle expected traffic.
- User Feedback Integration: Gather user feedback to improve search relevance and API usability.
- Advanced Query Features: Explore advanced query features like faceting, sorting, and recommendation systems.
This document outlines the comprehensive requirements for the development of the Search API. The focus is on delivering a robust, efficient, and user-friendly API that leverages the power of Solr and embedding technologies for enhanced search capabilities.
Below is a set of requirements