Skip to content

stochastic-sisyphus/synsearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SynSearch

Overview

SynSearch is a sophisticated Python-based research paper analysis system that combines advanced NLP techniques, clustering algorithms, and scientific text processing. The project aims to help researchers effectively analyze and summarize large collections of scientific literature.

πŸ“š Table of Contents

  1. Core Features
  2. System Architecture
  3. Installation
  4. Configuration
  5. Usage Guide
  6. API Reference
  7. Development
  8. Testing
  9. Performance Optimization
  10. Troubleshooting

Core Features

πŸ“– Document Processing

  • Multi-Dataset Support
    • XL-Sum dataset integration
    • ScisummNet dataset processing
    • Custom dataset handling capabilities

🧠 Advanced Text Processing

  • Domain-Specific Processing
    • Scientific text preprocessing
    • Legal document handling
    • Metadata extraction
    • URL and special character normalization

πŸ”„ Data Pipeline

  • Robust Data Loading
    • Batch processing support
    • Progress tracking
    • Automatic validation
    • Performance optimization

🎯 Clustering & Analysis

  • Dynamic Clustering
    • HDBSCAN implementation
    • Silhouette score calculation
    • Cluster quality metrics
    • Adaptive cluster size

πŸ“Š Summarization

  • Hybrid Summarization System
    • Multiple summarization styles:
      • Technical summaries
      • Concise overviews
      • Detailed analyses
    • Batch processing support
    • GPU acceleration

System Architecture

Directory Structure

synsearch/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ api/                 # API integrations
β”‚   β”œβ”€β”€ preprocessing/       # Text preprocessing
β”‚   β”œβ”€β”€ clustering/          # Clustering algorithms
β”‚   β”œβ”€β”€ summarization/       # Summary generation
β”‚   β”œβ”€β”€ utils/              # Utility functions
β”‚   └── visualization/       # Visualization tools
β”œβ”€β”€ tests/                  # Test suite
β”œβ”€β”€ config/                 # Configuration files
β”œβ”€β”€ data/                   # Dataset storage
β”œβ”€β”€ logs/                   # Log files
β”œβ”€β”€ cache/                  # Cache storage
└── outputs/               # Generated outputs

Key Components

1. Data Management

  • DataLoader: Handles dataset loading and validation
  • DataPreparator: Prepares and preprocesses text data
  • DataValidator: Ensures data quality and format

2. Text Processing

  • TextPreprocessor: Handles text cleaning and normalization
  • DomainAgnosticPreprocessor: Generic text preprocessing
  • EnhancedDataLoader: Optimized data loading

3. Analysis

  • ClusterManager: Manages document clustering
  • EnhancedEmbeddingGenerator: Generates text embeddings
  • HybridSummarizer: Multi-style text summarization

Installation

Prerequisites

  • Python 3.8 or higher
  • CUDA-compatible GPU (optional)
  • 8GB RAM minimum (16GB recommended)

Setup Steps

# Clone repository
git clone https://github.com/stochastic-sisyphus/synsearch.git
cd synsearch

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download required datasets
make download-data

# Initialize system
python -m src.initialization

Configuration

Basic Configuration (config/config.yaml)

data:
  input_path: "data/raw"
  output_path: "data/processed"
  scisummnet_path: "data/scisummnet"
  batch_size: 32

preprocessing:
  min_length: 100
  max_length: 1000
  validation:
    min_words: 50

embedding:
  model_name: "bert-base-uncased"
  dimension: 768
  batch_size: 32
  max_seq_length: 512
  device: "cuda"

clustering:
  algorithm: "hdbscan"
  min_cluster_size: 5
  min_samples: 3
  metric: "euclidean"

summarization:
  model_name: "t5-base"
  max_length: 150
  min_length: 50
  batch_size: 16

Advanced Settings

  • Performance optimization
  • Cache management
  • Logging configuration
  • Visualization options

Usage Guide

Basic Usage

from src.main import main

# Run complete pipeline
main()

Custom Pipeline

from src.api.arxiv_api import ArxivAPI
from src.preprocessing.domain_agnostic_preprocessor import DomainAgnosticPreprocessor
from src.clustering.cluster_manager import ClusterManager

# Initialize components
api = ArxivAPI()
preprocessor = DomainAgnosticPreprocessor()
cluster_manager = ClusterManager(config)

# Process papers
papers = api.search("quantum computing", max_results=50)
processed_texts = preprocessor.preprocess_texts([p['text'] for p in papers])
clusters, metrics = cluster_manager.perform_clustering(processed_texts)

Development

Environment Setup

  • Use Python 3.8+ virtual environment
  • Install development dependencies: pip install -r requirements-dev.txt
  • Setup pre-commit hooks: pre-commit install

Code Style

  • Follow PEP 8 guidelines
  • Use type hints
  • Document using Google docstring format

Contributing

  1. Fork the repository
  2. Create feature branch
  3. Add tests
  4. Submit pull request

Testing

Running Tests

# Run all tests
pytest tests/

# Run specific test category
pytest tests/test_preprocessor.py
pytest tests/test_clustering.py

Test Coverage

  • Unit tests for all components
  • Integration tests for pipelines
  • Performance benchmarks

Performance Optimization

Automatic Optimization

  • Batch size optimization
  • Worker count adjustment
  • GPU utilization
  • Memory management

Caching System

  • Embedding cache
  • Dataset cache
  • Results cache

Troubleshooting

Common Issues

  1. Memory errors
    • Reduce batch size
    • Enable disk caching
  2. GPU errors
    • Check CUDA installation
    • Reduce model size
  3. Dataset loading issues
    • Verify paths
    • Check file permissions

Logging

  • Logs stored in logs/synsearch.log
  • Debug level logging available
  • Performance metrics tracking

License

[License information pending]

Contributors

  • @stochastic-sisyphus

Contact

[Contact information pending]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages