SynSearch is a sophisticated Python-based research paper analysis system that combines advanced NLP techniques, clustering algorithms, and scientific text processing. The project aims to help researchers effectively analyze and summarize large collections of scientific literature.
- Core Features
- System Architecture
- Installation
- Configuration
- Usage Guide
- API Reference
- Development
- Testing
- Performance Optimization
- Troubleshooting
- Multi-Dataset Support
- XL-Sum dataset integration
- ScisummNet dataset processing
- Custom dataset handling capabilities
- Domain-Specific Processing
- Scientific text preprocessing
- Legal document handling
- Metadata extraction
- URL and special character normalization
- Robust Data Loading
- Batch processing support
- Progress tracking
- Automatic validation
- Performance optimization
- Dynamic Clustering
- HDBSCAN implementation
- Silhouette score calculation
- Cluster quality metrics
- Adaptive cluster size
- Hybrid Summarization System
- Multiple summarization styles:
- Technical summaries
- Concise overviews
- Detailed analyses
- Batch processing support
- GPU acceleration
- Multiple summarization styles:
synsearch/
βββ src/
β βββ api/ # API integrations
β βββ preprocessing/ # Text preprocessing
β βββ clustering/ # Clustering algorithms
β βββ summarization/ # Summary generation
β βββ utils/ # Utility functions
β βββ visualization/ # Visualization tools
βββ tests/ # Test suite
βββ config/ # Configuration files
βββ data/ # Dataset storage
βββ logs/ # Log files
βββ cache/ # Cache storage
βββ outputs/ # Generated outputs
DataLoader
: Handles dataset loading and validationDataPreparator
: Prepares and preprocesses text dataDataValidator
: Ensures data quality and format
TextPreprocessor
: Handles text cleaning and normalizationDomainAgnosticPreprocessor
: Generic text preprocessingEnhancedDataLoader
: Optimized data loading
ClusterManager
: Manages document clusteringEnhancedEmbeddingGenerator
: Generates text embeddingsHybridSummarizer
: Multi-style text summarization
- Python 3.8 or higher
- CUDA-compatible GPU (optional)
- 8GB RAM minimum (16GB recommended)
# Clone repository
git clone https://github.com/stochastic-sisyphus/synsearch.git
cd synsearch
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download required datasets
make download-data
# Initialize system
python -m src.initialization
data:
input_path: "data/raw"
output_path: "data/processed"
scisummnet_path: "data/scisummnet"
batch_size: 32
preprocessing:
min_length: 100
max_length: 1000
validation:
min_words: 50
embedding:
model_name: "bert-base-uncased"
dimension: 768
batch_size: 32
max_seq_length: 512
device: "cuda"
clustering:
algorithm: "hdbscan"
min_cluster_size: 5
min_samples: 3
metric: "euclidean"
summarization:
model_name: "t5-base"
max_length: 150
min_length: 50
batch_size: 16
- Performance optimization
- Cache management
- Logging configuration
- Visualization options
from src.main import main
# Run complete pipeline
main()
from src.api.arxiv_api import ArxivAPI
from src.preprocessing.domain_agnostic_preprocessor import DomainAgnosticPreprocessor
from src.clustering.cluster_manager import ClusterManager
# Initialize components
api = ArxivAPI()
preprocessor = DomainAgnosticPreprocessor()
cluster_manager = ClusterManager(config)
# Process papers
papers = api.search("quantum computing", max_results=50)
processed_texts = preprocessor.preprocess_texts([p['text'] for p in papers])
clusters, metrics = cluster_manager.perform_clustering(processed_texts)
- Use Python 3.8+ virtual environment
- Install development dependencies:
pip install -r requirements-dev.txt
- Setup pre-commit hooks:
pre-commit install
- Follow PEP 8 guidelines
- Use type hints
- Document using Google docstring format
- Fork the repository
- Create feature branch
- Add tests
- Submit pull request
# Run all tests
pytest tests/
# Run specific test category
pytest tests/test_preprocessor.py
pytest tests/test_clustering.py
- Unit tests for all components
- Integration tests for pipelines
- Performance benchmarks
- Batch size optimization
- Worker count adjustment
- GPU utilization
- Memory management
- Embedding cache
- Dataset cache
- Results cache
- Memory errors
- Reduce batch size
- Enable disk caching
- GPU errors
- Check CUDA installation
- Reduce model size
- Dataset loading issues
- Verify paths
- Check file permissions
- Logs stored in
logs/synsearch.log
- Debug level logging available
- Performance metrics tracking
[License information pending]
- @stochastic-sisyphus
[Contact information pending]