COSMOS v3.0.0 Release Notes
Overview
COSMOS v3.0.0 introduces several major architectural changes that fundamentally enhance the system's capabilities. The primary feature is a new website reindexing system that allows COSMOS to stay up-to-date with source website changes, addressing a key limitation of previous versions where websites could only be scraped once. This release includes comprehensive updates to the data models, frontend interface, rule creation system, and backend processing along with some bugfixes from v2.0.1.
The Environmental Justice (EJ) system has been significantly expanded, growing less than 100 manually curated datasets to approximately 1,000 datasets through the integration of machine learning classification of NASA CMR records. This expansion is supported by a new modular processing suite that generates and extracts metadata using Subject Matter Expert (SME) criteria.
To support future machine learning integration, COSMOS now implements a sophisticated two-column system that allows fields to maintain both ML-generated classifications and manual curator overrides. This system has been seamlessly integrated into the data models, serializers, and APIs, ensuring that both automated and human-curated data can coexist while maintaining clear precedence rules.
To ensure reliability and maintainability of these major changes, this release includes extensive testing coverage with 213 new tests spanning URL processing, pattern management, Environmental Justice functionality, workflow triggers, and data migrations. Additionally, we've added comprehensive documentation across 15 new README files that cover everything from fundamental pattern system concepts to detailed API specifications and ML integration guidelines.
Major Features
Reindexing System
- New Data Models: Introduced DumpUrl, DeltaUrl, and CuratedUrl to support the reindexing workflow
- Automated Workflows:
- New process to calculate deltas, deletions, and additions during migration
- Automatic promotion of DeltaUrls to CuratedUrls
- Status-based triggers for data ingestion and processing
- Duplicate Prevention: System now prevents duplicate patterns and URLs
- Enhanced Frontend:
- Added reindexing status column to collection and URL list pages
- New deletion tracking column on URL list page
- Updated collection list to display delta URL counts
- Improved URL list page accessibility via delta URL count
Pattern System Improvements
- Complete modularization of the pattern system
- Enhanced handling of edge cases including overlapping patterns
- Improved unapply logic
- Functional inclusion rules
- Pattern precedence system: most specific pattern takes priority, with pattern length as tiebreaker
Environmental Justice (EJ) Enhancement
- Expanded from 92 manual datasets to 1063 ML-classified NASA CMR records
- New modular processing suite for metadata generation
- Enhanced API with multiple data sources:
- Spreadsheet (original manual classifications)
- ML Production
- ML Testing
- Combined (ML production with spreadsheet overrides)
- Custom processing suite for CMR metadata extraction
Infrastructure Updates
- Streamlined database backup and restore
- Optimized Docker builds
- Fixed LetsEncrypt staging issues
- Modified Traefik timeouts for long-running jobs
- Updated Sinequa worker configuration:
- Reduced worker count to 3 for neural workload optimization
- Added neural indexing to all webcrawlers
- Removed deprecated version mappings
API Enhancements
- New endpoints for curated and delta URLs:
- GET /curated-urls-api/str:config_folder/
- GET /delta-urls-api/str:config_folder/
- Backwards compatibility through remapped CandidateUrl endpoint
- Updated Environmental Justice API with new data source parameter
Technical Improvements
Two-Column System
- New architecture to support dual ML/manual classifications
- Seamless integration with models, serializers, and APIs
- Prioritization system for manual overrides
Testing
Added 213 new tests across multiple areas:
- URL APIs and processing (19 tests)
- Delta and pattern management (31 tests)
- Environmental Justice API (7 tests)
- Environmental Justice Mappings and Thresholding (58)
- Workflow and status triggers (10 tests)
- Migration and promotion processes (31 tests)
- Field modifications and TDAMM tags (25 tests)
- Additional system functionality (30 tests)
Documentation
Added comprehensive documentation across 15 READMEs covering:
- Pattern system fundamentals and examples
- Reindexing statuses and triggers
- Model lifecycles and testing procedures
- URL inclusion/exclusion logic
- Environmental Justice classifier and API
- ML column functionality
- SQL dump restoration
Bug Fixes
- Fixed non-functional includes
- Resolved pagination issues for patterns (previously limited to 50)
- Eliminated ability to create duplicate URLs and patterns
- Corrected faulty unapply logic for modification patterns
- Fixed unrepeatable logic for overlapping patterns
- Allowed long running jobs to complete without timeouts
UI Updates
- Renamed application from "SDE Indexing Helper" to "COSMOS"
- Refactored collection list code for easier column management
- Enhanced URL list page with new status and deletion tracking
- Improved navigation through delta URL count integration
Administrative Changes
- Added new admin panels for enhanced system management
- Updated installation requirements
- Enhanced database backup and restore functionality
What's Changed (PR Log)
- remove force reindexing from templates by @CarsonDavis in #1018
- point tree root to name by @CarsonDavis in #1027
- Change LRM dev configurations by @bishwaspraveen in #1034
- get URLs from scrapers folder for LRM servers by @bishwaspraveen in #1037
- change EnableNeuralIndexing to true in indexing template by @CarsonDavis in #1070
- Retrieve Full-Texts from Sinequa Dev Servers by @saifrk in #1077
- add per indicator thrsholding and new dump by @CarsonDavis in #1073
- 1051 backend model changes on cosmos to hold new incoming urls frontend by @dhanur-sharma in #1090
- 1051 backend model changes on cosmos to hold new incoming urls by @bishwaspraveen in #1069
- 1105 improve pattern application and exclusion management by @CarsonDavis in #1109
- remove destination_server and add datasource by @CarsonDavis in #1108
- Update cmr mappings by @CarsonDavis in #1102
- 1115 improve title processing and tests by @CarsonDavis in #1118
- Affected Delta URLs header added by @dhanur-sharma in #1117
- 3034 cosmos api test cases by @dhanur-sharma in #1114
- Pagination on the Sinequa sql.engine Api by @saifrk in #1104
- Updated page title to URLs by @dhanur-sharma in #1120
- Refresh page on workflow status change by @dhanur-sharma in #1124
- Refactor Two Column to work with Delta Urls by @Kirandawadi in #1103
- add initial reindexing statuses by @CarsonDavis in #1125
- View deleted URLs under Delta URLs page by @dhanur-sharma in #1121
- 1126 managepy command for database backups by @CarsonDavis in #1127
- 3055 optmize the retrieval of url counts on admin page by @bishwaspraveen in #1131
- Updated database restore command by @dhanur-sharma in #1130
- 1133 refactor indexing statuses logic by @CarsonDavis in #1134
- Updated dockerignore and gitignore by @dhanur-sharma in #1135
- 1139 resolve id conflict when promoting by @CarsonDavis in #1140
- Updated title pane to Delta URLs by @dhanur-sharma in #1141
- refactor readme for unapply logic and refactor unapply to account for overlapping patterns by @CarsonDavis in #1146
- Filters fixed by @dhanur-sharma in #1145
- add new field to reindexing statuses by @CarsonDavis in #1148
- Conditional anchor updated for 0 Delta URLs by @dhanur-sharma in #1161
- fixed paging on excludes and includes tabs by @bishwaspraveen in #1163
- 1150 status button color matches by @dhanur-sharma in #1162
- Add documentation for PairedFieldDescriptor implementation by @Kirandawadi in #1160
New Contributors
- @saifrk made their first contribution in #1077
- @dhanur-sharma made their first contribution in #1090
Full Changelog: 3f85f26...8df561a