Releases: NASA-IMPACT/COSMOS
v3.0.0
COSMOS v3.0.0 Release Notes
Overview
COSMOS v3.0.0 introduces several major architectural changes that fundamentally enhance the system's capabilities. The primary feature is a new website reindexing system that allows COSMOS to stay up-to-date with source website changes, addressing a key limitation of previous versions where websites could only be scraped once. This release includes comprehensive updates to the data models, frontend interface, rule creation system, and backend processing along with some bugfixes from v2.0.1.
The Environmental Justice (EJ) system has been significantly expanded, growing less than 100 manually curated datasets to approximately 1,000 datasets through the integration of machine learning classification of NASA CMR records. This expansion is supported by a new modular processing suite that generates and extracts metadata using Subject Matter Expert (SME) criteria.
To support future machine learning integration, COSMOS now implements a sophisticated two-column system that allows fields to maintain both ML-generated classifications and manual curator overrides. This system has been seamlessly integrated into the data models, serializers, and APIs, ensuring that both automated and human-curated data can coexist while maintaining clear precedence rules.
To ensure reliability and maintainability of these major changes, this release includes extensive testing coverage with 213 new tests spanning URL processing, pattern management, Environmental Justice functionality, workflow triggers, and data migrations. Additionally, we've added comprehensive documentation across 15 new README files that cover everything from fundamental pattern system concepts to detailed API specifications and ML integration guidelines.
Major Features
Reindexing System
- New Data Models: Introduced DumpUrl, DeltaUrl, and CuratedUrl to support the reindexing workflow
- Automated Workflows:
- New process to calculate deltas, deletions, and additions during migration
- Automatic promotion of DeltaUrls to CuratedUrls
- Status-based triggers for data ingestion and processing
- Duplicate Prevention: System now prevents duplicate patterns and URLs
- Enhanced Frontend:
- Added reindexing status column to collection and URL list pages
- New deletion tracking column on URL list page
- Updated collection list to display delta URL counts
- Improved URL list page accessibility via delta URL count
Pattern System Improvements
- Complete modularization of the pattern system
- Enhanced handling of edge cases including overlapping patterns
- Improved unapply logic
- Functional inclusion rules
- Pattern precedence system: most specific pattern takes priority, with pattern length as tiebreaker
Environmental Justice (EJ) Enhancement
- Expanded from 92 manual datasets to 1063 ML-classified NASA CMR records
- New modular processing suite for metadata generation
- Enhanced API with multiple data sources:
- Spreadsheet (original manual classifications)
- ML Production
- ML Testing
- Combined (ML production with spreadsheet overrides)
- Custom processing suite for CMR metadata extraction
Infrastructure Updates
- Streamlined database backup and restore
- Optimized Docker builds
- Fixed LetsEncrypt staging issues
- Modified Traefik timeouts for long-running jobs
- Updated Sinequa worker configuration:
- Reduced worker count to 3 for neural workload optimization
- Added neural indexing to all webcrawlers
- Removed deprecated version mappings
API Enhancements
- New endpoints for curated and delta URLs:
- GET /curated-urls-api/str:config_folder/
- GET /delta-urls-api/str:config_folder/
- Backwards compatibility through remapped CandidateUrl endpoint
- Updated Environmental Justice API with new data source parameter
Technical Improvements
Two-Column System
- New architecture to support dual ML/manual classifications
- Seamless integration with models, serializers, and APIs
- Prioritization system for manual overrides
Testing
Added 213 new tests across multiple areas:
- URL APIs and processing (19 tests)
- Delta and pattern management (31 tests)
- Environmental Justice API (7 tests)
- Environmental Justice Mappings and Thresholding (58)
- Workflow and status triggers (10 tests)
- Migration and promotion processes (31 tests)
- Field modifications and TDAMM tags (25 tests)
- Additional system functionality (30 tests)
Documentation
Added comprehensive documentation across 15 READMEs covering:
- Pattern system fundamentals and examples
- Reindexing statuses and triggers
- Model lifecycles and testing procedures
- URL inclusion/exclusion logic
- Environmental Justice classifier and API
- ML column functionality
- SQL dump restoration
Bug Fixes
- Fixed non-functional includes
- Resolved pagination issues for patterns (previously limited to 50)
- Eliminated ability to create duplicate URLs and patterns
- Corrected faulty unapply logic for modification patterns
- Fixed unrepeatable logic for overlapping patterns
- Allowed long running jobs to complete without timeouts
UI Updates
- Renamed application from "SDE Indexing Helper" to "COSMOS"
- Refactored collection list code for easier column management
- Enhanced URL list page with new status and deletion tracking
- Improved navigation through delta URL count integration
Administrative Changes
- Added new admin panels for enhanced system management
- Updated installation requirements
- Enhanced database backup and restore functionality
What's Changed (PR Log)
- remove force reindexing from templates by @CarsonDavis in #1018
- point tree root to name by @CarsonDavis in #1027
- Change LRM dev configurations by @bishwaspraveen in #1034
- get URLs from scrapers folder for LRM servers by @bishwaspraveen in #1037
- change EnableNeuralIndexing to true in indexing template by @CarsonDavis in #1070
- Retrieve Full-Texts from Sinequa Dev Servers by @saifrk in #1077
- add per indicator thrsholding and new dump by @CarsonDavis in #1073
- 1051 backend model changes on cosmos to hold new incoming urls frontend by @dhanur-sharma in #1090
- 1051 backend model changes on cosmos to hold new incoming urls by @bishwaspraveen in #1069
- 1105 improve pattern application and exclusion management by @CarsonDavis in #1109
- remove destination_server and add datasource by @CarsonDavis in #1108
- Update cmr mappings by @CarsonDavis in #1102
- 1115 improve title processing and tests by @CarsonDavis in #1118
- Affected Delta URLs header added by @dhanur-sharma in #1117
- 3034 cosmos api test cases by @dhanur-sharma in #1114
- Pagination on the Sinequa sql.engine Api by @saifrk in #1104
- Updated page title to URLs by @dhanur-sharma in #1120
- Refresh page on workflow status change by @dhanur-sharma in #1124
- Refactor Two Column to work with Delta Urls by @Kirandawadi in #1103
- add initial reindexing statuses by @CarsonDavis in #1125
- View deleted URLs under Delta URLs page by @dhanur-sharma in #1121
- 1126 managepy command for database backups by @CarsonDavis in #1127
- 3055 optmize the retrieval of url counts on admin page by @bishwaspraveen in #1131
- Updated database restore command by @dhanur-sharma in #1130
- 1133 refactor indexing statuses logic by @CarsonDavis in #1134
- Updated dockerignore and gitignore by @dhanur-sharma in #1135
- 1139 resolve id conflict when promoting by @CarsonDavis in #1140
- Updated title pane to Delta URLs by @dhanur-sharma in #1141
- refactor readme for unapply logic and refactor unapply to account for overlapping patterns by @CarsonDavis in #1146
- Filters fixed by @dhanur-sharma in #1145
- add new field to reindexing statuses by @CarsonDavis in #1148
- Conditional anchor updated for 0 Delta URLs by @dhanur-sharma in #1161
- fixed paging on excludes and includes tabs by @bishwaspraveen in #1163
- 1150 status button color matches by @dhanur-sharma in #1162
- Add documentation for PairedFieldDescriptor implementation by @Kirandawadi in #1160
New Contributors
- @saifrk made their first contribution in #1077
- @dhanur-sharma made their first contribution in #1090
Full Changelog: 3f85f26...8df561a
v2.0.1
What's Changed
- Fix fake flake8 issues by @code-geek in #976
- Add LRM_QA_{USER, PASSWORD} variable to .django by @Kirandawadi in #985
- Make coding syntax consistent by @Kirandawadi in #990
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #979
- Add CONTRIBUTING.md file by @Kirandawadi in #996
- Add SQLDumpRestoration.md file by @Kirandawadi in #994
New Contributors
- @Kirandawadi made their first contribution in #985
- @pre-commit-ci made their first contribution in #979
Full Changelog: v2.0.0...v2.0.1
v2.0.0
What's Changed
-
Feature Enhancements: Integrated several new features, including a "Push to GitHub" button for selected collections, a conversation history webapp, and a JSON indexing template. Enhanced the indexing process with dynamic plugin generation and updated URL indexing endpoints.
-
Infrastructure and Configuration Updates: Improved project setup with updated configuration files and added mechanisms for automatic updates, such as using Celery to pull in URLs and updating collections based on production API. Switched Celery broker from Redis to SQS for better scalability.
-
Bug Fixes and Stability Improvements: Addressed various bugs, including inference bug fixes, preventing tag duplication, and resolving CORS issues on the frontend. Reverted certain changes for better stability and fixed issues related to job creation and indexing.
-
Codebase and API Updates: Introduced significant updates to the codebase, such as adding type hints, refreshing code libraries, and updating API endpoints to accommodate new features and feedback. Implemented functional tests using Selenium for enhanced reliability.
-
Admin and User Interface Improvements: Enhanced the webapp user experience by refining the UI, including removing clutter, automating file creation at specific status changes, and aligning webapp status implementation with the current process. Added admin actions for better management and visibility.
New Contributors
- @RajashreeDahal4 made their first contribution in #379
- @anisbhsl made their first contribution in #361
- @bishwaspraveen made their first contribution in #409
- @Jmok19927 made their first contribution in #690
- @emshahh made their first contribution in #689
- @Kshaw362 made their first contribution in #691
Full Changelog: v1.1.0...v2.0.0
v1.1.0
What's Changed
- Add pytest config for vscode by @code-geek in #246
- Add code to pull in connector type by @code-geek in #252
- Deal with collections that dont have a sinequa configuration by @code-geek in #256
- Import metadata from Sinequa configs into collections on the webapp by @code-geek in #257
- Implement soft delete filtering on collection list by @code-geek in #262
- Check if pull request already exists and dont hit the create api if it does by @code-geek in #264
- Update collections fixture with latest data by @code-geek in #266
- Add django extensions to prod by @code-geek in #268
- Fix bugs in the GitHub pipeline by @code-geek in #270
- When trying to remove a title pattern by deleting it from the input box, it throws an error by @rajdangol0077 in #258
New Contributors
- @impact-github-bot made their first contribution in #239
Full Changelog: v1.0.0...v1.1.0
v1.0.0
What's Changed
- Exclude patterns by @code-geek in #25
- Feature update models by @CarsonDavis in #27
- Add machine name by @CarsonDavis in #31
- Add jupyter notebook to ingest from csv by @code-geek in #43
- Added minor features by @SauravUpadhyaya in #8
- Feature track req urls by @CarsonDavis in #62
- Feature sinequa scraper by @code-geek in #89
- Dropdown to change the status of a collection on the collection list page by @code-geek in #90
- Dropdown to change the user who is curating a collection from the collection list page by @code-geek in #91
- Dev to main by @code-geek in #93
- Remove broken fields from collection detail page by @code-geek in #95
- Bring in recently indexed collections and set the status as ready to clean by @code-geek in #101
- Turn on statesave for candidate urls table by @code-geek in #102
- Add an option to curation status by @code-geek in #104
- Refactor to Scrape by Indexing and Generate Jobs in Parallel by @CarsonDavis in #110
- API to Ingest Candidate URLs in bulk from the test server by @code-geek in #120
- API to ingest candidate urls -- improved by @code-geek in #121
- Add scripts to export the entire index by @code-geek in #123
- Add code to pull entire index from s3 by @code-geek in #126
- Remove sidebar by @code-geek in #128
- Enable stateSave and reduce the number of rows on the collection list page by @code-geek in #139
- change Curated to Visited by @code-geek in #140
- Add link to sinequa configuration on the collection detail page by @code-geek in #142
- Update code to export select collections and document processes by @code-geek in #148
- Add ability to navigate to a page with user input by @code-geek in #150
- Add link to github issue on each collection by @code-geek in #152
- Prevent moving to the top of the page when changing dropdowns by @code-geek in #156
- Avoid refreshing the page on dropdown change by @code-geek in #157
- Add a status called delete/combine collection by @rajdangol0077 in #154
- Add filter for status and allow export csv by @code-geek in #160
- Show Title Patterns in the admin by @code-geek in #164
- Add pattern type as a filter to TitlePattern admin by @code-geek in #165
- Candidate URLs page improvements by @code-geek in #167
- Change cursor to pointer when hovering on select by @code-geek in #173
- Improve DocumentTypePattern modal by @code-geek in #176
- Allow wildcards in match patterns by @code-geek in #96
- Add a searchdelay of 1000 ms by @code-geek in #185
- DataTables enhancements -- search panes and select by @code-geek in #186
- Add ranges to candidate URL search pane by @code-geek in #188
- Speed up pattern creation and add ability to deselect document type by @code-geek in #190
- Change traefik timeout to 5 minutes by @code-geek in #192
- Set curation status to "Ready to curate" when new URLs are available by @code-geek in #194
- Streamline loading candidate urls into the web app using the API by @code-geek in #203
- Add mechanism to pull in URLs from prod by @code-geek in #210
- Hide github issue link if there isn't one; Make the field editable by @rajdangol0077 in #211
- Create a models module since models.py is getting too big by @code-geek in #215
- Push generated xml to github from the web app by @code-geek in #204
- Generate XML files from patterns by @CarsonDavis in #162
- Add new status for Github_PR_Created by @code-geek in #219
- change match criteria to be surrounded by single quotes by @CarsonDavis in #231
- Make curation status update when changed by @code-geek in #235
Contributors
Full Changelog: https://github.com/NASA-IMPACT/sde-indexing-helper/commits/v1.0.0