Resolved merge conflicts

NASA-IMPACT · Jan 10, 2025 · 6d0f670 · 6d0f670
2 parents c258803 + 99a72f4
commit 6d0f670
Show file tree

Hide file tree

Showing 60 changed files with 3,948 additions and 487 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -7,5 +7,13 @@
 .pre-commit-config.yaml
 .readthedocs.yml
 .travis.yml
-venv
 .git
+
+# ignore local python environments
+venv
+.venv
+
+# prevent large backup files from being copied into the image
+/backups
+*.sql
+*.gz
diff --git a/.gitignore b/.gitignore
@@ -292,8 +292,7 @@ config_generation/config.py
 # Model's inference files
 Document_Classifier_inference/model.pt
 
-# Database backup
-backup.json
-
-# Prod backup
-prod_backup-20240423.json
+# Ignore Database Backup files
+/backups
+*.sql
+*.gz
diff --git a/README.md b/README.md
@@ -18,7 +18,6 @@ $ docker-compose -f local.yml build
 ```bash
 $ docker-compose -f local.yml up
 ```
-
 ### Non-Docker Local Setup
 
 If you prefer to run the project without Docker, follow these steps:
@@ -69,57 +68,103 @@ $ docker-compose -f local.yml run --rm django python manage.py createsuperuser
 #### Creating Additional Users
 
 Create additional users through the admin interface (/admin).
+## Database Backup and Restore
+
+COSMOS provides dedicated management commands for backing up and restoring your PostgreSQL database. These commands handle both compressed and uncompressed backups and work seamlessly in both local and production environments using Docker.
+
+### Backup Directory Structure
 
-### Loading Fixtures
+All backups are stored in the `/backups` directory at the root of your project. This directory is mounted as a volume in both local and production Docker configurations, making it easy to manage backups across different environments.
 
-To load collections:
+- Local development: `./backups/`
+- Production server: `/path/to/project/backups/`
 
+If the directory doesn't exist, create it:
 ```bash
-$ docker-compose -f local.yml run --rm django python manage.py loaddata sde_collections/fixtures/collections.json
+mkdir backups
 ```
 
-### Manually Creating and Loading a ContentTypeless Backup
-Navigate to the server running prod, then to the project folder. Run the following command to create a backup:
+### Creating a Database Backup
+
+To create a backup of your database:
 
 ```bash
-docker-compose -f production.yml run --rm --user root django python manage.py dumpdata --natural-foreign --natural-primary --exclude=contenttypes --exclude=auth.Permission --indent 2 --output /app/backups/prod_backup-20241114.json
+# Create a compressed backup (recommended)
+docker-compose -f local.yml run --rm django python manage.py database_backup
+
+# Create an uncompressed backup
+docker-compose -f local.yml run --rm django python manage.py database_backup --no-compress
+
+# Specify custom output location within backups directory
+docker-compose -f local.yml run --rm django python manage.py database_backup --output my_custom_backup.sql
 ```
-This will have saved the backup in a folder outside of the docker container. Now you can copy it to your local machine.
+
+The backup command will automatically:
+- Detect your server environment (Production/Staging/Local)
+- Use database credentials from your environment settings
+- Generate a dated filename if no output path is specified
+- Save the backup to the mounted `/backups` directory
+- Compress the backup by default (can be disabled with --no-compress)
+
+### Restoring from a Database Backup
+
+To restore your database from a backup, it will need to be in the `/backups` directory. You can then run the following command:
 
 ```bash
-mv ~/prod_backup-20240812.json <project_path>/prod_backup-20240812.json
-scp sde:/home/ec2-user/sde_indexing_helper/backups/prod_backup-20240812.json prod_backup-20240812.json
+# Restore from a backup (handles both .sql and .sql.gz files)
+docker-compose -f local.yml run --rm django python manage.py database_restore backups/backup_file_name.sql.gz
 ```
 
-Finally, load the backup into your local database:
+The restore command will:
+- Automatically detect if the backup is compressed (.gz)
+- Terminate existing database connections
+- Drop and recreate the database
+- Restore all data from the backup
+- Handle all database credentials from your environment settings
 
+### Working with Remote Servers
+
+When working with production or staging servers:
+
+1. First, SSH into the appropriate server:
 ```bash
-docker-compose -f local.yml run --rm django python manage.py loaddata prod_backup-20240812.json
+# For production
+ssh user@production-server
+cd /path/to/project
 ```
 
-### Loading the Database from an Arbitrary Backup
+2. Create a backup on the remote server:
+```bash
+docker-compose -f production.yml run --rm django python manage.py database_backup
+```
 
-1. Build the project and run the necessary containers (as documented above).
-2. Clear out content types using the Django shell:
+3. Copy the backup from the remote server's backup directory to your local machine:
+```bash
+scp user@remote-server:/path/to/project/backups/backup_name.sql.gz ./backups/
+```
 
+4. Restore locally:
 ```bash
-$ docker-compose -f local.yml run --rm django python manage.py shell
->>> from django.contrib.contenttypes.models import ContentType
->>> ContentType.objects.all().delete()
->>> exit()
+docker-compose -f local.yml run --rm django python manage.py database_restore backups/backup_name.sql.gz
 ```
 
-3. Load your backup database:
+### Alternative Methods
+
+While the database_backup and database_restore commands are the recommended approach, you can also use Django's built-in fixtures for smaller datasets:
 
 ```bash
-$ docker cp /path/to/your/backup.json container_name:/path/inside/container/backup.json
-$ docker-compose -f local.yml run --rm django python manage.py loaddata /path/inside/the/container/backup.json
-$ docker-compose -f local.yml run --rm django python manage.py migrate
+# Create a backup excluding content types
+docker-compose -f production.yml run --rm django python manage.py dumpdata \
+    --natural-foreign --natural-primary \
+    --exclude=contenttypes --exclude=auth.Permission \
+    --indent 2 \
+    --output backups/prod_backup-$(date +%Y%m%d).json
+
+# Restore from a fixture
+docker-compose -f local.yml run --rm django python manage.py loaddata backups/backup_name.json
 ```
-### Restoring the Database from a SQL Dump
-If the JSON file is particularly large (>1.5GB), Docker might struggle with this method. In such cases, you can use SQL dump and restore commands as an alternative, as described [here](./SQLDumpRestoration.md).
-
 
+Note: For large databases (>1.5GB), the database_backup and database_restore commands are strongly recommended over JSON fixtures as they handle large datasets more efficiently.
 
 ## Additional Commands
 

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -0,0 +1,108 @@
+# COSMOS Release Notes
+## v3.0.0 from v2.0.1
+
+COSMOS v3.0.0 introduces several major architectural changes that fundamentally enhance the system's capabilities. The primary feature is a new website reindexing system that allows COSMOS to stay up-to-date with source website changes, addressing a key limitation of previous versions where websites could only be scraped once. This release includes comprehensive updates to the data models, frontend interface, rule creation system, and backend processing along with some bugfixes from v2.0.1.
+
+The Environmental Justice (EJ) system has been significantly expanded, growing less than 100 manually curated datasets to approximately 1,000 datasets through the integration of machine learning classification of NASA CMR records. This expansion is supported by a new modular processing suite that generates and extracts metadata using Subject Matter Expert (SME) criteria.
+
+To support future machine learning integration, COSMOS now implements a sophisticated two-column system that allows fields to maintain both ML-generated classifications and manual curator overrides. This system has been seamlessly integrated into the data models, serializers, and APIs, ensuring that both automated and human-curated data can coexist while maintaining clear precedence rules.
+
+To ensure reliability and maintainability of these major changes, this release includes extensive testing coverage with 213 new tests spanning URL processing, pattern management, Environmental Justice functionality, workflow triggers, and data migrations. Additionally, we've added comprehensive documentation across 15 new README files that cover everything from fundamental pattern system concepts to detailed API specifications and ML integration guidelines.
+
+
+### Major Features
+
+#### Reindexing System
+- **New Data Models**: Introduced DumpUrl, DeltaUrl, and CuratedUrl to support the reindexing workflow
+- **Automated Workflows**:
+  - New process to calculate deltas, deletions, and additions during migration
+  - Automatic promotion of DeltaUrls to CuratedUrls
+  - Status-based triggers for data ingestion and processing
+- **Duplicate Prevention**: System now prevents duplicate patterns and URLs
+- **Enhanced Frontend**:
+  - Added reindexing status column to collection and URL list pages
+  - New deletion tracking column on URL list page
+  - Updated collection list to display delta URL counts
+  - Improved URL list page accessibility via delta URL count
+
+#### Pattern System Improvements
+- Complete modularization of the pattern system
+- Enhanced handling of edge cases including overlapping patterns
+- Improved unapply logic
+- Functional inclusion rules
+- Pattern precedence system: most specific pattern takes priority, with pattern length as tiebreaker
+
+#### Environmental Justice (EJ) Enhancement
+- Expanded from 92 manual datasets to 1063 ML-classified NASA CMR records
+- New modular processing suite for metadata generation
+- Enhanced API with multiple data sources:
+  - Spreadsheet (original manual classifications)
+  - ML Production
+  - ML Testing
+  - Combined (ML production with spreadsheet overrides)
+- Custom processing suite for CMR metadata extraction
+
+#### Infrastructure Updates
+- Streamlined database backup and restore
+- Optimized Docker builds
+- Fixed LetsEncrypt staging issues
+- Modified Traefik timeouts for long-running jobs
+- Updated Sinequa worker configuration:
+  - Reduced worker count to 3 for neural workload optimization
+  - Added neural indexing to all webcrawlers
+  - Removed deprecated version mappings
+
+#### API Enhancements
+- New endpoints for curated and delta URLs:
+  - GET /curated-urls-api/<str:config_folder>/
+  - GET /delta-urls-api/<str:config_folder>/
+- Backwards compatibility through remapped CandidateUrl endpoint
+- Updated Environmental Justice API with new data source parameter
+
+### Technical Improvements
+
+#### Two-Column System
+- New architecture to support dual ML/manual classifications
+- Seamless integration with models, serializers, and APIs
+- Prioritization system for manual overrides
+
+#### Testing
+Added 213 new tests across multiple areas:
+- URL APIs and processing (19 tests)
+- Delta and pattern management (31 tests)
+- Environmental Justice API (7 tests)
+- Environmental Justice Mappings and Thresholding (58)
+- Workflow and status triggers (10 tests)
+- Migration and promotion processes (31 tests)
+- Field modifications and TDAMM tags (25 tests)
+- Additional system functionality (30 tests)
+
+
+#### Documentation
+Added comprehensive documentation across 15 READMEs covering:
+- Pattern system fundamentals and examples
+- Reindexing statuses and triggers
+- Model lifecycles and testing procedures
+- URL inclusion/exclusion logic
+- Environmental Justice classifier and API
+- ML column functionality
+- SQL dump restoration
+
+### Bug Fixes
+- Fixed non-functional includes
+- Resolved pagination issues for patterns (previously limited to 50)
+- Eliminated ability to create duplicate URLs and patterns
+- Corrected faulty unapply logic for modification patterns
+- Fixed unrepeatable logic for overlapping patterns
+- Allowed long running jobs to complete without timeouts
+
+### UI Updates
+- Renamed application from "SDE Indexing Helper" to "COSMOS"
+- Refactored collection list code for easier column management
+- Enhanced URL list page with new status and deletion tracking
+- Improved navigation through delta URL count integration
+
+### Administrative Changes
+- Added new admin panels for enhanced system management
+- Updated installation requirements
+- Enhanced database backup and restore functionality
diff --git a/SQLDumpRestoration.md b/SQLDumpRestoration.md
@@ -82,4 +82,111 @@ docker-compose -f local.yml up
 docker-compose -f local.yml run --rm django python manage.py createsuperuser
 ```
 
-8. Log in to the SDE Indexing Helper frontend to ensure that all data has been correctly populated in the UI.
+8. Log in to the COSMOS frontend to ensure that all data has been correctly populated in the UI.
+
+
+
+# making the backup
+
+```bash
+ssh sde
+cat .envs/.production/.postgres
+```
+
+find the values for the variables:
+POSTGRES_HOST=sde-indexing-helper-db.c3cr2yyh5zt0.us-east-1.rds.amazonaws.com
+POSTGRES_PORT=5432
+POSTGRES_DB=postgres
+POSTGRES_USER=postgres
+POSTGRES_PASSWORD=this_is_A_web_application_built_in_2023
+
+```bash
+docker ps
+```
+
+b3fefa2c19fb
+
+note here that you need to put the
+```bash
+docker exec -t your_postgres_container_id pg_dump -U your_postgres_user -d your_database_name > backup.sql
+```
+```bash
+docker exec -t container_id pg_dump -h host -U user -d database -W > prod_backup.sql
+```
+
+docker exec -t b3fefa2c19fb env PGPASSWORD="this_is_A_web_application_built_in_2023" pg_dump -h sde-indexing-helper-db.c3cr2yyh5zt0.us-east-1.rds.amazonaws.com -U postgres -d postgres > prod_backup.sql
+
+# move the backup to local
+ go back to local computer and scp the file
+
+```bash
+scp sde:/home/ec2-user/sde_indexing_helper/prod_backup.sql .
+```
+scp prod_backup.sql sde_staging:/home/ec2-user/sde-indexing-helper
+if you have trouble transferring the file, you can use rsync:
+rsync -avzP prod_backup.sql sde_staging:/home/ec2-user/sde-indexing-helper/
+
+# restoring the backup
+bring down the local containers
+```bash
+docker-compose -f local.yml down
+docker-compose -f local.yml up postgres
+docker ps
+```
+
+find the container id
+
+c11d7bae2e56
+
+find the local variables from
+cat .envs/.production/.postgres
+POSTGRES_HOST=sde-indexing-helper-staging-db.c3cr2yyh5zt0.us-east-1.rds.amazonaws.com
+POSTGRES_PORT=5432
+POSTGRES_DB=sde_staging
+POSTGRES_USER=postgres
+POSTGRES_PASSWORD=postgres
+
+
+```bash
+docker exec -it <container id> bash
+```
+docker exec -it c11d7bae2e56 bash
+
+## do all the database shit you need to
+
+
+psql -U <POSTGRES_USER> -d <POSTGRES_DB>
+psql -U postgres -d sde_staging
+or, if you are on one of the servers:
+psql -h sde-indexing-helper-staging-db.c3cr2yyh5zt0.us-east-1.rds.amazonaws.com -U postgres -d postgres
+
+\c postgres
+DROP DATABASE sde_staging;
+CREATE DATABASE sde_staging;
+
+# do the backup
+
+```bash
+docker cp prod_backup.sql c11d7bae2e56:/
+docker exec -it c11d7bae2e56 bash
+```
+
+```bash
+psql -U <POSTGRES_USER> -d <POSTGRES_DB> -f backup.sql
+```
+psql -U VnUvMKBSdkoFIETgLongnxYHrYVJKufn -d sde_indexing_helper -f prod_backup.sql
+
+psql -h sde-indexing-helper-staging-db.c3cr2yyh5zt0.us-east-1.rds.amazonaws.com -U postgres -d postgres -f prod_backup.sql
+pg_restore -h sde-indexing-helper-staging-db.c3cr2yyh5zt0.us-east-1.rds.amazonaws.com -U postgres -d postgres prod_backup.sql
+
+
+
+docker down
+
+docker up build
+
+migrate
+
+down
+
+up