Skip to content

Commit

Permalink
Resolved merge conflicts
Browse files Browse the repository at this point in the history
  • Loading branch information
Your Name committed Jan 10, 2025
2 parents c258803 + 99a72f4 commit 6d0f670
Show file tree
Hide file tree
Showing 60 changed files with 3,948 additions and 487 deletions.
10 changes: 9 additions & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,13 @@
.pre-commit-config.yaml
.readthedocs.yml
.travis.yml
venv
.git

# ignore local python environments
venv
.venv

# prevent large backup files from being copied into the image
/backups
*.sql
*.gz
9 changes: 4 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -292,8 +292,7 @@ config_generation/config.py
# Model's inference files
Document_Classifier_inference/model.pt

# Database backup
backup.json

# Prod backup
prod_backup-20240423.json
# Ignore Database Backup files
/backups
*.sql
*.gz
97 changes: 71 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ $ docker-compose -f local.yml build
```bash
$ docker-compose -f local.yml up
```

### Non-Docker Local Setup

If you prefer to run the project without Docker, follow these steps:
Expand Down Expand Up @@ -69,57 +68,103 @@ $ docker-compose -f local.yml run --rm django python manage.py createsuperuser
#### Creating Additional Users

Create additional users through the admin interface (/admin).
## Database Backup and Restore

COSMOS provides dedicated management commands for backing up and restoring your PostgreSQL database. These commands handle both compressed and uncompressed backups and work seamlessly in both local and production environments using Docker.

### Backup Directory Structure

### Loading Fixtures
All backups are stored in the `/backups` directory at the root of your project. This directory is mounted as a volume in both local and production Docker configurations, making it easy to manage backups across different environments.

To load collections:
- Local development: `./backups/`
- Production server: `/path/to/project/backups/`

If the directory doesn't exist, create it:
```bash
$ docker-compose -f local.yml run --rm django python manage.py loaddata sde_collections/fixtures/collections.json
mkdir backups
```

### Manually Creating and Loading a ContentTypeless Backup
Navigate to the server running prod, then to the project folder. Run the following command to create a backup:
### Creating a Database Backup

To create a backup of your database:

```bash
docker-compose -f production.yml run --rm --user root django python manage.py dumpdata --natural-foreign --natural-primary --exclude=contenttypes --exclude=auth.Permission --indent 2 --output /app/backups/prod_backup-20241114.json
# Create a compressed backup (recommended)
docker-compose -f local.yml run --rm django python manage.py database_backup

# Create an uncompressed backup
docker-compose -f local.yml run --rm django python manage.py database_backup --no-compress

# Specify custom output location within backups directory
docker-compose -f local.yml run --rm django python manage.py database_backup --output my_custom_backup.sql
```
This will have saved the backup in a folder outside of the docker container. Now you can copy it to your local machine.

The backup command will automatically:
- Detect your server environment (Production/Staging/Local)
- Use database credentials from your environment settings
- Generate a dated filename if no output path is specified
- Save the backup to the mounted `/backups` directory
- Compress the backup by default (can be disabled with --no-compress)

### Restoring from a Database Backup

To restore your database from a backup, it will need to be in the `/backups` directory. You can then run the following command:

```bash
mv ~/prod_backup-20240812.json <project_path>/prod_backup-20240812.json
scp sde:/home/ec2-user/sde_indexing_helper/backups/prod_backup-20240812.json prod_backup-20240812.json
# Restore from a backup (handles both .sql and .sql.gz files)
docker-compose -f local.yml run --rm django python manage.py database_restore backups/backup_file_name.sql.gz
```

Finally, load the backup into your local database:
The restore command will:
- Automatically detect if the backup is compressed (.gz)
- Terminate existing database connections
- Drop and recreate the database
- Restore all data from the backup
- Handle all database credentials from your environment settings

### Working with Remote Servers

When working with production or staging servers:

1. First, SSH into the appropriate server:
```bash
docker-compose -f local.yml run --rm django python manage.py loaddata prod_backup-20240812.json
# For production
ssh user@production-server
cd /path/to/project
```

### Loading the Database from an Arbitrary Backup
2. Create a backup on the remote server:
```bash
docker-compose -f production.yml run --rm django python manage.py database_backup
```

1. Build the project and run the necessary containers (as documented above).
2. Clear out content types using the Django shell:
3. Copy the backup from the remote server's backup directory to your local machine:
```bash
scp user@remote-server:/path/to/project/backups/backup_name.sql.gz ./backups/
```

4. Restore locally:
```bash
$ docker-compose -f local.yml run --rm django python manage.py shell
>>> from django.contrib.contenttypes.models import ContentType
>>> ContentType.objects.all().delete()
>>> exit()
docker-compose -f local.yml run --rm django python manage.py database_restore backups/backup_name.sql.gz
```

3. Load your backup database:
### Alternative Methods

While the database_backup and database_restore commands are the recommended approach, you can also use Django's built-in fixtures for smaller datasets:

```bash
$ docker cp /path/to/your/backup.json container_name:/path/inside/container/backup.json
$ docker-compose -f local.yml run --rm django python manage.py loaddata /path/inside/the/container/backup.json
$ docker-compose -f local.yml run --rm django python manage.py migrate
# Create a backup excluding content types
docker-compose -f production.yml run --rm django python manage.py dumpdata \
--natural-foreign --natural-primary \
--exclude=contenttypes --exclude=auth.Permission \
--indent 2 \
--output backups/prod_backup-$(date +%Y%m%d).json

# Restore from a fixture
docker-compose -f local.yml run --rm django python manage.py loaddata backups/backup_name.json
```
### Restoring the Database from a SQL Dump
If the JSON file is particularly large (>1.5GB), Docker might struggle with this method. In such cases, you can use SQL dump and restore commands as an alternative, as described [here](./SQLDumpRestoration.md).


Note: For large databases (>1.5GB), the database_backup and database_restore commands are strongly recommended over JSON fixtures as they handle large datasets more efficiently.

## Additional Commands

Expand Down
108 changes: 108 additions & 0 deletions RELEASE_NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# COSMOS Release Notes
## v3.0.0 from v2.0.1

COSMOS v3.0.0 introduces several major architectural changes that fundamentally enhance the system's capabilities. The primary feature is a new website reindexing system that allows COSMOS to stay up-to-date with source website changes, addressing a key limitation of previous versions where websites could only be scraped once. This release includes comprehensive updates to the data models, frontend interface, rule creation system, and backend processing along with some bugfixes from v2.0.1.

The Environmental Justice (EJ) system has been significantly expanded, growing less than 100 manually curated datasets to approximately 1,000 datasets through the integration of machine learning classification of NASA CMR records. This expansion is supported by a new modular processing suite that generates and extracts metadata using Subject Matter Expert (SME) criteria.

To support future machine learning integration, COSMOS now implements a sophisticated two-column system that allows fields to maintain both ML-generated classifications and manual curator overrides. This system has been seamlessly integrated into the data models, serializers, and APIs, ensuring that both automated and human-curated data can coexist while maintaining clear precedence rules.

To ensure reliability and maintainability of these major changes, this release includes extensive testing coverage with 213 new tests spanning URL processing, pattern management, Environmental Justice functionality, workflow triggers, and data migrations. Additionally, we've added comprehensive documentation across 15 new README files that cover everything from fundamental pattern system concepts to detailed API specifications and ML integration guidelines.


### Major Features

#### Reindexing System
- **New Data Models**: Introduced DumpUrl, DeltaUrl, and CuratedUrl to support the reindexing workflow
- **Automated Workflows**:
- New process to calculate deltas, deletions, and additions during migration
- Automatic promotion of DeltaUrls to CuratedUrls
- Status-based triggers for data ingestion and processing
- **Duplicate Prevention**: System now prevents duplicate patterns and URLs
- **Enhanced Frontend**:
- Added reindexing status column to collection and URL list pages
- New deletion tracking column on URL list page
- Updated collection list to display delta URL counts
- Improved URL list page accessibility via delta URL count

#### Pattern System Improvements
- Complete modularization of the pattern system
- Enhanced handling of edge cases including overlapping patterns
- Improved unapply logic
- Functional inclusion rules
- Pattern precedence system: most specific pattern takes priority, with pattern length as tiebreaker

#### Environmental Justice (EJ) Enhancement
- Expanded from 92 manual datasets to 1063 ML-classified NASA CMR records
- New modular processing suite for metadata generation
- Enhanced API with multiple data sources:
- Spreadsheet (original manual classifications)
- ML Production
- ML Testing
- Combined (ML production with spreadsheet overrides)
- Custom processing suite for CMR metadata extraction

#### Infrastructure Updates
- Streamlined database backup and restore
- Optimized Docker builds
- Fixed LetsEncrypt staging issues
- Modified Traefik timeouts for long-running jobs
- Updated Sinequa worker configuration:
- Reduced worker count to 3 for neural workload optimization
- Added neural indexing to all webcrawlers
- Removed deprecated version mappings

#### API Enhancements
- New endpoints for curated and delta URLs:
- GET /curated-urls-api/<str:config_folder>/
- GET /delta-urls-api/<str:config_folder>/
- Backwards compatibility through remapped CandidateUrl endpoint
- Updated Environmental Justice API with new data source parameter

### Technical Improvements

#### Two-Column System
- New architecture to support dual ML/manual classifications
- Seamless integration with models, serializers, and APIs
- Prioritization system for manual overrides

#### Testing
Added 213 new tests across multiple areas:
- URL APIs and processing (19 tests)
- Delta and pattern management (31 tests)
- Environmental Justice API (7 tests)
- Environmental Justice Mappings and Thresholding (58)
- Workflow and status triggers (10 tests)
- Migration and promotion processes (31 tests)
- Field modifications and TDAMM tags (25 tests)
- Additional system functionality (30 tests)


#### Documentation
Added comprehensive documentation across 15 READMEs covering:
- Pattern system fundamentals and examples
- Reindexing statuses and triggers
- Model lifecycles and testing procedures
- URL inclusion/exclusion logic
- Environmental Justice classifier and API
- ML column functionality
- SQL dump restoration

### Bug Fixes
- Fixed non-functional includes
- Resolved pagination issues for patterns (previously limited to 50)
- Eliminated ability to create duplicate URLs and patterns
- Corrected faulty unapply logic for modification patterns
- Fixed unrepeatable logic for overlapping patterns
- Allowed long running jobs to complete without timeouts

### UI Updates
- Renamed application from "SDE Indexing Helper" to "COSMOS"
- Refactored collection list code for easier column management
- Enhanced URL list page with new status and deletion tracking
- Improved navigation through delta URL count integration

### Administrative Changes
- Added new admin panels for enhanced system management
- Updated installation requirements
- Enhanced database backup and restore functionality
109 changes: 108 additions & 1 deletion SQLDumpRestoration.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,4 +82,111 @@ docker-compose -f local.yml up
docker-compose -f local.yml run --rm django python manage.py createsuperuser
```

8. Log in to the SDE Indexing Helper frontend to ensure that all data has been correctly populated in the UI.
8. Log in to the COSMOS frontend to ensure that all data has been correctly populated in the UI.



# making the backup

```bash
ssh sde
cat .envs/.production/.postgres
```

find the values for the variables:
POSTGRES_HOST=sde-indexing-helper-db.c3cr2yyh5zt0.us-east-1.rds.amazonaws.com
POSTGRES_PORT=5432
POSTGRES_DB=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD=this_is_A_web_application_built_in_2023

```bash
docker ps
```

b3fefa2c19fb

note here that you need to put the
```bash
docker exec -t your_postgres_container_id pg_dump -U your_postgres_user -d your_database_name > backup.sql
```
```bash
docker exec -t container_id pg_dump -h host -U user -d database -W > prod_backup.sql
```

docker exec -t b3fefa2c19fb env PGPASSWORD="this_is_A_web_application_built_in_2023" pg_dump -h sde-indexing-helper-db.c3cr2yyh5zt0.us-east-1.rds.amazonaws.com -U postgres -d postgres > prod_backup.sql

# move the backup to local
go back to local computer and scp the file

```bash
scp sde:/home/ec2-user/sde_indexing_helper/prod_backup.sql .
```
scp prod_backup.sql sde_staging:/home/ec2-user/sde-indexing-helper
if you have trouble transferring the file, you can use rsync:
rsync -avzP prod_backup.sql sde_staging:/home/ec2-user/sde-indexing-helper/

# restoring the backup
bring down the local containers
```bash
docker-compose -f local.yml down
docker-compose -f local.yml up postgres
docker ps
```

find the container id

c11d7bae2e56

find the local variables from
cat .envs/.production/.postgres
POSTGRES_HOST=sde-indexing-helper-staging-db.c3cr2yyh5zt0.us-east-1.rds.amazonaws.com
POSTGRES_PORT=5432
POSTGRES_DB=sde_staging
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres


```bash
docker exec -it <container id> bash
```
docker exec -it c11d7bae2e56 bash

## do all the database shit you need to


psql -U <POSTGRES_USER> -d <POSTGRES_DB>
psql -U postgres -d sde_staging
or, if you are on one of the servers:
psql -h sde-indexing-helper-staging-db.c3cr2yyh5zt0.us-east-1.rds.amazonaws.com -U postgres -d postgres

\c postgres
DROP DATABASE sde_staging;
CREATE DATABASE sde_staging;

# do the backup

```bash
docker cp prod_backup.sql c11d7bae2e56:/
docker exec -it c11d7bae2e56 bash
```

```bash
psql -U <POSTGRES_USER> -d <POSTGRES_DB> -f backup.sql
```
psql -U VnUvMKBSdkoFIETgLongnxYHrYVJKufn -d sde_indexing_helper -f prod_backup.sql

psql -h sde-indexing-helper-staging-db.c3cr2yyh5zt0.us-east-1.rds.amazonaws.com -U postgres -d postgres -f prod_backup.sql
pg_restore -h sde-indexing-helper-staging-db.c3cr2yyh5zt0.us-east-1.rds.amazonaws.com -U postgres -d postgres prod_backup.sql



docker down

docker up build

migrate

down

up
Loading

0 comments on commit 6d0f670

Please sign in to comment.