Genomic Data Services

Flask based web service providing genomic region search, based on regulomedb.org.

Installation Requirements:

To run this application locally you will need to install Docker. To download the machine learning models you need python3.

Application Installation

Download machine learning models

This is required for running indexing. Tests can be run without.

In python3 virtual env, install boto3:

pip install boto3

Download machine learning models:

python utils/download_files.py

Indexing

Using the compose file suitable for your machine:

docker-compose --file docker-compose-index-m1/intel.yml build
docker-compose --file docker-compose-index-m1/intel.yml up

After indexing has finished (takes about 5 minutes) tear down:

docker-compose --file dockeri-compose-index-m1/intel.yml down --remove-orphans

This command will index ES database, creating a directory esdata where it stores the indexes. This is reusable by the app (see instructions for running below).

Running the app

Using the compose file suitable for your machine:

docker-compose --file docker-compose-m1/intel.yml build
docker-compose --file docker-compose-m1/intel.yml up

The application is available in localhost:80.

Tear down:

docker-compose --file docker-compose-m1/intel.yml down --remove-orphans

Testing

Run tests using compose file suitable for your machine:

docker-compose --file docker-compose-test-m1/intel.yml --env-file ./docker_compose/test.env up --build

Tear down:

docker-compose --file docker-compose-test-m1/intel.yml down -v --remove-orphans

Automatic linting

This repo includes configuration for pre-commit hooks. To use pre-commit, install pre-commit, and activate the hooks:

pip install pre-commit==2.17.0
pre-commit install

Now every time you run git commit the automatic checks are run to check the changes you made.

AWS Deployment

A production grade data services deployment consists of three machines:

Main machine that runs the flask app that sends the requests to the ES machines.
Regulome search ES
ENCODED region-search ES

Connecting to the instances

The instances have EC2 Instance Connect installed. You need to install it to connect to the instances. Assume the instance-id of the instance you want to connect to is i-foobarbaz123. You would connect this instance with command:

mssh ubuntu@i-foobarbaz123 --profile regulome --region us-west-2

Demo deployment

Make sure you have activated the virtual environment created above. if you need demo deployment for Regulome or Encoded region search, set an environment variable DEMO_INDEXER_PASSWORD first, the deploy script will use it as password for indexer. Then run the command below. This command will launch one machine for both GDS flask app and Elasticsearch server.
```
python deploy/deploy.py --demo
```

Start indexing on the machine. For RegulomeDB:

cd /home/ubuntu/genomic-data-service
source genomic-venv/bin/activate
python genomic_data_service/region_indexer.py

Or for Encode region search:

cd /home/ubuntu/genomic-data-service
source genomic-venv/bin/activate
python genomic_data_service/region_indexer_encode.py

You can monitor the indexing progress using the flower dashboard at <public IP of the machine/indexer>. For demo purpose, the username and passowrd for indexer is already set in deploy script.

Production grade deployment

The command below will deploy three machines: GDS main machine, Reglulome ES machine and Encoded ES machine:
```
python deploy/deploy.py
```
On each ES machine create a password for accessing the indexer:
```
sudo mkdir -p /etc/apache2
sudo htpasswd -c /etc/apache2/.htpasswd <your-user-name>
```
You will use this login/password to access the flower dashboard on the machines. The dashboard is accessible at <public IP of the ES machine/indexer>. This is accessible to the internet, so be prudent in choosing the login/password (admin is a bad username, it is quite easy to guess).
On the main machine add the IP addresses of the ES machines into /home/ubuntu/genomic-data-service/config/production.cfg. Set the value of REGULOME_ES_HOSTS to the private IP address of the regulome data service machine, and the value of REGION_SEARCH_ES_HOSTS to the private IP address of the region search data service machine (note that in the normal case these values are lists with one item).

Start each service on the main machine:

sudo systemctl daemon-reload
sudo systemctl enable --now genomic.socket
sudo systemctl enable genomic.service
sudo systemctl enable nginx.service
sudo systemctl start genomic
sudo systemctl start nginx

Start regulome region indexer on the regulome ES machine:

cd /home/ubuntu/genomic-data-service
source genomic-venv/bin/activate
python genomic_data_service/region_indexer.py

Start encoded region indexer on the encoded ES machine:

cd /home/ubuntu/genomic-data-service
source genomic-venv/bin/activate
python genomic_data_service/region_indexer_encode.py

You can monitor the indexing progress using the flower dashboard. After indexing has finished (region-search machine indexes in few hours, regulome machine will take couple of days) the machines can be downsized. Good size for the regulome machine is t3a.2xlarge and for the region-search machine t2.xlarge is sufficient. Do not forget to restart the services after resize.
To deploy a regulome demo that uses your new deployment as backend, you need to edit https://github.com/ENCODE-DCC/regulome-encoded/blob/dev/ini-templates/production-template.ini and change the genomic_data_service_url to point to the instance running the flask app.
To deploy an encoded demo that uses your new deployment as the region-search backend, you need to edit https://github.com/ENCODE-DCC/encoded/blob/dev/conf/pyramid/demo.ini and change the genomic_data_service to point to the instance running the flask app.

ElasticSearch server only deployment

if you just want to deploy an ElasticSearch server only, for RegulomeDB:
```
python deploy/deploy.py --es regulome
```
For Encode:
```
python deploy/deploy.py --es encode
```
Follow the instruction in Production grade deployment above to create a password for accessing the indexer and Start indexer on the this ES machine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Genomic Data Services

Application Installation

Download machine learning models

Indexing

Running the app

Testing

Automatic linting

AWS Deployment

Connecting to the instances

Demo deployment

Production grade deployment

ElasticSearch server only deployment

Files

README.md

Latest commit

History

README.md

File metadata and controls

Genomic Data Services

Application Installation

Download machine learning models

Indexing

Running the app

Testing

Automatic linting

AWS Deployment

Connecting to the instances

Demo deployment

Production grade deployment

ElasticSearch server only deployment