In order to facilitate ease of access, some of the information available through Harvard Personal Genome Project page and the GET-Evidence site has been consolidated into a small SQLite database (~120Mb uncompressed). This project is a collection of scripts to download data, consolidate into a SQLite database, upload to an Arvados project and create an HTML visualization front end for easy exploration of the data.
You can explore the most recent snapshot of the Harvard Personal Genome Project database snapshot available through a Curoverse hosted collection
To grab the repository:
$ git clone https://github.com/abeconnelly/untap
$ cd untap
We need to run the application inside a HTTP server,
$ cd $HOME
$ sudo apt-get install nginx
$ sudo /etc/init.d/nginx start
$ mkdir /var/www
$ cat > /etc/nginx/sites-enabled/untap <<EOF
server {
root /var/www;
location / {
}
}
EOF
$ sudo ln -s $HOME/untap /var/www/untap
$ sudo chmod -R 777 /var/www/untap
$ sudo nginx -s reload
$ cd html
$ python -m SimpleHTTPServer
Now we need to obtain a dataset. Either 1) download the snapshot provided at the Untap hosted on Curoverse or 2) follow the instructions in the following section to scrape Tapestry and build your own snapshot. In both cases, the database should be put in the root directory, i.e. /untap/hu-pgp.sqlite3.gz
.
Now if you go to Untap.html you should see the application running and tabs such as "Summary" should show graphs when you select a dropdown option (e.g. "allergies").
The Quick start uses a static snapshot of the database and may not be up-to-date. To re-scrape all the data yourself for a more up-to-date copy, see the following instructions.
You may need several dependencies if they're not installed already.
$ sudo apt-get install jq
$ sudo add-apt-repository -y ppa:ethereum/ethereum
$ sudo apt-get install golang
$ mkdir -p ~/go; echo "export GOPATH=$HOME/go" >> ~/.bashrc
$ echo "export PATH=$PATH:$HOME/go/bin:/usr/local/go/bin" >> ~/.bashrc
$ source ~/.bashrc
$ go get github.com/ericchiang/pup
$ sudo apt-get install parallel
To download the database from my.pgp-hms.org
and evidence.pgp-hms.org
run:
$ ./public-database-snapshot
If you would like to upload to an Arvados project (requires an account on an Arvados system and appropriate config files):
$ ./upload-to-arvados
Installing the html
directory in the appropriate place will allow you to see the visualization. Care needs to be taken to make sure the SQLite database file gets copied over properly.
For a guided walkthrough of how to use this application, see Introduction.
Since the SQLite database is so small (~120Mb uncompressed) it can be loaded into the browser and explored directly. There are a few canned visualizations, explanations of the SQLite schema and custom visualizations available. Sometimes the database takes a while to load so please be patient if you don't immediately see any graphs in the Summary
, Variants
or Custom
section.
This includes some canned summary statistics for the Harvard Personal Genome Project cohort, including age distribution, gender, ethnicity, etc
This shows a matrix of participants who have genomic data and variants.
This allows you to do your own custom queries. There are some example queries that can be selected in the lower right hand corner.
This page gives the schema for the SQLite database provided.
This page gives some simple queries that allow you to explore the underlying tables that exist in the SQLite database.
Source code is provided under AGPLv3. All collected data from the Harvard Personal Genome Project is under CC0.