Instructions on how to update the PubMedDB annually and how to use the non-relational database.
- infotojson.py - converting information in baseline xml files to a single JSON document
- jsontodb.py - read JSON document into a database
- gettfidf.py - query database based on user input to obtain TF-IDFs and output results into a file
All packages are provided within the YML environment file. A conda environment named pubmeddb
can be created using the following command.
conda env create -f ./pubmeddb.yml
conda activate pubmeddb
Please use the DATA TRANSFER node of Sockeye to download the PubMed baseline (https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/) and gene2pubmed (https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz).
ssh <cwl>@dtn.sockeye.arc.ubc.ca
Run download script:
bash ./utils/dl_pubmeddata.sh
Please edit the PBS -M
with your email address in pubmed_submit.sh
.
##PBS -M <email>
Run the following code in the COMPUTE node and submit script as a job from a tempory/scratch directory (currently project directory is only readable by the compute nodes).
ssh <cwl>@sockeye.arc.ubc.ca
cd <SCRATCH DIR>
qsub /project/st-wasserww-1/PubMed_DB/pubmed_submit.sh
PubMedID Collection | Gene Collection |
---|---|
{ "PMID":"XX", "ArticleTitle": "xx", "Abstract":{ "Text": "XX", "Words":{ "Word1":{ "Stems": [xx , xx, xx], "Count": 1 }, "Word2":{ "Stems": [xx , xx, xx], "Count": 1 }, } }, "Country": "XX", "MeshHeading":{ "MeshIdentifier (Ex. D000818)":{ "DescriptorName": "XX", "QualifierName":{} } } } |
{ "GeneID": XX, “Name”: XX, "TaxonomyID": XX, "PubMedID": [xx , xx, xx] } |