Skip to content

wassermanlab/pubmed_db

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

Instructions on how to update the PubMedDB annually and how to use the non-relational database.

Table of Contents

  1. Script
  2. Conda
  3. Download PubMed Baseline
  4. XML to JSON
  5. Usage

Scripts

  • infotojson.py - converting information in baseline xml files to a single JSON document
  • jsontodb.py - read JSON document into a database
  • gettfidf.py - query database based on user input to obtain TF-IDFs and output results into a file


Conda Environment

All packages are provided within the YML environment file. A conda environment named pubmeddb can be created using the following command.

conda env create -f ./pubmeddb.yml
conda activate pubmeddb


Baseline

Please use the DATA TRANSFER node of Sockeye to download the PubMed baseline (https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/) and gene2pubmed (https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz).

ssh <cwl>@dtn.sockeye.arc.ubc.ca

Run download script:

bash ./utils/dl_pubmeddata.sh


XML to JSON

Please edit the PBS -M with your email address in pubmed_submit.sh.

##PBS -M <email>

Run the following code in the COMPUTE node and submit script as a job from a tempory/scratch directory (currently project directory is only readable by the compute nodes).
ssh <cwl>@sockeye.arc.ubc.ca
cd <SCRATCH DIR>
qsub /project/st-wasserww-1/PubMed_DB/pubmed_submit.sh


JSON Fields

PubMedID Collection Gene Collection
{
   	"PMID":"XX",
   	"ArticleTitle": "xx",
   	"Abstract":{
        	"Text": "XX",
        	"Words":{
			"Word1":{
	            		"Stems": [xx , xx, xx],
	                	"Count": 1
        			},
			"Word2":{ 
		               	"Stems": [xx , xx, xx],
		               	"Count": 1
				},
		}
	},
	"Country": "XX",
	"MeshHeading":{
		"MeshIdentifier (Ex. D000818)":{
			"DescriptorName": "XX",
			"QualifierName":{}
		}
	}	
}
{
	"GeneID": XX,
	“Name”: XX,
	"TaxonomyID": XX,
	"PubMedID": [xx , xx, xx]
}

















Usage

Releases

No releases published

Packages

No packages published