This section describes the steps required to convert the bioCADDIE benchmark and test collection data for indexing.
Convert bioCADDIE benchmark data to trectext format.
Download the BioCADDIE benchmark collection in JSON format.
mkdir -p /data/biocaddie/data
cd /data/biocaddie/data
wget https://biocaddie.org/sites/default/files/update_json_folder.zip
Convert data to TREC-text format:
cd ~/biocaddie
scripts/dats2trec.sh
Note: You may see the following error, which is expected:
java.lang.ClassCastException: com.google.gson.JsonNull cannot be cast to com.google.gson.JsonObject
at edu.gslis.biocaddie.util.DATSToTrecText.main(DATSToTrecText.java:61)
This converts the benchmark data to trectext format. This produces a file /data/biocaddie/data/biocaddie_all.txt
. You can remove the original benchmark data, if desired.
Build the indexes (see below).
Converting PubMed data to trectext.
Download the PubMed oa_bulk datasets to /data/pubmed/oa_bulk
:
mkdir -p /data/pubmed/oa_bulk
cd /data/pubmed/oa_bulk
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/non_comm_use.0-9A-B.txt.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/non_comm_use.C-H.txt.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/non_comm_use.I-N.txt.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/non_comm_use.O-Z.txt.tar.gz
cd ~/biocaddie
scripts/pmc2trec.sh
`.
This section describes the steps required to build Indri, Lucene and ElasticSearch indexes
Use IndriBuildIndex
to build the biocaddie_all
index (customize paths as needed):
mkdir -p /data/biocaddie/indexes
cd ~/biocaddie
IndriBuildIndex index/build_index.biocaddie.params
The following command will build a Krovetz-stemmed index:
mkdir -p /data/biocaddie/indexes
cd ~/biocaddie
IndriBuildIndex index/build_index.biocaddie.krovetz.params
The following command will build a Lucene 6.5 index with no stemming:
mkdir -p /data/biocaddie/lucene/
cd ~/biocaddie
scripts/run.sh edu.gslis.lucene.main.LuceneBuildIndex index/lucene_biocaddie.yaml
The following command will build a Lucene 6.5 index with no snowball stemming:
mkdir -p /data/biocaddie/lucene/
cd ~/biocaddie
scripts/run.sh edu.gslis.lucene.main.LuceneBuildIndex index/lucene_biocaddie_snowball.yaml
This assumes a running ElasticSearch instance:
cd ~/biocaddie/elasticsearch/biocaddie
./create-index.sh
./index-biocaddie.sh
This produces output in /data/pubmed/trecText/
containing the TREC-formatted documents.
Create the pubmed index:
mkdir -p /data/pubmed/indexes
cd ~/biocaddie
IndriBuildIndex index/build_index.pubmed.params
This will create an Indri index in /data/pubmed/indexes/pubmed
The following command will build a Lucene 6.5 index with no stemming:
mkdir -p /data/biocaddie/lucene/
cd ~/biocaddie
scripts/run.sh edu.gslis.lucene.main.LuceneBuildIndex index/lucene_pubmed.yaml