Extraction scripts for transforming the Orlando XML data into Linked Data (CIDOC edition) (cidoc-revisions branch)
Note: The CWRC version of these extraction scripts can be found on the Classic Branch
You must have Python installed, at least version 3.8.
You must have a CWRC account to be able to do this with the appropriate permissions. (Sign up here)
In Root folder:
- Create a Virtual Environment:
python3 -m venv venv
- Start Virtual Environment:
source ./venv/bin/activate
- Install modules:
pip install -r requirements.txt
- Create an
.env
file withusername=XXX
andpassword=yyy
, replacingxxx
andyyy
with the respective credentials.
Example file:
username=John Doe
password=mySuperSecretpassword12!
- Run script:
python3 islandora_auth.py
(This by default will only download the Entries)
These commands take place in Biography
folder (cd Biography
)
- Update
default directory
field withintestcases.json
to match where your source data files are - Create a Virtual Environment:
python3 -m venv venv
- Start Virtual Environment:
source ./venv/bin/activate
- Install modules:
pip install -r requirements.txt
- Run script
python3 bio_extraction.py
Run python3 bio_extraction.py -h
for a list of available options
No particular testcases available, please add to testcases.json
usage: bio_extraction.py [-h] [-qa | -s | -g | -i | -id ORLANDO | -f FILE | -d DIRECTORY | -r [RANDOM] | -l [LAST] | -fi [FIRST]] [-v {0,1,2,3}] [-fmt {rdf,rdf/xml,ttl,turtle,json-ld,nt,trix,n3,all}] [-u UPDATE] [-p]
Extract the Majority of biography related data information from selection of orlando xml documents
optional arguments:
-h, --help show this help message and exit
-qa will run through qa test cases that are related to www.github.com/cwrc/testData/tree/master/qa, Which currently are:'aguigr', 'alcolo', 'atwoma', 'bronch', 'bronem', 'levyam', 'seacma',
'shakwi', 'woolvi'
-s, -special will run through special cases that are of particular interest atm which currently are: 'fielmi'
-g, -graffles, -graffle
will run through cases related to our graffles'seacma', 'lel___', 'edgema', 'blesma', 'leonan'
-i, -ignored will run through files that are currently being ignored which currently include: 'fielmi'
-id ORLANDO, -orlando ORLANDO, --orlando ORLANDO
entry id of a single orlando document to run extraction upon, ex. woolvi
-f FILE, -file FILE, --file FILE
single orlando xml document to run extraction upon
-d DIRECTORY, -directory DIRECTORY, --directory DIRECTORY
directory of files to run extraction upon
-r [RANDOM], -random [RANDOM], --random [RANDOM]
chooses {RANDOM} random file(s) to run extraction upon
-l [LAST], -last [LAST], --last [LAST]
chooses {last} file(s) to run extraction upon, ex. the last 20 files
-fi [FIRST], -first [FIRST], --first [FIRST]
chooses {first} file(s) to run extraction upon, ex. the first 20 files
-v {0,1,2,3}, --verbosity {0,1,2,3}
increase output verbosity
-fmt {rdf,rdf/xml,ttl,turtle,json-ld,nt,trix,n3,all}, --format {rdf,rdf/xml,ttl,turtle,json-ld,nt,trix,n3,all}
-p, -pause, --pause pause after every entry to examine output and be prompted to continue/quit
Each script within Biography/
can be run on its own, bio_extraction.py
is the current main driver that calls needed functions within separate scripts. The same arguments are applicable to those scripts.
Example:
If you just wanted to test the extraction of cultural forms. You could do python3 culturalForm.py -r 1
This would only extract from culturalform tags, from 1 random source file. This allows for better testing and more modular classes to be made.