ID_Extractor (ID_Ex) for extracting IDs and references from `jats` article files

Introductory remarks

Several scientific journals edited by the German Archaeological Institute use jats xml to be displayed in an instance of the eLife Lens 2.0.0 (for example Archäologischer Anzeiger, see: https://publications.dainst.org/journals/aa). The articles are enhanced with bibliographic and geographic authority data as well as other references to specific information resources of the institute´s information infrastructure.

Approach

ID_Ex browses the .xml files stored in the article repository folder and extracts the pre-defined references. The results are stored in separate sqlite3 tables reflecting the relation of a specific record to the doi of the article, e. g. from

bibliographic records (zenon-IDs, see https://zenon.dainst.org/),
geographic authority data (gazetteer-IDs, see https://gazetteer.dainst.org/),
or records of other entities like objects (iDAI.objects-IDs, see https://arachne.dainst.org/) or records from archaeological fieldwork documentation systems (iDAI.field-IDs, see https://field.idai.world/).

ID_Ex is based on Python 3.12.0 using bs4 from BeautifulSoup library, so it can be easily modified for own purposes.

Mode of operation - and things to be done

If not existing, ID_Ex generates the required sqlite3 tables in a subfolder ("db_folder") when starting the tool for the first time. In the initial version of ID_Ex you have to enter the path to the repository folder in which the .jats files are stored manually. ID_Ex extracts the data and saves them in mentioned sqlite3 tables. To avoid duplicates ID_Ex checks if an article is already recorded using the doi and skipps in this case further actions. Additionally ID_Ex generates a detailed .txt log file containing the file names and the IDs extracted from them in a subfolder ("_ID_Ex_LOG"). With minor modifications ID_Ex can be run at certain intervalls (using a CronJob for example) to keep the corpus up to date automatically.

New in v.1.2.0

Added a GUI to use the application more comfortably
Improved log-handling for multiple runs
Tables merged to one single database ("ID_Ex_database.db").

New in v1.1.0:

A menue allows to export the records of a selected table into a .txt file in the log subfolder, not only after the extraction process but also in form of a request to a previous generated database
Improved handling of the parameters needed for sqlite3 operations using a dict that contains all necessary informations to minimize repetitions

To be done:

Enable automatical scraping of scattered repositories containing .jats article files.
Adding step by step features to export the records as .json files or in other formats.
Enable ID_Ex to handle more complex queries and requests
Implement a mode of running autonomously to make ID_Ex usable within a CronJob
Improvements of the GUI, especially exception handling

Technical remarks

Python 3.12.0
bs4 from BeautifulSoup
sqlite3
Tested for Windows (not for Linux yet)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
CITATION.cff		CITATION.cff
ID_Extractor.py		ID_Extractor.py
LICENSE.txt		LICENSE.txt
README.md		README.md
article_parent.py		article_parent.py
database_parent.py		database_parent.py
gui_parent.py		gui_parent.py
log_parent.py		log_parent.py
logo.ico		logo.ico

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ID_Extractor (ID_Ex) for extracting IDs and references from `jats` article files

Introductory remarks

Approach

Mode of operation - and things to be done

New in v.1.2.0

New in v1.1.0:

To be done:

Technical remarks

See also

About

Releases 2

Packages

Languages

License

pBxr/ID_Extractor

Folders and files

Latest commit

History

Repository files navigation

ID_Extractor (ID_Ex) for extracting IDs and references from jats article files

Introductory remarks

Approach

Mode of operation - and things to be done

New in v.1.2.0

New in v1.1.0:

To be done:

Technical remarks

See also

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

ID_Extractor (ID_Ex) for extracting IDs and references from `jats` article files

Packages