Web app for transforming entries of the Latin-Bulgarian dictionary into XML format which follows the TEI Lex-0 standard. It works with XML files, converted from DOCX by OxGarage.
The app is deployed on Google Cloud Run.
This app was commissioned by Sofia University to aid the creation of a digital Latin-Bulgarian dictionary. The original dictionary entries exist in DOCX format. With the help of the OxGarage tool they can be converted to XML which retains information about the original DOCX formatting. The result of this conversion is in turn transformed by the app into XML files which follow the TEI Lex-0 standard.
- Python 3.8+
- Pipenv
Download and install Pipenv. In the project directory install all the dependencies through pipenv
:
$ pipenv install
Settings for VSCode are provided in the launch.json
.
- Select the virtual environment created by Pipenv as your interpreter: Command Palette (Ctrl+Shift+P) > Python: Select Interpreter.
- Go to View > Run (Ctrl + Shift + D) and select "Python: Flask" configuration.
- Go to Run > Run Without Debugging (Ctrl + F5). Make sure your working directory is set to the project folder.
- The UI is available at http://localhost:5000/.
Use pipenv
to setup proper environment for the python interpreter:
$ pipenv run python app.py
Alternatively, use pipenv shell
to activate the virtualenv and then simply run python app.py
.
The app is deployed on Google Cloud Run as a Docker container. The easiest way to build the container is by using docker-compose:
$ docker-compose build
or by using docker directly:
$ docker build -t gcr.io/xml-to-teilex/xml-to-teilex .
Use
$ docker-compose up -d
to start the container locally. The UI will be available at http://localhost:5000.
Configuration is done with environment variables. Currently, the following options are supported:
Name | Description | Default |
---|---|---|
SECRET_KEY | Secret key used for sessions by Flask | "secret_key" |
Make sure to set this in production to prevent session stealing.
The application is split into two parts: the transform engine (transform.py
) and the frontend UI (app.py
).
The transformation workflow follows a few steps:
- Load the input XML and the output template XML
- Preprocess the input XML to fix irregularities in the markup.
- Loop through the input XML nodes and process each dictionary entry into its own output file (using the output template).
- Determine the part of speech of currently processed entry.
- Based on the part of speech encode the morphological section (the output varies by POS).
- Encode the lexical part (senses and examples).
- For unrecognized entries (for example entries with invalid syntax or too short content) try to partially encode them (as fallback just append to the output template for human intervention).
- Create a zip archive from all the generated output files.
The frontend is a simple Flask application which runs the transformation and serves the generated zip file.
The code is released under MIT Licence. Contributions are welcome.