This repository contains the graphic-word tokenized texts of the following two repositories (I also provide them in zipped format):
The texts have been generated completely automatically from the original XML files which are well-formed and CTS-compliant (some are not). Some conversion errors are already known to be ascribable to annotation inconsistencies/errors in the original files (which errors I have not tried to solve). For example, an inconsistent cts-urn location in the xml file or lack of numeration for each verse in a poem will generate errors (typically missing text).
Check the XQuery module in the scripts
folder for details.
Each file contains the following information:
- the
@p
attribute lists the passage (the full cts urn derives from merging this value and the cts urn of the text in the@text-cts
attribute in the text element) - the
@n
attribute shows the running number id for each word (numeration starts again as the passage changes) - the
text()
of eacht
element contains the word form - the optional
@join
attribute specifies whether a punctuation mark should be attached to either the preceding (b) or the following (a) word. - the optional
@tag
element shows some special elements which contained the given word: more precisely, theadd
,del
,unclear
,surplus
,supplied
andseg
elements, which can be of interest to identify editorial interventions.
From release 1.0.0:
- Correction to the cts-urn structure by considering the elements seg and p (currently div, seg, p, and l are considered)
- Addition of sentence split (on the basis of the following characters: ".", "·", ";", ":")
Cite the following work thus:
- Giuseppe G. A. Celano. (2017). Tokenized and sentence-splitted CTSized Ancient Greek texts (v1.1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.438311
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.