Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(global) Convert HTML entities to UTF-8 characters #1513

Closed
wants to merge 18 commits into from

Conversation

npmccallum
Copy link
Contributor

@npmccallum npmccallum commented Oct 24, 2023

This helps XML parsers which don't handle HTML entities.

Files changed here:
data/tlg0057/tlg010/tlg0057.tlg010.perseus-eng1.xml
data/tlg0086/tlg009/tlg0086.tlg009.perseus-eng1.xml
data/tlg0086/tlg010/tlg0086.tlg010.perseus-eng1.xml
data/tlg0086/tlg025/tlg0086.tlg025.perseus-eng1.xml
data/tlg0086/tlg029/tlg0086.tlg029.perseus-eng1.xml
data/tlg0086/tlg034/tlg0086.tlg034.perseus-eng1.xml
data/tlg0086/tlg038/tlg0086.tlg038.perseus-eng1.xml
data/tlg0094/tlg001/tlg0094.tlg001.perseus-eng1.xml
data/tlg0094/tlg002/tlg0094.tlg002.perseus-eng1.xml
data/tlg0094/tlg003/tlg0094.tlg003.perseus-eng1.xml
data/tlg0099/tlg001/tlg0099.tlg001.perseus-eng2.xml
data/tlg0099/tlg001/tlg0099.tlg001.perseus-eng1.xml
data/tlg0526/tlg001/tlg0526.tlg001.perseus-eng1.xml
data/tlg0526/tlg002/tlg0526.tlg002.perseus-eng1.xml
data/tlg0526/tlg003/tlg0526.tlg003.perseus-eng1.xml

data/tlg0543/tlg001/tlg0543.tlg001.perseus-eng1.xml
data/tlg0555/tlg004/tlg0555.tlg004.perseus-grc1..xml
data/tlg1799/tlg001/tlg1799.tlg001.perseus-eng1.xml

@lcerrato
Copy link
Collaborator

@npmccallum
Thank you.
As we update works to CTS and EpiDoc compliance, entities are removed as part of that workflow. Any file ending in "grc1" is likely not updated to the current best practices. This includes work such as header completion, encoding, and quality control issues.

There cases where if entity conversions are done out of order this can cause problems (this is particularly true of the beta code file in this batch which is preserved for backup purposes and will not be fixed in the short term--I would recommend against any edits to this file).

Large multi-file PRs are not encouraged due to other ongoing conversion work and potential for conflicts. We generally work one corpus or author at a time.

I can check and see if there are any potential conflicts here, but I would request that no changes to diod.hist09-10_gk.xml be made at this time. (#1405) I have added a PR with a comment on this file.

@npmccallum
Copy link
Contributor Author

@lcerrato Thanks for the summary!

I caught the issue with diod.hist09-10_gk.xml and wrote a script to convert it from beta code to unicode. This work does not update the file to best TEI practices. But it is at least forward progress. Would you like me to submit this?

This led me down a rabbit trail where I found a bunch of other beta code in other files. Is there a description somewhere of which files should be considered to meet quality standards?

Would you like me to split this into multiple PRs?

@lcerrato
Copy link
Collaborator

@npmccallum
Thank you, but the unconverted file would need to be merged with other data as it is a partial work so it is not a matter of conversion per se, but rather data that was not ported from the old collection to GitHub.
Ideally, all of this work would be addressed within the usual conversion workflow.
If the file is going to cause issues, I can move to another repo for preservation (it was moved here for summer employees who would not otherwise have access).

@lcerrato
Copy link
Collaborator

@npmccallum
For works that have been transitioned to the latest encoding, the change logs will say if the file has been updated to CTS and EpiDoc compliance and the file will be visible in the Scaife Viewer. As a rule of thumb, anything that is grc1 is likely not updated to best practices, but there will be files that were converted early on that could use refinement (sometimes these have things like beta code in footnotes that was missed or other irregularities).

Any file with a plain text equivalent in the package release is complaint.
https://github.com/PerseusDL/canonical-greekLit/releases

Additionally, the HookTest output (found under Actions) shows passing (= updated) files.

@lcerrato
Copy link
Collaborator

@npmccallum
I am going to move this file out of here as it is just going to cause confusion. Perhaps you can either close and resubmit minus anything related to diod.hist09-10_gk.xml or I can close this out simply work on the global entity fix in another pass.
(Our testing regime seems broken at the moment so no changes as yet.)

Happy to continue the conversation and answer any questions or discuss other contributions.

@npmccallum npmccallum force-pushed the master branch 2 times, most recently from 5f9fcca to dd66bb4 Compare October 27, 2023 19:51
@npmccallum
Copy link
Contributor Author

@lcerrato I have removed data/tlg0060/tlg001/diod.hist09-10_gk.xml from the commit. Please review.

Copy link
Collaborator

@lcerrato lcerrato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several of these files have too many changes to comprehensively review in this format.

I do see errant spaces around things like em dashes, which would typically be fixed at the time of conversion. I also see ligatures that do not require preservation. Middots all appear errant (but I could not check all works to confirm).

So a global change missed some of this refinement. It is good to have a "spot check" as it typically points to refinements we would not otherwise spot absent a careful read-through.

data/tlg0526/tlg002/tlg0526.tlg002.perseus-eng1.xml Outdated Show resolved Hide resolved
<p>We read<note n="4" place="unspecified" anchored="yes">Casiri, <title>Bibliotheca Arabico-Hispana Escurialensis</title>, I. p. 339. Casiri's source is alQifti (d. 1248), the author of the <emph>Ta'rīkh al-H&lt;*&gt;ukamā</emph>, a collection of biographies of philosophers, mathematicians, astronomers etc.</note> that <quote>Euclid, son of Naucrates, grandson of Zenarchus<note n="5" place="unspecified" anchored="yes">The <title>Fihrist</title> says <quote>son of Naucrates, the son of Berenice (?)</quote>
(see Suter's translation in <title>Abhandlungen zur Gesch</title>. <emph>d</emph>. <title>Math</title>. VI. Heft, 1892, p. 16).</note>, called the author of geometry, a philosopher of somewhat ancient date, a Greek by nationality domiciled at Damascus, born at Tyre, most learned in the science of geometry, published a most excellent and most useful work entitled the foundation or elements of geometry, a subject in which no more general treatise existed before among the Greeks: nay, there was no one even of later date who did not walk in his footsteps and frankly profess his doctrine. Hence also Greek, Roman and Arabian geometers not a few, who undertook the task of illustrating this work, published commentaries, scholia, and notes upon it, and made an abridgment of the work itself. For this reason the Greek philosophers used to post up on the doors of their schools the well-known notice: Let no one come to our school, who has not first learned the elements of Euclid.</quote>
The details at the beginning of this extract cannot be derived from Greek sources, for even Proclus did not know anything about Euclid's father, while it was not the Greek habit to record the names of grandfathers, as the Arabians commonly did. Damascus and Tyre were no doubt brought in to gratify a desire which the Arabians always showed to connect famous Greeks in some way or other with the East. Thus Nas&lt;*&gt;īraddīn, the translator of the <title>Elements</title>, who was of T&lt;*&gt;ūs in Khurāsān, actually makes Euclid out to have been <quote>Thusinus</quote>
also<note n="6" place="unspecified" anchored="yes">The same predilection made the Arabs describe Pythagoras as a pupil of the wise Salomo, Hipparchus as the exponent of Chaldaean philosophy or as the Chaldaean, Archimedes as an Egyptian etc. (H&lt;*&gt;ăjī Khalfa, <title>Lexicon Bibliographicum</title>, and Casiri).</note>. The readiness of the Arabians to run away with an idea is illustrated by the last words <pb n="5"/>of the extract. Everyone knows the story of Plato's inscription over the porch of the Academy: <quote>let no one unversed in geometry enter my doors</quote>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an error mark here <*> which has not been preserved. I believe there is an underdot with the H.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worthwhile to fix the missing underdot throughout the text if it is always found as <*>

also<note n="6" place="unspecified" anchored="yes">The same predilection made the Arabs describe Pythagoras as a pupil of the wise Salomo, Hipparchus as the exponent of Chaldaean philosophy or as the Chaldaean, Archimedes as an Egyptian etc. (H&lt;*&gt;&abreve;j&imacr; Khalfa, <title>Lexicon Bibliographicum</title>, and Casiri).</note>. The readiness of the Arabians to run away with an idea is illustrated by the last words <pb n="5"/>of the extract. Everyone knows the story of Plato's inscription over the porch of the Academy: <quote>let no one unversed in geometry enter my doors</quote>
<p>We read<note n="4" place="unspecified" anchored="yes">Casiri, <title>Bibliotheca Arabico-Hispana Escurialensis</title>, I. p. 339. Casiri's source is alQifti (d. 1248), the author of the <emph>Ta'rīkh al-H&lt;*&gt;ukamā</emph>, a collection of biographies of philosophers, mathematicians, astronomers etc.</note> that <quote>Euclid, son of Naucrates, grandson of Zenarchus<note n="5" place="unspecified" anchored="yes">The <title>Fihrist</title> says <quote>son of Naucrates, the son of Berenice (?)</quote>
(see Suter's translation in <title>Abhandlungen zur Gesch</title>. <emph>d</emph>. <title>Math</title>. VI. Heft, 1892, p. 16).</note>, called the author of geometry, a philosopher of somewhat ancient date, a Greek by nationality domiciled at Damascus, born at Tyre, most learned in the science of geometry, published a most excellent and most useful work entitled the foundation or elements of geometry, a subject in which no more general treatise existed before among the Greeks: nay, there was no one even of later date who did not walk in his footsteps and frankly profess his doctrine. Hence also Greek, Roman and Arabian geometers not a few, who undertook the task of illustrating this work, published commentaries, scholia, and notes upon it, and made an abridgment of the work itself. For this reason the Greek philosophers used to post up on the doors of their schools the well-known notice: ’Let no one come to our school, who has not first learned the elements of Euclid.’</quote>
The details at the beginning of this extract cannot be derived from Greek sources, for even Proclus did not know anything about Euclid's father, while it was not the Greek habit to record the names of grandfathers, as the Arabians commonly did. Damascus and Tyre were no doubt brought in to gratify a desire which the Arabians always showed to connect famous Greeks in some way or other with the East. Thus Nas&lt;*&gt;īraddīn, the translator of the <title>Elements</title>, who was of T&lt;*&gt;ūs in Khurāsān, actually makes Euclid out to have been <quote>Thusinus</quote>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<*> appears twice = <*> and it appears in both cases the data entry marked these as an underdot appears beneath these letters.

<p>V^{b} seldom has scholia common with the other older sources; for the most part they either appear in V^{b} alone or only in the later sources as v or F^{2} (later scholia in F), some being original, others not. In Book X. V^{b} has three series of numerical examples, (1) with Greek numerals, (2) alternatives added later, also mostly with Greek numerals, (3) with Arabic numerals. The last class were probably the work of the copyist himself. These examples (cf. p. 74 below) show the facility with which the Byzantines made calculations at the date of the MS. (12th c.). They prove also that the use of the Arabic numerals (in the East-Arabian form) was thoroughly established in the 12th c.; they were actually known to the Byzantines a century earlier, since they appear, in the first hand, in an Escurial MS. of the 11th c.</p>
<p>Of collections in other hands in V distinguished by Heiberg (see preface to Vol. v.), V^{1} has very few scholia which are found in other sources, the greater part being original; V^{2}, V^{3} are the work of the copyist himself; V^{4} are so in part only, and contain several scholia from Schol. Vat. and other sources. V^{3} and V^{4} are later than 13th &mdash;14th c., since they are not found in f (cod. Laurent. XXVIII, 6) which was copied from V and contains, besides V^{a} V^{b}, the greater part of V^{1} and VI. No. 20 of V^{2} (in the text).</p>
<p>Of collections in other hands in V distinguished by Heiberg (see preface to Vol. v.), V^{1} has very few scholia which are found in other sources, the greater part being original; V^{2}, V^{3} are the work of the copyist himself; V^{4} are so in part only, and contain several scholia from Schol. Vat. and other sources. V^{3} and V^{4} are later than 13th 14th c., since they are not found in f (cod. Laurent. XXVIII, 6) which was copied from V and contains, besides V^{a} V^{b}, the greater part of V^{1} and VI. No. 20 of V^{2} (in the text).</p>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several badly rendered manuscripts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your request here was to remove the space before the mdash. But I'm not sure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has several sets of encoding issues which is to be expected in a mathematical text. I have logged ones that caught my eye, but mainly for further work and these changes are not required here.
NB that <*> was the mark our data entry teams used when they could not read the print or did not know how to encode something. Whoever converted this file was unaware of that and has rendered that as <*> which makes it harder to know what is going on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not complete a comprehensive review: too many changes.

@lcerrato
Copy link
Collaborator

@npmccallum
A pull from master would be required due to updates to the testing regime made on the weekend.

@lcerrato
Copy link
Collaborator

@npmccallum
I appreciate all of the work that went into this. Please let me know if there are specific works that you would like prioritized for further updates.

Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
@npmccallum
Copy link
Contributor Author

  1. I rebased the changes on the latest master.
  2. I split each file into a separate commit. This is a prelude to splitting into separate PRs, if you want.
  3. I fixed most of the cleanups mentioned above.
  4. I did not address the <*> or superscript issues as I'm not sure what precisely you want me to do with those, or the scope of the requested changes.

@lcerrato lcerrato changed the title Convert HTML entities to UTF-8 characters (global) Convert HTML entities to UTF-8 characters Nov 22, 2023
@lcerrato
Copy link
Collaborator

@npmccallum
I'll take a look. I think this is great for now.

@npmccallum
Copy link
Contributor Author

@lcerrato I filed this issue for <*> markers: #1529

@npmccallum
Copy link
Contributor Author

@lcerrato I filed this issue for the superscripts: #1530

@npmccallum npmccallum closed this Nov 22, 2023
@lcerrato
Copy link
Collaborator

I don't understand why this was closed when it was ready for merging?

@lcerrato lcerrato reopened this Nov 22, 2023
@npmccallum
Copy link
Contributor Author

@lcerrato I found some other issues which are fixed in the split PRs. I think each one deserves individual review.

@lcerrato
Copy link
Collaborator

@npmccallum Ok, thanks for the clarification. I was going to merge pending the earlier check.

Some/most of the files are too large to review in the GUI, so that probably won't be feasible on any scale in the short term, but almost all of these files will require additional review at time of conversion (they are mostly translations) so I was ready to merge here because I could not do a comprehensive check.
(Work on tlg0555 corpus is underway at present)

@lcerrato lcerrato closed this Nov 22, 2023
@npmccallum
Copy link
Contributor Author

@lcerrato Helpful tip: review of these diffs is MUCH easier using git diff --word-diff-regex=..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants