-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(global) Convert HTML entities to UTF-8 characters #1513
Conversation
@npmccallum There cases where if entity conversions are done out of order this can cause problems (this is particularly true of the beta code file in this batch which is preserved for backup purposes and will not be fixed in the short term--I would recommend against any edits to this file). Large multi-file PRs are not encouraged due to other ongoing conversion work and potential for conflicts. We generally work one corpus or author at a time. I can check and see if there are any potential conflicts here, but I would request that no changes to diod.hist09-10_gk.xml be made at this time. (#1405) I have added a PR with a comment on this file. |
@lcerrato Thanks for the summary! I caught the issue with This led me down a rabbit trail where I found a bunch of other beta code in other files. Is there a description somewhere of which files should be considered to meet quality standards? Would you like me to split this into multiple PRs? |
@npmccallum |
@npmccallum Any file with a plain text equivalent in the package release is complaint. Additionally, the HookTest output (found under Actions) shows passing (= updated) files. |
@npmccallum Happy to continue the conversation and answer any questions or discuss other contributions. |
5f9fcca
to
dd66bb4
Compare
@lcerrato I have removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several of these files have too many changes to comprehensively review in this format.
I do see errant spaces around things like em dashes, which would typically be fixed at the time of conversion. I also see ligatures that do not require preservation. Middots all appear errant (but I could not check all works to confirm).
So a global change missed some of this refinement. It is good to have a "spot check" as it typically points to refinements we would not otherwise spot absent a careful read-through.
<p>We read<note n="4" place="unspecified" anchored="yes">Casiri, <title>Bibliotheca Arabico-Hispana Escurialensis</title>, I. p. 339. Casiri's source is alQifti (d. 1248), the author of the <emph>Ta'rīkh al-H<*>ukamā</emph>, a collection of biographies of philosophers, mathematicians, astronomers etc.</note> that <quote>Euclid, son of Naucrates, grandson of Zenarchus<note n="5" place="unspecified" anchored="yes">The <title>Fihrist</title> says <quote>son of Naucrates, the son of Berenice (?)</quote> | ||
(see Suter's translation in <title>Abhandlungen zur Gesch</title>. <emph>d</emph>. <title>Math</title>. VI. Heft, 1892, p. 16).</note>, called the author of geometry, a philosopher of somewhat ancient date, a Greek by nationality domiciled at Damascus, born at Tyre, most learned in the science of geometry, published a most excellent and most useful work entitled the foundation or elements of geometry, a subject in which no more general treatise existed before among the Greeks: nay, there was no one even of later date who did not walk in his footsteps and frankly profess his doctrine. Hence also Greek, Roman and Arabian geometers not a few, who undertook the task of illustrating this work, published commentaries, scholia, and notes upon it, and made an abridgment of the work itself. For this reason the Greek philosophers used to post up on the doors of their schools the well-known notice: ’Let no one come to our school, who has not first learned the elements of Euclid.’</quote> | ||
The details at the beginning of this extract cannot be derived from Greek sources, for even Proclus did not know anything about Euclid's father, while it was not the Greek habit to record the names of grandfathers, as the Arabians commonly did. Damascus and Tyre were no doubt brought in to gratify a desire which the Arabians always showed to connect famous Greeks in some way or other with the East. Thus Nas<*>īraddīn, the translator of the <title>Elements</title>, who was of T<*>ūs in Khurāsān, actually makes Euclid out to have been <quote>Thusinus</quote> | ||
also<note n="6" place="unspecified" anchored="yes">The same predilection made the Arabs describe Pythagoras as a pupil of the wise Salomo, Hipparchus as the exponent of Chaldaean philosophy or as the Chaldaean, Archimedes as an Egyptian etc. (H<*>ăjī Khalfa, <title>Lexicon Bibliographicum</title>, and Casiri).</note>. The readiness of the Arabians to run away with an idea is illustrated by the last words <pb n="5"/>of the extract. Everyone knows the story of Plato's inscription over the porch of the Academy: <quote>let no one unversed in geometry enter my doors</quote> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an error mark here <*>
which has not been preserved. I believe there is an underdot with the H.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worthwhile to fix the missing underdot throughout the text if it is always found as <*>
also<note n="6" place="unspecified" anchored="yes">The same predilection made the Arabs describe Pythagoras as a pupil of the wise Salomo, Hipparchus as the exponent of Chaldaean philosophy or as the Chaldaean, Archimedes as an Egyptian etc. (H<*>ăjī Khalfa, <title>Lexicon Bibliographicum</title>, and Casiri).</note>. The readiness of the Arabians to run away with an idea is illustrated by the last words <pb n="5"/>of the extract. Everyone knows the story of Plato's inscription over the porch of the Academy: <quote>let no one unversed in geometry enter my doors</quote> | ||
<p>We read<note n="4" place="unspecified" anchored="yes">Casiri, <title>Bibliotheca Arabico-Hispana Escurialensis</title>, I. p. 339. Casiri's source is alQifti (d. 1248), the author of the <emph>Ta'rīkh al-H<*>ukamā</emph>, a collection of biographies of philosophers, mathematicians, astronomers etc.</note> that <quote>Euclid, son of Naucrates, grandson of Zenarchus<note n="5" place="unspecified" anchored="yes">The <title>Fihrist</title> says <quote>son of Naucrates, the son of Berenice (?)</quote> | ||
(see Suter's translation in <title>Abhandlungen zur Gesch</title>. <emph>d</emph>. <title>Math</title>. VI. Heft, 1892, p. 16).</note>, called the author of geometry, a philosopher of somewhat ancient date, a Greek by nationality domiciled at Damascus, born at Tyre, most learned in the science of geometry, published a most excellent and most useful work entitled the foundation or elements of geometry, a subject in which no more general treatise existed before among the Greeks: nay, there was no one even of later date who did not walk in his footsteps and frankly profess his doctrine. Hence also Greek, Roman and Arabian geometers not a few, who undertook the task of illustrating this work, published commentaries, scholia, and notes upon it, and made an abridgment of the work itself. For this reason the Greek philosophers used to post up on the doors of their schools the well-known notice: ’Let no one come to our school, who has not first learned the elements of Euclid.’</quote> | ||
The details at the beginning of this extract cannot be derived from Greek sources, for even Proclus did not know anything about Euclid's father, while it was not the Greek habit to record the names of grandfathers, as the Arabians commonly did. Damascus and Tyre were no doubt brought in to gratify a desire which the Arabians always showed to connect famous Greeks in some way or other with the East. Thus Nas<*>īraddīn, the translator of the <title>Elements</title>, who was of T<*>ūs in Khurāsān, actually makes Euclid out to have been <quote>Thusinus</quote> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<*> appears twice = <*>
and it appears in both cases the data entry marked these as an underdot appears beneath these letters.
<p>V^{b} seldom has scholia common with the other older sources; for the most part they either appear in V^{b} alone or only in the later sources as v or F^{2} (later scholia in F), some being original, others not. In Book X. V^{b} has three series of numerical examples, (1) with Greek numerals, (2) alternatives added later, also mostly with Greek numerals, (3) with Arabic numerals. The last class were probably the work of the copyist himself. These examples (cf. p. 74 below) show the facility with which the Byzantines made calculations at the date of the MS. (12th c.). They prove also that the use of the Arabic numerals (in the East-Arabian form) was thoroughly established in the 12th c.; they were actually known to the Byzantines a century earlier, since they appear, in the first hand, in an Escurial MS. of the 11th c.</p> | ||
<p>Of collections in other hands in V distinguished by Heiberg (see preface to Vol. v.), V^{1} has very few scholia which are found in other sources, the greater part being original; V^{2}, V^{3} are the work of the copyist himself; V^{4} are so in part only, and contain several scholia from Schol. Vat. and other sources. V^{3} and V^{4} are later than 13th —14th c., since they are not found in f (cod. Laurent. XXVIII, 6) which was copied from V and contains, besides V^{a} V^{b}, the greater part of V^{1} and VI. No. 20 of V^{2} (in the text).</p> | ||
<p>Of collections in other hands in V distinguished by Heiberg (see preface to Vol. v.), V^{1} has very few scholia which are found in other sources, the greater part being original; V^{2}, V^{3} are the work of the copyist himself; V^{4} are so in part only, and contain several scholia from Schol. Vat. and other sources. V^{3} and V^{4} are later than 13th —14th c., since they are not found in f (cod. Laurent. XXVIII, 6) which was copied from V and contains, besides V^{a} V^{b}, the greater part of V^{1} and VI. No. 20 of V^{2} (in the text).</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several badly rendered manuscripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your request here was to remove the space before the mdash. But I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file has several sets of encoding issues which is to be expected in a mathematical text. I have logged ones that caught my eye, but mainly for further work and these changes are not required here.
NB that <*>
was the mark our data entry teams used when they could not read the print or did not know how to encode something. Whoever converted this file was unaware of that and has rendered that as <*> which makes it harder to know what is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not complete a comprehensive review: too many changes.
@npmccallum |
@npmccallum |
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
Convert HTML entities to UTF-8 characters
|
@npmccallum |
I don't understand why this was closed when it was ready for merging? |
@lcerrato I found some other issues which are fixed in the split PRs. I think each one deserves individual review. |
@npmccallum Ok, thanks for the clarification. I was going to merge pending the earlier check. Some/most of the files are too large to review in the GUI, so that probably won't be feasible on any scale in the short term, but almost all of these files will require additional review at time of conversion (they are mostly translations) so I was ready to merge here because I could not do a comprehensive check. |
@lcerrato Helpful tip: review of these diffs is MUCH easier using |
This helps XML parsers which don't handle HTML entities.
Files changed here:
data/tlg0057/tlg010/tlg0057.tlg010.perseus-eng1.xml
data/tlg0086/tlg009/tlg0086.tlg009.perseus-eng1.xml
data/tlg0086/tlg010/tlg0086.tlg010.perseus-eng1.xml
data/tlg0086/tlg025/tlg0086.tlg025.perseus-eng1.xml
data/tlg0086/tlg029/tlg0086.tlg029.perseus-eng1.xml
data/tlg0086/tlg034/tlg0086.tlg034.perseus-eng1.xml
data/tlg0086/tlg038/tlg0086.tlg038.perseus-eng1.xml
data/tlg0094/tlg001/tlg0094.tlg001.perseus-eng1.xml
data/tlg0094/tlg002/tlg0094.tlg002.perseus-eng1.xml
data/tlg0094/tlg003/tlg0094.tlg003.perseus-eng1.xml
data/tlg0099/tlg001/tlg0099.tlg001.perseus-eng2.xml
data/tlg0099/tlg001/tlg0099.tlg001.perseus-eng1.xml
data/tlg0526/tlg001/tlg0526.tlg001.perseus-eng1.xml
data/tlg0526/tlg002/tlg0526.tlg002.perseus-eng1.xml
data/tlg0526/tlg003/tlg0526.tlg003.perseus-eng1.xml
data/tlg0543/tlg001/tlg0543.tlg001.perseus-eng1.xml
data/tlg0555/tlg004/tlg0555.tlg004.perseus-grc1..xml
data/tlg1799/tlg001/tlg1799.tlg001.perseus-eng1.xml