(global) Convert HTML entities to UTF-8 characters #1513

npmccallum · 2023-10-24T17:28:26Z

This helps XML parsers which don't handle HTML entities.

Files changed here:
data/tlg0057/tlg010/tlg0057.tlg010.perseus-eng1.xml
data/tlg0086/tlg009/tlg0086.tlg009.perseus-eng1.xml
data/tlg0086/tlg010/tlg0086.tlg010.perseus-eng1.xml
data/tlg0086/tlg025/tlg0086.tlg025.perseus-eng1.xml
data/tlg0086/tlg029/tlg0086.tlg029.perseus-eng1.xml
data/tlg0086/tlg034/tlg0086.tlg034.perseus-eng1.xml
data/tlg0086/tlg038/tlg0086.tlg038.perseus-eng1.xml
data/tlg0094/tlg001/tlg0094.tlg001.perseus-eng1.xml
data/tlg0094/tlg002/tlg0094.tlg002.perseus-eng1.xml
data/tlg0094/tlg003/tlg0094.tlg003.perseus-eng1.xml
data/tlg0099/tlg001/tlg0099.tlg001.perseus-eng2.xml
data/tlg0099/tlg001/tlg0099.tlg001.perseus-eng1.xml
data/tlg0526/tlg001/tlg0526.tlg001.perseus-eng1.xml
data/tlg0526/tlg002/tlg0526.tlg002.perseus-eng1.xml
data/tlg0526/tlg003/tlg0526.tlg003.perseus-eng1.xml

data/tlg0543/tlg001/tlg0543.tlg001.perseus-eng1.xml
data/tlg0555/tlg004/tlg0555.tlg004.perseus-grc1..xml
data/tlg1799/tlg001/tlg1799.tlg001.perseus-eng1.xml

lcerrato · 2023-10-24T20:05:53Z

@npmccallum
Thank you.
As we update works to CTS and EpiDoc compliance, entities are removed as part of that workflow. Any file ending in "grc1" is likely not updated to the current best practices. This includes work such as header completion, encoding, and quality control issues.

There cases where if entity conversions are done out of order this can cause problems (this is particularly true of the beta code file in this batch which is preserved for backup purposes and will not be fixed in the short term--I would recommend against any edits to this file).

Large multi-file PRs are not encouraged due to other ongoing conversion work and potential for conflicts. We generally work one corpus or author at a time.

I can check and see if there are any potential conflicts here, but I would request that no changes to diod.hist09-10_gk.xml be made at this time. (#1405) I have added a PR with a comment on this file.

npmccallum · 2023-10-24T20:16:59Z

@lcerrato Thanks for the summary!

I caught the issue with diod.hist09-10_gk.xml and wrote a script to convert it from beta code to unicode. This work does not update the file to best TEI practices. But it is at least forward progress. Would you like me to submit this?

This led me down a rabbit trail where I found a bunch of other beta code in other files. Is there a description somewhere of which files should be considered to meet quality standards?

Would you like me to split this into multiple PRs?

lcerrato · 2023-10-27T18:34:26Z

@npmccallum
Thank you, but the unconverted file would need to be merged with other data as it is a partial work so it is not a matter of conversion per se, but rather data that was not ported from the old collection to GitHub.
Ideally, all of this work would be addressed within the usual conversion workflow.
If the file is going to cause issues, I can move to another repo for preservation (it was moved here for summer employees who would not otherwise have access).

lcerrato · 2023-10-27T18:41:56Z

@npmccallum
For works that have been transitioned to the latest encoding, the change logs will say if the file has been updated to CTS and EpiDoc compliance and the file will be visible in the Scaife Viewer. As a rule of thumb, anything that is grc1 is likely not updated to best practices, but there will be files that were converted early on that could use refinement (sometimes these have things like beta code in footnotes that was missed or other irregularities).

Any file with a plain text equivalent in the package release is complaint.
https://github.com/PerseusDL/canonical-greekLit/releases

Additionally, the HookTest output (found under Actions) shows passing (= updated) files.

lcerrato · 2023-10-27T19:09:15Z

@npmccallum
I am going to move this file out of here as it is just going to cause confusion. Perhaps you can either close and resubmit minus anything related to diod.hist09-10_gk.xml or I can close this out simply work on the global entity fix in another pass.
(Our testing regime seems broken at the moment so no changes as yet.)

Happy to continue the conversation and answer any questions or discuss other contributions.

npmccallum · 2023-10-27T19:56:35Z

@lcerrato I have removed data/tlg0060/tlg001/diod.hist09-10_gk.xml from the commit. Please review.

lcerrato

Several of these files have too many changes to comprehensively review in this format.

I do see errant spaces around things like em dashes, which would typically be fixed at the time of conversion. I also see ligatures that do not require preservation. Middots all appear errant (but I could not check all works to confirm).

So a global change missed some of this refinement. It is good to have a "spot check" as it typically points to refinements we would not otherwise spot absent a careful read-through.

data/tlg0526/tlg003/tlg0526.tlg003.perseus-eng1.xml

data/tlg0526/tlg001/tlg0526.tlg001.perseus-eng1.xml

data/tlg0526/tlg002/tlg0526.tlg002.perseus-eng1.xml

lcerrato · 2023-10-30T18:36:47Z