HTML tags in titles #197

Adafede · 2022-10-26T11:47:12Z

Hi,

It is multiple times I had the remark from the WD community I had to sanitize the titles I am importing through WikidataIntegrator (see https://www.wikidata.org/wiki/User_talk:AdrianoRutz), before trying to fix it downstream, would there be a solution to implement here directly to format them better?

andrawaag · 2022-10-26T18:18:07Z

Are the HTML titles introduced by the wdi, or are they sourced from a primary source? Do you have an example and a pointer to the source to see if I can reproduce the issue? I am still on the fence about simply adding a strip function or leaving it to the source to fix.

Adafede · 2022-10-26T18:26:01Z

An example would be https://doi.org/10.1002/ejoc.201402609 (https://www.wikidata.org/wiki/Q114865259). It is in the source with the HTML tags: https://api.crossref.org/v1/works/10.1002/ejoc.201402609

Problem is that some replacements exist for chemistry for example (₁₂₃), so top of the top would be adding them as for molecular formulas (https://www.wikidata.org/wiki/Property:P274), but the sub/sup are not limited to it, and the current example also has <i> tags, which then lead to a missing space... a nightmare, I know.

I just thought given the users of Wikidataintegrator, better report upstream than do a

cleantext = BeautifulSoup("Mild, Stereoselective, and Highly Efficient Synthesis of<i>N</i>-Acylhydrazones Mediated by CeCl<sub>3</sub>·7H<sub>2</sub>O in a Broad Range of Solvents", "lxml").text

on my side.

I understand (and probably share) your point of view, but we should then make it understandable to other wiki members...

andrawaag · 2023-01-04T21:18:50Z

Revisting this after starting the discussion in the telegram channel. I wonder if, by your suggestion, changing https://github.com/SuLab/WikidataIntegrator/blob/main/wikidataintegrator/wdi_helpers/publication.py#L106 to

self.title = BeautifulSoup(title, "lxml").text

would not fix this issue. I am currently travelling and I want to give it a bit more attention, but will dive in upon returning to the office by the end of this week.

Adafede · 2023-01-05T07:01:13Z

As discussed, I was also trying to clean Crossref titles from html tags as requested by WD. Here are some challenging tests:

https://github.com/lotusnprod/lotus-wikidata-interact/blob/cfffc1e7c8f4210f9dd7fd506b14110da1ba1c1c/wdkt/src/test/kotlin/wd/WDArticleTest.kt#L32-L43

For now, I could not succesfully clean all tests using simple JSoup cleaning.

andrawaag self-assigned this Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML tags in titles #197

HTML tags in titles #197

Adafede commented Oct 26, 2022

andrawaag commented Oct 26, 2022

Adafede commented Oct 26, 2022

andrawaag commented Jan 4, 2023 •

edited

Loading

Adafede commented Jan 5, 2023

HTML tags in titles #197

HTML tags in titles #197

Comments

Adafede commented Oct 26, 2022

andrawaag commented Oct 26, 2022

Adafede commented Oct 26, 2022

andrawaag commented Jan 4, 2023 • edited Loading

Adafede commented Jan 5, 2023

andrawaag commented Jan 4, 2023 •

edited

Loading