Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML tags in titles #197

Open
Adafede opened this issue Oct 26, 2022 · 4 comments
Open

HTML tags in titles #197

Adafede opened this issue Oct 26, 2022 · 4 comments
Assignees

Comments

@Adafede
Copy link

Adafede commented Oct 26, 2022

Hi,

It is multiple times I had the remark from the WD community I had to sanitize the titles I am importing through WikidataIntegrator (see https://www.wikidata.org/wiki/User_talk:AdrianoRutz), before trying to fix it downstream, would there be a solution to implement here directly to format them better?

@andrawaag
Copy link
Collaborator

Are the HTML titles introduced by the wdi, or are they sourced from a primary source? Do you have an example and a pointer to the source to see if I can reproduce the issue? I am still on the fence about simply adding a strip function or leaving it to the source to fix.

@Adafede
Copy link
Author

Adafede commented Oct 26, 2022

An example would be https://doi.org/10.1002/ejoc.201402609 (https://www.wikidata.org/wiki/Q114865259). It is in the source with the HTML tags: https://api.crossref.org/v1/works/10.1002/ejoc.201402609

Problem is that some replacements exist for chemistry for example (₁₂₃), so top of the top would be adding them as for molecular formulas (https://www.wikidata.org/wiki/Property:P274), but the sub/sup are not limited to it, and the current example also has <i> tags, which then lead to a missing space... a nightmare, I know.

I just thought given the users of Wikidataintegrator, better report upstream than do a

cleantext = BeautifulSoup("Mild, Stereoselective, and Highly Efficient Synthesis of<i>N</i>-Acylhydrazones Mediated by CeCl<sub>3</sub>·7H<sub>2</sub>O in a Broad Range of Solvents", "lxml").text

on my side.

I understand (and probably share) your point of view, but we should then make it understandable to other wiki members...

@andrawaag
Copy link
Collaborator

andrawaag commented Jan 4, 2023

Revisting this after starting the discussion in the telegram channel. I wonder if, by your suggestion, changing https://github.com/SuLab/WikidataIntegrator/blob/main/wikidataintegrator/wdi_helpers/publication.py#L106 to

self.title = BeautifulSoup(title, "lxml").text

would not fix this issue. I am currently travelling and I want to give it a bit more attention, but will dive in upon returning to the office by the end of this week.

@andrawaag andrawaag self-assigned this Jan 4, 2023
@Adafede
Copy link
Author

Adafede commented Jan 5, 2023

As discussed, I was also trying to clean Crossref titles from html tags as requested by WD. Here are some challenging tests:

https://github.com/lotusnprod/lotus-wikidata-interact/blob/cfffc1e7c8f4210f9dd7fd506b14110da1ba1c1c/wdkt/src/test/kotlin/wd/WDArticleTest.kt#L32-L43

For now, I could not succesfully clean all tests using simple JSoup cleaning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants