-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML tags in titles #197
Comments
Are the HTML titles introduced by the wdi, or are they sourced from a primary source? Do you have an example and a pointer to the source to see if I can reproduce the issue? I am still on the fence about simply adding a strip function or leaving it to the source to fix. |
An example would be https://doi.org/10.1002/ejoc.201402609 (https://www.wikidata.org/wiki/Q114865259). It is in the source with the HTML tags: https://api.crossref.org/v1/works/10.1002/ejoc.201402609 Problem is that some replacements exist for chemistry for example (₁₂₃), so top of the top would be adding them as for molecular formulas (https://www.wikidata.org/wiki/Property:P274), but the sub/sup are not limited to it, and the current example also has I just thought given the users of Wikidataintegrator, better report upstream than do a
on my side. I understand (and probably share) your point of view, but we should then make it understandable to other wiki members... |
Revisting this after starting the discussion in the telegram channel. I wonder if, by your suggestion, changing https://github.com/SuLab/WikidataIntegrator/blob/main/wikidataintegrator/wdi_helpers/publication.py#L106 to
would not fix this issue. I am currently travelling and I want to give it a bit more attention, but will dive in upon returning to the office by the end of this week. |
As discussed, I was also trying to clean Crossref titles from html tags as requested by WD. Here are some challenging tests: For now, I could not succesfully clean all tests using simple JSoup cleaning. |
Hi,
It is multiple times I had the remark from the WD community I had to sanitize the titles I am importing through WikidataIntegrator (see https://www.wikidata.org/wiki/User_talk:AdrianoRutz), before trying to fix it downstream, would there be a solution to implement here directly to format them better?
The text was updated successfully, but these errors were encountered: