Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code for Demo Files #38

Open
tarinidash opened this issue Sep 1, 2021 · 2 comments
Open

Code for Demo Files #38

tarinidash opened this issue Sep 1, 2021 · 2 comments

Comments

@tarinidash
Copy link

I have been working with this library to extract chem information from HTML pages.
I followed http://chemdataextractor.org/demo and saved https://pubs.rsc.org/en/content/articlelanding/2015/TC/C5TC02626A as an html(input3.html) file.

Below is my code.

with open('input/input3.html', 'rb') as f:
doc = Document.from_file(f)

records = doc.records.serialize()

This does not matches with the records in the json output published at https://pubs.rsc.org/en/content/articlelanding/2015/TC/C5TC02626A .
A lot of information is missing including smiles, fluorescence_lifetimes etc.

@mcs07 was wondering if you could publish the code that was used for the demo.

Ps : Is there a method which creates the entire json which includes abbreviation + biblio + record or they are extracted separately and stitched together to create the final json output.

@mcs07
Copy link
Owner

mcs07 commented Sep 1, 2021

Unfortunately it looks like the RSC "articlelanding" page is assembled dynamically using JavaScript, as you scroll down the page. So the HTML that you save may not include the full article, even though it appears to in the browser. It might work better if you click the "Article HTML" button on the right and save that page instead: https://pubs.rsc.org/en/content/articlehtml/2015/tc/c5tc02626a

The demo results on the web site were run quite a few years ago now, so unfortunately the article HTML may also have changed since. The full web site code is available at: https://github.com/mcs07/cdeweb

It only does a couple of extra things to extend the output - all in the get_biblio and add_structures functions.

@tarinidash
Copy link
Author

tarinidash commented Sep 1, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants