Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies in Isomeric SMILES Data Retrieval Using PubChemPy Compared to Previous Data #87

Open
kyokim-mpu opened this issue Jun 3, 2024 · 0 comments

Comments

@kyokim-mpu
Copy link

Dear Developer,

Hello.

We use PubChemPy to obtain isomeric SMILES for our daily research. It has been very helpful, and I greatly appreciate it.

To ensure the accuracy of the data when obtaining SMILES, I would like to ask a question based on a discrepancy I noticed when comparing data from the previous year.

■Events
When I obtained isomeric SMILES from compound names using PubChemPy, there was a difference between 2022 and 2023. Specifically, 76 compounds could be retrieved in FY2022 but not in FY2023.

When we searched and compared the compound names and SMILES obtained in FY2022 with the current data from PubChem, we observed the following two patterns:

  1. Search results appeared, but some SMILES were not retrieved.
     Example: POLIDOCANOL and CCCCCCCCCCCCOCCOCCOCCOCCOCCOCCOCCOCCOCCO
     Example: METHYLENEDIOXYMETHAMPHETAMINE and CC(CC1=CC2=C(C=C1)OCO2)NC
     Example: SITOSTEROL and CCC@HC(C)C

  2. Search results appeared, and SMILES were retrieved, but the information was neither the best match nor relevant.
     Example: CHROME ALUM and OS(=O)(=O)O.OS(=O)(=O)[O-].[K+].[Cr]
     Example: EGG YOLK and CCCCCCCCC/C=C/CCCCCCCC(=O)OCC(COP(=O)(O)OCCN+(C)C)OC(=O)CCCCCCC/C=C/CCCCCCCCC
     Example: BROMELAINS and CCCC(C)C1(C(=O)NC(=O)N=C1[O-])CC.[Na+]

■Question
I have questions because I cannot confirm the sequential data changes from FY2022 to FY2023.

  1. Is the acquisition process different when searching PubChem by compound name and when using PubChemPy's API, and do the results differ? If they do, we believe the discrepancies above could be due to this difference.

  2. Could it simply be that the data was updated between 2022 and 2023, and thus, certain compounds could not be retrieved? If the acquisition process is the same, we believe it is simply due to data updates.

Are there any other possible causes or ways to confirm this?

I really appreciate any help you can provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant