-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disambiguating chemistry and fixing typos #76
Comments
== create sample disambiguation of chemistry ==
For each lookup go to the site and lookup the name. Record the ID if found, else leave empty. If there are special comments record them. This may be automatable through Egon's tools. |
Chemical disambiguationThe tools to use are:
By using InChIs we have a correspondence between the systems. INPUTS OUTPUTS Typical example for |
Scheduling chemical workVinita should supervise the processing, which will be largely carried out by Ambarish and later Shruthi. It is particularly important to check correctness of results. Method: 0/ There should be a single communal table (as described). There may need to be more columns than specified there. 3/ search Wikidata with (a) CID (b) InChI (c) original name if fails . This should be done automatically . The correctness of the search will be shown by matching InChIs for numerous compounds. We will report early results in the poster. |
Sir, Wikidata entry is remaining right now. |
Sir, PubChem lookup generates isomers. Those are present into the file as output is generated (also order of PubChem lookup entries are same as of generated output.) |
Files are meaningless unless they have documentation.
Please briefly record (on Github) how these files were created.
Also I will probably move these files in the directory structure
…On Mon, Jul 15, 2019 at 12:27 PM Ambarish Kumar ***@***.***> wrote:
Sir,
Please go through the files.
100cnamePubchemAndOPSIN.csv
https://github.com/gilienv/EssOilDB/blob/master/100cnamePubchemAndOPSIN.csv
100cnamePubchem.csv
https://github.com/gilienv/EssOilDB/blob/master/100cnamePubchem.csv
100cnameOPSIN.csv
[https://github.com/gilienv/EssOilDB/blob/master/100cnameOPSIN.csv]
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCSZK47ITAATNHSC53J3P7RNK5A5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ5M7SI#issuecomment-511365065>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS53XUTSVE6D6DREN2DP7RNK5ANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
I had better luck fixing chemical names with this: Not so much luck with this: |
This one is pretty good too: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:10447 |
Thank you Manny! Ambarish - |
As PMR had first pointed out - we need to document the KINDS of errors we have in the Chemistry. At present, the most comprehensive assessment of types of errors has been conducted by Manny, and we have had a few meetings to discuss various issues. More on my dropbox, but happy to add here if Ambarish initiates a list of Error types, along with V.1 entries for each kind |
We are clearly going to have to do manual correction of chemical names.
Common problems include:
* misspelling
* spaces included "alpha - pinene"
* spaces omitted "ethylacetate"
* hypens omitted/included
* quotes (strange, unbalanced...)
* multiple locants
* missing locants
To be correct we should have at least 2 columns (raw data, curated data)
…On Wed, Jul 17, 2019 at 8:59 AM Gitanjali Yadav ***@***.***> wrote:
As PMR had first pointed out - we need to document the KINDS of errors we
have in the Chemistry.
At present, the most comprehensive assessment of types of errors has been
conducted by Manny, and we have had a few meetings to discuss various
issues.
More on my dropbox, but happy to add here if Ambarish initiates a list of
Error types, along with V.1 entries for each kind
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS6HRCJTONJWSHPKPRDP73GNPA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2DL4XI#issuecomment-512147037>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS3ISTO5SRK6MF3GAFTP73GNPANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
We generated 1000 records of compounds using OPSIN and PubChem. For getting WIKIDATA lookup column, we will have to reset the run. Preparing run for getting WIKIDATA and WIKIPEDIA lookups. |
Ambarish has made good progress on disambiguation - see @ambarishK and I had a good discussion today. The result in OPSINPubChem is:
An OPSIN-parsable name is ACTION we will need (at least) three columns
The benefit is that @mannyrules and other volunteers (@petermr ) can edit this on a day-by-day basis without affecting the rest of the submission. Both |
Created a new table |
I have been and will continue to relatively quickly replace errors in punctuation as well as “foreign” characters (eg, Ã, ã) etc.. I have also created a little table for myself where I am storing other, stranger anomalies such as things that look like spaces, but are actually some indescribable character. Each time I find one, I save it so I can go through all of them “one last time” after the last person has touched the data. I don’t know the cause of this strange data. It could be that we are each using different keyboard language settings, operating systems, or different dictionaries as default in our spreadsheet programs. No matter though. I’m confident I can clean that stuff up. My biggest limitation is not knowing what’s actually correct or incorrect. But on the other hand, my layman’s eyes see things others may miss, so together we’ll ferret out the weirdness. Sent with GitHawk |
A very quick eyeball of InChIs Of the 1000 names, approximately 700 were translated by PubChem and 400 by OPSIN (though there is still a punctuation problem and this number should increase. |
Brilliant,
The main thing is to record everything and try to systematize the errors.
For example:
EXTRA_SPACE
MISSING_SPACE
INVISIBLE_CHAR
Then we can analyze what is most frequent.
I agree with you that there may be an invisible character problem. This
might come from non-Unicode characters that cannot be rendered. Believe me,
I know most of the "tricks"
We should only use ASCII characters (32-126). No clever spaces
(non-breaking space, zero-width space, etc.). No greek characters (=> beta,
etc.) No em-dashes (only hyphen-minus), no umlauts and other diacritics.
Quoting is a real problem and in general No Quotes or apostrophes.
I don't think we can "correct" any of this algorithmically and if we do I
suggest that I do it.
P.
…On Thu, Jul 18, 2019 at 4:46 PM Manny ***@***.***> wrote:
I have been and will continue to relatively quickly replace errors in
punctuation as well as “foreign” characters (eg, Ã, ã) etc.. I have also
created a little table for myself where I am storing other, stranger
anomalies such as things that *look* like spaces, but are actually some
indescribable character.
Each time I find one, I save it so I can go through all of them “one last
time” after the last person has touched the data.
I don’t know the cause of this strange data. It could be that we are each
using different keyboard language settings, operating systems, or different
dictionaries as default in our spreadsheet programs.
No matter though. I’m confident I can clean that stuff up.
My biggest limitation is not knowing what’s actually correct or incorrect.
But on the other hand, my layman’s eyes see things others may miss, so
together we’ll ferret out the weirdness.
Sent with GitHawk <http://githawk.com>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS65JSFGY4ZH26AGKQ3QACF6HA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2I5HNI#issuecomment-512873397>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS2CTSQAGS24GFOF6JTQACF6HANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
I will start adding records after meeting today. Also, I will draft all possibilities of name inconsistencies with example. |
Sir I prepared a fresh sheet for name cleaning. It containes exact 7162 unique compound records. The we discussed today is there as it is. - https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/EssOilDBOPSINPubChemInChIs_A.csv It has 7169 unique compound entries. It is better to continue with the today discussed sheet. I tried to get into the difference of 07 records. It may be because of repeated 07 compound names. Documentation for generating sheet is at |
On Fri, Jul 19, 2019 at 10:04 AM Ambarish Kumar ***@***.***> wrote:
Table for name correction
<https://github.com/gilienv/EssOilDB/blob/master/chemistry/EssOilDBOPSINPubChemInChIs_A.csv>
We cannot create additional tables until we have agreed the identifiers.
I will start adding records after meeting today. Also, I will draft all
… possibilities of name inconsistencies with example.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS6TZHZWCWN2VTT6CPDQAF7SBA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2LBYDI#issuecomment-513154061>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS4VBVQ2NTYAQ4VW2BLQAF7SBANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Sir Check for the sheet EssOilDBOPSINPubChemInChIsANewFinal.csv. It contains exactly same identifiers as of the first sheet (the finalised one) - EssOilDBOPSINPubChemInChIs_A.csv Removing sheets - EssOilDBOPSINPubChemInChIsANew.csv and EssOilDBOPSINPubChemInChIsANew.tsv |
EssOilDB entry is "bergamotol acetate" but PubChem search shows - Trans-.alpha.-Bergamatol Acetate OR (Z)-.Alpha.-Bergamotol Acetate OR Cis-alpha-Bergamotol Acetate.
e.g - borneole
e.g -
e.g
e.g EssOilDBEntry -
e.g - EssOilDBEntry is To be correct we should have at least 2 columns (raw data, curated data) |
Thanks,
Yes,
This is a difficult area and we are going to have to treat it carefully and
systematically.
It is essential to preserve the original spelling regardless of whether it
is "wrong" or "right". So we must have a column for raw name.
There are names with are very similar but represent different compounds. If
we "correct" these we will corrupt the database. Thus:
decanol
decanal
decenol
decenal
are all valid names and are all distinct.
(if the original abstracter made a copying error it may be difficult to
detect)
On Mon, Jul 22, 2019 at 8:56 AM Ambarish Kumar <[email protected]>
wrote:
We are clearly going to have to do manual correction of chemical names.
Common problems include:
- misspelling
e.g - 1,8-cineol
This is not a misspelling, it's a synonym.
See https://pubchem.ncbi.nlm.nih.gov/compound/Eucalyptol
which lists
>
2.4Synonyms
Help
New Window
<https://pubchem.ncbi.nlm.nih.gov/compound/Eucalyptol#section=Synonyms&fullscreen=true>
2.4.1MeSH Entry Terms
Help
New Window
<https://pubchem.ncbi.nlm.nih.gov/compound/Eucalyptol#section=MeSH-Entry-Terms&fullscreen=true>
1,8 Cineol
1,8 Cineole
1,8 Epoxy p menthane
1,8-cineol
1,8-cineole
1,8-Epoxy-p-menthane
cineole
eucalyptol
Soledum
>
- spaces included "alpha - pinene"
e.g - 1,2,3,4-Tetrahydro-1,5,7-trimethyl naphthalene
Yes!
- spaces omitted "ethylacetate"
e.g - (e)-sesquilavandulylacetate
Yes
- hypens omitted/included
e.g - 1,8 cineole
Yes
- quotes (strange, unbalanced...)
e.g - (2,4)-nonadienal
Yes
- multiple locants
- missing locants
We should create short unique codes for this:
examples
SYNONYM
ADDED_SPACE
MISSING_SPACE
MISSING_HYPHEN
ADDED_HYPHEN
QUOTE_ERROR
MULTIPLE_LOCANT
MISSING_LOCANT
By using codes like this (always uppercase) we can normalize the reporting
of errors.
…
To be correct we should have at least 2 columns (raw data, curated data)
… <#m_-1617261654995852357_m_7121751399519024663_>
On Wed, Jul 17, 2019 at 8:59 AM Gitanjali Yadav *@*.***> wrote: As PMR
had first pointed out - we need to document the KINDS of errors we have in
the Chemistry. At present, the most comprehensive assessment of types of
errors has been conducted by Manny, and we have had a few meetings to
discuss various issues. More on my dropbox, but happy to add here if
Ambarish initiates a list of Error types, along with V.1 entries for each
kind — You are receiving this because you authored the thread. Reply to
this email directly, view it on GitHub <#76
<#76>?email_source=notifications&email_token=AAFTCS6HRCJTONJWSHPKPRDP73GNPA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2DL4XI#issuecomment-512147037>,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAFTCS3ISTO5SRK6MF3GAFTP73GNPANCNFSM4ICLYMFQ
.
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in
Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW,
UK
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCSY52VQRXDO7IE6BODLQAVR2FA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2PC5PY#issuecomment-513683135>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS36BGGEQA4SZIKWWV3QAVR2FANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Please go through the name cleaning sheet updated by me - Copy of EssOilDBOPSINPubChemInChIs_A.csv. There is additional column for IUPAC name. I have added first 50 records into it. I added a short description of today file. |
Sir, Compound_identifiers are now as C1,C2,C3 ......which corresponds to previous identifiers 1C, 2C, 3C ...... respectively. |
I will revisit this after I have created the poster. We have to start
again and document exacty what we start with and what operations we carry
out. The confusing thing was the IUPAC names which were not in the original
V1.0 (as far as I know). In fact there is only one compound name and
possibly a CAS number.
But I have to talk with Gita first.
…On Tue, Jul 23, 2019 at 1:08 PM Ambarish Kumar ***@***.***> wrote:
Sir,
Please go through the documentation page
<https://github.com/gilienv/EssOilDB/blob/master/chemistry/Disambiguating_chemistry_and_fixing_typos.md>.
I added a short description of today file.
Compound_identifiers are now as C1,C2,C3 ......which corresponds to 1C,
2C, 3C ...... respectively.
Updated sheet with compound_identifier
<https://github.com/gilienv/EssOilDB/blob/master/chemistry/EssOilDBOPSINPubChemInChIs_A.csv>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS5KB7RRFGU523U7PQDQA3YFTA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2S4YOQ#issuecomment-514182202>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS7GIRTET33XU7BBOG3QA3YFTANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Sir, |
Thank you.
I will look
…On Wed, Jul 24, 2019 at 1:11 PM Ambarish Kumar ***@***.***> wrote:
Sir,
I have listed WIKIDATA 'Q' ID for all compounds onto the poster. Please go
through the page
<https://github.com/gilienv/EssOilDB/blob/master/EssOilDBPosterWIKIDATA-QID.md>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS2GPQ2ZT2XJWXS6NWDQBBBGHA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2WEDUY#issuecomment-514605523>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCSZR3QRKNVS56FFSKATQBBBGHANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Dear Sir One pic I found in my mobile camera roll, It is of harvesting time ( of this March when I had visited my home ). Pic has Lantana camara shrubs spread at the bottom. If convenient, it can be included into the poster. |
Ambarish,
Please can you create an HTML table or CSV that we can browse and see the
images. In your directory we would have something like
<tr>
<td>170</td>
<td><img src="./3D_images/170.png"/></td>
etc.
…On Tue, Aug 27, 2019 at 2:00 PM Ambarish Kumar ***@***.***> wrote:
Sir,
Please go through the added chemical structure diagrams. 2D images
<https://github.com/gilienv/EssOilDB/tree/master/tables/chemistry/2D_images>
and 3D images
<https://github.com/gilienv/EssOilDB/tree/master/tables/chemistry/3D_images>
.
Count of 2D chemical structure diagrams is 2114.
Count of 3D chemical structure diagram is 2004.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS5PSWAHFW72GKK74KTQGUQOHA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5HUSXQ#issuecomment-525289822>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS2STTLJM4JUS4AN5BTQGUQOHANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Yes sir. |
Thanks
It doesn't have to be the same document as the spreadsheet - a static
HTMLwill do.
…On Tue, Aug 27, 2019 at 8:15 PM Ambarish Kumar ***@***.***> wrote:
Yes sir.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS3Z2BWO6DGWNQQDRZ3QGV4ORA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5I2JKQ#issuecomment-525444266>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS4OGBIEUEKJ6TR3ALLQGV4ORANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
I have manually edited the TSV file into HTML to include images and links to Wikidata. It's static. resolveCompTable20190828.html |
I am VERY pleased with the resolution of compounds in the resolveCompTable20190828.html table. Well done. |
Sir, even I prepared one table and added to repository now - compoundStructureDiagram.html but could not get an idea to embed images from repository folder. Current file has image path of my laptop i.e "E:/2D_images/*.png".
|
Sir, Table header for WIKIDATA is put into last most of the table. It should have been before |
On Wed, Aug 28, 2019 at 11:39 AM Ambarish Kumar ***@***.***> wrote:
Sir, even I prepared one table and added to repository now -
compoundStructureDiagram.html
<https://raw.githubusercontent.com/gilienv/EssOilDB/master/tables/chemistry/compoundStructureDiagram.html>
but could not get an idea to embed images from repository folder. Current
file has image path of my laptop i.e "E:/2D_images/*.png".
If the images are in the same directory tree as the HTML you just need
relative links. So if compoundStructureDiagram.html
<https://raw.githubusercontent.com/gilienv/EssOilDB/master/tables/chemistry/compoundStructureDiagram.html>
is in /tables/chemistry/ and the images are in tables/chemistry2D_images
then the link from compoundStructureDiagram.html
<https://raw.githubusercontent.com/gilienv/EssOilDB/master/tables/chemistry/compoundStructureDiagram.html>
is simply:
<a href="2D_images/17.png"/>
see my files.
Or is there a separate problem?
—
… You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS7TCCGTXUMYA56F65LQGZIXZA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5KVUOI#issuecomment-525687353>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCSYUILZW6B2HGLV2DVLQGZIXZANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Thanks,
That's a typical error in manual editing.
…--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Sir, synonym table for compounds as well as plants are below. column description
column description
Plant synonyms are from details.csv table. |
Sir, all compound synonyms are from PubChem and plant synonyms are from details.csv. Plant synonyms are extracted using R-package-taxize from database - "col". |
The plant table looks good,
BUT the the column "pid" refers to the plant ID is EssoilDB and that should
have a letter.
We should have three columns,
```
id pid name
EPS1 EP99 Foo baricus
EPS2 EP99 Foo xyzzy
EPS3 EP23 Bar fooicus
```
which says that
Foo baricus has a synonym ID of EPS1 and is a synonym of EP99
Foo xyzzy has a synonym ID of EPS2 and is a synonym of EP99
EP99 in the plant table represents the plant and has an approved name (say
Foo plughiensis)
…On Fri, Sep 6, 2019 at 11:01 AM Ambarish Kumar ***@***.***> wrote:
Sir, synonym table for compounds as well as plants are below.
compoundSynonymTable
<https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/compoundSynonymTable.tsv>
column description
- EID - EssoilDB unique ID for compound name.
- SYNONYMS - compound synonym.
plantSynonymTable
<https://github.com/gilienv/EssOilDB/blob/master/tables/plant/plantSynonymTable.csv>
column description
- pid - EssoilDB unique ID for plant name.
- synonyms - plant synonym.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS24ZWYQ7S6DKFFEAOLQIITAFA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6CMF4Q#issuecomment-528794354>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS6BSOLDAHL2OILFADLQIITAFANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
OK sir. |
Pubchem is too verbose .
We must try Wikidata or CheBI.
These synonyms are going to be used for searching.
…On Fri, Sep 6, 2019 at 12:23 PM Ambarish Kumar ***@***.***> wrote:
Sir, all compound synonyms are from PubChem and plant synonyms are from
details.csv
<https://github.com/gilienv/EssOilDB/blob/master/tables/plant/details.csv>.
Plant synonyms are extracted using R-package-taxize from database - "col".
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS7MP5FDBDNN2NQND23QII4SBA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6CRUKQ#issuecomment-528816682>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCSZZJUFFFW2RQIFSWRDQII4SBANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Please create a flow diagram (with numbers) for where the compounds and the
synonyms came from.
…On Fri, Sep 6, 2019 at 12:24 PM Ambarish Kumar ***@***.***> wrote:
OK sir.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS6ZI2H7JKQKKMNV3XTQII4YFA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6CRXUI#issuecomment-528817105>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS5DAO4PKY2RNSDFVH3QII4YFANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
sir, what is pid (property id) for compound synonyms? |
On Fri, Sep 6, 2019 at 12:27 PM Ambarish Kumar ***@***.***> wrote:
sir, what is pid (property id) for compound synonyms?
Do you mean in our synonym table?
If a Compound ID has the form C123
I suggest we have CS789 for Compound synonym IDs
Do we have a syntax for Plant IDs? It can't be Pddd as that clashes with
Wikidata
Did we decide on something like EPddd?
If so
then EPS9876 for synonym IDs
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS3B3TZNDKEY4JQ3ZRTQII5A7A5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6CR4TA#issuecomment-528817740>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS3DXQZCNH56MVDPUZLQII5A7ANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Yes sir, I assign synonym IDs to plant synonyms as well as compound synonyms. For example -
Similarly, I will have to assign plant IDs as per the mentioned syntax.
Previous I meant to ask how to get compound synonyms from wikidata. In SPARQL query I will have to pass on |
Sir, please go through synonym tables with added synonym IDs. Column description of plant synonym table is as follows.
For example.
Column description of compound synonym table is as follows.
|
Thanks,
Something seems to have gone wrong here: A group of names are mapped to
both EP1186 and EP1188
EPS12565 EP1186 Ocimum sanctum
EPS12566 EP1186 Ocimum sanctum angustifolium
EPS12567 EP1186 Ocimum sanctum cubensis
EPS12568 EP1186 Ocimum sanctum hirsutum
EPS12569 EP1186 Ocimum scutellarioides
EPS12570 EP1186 Ocimum subserratum
EPS12571 EP1186 Ocimum tenuiflorum anisodorum
EPS12572 EP1186 Ocimum tenuiflorum villicaulis
EPS12573 EP1186 Ocimum tomentosum
EPS12574 EP1186 Ocimum villosum
EPS12575 EP1186 Plectranthus monachorum
EPS12576 EP1187 Lumnitzera carnosa
EPS12577 EP1187 Ocimum atrovirens
EPS12578 EP1187 Ocimum ebracteatum
EPS12579 EP1187 Ocimum graveolens
EPS12580 EP1187 Ocimum montevideanum
EPS12581 EP1187 Ocimum selloi
EPS12582 EP1187 Ocimum selloi angustifolium
EPS12583 EP1187 Ocimum selloi carnosum
EPS12584 EP1187 Ocimum selloi genuinum
EPS12585 EP1187 Ocimum selloi serratum
EPS12586 EP1187 Ocimum selloi subintegrifolium
EPS12587 EP1187 Ocimum selloi tweedieanum
EPS12588 EP1187 Ocimum tweedieanum
****
EPS12589 EP1188 Geniosporum tenuiflorum
EPS12590 EP1188 Lumnitzera tenuiflora
EPS12591 EP1188 Moschosma tenuiflorum
EPS12592 EP1188 Ocimum anisodorum
EPS12593 EP1188 Ocimum caryophyllinum
EPS12594 EP1188 Ocimum hirsutum
EPS12595 EP1188 Ocimum inodorum
EPS12596 EP1188 Ocimum monachorum
EPS12597 EP1188 Ocimum sanctum
EPS12598 EP1188 Ocimum sanctum angustifolium
EPS12599 EP1188 Ocimum sanctum cubensis
EPS12600 EP1188 Ocimum sanctum hirsutum
EPS12601 EP1188 Ocimum scutellarioides
EPS12602 EP1188 Ocimum subserratum
EPS12603 EP1188 Ocimum tenuiflorum anisodorum
EPS12604 EP1188 Ocimum tenuiflorum villicaulis
EPS12605 EP1188 Ocimum tomentosum
EPS12606 EP1188 Ocimum villosum
***
In the details.csv file we have:
1186,Origanum adanense,Origanum adanense K.H.C.Baéªer &
H.Duman,3895399,Q15349571,Origanum adanense,NOT A SYNONYM,"Geniosporum
tenuiflorum,Lumnitzera tenuiflora,Moschosma tenuiflorum,Ocimum
anisodorum,Ocimum caryophyllinum,Ocimum hirsutum,Ocimum inodorum,Ocimum
monachorum,Ocimum sanctum,Ocimum sanctum angustifolium,Ocimum sanctum
cubensis,Ocimum sanctum hirsutum,Ocimum scutellarioides,Ocimum
subserratum,Ocimum tenuiflorum anisodorum,Ocimum tenuiflorum
villicaulis,Ocimum tomentosum,Ocimum villosum,Plectranthus monachorum",NA,
1187,Origanum bargyli,Origanum bargyli Mouterde,3895268,Q12242088,Origanum
bargyli,NOT A SYNONYM,"Lumnitzera carnosa,Ocimum atrovirens,Ocimum
ebracteatum,Ocimum graveolens,Ocimum montevideanum,Ocimum selloi,Ocimum
selloi angustifolium,Ocimum selloi carnosum,Ocimum selloi genuinum,Ocimum
selloi serratum,Ocimum selloi subintegrifolium,Ocimum selloi
tweedieanum,Ocimum tweedieanum",NA,
1188,Ocimum basilicum,Origanum Tourn. ex L.,2926611,Q38859,Origanum
basilicum,NOT A SYNONYM,"Geniosporum tenuiflorum,Lumnitzera
tenuiflora,Moschosma tenuiflorum,Ocimum anisodorum,Ocimum
caryophyllinum,Ocimum hirsutum,Ocimum inodorum,Ocimum monachorum,Ocimum
sanctum,Ocimum sanctum angustifolium,Ocimum sanctum cubensis,Ocimum sanctum
hirsutum,Ocimum scutellarioides,Ocimum subserratum,Ocimum tenuiflorum
anisodorum,Ocimum tenuiflorum villicaulis,Ocimum tomentosum,Ocimum
villosum,Plectranthus monachorum,Geniosporum tenuiflorum,Lumnitzera
tenuiflora,Moschosma tenuiflorum,Ocimum anisodorum,Ocimum
caryophyllinum,Ocimum hirsutum,Ocimum inodorum,Ocimum monachorum,Ocimum
sanctum,Ocimum sanctum angustifolium,Ocimum sanctum cubensis,Ocimum sanctum
hirsutum,Ocimum scutellarioides,Ocimum subserratum,Ocimum tenuiflorum
anisodorum,Ocimum tenuiflorum villicaulis,Ocimum tomentosum,Ocimum
villosum,Plectranthus monachorum",sweet basil,Entry not found
Note the NOT A SYNONYM so I would exclude all these.
…On Sun, Sep 8, 2019 at 4:16 AM Ambarish Kumar ***@***.***> wrote:
Sir, please go through synonym tables with added synonym IDs.
plant synonym table
<https://github.com/gilienv/EssOilDB/blob/master/tables/plant/plantSynonymTableWithNewID.tsv>
Column description of plant synonym table is as follows.
- EPSID - synonym ID assigned to plant synonym name.
- EPID - Unique ID assigned to plant name (EssoilDB ID).
- synonyms - plant synonym name.
For example.
EPSID EPID synonyms
EPS1 EP1 Abies alba apennina
EPS2 EP1 Abies alba pardei
EPS3 EP1 Abies alba podolica
EPS4 EP1 Abies argentea
EPS26 EP2 Abies apollinis
EPS27 EP2 Abies borisii-regis pungenti-pilosa
EPS28 EP2 Abies cilicica borisii-regis
EPS29 EP3 Abies alba cephalonica
EPS30 EP3 Abies apollinis
EPS31 EP3 Abies apollinis
EPS32 EP3 Abies apollinis panachaica
[compound synonym table]
https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/compoundSynonymTableWithNewID.tsv
Column description of compound synonym table is as follows.
- CSID - compound synonym ID.
- EID - unique ID assigned to compounds (EssoilDB ID).
- SYNONYM - compound sunonym name.
CSID EID SYNONYM
CS1 C214 acetate
CS2 C214 Acetate Ion
CS3 C214 Acetic acid, ion(1-)
CS4 C214 Acetate ions
CS5 C214 71-50-1
CS431 C215 acetic aicd
CS432 C215 acetic-acid
CS433 C215 Glacial acetate
CS1160 C2776 ethanal
CS1161 C2776 acetic aldehyde
CS1162 C2776 ethyl aldehyde
CS1163 C2776 75-07-0
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS7ZUCQB4L3DOTU3QM3QIRVAFA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6FGZ4Q#issuecomment-529165554>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS34JJRV6N5253JGC6DQIRVAFANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
On Sun, Sep 8, 2019 at 4:16 AM Ambarish Kumar ***@***.***> wrote:
[compound synonym table]
https://github.com/gilienv/EssOilDB/blob/master/tables/chemistry/compoundSynonymTableWithNewID.tsv
Column description of compound synonym table is as follows.
- CSID - compound synonym ID.
- EID - unique ID assigned to compounds (EssoilDB ID).
- SYNONYM - compound sunonym name.
CSID EID SYNONYM
CS1 C214 acetate
CS2 C214 Acetate Ion
CS3 C214 Acetic acid, ion(1-)
CS4 C214 Acetate ions
CS5 C214 71-50-1
CS431 C215 acetic aicd
CS432 C215 acetic-acid
CS433 C215 Glacial acetate
CS1160 C2776 ethanal
CS1161 C2776 acetic aldehyde
CS1162 C2776 ethyl aldehyde
CS1163 C2776 75-07-0
This is looking better. We don't want more synonyms than this if we can
help it.
—
… You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS7ZUCQB4L3DOTU3QM3QIRVAFA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6FGZ4Q#issuecomment-529165554>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS34JJRV6N5253JGC6DQIRVAFANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Yes sir. I am going through details.csv.
I have extracted plant synonym names. Column description of the table is as follows.
All synonyms have been extracted using R R code snippet.
|
Suggest we discuss plants and compounds in different threads.
On Mon, Sep 9, 2019 at 8:38 AM Ambarish Kumar ***@***.***> wrote:
Yes sir.
Should we go for applying regex over compound name synonyms?
Probably not. Regexes don't work well on chemical names.
I suspect the best approach to chemicals is simply to collect the names
used in phytochemical aricles. We already have these in E1.0. So collect
the synonyms in that (about 300/2100 compound names I think). Maybe collect
no more than 5 for a compound anyway.
P.
|
Yes sir. |
Many of the synonyms in Pubchem are arbitrary and unlikely to occur in
phyto chemistry. We only need a small list.
Can we talk on Hangout?
…On Mon, Sep 9, 2019 at 8:52 AM Ambarish Kumar ***@***.***> wrote:
Yes sir.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76?email_source=notifications&email_token=AAFTCS6ZYZRKDN52QFJB2YTQIX6DRA5CNFSM4ICLYMF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6GSC7A#issuecomment-529342844>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCSZGHJC2CYA2WU4UV3LQIX6DRANCNFSM4ICLYMFQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Yes sir. Sir, There is no
There is |
Sir, please check for compound synonym names picked from EssoilDB1.0. uniqueCompoundSynonym20190910.tsv Each synonym is reported as one record per row. Column description is as per follows.
Total number of records Example -
|
Ggg |
Error CS0433 |
Chemical nomenclature is complex and ambiguous. Any attempt to disambiguate MUST record ambiguity. Thus acetyl-furan could be 1-acetyl-furan or 2-acetyl-furan,
OPSIN (https://opsin.ch.cam.ac.uk) gives:
and this must be recorded
Always test with OPSIN.
The text was updated successfully, but these errors were encountered: