Disambiguation of plant names using GBIF #82

Shruthi-M · 2019-07-25T14:13:26Z

I submitted the entire set of plant names (before clean up) onto the GBIF link - (https://www.gbif.org/en/tools/species-lookup)
This allows the user to perform multiple searches at once. After this step, I got the results - which I have uploaded as gbif_result.csv onto the repository.
The default headings of the columns are as follows:

occurrenceId
verbatimScientificName (user-submitted name)
scientificName (name existing in the database)
key (unique number assigned to the particular species on GBIF
matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)

EXACT means the name exactly matches with the entry in the database
FUZZY indicates entries that may be mis-spelt
HIGHERRANK implies that the specific epithet of the entry is not being recognized (in other words, only genus is recognized)

confidence (expressed in terms of percentage)
status (can be ACCEPTED, SYNONYM or DOUBTFUL)

DOUBTFUL Treated as accepted, but doubtful whether this is correct.
SYNONYM A general synonym, the exact type is unknown.

rank (the highest rank recognized)
kingdom
phylum
class
order
family
genus
species

petermr · 2019-07-25T15:34:11Z

Looks great. Well done for organising columns. Will need something like this for chemistry. Will look in detail when on my laptop

…

On Thu, 25 Jul 2019, 15:13 Shruthi-M, ***@***.***> wrote: I submitted the entire set of plant names (before clean up) onto the GBIF link - (https://www.gbif.org/en/tools/species-lookup) This allows the user to perform multiple searches at once. After this step, I got the results - which I have uploaded as gbif_result.csv onto the repository. The default headings of the columns are as follows: 1. occurrenceId 2. verbatimScientificName (user-submitted name) 3. scientificName (name existing in the database) 4. key (unique number assigned to the particular species on GBIF 5. matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK) - EXACT means the name exactly matches with the entry in the database - FUZZY indicates entries that may be mis-spelt - HIGHERRANK implies that the specific epithet of the entry is not being recognized (in other words, only genus is recognized) 1. confidence (expressed in terms of percentage) 2. status (can be ACCEPTED, SYNONYM or DOUBTFUL) - DOUBTFUL Treated as accepted, but doubtful whether this is correct. - SYNONYM A general synonym, the exact type is unknown. 1. rank (the highest rank recognized) 2. kingdom 3. phylum 4. class 5. order 6. family 7. genus 8. species — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=AAFTCS4ISVZNA5XSDCJE7CLQBGYINA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBPXSQA>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFTCS7GACCNCGZLLA433FLQBGYINANCNFSM4IG3DB3A> .

petermr · 2019-07-25T20:29:19Z

Shruthi , This is excellent. **We should preserve this table** Then we will normalize. We should classify the results into the major groups. Initial comments: ``` occurrenceId verbatimScientificName scientificName key matchType confidence status rank kingdom phylum class order family genus species ``` occurrenceId // these were all blank, so we can drop this verbatimScientificName // this is our initial raw data and must be preserved. Let's use GBIF terminology where possible, so keep this column name scientificName // the preferred name. Can include synonyms. We should not use this if there is a species key // this is the most important column and gives us all the normalized information we need matchType // Yes, we should keep this because it helps understand non-normalized species confidence // whats' the lowest? I think we can drop this later status // useful for non-normalized names rank // useful for non-normalized names kingdom phylum class order family genus // probably keep. GBIF seems to map unknown species to genus. species // the key normalization ``` Abies alba Abies alba Mill. 2685484 EXACT 99 ACCEPTED SPECIES Plantae Tracheophyta Pinopsida Pinales Pinaceae Abies Abies alba Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia nuperrima Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia nuperrima ``` ^^ duplicates. Why? we can get rid of these immediately ``` Achillea albicaulis Achillea albicaulis C.A.Mey. 3120384 EXACT 99 SYNONYM SPECIES Plantae Tracheophyta Magnoliopsida Asterales Asteraceae Achillea Achillea tenuifolia ``` This is a single synonym but we should use the species name "Achillea tenuifolia" for future matching, not the scientificName "Achillea albicaulis". As always the key is the critical column. ``` ,"Ocimum sanctum","Ocimum sanctum L.","2927101","EXACT","99","SYNONYM","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum tenuiflorum" ,"Ocimum tenuiflorum","Ocimum tenuiflorum L.","2927100","EXACT","99","ACCEPTED","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum tenuiflorum" ``` These are synonyms but have different keys. So in our normalized table there should only be the ACCEPTED. Normalization should be on "species" Let's summarize and make a list of ACCEPTED species SYNONYMS can be removed if there is an ACCEPTED species SYNONYM without ACCEPTED equivalent should be normalized on the species Everything else shouldbe separated out as we will have to discuss it. Well done. On Thu, Jul 25, 2019 at 4:33 PM Peter Murray-Rust < [email protected]> wrote:

…

Looks great. Well done for organising columns. Will need something like this for chemistry. Will look in detail when on my laptop On Thu, 25 Jul 2019, 15:13 Shruthi-M, ***@***.***> wrote: > I submitted the entire set of plant names (before clean up) onto the GBIF > link - (https://www.gbif.org/en/tools/species-lookup) > This allows the user to perform multiple searches at once. After this > step, I got the results - which I have uploaded as gbif_result.csv onto the > repository. > The default headings of the columns are as follows: > > 1. occurrenceId > 2. verbatimScientificName (user-submitted name) > 3. scientificName (name existing in the database) > 4. key (unique number assigned to the particular species on GBIF > 5. matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK) > > > - EXACT means the name exactly matches with the entry in the database > - FUZZY indicates entries that may be mis-spelt > - HIGHERRANK implies that the specific epithet of the entry is not > being recognized (in other words, only genus is recognized) > > > 1. confidence (expressed in terms of percentage) > 2. status (can be ACCEPTED, SYNONYM or DOUBTFUL) > > > - DOUBTFUL Treated as accepted, but doubtful whether this is correct. > - SYNONYM A general synonym, the exact type is unknown. > > > 1. rank (the highest rank recognized) > 2. kingdom > 3. phylum > 4. class > 5. order > 6. family > 7. genus > 8. species > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#82?email_source=notifications&email_token=AAFTCS4ISVZNA5XSDCJE7CLQBGYINA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBPXSQA>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAFTCS7GACCNCGZLLA433FLQBGYINANCNFSM4IG3DB3A> > . >

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr · 2019-07-25T20:30:56Z

We are going to need unique identifiers for all these accepted species, e.g. PL123 we need a column for EssoilDB plant key. On Thu, Jul 25, 2019 at 9:29 PM Peter Murray-Rust < [email protected]> wrote:

…

Shruthi , This is excellent. **We should preserve this table** Then we will normalize. We should classify the results into the major groups. Initial comments: ``` occurrenceId verbatimScientificName scientificName key matchType confidence status rank kingdom phylum class order family genus species ``` occurrenceId // these were all blank, so we can drop this verbatimScientificName // this is our initial raw data and must be preserved. Let's use GBIF terminology where possible, so keep this column name scientificName // the preferred name. Can include synonyms. We should not use this if there is a species key // this is the most important column and gives us all the normalized information we need matchType // Yes, we should keep this because it helps understand non-normalized species confidence // whats' the lowest? I think we can drop this later status // useful for non-normalized names rank // useful for non-normalized names kingdom phylum class order family genus // probably keep. GBIF seems to map unknown species to genus. species // the key normalization ``` Abies alba Abies alba Mill. 2685484 EXACT 99 ACCEPTED SPECIES Plantae Tracheophyta Pinopsida Pinales Pinaceae Abies Abies alba Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia nuperrima Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia nuperrima ``` ^^ duplicates. Why? we can get rid of these immediately ``` Achillea albicaulis Achillea albicaulis C.A.Mey. 3120384 EXACT 99 SYNONYM SPECIES Plantae Tracheophyta Magnoliopsida Asterales Asteraceae Achillea Achillea tenuifolia ``` This is a single synonym but we should use the species name "Achillea tenuifolia" for future matching, not the scientificName "Achillea albicaulis". As always the key is the critical column. ``` ,"Ocimum sanctum","Ocimum sanctum L.","2927101","EXACT","99","SYNONYM","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum tenuiflorum" ,"Ocimum tenuiflorum","Ocimum tenuiflorum L.","2927100","EXACT","99","ACCEPTED","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum tenuiflorum" ``` These are synonyms but have different keys. So in our normalized table there should only be the ACCEPTED. Normalization should be on "species" Let's summarize and make a list of ACCEPTED species SYNONYMS can be removed if there is an ACCEPTED species SYNONYM without ACCEPTED equivalent should be normalized on the species Everything else shouldbe separated out as we will have to discuss it. Well done. On Thu, Jul 25, 2019 at 4:33 PM Peter Murray-Rust < ***@***.***> wrote: > Looks great. Well done for organising columns. Will need something like > this for chemistry. > Will look in detail when on my laptop > > On Thu, 25 Jul 2019, 15:13 Shruthi-M, ***@***.***> wrote: > >> I submitted the entire set of plant names (before clean up) onto the >> GBIF link - (https://www.gbif.org/en/tools/species-lookup) >> This allows the user to perform multiple searches at once. After this >> step, I got the results - which I have uploaded as gbif_result.csv onto the >> repository. >> The default headings of the columns are as follows: >> >> 1. occurrenceId >> 2. verbatimScientificName (user-submitted name) >> 3. scientificName (name existing in the database) >> 4. key (unique number assigned to the particular species on GBIF >> 5. matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK) >> >> >> - EXACT means the name exactly matches with the entry in the database >> - FUZZY indicates entries that may be mis-spelt >> - HIGHERRANK implies that the specific epithet of the entry is not >> being recognized (in other words, only genus is recognized) >> >> >> 1. confidence (expressed in terms of percentage) >> 2. status (can be ACCEPTED, SYNONYM or DOUBTFUL) >> >> >> - DOUBTFUL Treated as accepted, but doubtful whether this is correct. >> - SYNONYM A general synonym, the exact type is unknown. >> >> >> 1. rank (the highest rank recognized) >> 2. kingdom >> 3. phylum >> 4. class >> 5. order >> 6. family >> 7. genus >> 8. species >> >> — >> You are receiving this because you are subscribed to this thread. >> Reply to this email directly, view it on GitHub >> <#82?email_source=notifications&email_token=AAFTCS4ISVZNA5XSDCJE7CLQBGYINA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBPXSQA>, >> or mute the thread >> <https://github.com/notifications/unsubscribe-auth/AAFTCS7GACCNCGZLLA433FLQBGYINANCNFSM4IG3DB3A> >> . >> > -- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr · 2019-07-26T17:17:52Z

@Shruthi-M - this is so important for all of us! Everyone should read this thread. I'll annotate @Shruthi-M 's table and add actions. We should extract the discussion here onto .md pages as well.
There are fundamental issues which apply to compounds as well, @ambarishK . I hadn't seen this clearly until I started on the poster.
** We are only dealing at present with converting exisiting EssoilDB 1.0 (E1.0) to E2.0 (i.e not worrying about ingesting new data into either).*

= Origin of data =
It's critical to review exactly where the data comes from. After talking with @gilienv yesterday I believe that :

all the independent data are in "infopdata" and "infocdata"
there is probably not much original documentation
there may be extra information in legacy *.xls files,

== ACTION ==
We have to agree and then document what is in EssoilDB 1.0

Shruthi-M · 2019-07-27T07:54:47Z

Sir, I am working on your previous guidelines. I will try to separate the synonyms and the accepted names using the clues - you have mentioned. A final list of accepted species will be prepared soon (in the coming week).
ISSUES BEING FACED:

Callistemon sp. [pid: 299] - The literature reports 7 varieties of this species and our database has data about only one variety - “Blackdown tableland”.
Kunzea ambigua [pid: 879] - The literature from which this is taken - was analyzed. It was found that the data corresponding to this entry was related “prostate form, B” and the article reports three more varieties - which are not included in the database.
Eryngium sp. nov. [pid: 560] - The literature reports 2 varieties - “1” and “2” under this. According to the EssoilDB 1.0, these 2 varieties are not separated.
Astartea sp. nov. [pid: 179] and Mikania sp. nov. [pid: 1074] - These could not be resolved further.
There are 11 binomial names that are shown to be DOUBTFUL.
The binomials without the author's names - are not accepted by gbif and other open source databases. There are more than half of the binomials which have more than one author. I have referred to the journal and chosen the right author, wherever I could find a discrepancy. Do the binomials have to be separated or retained along with their authors (as this plays a crucial role in a bibliography database)? This was raised earlier and not resolved completely. It would be really kind of you if you can give me more clarity about this.
A final list of accepted species (i.e. not synonyms) has to be prepared.
This list of accepted names have to be rechecked with their respective journal articles to ensure that they have the right assigned author. NOTE: The assignment of author was done by the GBIF web program. Hence, this step is necessary.

petermr · 2019-07-28T09:40:46Z

Thanks so much Shruthi, One feature of data is that there is always a "long tail". https://en.wikipedia.org/wiki/Long_tail . A few items that can't be easily processed. The most important thing at present is to resolve the largest chunks of names as effeciently as possible. I'll try to highlight a strategy today based on the very useful output from GBIF you created. If there is a species that occurs only once and we can't resolve it, compared with one that occurs 10 times and we can, we prioritise the latter. P.

…

On Sat, Jul 27, 2019 at 8:54 AM Shruthi-M ***@***.***> wrote: Sir, I am working on your previous guidelines. I will try to separate the synonyms and the accepted names using the clues - you have mentioned. A final list of accepted species will be prepared soon (in the coming week). *ISSUES BEING FACED:* 1. Callistemon sp. [pid: 299] - The literature reports 7 varieties of this species and our database has data about only one variety - “Blackdown tableland”. 2. Kunzea ambigua [pid: 879] - The literature from which this is taken - was analyzed. It was found that the data corresponding to this entry was related “prostate form, B” and the article reports three more varieties - which are not included in the database. 3. Eryngium sp. nov. [pid: 560] - The literature reports 2 varieties - “1” and “2” under this. According to the EssoilDB 1.0, these 2 varieties are not separated. 4. Astartea sp. nov. [pid: 179] and Mikania sp. nov. [pid: 1074] - These could not be resolved further. 5. There are 11 binomial names that are shown to be DOUBTFUL. 6. The binomials without the author's names - are not accepted by gbif and other open source databases. There are more than half of the binomials which have more than one author. I have referred to the journal and chosen the right author, wherever I could find a discrepancy. Do the binomials have to be separated or retained along with their authors (as this plays a crucial role in a bibliography database)? This was raised earlier and not resolved completely. It would be really kind of you if you can give me more clarity about this. 7. A final list of accepted species (i.e. not synonyms) has to be prepared. 8. This list of accepted names have to be rechecked with their respective journal articles to ensure that they have the right assigned author. NOTE: The assignment of author was done by the GBIF web program. Hence, this step is necessary. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=AAFTCSYLELWVYH6E6YTFLT3QBP5MPA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD26GLRA#issuecomment-515663300>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFTCSZGSJGU4EUOZXUEXU3QBP5MPANCNFSM4IG3DB3A> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr · 2019-07-28T10:59:41Z

I shall add comments on your very useful output (which can be viewed directly in table form on Github):
https://github.com/gilienv/EssOilDB/blob/master/tables/plant/gbif_result.tsv

Note that there is no EssoilDB ID for each row so I shall refer to this table in unsorted fashion. This is why we need an ID!

It has 1839 rows (1 header and 1838 data). ACTION does this number agree with other info_plant tables?
GBIF has returned an identifier (column 4) for each row. As far as I can see every row has an identifier, so there is nothing that GBIF cannot interpret in some way. ACTION are there any names elsewhere in V1.0 that GBIF cannot interpret?
There are exactly duplicated rows (e.g. 7 and 8 and more elsewhere). "Remove duplicates" in Excel gives:
129 duplicates: ACTION after adding UniqueIds, remove all duplicates from table.

petermr · 2019-07-28T11:17:24Z

After deduplication here are my comments.

row 2

 	Abies alba 	Abies alba Mill. 	2685484 	EXACT 	99 	ACCEPTED 	SPECIES 	Plantae 	Tracheophyta 	Pinopsida 	Pinales 	Pinaceae 	Abies 	Abies alba

Input is Abies alba , GBIF found key=2685484 as an EXACT match with 99% confidence a a SPECIES, [taxonomy omitted] and the normative species name as Abies alba. It added the authority as Abies alba Mill. but this is probably more detail than we need. So our result is

Abies alba => Abies alba (GBIF 2685484) [SPECIES CONFIRMED]

Most of the results are (happily) of this form.

row 6

 	Acacia caven 	Acacia caven (Molina) Molina 	2979244 	EXACT 	99 	SYNONYM 	SPECIES 	Plantae 	Tracheophyta 	Magnoliopsida 	Fabales 	Fabaceae 	Vachellia 	Vachellia caven

Input is Acacia caven EXACTly identified, but this a synonym for the preferred name Vachellia caven

Vachellia caven is not mentioned in V1.0 so we lookup Vachellia caven in GBIF to give
https://www.gbif.org/species/3795588.
We then use Vachellia caven (GBIF 3795588) as the accepted name with Acacia caven (GBIF 2979244) as a synonym. ACTION agree this strategy.

row 16/17

 Achillea beibersteinii 	Achillea beibersteinii Afan. 	7400456 	EXACT 	98 	DOUBTFUL 	SPECIES 	Plantae 	Tracheophyta 	Magnoliopsida 	Asterales 	Asteraceae 	Achillea 	Achillea beibersteinii
Achillea biebersteinii 	Achillea biebersteinii C.Afan. 	3120276 	EXACT 	98 	SYNONYM 	SPECIES 	Plantae 	Tracheophyta 	Magnoliopsida 	Asterales 	Asteraceae 	Achillea 	Achillea arabica

I am guessing what has happened here is that beibersteinii is a misprint in the general plant literature (hence category DOUBTFUL), but it has got into the official books. So this should be referred to our curator. I would expect that we'd normalize it to Achillea biebersteinii which is a SYNONYM for Achillea arabica (which should be our agreed normative species).

row 25

 	Achillea depressa 	Achillea L. 	3119995 	HIGHERRANK 	96 	ACCEPTED 	GENUS 	Plantae 	Tracheophyta 	Magnoliopsida 	Asterales 	Asteraceae 	Achillea

Here GBIF cannot find an exact match, so reverts to the Genus. This is a loss of information, so maybe we should search elsewhere. "Plants of the world" (Kew) gives:
http://plantsoftheworldonline.org/taxon/urn:lsid:ipni.org:names:173942-1
"

Achillea depressa Janka

    This is a synonym of Achillea pseudopectinata Janka

"
So we can probably add the relatively few examples by hand.
There are 63 GENUS and 9 KINGDOM rows of which about 20 are either Foobar spp. and so not reconcilable. The other ~40 can be searched by hand and added by curator.

row 55

 	Aframomum hanburyl 	Aframomum hanburyi K.Schum. 	2758831 	FUZZY 	96 	SYNONYM 	SPECIES 	Plantae 	Tracheophyta 	Liliopsida 	Zingiberales 	Zingiberaceae 	Aframomum 	Aframomum angustifolium

FUZZY means that there is probably a misprint (here hanburyl for hanburyi). In this case the accepted name is also a SYNONYM, so there is a further step to Aframomum angustifolium

recommendation

create new columns:

original    GBIFAcceptedName  GBIFIdentier  GBIFSynonyms  curationDetails

The original is presereved
The best accepted name is always given
the identifier for that name is always given
If there is one or more accepted synonyms in V1.0 list them
log curation details (date, curator, action). Action can be: TYPO, SYNONYM, GENUS

petermr · 2019-07-28T14:12:35Z

list of problem species

Shruthi has created a report with a number of problems of names. She has actually gone back to priginal papers. Suggests she copy the data here.

I have also found some problems which seem to be different, and add some suggestions.

species with unusual synonyms or mapping onto more than one species.

Requires hand editing

Achillea depressa
Achillea stricta
Achillea tanacetifolia
Aloysia triphylla
Anthemis altissima
Artemisia coerulescens
Artemisia fragrans
Artemisia gallica
Artemisia herba-alba
Athrotaxis taxifolia
Cedrus liobani
Chenopodium ambrosioides
Cinnamomum fragrans
Cinnamomum zeylanicum
Coleus Aromaticus
Dracocephalum speciosum
Echinophora chysantha
Eclipta indica
Eryngium caeruleum
Eucalyptus viridiflora
Eugenia nitida
Eugenia ovalifolia
Eugenia rotundifolia
Lavandula hybrida
Lindera strychnifolia
Lippia gracillis
Mentha gracilis
Micromeria dalmatica
Nepeta fissa
Ocimum adscendens
Oenanthe divaricata
Origanum basilicum
Origanum micranthum
Pinus laricio
Pluchea purpurascens
Polymnia sonchifolia
Satureja viminea
Senecio farfarifolius
Stachys lanata
Tanacetum elburensis
Thymus capitatus
Thymus caucasicus
Thymus ciliates
Thymus hirtus

hybrids

Probably best represented at genus level

Citrus reticulata x Citrus sinensis
Citrus latifolia Tanaka x Citrus aurantifolia Swingle
Citrus paradisi x Citrus. reticulata
Citrus unshiu x Citrus nobilis
Eucalyptus citriodora x E.torelliana
Lavandula luisieri x Lavandula stoechas

and these are probably hybrids (assume the non-Unicode char is 'times' symbol.

Mentha •À_ piperita
Mentha•À_longifolia•À_L.
Peperomia•À_pellucida•À_L.

genus

These entries are only interpretable at genus level.

Astartea sp. nov.
Calamintha var.darensis
Callistemon sp.
Eryngium sp nov.
Eryngium spp.
Eugenia sp.
Hypericum 'Hidcote'
Kunzea sp.
Mentha spp.
Mikania sp.nov.
Origanum spp.
Persea
Xanthostemon spp.
Renealmia spp.

typos

Species require lowercase specific name.

Stachys Corsica
Tordylium Ketenoglui

unknown species

Lomatopodium khorassanicum
Serotinocarpum insignis

petermr · 2019-07-29T08:11:28Z

Shruthi,
Are you able to create a table of frequencies of plants? Then we could start the disambiguation with the most frequent problems.
You would have to find the unique ids for each profile, extract the plant by joining the tables and then sort.

It would be useful statistics as well.
P.

petermr · 2019-07-29T09:41:43Z

Shruthi,
When you have diambiguated (most) of the plant species can you lookup their IDs in Wikidata?
I wrote a simple tool in Feb for the workshop, but it was a bit slow - had to lookup one-by-one. There may be better tools now - I can ask...

Shruthi-M · 2019-07-29T10:23:03Z

Shruthi,
Are you able to create a table of frequencies of plants? Then we could start the disambiguation with the most frequent problems.
You would have to find the unique ids for each profile, extract the plant by joining the tables and then sort.

It would be useful statistics as well.
P.

Sir
Presently, I am adding the authors to the variations columns. This is very time-consuming as there are a lot of entries.
I had a small discussion with Gitanjali ma'am today and we decided to add common names, synonyms, GBIF key and the scientific name (with author) - all under one separate column titled "SYNONYM".
I am currently working on this.
As I have only 10 days of my training left and I have to start writing my final report, I will not be able to give more inputs apart from working on the new column.

Thank you for your guidance.

petermr · 2019-07-29T10:59:22Z

On Mon, Jul 29, 2019 at 11:23 AM Shruthi-M ***@***.***> wrote: Shruthi, Are you able to create a table of frequencies of plants? Then we could start the disambiguation with the most frequent problems. You would have to find the unique ids for each profile, extract the plant by joining the tables and then sort. It would be useful statistics as well. P. Sir Presently, I am adding the authors to the variations columns. This is very time-consuming as there are a lot of entries.

I can understand there is a lot to do.

I had a small discussion with Gitanjali ma'am today and we decided to add common names, synonyms, GBIF key and the scientific name (with author) - all under one separate column titled "SYNONYM".

What is the purpose of SYNONYM? Is it for searching? In which case it can be automatically generated from the GBIF identifier when needed.

I am currently working on this. As I have only 10 days of my training left and I have to start writing my final report, I will not be able to give more inputs apart from working on the new column.

Understood. I will mail Gita.

Thank you for your guidance.

It is a pleasure to work with you. P.

…

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=AAFTCSYX2KBHUBZYQNNHMX3QB3AIPA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3AI4SQ#issuecomment-515935818>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFTCSZ2VQGKB2VCYSXRAGTQB3AIPANCNFSM4IG3DB3A> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Shruthi-M · 2019-07-30T06:34:13Z

Greetings!
I have uploaded a file named essoildb.plantdata (2) on the repository. This has the following columns: [Please note: This is not the plant table. This is being used only for modifications.]

pid - as per EssoilDB 1.0
pname - as existing in EssoilDB 1.0
scientificName (gbif) - results obtained from GBIF
Normalized name
Details - about the author, subspecies, variety, etc.
pfid
phid
Error
kingdom
phylum
class
order
family
genus
species
Synonym - this column just gives the name of the synonymous species along with the GBIF key of the name - existing in our database. I will be adding the synonyms, common names and scientific names of all the plants to this column. Each of these will be separated by a comma.

The entries that are modified/ need modification are in red.

The hybrids are yet to be resolved

petermr · 2019-07-30T11:06:27Z

Thanks Good to see this is a separate table. Will look later today

…

On Tue, 30 Jul 2019, 07:34 Shruthi-M, ***@***.***> wrote: Greetings! I have uploaded a file named essoildb.plantdata (2) on the repository. This has the following columns: [Please note: This is not the plant table. This is being used only for modifications.] 1. pid - as per EssoilDB 1.0 2. pname - as existing in EssoilDB 1.0 3. scientificName (gbif) - results obtained from GBIF 4. Normalized name 5. Details - about the author, subspecies, variety, etc. 6. pfid 7. phid 8. Error 9. kingdom 10. phylum 11. class 12. order 13. family 14. genus 15. species 16. Synonym - this column just gives the name of the synonymous species along with the GBIF key of the name - existing in our database. I will be adding the synonyms, common names and scientific names of all the plants to this column. Each of these will be separated by a comma. *The entries that are modified/ need modification are in red.* - The hybrids are yet to be resolved — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=AAFTCS2JP7VJJPCVLTYAPSTQB7OGLA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3C54CI#issuecomment-516283913>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFTCS2MPO62MZHZ37PYG3TQB7OGLANCNFSM4IG3DB3A> .

Shruthi-M · 2019-08-01T09:47:46Z

I have uploaded the file containing the wiki-id as wiki_id.xlsx onto the repository. "NA" implies that the name does not exist in wikidata.

petermr · 2019-08-01T16:09:57Z

Many thanks!

…

On Thu, Aug 1, 2019 at 10:47 AM Shruthi-M ***@***.***> wrote: I have uploaded the file containing the wiki-id as wiki_id.xlsx onto the repository. "NA" implies that the name does not exist in wikidata. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=AAFTCSYTW6XPTNSDUKQ23P3QCKWMFA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3KATKQ#issuecomment-517212586>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFTCS54SMRWJI5MGLLJCWLQCKWMFANCNFSM4IG3DB3A> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr · 2019-08-01T19:49:45Z

@Shruthi-M This is wonderful! You have done a good job.
Could you please add:

how you created the data in each column - I imagine some has been added from GBIF or other authority - please name the authority explicitly and the service you used (if you did it automatically).
please explain the coloured entries.
how did you perform the wikimedia look up? did you use a service? (I thought you said you only had 300 , but you've got about 85% I think?)

[Although you may write this in the report the people using the plant data may not have access, so make sure the doc is in the directory].

I have renamed the major table to ingestion and created a TSV version.

petermr · 2019-08-01T20:21:59Z

UNIQUE IDENTIFIERS for plants.
Now is the time to freeze the number of entries being imported from V1.0. There are 1838 plant entries and you have generated a unique ID for each record. This ID must always be associated with the same record. If records are deleted we NEVER reuse that identifier.
** I think the identifiers should have a leading letter or more **
This has several advantages:

it protects against pre-truncation by mistake
it protects against using them as data (e.g. adding or subtraction)
it makes it clear they are identifiers
it make make them easier to find in google, etc.
It identifies them to the world as EssoilDB

So I suggest:

EPdddd for plants
ECdddd for compounds
ELdddd for locations
etc.

The question is whether we create identifiers of fixed length, e.g.
EP0001234
Since Wikidata and others don't I suggest we DONT worry about length.

EmanuelFaria · 2019-08-01T20:24:40Z

>Manny >Before Re-importing into the database, I’d like to get a shot at eliminating any invisible characters and othe anomalies please.---- On Thu, 01 Aug 2019 16:22:00 -0400

>PMR>>

Absolutely!! The characters should ONLY be Unicode 32-126. We will test for that. All other characters must be mapped onto these. * Thus any beta-character => `beta-` * all quotes => `"` or `'` * all dashes => `-` * all typography and style is discarded (BTW when replying to Github issues, try to eliminate all copy of previous posts, signatures, routing etc.)

petermr · 2019-08-02T09:44:23Z

I have renamed @Shruthi-M tables to tables/plant/import1.0.* Sorry if this incoveniences anyone

Shruthi-M · 2019-08-05T06:26:52Z

Greetings!
I have uploaded a file - details.xlsx. This contains the following data:

pid
Normalized name
scientificName
GBIF key
wiki_id
IF_ACCEPTED _NAMES
IF_SYNONYMS
Common_names
synonyms
Columns 6 and 7 I have also uploaded another document called Documentation (details) which contains the code used during the process of obtaining the same.
ANALYSIS:
The following cases need a review:
a) if a taxon is neither accepted nor a synonym, it implies that the name needs review
b) if the scientificName column contains the entry as "Plantae"
c) if the entries in the column "Normalized name" are marked in red

Shruthi-M · 2019-08-05T06:30:29Z

The above post is in tables/plant.

petermr · 2019-08-05T23:26:54Z

The details.xlsx table looks well designed and created. I need to check details - this will take a little time. The wiki_id table is presumably not required as the Wikidata column is already in "details", correct? P.

…

On Mon, Aug 5, 2019 at 7:30 AM Shruthi-M ***@***.***> wrote: The above post is in tables/plant. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=AAFTCS7TJGO3CN36AE7D2DDQC7CILA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3Q2F3I#issuecomment-518103789>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFTCSYA2EFJR3FDDMMJHGLQC7CILANCNFSM4IG3DB3A> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Shruthi-M · 2019-08-06T06:58:35Z

On Tue, 6 Aug 2019 at 04:56, petermr ***@***.***> wrote: The details.xlsx table looks well designed and created. I need to check details - this will take a little time.

Thank you Sir

The wiki_id table is presumably not required as the Wikidata column is already in "details", correct?

Yes, a separate table is not required.

…

P. On Mon, Aug 5, 2019 at 7:30 AM Shruthi-M ***@***.***> wrote: > The above post is in tables/plant. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > < #82?email_source=notifications&email_token=AAFTCS7TJGO3CN36AE7D2DDQC7CILA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3Q2F3I#issuecomment-518103789 >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AAFTCSYA2EFJR3FDDMMJHGLQC7CILANCNFSM4IG3DB3A > > . > -- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=AMIWRYEBQTEELIOVA5MAWA3QDCZL7A5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3TLWHA#issuecomment-518437660>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMIWRYHINTGQEPRJD6AUXC3QDCZL7ANCNFSM4IG3DB3A> .

petermr · 2019-08-06T07:18:28Z

@shruthi Mohan <[email protected]> - can you put a file with brief descriptions of the column headings and the colours in the plant/ directory? Also are there non-Unicode characters? I suspect not as plant names use ASCII and I don't think there are other requirements. Normalize dashes to hyphen-minus. There should be no quotes, apostrophe but if so, normalize to " or ' . Do not use smart quotes. Use TSV by default because you may need commas eslewhere. Spaces should be normal single spaces (char 32). Use a text editor, not Word. I'll have a look but I'm not too concerned. (The chemistry and bibliography are harder).

…

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria · 2019-08-06T07:31:06Z

These are all good points Peter, I’ll be taking care of this as the final step before Gita and you get a last look before import. With what little time Shruthi has with on this project, getting the data to be true and correct, should be her main focus. If she can do this without endangering the chances of having true, correctly spelled data, that’s great. But ultimately unnecessary because Gita has made me responsible for that. Meanwhile… GO! Shruthi GO! We’re cheering you on to the finish line!! :D Manny Emanuel Faria Founder | Formulator | President [email protected] VERRICLEAR NATURAL SKIN ESSENTIALS LTD. Nature + Science = Success!™ North America: www.verriclear.com <http://www.verriclear.com/> South America: www.verriclear.com.br <http://www.verriclear.com.br/>

…

------------------------------------------------------------------------------------------- “If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it. - Albert Einstein - ------------------------------------------------------------------------------------------- ****************** CONFIDENTIALITY NOTICE ****************** This email message, including any attachments, may contain information that is confidential, privileged, and/or proprietary. If you are not an intended recipient, please be advised that any review, use, reproduction or distribution of this message is prohibited. The information and documents electronically transmitted are private, may include privileged communications and may contain confidential information intended only for the person named above. Nothing in this electronic transmission is intended to waive the confidentiality of this message or any attachment. Any other distribution, copying or disclosure is not intended by the sender and may result in the breach of certain laws or the infringement of rights of third parties. If you have received this message in error, please completely destroy all electronic and hard copies, and contact the sender at [email protected]. Thank you for your co-operation. Although we run anti-virus software we caution that every recipient should scan this e-mail and any attached files for viruses, worms and the like. Neither the writer nor its assignees accepts any liability for any loss, liability, damage or expense resulting directly or indirectly from the access of any files attached to this message. VERRICLEAR Natural Skin Essentials Ltd. does not provide medical advice or services, and nothing in this e-mail or any document published by VERRICLEAR should be construed as such. On Aug 6, 2019, at 4:18 AM, petermr <[email protected]> wrote: @shruthi Mohan <[email protected]> - can you put a file with brief descriptions of the column headings and the colours in the plant/ directory? Also are there non-Unicode characters? I suspect not as plant names use ASCII and I don't think there are other requirements. Normalize dashes to hyphen-minus. There should be no quotes, apostrophe but if so, normalize to " or ' . Do not use smart quotes. Use TSV by default because you may need commas eslewhere. Spaces should be normal single spaces (char 32). Use a text editor, not Word. I'll have a look but I'm not too concerned. (The chemistry and bibliography are harder).

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=ACJK2M3KK5HPAZKLTCXED73QDEQUJA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UE3IY#issuecomment-518540707>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACJK2M23RZRI3U44ZX4FFXDQDEQUJANCNFSM4IG3DB3A>.

petermr · 2019-08-06T08:30:57Z

On Tue, Aug 6, 2019 at 8:31 AM Manny ***@***.***> wrote: These are all good points Peter, I’ll be taking care of this as the final step before Gita and you get a last look before import.

Thanks so much!

With what little time Shruthi has with on this project, getting the data to be true and correct, should be her main focus.

Absolutely agreed. If she can do this without endangering the chances of having true,

correctly spelled data, that’s great. But ultimately unnecessary because Gita has made me responsible for that.

It is MUCH easier now. By resolving against GBIF and Wikipedia/Wikidata we don't have to worry about spelling because *they* take care of it. So GBIF=2685484 Wikidata=Q146992 species=Abies alba is ALL we have to know for the the first entry. Everything else can be looked up. "GBIF, what is the preferred taxonomic authority for Abies Alba?" "Abies alba Mill." "Wikidata , what is the common name for Q146992 in Portuguese" "*abeto-prateado"* In particular those two authorities work closely together. They will *automatically* update when: * a species is reclassified (genus, family) * a new synonym is found * a new authority is added Also you can automatically ask: "What is the IUCN status of Q146992?" "Least concern" In this way the things that EssoilDB has to maintain are: * a register of imported articles (bibliography) - Ambarish is doing this * a register of plant species (Shruthi has done this!) * a register of compounds (Ambarish is doing this) * locations (not well advanced) then: * an import mechanism (PMR) * import checking - yet to be developed but uses core tables * data=> core plant/compound/parts/location/ tables * a search engine (separate from core) using: - core tables - plant synonyms from Wikidata, GBIF - chemical structure search (from CDK - they will be happy to advise) This design is implicit in the poster which should be an initial guide I think this is a great time to design in the features that you would find useful. It's a relatively small knowledgebase so systems such as NoSQL or Tidyverse should be considered. Also I want to store a LOT more of the original papers if that would be useful. Exciting! I'd very much like to talk again over Skype. I think just you and me if Gita is busy.

…

Meanwhile… GO! Shruthi GO! We’re cheering you on to the finish line!! :D Manny Emanuel Faria Founder | Formulator | President ***@***.*** VERRICLEAR NATURAL SKIN ESSENTIALS LTD. Nature + Science = Success!™ North America: www.verriclear.com <http://www.verriclear.com/> South America: www.verriclear.com.br <http://www.verriclear.com.br/> ------------------------------------------------------------------------------------------- “If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it. - Albert Einstein - ------------------------------------------------------------------------------------------- ****************** CONFIDENTIALITY NOTICE ****************** This email message, including any attachments, may contain information that is confidential, privileged, and/or proprietary. If you are not an intended recipient, please be advised that any review, use, reproduction or distribution of this message is prohibited. The information and documents electronically transmitted are private, may include privileged communications and may contain confidential information intended only for the person named above. Nothing in this electronic transmission is intended to waive the confidentiality of this message or any attachment. Any other distribution, copying or disclosure is not intended by the sender and may result in the breach of certain laws or the infringement of rights of third parties. If you have received this message in error, please completely destroy all electronic and hard copies, and contact the sender at ***@***.*** Thank you for your co-operation. Although we run anti-virus software we caution that every recipient should scan this e-mail and any attached files for viruses, worms and the like. Neither the writer nor its assignees accepts any liability for any loss, liability, damage or expense resulting directly or indirectly from the access of any files attached to this message. VERRICLEAR Natural Skin Essentials Ltd. does not provide medical advice or services, and nothing in this e-mail or any document published by VERRICLEAR should be construed as such. On Aug 6, 2019, at 4:18 AM, petermr ***@***.***> wrote: @shruthi Mohan ***@***.***> - can you put a file with brief descriptions of the column headings and the colours in the plant/ directory? Also are there non-Unicode characters? I suspect not as plant names use ASCII and I don't think there are other requirements. Normalize dashes to hyphen-minus. There should be no quotes, apostrophe but if so, normalize to " or ' . Do not use smart quotes. Use TSV by default because you may need commas eslewhere. Spaces should be normal single spaces (char 32). Use a text editor, not Word. I'll have a look but I'm not too concerned. (The chemistry and bibliography are harder). -- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you commented. Reply to this email directly, view it on GitHub < #82?email_source=notifications&email_token=ACJK2M3KK5HPAZKLTCXED73QDEQUJA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UE3IY#issuecomment-518540707>, or mute the thread < https://github.com/notifications/unsubscribe-auth/ACJK2M23RZRI3U44ZX4FFXDQDEQUJANCNFSM4IG3DB3A >. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=AAFTCS2TUB73HPMUDOIVOGTQDESDXA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UF62I#issuecomment-518545257>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFTCSZOLDMZA7GB4STPB73QDESDXANCNFSM4IG3DB3A> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr · 2019-08-06T09:53:26Z

@Shruthi-M I have found your *.docx file and this looks very good. Am reading it.

petermr · 2019-08-06T10:14:14Z

Using word documents for docs on Github is not normally a good idea for several reasons.

word can introduce spurious characters especially line ends, smart quotes etc.
Github is designed for code, Word is not.

In particular displaying screen shots of code can be very frustrating for people who want to use them. They have to retype them and will make mistakes. People want to cut and paste and run.
(Same goes for the species/output).
[Screen shots can be useful for tutorials and web pages but the original should always be available.

Can you put the code in an R format (note-book like) that's the best way.

Shruthi-M · 2019-08-06T10:54:50Z

Using word documents for docs on Github is not normally a good idea for several reasons.

word can introduce spurious characters especially line ends, smart quotes etc.

Github is designed for code, Word is not.

In particular displaying screen shots of code can be very frustrating for people who want to use them. They have to retype them and will make mistakes. People want to cut and paste and run.
(Same goes for the species/output).
[Screen shots can be useful for tutorials and web pages but the original should always be available.

Can you put the code in an R format (note-book like) that's the best way.

Sure, I will look into this.

EmanuelFaria · 2019-08-06T15:22:03Z

Thanks Peter, Everything below is great news. If I were the one responsible for deciding final spelling among all the versions and accepted typos on Google, I’d pull my hair out. (What’s left of it.) Regarding locations, I’ve started this but ran into some trouble trying to parse State/Prov, City, Town, Region names from the original single text field into separate fields for each. I’ve emailed the owner of a world-wide database for help, but no response. If I (or preferably Manish) can figure out how we could use such a database to automatically compare words in our Locations table against the World table, and drop it in the right field, that would be a time-saving miracle — not to mention taking the fear of getting something wrong. I have a bit more to do on the updated Compound (AND PLANT) activities table (because lots of journal articles talk about plant oils having activities, without naming specific constituents). All the current IDs will be preserved, and I have a plan to make it easy to connect current entries that list more than one activity in the same record field. Looking forward to getting the cleanups done for the team in a precise manner, yet quick manner. I have a list of things I’ve found and stored in a “clean up” database, so I can copy and paste the non-space spaces and other anomalies. Whenever you’re ready, send me your list, and I’ll go through them all, methodically (checkist-style), and turn to you if anything strange happens before uploading for your final once-over. I’d love to chat with you too! I’m in the middle of some extremely tedious, but very important work for the next couple of days, but perhaps Thursday or Friday? Keep in mind I’m in Brazil, so we can work out a good time for both of us. If you use an iphone, I use this app to quickly find good times to meet and two or more timezones at once: https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812 <https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812> Talk soon on skype! (Skype name: mannyrules … I was feeling good about myself that day, and didn’t know my account name would end up being my public user name haha) Good day or night to you, wherever you are. Manny

On Aug 6, 2019, at 5:30 AM, petermr ***@***.***> wrote: On Tue, Aug 6, 2019 at 8:31 AM Manny ***@***.***> wrote: These are all good points Peter, I’ll be taking care of this as the final step before Gita and you get a last look before import.

Thanks so much!

With what little time Shruthi has with on this project, getting the data to be true and correct, should be her main focus.

Absolutely agreed. If she can do this without endangering the chances of having true,

correctly spelled data, that’s great. But ultimately unnecessary because Gita has made me responsible for that.

It is MUCH easier now. By resolving against GBIF and Wikipedia/Wikidata we don't have to worry about spelling because *they* take care of it. So GBIF=2685484 Wikidata=Q146992 species=Abies alba is ALL we have to know for the the first entry. Everything else can be looked up. "GBIF, what is the preferred taxonomic authority for Abies Alba?" "Abies alba Mill." "Wikidata , what is the common name for Q146992 in Portuguese" "*abeto-prateado"* In particular those two authorities work closely together. They will *automatically* update when: * a species is reclassified (genus, family) * a new synonym is found * a new authority is added Also you can automatically ask: "What is the IUCN status of Q146992?" "Least concern" In this way the things that EssoilDB has to maintain are: * a register of imported articles (bibliography) - Ambarish is doing this * a register of plant species (Shruthi has done this!) * a register of compounds (Ambarish is doing this) * locations (not well advanced) then: * an import mechanism (PMR) * import checking - yet to be developed but uses core tables * data=> core plant/compound/parts/location/ tables * a search engine (separate from core) using: - core tables - plant synonyms from Wikidata, GBIF - chemical structure search (from CDK - they will be happy to advise) This design is implicit in the poster which should be an initial guide I think this is a great time to design in the features that you would find useful. It's a relatively small knowledgebase so systems such as NoSQL or Tidyverse should be considered. Also I want to store a LOT more of the original papers if that would be useful. Exciting! I'd very much like to talk again over Skype. I think just you and me if Gita is busy.

…

Meanwhile… GO! Shruthi GO! We’re cheering you on to the finish line!! :D Manny Emanuel Faria Founder | Formulator | President ***@***.*** VERRICLEAR NATURAL SKIN ESSENTIALS LTD. Nature + Science = Success!™ North America: www.verriclear.com <http://www.verriclear.com/> South America: www.verriclear.com.br <http://www.verriclear.com.br/> ------------------------------------------------------------------------------------------- “If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it. - Albert Einstein - ------------------------------------------------------------------------------------------- ****************** CONFIDENTIALITY NOTICE ****************** This email message, including any attachments, may contain information that is confidential, privileged, and/or proprietary. If you are not an intended recipient, please be advised that any review, use, reproduction or distribution of this message is prohibited. The information and documents electronically transmitted are private, may include privileged communications and may contain confidential information intended only for the person named above. Nothing in this electronic transmission is intended to waive the confidentiality of this message or any attachment. Any other distribution, copying or disclosure is not intended by the sender and may result in the breach of certain laws or the infringement of rights of third parties. If you have received this message in error, please completely destroy all electronic and hard copies, and contact the sender at ***@***.*** Thank you for your co-operation. Although we run anti-virus software we caution that every recipient should scan this e-mail and any attached files for viruses, worms and the like. Neither the writer nor its assignees accepts any liability for any loss, liability, damage or expense resulting directly or indirectly from the access of any files attached to this message. VERRICLEAR Natural Skin Essentials Ltd. does not provide medical advice or services, and nothing in this e-mail or any document published by VERRICLEAR should be construed as such. On Aug 6, 2019, at 4:18 AM, petermr ***@***.***> wrote: @shruthi Mohan ***@***.***> - can you put a file with brief descriptions of the column headings and the colours in the plant/ directory? Also are there non-Unicode characters? I suspect not as plant names use ASCII and I don't think there are other requirements. Normalize dashes to hyphen-minus. There should be no quotes, apostrophe but if so, normalize to " or ' . Do not use smart quotes. Use TSV by default because you may need commas eslewhere. Spaces should be normal single spaces (char 32). Use a text editor, not Word. I'll have a look but I'm not too concerned. (The chemistry and bibliography are harder). -- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you commented. Reply to this email directly, view it on GitHub < #82?email_source=notifications&email_token=ACJK2M3KK5HPAZKLTCXED73QDEQUJA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UE3IY#issuecomment-518540707>, or mute the thread < https://github.com/notifications/unsubscribe-auth/ACJK2M23RZRI3U44ZX4FFXDQDEQUJANCNFSM4IG3DB3A >. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=AAFTCS2TUB73HPMUDOIVOGTQDESDXA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UF62I#issuecomment-518545257>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFTCSZOLDMZA7GB4STPB73QDESDXANCNFSM4IG3DB3A> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=ACJK2M7TWYRNQB3GOTOFRT3QDEZEFA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3ULQNY#issuecomment-518567991>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACJK2M7IGW6J7H4ZIJTXEKLQDEZEFANCNFSM4IG3DB3A>.

petermr · 2019-08-06T18:22:24Z

On Tue, Aug 6, 2019 at 4:22 PM Manny ***@***.***> wrote: Thanks Peter, Everything below is great news. If I were the one responsible for deciding final spelling among all the versions and accepted typos on Google, I’d pull my hair out. (What’s left of it.)

EssoilDB1.0 is a finite task. I suspect we don't need to tidy up the whole of the long tail. EssoilDB2.0 will be wonderfully different.

Regarding locations, I’ve started this but ran into some trouble trying to parse State/Prov, City, Town, Region names from the original single text field into separate fields for each. I’ve emailed the owner of a world-wide database for help, but no response. If I (or preferably Manish) can figure out how we could use such a database to automatically compare words in our Locations table against the World table, and drop it in the right field, that would be a time-saving miracle — not to mention taking the fear of getting something wrong.

The Open community - Wikipedia and others - have some solutions here. I'll tweet it.

I have a bit more to do on the updated Compound (AND PLANT) activities table (because lots of journal articles talk about plant oils having activities, without naming specific constituents).

Let's talk about this. I was under the impression that the activities in E1.0 were inserted from external sources and not from the paper. But I may be wrong. If it is extracting them from the paper we need to talk.

All the current IDs will be preserved, and I have a plan to make it easy to connect current entries that list more than one activity in the same record field.

We really need GIta's view on this. I have a list of things I’ve found and stored in a “clean up” database, so

I can copy and paste the non-space spaces and other anomalies.

The database is small enough it fits in Github easily. EssOilDB/v1.0/info_c.tsv is only 38 Mbyte.

Whenever you’re ready, send me your list, and I’ll go through them all, methodically (checkist-style), and turn to you if anything strange happens before uploading for your final once-over.

Ambarish is/has_been working on this. In any case all the data is on Github so we don't need to send it.

I’d love to chat with you too! I’m in the middle of some extremely tedious, but very important work for the next couple of days, but perhaps Thursday or Friday? Keep in mind I’m in Brazil, so we can work out a good time for both of us.

I have some ideas about Open Science in LatAm which I'll explain later.

If you use an iphone, I use this app to quickly find good times to meet and two or more timezones at once: https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812 < https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812> Talk soon on skype! (Skype name: mannyrules … I was feeling good about myself that day, and didn’t know my account name would end up being my public user name haha) I shall be in Edinburgh Thu and Friday. I am happy to try times in the UK

in the afternoon and evening. What I'd like for V2.0 is some use cases. I can't guarantee that they would all be supported. However I would be optimstic about experimental methodology for extraction. Activities will be harder.

…

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Shruthi-M · 2019-08-11T05:33:34Z

Greetings!
I have added a new file at EssOilDB/tables/plant called details.txt.I have added it in the .txt format (not .xlsx) as told. I can separately send the .xlsx file as well (if needed). I noticed that some changes needed to be made in the synonyms column. I have done those.
I will send the text version of the R codes soon.

Thank you.

Shruthi-M self-assigned this Jul 25, 2019

Repository owner deleted a comment from Shruthi-M Nov 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disambiguation of plant names using GBIF #82

Disambiguation of plant names using GBIF #82

Shruthi-M commented Jul 25, 2019

petermr commented Jul 25, 2019 via email

petermr commented Jul 25, 2019 via email

petermr commented Jul 25, 2019 via email

petermr commented Jul 26, 2019

Shruthi-M commented Jul 27, 2019

petermr commented Jul 28, 2019 via email

petermr commented Jul 28, 2019

petermr commented Jul 28, 2019 •

edited

Loading

petermr commented Jul 28, 2019 •

edited

Loading

petermr commented Jul 29, 2019

petermr commented Jul 29, 2019

Shruthi-M commented Jul 29, 2019

petermr commented Jul 29, 2019 via email

Shruthi-M commented Jul 30, 2019

petermr commented Jul 30, 2019 via email

Shruthi-M commented Aug 1, 2019

petermr commented Aug 1, 2019 via email

petermr commented Aug 1, 2019 •

edited

Loading

petermr commented Aug 1, 2019 •

edited

Loading

EmanuelFaria commented Aug 1, 2019 via email •

edited by petermr

Loading

petermr commented Aug 2, 2019

Shruthi-M commented Aug 5, 2019

Shruthi-M commented Aug 5, 2019

petermr commented Aug 5, 2019 via email

Shruthi-M commented Aug 6, 2019 via email

petermr commented Aug 6, 2019 via email

EmanuelFaria commented Aug 6, 2019 via email

petermr commented Aug 6, 2019 via email

petermr commented Aug 6, 2019

petermr commented Aug 6, 2019

Shruthi-M commented Aug 6, 2019

EmanuelFaria commented Aug 6, 2019 via email

petermr commented Aug 6, 2019 via email

Shruthi-M commented Aug 11, 2019

Disambiguation of plant names using GBIF #82

Disambiguation of plant names using GBIF #82

Comments

Shruthi-M commented Jul 25, 2019

petermr commented Jul 25, 2019 via email

petermr commented Jul 25, 2019 via email

petermr commented Jul 25, 2019 via email

petermr commented Jul 26, 2019

Shruthi-M commented Jul 27, 2019

petermr commented Jul 28, 2019 via email

petermr commented Jul 28, 2019

petermr commented Jul 28, 2019 • edited Loading

recommendation

petermr commented Jul 28, 2019 • edited Loading

list of problem species

species with unusual synonyms or mapping onto more than one species.

hybrids

genus

typos

unknown species

petermr commented Jul 29, 2019

petermr commented Jul 29, 2019

Shruthi-M commented Jul 29, 2019

petermr commented Jul 29, 2019 via email

Shruthi-M commented Jul 30, 2019

petermr commented Jul 30, 2019 via email

Shruthi-M commented Aug 1, 2019

petermr commented Aug 1, 2019 via email

petermr commented Aug 1, 2019 • edited Loading

petermr commented Aug 1, 2019 • edited Loading

EmanuelFaria commented Aug 1, 2019 via email • edited by petermr Loading

petermr commented Aug 2, 2019

Shruthi-M commented Aug 5, 2019

Shruthi-M commented Aug 5, 2019

petermr commented Aug 5, 2019 via email

Shruthi-M commented Aug 6, 2019 via email

petermr commented Aug 6, 2019 via email

EmanuelFaria commented Aug 6, 2019 via email

petermr commented Aug 6, 2019 via email

petermr commented Aug 6, 2019

petermr commented Aug 6, 2019

Shruthi-M commented Aug 6, 2019

EmanuelFaria commented Aug 6, 2019 via email

petermr commented Aug 6, 2019 via email

Shruthi-M commented Aug 11, 2019

petermr commented Jul 28, 2019 •

edited

Loading

petermr commented Jul 28, 2019 •

edited

Loading

petermr commented Aug 1, 2019 •

edited

Loading

petermr commented Aug 1, 2019 •

edited

Loading

EmanuelFaria commented Aug 1, 2019 via email •

edited by petermr

Loading