-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disambiguation of plant names using GBIF #82
Comments
Looks great. Well done for organising columns. Will need something like
this for chemistry.
Will look in detail when on my laptop
…On Thu, 25 Jul 2019, 15:13 Shruthi-M, ***@***.***> wrote:
I submitted the entire set of plant names (before clean up) onto the GBIF
link - (https://www.gbif.org/en/tools/species-lookup)
This allows the user to perform multiple searches at once. After this
step, I got the results - which I have uploaded as gbif_result.csv onto the
repository.
The default headings of the columns are as follows:
1. occurrenceId
2. verbatimScientificName (user-submitted name)
3. scientificName (name existing in the database)
4. key (unique number assigned to the particular species on GBIF
5. matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)
- EXACT means the name exactly matches with the entry in the database
- FUZZY indicates entries that may be mis-spelt
- HIGHERRANK implies that the specific epithet of the entry is not
being recognized (in other words, only genus is recognized)
1. confidence (expressed in terms of percentage)
2. status (can be ACCEPTED, SYNONYM or DOUBTFUL)
- DOUBTFUL Treated as accepted, but doubtful whether this is correct.
- SYNONYM A general synonym, the exact type is unknown.
1. rank (the highest rank recognized)
2. kingdom
3. phylum
4. class
5. order
6. family
7. genus
8. species
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#82?email_source=notifications&email_token=AAFTCS4ISVZNA5XSDCJE7CLQBGYINA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBPXSQA>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS7GACCNCGZLLA433FLQBGYINANCNFSM4IG3DB3A>
.
|
Shruthi ,
This is excellent.
**We should preserve this table**
Then we will normalize.
We should classify the results into the major groups.
Initial comments:
```
occurrenceId verbatimScientificName scientificName key matchType confidence
status rank kingdom phylum class order family genus species
```
occurrenceId // these were all blank, so we can drop this
verbatimScientificName // this is our initial raw data and must be
preserved. Let's use GBIF terminology where possible, so keep this column
name
scientificName // the preferred name. Can include synonyms. We should not
use this if there is a species
key // this is the most important column and gives us all the normalized
information we need
matchType // Yes, we should keep this because it helps understand
non-normalized species
confidence // whats' the lowest? I think we can drop this later
status // useful for non-normalized names
rank // useful for non-normalized names
kingdom phylum class order family genus // probably keep. GBIF seems to map
unknown species to genus.
species // the key normalization
```
Abies alba Abies alba Mill. 2685484 EXACT 99 ACCEPTED SPECIES Plantae
Tracheophyta Pinopsida Pinales Pinaceae Abies Abies alba
Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED
SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia
nuperrima
Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED
SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia
nuperrima
```
^^ duplicates. Why? we can get rid of these immediately
```
Achillea albicaulis Achillea albicaulis C.A.Mey. 3120384 EXACT 99 SYNONYM
SPECIES Plantae Tracheophyta Magnoliopsida Asterales Asteraceae Achillea
Achillea tenuifolia
```
This is a single synonym but we should use the species name "Achillea
tenuifolia" for future matching, not the scientificName "Achillea
albicaulis". As always the key is the critical column.
```
,"Ocimum sanctum","Ocimum sanctum
L.","2927101","EXACT","99","SYNONYM","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum
tenuiflorum"
,"Ocimum tenuiflorum","Ocimum tenuiflorum
L.","2927100","EXACT","99","ACCEPTED","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum
tenuiflorum"
```
These are synonyms but have different keys. So in our normalized table
there should only be the ACCEPTED. Normalization should be on "species"
Let's summarize and make a list of
ACCEPTED species
SYNONYMS can be removed if there is an ACCEPTED species
SYNONYM without ACCEPTED equivalent should be normalized on the species
Everything else shouldbe separated out as we will have to discuss it.
Well done.
On Thu, Jul 25, 2019 at 4:33 PM Peter Murray-Rust <
[email protected]> wrote:
… Looks great. Well done for organising columns. Will need something like
this for chemistry.
Will look in detail when on my laptop
On Thu, 25 Jul 2019, 15:13 Shruthi-M, ***@***.***> wrote:
> I submitted the entire set of plant names (before clean up) onto the GBIF
> link - (https://www.gbif.org/en/tools/species-lookup)
> This allows the user to perform multiple searches at once. After this
> step, I got the results - which I have uploaded as gbif_result.csv onto the
> repository.
> The default headings of the columns are as follows:
>
> 1. occurrenceId
> 2. verbatimScientificName (user-submitted name)
> 3. scientificName (name existing in the database)
> 4. key (unique number assigned to the particular species on GBIF
> 5. matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)
>
>
> - EXACT means the name exactly matches with the entry in the database
> - FUZZY indicates entries that may be mis-spelt
> - HIGHERRANK implies that the specific epithet of the entry is not
> being recognized (in other words, only genus is recognized)
>
>
> 1. confidence (expressed in terms of percentage)
> 2. status (can be ACCEPTED, SYNONYM or DOUBTFUL)
>
>
> - DOUBTFUL Treated as accepted, but doubtful whether this is correct.
> - SYNONYM A general synonym, the exact type is unknown.
>
>
> 1. rank (the highest rank recognized)
> 2. kingdom
> 3. phylum
> 4. class
> 5. order
> 6. family
> 7. genus
> 8. species
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#82?email_source=notifications&email_token=AAFTCS4ISVZNA5XSDCJE7CLQBGYINA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBPXSQA>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAFTCS7GACCNCGZLLA433FLQBGYINANCNFSM4IG3DB3A>
> .
>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
We are going to need unique identifiers for all these accepted species,
e.g. PL123
we need a column for EssoilDB plant key.
On Thu, Jul 25, 2019 at 9:29 PM Peter Murray-Rust <
[email protected]> wrote:
… Shruthi ,
This is excellent.
**We should preserve this table**
Then we will normalize.
We should classify the results into the major groups.
Initial comments:
```
occurrenceId verbatimScientificName scientificName key matchType
confidence status rank kingdom phylum class order family genus species
```
occurrenceId // these were all blank, so we can drop this
verbatimScientificName // this is our initial raw data and must be
preserved. Let's use GBIF terminology where possible, so keep this column
name
scientificName // the preferred name. Can include synonyms. We should not
use this if there is a species
key // this is the most important column and gives us all the normalized
information we need
matchType // Yes, we should keep this because it helps understand
non-normalized species
confidence // whats' the lowest? I think we can drop this later
status // useful for non-normalized names
rank // useful for non-normalized names
kingdom phylum class order family genus // probably keep. GBIF seems to
map unknown species to genus.
species // the key normalization
```
Abies alba Abies alba Mill. 2685484 EXACT 99 ACCEPTED SPECIES Plantae
Tracheophyta Pinopsida Pinales Pinaceae Abies Abies alba
Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED
SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia
nuperrima
Acacia nuperrima Acacia nuperrima Baker f. 2980107 EXACT 100 ACCEPTED
SPECIES Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Acacia Acacia
nuperrima
```
^^ duplicates. Why? we can get rid of these immediately
```
Achillea albicaulis Achillea albicaulis C.A.Mey. 3120384 EXACT 99 SYNONYM
SPECIES Plantae Tracheophyta Magnoliopsida Asterales Asteraceae Achillea
Achillea tenuifolia
```
This is a single synonym but we should use the species name "Achillea
tenuifolia" for future matching, not the scientificName "Achillea
albicaulis". As always the key is the critical column.
```
,"Ocimum sanctum","Ocimum sanctum
L.","2927101","EXACT","99","SYNONYM","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum
tenuiflorum"
,"Ocimum tenuiflorum","Ocimum tenuiflorum
L.","2927100","EXACT","99","ACCEPTED","SPECIES","Plantae","Tracheophyta","Magnoliopsida","Lamiales","Lamiaceae","Ocimum","Ocimum
tenuiflorum"
```
These are synonyms but have different keys. So in our normalized table
there should only be the ACCEPTED. Normalization should be on "species"
Let's summarize and make a list of
ACCEPTED species
SYNONYMS can be removed if there is an ACCEPTED species
SYNONYM without ACCEPTED equivalent should be normalized on the species
Everything else shouldbe separated out as we will have to discuss it.
Well done.
On Thu, Jul 25, 2019 at 4:33 PM Peter Murray-Rust <
***@***.***> wrote:
> Looks great. Well done for organising columns. Will need something like
> this for chemistry.
> Will look in detail when on my laptop
>
> On Thu, 25 Jul 2019, 15:13 Shruthi-M, ***@***.***> wrote:
>
>> I submitted the entire set of plant names (before clean up) onto the
>> GBIF link - (https://www.gbif.org/en/tools/species-lookup)
>> This allows the user to perform multiple searches at once. After this
>> step, I got the results - which I have uploaded as gbif_result.csv onto the
>> repository.
>> The default headings of the columns are as follows:
>>
>> 1. occurrenceId
>> 2. verbatimScientificName (user-submitted name)
>> 3. scientificName (name existing in the database)
>> 4. key (unique number assigned to the particular species on GBIF
>> 5. matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)
>>
>>
>> - EXACT means the name exactly matches with the entry in the database
>> - FUZZY indicates entries that may be mis-spelt
>> - HIGHERRANK implies that the specific epithet of the entry is not
>> being recognized (in other words, only genus is recognized)
>>
>>
>> 1. confidence (expressed in terms of percentage)
>> 2. status (can be ACCEPTED, SYNONYM or DOUBTFUL)
>>
>>
>> - DOUBTFUL Treated as accepted, but doubtful whether this is correct.
>> - SYNONYM A general synonym, the exact type is unknown.
>>
>>
>> 1. rank (the highest rank recognized)
>> 2. kingdom
>> 3. phylum
>> 4. class
>> 5. order
>> 6. family
>> 7. genus
>> 8. species
>>
>> —
>> You are receiving this because you are subscribed to this thread.
>> Reply to this email directly, view it on GitHub
>> <#82?email_source=notifications&email_token=AAFTCS4ISVZNA5XSDCJE7CLQBGYINA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBPXSQA>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-auth/AAFTCS7GACCNCGZLLA433FLQBGYINANCNFSM4IG3DB3A>
>> .
>>
>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
@Shruthi-M - this is so important for all of us! Everyone should read this thread. I'll annotate @Shruthi-M 's table and add actions. We should extract the discussion here onto .md pages as well. = Origin of data =
== ACTION == |
Sir, I am working on your previous guidelines. I will try to separate the synonyms and the accepted names using the clues - you have mentioned. A final list of accepted species will be prepared soon (in the coming week).
|
Thanks so much Shruthi,
One feature of data is that there is always a "long tail".
https://en.wikipedia.org/wiki/Long_tail . A few items that can't be easily
processed. The most important thing at present is to resolve the largest
chunks of names as effeciently as possible. I'll try to highlight a
strategy today based on the very useful output from GBIF you created. If
there is a species that occurs only once and we can't resolve it, compared
with one that occurs 10 times and we can, we prioritise the latter.
P.
…On Sat, Jul 27, 2019 at 8:54 AM Shruthi-M ***@***.***> wrote:
Sir, I am working on your previous guidelines. I will try to separate the
synonyms and the accepted names using the clues - you have mentioned. A
final list of accepted species will be prepared soon (in the coming week).
*ISSUES BEING FACED:*
1. Callistemon sp. [pid: 299] - The literature reports 7 varieties of
this species and our database has data about only one variety - “Blackdown
tableland”.
2. Kunzea ambigua [pid: 879] - The literature from which this is taken
- was analyzed. It was found that the data corresponding to this entry was
related “prostate form, B” and the article reports three more varieties -
which are not included in the database.
3. Eryngium sp. nov. [pid: 560] - The literature reports 2 varieties -
“1” and “2” under this. According to the EssoilDB 1.0, these 2 varieties
are not separated.
4. Astartea sp. nov. [pid: 179] and Mikania sp. nov. [pid: 1074] -
These could not be resolved further.
5. There are 11 binomial names that are shown to be DOUBTFUL.
6. The binomials without the author's names - are not accepted by gbif
and other open source databases. There are more than half of the binomials
which have more than one author. I have referred to the journal and chosen
the right author, wherever I could find a discrepancy. Do the binomials
have to be separated or retained along with their authors (as this plays a
crucial role in a bibliography database)? This was raised earlier and not
resolved completely. It would be really kind of you if you can give me more
clarity about this.
7. A final list of accepted species (i.e. not synonyms) has to be
prepared.
8. This list of accepted names have to be rechecked with their
respective journal articles to ensure that they have the right assigned
author. NOTE: The assignment of author was done by the GBIF web program.
Hence, this step is necessary.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#82?email_source=notifications&email_token=AAFTCSYLELWVYH6E6YTFLT3QBP5MPA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD26GLRA#issuecomment-515663300>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCSZGSJGU4EUOZXUEXU3QBP5MPANCNFSM4IG3DB3A>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
I shall add comments on your very useful output (which can be viewed directly in table form on Github): Note that there is no EssoilDB ID for each row so I shall refer to this table in unsorted fashion. This is why we need an ID!
|
After deduplication here are my comments.
Input is
Most of the results are (happily) of this form.
Input is
I am guessing what has happened here is that
Here GBIF cannot find an exact match, so reverts to the Genus. This is a loss of information, so maybe we should search elsewhere. "Plants of the world" (Kew) gives:
"
FUZZY means that there is probably a misprint (here recommendation
|
list of problem speciesShruthi has created a report with a number of problems of names. She has actually gone back to priginal papers. Suggests she copy the data here. I have also found some problems which seem to be different, and add some suggestions. species with unusual synonyms or mapping onto more than one species.Requires hand editing
hybridsProbably best represented at genus level
and these are probably hybrids (assume the non-Unicode char is 'times' symbol.
genusThese entries are only interpretable at genus level.
typosSpecies require lowercase specific name.
unknown species
|
Shruthi, It would be useful statistics as well. |
Shruthi, |
Sir Thank you for your guidance. |
On Mon, Jul 29, 2019 at 11:23 AM Shruthi-M ***@***.***> wrote:
Shruthi,
Are you able to create a table of frequencies of plants? Then we could
start the disambiguation with the most frequent problems.
You would have to find the unique ids for each profile, extract the plant
by joining the tables and then sort.
It would be useful statistics as well.
P.
Sir
Presently, I am adding the authors to the variations columns. This is very
time-consuming as there are a lot of entries.
I can understand there is a lot to do.
I had a small discussion with Gitanjali ma'am today and we decided to add
common names, synonyms, GBIF key and the scientific name (with author) -
all under one separate column titled "SYNONYM".
What is the purpose of SYNONYM? Is it for searching? In which case it can
be automatically generated from the GBIF identifier when needed.
I am currently working on this.
As I have only 10 days of my training left and I have to start writing my
final report, I will not be able to give more inputs apart from working on
the new column.
Understood. I will mail Gita.
Thank you for your guidance.
It is a pleasure to work with you.
P.
… —
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#82?email_source=notifications&email_token=AAFTCSYX2KBHUBZYQNNHMX3QB3AIPA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3AI4SQ#issuecomment-515935818>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCSZ2VQGKB2VCYSXRAGTQB3AIPANCNFSM4IG3DB3A>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Greetings!
The entries that are modified/ need modification are in red.
|
Thanks
Good to see this is a separate table.
Will look later today
…On Tue, 30 Jul 2019, 07:34 Shruthi-M, ***@***.***> wrote:
Greetings!
I have uploaded a file named essoildb.plantdata (2) on the repository.
This has the following columns: [Please note: This is not the plant table.
This is being used only for modifications.]
1. pid - as per EssoilDB 1.0
2. pname - as existing in EssoilDB 1.0
3. scientificName (gbif) - results obtained from GBIF
4. Normalized name
5. Details - about the author, subspecies, variety, etc.
6. pfid
7. phid
8. Error
9. kingdom
10. phylum
11. class
12. order
13. family
14. genus
15. species
16. Synonym - this column just gives the name of the synonymous
species along with the GBIF key of the name - existing in our database. I
will be adding the synonyms, common names and scientific names of all the
plants to this column. Each of these will be separated by a comma.
*The entries that are modified/ need modification are in red.*
- The hybrids are yet to be resolved
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#82?email_source=notifications&email_token=AAFTCS2JP7VJJPCVLTYAPSTQB7OGLA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3C54CI#issuecomment-516283913>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS2MPO62MZHZ37PYG3TQB7OGLANCNFSM4IG3DB3A>
.
|
I have uploaded the file containing the wiki-id as wiki_id.xlsx onto the repository. "NA" implies that the name does not exist in wikidata. |
Many thanks!
…On Thu, Aug 1, 2019 at 10:47 AM Shruthi-M ***@***.***> wrote:
I have uploaded the file containing the wiki-id as wiki_id.xlsx onto the
repository. "NA" implies that the name does not exist in wikidata.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#82?email_source=notifications&email_token=AAFTCSYTW6XPTNSDUKQ23P3QCKWMFA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3KATKQ#issuecomment-517212586>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCS54SMRWJI5MGLLJCWLQCKWMFANCNFSM4IG3DB3A>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
@Shruthi-M This is wonderful! You have done a good job.
[Although you may write this in the report the people using the plant data may not have access, so make sure the doc is in the directory]. I have renamed the major table to |
UNIQUE IDENTIFIERS for plants.
So I suggest:
The question is whether we create identifiers of fixed length, e.g. |
>Manny >Before Re-importing into the database, I’d like to get a shot at eliminating any invisible characters and othe anomalies please.---- On Thu, 01 Aug 2019 16:22:00 -0400
>PMR>>
Absolutely!!
The characters should ONLY be Unicode 32-126.
We will test for that.
All other characters must be mapped onto these.
* Thus any beta-character => `beta-`
* all quotes => `"` or `'`
* all dashes => `-`
* all typography and style is discarded
(BTW when replying to Github issues, try to eliminate all copy of previous posts, signatures, routing etc.)
|
I have renamed @Shruthi-M tables to |
Greetings!
|
The above post is in tables/plant. |
The details.xlsx table looks well designed and created. I need to check
details - this will take a little time.
The wiki_id table is presumably not required as the Wikidata column is
already in "details", correct?
P.
…On Mon, Aug 5, 2019 at 7:30 AM Shruthi-M ***@***.***> wrote:
The above post is in tables/plant.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#82?email_source=notifications&email_token=AAFTCS7TJGO3CN36AE7D2DDQC7CILA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3Q2F3I#issuecomment-518103789>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCSYA2EFJR3FDDMMJHGLQC7CILANCNFSM4IG3DB3A>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
On Tue, 6 Aug 2019 at 04:56, petermr ***@***.***> wrote:
The details.xlsx table looks well designed and created. I need to check
details - this will take a little time.
Thank you Sir
The wiki_id table is presumably not required as the Wikidata column is
already in "details", correct?
Yes, a separate table is not required.
…
P.
On Mon, Aug 5, 2019 at 7:30 AM Shruthi-M ***@***.***> wrote:
> The above post is in tables/plant.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <
#82?email_source=notifications&email_token=AAFTCS7TJGO3CN36AE7D2DDQC7CILA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3Q2F3I#issuecomment-518103789
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AAFTCSYA2EFJR3FDDMMJHGLQC7CILANCNFSM4IG3DB3A
>
> .
>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#82?email_source=notifications&email_token=AMIWRYEBQTEELIOVA5MAWA3QDCZL7A5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3TLWHA#issuecomment-518437660>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMIWRYHINTGQEPRJD6AUXC3QDCZL7ANCNFSM4IG3DB3A>
.
|
@shruthi Mohan <[email protected]> - can you put a file with brief
descriptions of the column headings and the colours in the plant/
directory?
Also are there non-Unicode characters? I suspect not as plant names use
ASCII and I don't think there are other requirements. Normalize dashes to
hyphen-minus. There should be no quotes, apostrophe but if so, normalize
to " or ' . Do not use smart quotes. Use TSV by default because you may
need commas eslewhere. Spaces should be normal single spaces (char 32). Use
a text editor, not Word. I'll have a look but I'm not too concerned. (The
chemistry and bibliography are harder).
…--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
These are all good points Peter,
I’ll be taking care of this as the final step before Gita and you get a last look before import.
With what little time Shruthi has with on this project, getting the data to be true and correct, should be her main focus.
If she can do this without endangering the chances of having true, correctly spelled data, that’s great. But ultimately unnecessary because Gita has made me responsible for that.
Meanwhile…
GO! Shruthi GO! We’re cheering you on to the finish line!!
:D
Manny
Emanuel Faria
Founder | Formulator | President
[email protected]
VERRICLEAR NATURAL SKIN ESSENTIALS LTD.
Nature + Science = Success!™
North America: www.verriclear.com <http://www.verriclear.com/>
South America: www.verriclear.com.br <http://www.verriclear.com.br/>
…-------------------------------------------------------------------------------------------
“If I were given one hour to save the planet, I would spend 59 minutes
defining the problem and one minute resolving it.
- Albert Einstein -
-------------------------------------------------------------------------------------------
****************** CONFIDENTIALITY NOTICE ******************
This email message, including any attachments, may contain information that is confidential, privileged, and/or proprietary. If you are not an intended recipient, please be advised that any review, use, reproduction or distribution of this message is prohibited. The information and documents electronically transmitted are private, may include privileged communications and may contain confidential information intended only for the person named above. Nothing in this electronic transmission is intended to waive the confidentiality of this message or any attachment. Any other distribution, copying or disclosure is not intended by the sender and may result in the breach of certain laws or the infringement of rights of third parties. If you have received this message in error, please completely destroy all electronic and hard copies, and contact the sender at [email protected]. Thank you for your co-operation.
Although we run anti-virus software we caution that every recipient should scan this e-mail and any attached files for viruses, worms and the like.
Neither the writer nor its assignees accepts any liability for any loss, liability, damage or expense resulting directly or indirectly from the access of any files attached to this message.
VERRICLEAR Natural Skin Essentials Ltd. does not provide medical advice or services, and nothing in this e-mail or any document published by VERRICLEAR should be construed as such.
On Aug 6, 2019, at 4:18 AM, petermr <[email protected]> wrote:
@shruthi Mohan <[email protected]> - can you put a file with brief
descriptions of the column headings and the colours in the plant/
directory?
Also are there non-Unicode characters? I suspect not as plant names use
ASCII and I don't think there are other requirements. Normalize dashes to
hyphen-minus. There should be no quotes, apostrophe but if so, normalize
to " or ' . Do not use smart quotes. Use TSV by default because you may
need commas eslewhere. Spaces should be normal single spaces (char 32). Use
a text editor, not Word. I'll have a look but I'm not too concerned. (The
chemistry and bibliography are harder).
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=ACJK2M3KK5HPAZKLTCXED73QDEQUJA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UE3IY#issuecomment-518540707>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACJK2M23RZRI3U44ZX4FFXDQDEQUJANCNFSM4IG3DB3A>.
|
On Tue, Aug 6, 2019 at 8:31 AM Manny ***@***.***> wrote:
These are all good points Peter,
I’ll be taking care of this as the final step before Gita and you get a
last look before import.
Thanks so much!
With what little time Shruthi has with on this project, getting the data
to be true and correct, should be her main focus.
Absolutely agreed.
If she can do this without endangering the chances of having true,
correctly spelled data, that’s great. But ultimately unnecessary because
Gita has made me responsible for that.
It is MUCH easier now. By resolving against GBIF and Wikipedia/Wikidata we
don't have to worry about spelling because *they* take care of it. So
GBIF=2685484 Wikidata=Q146992 species=Abies alba
is ALL we have to know for the the first entry. Everything else can be
looked up.
"GBIF, what is the preferred taxonomic authority for Abies Alba?"
"Abies alba Mill."
"Wikidata , what is the common name for Q146992 in Portuguese"
"*abeto-prateado"*
In particular those two authorities work closely together. They will
*automatically* update when:
* a species is reclassified (genus, family)
* a new synonym is found
* a new authority is added
Also you can automatically ask:
"What is the IUCN status of Q146992?"
"Least concern"
In this way the things that EssoilDB has to maintain are:
* a register of imported articles (bibliography) - Ambarish is doing this
* a register of plant species (Shruthi has done this!)
* a register of compounds (Ambarish is doing this)
* locations (not well advanced)
then:
* an import mechanism (PMR)
* import checking - yet to be developed but uses core tables
* data=> core plant/compound/parts/location/ tables
* a search engine (separate from core) using:
- core tables
- plant synonyms from Wikidata, GBIF
- chemical structure search (from CDK - they will be happy to advise)
This design is implicit in the poster which should be an initial guide
I think this is a great time to design in the features that you would find
useful. It's a relatively small knowledgebase so systems such as NoSQL or
Tidyverse should be considered. Also I want to store a LOT more of the
original papers if that would be useful.
Exciting!
I'd very much like to talk again over Skype. I think just you and me if
Gita is busy.
… Meanwhile…
GO! Shruthi GO! We’re cheering you on to the finish line!!
:D
Manny
Emanuel Faria
Founder | Formulator | President
***@***.***
VERRICLEAR NATURAL SKIN ESSENTIALS LTD.
Nature + Science = Success!™
North America: www.verriclear.com <http://www.verriclear.com/>
South America: www.verriclear.com.br <http://www.verriclear.com.br/>
-------------------------------------------------------------------------------------------
“If I were given one hour to save the planet, I would spend 59 minutes
defining the problem and one minute resolving it.
- Albert Einstein -
-------------------------------------------------------------------------------------------
****************** CONFIDENTIALITY NOTICE ******************
This email message, including any attachments, may contain information
that is confidential, privileged, and/or proprietary. If you are not an
intended recipient, please be advised that any review, use, reproduction or
distribution of this message is prohibited. The information and documents
electronically transmitted are private, may include privileged
communications and may contain confidential information intended only for
the person named above. Nothing in this electronic transmission is intended
to waive the confidentiality of this message or any attachment. Any other
distribution, copying or disclosure is not intended by the sender and may
result in the breach of certain laws or the infringement of rights of third
parties. If you have received this message in error, please completely
destroy all electronic and hard copies, and contact the sender at
***@***.*** Thank you for your co-operation.
Although we run anti-virus software we caution that every recipient should
scan this e-mail and any attached files for viruses, worms and the like.
Neither the writer nor its assignees accepts any liability for any loss,
liability, damage or expense resulting directly or indirectly from the
access of any files attached to this message.
VERRICLEAR Natural Skin Essentials Ltd. does not provide medical advice or
services, and nothing in this e-mail or any document published by
VERRICLEAR should be construed as such.
On Aug 6, 2019, at 4:18 AM, petermr ***@***.***> wrote:
@shruthi Mohan ***@***.***> - can you put a file with brief
descriptions of the column headings and the colours in the plant/
directory?
Also are there non-Unicode characters? I suspect not as plant names use
ASCII and I don't think there are other requirements. Normalize dashes to
hyphen-minus. There should be no quotes, apostrophe but if so, normalize
to " or ' . Do not use smart quotes. Use TSV by default because you may
need commas eslewhere. Spaces should be normal single spaces (char 32). Use
a text editor, not Word. I'll have a look but I'm not too concerned. (The
chemistry and bibliography are harder).
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <
#82?email_source=notifications&email_token=ACJK2M3KK5HPAZKLTCXED73QDEQUJA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UE3IY#issuecomment-518540707>,
or mute the thread <
https://github.com/notifications/unsubscribe-auth/ACJK2M23RZRI3U44ZX4FFXDQDEQUJANCNFSM4IG3DB3A
>.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#82?email_source=notifications&email_token=AAFTCS2TUB73HPMUDOIVOGTQDESDXA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UF62I#issuecomment-518545257>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCSZOLDMZA7GB4STPB73QDESDXANCNFSM4IG3DB3A>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
@Shruthi-M I have found your *.docx file and this looks very good. Am reading it. |
Using word documents for docs on Github is not normally a good idea for several reasons.
In particular displaying screen shots of code can be very frustrating for people who want to use them. They have to retype them and will make mistakes. People want to cut and paste and run. Can you put the code in an R format (note-book like) that's the best way. |
Sure, I will look into this. |
Thanks Peter,
Everything below is great news. If I were the one responsible for deciding final spelling among all the versions and accepted typos on Google, I’d pull my hair out. (What’s left of it.)
Regarding locations, I’ve started this but ran into some trouble trying to parse State/Prov, City, Town, Region names from the original single text field into separate fields for each. I’ve emailed the owner of a world-wide database for help, but no response. If I (or preferably Manish) can figure out how we could use such a database to automatically compare words in our Locations table against the World table, and drop it in the right field, that would be a time-saving miracle — not to mention taking the fear of getting something wrong.
I have a bit more to do on the updated Compound (AND PLANT) activities table (because lots of journal articles talk about plant oils having activities, without naming specific constituents).
All the current IDs will be preserved, and I have a plan to make it easy to connect current entries that list more than one activity in the same record field.
Looking forward to getting the cleanups done for the team in a precise manner, yet quick manner.
I have a list of things I’ve found and stored in a “clean up” database, so I can copy and paste the non-space spaces and other anomalies.
Whenever you’re ready, send me your list, and I’ll go through them all, methodically (checkist-style), and turn to you if anything strange happens before uploading for your final once-over.
I’d love to chat with you too!
I’m in the middle of some extremely tedious, but very important work for the next couple of days, but perhaps Thursday or Friday?
Keep in mind I’m in Brazil, so we can work out a good time for both of us.
If you use an iphone, I use this app to quickly find good times to meet and two or more timezones at once: https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812 <https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812>
Talk soon on skype! (Skype name: mannyrules … I was feeling good about myself that day, and didn’t know my account name would end up being my public user name haha)
Good day or night to you, wherever you are.
Manny
On Aug 6, 2019, at 5:30 AM, petermr ***@***.***> wrote:
On Tue, Aug 6, 2019 at 8:31 AM Manny ***@***.***> wrote:
These are all good points Peter,
I’ll be taking care of this as the final step before Gita and you get a
last look before import.
Thanks so much!
With what little time Shruthi has with on this project, getting the data
to be true and correct, should be her main focus.
Absolutely agreed.
If she can do this without endangering the chances of having true,
correctly spelled data, that’s great. But ultimately unnecessary because
Gita has made me responsible for that.
It is MUCH easier now. By resolving against GBIF and Wikipedia/Wikidata we
don't have to worry about spelling because *they* take care of it. So
GBIF=2685484 Wikidata=Q146992 species=Abies alba
is ALL we have to know for the the first entry. Everything else can be
looked up.
"GBIF, what is the preferred taxonomic authority for Abies Alba?"
"Abies alba Mill."
"Wikidata , what is the common name for Q146992 in Portuguese"
"*abeto-prateado"*
In particular those two authorities work closely together. They will
*automatically* update when:
* a species is reclassified (genus, family)
* a new synonym is found
* a new authority is added
Also you can automatically ask:
"What is the IUCN status of Q146992?"
"Least concern"
In this way the things that EssoilDB has to maintain are:
* a register of imported articles (bibliography) - Ambarish is doing this
* a register of plant species (Shruthi has done this!)
* a register of compounds (Ambarish is doing this)
* locations (not well advanced)
then:
* an import mechanism (PMR)
* import checking - yet to be developed but uses core tables
* data=> core plant/compound/parts/location/ tables
* a search engine (separate from core) using:
- core tables
- plant synonyms from Wikidata, GBIF
- chemical structure search (from CDK - they will be happy to advise)
This design is implicit in the poster which should be an initial guide
I think this is a great time to design in the features that you would find
useful. It's a relatively small knowledgebase so systems such as NoSQL or
Tidyverse should be considered. Also I want to store a LOT more of the
original papers if that would be useful.
Exciting!
I'd very much like to talk again over Skype. I think just you and me if
Gita is busy.
… Meanwhile…
GO! Shruthi GO! We’re cheering you on to the finish line!!
:D
Manny
Emanuel Faria
Founder | Formulator | President
***@***.***
VERRICLEAR NATURAL SKIN ESSENTIALS LTD.
Nature + Science = Success!™
North America: www.verriclear.com <http://www.verriclear.com/>
South America: www.verriclear.com.br <http://www.verriclear.com.br/>
-------------------------------------------------------------------------------------------
“If I were given one hour to save the planet, I would spend 59 minutes
defining the problem and one minute resolving it.
- Albert Einstein -
-------------------------------------------------------------------------------------------
****************** CONFIDENTIALITY NOTICE ******************
This email message, including any attachments, may contain information
that is confidential, privileged, and/or proprietary. If you are not an
intended recipient, please be advised that any review, use, reproduction or
distribution of this message is prohibited. The information and documents
electronically transmitted are private, may include privileged
communications and may contain confidential information intended only for
the person named above. Nothing in this electronic transmission is intended
to waive the confidentiality of this message or any attachment. Any other
distribution, copying or disclosure is not intended by the sender and may
result in the breach of certain laws or the infringement of rights of third
parties. If you have received this message in error, please completely
destroy all electronic and hard copies, and contact the sender at
***@***.*** Thank you for your co-operation.
Although we run anti-virus software we caution that every recipient should
scan this e-mail and any attached files for viruses, worms and the like.
Neither the writer nor its assignees accepts any liability for any loss,
liability, damage or expense resulting directly or indirectly from the
access of any files attached to this message.
VERRICLEAR Natural Skin Essentials Ltd. does not provide medical advice or
services, and nothing in this e-mail or any document published by
VERRICLEAR should be construed as such.
On Aug 6, 2019, at 4:18 AM, petermr ***@***.***> wrote:
@shruthi Mohan ***@***.***> - can you put a file with brief
descriptions of the column headings and the colours in the plant/
directory?
Also are there non-Unicode characters? I suspect not as plant names use
ASCII and I don't think there are other requirements. Normalize dashes to
hyphen-minus. There should be no quotes, apostrophe but if so, normalize
to " or ' . Do not use smart quotes. Use TSV by default because you may
need commas eslewhere. Spaces should be normal single spaces (char 32). Use
a text editor, not Word. I'll have a look but I'm not too concerned. (The
chemistry and bibliography are harder).
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <
#82?email_source=notifications&email_token=ACJK2M3KK5HPAZKLTCXED73QDEQUJA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UE3IY#issuecomment-518540707>,
or mute the thread <
https://github.com/notifications/unsubscribe-auth/ACJK2M23RZRI3U44ZX4FFXDQDEQUJANCNFSM4IG3DB3A
>.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#82?email_source=notifications&email_token=AAFTCS2TUB73HPMUDOIVOGTQDESDXA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UF62I#issuecomment-518545257>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTCSZOLDMZA7GB4STPB73QDESDXANCNFSM4IG3DB3A>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#82?email_source=notifications&email_token=ACJK2M7TWYRNQB3GOTOFRT3QDEZEFA5CNFSM4IG3DB3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3ULQNY#issuecomment-518567991>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACJK2M7IGW6J7H4ZIJTXEKLQDEZEFANCNFSM4IG3DB3A>.
|
On Tue, Aug 6, 2019 at 4:22 PM Manny ***@***.***> wrote:
Thanks Peter,
Everything below is great news. If I were the one responsible for deciding
final spelling among all the versions and accepted typos on Google, I’d
pull my hair out. (What’s left of it.)
EssoilDB1.0 is a finite task. I suspect we don't need to tidy up the whole
of the long tail.
EssoilDB2.0 will be wonderfully different.
Regarding locations, I’ve started this but ran into some trouble trying to
parse State/Prov, City, Town, Region names from the original single text
field into separate fields for each. I’ve emailed the owner of a world-wide
database for help, but no response. If I (or preferably Manish) can figure
out how we could use such a database to automatically compare words in our
Locations table against the World table, and drop it in the right field,
that would be a time-saving miracle — not to mention taking the fear of
getting something wrong.
The Open community - Wikipedia and others - have some solutions here. I'll
tweet it.
I have a bit more to do on the updated Compound (AND PLANT) activities
table (because lots of journal articles talk about plant oils having
activities, without naming specific constituents).
Let's talk about this.
I was under the impression that the activities in E1.0 were inserted from
external sources and not from the paper. But I may be wrong. If it is
extracting them from the paper we need to talk.
All the current IDs will be preserved, and I have a plan to make it easy
to connect current entries that list more than one activity in the same
record field.
We really need GIta's view on this.
I have a list of things I’ve found and stored in a “clean up” database, so
I can copy and paste the non-space spaces and other anomalies.
The database is small enough it fits in Github easily.
EssOilDB/v1.0/info_c.tsv is only 38 Mbyte.
Whenever you’re ready, send me your list, and I’ll go through them all,
methodically (checkist-style), and turn to you if anything strange happens
before uploading for your final once-over.
Ambarish is/has_been working on this. In any case all the data is on Github
so we don't need to send it.
I’d love to chat with you too!
I’m in the middle of some extremely tedious, but very important work for
the next couple of days, but perhaps Thursday or Friday?
Keep in mind I’m in Brazil, so we can work out a good time for both of us.
I have some ideas about Open Science in LatAm which I'll explain later.
If you use an iphone, I use this app to quickly find good times to meet
and two or more timezones at once:
https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812 <
https://apps.apple.com/ca/app/timescroller-time-zone-utility/id288013812>
Talk soon on skype! (Skype name: mannyrules … I was feeling good about
myself that day, and didn’t know my account name would end up being my
public user name haha)
I shall be in Edinburgh Thu and Friday. I am happy to try times in the UK
in the afternoon and evening.
What I'd like for V2.0 is some use cases. I can't guarantee that they would
all be supported. However I would be optimstic about experimental
methodology for extraction. Activities will be harder.
…--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Greetings! Thank you. |
I submitted the entire set of plant names (before clean up) onto the GBIF link - (https://www.gbif.org/en/tools/species-lookup)
This allows the user to perform multiple searches at once. After this step, I got the results - which I have uploaded as gbif_result.csv onto the repository.
The default headings of the columns are as follows:
The text was updated successfully, but these errors were encountered: