Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disambiguation of plant names using GBIF #82

Open
Shruthi-M opened this issue Jul 25, 2019 · 34 comments
Open

Disambiguation of plant names using GBIF #82

Shruthi-M opened this issue Jul 25, 2019 · 34 comments
Assignees

Comments

@Shruthi-M
Copy link
Collaborator

I submitted the entire set of plant names (before clean up) onto the GBIF link - (https://www.gbif.org/en/tools/species-lookup)
This allows the user to perform multiple searches at once. After this step, I got the results - which I have uploaded as gbif_result.csv onto the repository.
The default headings of the columns are as follows:

  1. occurrenceId
  2. verbatimScientificName (user-submitted name)
  3. scientificName (name existing in the database)
  4. key (unique number assigned to the particular species on GBIF
  5. matchType (3 levels of result - EXACT, FUZZY, HIGHERRANK)
  • EXACT means the name exactly matches with the entry in the database
  • FUZZY indicates entries that may be mis-spelt
  • HIGHERRANK implies that the specific epithet of the entry is not being recognized (in other words, only genus is recognized)
  1. confidence (expressed in terms of percentage)
  2. status (can be ACCEPTED, SYNONYM or DOUBTFUL)
  • DOUBTFUL Treated as accepted, but doubtful whether this is correct.
  • SYNONYM A general synonym, the exact type is unknown.
  1. rank (the highest rank recognized)
  2. kingdom
  3. phylum
  4. class
  5. order
  6. family
  7. genus
  8. species
@Shruthi-M Shruthi-M self-assigned this Jul 25, 2019
@petermr
Copy link
Collaborator

petermr commented Jul 25, 2019 via email

@petermr
Copy link
Collaborator

petermr commented Jul 25, 2019 via email

@petermr
Copy link
Collaborator

petermr commented Jul 25, 2019 via email

@petermr
Copy link
Collaborator

petermr commented Jul 26, 2019

@Shruthi-M - this is so important for all of us! Everyone should read this thread. I'll annotate @Shruthi-M 's table and add actions. We should extract the discussion here onto .md pages as well.
There are fundamental issues which apply to compounds as well, @ambarishK . I hadn't seen this clearly until I started on the poster.
** We are only dealing at present with converting exisiting EssoilDB 1.0 (E1.0) to E2.0 (i.e not worrying about ingesting new data into either).
*

= Origin of data =
It's critical to review exactly where the data comes from. After talking with @gilienv yesterday I believe that :

  • all the independent data are in "infopdata" and "infocdata"
  • there is probably not much original documentation
  • there may be extra information in legacy *.xls files,

== ACTION ==
We have to agree and then document what is in EssoilDB 1.0

@Shruthi-M
Copy link
Collaborator Author

Sir, I am working on your previous guidelines. I will try to separate the synonyms and the accepted names using the clues - you have mentioned. A final list of accepted species will be prepared soon (in the coming week).
ISSUES BEING FACED:

  1. Callistemon sp. [pid: 299] - The literature reports 7 varieties of this species and our database has data about only one variety - “Blackdown tableland”.
  2. Kunzea ambigua [pid: 879] - The literature from which this is taken - was analyzed. It was found that the data corresponding to this entry was related “prostate form, B” and the article reports three more varieties - which are not included in the database.
  3. Eryngium sp. nov. [pid: 560] - The literature reports 2 varieties - “1” and “2” under this. According to the EssoilDB 1.0, these 2 varieties are not separated.
  4. Astartea sp. nov. [pid: 179] and Mikania sp. nov. [pid: 1074] - These could not be resolved further.
  5. There are 11 binomial names that are shown to be DOUBTFUL.
  6. The binomials without the author's names - are not accepted by gbif and other open source databases. There are more than half of the binomials which have more than one author. I have referred to the journal and chosen the right author, wherever I could find a discrepancy. Do the binomials have to be separated or retained along with their authors (as this plays a crucial role in a bibliography database)? This was raised earlier and not resolved completely. It would be really kind of you if you can give me more clarity about this.
  7. A final list of accepted species (i.e. not synonyms) has to be prepared.
  8. This list of accepted names have to be rechecked with their respective journal articles to ensure that they have the right assigned author. NOTE: The assignment of author was done by the GBIF web program. Hence, this step is necessary.

@petermr
Copy link
Collaborator

petermr commented Jul 28, 2019 via email

@petermr
Copy link
Collaborator

petermr commented Jul 28, 2019

I shall add comments on your very useful output (which can be viewed directly in table form on Github):
https://github.com/gilienv/EssOilDB/blob/master/tables/plant/gbif_result.tsv

Note that there is no EssoilDB ID for each row so I shall refer to this table in unsorted fashion. This is why we need an ID!

  • It has 1839 rows (1 header and 1838 data). ACTION does this number agree with other info_plant tables?
  • GBIF has returned an identifier (column 4) for each row. As far as I can see every row has an identifier, so there is nothing that GBIF cannot interpret in some way. ACTION are there any names elsewhere in V1.0 that GBIF cannot interpret?
  • There are exactly duplicated rows (e.g. 7 and 8 and more elsewhere). "Remove duplicates" in Excel gives:
    129 duplicates: ACTION after adding UniqueIds, remove all duplicates from table.

@petermr
Copy link
Collaborator

petermr commented Jul 28, 2019

After deduplication here are my comments.

  • row 2
 	Abies alba 	Abies alba Mill. 	2685484 	EXACT 	99 	ACCEPTED 	SPECIES 	Plantae 	Tracheophyta 	Pinopsida 	Pinales 	Pinaceae 	Abies 	Abies alba

Input is Abies alba , GBIF found key=2685484 as an EXACT match with 99% confidence a a SPECIES, [taxonomy omitted] and the normative species name as Abies alba. It added the authority as Abies alba Mill. but this is probably more detail than we need. So our result is

Abies alba => Abies alba (GBIF 2685484) [SPECIES CONFIRMED]

Most of the results are (happily) of this form.

  • row 6
 	Acacia caven 	Acacia caven (Molina) Molina 	2979244 	EXACT 	99 	SYNONYM 	SPECIES 	Plantae 	Tracheophyta 	Magnoliopsida 	Fabales 	Fabaceae 	Vachellia 	Vachellia caven

Input is Acacia caven EXACTly identified, but this a synonym for the preferred name Vachellia caven

Vachellia caven is not mentioned in V1.0 so we lookup Vachellia caven in GBIF to give
https://www.gbif.org/species/3795588.
We then use Vachellia caven (GBIF 3795588) as the accepted name with Acacia caven (GBIF 2979244) as a synonym. ACTION agree this strategy.

  • row 16/17
 Achillea beibersteinii 	Achillea beibersteinii Afan. 	7400456 	EXACT 	98 	DOUBTFUL 	SPECIES 	Plantae 	Tracheophyta 	Magnoliopsida 	Asterales 	Asteraceae 	Achillea 	Achillea beibersteinii
Achillea biebersteinii 	Achillea biebersteinii C.Afan. 	3120276 	EXACT 	98 	SYNONYM 	SPECIES 	Plantae 	Tracheophyta 	Magnoliopsida 	Asterales 	Asteraceae 	Achillea 	Achillea arabica

I am guessing what has happened here is that beibersteinii is a misprint in the general plant literature (hence category DOUBTFUL), but it has got into the official books. So this should be referred to our curator. I would expect that we'd normalize it to Achillea biebersteinii which is a SYNONYM for Achillea arabica (which should be our agreed normative species).

  • row 25
 	Achillea depressa 	Achillea L. 	3119995 	HIGHERRANK 	96 	ACCEPTED 	GENUS 	Plantae 	Tracheophyta 	Magnoliopsida 	Asterales 	Asteraceae 	Achillea 	

Here GBIF cannot find an exact match, so reverts to the Genus. This is a loss of information, so maybe we should search elsewhere. "Plants of the world" (Kew) gives:
http://plantsoftheworldonline.org/taxon/urn:lsid:ipni.org:names:173942-1
"

Achillea depressa Janka

    This is a synonym of Achillea pseudopectinata Janka

"
So we can probably add the relatively few examples by hand.
There are 63 GENUS and 9 KINGDOM rows of which about 20 are either Foobar spp. and so not reconcilable. The other ~40 can be searched by hand and added by curator.

  • row 55
 	Aframomum hanburyl 	Aframomum hanburyi K.Schum. 	2758831 	FUZZY 	96 	SYNONYM 	SPECIES 	Plantae 	Tracheophyta 	Liliopsida 	Zingiberales 	Zingiberaceae 	Aframomum 	Aframomum angustifolium

FUZZY means that there is probably a misprint (here hanburyl for hanburyi). In this case the accepted name is also a SYNONYM, so there is a further step to Aframomum angustifolium

recommendation

  • create new columns:
original    GBIFAcceptedName  GBIFIdentier  GBIFSynonyms  curationDetails  
  • The original is presereved
  • The best accepted name is always given
  • the identifier for that name is always given
  • If there is one or more accepted synonyms in V1.0 list them
  • log curation details (date, curator, action). Action can be: TYPO, SYNONYM, GENUS

@petermr
Copy link
Collaborator

petermr commented Jul 28, 2019

list of problem species

Shruthi has created a report with a number of problems of names. She has actually gone back to priginal papers. Suggests she copy the data here.

I have also found some problems which seem to be different, and add some suggestions.

species with unusual synonyms or mapping onto more than one species.

Requires hand editing

Achillea depressa
Achillea stricta
Achillea tanacetifolia
Aloysia triphylla
Anthemis altissima
Artemisia coerulescens
Artemisia fragrans
Artemisia gallica
Artemisia herba-alba
Athrotaxis taxifolia
Cedrus liobani
Chenopodium ambrosioides
Cinnamomum fragrans
Cinnamomum zeylanicum
Coleus Aromaticus
Dracocephalum speciosum
Echinophora chysantha
Eclipta indica
Eryngium caeruleum
Eucalyptus viridiflora
Eugenia nitida
Eugenia ovalifolia
Eugenia rotundifolia
Lavandula hybrida
Lindera strychnifolia
Lippia gracillis
Mentha gracilis
Micromeria dalmatica
Nepeta fissa
Ocimum adscendens
Oenanthe divaricata
Origanum basilicum
Origanum micranthum
Pinus laricio
Pluchea purpurascens
Polymnia sonchifolia
Satureja viminea
Senecio farfarifolius
Stachys lanata
Tanacetum elburensis
Thymus capitatus
Thymus caucasicus
Thymus ciliates
Thymus hirtus

hybrids

Probably best represented at genus level

Citrus reticulata x Citrus sinensis
Citrus latifolia Tanaka x Citrus aurantifolia Swingle
Citrus paradisi x Citrus. reticulata
Citrus unshiu x Citrus nobilis
Eucalyptus citriodora x E.torelliana
Lavandula luisieri x Lavandula stoechas

and these are probably hybrids (assume the non-Unicode char is 'times' symbol.

Mentha •À_ piperita
Mentha•À_longifolia•À_L.
Peperomia•À_pellucida•À_L.

genus

These entries are only interpretable at genus level.

Astartea sp. nov.
Calamintha var.darensis
Callistemon sp.
Eryngium sp nov.
Eryngium spp.
Eugenia sp.
Hypericum 'Hidcote'
Kunzea sp.
Mentha spp.
Mikania sp.nov.
Origanum spp.
Persea
Xanthostemon spp.
Renealmia spp.

typos

Species require lowercase specific name.

Stachys Corsica
Tordylium Ketenoglui

unknown species

Lomatopodium khorassanicum
Serotinocarpum insignis

@petermr
Copy link
Collaborator

petermr commented Jul 29, 2019

Shruthi,
Are you able to create a table of frequencies of plants? Then we could start the disambiguation with the most frequent problems.
You would have to find the unique ids for each profile, extract the plant by joining the tables and then sort.

It would be useful statistics as well.
P.

@petermr
Copy link
Collaborator

petermr commented Jul 29, 2019

Shruthi,
When you have diambiguated (most) of the plant species can you lookup their IDs in Wikidata?
I wrote a simple tool in Feb for the workshop, but it was a bit slow - had to lookup one-by-one. There may be better tools now - I can ask...

@Shruthi-M
Copy link
Collaborator Author

Shruthi,
Are you able to create a table of frequencies of plants? Then we could start the disambiguation with the most frequent problems.
You would have to find the unique ids for each profile, extract the plant by joining the tables and then sort.

It would be useful statistics as well.
P.

Sir
Presently, I am adding the authors to the variations columns. This is very time-consuming as there are a lot of entries.
I had a small discussion with Gitanjali ma'am today and we decided to add common names, synonyms, GBIF key and the scientific name (with author) - all under one separate column titled "SYNONYM".
I am currently working on this.
As I have only 10 days of my training left and I have to start writing my final report, I will not be able to give more inputs apart from working on the new column.

Thank you for your guidance.

@petermr
Copy link
Collaborator

petermr commented Jul 29, 2019 via email

@Shruthi-M
Copy link
Collaborator Author

Greetings!
I have uploaded a file named essoildb.plantdata (2) on the repository. This has the following columns: [Please note: This is not the plant table. This is being used only for modifications.]

  1. pid - as per EssoilDB 1.0
  2. pname - as existing in EssoilDB 1.0
  3. scientificName (gbif) - results obtained from GBIF
  4. Normalized name
  5. Details - about the author, subspecies, variety, etc.
  6. pfid
  7. phid
  8. Error
  9. kingdom
  10. phylum
  11. class
  12. order
  13. family
  14. genus
  15. species
  16. Synonym - this column just gives the name of the synonymous species along with the GBIF key of the name - existing in our database. I will be adding the synonyms, common names and scientific names of all the plants to this column. Each of these will be separated by a comma.

The entries that are modified/ need modification are in red.

  • The hybrids are yet to be resolved

@petermr
Copy link
Collaborator

petermr commented Jul 30, 2019 via email

@Shruthi-M
Copy link
Collaborator Author

I have uploaded the file containing the wiki-id as wiki_id.xlsx onto the repository. "NA" implies that the name does not exist in wikidata.

@petermr
Copy link
Collaborator

petermr commented Aug 1, 2019 via email

@petermr
Copy link
Collaborator

petermr commented Aug 1, 2019

@Shruthi-M This is wonderful! You have done a good job.
Could you please add:

  • how you created the data in each column - I imagine some has been added from GBIF or other authority - please name the authority explicitly and the service you used (if you did it automatically).
  • please explain the coloured entries.
  • how did you perform the wikimedia look up? did you use a service? (I thought you said you only had 300 , but you've got about 85% I think?)

[Although you may write this in the report the people using the plant data may not have access, so make sure the doc is in the directory].

I have renamed the major table to ingestion and created a TSV version.

@petermr
Copy link
Collaborator

petermr commented Aug 1, 2019

UNIQUE IDENTIFIERS for plants.
Now is the time to freeze the number of entries being imported from V1.0. There are 1838 plant entries and you have generated a unique ID for each record. This ID must always be associated with the same record. If records are deleted we NEVER reuse that identifier.
** I think the identifiers should have a leading letter or more **
This has several advantages:

  • it protects against pre-truncation by mistake
  • it protects against using them as data (e.g. adding or subtraction)
  • it makes it clear they are identifiers
  • it make make them easier to find in google, etc.
  • It identifies them to the world as EssoilDB

So I suggest:

  • EPdddd for plants
  • ECdddd for compounds
  • ELdddd for locations
    etc.

The question is whether we create identifiers of fixed length, e.g.
EP0001234
Since Wikidata and others don't I suggest we DONT worry about length.

@EmanuelFaria
Copy link
Collaborator

EmanuelFaria commented Aug 1, 2019 via email

@petermr
Copy link
Collaborator

petermr commented Aug 2, 2019

I have renamed @Shruthi-M tables to tables/plant/import1.0.* Sorry if this incoveniences anyone

@Shruthi-M
Copy link
Collaborator Author

Greetings!
I have uploaded a file - details.xlsx. This contains the following data:

  1. pid
  2. Normalized name
  3. scientificName
  4. GBIF key
  5. wiki_id
  6. IF_ACCEPTED _NAMES
  7. IF_SYNONYMS
  8. Common_names
  9. synonyms
    Columns 6 and 7 I have also uploaded another document called Documentation (details) which contains the code used during the process of obtaining the same.
    ANALYSIS:
    The following cases need a review:
    a) if a taxon is neither accepted nor a synonym, it implies that the name needs review
    b) if the scientificName column contains the entry as "Plantae"
    c) if the entries in the column "Normalized name" are marked in red

@Shruthi-M
Copy link
Collaborator Author

The above post is in tables/plant.

@petermr
Copy link
Collaborator

petermr commented Aug 5, 2019 via email

@Shruthi-M
Copy link
Collaborator Author

Shruthi-M commented Aug 6, 2019 via email

@petermr
Copy link
Collaborator

petermr commented Aug 6, 2019 via email

@EmanuelFaria
Copy link
Collaborator

EmanuelFaria commented Aug 6, 2019 via email

@petermr
Copy link
Collaborator

petermr commented Aug 6, 2019 via email

@petermr
Copy link
Collaborator

petermr commented Aug 6, 2019

@Shruthi-M I have found your *.docx file and this looks very good. Am reading it.

@petermr
Copy link
Collaborator

petermr commented Aug 6, 2019

Using word documents for docs on Github is not normally a good idea for several reasons.

  • word can introduce spurious characters especially line ends, smart quotes etc.
  • Github is designed for code, Word is not.

In particular displaying screen shots of code can be very frustrating for people who want to use them. They have to retype them and will make mistakes. People want to cut and paste and run.
(Same goes for the species/output).
[Screen shots can be useful for tutorials and web pages but the original should always be available.

Can you put the code in an R format (note-book like) that's the best way.

@Shruthi-M
Copy link
Collaborator Author

Using word documents for docs on Github is not normally a good idea for several reasons.

  • word can introduce spurious characters especially line ends, smart quotes etc.
  • Github is designed for code, Word is not.

In particular displaying screen shots of code can be very frustrating for people who want to use them. They have to retype them and will make mistakes. People want to cut and paste and run.
(Same goes for the species/output).
[Screen shots can be useful for tutorials and web pages but the original should always be available.

Can you put the code in an R format (note-book like) that's the best way.

Sure, I will look into this.

@EmanuelFaria
Copy link
Collaborator

EmanuelFaria commented Aug 6, 2019 via email

@petermr
Copy link
Collaborator

petermr commented Aug 6, 2019 via email

@Shruthi-M
Copy link
Collaborator Author

Greetings!
I have added a new file at EssOilDB/tables/plant called details.txt.I have added it in the .txt format (not .xlsx) as told. I can separately send the .xlsx file as well (if needed). I noticed that some changes needed to be made in the synonyms column. I have done those.
I will send the text version of the R codes soon.

Thank you.

Repository owner deleted a comment from Shruthi-M Nov 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants