Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents duplication on ontology parsing #644

Open
galviset opened this issue Dec 17, 2024 · 9 comments
Open

Agents duplication on ontology parsing #644

galviset opened this issue Dec 17, 2024 · 9 comments
Labels

Comments

@galviset
Copy link
Collaborator

Describe the bug
When parsing a new submission of an ontology, it sometimes create duplicate Agents objects with the same name.
The conditions for it to happen are not always clear, but people described with a string concatenating more than just names in the ontology file (e.g. "Guillaume Alviset https://orcid.org/0009-0004-4295-6593") will trigger that behavior.

Screenshots
image
image

@syphax-bouazzouni
Copy link
Contributor

Here are two possible solutions:

  1. Disable the extraction of Agents
  2. Disable the extraction of Agents if the submission has already Agents set in the previous submission.

@jonquet
Copy link
Contributor

jonquet commented Dec 19, 2024

I will rephrase expression only one "proposed" solution :
Enable agent extraction when parsing the first submission of an ontology (if submissionId=1) and disable it for subsequent submissions (submissionId >1). Make this setting (enabled/not enabled) accessible to ontology admins in their admin panel.

@syphax-bouazzouni
Copy link
Contributor

syphax-bouazzouni commented Dec 19, 2024

Enable agent extraction when parsing the first submission of an ontology (if submissionId=1) and disable it for subsequent submissions (submissionId >1).

OK

Make this setting (enabled/not enabled) accessible to ontology admins in their admin panel.

Not really possible for now as we don't have a configuration workflow in the UI ontoportal-lirmm/bioportal_web_ui#836

@jonquet
Copy link
Contributor

jonquet commented Dec 19, 2024

In fact, I did not meant to have this in a "general" admin panel. But in an ontology specific panel, the one we are talking about doing to split the "Edit submission" page into 2 main part: (i) one related to metadata and (ii) one related to how AgroPortal deal with the ontology.

So typically, this would go in the second "part".

And for the moment, this perspective to seperate Edit submission into 2 part is a UI only contribition, which means all of these would still be based on properties of a submlsison.

In other words, we only have to create a boolean property extractAgentsFromSourceFile and then use it in the processing workflow to skip or not the exclusion of agent extraction.

@syphax-bouazzouni
Copy link
Contributor

In other words, we only have to create a boolean property extractAgentsFromSourceFile and then use it in the processing workflow to skip or not the exclusion of agent extraction.
In summary to do that we need:

  • Add the attribute in the submission model
  • Add the metadata of that attribute to the .yml file to explain what it is.
  • Update the metadata extract to read it and implement the logic
  • Update the UI to add the property
  • Test all of this

@Bilelkihal
Copy link
Member

In other words, we only have to create a boolean property extractAgentsFromSourceFile and then use it in the processing workflow to skip or not the exclusion of agent extraction.
In summary to do that we need:

  • Add the attribute in the submission model
  • Add the metadata of that attribute to the .yml file to explain what it is.
  • Update the metadata extract to read it and implement the logic
  • Update the UI to add the property
  • Test all of this

I don't see the need to overcomplicate things for such a small feature, but if we plan to add more options for controlling how AgroPortal handles each ontology separately, then why not (to be discussed in the next meet).

I also don't prefer the solution of extracting only from the first submission.
Why not simply add a heuristic to detect if the agent already exists, and if so, avoid creating it again?

@jonquet
Copy link
Contributor

jonquet commented Jan 3, 2025

The feature was enabled in ontoportal-lirmm/ontologies_linked_data#154

The current code is here: https://github.com/ontoportal-lirmm/ontologies_linked_data/blob/master/lib/ontologies_linked_data/services/submission_process/operations/submission_extract_metadata.rb#L276

Discussed today:
We shall reformulate the Syphax's proposition : disable the extraction of Agents if the submission has already Agents set in the previous submission.
to
Disable the extraction of any "person and organization" properties properties if the ontology has already some values set in the previous submission.

We accept the consequence that extraction of an agent in ontology2 could recreate an agent that exists already for ontology1. In other words: any parsing with extraction of agents need a curation of the agents.

When implementing the new ontology parsing report: we shall list the agents extracted.

This solution allows to implement a solution independant from the ontology and not relying on a parameter (global or ontology specific).

Note: the behviour proposed for "person and organization" category is the opposite of the default behaviour which consists to always give the priority to what is in the file compared to what we have in the metadata record.

@jonquet
Copy link
Contributor

jonquet commented Jan 3, 2025

This solution allows to implement a solution independant from the ontology and not relying on a parameter (global or ontology specific).

Note: the behviour proposed for "person and organization" category is the opposite of the default behaviour which consists to always give the priority to what is in the file compared to what we have in the metadata record.

@jonquet
Copy link
Contributor

jonquet commented Jan 3, 2025

Another solution would consist of remembering the fact that a "agent string" has already been extracted ... for instance by :

  • keeping a record/ temp file of all the extracted "agent string"
  • create an ID based on a hash generated from the "agent string"

Solution not preferred as this will require some curation again of things that have been already curated if the "agent string" would change in any ways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants