-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record the fact that a statement is somehow auto-generated? #172
Comments
It would be great to have more metadata here. Another use case is axioms added by a reasoner. The robot reason command adds an
We also previously discussed having complete PROV graphs with very clear provenance: This may seem like overkill, and there are the usual objections about not having individuals cluttering ontologies, but I think it is worth doing this right, with a full data model. There is sometimes a blurry line between annotations (in the bio sense) and ontology axioms. I personally follow Rector et al and believe there needs to be a firm dividing line here and we should not capture annotations in OWL. But it can be convenient, and the horse may have left the barn:
There are of course existing data models for annotations, such as the GO evidence model, and biolink. It's really important to be precise about sources of axioms, whether auto-generated or not, and this will be increasingly important. But as a stopgap measure until we have full PROV graphs, what's wrong with having hasDbXref axiom annotations, and having standard conventions for the object (e.g. the LLM pipeline used)? Using dbxref axiom annotations is already a standard that has been in use for 20 years and is well understood by tools. |
I realize |
Very nice issue, I love it and I think this is very important. I would love the PROV-solution in the ticket you shared.
I know this sounds impractical, but it is also beautiful. The main downside is that our ontologies get cluttered with a lot of provenance information, but we could perhaps agree on a scheme that does not include the serialised PROV graph in the ontologies but have the activity purls resolve to it. Dreaming. I really believe in this because I think that provenance will be the main selling point for declarative forms of knowledge in the age of AI. |
I am not sure I understand what you mean by “annotations (in the bio sense)”. And likely because of that, I don’t understand what this has to do with the issue at hand.
The fact that precisely, we do not have standard conventions for the object. The
I agree about the beautiful part. I’d be happy with such a solution, except for one bit. It does not provide a simple, direct way to get the information that a statement had been automatically generated without human input (which is what I want to do here). Unless I missed something (I’ll admit I only briefly skimmed the PROV spec for now), So if all axioms in the ontology are annotated with *) have an out-of-band knowledge of which activities correspond to automatic processes (e.g. a list of all *) explore the Ultimately it should be doable, but I can’t help thinking we are envisioning a complex solution that will maybe provide many potentially useful informations but will fail to provide easily the one bit of information that, for now, we know that we want to provide. |
I like the idea of something like the ROBOPROV ontology outlined in the document linked in ontodev/robot#6, and I think such an ontology would be absolutely necessary if we want to be able to describe our “activities” with enough details (PROV on its own seems largely insufficient), but I think it should not be focused specifically on ROBOT. If we have to design such an ontology, we should make it cover not only ROBOT but also the OAK, and possibly the ODK as well. |
Wait, |
I can only say this: #90 The semantic web was designed for instance level assertions, not class level assertions.. |
But doesn’t that imply that any ontology in which we would use a |
https://incatools.github.io/ontology-access-kit/glossary.html#term-Annotation |
I'd say the semantic web rdf world is completely fine with class level assertions, classes are instances of classes in RDFS. It was OWL1 that insisted that classes aren't in the domain of discourse creating the awful hack of "annotation properties" to get around this. They backtracked a bit with OWL2 but punning is fundamentally strange and confusing to 99% of people, same with OWL-Full. |
The decision that APs / OPs punning is illegal is really one of the most annoying design decisions. In Protege you cant even select an OP when annotating an axiom.
@gouttegd yes, this is the super annoying downside. They can be merged on RDF level, but not on OWL level (e.g. imported using OWL API-based tooling, or processed using tools sensitive to the, as Chris says, often annoying limitations of OWL-Full). The alternative is to never re-use any already existing properties, which I believe is worse. The "typing" is just a design flaw, and since RDF level integration is totally fine either way, I would vote for re-typing. |
Crazy idea (not sure I would vote for it myself, but just thinking out loud): Instead of re-typing (that is, use the IRI of a standard OP as if it was an AP), how about defining new IRIs (in a dedicated namespace) for annotation properties that “mirror” the object properties we would like to use, with explicit mappings between each new AP and its original OP counterpart? That is, if we’d like to use, for example, That’s akin to “never re-use any already existing (object) properties”, yes. But at least the new properties would not come out of the blue and would not be reinventing the wheel – they would follow the existing properties. Again, not sure I believe myself this is a good idea, but WDYT? |
This is not a crazy idea, its neat, but the main problem remains: creating a burden for users to integrate at RDF level. Previously we (in this case I will take responsibility) decided the tradeoff between (1) re-typing and OWL violations on the one side, and (2) using different IRIs requiring mappings and churn on the other (data integration side) in favour of (1) as the lesser of two evils. Check this:
in here: https://www.w3.org/TR/skos-reference/#namespace-documents
It will be massive churn to create parallel hierarchies in the way you propose for all vocabularies we re-use where the originators, for some reason, thought it was a good idea to model the metadata properties in OWL.. I personally think should not create different IRIs. I value the fact that we can integrate at RDF level higher than maintaining conceptual separation. |
Several ontologies contains annotations that have not been manually curated/edited, but are instead the result of some kind of automatic generation process.
For example, FlyBase’s Drosophila anatomy ontology (FBbt) contains classes whose text definition has been automatically generated from the logical definition of the class (by “translating” the class expression the class is equivalent to into plain English).
We can also expect to see more annotations that are the result of some LLM-assisted process.
I think it would be useful if this kind of auto-generated contents could be explicitly flagged as such, for at least two reasons:
Basic honesty. There is an implicit assumption that an ontology is the result of the work of human curators who know what they are doing. Users have the right to know when a part of an ontology is instead the result of an automated process involving no actual (human) curation.
Provide a way for LLM folks to avoid using auto-generated content when they collect training data, to avoid a situation where the next generation of LLM is trained on the output of the previous generation (it could be that this horse has already left the barn; still, doesn’t mean we shouldn’t try to avoid making things worse).
In the aforementioned FBbt ontology, automatically generated definitions are annotated with a
oboInOwl:hasDbXref
annotation with the special valueFBC:Autogenerated
(whereFBC
stands for “FlyBase Curator”). It’s better than nothing but it’s obviously a local, ad-hoc solution. A standard, uniform way to flag auto-generated statements would be better.Several possibilities:
a) A simple annotation with a new property that takes a boolean value (something like
OMO:is_autogenerated=true
) and merely indicates whether the statement the annotation is applied to is, well, auto-generated.b) An annotation with a new property that takes either a string or (preferably) an IRI, and that indicates both: 1) the fact that the statement is auto-generated and 2) some information about how the content was generated, for example with an IRI that identifies the generating process (something like
OMO:generator=http://example.org/my/text/definition/generator
).c) Defining some special values to use with existing properties such as
dc:contributor
ordc:source
(@matentzn ’s idea; something likedc:contributor=openai:gpt4
). Does not involve any new property but implies that one must look at the value of the annotation to possibly know that the content is auto-generated.Thoughts?
The text was updated successfully, but these errors were encountered: