Add support for Coreference Resolution #15

Piyush13y · 2022-04-12T03:11:46Z

Is your feature request related to a problem? Please describe.
Issues with coreference resolution are one of the most frequently mentioned challenges for information extraction from the biomedical literature. We plan to add support for coreferencing into our pipeline through a CoreferenceProcessor and this issue will help you get the implementation kickstarted.

Describe the solution you'd like
We will be developing a wrapper around Huggingface's NeuralCoref library to suit our use case and leverage their pre trained model for coreference resolution purposes. It uses spaCy with Neural Networks in the backend. The following is the link to the GitHub repo for the NeuralCoref project:
https://github.com/huggingface/neuralcoref

This is the blogpost by Huggingface to better describe their coreference resolution:
https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30

Please give the GitHub repository Readme file and the blogpost a read as it would help implementing the wrapper around NeuralCoref.
Their model is trained on English language (non biomedical corpus). The ontologies pertaining to this issue have already been defined, i.e. CoreferenceGroup, EntityMention, MedicalEntityMention. CoreferenceGroup currently works with EntityMention members, and we might have to translate/merge those as MedicalEntityMention for our medical pipeline.

doc._.coref_clusters <=> CoreferenceGroup
doc._.coref_clusters[1].mentions <=> EntityMentions

(Building and Generating Ontologies documentation)

As is clear from the GitHub repository, if doc._.has_coref is True, doc._.coref_clusters returns a list of all coref clusters, each of which would in turn define CoreferenceGroups. NeuralCoref mentions are all Span objects, which implies its straightforward to define EntityMentions/MedicalEntityMentions from these. These in turn can then be used to define a CoreferenceGroup.

Regarding config for the processor, the user can provide values for greediness, max_dist, blacklist, etc. These parameters are mentioned in the GitHub repository readme and can be referred to for more details.

Example call:

pl.add(
        CoreferenceProcessor(),
        {
            lang: "en_core_web_sm",
            greedyness: 0.75,
            max_dist: 50,
            max_dist_match: 500, 
        },
    )

Another thing that we will have to ensure is that we must install neuralcoref along with forte-medical. Hence, it will have to be added to the setup.py and requirements files.

Also, make sure you add unit test cases for the processor. You can refer any of the test files in https://github.com/asyml/ForteHealth/tree/master/tests/forte_medical/processors for reference.

P.S. You can follow NegationContextAnalyzer processor for the structure and code design. It can be used as the template processor to refer to when implementing a new one.

Describe alternatives you've considered
Several papers were referred to and a couple GitHub repositories as well. E2E was another alternative to this for coreference resolution, but Huggingface's NeuralCoref seems to be strightforward to implement and since we already have spaCy based processors in our code base, it can be easier to write this wrapper.

The text was updated successfully, but these errors were encountered:

Leolty · 2022-06-08T06:35:36Z

@KiaLAN maybe we can work on this together.
@Piyush13y could you please add me to the collaborator list? Zhiting said I should get my hands dirty by this issue.

KiaLAN · 2022-06-08T10:27:16Z

@Leolty Sounds good.

KiaLAN · 2022-06-16T08:58:42Z

To create a NeuralCoref object, I need to create a spaCy pipeline first, then add it to the pipeline:

nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

Currently my test code adds a SpacyProcessor right before the CoreferenceProcessor.

Here the problem comes:

If CoreferenceProcessor contains its own spaCy pipeline, it may be different from the pipeline in SpacyProcessor. Then, the tokenization of SpacyProcessor and CoreferenceProcessor may be different. I am not sure if this is a behavior we want.
Currently my implementation let CoreferenceProcessor borrow the pipeline from SpacyProcessor, which ensures the tokenization to be the same. But I find that the intermediate result cannot be got from SpacyProcessor, so I have to run the pipeline inside the CoreferenceProcessor again, which is not very elegant.

KiaLAN · 2022-06-16T09:00:53Z

Another behavior I found:

Since NeuralCoref is trained on daily language, it is not doing good at resolving medical coreference.

KiaLAN · 2022-06-17T04:29:40Z

Since NeuralCoref is trained on daily language, it is not doing good at resolving medical coreference.

hunterhector · 2022-06-17T04:37:09Z

It can help to identify coref group related to the person (like the patient in a discharge note). But it would be nice if we can find other models that can do better on the medical-related text.

KiaLAN · 2022-06-17T11:16:57Z

CoreferenceGroup currently works with EntityMention members, and we might have to translate/merge those as MedicalEntityMention for our medical pipeline

When I read this, I think @Piyush13y means we need to do coref resolution for medical entities.

hunterhector · 2022-06-17T15:28:52Z

The rationale behind this (using a new ontology name instead of EntityMention), is to allow this tool to use its own set of ontologies so it doesn't necessarily conflict with the output from other tools, like the ones here: https://github.com/asyml/forte-wrappers.

We would certainly like to do better coref on domain-related entities, if you can find existing models we can set that up too. But if we don't have good alternatives right now we can use this to resolve some coref chains at the moment.

Piyush13y added the new component label Apr 12, 2022

Piyush13y changed the title ~~Add support for Coreferencing~~ Add support for Coreference Resolution Apr 12, 2022

KiaLAN self-assigned this Jun 8, 2022

Piyush13y assigned Leolty Jun 8, 2022

Piyush13y unassigned Leolty Jun 9, 2022

KiaLAN assigned Leolty and unassigned Leolty Jun 15, 2022

KiaLAN linked a pull request Jun 17, 2022 that will close this issue

15 Adding CoreferenceProcessor #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Coreference Resolution #15

Add support for Coreference Resolution #15

Piyush13y commented Apr 12, 2022 •

edited

Loading

Leolty commented Jun 8, 2022

KiaLAN commented Jun 8, 2022

KiaLAN commented Jun 16, 2022

KiaLAN commented Jun 16, 2022

KiaLAN commented Jun 17, 2022

hunterhector commented Jun 17, 2022

KiaLAN commented Jun 17, 2022

hunterhector commented Jun 17, 2022

Add support for Coreference Resolution #15

Add support for Coreference Resolution #15

Comments

Piyush13y commented Apr 12, 2022 • edited Loading

Leolty commented Jun 8, 2022

KiaLAN commented Jun 8, 2022

KiaLAN commented Jun 16, 2022

KiaLAN commented Jun 16, 2022

KiaLAN commented Jun 17, 2022

hunterhector commented Jun 17, 2022

KiaLAN commented Jun 17, 2022

hunterhector commented Jun 17, 2022

Piyush13y commented Apr 12, 2022 •

edited

Loading