Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Coreference Resolution #15

Open
Piyush13y opened this issue Apr 12, 2022 · 8 comments · May be fixed by #41
Open

Add support for Coreference Resolution #15

Piyush13y opened this issue Apr 12, 2022 · 8 comments · May be fixed by #41
Assignees

Comments

@Piyush13y
Copy link
Collaborator

Piyush13y commented Apr 12, 2022

Is your feature request related to a problem? Please describe.
Issues with coreference resolution are one of the most frequently mentioned challenges for information extraction from the biomedical literature. We plan to add support for coreferencing into our pipeline through a CoreferenceProcessor and this issue will help you get the implementation kickstarted.

Describe the solution you'd like
We will be developing a wrapper around Huggingface's NeuralCoref library to suit our use case and leverage their pre trained model for coreference resolution purposes. It uses spaCy with Neural Networks in the backend. The following is the link to the GitHub repo for the NeuralCoref project:
https://github.com/huggingface/neuralcoref

This is the blogpost by Huggingface to better describe their coreference resolution:
https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30

Please give the GitHub repository Readme file and the blogpost a read as it would help implementing the wrapper around NeuralCoref.
Their model is trained on English language (non biomedical corpus). The ontologies pertaining to this issue have already been defined, i.e. CoreferenceGroup, EntityMention, MedicalEntityMention. CoreferenceGroup currently works with EntityMention members, and we might have to translate/merge those as MedicalEntityMention for our medical pipeline.

doc._.coref_clusters <=> CoreferenceGroup
doc._.coref_clusters[1].mentions <=> EntityMentions

(Building and Generating Ontologies documentation)

As is clear from the GitHub repository, if doc._.has_coref is True, doc._.coref_clusters returns a list of all coref clusters, each of which would in turn define CoreferenceGroups. NeuralCoref mentions are all Span objects, which implies its straightforward to define EntityMentions/MedicalEntityMentions from these. These in turn can then be used to define a CoreferenceGroup.

Regarding config for the processor, the user can provide values for greediness, max_dist, blacklist, etc. These parameters are mentioned in the GitHub repository readme and can be referred to for more details.

Example call:

pl.add(
        CoreferenceProcessor(),
        {
            lang: "en_core_web_sm",
            greedyness: 0.75,
            max_dist: 50,
            max_dist_match: 500, 
        },
    )

Another thing that we will have to ensure is that we must install neuralcoref along with forte-medical. Hence, it will have to be added to the setup.py and requirements files.

Also, make sure you add unit test cases for the processor. You can refer any of the test files in https://github.com/asyml/ForteHealth/tree/master/tests/forte_medical/processors for reference.

P.S. You can follow NegationContextAnalyzer processor for the structure and code design. It can be used as the template processor to refer to when implementing a new one.

Describe alternatives you've considered
Several papers were referred to and a couple GitHub repositories as well. E2E was another alternative to this for coreference resolution, but Huggingface's NeuralCoref seems to be strightforward to implement and since we already have spaCy based processors in our code base, it can be easier to write this wrapper.

@Piyush13y Piyush13y changed the title Add support for Coreferencing Add support for Coreference Resolution Apr 12, 2022
@KiaLAN KiaLAN self-assigned this Jun 8, 2022
@Leolty
Copy link
Collaborator

Leolty commented Jun 8, 2022

@KiaLAN maybe we can work on this together.
@Piyush13y could you please add me to the collaborator list? Zhiting said I should get my hands dirty by this issue.

@KiaLAN
Copy link
Collaborator

KiaLAN commented Jun 8, 2022

@Leolty Sounds good.

@KiaLAN KiaLAN assigned Leolty and unassigned Leolty Jun 15, 2022
@KiaLAN
Copy link
Collaborator

KiaLAN commented Jun 16, 2022

To create a NeuralCoref object, I need to create a spaCy pipeline first, then add it to the pipeline:

nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

Currently my test code adds a SpacyProcessor right before the CoreferenceProcessor.

Here the problem comes:

  1. If CoreferenceProcessor contains its own spaCy pipeline, it may be different from the pipeline in SpacyProcessor. Then, the tokenization of SpacyProcessor and CoreferenceProcessor may be different. I am not sure if this is a behavior we want.
  2. Currently my implementation let CoreferenceProcessor borrow the pipeline from SpacyProcessor, which ensures the tokenization to be the same. But I find that the intermediate result cannot be got from SpacyProcessor, so I have to run the pipeline inside the CoreferenceProcessor again, which is not very elegant.

@KiaLAN
Copy link
Collaborator

KiaLAN commented Jun 16, 2022

Another behavior I found:

Since NeuralCoref is trained on daily language, it is not doing good at resolving medical coreference.

@KiaLAN
Copy link
Collaborator

KiaLAN commented Jun 17, 2022

Since NeuralCoref is trained on daily language, it is not doing good at resolving medical coreference.
example output

@hunterhector
Copy link
Member

It can help to identify coref group related to the person (like the patient in a discharge note). But it would be nice if we can find other models that can do better on the medical-related text.

@KiaLAN
Copy link
Collaborator

KiaLAN commented Jun 17, 2022

CoreferenceGroup currently works with EntityMention members, and we might have to translate/merge those as MedicalEntityMention for our medical pipeline

When I read this, I think @Piyush13y means we need to do coref resolution for medical entities.

@KiaLAN KiaLAN linked a pull request Jun 17, 2022 that will close this issue
@hunterhector
Copy link
Member

The rationale behind this (using a new ontology name instead of EntityMention), is to allow this tool to use its own set of ontologies so it doesn't necessarily conflict with the output from other tools, like the ones here: https://github.com/asyml/forte-wrappers.

We would certainly like to do better coref on domain-related entities, if you can find existing models we can set that up too. But if we don't have good alternatives right now we can use this to resolve some coref chains at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants