Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some entities do not have the text field on the output file from inference #3

Open
T-Almeida opened this issue Nov 7, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@T-Almeida
Copy link
Collaborator

I believe merge #2 introduced a bug where some entities are missing their associated text in the output file. Here's a comparison:

New output file:

es-S2254-28842014000200009-1    1346    1355    SYMPTOM -       focalidad
es-S2254-28842014000200009-1    1382    1401    SYMPTOM -       síntoma neurológico
es-S2254-28842014000200009-1    2271    2279    SYMPTOM -
es-S2254-28842014000200009-1    2740    2758    SYMPTOM -
es-S2254-28842014000200009-1    2777    2798    SYMPTOM -
es-S2254-28842014000200009-1    2893    2900    SYMPTOM -
es-S2254-28842014000200009-1    2922    2933    SYMPTOM -
es-S2254-28842014000200009-1    2936    2948    SYMPTOM -
es-S2340-98942015000100005-1    259     271     CHEMICAL        -       carboplatino
es-S2340-98942015000100005-1    274     284     CHEMICAL        -       paclitaxel

Old output file (from commit 11da870):

es-S2254-28842014000200009-1    1346    1355    SYMPTOM -       focalidad
es-S2254-28842014000200009-1    1382    1401    SYMPTOM -       síntoma neurológico
es-S2254-28842014000200009-1    2271    2279    SYMPTOM -       sangrado
es-S2254-28842014000200009-1    2740    2758    SYMPTOM -       mal estado general
es-S2254-28842014000200009-1    2777    2798    SYMPTOM -       constantes mantenidas
es-S2254-28842014000200009-1    2893    2900    SYMPTOM -       agitada
es-S2254-28842014000200009-1    2922    2933    SYMPTOM -       hipotensión
es-S2254-28842014000200009-1    2936    2948    SYMPTOM -       convulsiones
es-S2340-98942015000100005-1    259     271     CHEMICAL        -       carboplatino
es-S2340-98942015000100005-1    274     284     CHEMICAL        -       paclitaxel
@richardjonker2000
Copy link
Collaborator

richardjonker2000 commented Nov 8, 2024

i believe the bug is related to document splitting, specifically data.py line 222:
text": doc['text'][low_offset: high_offset],

I changed the code so the text field only contains its repsective offset text. I have not verified, but when constructing the results text it will take the text field from the first chunk.

This is then corresponded to line 42 in inference.py:
text = documents[doc][0]["text"]
Taking only the first document as text

We can either fix the code in inference or in data.

I think its better to fix it in inference.

@T-Almeida T-Almeida mentioned this issue Nov 29, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants