Some entities do not have the text field on the output file from inference #3

T-Almeida · 2024-11-07T21:24:23Z

I believe merge #2 introduced a bug where some entities are missing their associated text in the output file. Here's a comparison:

New output file:

es-S2254-28842014000200009-1    1346    1355    SYMPTOM -       focalidad
es-S2254-28842014000200009-1    1382    1401    SYMPTOM -       síntoma neurológico
es-S2254-28842014000200009-1    2271    2279    SYMPTOM -
es-S2254-28842014000200009-1    2740    2758    SYMPTOM -
es-S2254-28842014000200009-1    2777    2798    SYMPTOM -
es-S2254-28842014000200009-1    2893    2900    SYMPTOM -
es-S2254-28842014000200009-1    2922    2933    SYMPTOM -
es-S2254-28842014000200009-1    2936    2948    SYMPTOM -
es-S2340-98942015000100005-1    259     271     CHEMICAL        -       carboplatino
es-S2340-98942015000100005-1    274     284     CHEMICAL        -       paclitaxel

Old output file (from commit 11da870):

es-S2254-28842014000200009-1    1346    1355    SYMPTOM -       focalidad
es-S2254-28842014000200009-1    1382    1401    SYMPTOM -       síntoma neurológico
es-S2254-28842014000200009-1    2271    2279    SYMPTOM -       sangrado
es-S2254-28842014000200009-1    2740    2758    SYMPTOM -       mal estado general
es-S2254-28842014000200009-1    2777    2798    SYMPTOM -       constantes mantenidas
es-S2254-28842014000200009-1    2893    2900    SYMPTOM -       agitada
es-S2254-28842014000200009-1    2922    2933    SYMPTOM -       hipotensión
es-S2254-28842014000200009-1    2936    2948    SYMPTOM -       convulsiones
es-S2340-98942015000100005-1    259     271     CHEMICAL        -       carboplatino
es-S2340-98942015000100005-1    274     284     CHEMICAL        -       paclitaxel

The text was updated successfully, but these errors were encountered:

richardjonker2000 · 2024-11-08T09:41:54Z

i believe the bug is related to document splitting, specifically data.py line 222:
text": doc['text'][low_offset: high_offset],

I changed the code so the text field only contains its repsective offset text. I have not verified, but when constructing the results text it will take the text field from the first chunk.

This is then corresponded to line 42 in inference.py:
text = documents[doc][0]["text"]
Taking only the first document as text

We can either fix the code in inference or in data.

I think its better to fix it in inference.

T-Almeida added the bug Something isn't working label Nov 7, 2024

T-Almeida assigned richardjonker2000 Nov 7, 2024

T-Almeida mentioned this issue Nov 7, 2024

Customize dataset and classes/label for training #1

Open

T-Almeida mentioned this issue Nov 29, 2024

Inference refactor #5

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some entities do not have the text field on the output file from inference #3

Some entities do not have the text field on the output file from inference #3

T-Almeida commented Nov 7, 2024

richardjonker2000 commented Nov 8, 2024 •

edited

Loading

Some entities do not have the text field on the output file from inference #3

Some entities do not have the text field on the output file from inference #3

Comments

T-Almeida commented Nov 7, 2024

richardjonker2000 commented Nov 8, 2024 • edited Loading

richardjonker2000 commented Nov 8, 2024 •

edited

Loading