This repository contains a set of tools for analyzing named entity recognition (NER) for Emergency Medical Services (EMS) data. The tools are designed to extract and analyze entities related to medical concepts from unstructured text data commonly found in EMS transcriptions.
The SciSpacyEntityExtractor
tool utilizes the SciSpacy library for entity extraction using any currently available SciSpacy model. It provides functionalities to analyze and save entities to an HTML file, as well as extract entities from a list of notes and add the results as a new column to a DataFrame.
The MetaMapEntityExtractor
tool integrates with MetaMap, a program developed by the National Library of Medicine (NLM), to extract concepts from clinical text. It includes features to start MetaMap servers, extract concepts to a file, and add extracted concepts as a new column to a DataFrame.
The DataArranger
class is a utility for loading and arranging data from Excel files and then writing back into the data files. It provides methods to return ground truth information, replace abbreviations, return sentences, and add ground truth columns to the DataFrame. For DataArranger to work, the EMS transcription column should be named 'Transcription' and you should name your Ground Truth column as ' NER Ground Truth.' Ensure that your ground truth is in the format CATEGORY, BOOL, NAME; etcetera.
The Validator
class is responsible for comparing and evaluating the performance of different NLP tools. It calculates precision, recall, and F1 score at both phrase and token levels based on the extracted entities. It uses formulas provided in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995677/.
-
Configure MetaMap:
- If using the
MetaMapEntityExtractor
, make sure to provide the correct paths for MetaMap binaries and servers in theMetaMapEntityExtractor
constructor.
- If using the
-
Start MetaMap Servers:
- If utilizing MetaMap, start the part-of-speech tagger and word sense disambiguation servers using the
start_servers
method ofMetaMapEntityExtractor
.
- If utilizing MetaMap, start the part-of-speech tagger and word sense disambiguation servers using the
-
Run the Tools:
- Utilize the provided scripts or integrate the tools into your workflow as needed.
-
Ensure you have a dataset in an Excel file.
-
Instantiate the
DataArranger
class indataloader.py
:data_loader = DataArranger(file_path_to_xlsx="your_dataset.xlsx", names=True)
-
Add a ground truth column to your dataset:
data_loader.add_ground_truth_column(column_name="GroundTruthNames")
-
Add a new column to your dataset using a DataFrame:
# Example DataFrame creation df = pd.DataFrame(data={'NewColumnName': ['value1', 'value2', ...]}) data_loader.add_column(df=df, col_name="NewColumnName")
-
Install MetaMap and its dependencies.
-
Instantiate the
MetaMapEntityExtractor
class inconcept_extractor_metamap.py
:metamap_extractor = MetaMapEntityExtractor(base_dir="your_metamap_directory")
-
Start MetaMap servers:
metamap_extractor.start_servers()
-
Extract concepts and save them to a DataFrame:
your_note_list = [...] # List of notes to extract concepts from df_concepts = metamap_extractor.extract_concepts_as_df(note_list=your_note_list, col_name="Concepts") df_concepts.to_excel("metamap_concepts_output.xlsx")
-
Instantiate the
SciSpacyEntityExtractor
class inconcept_extractor_scispacy.py
:scispacy_extractor = SciSpacyEntityExtractor(model_name="en_core_sci_sm")
-
Analyze entities and save them to a DataFrame:
your_note_list = [...] # List of notes to extract entities from df_entities = scispacy_extractor.extract_entities_as_df(note_list=your_note_list, col_name="Entities") df_entities.to_excel("scispacy_entities_output.xlsx")
-
Instantiate the
Validator
class invalidator.py
for evaluating NER ground truth against model predictions:checker = Validator(col_name="your_column_name", gold_label="your_ground_truth_column", file_path_to_xlsx_of_data="your_dataset.xlsx")
col_name
: The column containing model predictions.gold_label
: The column containing the ground truth.file_path_to_xlsx_of_data
: Path to the dataset Excel file.
-
Print debug information (optional):
checker.print_debug()
-
Print and add calculated scores as new columns to the existing Excel file:
checker.print_scores() checker.add_columns_to_excel(new_file_path="output_with_scores.xlsx")
Make sure to replace placeholders like "your_column_name"
, "your_ground_truth_column"
, "your_dataset.xlsx"
, etc., with the actual values specific to your project. Adjust the example usage accordingly.