Description: This program is designed to extract and analyze symptoms from clinical text documents using two different tools: cTAKES and SciSpaCy. The program includes scripts for data loading, processing, extraction, and validation. It currently supports three different websites' datasets: MayoClinic, ODEMSA, and Wikipedia.
A shell script to run the cTAKES clinical pipeline on a specified input directory of text files. It requires cTAKES_HOME, INPUT_DIR, TARGET_DIR, and an API_KEY as command line arguments.
A Python script containing a DataLoader
class with methods for converting various data formats, such as JSON and XMI. It also includes functionality for organizing data into specific directories.
A Python script attempting to use the cTAKES pipeline for symptom extraction. Currently disabled due to an issue with the shell command.
A Python script using SciSpaCy to extract symptoms. It converts JSON data to text and then extracts symptoms using a specified SciSpaCy model.
A Python script containing a Validator
class to compare the output of the symptom extraction tools (cTAKES and SciSpaCy) against gold standard labels. It calculates precision, recall, and F1 scores at both the phrase and token levels.
- Ensure that cTAKES is properly installed and configured.
- Install required Python dependencies using
pip install -r requirements.txt
. - Run the
run_ctakes_cpe.sh
script to process the input clinical text files using cTAKES. Provide the required command line arguments: cTAKES_HOME, INPUT_DIR, TARGET_DIR, and API_KEY. - Run the
symptom_extractor_scispacy.py
script to extract symptoms using SciSpaCy. Adjust the input file path and output file path as needed. - Run the
validator.py
script to compare the output of cTAKES and SciSpaCy against gold standard labels. Provide the paths to the gold label JSON, test JSON, and the tool being validated (either "cTAKES" or "SciSpaCy").
- cTAKES (clinical Text Analysis and Knowledge Extraction System)
- Python 3.x
- SciSpaCy
- Ensure cTAKES is correctly installed and configured.
- Install Python dependencies using
pip install -r requirements.txt
. - Execute the
run_ctakes_cpe.sh
script to process clinical text files with cTAKES. - Run the
symptom_extractor_scispacy.py
script to extract symptoms using SciSpaCy. - Validate the results using the
validator.py
script, providing the necessary input paths.
Feel free to contribute to the project by submitting bug reports, feature requests, or pull requests.
This project is licensed under the MIT License.