Integration of spaCy's components with Large Language Models (LLMs) to boost text processing, entity extraction, NER, and summarization. Includes unit and integration tests, fixtures, and samples. Enables NLP pipelines with Large Language Models (LLMs), combining spaCy's supervised learning or rule-based components with LLM-powered features.
The installation steps suit a specific configuration:
- macOS/OSX
- ARM/M1, -
conda
- CPU
- Virtual environment
- English
- Efficiency
Refer to spaCy Quickstart β©© for other configurations.
Activate a virtual environment and install spaCy:
Terminal:
conda create -n venv
conda activate venv
conda install -c conda-forge spacy
python -m spacy download en_core_web_sm
python -m spacy validate
en_core_web_sm
: A small English model trained on web text.en_core_web_trf
: For accuracy, use a transformer-based model.
To use the transformer-based model:
python -m spacy download en_core_web_trf
See spaCy download method β©© and spaCy models β©© for more details.
pytest src/test.py
python src/main.py
python src/get_top_ranked_phrases.py
load_model()
: Loads the spaCy model. Returns the model. Example:spacy.load("en_core_web_sm")
process_text_returns_expected_tuples(nlp, text)
: Loads the spaCy model, processes text, and returns expected tuples. Example:[(token, POS, dependency)]
extract_entities_returns_expected_entity_tuples(nlp, text)
: Identifies named entities in text and returns expected entity tuples. Example:[(entity, label)]
summarize_text_returns_expected_summary(nlp, text)
: Generates a summary of text by extracting important phrases. Example: 'summary'get_top_ranked_phrases(text)
: Extracts top-ranked phrases from text and returns expected phrases. Example:[(phrase, rank)]
@pytest.fixture
textrank
pytextrank
pytest
butyrate_text = """Trivia: The bacterium Faecalibacterium prausnitzii in the human gut microbiome is responsible for producing butyrate, a short-chain fatty acid.
Explanation: Faecalibacterium prausnitzii utilizes complex carbohydrates, such as dietary fiber, as its primary energy source. Through a fermentation process, it breaks down these carbohydrates into smaller molecules, including butyrate. Butyrate has beneficial effects on gut health, serving as an energy source for colon cells, promoting their growth, maintaining the gut barrier integrity, and reducing inflammation. Faecalibacterium prausnitzii's ability to produce butyrate highlights its importance in maintaining a healthy gut microbiome."""
geosynchronization_text = """Trivia: The concept of geosynchronization was first postulated by Arthur C. Clarke.
Explanation: Geosynchronous orbits are orbits around Earth that have an orbital period matching Earth's rotation period. This results in the satellite appearing stationary with respect to a point on Earth's surface. This concept is crucial in space physics and geodesy, as it is used in various applications like communication satellites. Arthur C. Clarke, a British science fiction writer, was the first to postulate this concept, which is why geosynchronous orbits are sometimes referred to as Clarke orbits."""
- Optimize LLM integration
- Extend models
- API development
- Testing
- Dockerization
To contribute, fork the repository, implement changes, run tests, and submit a pull request. We appreciate and support collaborations.
- Forgetfulness
- Momentum
- Extraction
- Dependency parsing
- spaCy evaluate
- NER
- π€ Huggingface transformers
- π¦ spaCy-LLM
- Memory
- Redis
- System stability