Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: severinsimmler/chaine
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.2.2
Choose a base ref
...
head repository: severinsimmler/chaine
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref
Loading
Showing with 9,379 additions and 4,072 deletions.
  1. +32 −0 .github/workflows/main.yml
  2. +2 −4 .gitignore
  3. +1 −0 .python-version
  4. +307 −15 README.md
  5. +46 −37 build.py
  6. +1 −1 chaine/__init__.py
  7. +271 −0 chaine/_core/crf.pyx
  8. 0 chaine/{ → _core}/crfsuite/COPYING
  9. 0 chaine/{ → _core}/crfsuite/README
  10. +1,077 −0 chaine/_core/crfsuite/include/crfsuite.h
  11. +1 −1 chaine/{ → _core}/crfsuite/include/crfsuite.hpp
  12. +406 −0 chaine/_core/crfsuite/include/crfsuite_api.hpp
  13. 0 chaine/{ → _core}/crfsuite/include/os.h
  14. 0 chaine/{ → _core}/crfsuite/lib/cqdb/COPYING
  15. +10 −10 chaine/{ → _core}/crfsuite/lib/cqdb/include/cqdb.h
  16. +1 −1 chaine/{ → _core}/crfsuite/lib/cqdb/src/cqdb.c
  17. +13 −13 chaine/{ → _core}/crfsuite/lib/cqdb/src/lookup3.c
  18. 0 chaine/{ → _core}/crfsuite/lib/cqdb/src/main.c
  19. +2 −1 chaine/{ → _core}/crfsuite/lib/crf/src/crf1d.h
  20. 0 chaine/{ → _core}/crfsuite/lib/crf/src/crf1d_context.c
  21. +2 −19 chaine/{ → _core}/crfsuite/lib/crf/src/crf1d_encode.c
  22. +0 −2 chaine/{ → _core}/crfsuite/lib/crf/src/crf1d_feature.c
  23. +46 −75 chaine/{ → _core}/crfsuite/lib/crf/src/crf1d_model.c
  24. +11 −3 chaine/{ → _core}/crfsuite/lib/crf/src/crf1d_tag.c
  25. +0 −19 chaine/{ → _core}/crfsuite/lib/crf/src/crfsuite.c
  26. +1 −1 chaine/{ → _core}/crfsuite/lib/crf/src/crfsuite_internal.h
  27. +0 −2 chaine/{ → _core}/crfsuite/lib/crf/src/crfsuite_train.c
  28. 0 chaine/{ → _core}/crfsuite/lib/crf/src/dataset.c
  29. 0 chaine/{ → _core}/crfsuite/lib/crf/src/dictionary.c
  30. 0 chaine/{ → _core}/crfsuite/lib/crf/src/holdout.c
  31. +1,497 −0 chaine/_core/crfsuite/lib/crf/src/json.c
  32. +120 −0 chaine/_core/crfsuite/lib/crf/src/json.h
  33. +1 −12 chaine/{ → _core}/crfsuite/lib/crf/src/logging.c
  34. 0 chaine/{ → _core}/crfsuite/lib/crf/src/logging.h
  35. 0 chaine/{ → _core}/crfsuite/lib/crf/src/params.c
  36. 0 chaine/{ → _core}/crfsuite/lib/crf/src/params.h
  37. 0 chaine/{ → _core}/crfsuite/lib/crf/src/quark.c
  38. 0 chaine/{ → _core}/crfsuite/lib/crf/src/quark.h
  39. +71 −71 chaine/{ → _core}/crfsuite/lib/crf/src/rumavl.c
  40. +21 −21 chaine/{ → _core}/crfsuite/lib/crf/src/rumavl.h
  41. +3 −19 chaine/{ → _core}/crfsuite/lib/crf/src/train_arow.c
  42. +3 −15 chaine/{ → _core}/crfsuite/lib/crf/src/train_averaged_perceptron.c
  43. +12 −52 chaine/{ → _core}/crfsuite/lib/crf/src/train_l2sgd.c
  44. +5 −32 chaine/{ → _core}/crfsuite/lib/crf/src/train_lbfgs.c
  45. +3 −19 chaine/{ → _core}/crfsuite/lib/crf/src/train_passive_aggressive.c
  46. 0 chaine/{ → _core}/crfsuite/lib/crf/src/vecmath.h
  47. 0 chaine/{ → _core}/crfsuite/swig/crfsuite.cpp
  48. +2 −2 chaine/{ → _core}/crfsuite_api.pxd
  49. 0 chaine/{ → _core}/liblbfgs/COPYING
  50. 0 chaine/{ → _core}/liblbfgs/README
  51. +109 −109 chaine/{ → _core}/liblbfgs/include/lbfgs.h
  52. 0 chaine/{ → _core}/liblbfgs/lib/arithmetic_ansi.h
  53. 0 chaine/{ → _core}/liblbfgs/lib/arithmetic_sse_double.h
  54. 0 chaine/{ → _core}/liblbfgs/lib/arithmetic_sse_float.h
  55. +2 −2 chaine/{ → _core}/liblbfgs/lib/lbfgs.c
  56. +24 −2 chaine/{ → _core}/tagger_wrapper.hpp
  57. 0 chaine/{ → _core}/trainer_wrapper.cpp
  58. 0 chaine/{ → _core}/trainer_wrapper.hpp
  59. +505 −0 chaine/crf.py
  60. +0 −544 chaine/crf.pyx
  61. +0 −1,069 chaine/crfsuite/include/crfsuite.h
  62. +0 −406 chaine/crfsuite/include/crfsuite_api.hpp
  63. +0 −68 chaine/data.py
  64. +127 −83 chaine/logging.py
  65. +10 −0 chaine/optimization/__init__.py
  66. +129 −0 chaine/optimization/metrics.py
  67. +394 −0 chaine/optimization/spaces.py
  68. +103 −0 chaine/optimization/trial.py
  69. +119 −0 chaine/optimization/utils.py
  70. +169 −17 chaine/training.py
  71. +10 −8 chaine/typing.py
  72. +43 −0 chaine/validation.py
  73. +1,226 −0 examples/notebooks/ner.ipynb
  74. +112 −0 examples/scripts/README.md
  75. +28 −0 examples/scripts/optimization.py
  76. +47 −0 examples/scripts/training.py
  77. +114 −0 examples/scripts/utils.py
  78. +0 −115 notebooks/tutorial.ipynb
  79. +1,828 −1,000 poetry.lock
  80. +18 −17 pyproject.toml
  81. +82 −129 tests/test_crf.py
  82. +0 −13 tests/test_data.py
  83. +5 −40 tests/test_logging.py
  84. +68 −0 tests/test_optimization_metrics.py
  85. +43 −0 tests/test_optimization_trial.py
  86. +56 −0 tests/test_optimization_utils.py
  87. +20 −3 tests/test_training.py
  88. +20 −0 tests/test_validation.py
32 changes: 32 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: CI & CD

on:
push:
tags: ["*"]

jobs:
build:
name: Build wheels on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]

steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
name: Install Python
with:
python-version: "3.13"

- name: Install Python tools
run: pip install poetry cibuildwheel

- name: Build wheels
run: cibuildwheel --output-dir dist

- name: Deploy wheels
run: |
poetry config pypi-token.pypi ${{ secrets.PYPI_TOKEN }}
poetry publish
6 changes: 2 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -80,9 +80,6 @@ target/
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
@@ -124,5 +121,6 @@ venv.bak/

.vscode

chaine/crf.cpp
*.crf
*.chaine
hyperparameter-optimization.json
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.13.1
322 changes: 307 additions & 15 deletions README.md
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,39 +1,331 @@
# Chaine

Linear-chain conditional random fields for natural language processing.
[![downloads](https://static.pepy.tech/personalized-badge/chaine?period=total&units=international_system&left_color=black&right_color=black&left_text=downloads)](https://pepy.tech/project/chaine)
[![downloads/month](https://static.pepy.tech/personalized-badge/chaine?period=month&units=abbreviation&left_color=black&right_color=black&left_text=downloads/month)](https://pepy.tech/project/chaine)
[![downloads/week](https://static.pepy.tech/personalized-badge/chaine?period=week&units=abbreviation&left_color=black&right_color=black&left_text=downloads/week)](https://pepy.tech/project/chaine)

Chaine is a modern Python library without third-party dependencies and a backend written in C. You can train conditional random fields for natural language processing tasks like [named entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) or [part-of-speech tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging).
Chaine is a modern, fast and lightweight Python library implementing **linear-chain conditional random fields**. Use it for sequence labeling tasks like [named entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) or [part-of-speech tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging).

- **Lightweight**: No use of bloated third-party libraries.
- **Fast**: Performance critical parts are written in C and thus [blazingly fast](http://www.chokkan.org/software/crfsuite/benchmark.html).
- **Easy to use**: Designed with special focus on usability and a beautiful high-level API.
The main goals of this project are:

You can install the latest stable version from [PyPI](https://pypi.org/project/chaine):
- **Usability**: Designed with special focus on usability and a beautiful high-level API.
- **Efficiency**: Performance critical parts are written in C and thus [blazingly fast](http://www.chokkan.org/software/crfsuite/benchmark.html). Loading a model from disk and retrieving feature weights for inference is optimized for both [speed and memory](http://www.chokkan.org/software/cqdb/).
- **Persistency**: No `pickle` or `joblib` is used for serialization. A trained model will be compatible with all versions for eternity, because the underlying C library will not change. I promise.
- **Compatibility**: There are wheels for Linux, macOS and Windows. No compiler needed.
- **Minimalism**: No code bloat, no external dependencies.

Install the latest stable version from [PyPI](https://pypi.org/project/chaine):

```
pip install chaine
```

### Table of contents

- [Algorithms](#algorithms)
- [Usage](#usage)
- [Features](#features)
- [Training](#training)
- [Hyperparameters](#hyperparameters)
- [Inference](#inference)
- [Weights](#weights)
- [Credits](#credits)

## Algorithms

You can train models using the following methods:

- Limited-Memory BFGS ([Nocedal 1980](https://www.jstor.org/stable/2006193))
- Orthant-Wise Limited-Memory Quasi-Newton ([Andrew et al. 2007](https://www.microsoft.com/en-us/research/publication/scalable-training-of-l1-regularized-log-linear-models/))
- Stochastic Gradient Descent ([Shalev et al. 2007](https://www.google.com/url?q=https://www.cs.huji.ac.il/~shais/papers/ShalevSiSr07.pdf))
- Averaged Perceptron ([Collins 2002](https://aclanthology.org/W02-1001.pdf))
- Passive Aggressive ([Crammer et al. 2006](https://jmlr.csail.mit.edu/papers/v7/crammer06a.html))
- Adaptive Regularization of Weight Vectors ([Mejer et al. 2010](https://aclanthology.org/D10-1095.pdf))

Please refer to the paper by [Lafferty et al.](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers) for a general introduction to **conditional random fields** or the respective chapter in [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/8.pdf).

## Usage

Training and using a **conditional random field** for inference is easy as:

```python
>>> import chaine
>>> tokens = [[{"index": 0, "text": "John"}, {"index": 1, "text": "Lennon"}]]
>>> labels = [["B-PER", "I-PER"]]
>>> model = chaine.train(tokens, labels)
>>> model.predict(tokens)
[['B-PER', 'I-PER']]
```

> You can control verbosity with the argument `verbose`, where `0` will set the log level to `ERROR`, `1` to `INFO` (which is the default) and `2` to `DEBUG`.
### Features

One token in a sequence is represented as a dictionary with describing feature names as keys and respective values of type string, integer, float or boolean:

```python
{
"text": "John",
"num_characters": 4,
"relative_index": 0.0,
"is_number": False,
}
```

One sequence is represented as a list of feature dictionaries:

```python
[
{"text": "John", "num_characters": 4},
{"text": "Lennon", "num_characters": 6}
]
```

One data set is represented as an iterable of a list of feature dictionaries:

```python
[
[
{"text": "John", "num_characters": 4},
{"text": "Lennon", "num_characters": 6}
],
[
{"text": "Paul", "num_characters": 4},
{"text": "McCartney", "num_characters": 9}
],
...
]
```

This is the expected input format for training. For inference, you can also process a single sequence rather than a batch of multiple sequences.

#### Generators

Depending on the size of your data set, it probably makes sense to use generators. Something like this would be totally fine for both training and inference:

```python
([extract_features(token) for token in tokens] for tokens in dataset)
```

Assuming `dataset` is a generator as well, only one sequence is loaded into memory at a time.

### Training

You can either use the high-level function to train a model (which also loads and returns it):

```python
>>> import chaine
>>> chaine.train(tokens, labels)
```
$ pip install chaine

or the lower-level `Trainer` class:

```python
>>> from chaine import Trainer
>>> trainer = Trainer()
```

If you are interested in the theoretical concepts behind conditional random fields, please refer to the introducing paper by [Lafferty et al](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers).
A `Trainer` object has a method `train()` to learn states and transitions from the given data set. You have to provide a filepath to serialize the model to:

```python
>>> trainer.train(tokens, labels, model_filepath="model.chaine")
```

## Example
### Hyperparameters

Before training a model, you might want to find out the ideal hyperparameters first. You can just set the respective argument to `True`:

```python
>>> import chaine
>>> tokens = [["John", "Lennon", "was", "born", "in" "Liverpool"]]
>>> labels = [["B-PER", "I-PER", "O", "O", "O", "B-LOC"]]
>>> model = chaine.train(tokens, labels, max_iterations=5)
>>> model = chaine.train(tokens, labels, optimize_hyperparameters=True)
```

> This might be very memory and time consuming, because 5-fold cross validation for each of the 10 trials for each of the algorithms is performed.
or use the `HyperparameterOptimizer` class and have more control over the optimization process:

```python
>>> from chaine import HyperparameterOptimizer
>>> from chaine.optimization import L2SGDSearchSpace
>>> optimizer = HyperparameterOptimizer(trials=50, folds=3, spaces=[L2SGDSearchSpace()])
>>> optimizer.optimize_hyperparameters(tokens, labels, sample_size=1000)
```

This will make 50 trials with 3-fold cross validation for the Stochastic Gradient Descent algorithm and return a sorted list of hyperparameters with evaluation stats. The given data set is downsampled to 1000 instances.

<details>
<summary>Example of a hyperparameter optimization report</summary>

```json
[
{
"hyperparameters": {
"algorithm": "lbfgs",
"min_freq": 0,
"all_possible_states": true,
"all_possible_transitions": true,
"num_memories": 8,
"c1": 0.9,
"c2": 0.31,
"epsilon": 0.00011,
"period": 17,
"delta": 0.00051,
"linesearch": "Backtracking",
"max_linesearch": 31
},
"stats": {
"mean_precision": 0.4490952380952381,
"stdev_precision": 0.16497993418839532,
"mean_recall": 0.4554858934169279,
"stdev_recall": 0.20082402876210334,
"mean_f1": 0.45041435392087253,
"stdev_f1": 0.17914435056760908,
"mean_time": 0.3920876979827881,
"stdev_time": 0.0390961164333519
}
},
{
"hyperparameters": {
"algorithm": "lbfgs",
"min_freq": 5,
"all_possible_states": true,
"all_possible_transitions": false,
"num_memories": 9,
"c1": 1.74,
"c2": 0.09,
"epsilon": 0.0008600000000000001,
"period": 1,
"delta": 0.00045000000000000004,
"linesearch": "StrongBacktracking",
"max_linesearch": 34
},
"stats": {
"mean_precision": 0.4344436335328176,
"stdev_precision": 0.15542689556199216,
"mean_recall": 0.4385174258109041,
"stdev_recall": 0.19873733310765845,
"mean_f1": 0.43386496201052716,
"stdev_f1": 0.17225578421967264,
"mean_time": 0.12209572792053222,
"stdev_time": 0.0236177196325414
}
},
{
"hyperparameters": {
"algorithm": "lbfgs",
"min_freq": 2,
"all_possible_states": true,
"all_possible_transitions": true,
"num_memories": 1,
"c1": 0.91,
"c2": 0.4,
"epsilon": 0.0008400000000000001,
"period": 13,
"delta": 0.00018,
"linesearch": "MoreThuente",
"max_linesearch": 43
},
"stats": {
"mean_precision": 0.41963433149859447,
"stdev_precision": 0.16363544501259455,
"mean_recall": 0.4331173486012196,
"stdev_recall": 0.21344965207006913,
"mean_f1": 0.422038027332145,
"stdev_f1": 0.18245844823319127,
"mean_time": 0.2586916446685791,
"stdev_time": 0.04341208573100539
}
},
{
"hyperparameters": {
"algorithm": "l2sgd",
"min_freq": 5,
"all_possible_states": true,
"all_possible_transitions": true,
"c2": 1.68,
"period": 2,
"delta": 0.00047000000000000004,
"calibration_eta": 0.0006900000000000001,
"calibration_rate": 2.9000000000000004,
"calibration_samples": 1400,
"calibration_candidates": 25,
"calibration_max_trials": 23
},
"stats": {
"mean_precision": 0.2571428571428571,
"stdev_precision": 0.43330716823151716,
"mean_recall": 0.01,
"stdev_recall": 0.022360679774997897,
"mean_f1": 0.01702127659574468,
"stdev_f1": 0.038060731531911314,
"mean_time": 0.15442829132080077,
"stdev_time": 0.051750737506044905
}
}
]
```
</details>

### Inference

The high-level function `chaine.train()` returns a `Model` object. You can load an already trained model from disk by initializing a `Model` object with the model's filepath:

```python
>>> from chaine import Model
>>> model = Model("model.chaine")
```

You can predict labels for a batch of sequences:

```python
>>> tokens = [
... [{"index": 0, "text": "John"}, {"index": 1, "text": "Lennon"}],
... [{"index": 0, "text": "Paul"}, {"index": 1, "text": "McCartney"}],
... [{"index": 0, "text": "George"}, {"index": 1, "text": "Harrison"}],
... [{"index": 0, "text": "Ringo"}, {"index": 1, "text": "Starr"}]
... ]
>>> model.predict(tokens)
[['B-PER', 'I-PER', 'O', 'O', 'O', 'B-LOC']]
[['B-PER', 'I-PER'], ['B-PER', 'I-PER'], ['B-PER', 'I-PER'], ['B-PER', 'I-PER']]
```

or only for a single sequence:

```python
>>> model.predict_single(tokens[0])
['B-PER', 'I-PER']
```

If you are interested in the model's probability distribution for a given sequence, you can:

```python
>>> model.predict_proba_single(tokens[0])
[[{'B-PER': 0.99, 'I-PER': 0.01}, {'B-PER': 0.01, 'I-PER': 0.99}]]
```

> Use the `model.predict_proba()` method for a batch of sequences.
### Weights

After loading a trained model, you can inspect the learned transition and state weights:

```python
>>> model = Model("model.chaine")
>>> model.transitions
[{'from': 'B-PER', 'to': 'I-PER', 'weight': 1.430506540616852e-06}]
>>> model.states
[{'feature': 'text:John', 'label': 'B-PER', 'weight': 9.536710877105517e-07}, ...]
```

Check out the introducing [Jupyter notebook](https://github.com/severinsimmler/chaine/blob/master/notebooks/tutorial.ipynb).
You can also dump both transition and state weights as JSON:

```python
>>> model.dump_states("states.json")
>>> model.dump_transitions("transitions.json")
```

## Credits

This library makes use of and is partially based on:
This project makes use of and is partially based on:

- [CRFsuite](https://github.com/chokkan/crfsuite)
- [libLBFGS](https://github.com/chokkan/liblbfgs)
Loading