Int piece is an efficient integer-sequence compression algorithm. It accepts an integer-sequence as input and outputs another sequence of integers. It reduces the length of dequence but increasing the vocabulary size. As its name suggests, it is inherited from Su's BytePiece( more details can be found here). The core algorithm of IntPiece is almost the same as BytePiece. I just changed a few insignificant lines so that the core code can accept integer-sequences instead of string or chars as input and output. Because of the powerful performance of BytePiece itself, IntPiece also has some good features, such as Lossless reconstruction/High compression rate/Training-friendly.
As for most autoregressive language models, the cost of algorithm complexity increase linearly with vocabulary size, but quadratic with the sequence length. Shortening the sequence length is critical for the large language models. Therefore, we prefer to a larger vocabulary but shorter sequence. At the same time, our sequence does not consist of explicit words, but some implicit tokens. For example, Amazon's base-tts system maps audio into integer tokens and then uses llm to predict these tokens one by one. By using a similar compression algorithm(Byte-pair), 40% length compression is obtained(can be seen in their papers).
you can also see document of BytePieces, but actually you just need to run
pip uninstall pyahocorasick
AHOCORASICK_BYTES=1 pip install git+https://github.com/WojciechMula/pyahocorasick.git
to get pyahocorasick for BYTE version. And run (Optional)
python setup.py build_ext --inplace
to build c++ core function. The c++ functions will be faster, but not much. The project also contains a python version.
It is the same with BytePiece:
All source code of BytePiece is actually in a single file,
including Trainer
and Tokenizer
two classes, corresponding to training and tokenization(inferece) respectively.
you can run
python intpiece --data_path xxx.json --model_path xx.model --train
python intpiece --data_path xxx.json --model_path xx.model
to train or inference(evaluation).
And also, i put some important parameters into arguments, such as order or min_count. And you can use -h to see more details.
About dataset format, you can see Corpus
class. It is very easy to use.
Finally, I would like to express my gratitude to Su and its project again, especially for his perseverance in open source.