Skip to content

A fast and lightweight Rust library for splitting text into semantically meaningful chunks.

License

Notifications You must be signed in to change notification settings

dominictarro/semchunk-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

semchunk-rs

Downloads Version License Docs

A port of umarbutler/semchunk into Rust for splitting text into semantically meaningful chunks.

Usage

use semchunk::Chunker;
use rust_tokenizers::tokenizer::{RobertaTokenizer, Tokenizer};

const CHUNK_SIZE: usize = 4;

fn main() {
    let tokenizer = RobertaTokenizer
        ::from_file("data/roberta-base-vocab.json", "data/roberta-base-merges.txt", false, false)
        .unwrap();
    let token_counter = Box::new(move |s: &str| tokenizer.tokenize(s).len());
    let chunker = Chunker::new(4, token_counter);

    let text = "The quick brown fox jumps over the lazy dog.";
    let chunks = chunker.chunk(text);

    for (i, chunk) in chunks.iter().enumerate() {
        println!("{}) {}", i + 1, chunk);
    }
}
1) The quick brown fox
2) jumps over the
3) lazy dog.

Benchmarks 📊

Environment:

Component Version
Rust Version 1.80.1
Computer Apple 2022 Macbook Pro
Processor Apple M2
Memory 24 GB
Operating System Sequoia 15.1.1

gutenberg

Benchmarking against the 18 texts of the Gutenberg corpus which contains 3,001,260 tokens. Code here.

Parameter Value
Iterations 100
Chunk Size 512
Tokenizer RoBERTa (base)
Tokenizer Library rust_tokenizers (8.1.1)
Metric semchunk-rs
Mean 6.2223 s
Min 6.2040 s
Max 6.2431 s

About

A fast and lightweight Rust library for splitting text into semantically meaningful chunks.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published