LIBXSL is a machine learning package that leverages large language models (LLMs) for text classification tasks. It provides a flexible and scalable framework for training and evaluating text classification models using state-of-the-art LLMs.
- Easy integration with Hugging Face Transformers
- Support for distributed training with PyTorch
- Customizable loss functions for various classification tasks
- Comprehensive logging and evaluation metrics
To install the package, run:
pip install libxsl
LIBXSL also depends on a library available from a GitHub repository. This dependency will be automatically installed:
pip install git+https://github.com/ryaninhust/pyxclib.git
To train a model, you need a configuration file (in YAML format) specifying the training parameters, dataset paths, and model configurations. Here's an example configuration file:
model_name: "bert-base-uncased"
train_data_file: "path/to/train/data"
test_data_file: "path/to/test/data"
max_length: 128
batch_size: 32
num_epochs: 10
pretrained_lr: 2e-5
label_embedding_lr: 1e-3
pretrained_weight_decay: 0.01
label_embedding_weight_decay: 0.01
positive_weight: 1.0
loss_fn: "LRLR"
omega: 1.0
kernel_approx: true
log_file_path: "training.log"
model_save_path: models/model.pth
prediction_save_path: outputs/predictions.npy
Run the training script with the following command:
with open(config_path, 'r') as file:
config = yaml.safe_load(file)
tokenizer = AutoTokenizer.from_pretrained(config['model_name'])
world_size = torch.cuda.device_count()
train_dataset = TextClassificationDataset(config['train_data_file'], tokenizer, config['max_length'])
test_dataset = TextClassificationDataset(config['test_data_file'], tokenizer, config['max_length'], num_classes=train_dataset.num_classes)
mp.spawn(train, args=(world_size, train_dataset, test_dataset, config), nprocs=world_size, join=True)
mp.spawn(predict, args=(world_size, test_dataset, config), nprocs=world_size, join=True)
You can customize the model and loss functions by editing the corresponding files in the package. For example, to add a new loss function, update the loss_fn.py
file and add the new function to the loss function dictionary.
We welcome contributions to the NEO project! If you have any ideas, bug reports, or improvements, please submit an issue or a pull request on our GitHub repository.
This project is licensed under the MIT License. See the LICENSE file for more details.