torchtitan
is designed to work seamlessly with most HuggingFace datasets. While we provide the C4 dataset for numerics and convergence testing, you can easily add support for your own datasets. Here's how to do it using Wikipedia as an example.
Locate the dataset configuration file:
torchtitan/datasets/hf_datasets/hf_datasets.py
You'll need to add three components:
- A dataset loader function
- A sample processor function
- A dataset configuration entry
Create a function that specifies how to load your dataset:
def load_wikipedia_dataset(dataset_path: str, **kwargs):
"""Load Wikipedia dataset with specific configuration."""
logger.info("Loading Wikipedia dataset...")
return load_dataset(
dataset_path,
name="20220301.en",
split="train",
streaming=True,
trust_remote_code=True,
)
Create a function that processes individual samples from your dataset:
def process_wikipedia_text(sample: Dict[str, Any]) -> str:
"""Process Wikipedia dataset sample text."""
return f"{sample['title']}\n\n{sample['text']}"
Add your dataset configuration to the DATASETS dictionary:
DATASETS = {
# ... existing datasets ...
"wikipedia": DatasetConfig(
path="wikipedia", # default HuggingFace dataset path
loader=load_wikipedia_dataset,
text_processor=process_wikipedia_text,
),
}
In your training configuration file (.toml
), set your dataset:
dataset = "wikipedia"
That's it! Your custom dataset is now ready to use with torchtitan
.
- The DatasetConfig contains all necessary components for a dataset:
path
: The default path to the dataset (can be overridden during training)loader
: Function to load the datasettext_processor
: Function to process individual samples
- The loader function should return a HuggingFace dataset object
- The processor function should return a string that combines the relevant fields from your dataset
- Use
streaming=True
for large datasets to manage memory efficiently
Now you can start training with your custom dataset!