Major Updates
- Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
- Optimized the distributed mode performance and usability with more automatic features.
DJ-Operators
extract_support_text_mapper
,relation_identity_mapper
,python_file_mapper
, #500naive_grouper
,key_value_grouper
, #500nested_aggregator
,entity_attribute_aggregator
,most_relavant_entities_aggregator
, #500video_extract_frames_mapper
, #507
Performance
- Optimize ray mode performance, #442
- Patch for Performance Benchmark in CI/CD workflows, #506
- DJ Ray mode supports streaming loading of
jsonl
files, #515
Usability and Analysis
- support dj-install in recipe-level, #508
- support dj-analyze with --auto mode, #512
- support op-wise insight auto mining, #516
Acknowledgment
Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!