-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SCU][PPMix No.27】 #823
base: develop
Are you sure you want to change the base?
[SCU][PPMix No.27】 #823
Changes from all commits
ead7a8f
f8c9f55
cebf2d3
679b487
362f152
c4c6170
6c161de
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,11 +9,28 @@ | |
</div> | ||
|
||
</details> | ||
# DataCopilot 使用教程 | ||
|
||
## 定位 | ||
<<<<<<< HEAD | ||
# DataCopilot 使用教程 | ||
|
||
======= | ||
>>>>>>> 679b487401c6588d21d30032fa46b22da078ad79 | ||
## 一、简介 | ||
**DataCopilot** 是 **PaddleMIX** 提供的多模态数据处理工具箱,旨在帮助开发者高效地进行数据预处理、增强和转换等操作。通过 **DataCopilot**,你可以以低代码量的方式实现数据的基本操作,从而加速模型训练和推理的过程。 | ||
|
||
## 二、定位 | ||
DataCopilot是PaddleMIX 2.0版本新推出的多模态数据处理工具箱,理念是把数据作为多模态算法的一部分参与迭代的全流程,让开发者根据特定任务以低代码量实现数据的基本操作。 | ||
|
||
## 核心概念 | ||
## 三、安装与导入 | ||
首先,确保你已经安装了 **PaddleMIX**。如果尚未安装,请参考 **PaddleMIX** 的官方文档进行安装。 | ||
安装完成后,你可以通过以下方式导入 **DataCopilot**: | ||
```python | ||
from paddlemix.datacopilot.core import MMDataset, SCHEMA | ||
import paddlemix.datacopilot.ops as ops | ||
``` | ||
|
||
## 四、核心概念 | ||
工具核心概念包括Schema和Dataset。Schema用于定义多模态数据组织结构和字段名字。MMDataset作为数据操作的核心类,为存储,查看,转换,生成等操作的基本对象。 | ||
|
||
### SCHEMA | ||
|
@@ -66,7 +83,127 @@ def info(dataset: MMDataset) -> None: ... | |
|
||
``` | ||
|
||
## 使用案例 | ||
## 五、基本操作 | ||
### 1. 加载数据 | ||
使用 `MMDataset.from_json` 方法从 JSON 文件中加载数据: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 加载数据这块 可以把其他支持的格式也写上 jsonl和h5 |
||
```python | ||
dataset = MMDataset.from_json('path/to/your/dataset.json') | ||
``` | ||
|
||
<<<<<<< HEAD | ||
使用 `MMDataset.load_jsonl` 方法从 JSONL 文件中加载数据: | ||
```python | ||
dataset = MMDataset.load_jsonl('path/to/your/dataset.jsonl') | ||
``` | ||
|
||
使用 `MMDataset.from_h5` 方法从 h5 文件中加载数据: | ||
```python | ||
dataset = MMDataset.from_h5('path/to/your/dataset.h5') | ||
``` | ||
|
||
======= | ||
>>>>>>> 679b487401c6588d21d30032fa46b22da078ad79 | ||
### 2. 查看数据 | ||
使用 info 和 head 方法查看数据集的基本信息和前几个样本: | ||
```python | ||
dataset.info() | ||
dataset.head() | ||
``` | ||
|
||
### 3. 数据切片 | ||
支持对数据集进行切片操作,返回一个新的 MMDataset 对象: | ||
```python | ||
subset = dataset[:100] # 获取前100个样本 | ||
``` | ||
|
||
### 4. 数据增强 | ||
使用 map 方法对数据集中的样本进行增强操作: | ||
```python | ||
def augment_data(item): | ||
# 定义你的数据增强逻辑 | ||
pass | ||
|
||
augmented_dataset = dataset.map(augment_data, max_workers=8, progress=True) | ||
``` | ||
|
||
### 5. 数据过滤 | ||
使用 filter 方法根据条件过滤数据集中的样本: | ||
```python | ||
def is_valid_sample(item): | ||
# 定义你的过滤条件 | ||
return True or False | ||
|
||
filtered_dataset = dataset.filter(is_valid_sample).nonempty() # 返回过滤后的非空数据集 | ||
``` | ||
|
||
### 6. 导出数据 | ||
使用 export_json 方法将处理后的数据集导出为 JSON 文件: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 同加载数据部分 |
||
```python | ||
augmented_dataset.export_json('path/to/your/output_dataset.json') | ||
``` | ||
|
||
<<<<<<< HEAD | ||
使用 export_jsonl 方法将处理后的数据集导出为 JSONL 文件: | ||
```python | ||
augmented_dataset.export_jsonl('path/to/your/output_dataset.jsonl') | ||
``` | ||
|
||
使用 export_h5 方法将处理后的数据集导出为 h5 文件: | ||
```python | ||
augmented_dataset.export_h5('path/to/your/output_dataset.h5') | ||
``` | ||
======= | ||
>>>>>>> 679b487401c6588d21d30032fa46b22da078ad79 | ||
## 六、高级操作 | ||
### 1. 自定义 Schema | ||
通过定义 SCHEMA 来指定数据集的字段和类型: | ||
```python | ||
schema = SCHEMA( | ||
image={'type': 'image', 'required': True}, | ||
text={'type': 'str', 'required': True}, | ||
label={'type': 'int', 'required': False} | ||
) | ||
``` | ||
使用自定义 schema 加载数据 | ||
```python | ||
custom_dataset = MMDataset.from_json('path/to/your/dataset.json', schema=schema) | ||
``` | ||
|
||
### 2. 批量处理 | ||
使用 batch 方法将数据集中的样本按批次处理,适用于需要批量操作的情况: | ||
```python | ||
batch_size = 32 | ||
<<<<<<< HEAD | ||
batched_dataset = dataset[i: i + batch_size] | ||
======= | ||
batched_dataset = dataset.batch(batch_size) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 目前没有这个方法吧;可以在code里支持一下;( 其实这个功能可以也用slice实现, dataset[i: i+batch_size] |
||
for batch in batched_dataset: | ||
# 对每个批次进行处理 | ||
pass | ||
>>>>>>> 679b487401c6588d21d30032fa46b22da078ad79 | ||
``` | ||
|
||
### 3. 数据采样 | ||
使用 shuffle 方法打乱数据集,或使用 sample 方法随机抽取样本: | ||
```python | ||
shuffled_dataset = dataset.shuffle() | ||
sampled_dataset = dataset.sample(10) # 随机抽取10个样本 | ||
``` | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 使用上还可以写一块 支持链式调用的方式,例如 (
MMDataset.from_json(orig_path)
.sanitize(schema=SCHEMA.MM, max_workers=8, progress=True, )
.map(functools.partial(ops.convert_schema, out_schema=SCHEMA.MIX), max_workers=8)
.filter(filter_text_token, max_workers=8, progress=True, order=True)
.nonempty()
.export_json(new_path)
) |
||
<<<<<<< HEAD | ||
### 4. 链式调用 | ||
```python | ||
MMDataset.from_json(orig_path) | ||
.sanitize(schema=SCHEMA.MM, max_workers=8, progress=True, ) | ||
.map(functools.partial(ops.convert_schema, out_schema=SCHEMA.MIX), max_workers=8) | ||
.filter(filter_text_token, max_workers=8, progress=True, order=True) | ||
.nonempty() | ||
.export_json(new_path) | ||
``` | ||
======= | ||
>>>>>>> 679b487401c6588d21d30032fa46b22da078ad79 | ||
|
||
## 七、使用案例 | ||
1. 导入导出 | ||
``` | ||
import functools | ||
|
@@ -102,3 +239,6 @@ dataset = dataset.filter(is_wanted).nonempty() | |
3. LLaVA-SFT训练 | ||
数据准备和训练流程参考项目[pp_cap_instruct](https://aistudio.baidu.com/projectdetail/7917712) | ||
|
||
## 八、总结 | ||
**DataCopilot** 是 **PaddleMIX** 提供的一个强大且灵活的多模态数据处理工具箱。 | ||
通过掌握其基本操作和高级功能,你可以高效地处理、增强和转换多模态数据,为后续的模型训练和推理提供有力支持。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这行预览好像不太符合预期