Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCU][PPMix No.27】 #823

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions build_paddle_env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -57,23 +57,23 @@ if command -v nvcc > /dev/null 2>&1; then
case $cuda_version in
"11.2")
echo "安装CUDA 11.2版本的paddlepaddle..."
$PYTHON_CMD -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu112/
$PYTHON_CMD -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu112/
;;
"11.6")
echo "安装CUDA 11.6版本的paddlepaddle..."
$PYTHON_CMD -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu116/
$PYTHON_CMD -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu116/
;;
"11.7")
echo "安装CUDA 11.7版本的paddlepaddle..."
$PYTHON_CMD -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu117/
$PYTHON_CMD -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu117/
;;
"11.8")
echo "安装CUDA 11.8版本的paddlepaddle..."
$PYTHON_CMD -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
$PYTHON_CMD -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
;;
"12.3")
echo "安装CUDA 12.3版本的paddlepaddle..."
$PYTHON_CMD -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
$PYTHON_CMD -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
;;
*)
echo "警告: 不支持的CUDA版本 ($cuda_version)"
Expand All @@ -83,14 +83,14 @@ if command -v nvcc > /dev/null 2>&1; then
esac
else
echo "未检测到CUDA。安装CPU版本的paddlepaddle..."
$PYTHON_CMD -m pip install paddlepaddle==3.0.0b2
$PYTHON_CMD -m pip install paddlepaddle==3.0.0b1
fi

# 验证安装
echo "验证PaddlePaddle 3.0.0b2安装..."
echo "验证PaddlePaddle 3.0.0b1安装..."
if $PYTHON_CMD -c "import paddle; paddle.utils.run_check()"; then
echo "PaddlePaddle 3.0.0b2安装成功!"
echo "PaddlePaddle 3.0.0b1安装成功!"
else
echo "PaddlePaddle 3.0.0b2安装验证失败,请检查安装日志"
echo "PaddlePaddle 3.0.0b1安装验证失败,请检查安装日志"
exit 1
fi
146 changes: 143 additions & 3 deletions paddlemix/datacopilot/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,28 @@
</div>

</details>
# DataCopilot 使用教程
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这行预览好像不太符合预期

image


## 定位
<<<<<<< HEAD
# DataCopilot 使用教程

=======
>>>>>>> 679b487401c6588d21d30032fa46b22da078ad79
## 一、简介
**DataCopilot** 是 **PaddleMIX** 提供的多模态数据处理工具箱,旨在帮助开发者高效地进行数据预处理、增强和转换等操作。通过 **DataCopilot**,你可以以低代码量的方式实现数据的基本操作,从而加速模型训练和推理的过程。

## 二、定位
DataCopilot是PaddleMIX 2.0版本新推出的多模态数据处理工具箱,理念是把数据作为多模态算法的一部分参与迭代的全流程,让开发者根据特定任务以低代码量实现数据的基本操作。

## 核心概念
## 三、安装与导入
首先,确保你已经安装了 **PaddleMIX**。如果尚未安装,请参考 **PaddleMIX** 的官方文档进行安装。
安装完成后,你可以通过以下方式导入 **DataCopilot**:
```python
from paddlemix.datacopilot.core import MMDataset, SCHEMA
import paddlemix.datacopilot.ops as ops
```

## 四、核心概念
工具核心概念包括Schema和Dataset。Schema用于定义多模态数据组织结构和字段名字。MMDataset作为数据操作的核心类,为存储,查看,转换,生成等操作的基本对象。

### SCHEMA
Expand Down Expand Up @@ -66,7 +83,127 @@ def info(dataset: MMDataset) -> None: ...

```

## 使用案例
## 五、基本操作
### 1. 加载数据
使用 `MMDataset.from_json` 方法从 JSON 文件中加载数据:
Copy link
Collaborator

@lyuwenyu lyuwenyu Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加载数据这块 可以把其他支持的格式也写上 jsonl和h5

```python
dataset = MMDataset.from_json('path/to/your/dataset.json')
```

<<<<<<< HEAD
使用 `MMDataset.load_jsonl` 方法从 JSONL 文件中加载数据:
```python
dataset = MMDataset.load_jsonl('path/to/your/dataset.jsonl')
```

使用 `MMDataset.from_h5` 方法从 h5 文件中加载数据:
```python
dataset = MMDataset.from_h5('path/to/your/dataset.h5')
```

=======
>>>>>>> 679b487401c6588d21d30032fa46b22da078ad79
### 2. 查看数据
使用 info 和 head 方法查看数据集的基本信息和前几个样本:
```python
dataset.info()
dataset.head()
```

### 3. 数据切片
支持对数据集进行切片操作,返回一个新的 MMDataset 对象:
```python
subset = dataset[:100] # 获取前100个样本
```

### 4. 数据增强
使用 map 方法对数据集中的样本进行增强操作:
```python
def augment_data(item):
# 定义你的数据增强逻辑
pass

augmented_dataset = dataset.map(augment_data, max_workers=8, progress=True)
```

### 5. 数据过滤
使用 filter 方法根据条件过滤数据集中的样本:
```python
def is_valid_sample(item):
# 定义你的过滤条件
return True or False

filtered_dataset = dataset.filter(is_valid_sample).nonempty() # 返回过滤后的非空数据集
```

### 6. 导出数据
使用 export_json 方法将处理后的数据集导出为 JSON 文件:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同加载数据部分

```python
augmented_dataset.export_json('path/to/your/output_dataset.json')
```

<<<<<<< HEAD
使用 export_jsonl 方法将处理后的数据集导出为 JSONL 文件:
```python
augmented_dataset.export_jsonl('path/to/your/output_dataset.jsonl')
```

使用 export_h5 方法将处理后的数据集导出为 h5 文件:
```python
augmented_dataset.export_h5('path/to/your/output_dataset.h5')
```
=======
>>>>>>> 679b487401c6588d21d30032fa46b22da078ad79
## 六、高级操作
### 1. 自定义 Schema
通过定义 SCHEMA 来指定数据集的字段和类型:
```python
schema = SCHEMA(
image={'type': 'image', 'required': True},
text={'type': 'str', 'required': True},
label={'type': 'int', 'required': False}
)
```
使用自定义 schema 加载数据
```python
custom_dataset = MMDataset.from_json('path/to/your/dataset.json', schema=schema)
```

### 2. 批量处理
使用 batch 方法将数据集中的样本按批次处理,适用于需要批量操作的情况:
```python
batch_size = 32
<<<<<<< HEAD
batched_dataset = dataset[i: i + batch_size]
=======
batched_dataset = dataset.batch(batch_size)
Copy link
Collaborator

@lyuwenyu lyuwenyu Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前没有这个方法吧;可以在code里支持一下;( 其实这个功能可以也用slice实现, dataset[i: i+batch_size]

for batch in batched_dataset:
# 对每个批次进行处理
pass
>>>>>>> 679b487401c6588d21d30032fa46b22da078ad79
```

### 3. 数据采样
使用 shuffle 方法打乱数据集,或使用 sample 方法随机抽取样本:
```python
shuffled_dataset = dataset.shuffle()
sampled_dataset = dataset.sample(10) # 随机抽取10个样本
```

Copy link
Collaborator

@lyuwenyu lyuwenyu Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用上还可以写一块 支持链式调用的方式,例如

    (
        MMDataset.from_json(orig_path)
        .sanitize(schema=SCHEMA.MM, max_workers=8, progress=True, )
        .map(functools.partial(ops.convert_schema, out_schema=SCHEMA.MIX), max_workers=8)
        .filter(filter_text_token, max_workers=8, progress=True, order=True)
        .nonempty()
        .export_json(new_path)
    )

<<<<<<< HEAD
### 4. 链式调用
```python
MMDataset.from_json(orig_path)
.sanitize(schema=SCHEMA.MM, max_workers=8, progress=True, )
.map(functools.partial(ops.convert_schema, out_schema=SCHEMA.MIX), max_workers=8)
.filter(filter_text_token, max_workers=8, progress=True, order=True)
.nonempty()
.export_json(new_path)
```
=======
>>>>>>> 679b487401c6588d21d30032fa46b22da078ad79

## 七、使用案例
1. 导入导出
```
import functools
Expand Down Expand Up @@ -102,3 +239,6 @@ dataset = dataset.filter(is_wanted).nonempty()
3. LLaVA-SFT训练
数据准备和训练流程参考项目[pp_cap_instruct](https://aistudio.baidu.com/projectdetail/7917712)

## 八、总结
**DataCopilot** 是 **PaddleMIX** 提供的一个强大且灵活的多模态数据处理工具箱。
通过掌握其基本操作和高级功能,你可以高效地处理、增强和转换多模态数据,为后续的模型训练和推理提供有力支持。
8 changes: 4 additions & 4 deletions paddlemix/examples/llava/pretrain.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,11 +156,11 @@ def main():
if training_args.benchmark:
total_effective_samples = total_samples * training_args.num_train_epochs
effective_samples_per_second = total_effective_samples / train_result.metrics["train_runtime"]
# mem_gpu = (
# train_result.metrics["train_mem_gpu_peaked_delta"] + train_result.metrics["train_mem_gpu_alloc_delta"]
# )
mem_gpu = (
train_result.metrics["train_mem_gpu_peaked_delta"] + train_result.metrics["train_mem_gpu_alloc_delta"]
)
logger.info(f"Effective_samples_per_second: {effective_samples_per_second} ")
# logger.info(f"train_mem_gpu_peaked: {int(mem_gpu/ (2**20))} MB")
logger.info(f"train_mem_gpu_peaked: {int(mem_gpu/ (2**20))} MB")
logger.info("Benchmark done.")
else:
trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1)
Expand Down
8 changes: 4 additions & 4 deletions paddlemix/tools/supervised_finetune.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,11 +182,11 @@ def main():
if training_args.benchmark:
total_effective_samples = total_samples * training_args.num_train_epochs
effective_samples_per_second = total_effective_samples / train_result.metrics["train_runtime"]
# mem_gpu = (
# train_result.metrics["train_mem_gpu_peaked_delta"] + train_result.metrics["train_mem_gpu_alloc_delta"]
# )
mem_gpu = (
train_result.metrics["train_mem_gpu_peaked_delta"] + train_result.metrics["train_mem_gpu_alloc_delta"]
)
logger.info(f"Effective_samples_per_second: {effective_samples_per_second} ")
# logger.info(f"train_mem_gpu_peaked: {int(mem_gpu/ (2**20))} MB")
logger.info(f"train_mem_gpu_peaked: {int(mem_gpu/ (2**20))} MB")
logger.info("Benchmark done.")
else:
trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1)
Expand Down
Loading