Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCU][PPMix No.27】 #823

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

yangrongxinuser
Copy link

datacopilot使用教程添加:
@lyuwenyu

Copy link

paddle-bot bot commented Nov 20, 2024

Thanks for your contribution!

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@luotao1 luotao1 assigned luotao1 and lyuwenyu and unassigned luyao-cv Nov 20, 2024
@luotao1 luotao1 added the HappyOpenSource Pro 快乐开源issue与PR,更具挑战的任务 label Nov 20, 2024
@lyuwenyu
Copy link
Collaborator

可以直接更新datacopilot目录下的readme.md文件;( 把你的文档和原始文档看看怎么合并一下

@yangrongxinuser
Copy link
Author

datacopilot文档合并:
@lyuwenyu

@@ -9,11 +9,23 @@
</div>

</details>
# DataCopilot 使用教程
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这行预览好像不太符合预期

image

## 使用案例
## 五、基本操作
### 1. 加载数据
使用 `MMDataset.from_json` 方法从 JSON 文件中加载数据:
Copy link
Collaborator

@lyuwenyu lyuwenyu Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加载数据这块 可以把其他支持的格式也写上 jsonl和h5

```

### 6. 导出数据
使用 export_json 方法将处理后的数据集导出为 JSON 文件:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同加载数据部分

使用 batch 方法将数据集中的样本按批次处理,适用于需要批量操作的情况:
```python
batch_size = 32
batched_dataset = dataset.batch(batch_size)
Copy link
Collaborator

@lyuwenyu lyuwenyu Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前没有这个方法吧;可以在code里支持一下;( 其实这个功能可以也用slice实现, dataset[i: i+batch_size]

shuffled_dataset = dataset.shuffle()
sampled_dataset = dataset.sample(10) # 随机抽取10个样本
```

Copy link
Collaborator

@lyuwenyu lyuwenyu Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用上还可以写一块 支持链式调用的方式,例如

    (
        MMDataset.from_json(orig_path)
        .sanitize(schema=SCHEMA.MM, max_workers=8, progress=True, )
        .map(functools.partial(ops.convert_schema, out_schema=SCHEMA.MIX), max_workers=8)
        .filter(filter_text_token, max_workers=8, progress=True, order=True)
        .nonempty()
        .export_json(new_path)
    )

@yangrongxinuser
Copy link
Author

@lyuwenyu 辛苦老师看看还有问题吗

@lyuwenyu
Copy link
Collaborator

lyuwenyu commented Nov 22, 2024

看下你log里的最后一个commit 引入了其他的东西了;你应该merge 或者 rebase develop;

@yangrongxinuser

@yangrongxinuser
Copy link
Author

就是老师我提交的时候,不知道为什么他说我有commit conflicts,所以我就又按他的要求commit了一下

@lyuwenyu
Copy link
Collaborator

362f152

你应该先回到这个commit-id 后面的不要了;然后看下还有什么要修改的嘛;最后merge或rebase下develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor HappyOpenSource Pro 快乐开源issue与PR,更具挑战的任务
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants