-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SCU][PPMix No.27】 #823
base: develop
Are you sure you want to change the base?
[SCU][PPMix No.27】 #823
Conversation
Thanks for your contribution! |
|
可以直接更新datacopilot目录下的readme.md文件;( 把你的文档和原始文档看看怎么合并一下 |
datacopilot文档合并: |
@@ -9,11 +9,23 @@ | |||
</div> | |||
|
|||
</details> | |||
# DataCopilot 使用教程 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## 使用案例 | ||
## 五、基本操作 | ||
### 1. 加载数据 | ||
使用 `MMDataset.from_json` 方法从 JSON 文件中加载数据: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加载数据这块 可以把其他支持的格式也写上 jsonl和h5
``` | ||
|
||
### 6. 导出数据 | ||
使用 export_json 方法将处理后的数据集导出为 JSON 文件: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同加载数据部分
使用 batch 方法将数据集中的样本按批次处理,适用于需要批量操作的情况: | ||
```python | ||
batch_size = 32 | ||
batched_dataset = dataset.batch(batch_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前没有这个方法吧;可以在code里支持一下;( 其实这个功能可以也用slice实现, dataset[i: i+batch_size]
shuffled_dataset = dataset.shuffle() | ||
sampled_dataset = dataset.sample(10) # 随机抽取10个样本 | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用上还可以写一块 支持链式调用的方式,例如
(
MMDataset.from_json(orig_path)
.sanitize(schema=SCHEMA.MM, max_workers=8, progress=True, )
.map(functools.partial(ops.convert_schema, out_schema=SCHEMA.MIX), max_workers=8)
.filter(filter_text_token, max_workers=8, progress=True, order=True)
.nonempty()
.export_json(new_path)
)
f8c9f55
to
64648a1
Compare
64648a1
to
679b487
Compare
@lyuwenyu 辛苦老师看看还有问题吗 |
看下你log里的最后一个commit 引入了其他的东西了;你应该merge 或者 rebase develop; |
就是老师我提交的时候,不知道为什么他说我有commit conflicts,所以我就又按他的要求commit了一下 |
你应该先回到这个commit-id 后面的不要了;然后看下还有什么要修改的嘛;最后merge或rebase下develop |
datacopilot使用教程添加:
@lyuwenyu