New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SCU][PPMix No.27】 #823

Open

yangrongxinuser wants to merge 7 commits into PaddlePaddle:develop from yangrongxinuser:Datacopilot

yangrongxinuser commented Nov 20, 2024

datacopilot使用教程添加：
@lyuwenyu


          Create DataCopilot.md

ead7a8f

paddle-bot bot commented Nov 20, 2024

Thanks for your contribution!

paddle-bot bot added the contributor label

CLAassistant commented Nov 20, 2024

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

paddle-bot bot assigned luyao-cv

luotao1 mentioned this pull request

PaddleMIX 快乐开源活动 (2024 Q4) #787

Open

luotao1 assigned luotao1 and lyuwenyu and unassigned luyao-cv

luotao1 added the HappyOpenSource Pro label

Collaborator

lyuwenyu commented Nov 21, 2024

可以直接更新datacopilot目录下的readme.md文件；（把你的文档和原始文档看看怎么合并一下


          文档合并

f8c9f55

Author

yangrongxinuser commented Nov 21, 2024

datacopilot文档合并：
@lyuwenyu

lyuwenyu reviewed

View reviewed changes

paddlemix/datacopilot/readme.md

@@ @@ -9,11 +9,23 @@ @@
               </div>
               </details>
+              # DataCopilot 使用教程

Collaborator

lyuwenyu Nov 21, 2024

这行预览好像不太符合预期

lyuwenyu reviewed

View reviewed changes

paddlemix/datacopilot/readme.md

-              ## 使用案例
+              ## 五、基本操作
+              ### 1. 加载数据
+              使用 `MMDataset.from_json` 方法从 JSON 文件中加载数据：

Collaborator

lyuwenyu Nov 21, 2024 •

edited

Loading

加载数据这块可以把其他支持的格式也写上 jsonl和h5

lyuwenyu reviewed

View reviewed changes

paddlemix/datacopilot/readme.md

+              ```
+              ### 6. 导出数据
+              使用 export_json 方法将处理后的数据集导出为 JSON 文件：

Collaborator

lyuwenyu Nov 21, 2024

同加载数据部分

paddlemix/datacopilot/readme.md

+              使用 batch 方法将数据集中的样本按批次处理，适用于需要批量操作的情况：
+              ```python
+              batch_size = 32
+              batched_dataset = dataset.batch(batch_size)

Collaborator

lyuwenyu Nov 21, 2024 •

edited

Loading

目前没有这个方法吧；可以在code里支持一下；（其实这个功能可以也用slice实现, dataset[i: i+batch_size]

paddlemix/datacopilot/readme.md

+              shuffled_dataset = dataset.shuffle()
+              sampled_dataset = dataset.sample(10)  # 随机抽取10个样本
+              ```

Collaborator

lyuwenyu Nov 21, 2024 •

edited

Loading

使用上还可以写一块支持链式调用的方式，例如

    (
        MMDataset.from_json(orig_path)
        .sanitize(schema=SCHEMA.MM, max_workers=8, progress=True, )
        .map(functools.partial(ops.convert_schema, out_schema=SCHEMA.MIX), max_workers=8)
        .filter(filter_text_token, max_workers=8, progress=True, order=True)
        .nonempty()
        .export_json(new_path)
    )

lyuwenyu force-pushed the Datacopilot branch from f8c9f55 to 64648a1 Compare

November 21, 2024 06:05

yangrongxinuser added 2 commits

November 21, 2024 16:14


          Create DataCopilot.md

cebf2d3


          文档合并

679b487

lyuwenyu force-pushed the Datacopilot branch from 64648a1 to 679b487 Compare

November 21, 2024 08:14

yangrongxinuser added 2 commits

November 21, 2024 17:40


          modify

362f152


          modify

c4c6170

Author

yangrongxinuser commented Nov 22, 2024

@lyuwenyu 辛苦老师看看还有问题吗

Collaborator

lyuwenyu commented Nov 22, 2024 •

edited

Loading

看下你log里的最后一个commit 引入了其他的东西了；你应该merge 或者 rebase develop；

@yangrongxinuser


          Merge branch 'develop' into Datacopilot

6c161de

Author

yangrongxinuser commented Nov 22, 2024

就是老师我提交的时候,不知道为什么他说我有commit conflicts，所以我就又按他的要求commit了一下

Collaborator

lyuwenyu commented Nov 23, 2024

你应该先回到这个commit-id 后面的不要了；然后看下还有什么要修改的嘛；最后merge或rebase下develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor HappyOpenSource Pro