Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #4

Merged
merged 38 commits into from
Jan 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 4 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,18 @@ data is one of the basic elements in the development of artificial intelligence.

FlagData supports the following features:

* it can be used with simple configuration after installation, and the custom feature can be realized with low code volume.
* Realize the high-quality content extraction of various original format data, and greatly reduce the processing cost.

* High-quality structured data can be quickly cleaned from the original html/text/pdf/epub, and sensitive information can be filtered to avoid the risk of privacy disclosure.
* Provide the function of fine-tuning data perspective for large models.

* Support massive text data de-duplication, and provide detailed multi-machine distributed data processing system deployment documents.

* support data quality assessment and common data analysis.
* One-stop efficient distributed data processing function.

The complete pipeline process and features such as
![pipeline](pipeline.png)

## News

- [Dec 15st, 2023] FlagData v1.1.0 has been upgraded
- [Dec 31st, 2023] FlagData v2.0.0 has been upgraded
- [Jan 31st, 2023] FlagData v1.0.0 is online!

--------------------------------------------------------------------------------
Expand Down
10 changes: 4 additions & 6 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,18 @@

FlagData支持以下特性:

* 安装后简单配置即可上手使用,低代码量实现自定义功能。
* 实现多种原始格式数据的高质量内容提取,极大降低处理成本

* 可从原始html/text/pdf/epub 快速清洗得到高质量结构化数据,注重敏感信息滤除,避免隐私泄露风险。
* 提供大模型微调数据透视功能

* 支持海量文本数据去重,并提供详细的多机分布式数据处理系统部署文档。

* 支持数据质量评估与常见数据分析。
* 一站式高效分布式数据处理功能

完整的pipeline流程以及功能如下图:
![pipeline](pipeline_zh.png)

## 动态

- [Dec 15st, 2023] FlagData v1.1.0 升级
- [Dec 31st, 2023] FlagData v2.0.0 升级
- [Jan 31st, 2023] FlagData v1.0.0 上线了!

--------------------------------------------------------------------------------
Expand Down
Loading