Skip to content

Commit

Permalink
Update wheel2.json
Browse files Browse the repository at this point in the history
  • Loading branch information
Franri3008 committed Dec 19, 2024
1 parent 1358d25 commit 5e86081
Showing 1 changed file with 5 additions and 71 deletions.
76 changes: 5 additions & 71 deletions pages/Wheels/wheel2.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,21 +10,10 @@
"icon_transform": "None"
},
"items": [
{
"name": "FineWeb",
"subname": "HuggingFaceFW",
"bullets": ["Cleaned and deduplicated english web data from CommonCrawl", "93.4 TBs"],
"description": "Contains 15T-tokens of cleaned and deduplicated english web data from CommonCrawl. Curated for large-scale LLM training. Models trained on this data show superiority over models trained on other datasets like C4, Dolma, and RedPajama.\nEstimated number of rows: 45,995,362,478.\nSize of auto-converted Parquet files: 93.4 TB. Key feature: As of today, the largest publicly available, high-quality web dataset.",
"icon": "https://huggingface.co/front/assets/huggingface_logo-noborder.svg",
"url": "https://huggingface.co/datasets/HuggingFaceFW/fineweb",
"color": "#FFD21E",
"x": 0.0,
"y": -0.255
},
{
"name": "orca-agentinstruct-1M-v1",
"subname": "Microsoft",
"bullets": ["Designed to train models for instruction-following tasks", "Prompts and responses are synthetically generated by AgentInstruct"],
"bullets": ["Designed to train models for instruction-following tasks", "Prompts and responses are synthetically generated"],
"description": "Designed to train models for instruction-following tasks like text creative writing, coding or reading comprehension. Both the prompts and the responses of this dataset are synthetically generated by AgentInstruct, using only raw text content publicly avialble on the Web as seeds.\nNumber of rows: 1,046,410\nSize of auto-converted Parquet files: 2.21 GB",
"icon": "https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png",
"url": "https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1",
Expand All @@ -35,7 +24,7 @@
{
"name": "arXiver",
"subname": "Neuralwork",
"bullets": ["Curated for question-answering tasks", "Data is converted to highly readable (.mmd) format"],
"bullets": ["Curated for question-answering tasks", "Data is converted to the highly readable (.mmd) format"],
"description": "The largest open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens). Contains a diverse set of sources such as books, newspapers, scientific articles, government and legal documents, code, and more.\nEstimated number of rows: 396,953,971\nSize of auto-converted Parquet files (First 5GB): 2.96 GB\nKey feature: Data is permissively licensed, meaning it can be used, modified, and redistributed without legal ambiguity or risk of infringement.",
"icon": "https://cdn-avatars.huggingface.co/v1/production/uploads/6329b0cabdb6242b42b8cd63/7T7rS_-BL7wLWMDiCZgs7.png",
"url": "https://huggingface.co/datasets/neuralwork/arxiver",
Expand All @@ -57,7 +46,7 @@
{
"name": "SmolTalkDataset",
"subname": "HuggingFaceTB",
"bullets": ["Designed for supervised finetuning (SFT) of LLMs", "Curated to strengthen model capabilities such as mathematics and coding"],
"bullets": ["Designed for supervised finetuning (SFT) of LLMs", "Curated to strengthen model capabilities like mathematics and coding"],
"description": "Synthetic dataset designed for supervised finetuning (SFT) of LLMs. It focuses on bridging the performance gap between models trained on SFT datasets and those trained on proprietary instruction datasets.\nNumber of rows: 2,197,730\nSize of the auto-converted Parquet files: 4.15 GB\nKey feature: While curated for SFT, the dataset also aims at improving on instruction following tasks.",
"icon": "https://huggingface.co/front/assets/huggingface_logo-noborder.svg",
"url": "https://huggingface.co/datasets/HuggingFaceTB/smoltalk",
Expand All @@ -66,7 +55,7 @@
"y": -0.05
},
{
"name": "Multilingual Massive Multitask Language Understanding (MMMLU)",
"name": "MMMLU",
"subname": "OpenAI",
"bullets": ["Covers 57 topics from physics to history", "Test set translated into 14 languages using professional human translators"],
"description": "Benchmark dataset for assessing the general knowledge and reasoning skills of AI models. It covers 57 topics from physics to history, and its test set is translated into 14 languages by professional translators to improve multilingual AI performance and accuracy.\nNumber of rows: 393,176\nSize of the auto-converted Parquet files: 124 MB\nKey feature: Aiming to improve the multilingual capabilities of AI models, ensuring they perform accurately across languages, particularly for underrepresented communities.",
Expand All @@ -79,7 +68,7 @@
{
"name": "FinePersonas",
"subname": "Argilla",
"bullets": ["Provides unique persona traits for synthetic outputs", "Enhances the diversity and specificity of synthetic outputs"],
"bullets": ["Designed to train models in tasks like reasoning or creative writing", "Enhances the diversity and specificity of synthetic outputs by using diverse personas"],
"description": "Allows AI researchers and engineers to easily integrate unique persona traits into text generation systems, thereby enhancing the diversity and specificity of synthetic outputs without the complexity of crafting detailed attributes from scratch.\nNumber of rows: 42,142,456\nSize of the auto-converted Parquet files: 143 GB\nKey feature: Focus on providing personas with specific expertise, career paths, or personal interestswhich allow for more nuanced and targeted content.",
"icon": "https://cdn-avatars.huggingface.co/v1/production/uploads/1664307416166-60420dccc15e823a685f2b03.png",
"url": "https://huggingface.co/datasets/argilla/FinePersonas-v0.1",
Expand Down Expand Up @@ -240,61 +229,6 @@
"color": "#ffd11e",
"x": 0.0,
"y": -0.05
},
{
"name": "Hermes Function-Calling V1 ",
"subname": "NousResearch",
"bullets": ["Trains LLMs to return structured output based on natural language instructions", "Training areas include API usage, automated workflows, and complex system integration"],
"description": "A dataset of structured outputs and function-calling conversations, enabling LLMs to perform and return structured outputs from natural language instructions.",
"icon": "https://cdn-avatars.huggingface.co/v1/production/uploads/6317aade83d8d2fd903192d9/tPLjYEeP6q1w0j_G2TJG_.png",
"url": "https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1",
"color": "#0221b9",
"x": 0.0,
"y": -0.05
},
{
"name": "reasoning-0.01",
"subname": "SkunkworksAI",
"bullets": ["Designed to train reasoning and problem-solving capabilities in LLMs", "Contains step-by-step solutions in CoT format"],
"description": "Synthetic dataset of reasoning chains for a wide variety of tasks.",
"icon": "https://cdn-avatars.huggingface.co/v1/production/uploads/64b7e345f92b20f7a38bf47a/wz-kRIjZ02vPTaBrrjQAd.png",
"url": "https://huggingface.co/datasets/SkunkworksAI/reasoning-0.01",
"color": "#48ab73",
"x": 0.0,
"y": -0.05
},
{
"name": "SmolLM-Corpus ",
"subname": "HuggingFaceTB",
"bullets": ["Collection of high-quality educational and synthetic data", "Designed for fine-tuning and training small language models"],
"description": "A curated collection of high-quality educational and synthetic data designed for training small language models.",
"icon": "https://huggingface.co/front/assets/huggingface_logo-noborder.svg",
"url": "https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus",
"color": "#ffd11e",
"x": 0.0,
"y": -0.05
},
{
"name": "🐋 Orca-AgentInstruct-1M-v1-cleaned",
"subname": "mlabonne",
"bullets": ["Refined version of Microsoft’s synthetic dataset for instruction tuning", "Optimized for ease of use and instruction-following fine-tuning"],
"description": "A refined version of Microsoft’s synthetic dataset for instruction tuning, optimized for framework compatibility and improved model performance on benchmarks.",
"icon": "https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/JtUGAwVh_4cDEsjNcfpye.png",
"url": "https://huggingface.co/datasets/mlabonne/orca-agentinstruct-1M-v1-cleaned",
"color": "#eb321b",
"x": 0.0,
"y": -0.05
},
{
"name": "synthetic_text_to_sql",
"subname": "gretelai",
"bullets": ["Designed to train and evaluate models for Text-to-SQL tasks", "Covers 100 domains including finance, energy or marine biology"],
"description": "A comprehensive dataset of 105k synthetic Text-to-SQL samples across 100 domains, featuring diverse SQL tasks and complexities. Designed for advanced model training.",
"icon": "https://cdn-avatars.huggingface.co/v1/production/uploads/620df72e917be1e25d20008d/RnsdKEkmNVZnAEn8XfQOw.png",
"url": "https://huggingface.co/datasets/gretelai/synthetic_text_to_sql",
"color": "#70a1d1",
"x": 0.0,
"y": -0.05
}
]
}
Expand Down

0 comments on commit 5e86081

Please sign in to comment.