Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev/manage meta #518

Merged
merged 180 commits into from
Jan 3, 2025
Merged

Dev/manage meta #518

merged 180 commits into from
Jan 3, 2025

Conversation

BeachWang
Copy link
Collaborator

@BeachWang BeachWang commented Dec 24, 2024

Have organized and consolidated the generated metadata uniformly.

  1. Meta data produced by mapper OPs is stored in Fields.meta field. Corresponding keys are list in MetaKeys.
  2. For convenience of split a batch sample, meta data produced by aggregator OPs is stored in Fields.agg field. Corresponding keys are list in AggKeys. Here, we add a default Fields.agg field to all samples in all aggregator operators, without adding a decorator marker TAGGING_OPS like we do with mappers.
  3. Unify the interface for generating meta operators, allowing modification of meta keys. If a meta key already exists, it should simply return without making any changes.
  4. Add corresponding unit test.
  5. Adjust the role playing demo according to the changes.
  6. Remove nested_set function.

@BeachWang BeachWang self-assigned this Dec 24, 2024
@BeachWang BeachWang marked this pull request as draft December 24, 2024 02:24
@BeachWang BeachWang marked this pull request as ready for review December 31, 2024 06:38
@BeachWang BeachWang requested review from HYLcool and yxdyc December 31, 2024 08:37
@BeachWang BeachWang added the enhancement New feature or request label Dec 31, 2024
@yxdyc yxdyc added dj:dataset issues/PRs about the dj-dataset dj:core issues/PRs about the core functions of Data-Juicer labels Jan 2, 2025
Copy link
Collaborator

@yxdyc yxdyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about removing Fields.agg and AggKeys? Instead, we can unify them into Fields.meta and MetaKeys. This makes sense because they are implemented in a very similar way, and conceptually, agg is still a type of meta.

Additionally, currently, we only have two types of AggKeys, and keeping both agg and meta may increase the cognitive load for users and introduce inconsistencies. Our 2.0 paper mentions that we have three main parts of the fields: 1. data payload, 2. stats, 3. meta. Unifying agg into meta would align better with this structure.

Others LGTM.

@BeachWang BeachWang merged commit fb98c56 into main Jan 3, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dj:core issues/PRs about the core functions of Data-Juicer dj:dataset issues/PRs about the dj-dataset enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants