Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text generation #647

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open

Text generation #647

wants to merge 40 commits into from

Conversation

sjohn4
Copy link
Collaborator

@sjohn4 sjohn4 commented Jan 13, 2025

Took the previous feedback in mind and corrected the code.

Copy link

codecov bot commented Jan 13, 2025

Codecov Report

Attention: Patch coverage is 67.97386% with 49 lines in your changes missing coverage. Please review.

Project coverage is 85.34%. Comparing base (1a40faf) to head (9e3a9e8).

Files with missing lines Patch % Lines
.../trainer_server/internal/dataset/online_dataset.py 64.10% 28 Missing ⚠️
modyn/models/gpt2/gpt2.py 57.14% 9 Missing ⚠️
...trainer_server/internal/trainer/pytorch_trainer.py 69.23% 8 Missing ⚠️
modyn/models/tokenizers/gpt2_tokenizer.py 50.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #647      +/-   ##
==========================================
- Coverage   85.58%   85.34%   -0.25%     
==========================================
  Files         258      261       +3     
  Lines       11378    11464      +86     
==========================================
+ Hits         9738     9784      +46     
- Misses       1640     1680      +40     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

Line Coverage: -% ( % to main)
Branch Coverage: -% ( % to main)

Copy link
Contributor

@MaxiBoether MaxiBoether left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,

Thanks again for the changes. I did a first pass (without a detailed look at the intricacies of the OnlineDataset) and realized the GetNL call is still there. Let's make this a bit more readable by removing this as mentioned in the detailed comments, and only having the relevant diff here. If you could add some tests for the stuff you change/add, that would also be great. Thank you!

Comment on lines 19 to 20
no_labels: bool = Field(
False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a no_labels flag which is conceptually not part of the pipeline, let's call it has_labels and make it part of the dataset configuration

modyn/config/examples/modyn_config.yaml Show resolved Hide resolved
modyn/models/gpt2/gpt2.py Outdated Show resolved Hide resolved
modyn/models/tokenizers/hf_tokenizer.py Outdated Show resolved Hide resolved
modyn/models/tokenizers/hf_tokenizer.py Show resolved Hide resolved
@@ -30,7 +30,6 @@ def __init__(self, supervisor: Supervisor, modyn_config: dict) -> None:
def start_pipeline(self, request: StartPipelineRequest, context: grpc.ServicerContext) -> PipelineResponse:
tid = threading.get_native_id()
pid = os.getpid()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary changes

@@ -295,33 +295,34 @@ TEST_F(BinaryFileWrapperTest, TestGetSamplesFromIndices) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add some unit tests please to test the include labels false case? (for all file wrappers)

else: # If we have failed, we need to filter out yielded samples
# Note that the returned order by storage is non-deterministic
if not self._include_labels:
yield keys, list(response.samples), None, response_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned by codecov, we are missing a test for the include labels False case for the online dataset. Can you extend the test suite?

for batch in self._train_dataloader:
stopw.stop("FetchBatch")
batch_timings.append(stopw.stop("IndivFetchBatch"))
retrieve_weights_from_dataloader, weighted_optimization = self.weights_handling(len(batch))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you limit the diff to the relevant changes and don't commit the changed new lines please :)? Unless it's an intended cleanup, it's better to only commit the actual changes.

Comment on lines -920 to +937
# compute the scores and accumulate them
model_output = self._model.model(data) if self._downsampler.forward_required else torch.Tensor()
embeddings = self.get_embeddings_if_recorded()
# Inform the downsampler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clean up the diff of this file to make it easier so see what is actually changing that is relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants