Text generation #647

sjohn4 · 2025-01-13T12:52:21Z

Took the previous feedback in mind and corrected the code.

codecov · 2025-01-13T13:05:01Z

Codecov Report

Attention: Patch coverage is 67.97386% with 49 lines in your changes missing coverage. Please review.

Project coverage is 85.34%. Comparing base (1a40faf) to head (9e3a9e8).

Files with missing lines	Patch %	Lines
.../trainer_server/internal/dataset/online_dataset.py	64.10%	28 Missing ⚠️
modyn/models/gpt2/gpt2.py	57.14%	9 Missing ⚠️
...trainer_server/internal/trainer/pytorch_trainer.py	69.23%	8 Missing ⚠️
modyn/models/tokenizers/gpt2_tokenizer.py	50.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #647      +/-   ##
==========================================
- Coverage   85.58%   85.34%   -0.25%     
==========================================
  Files         258      261       +3     
  Lines       11378    11464      +86     
==========================================
+ Hits         9738     9784      +46     
- Misses       1640     1680      +40

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2025-01-13T16:11:21Z

^{( % to main)}
^{( % to main)}

MaxiBoether

Hey,

Thanks again for the changes. I did a first pass (without a detailed look at the intricacies of the OnlineDataset) and realized the GetNL call is still there. Let's make this a bit more readable by removing this as mentioned in the detailed comments, and only having the relevant diff here. If you could add some tests for the stuff you change/add, that would also be great. Thank you!

MaxiBoether · 2025-01-16T13:16:08Z

modyn/config/schema/pipeline/data.py

+    no_labels: bool = Field(
+        False,


Instead of having a no_labels flag which is conceptually not part of the pipeline, let's call it has_labels and make it part of the dataset configuration

modyn/config/examples/modyn_config.yaml

modyn/models/gpt2/gpt2.py

modyn/models/tokenizers/hf_tokenizer.py

MaxiBoether · 2025-01-20T09:54:14Z

modyn/supervisor/internal/grpc/supervisor_grpc_servicer.py

@@ -30,7 +30,6 @@ def __init__(self, supervisor: Supervisor, modyn_config: dict) -> None:
    def start_pipeline(self, request: StartPipelineRequest, context: grpc.ServicerContext) -> PipelineResponse:
        tid = threading.get_native_id()
        pid = os.getpid()
-


unnecessary changes

MaxiBoether · 2025-01-20T09:54:41Z

modyn/tests/storage/internal/file_wrapper/binary_file_wrapper_test.cpp

@@ -295,33 +295,34 @@ TEST_F(BinaryFileWrapperTest, TestGetSamplesFromIndices) {



Can you also add some unit tests please to test the include labels false case? (for all file wrappers)

MaxiBoether · 2025-01-20T09:55:28Z

modyn/trainer_server/internal/dataset/online_dataset.py

-                        else:  # If we have failed, we need to filter out yielded samples
-                            # Note that the returned order by storage is non-deterministic
+                            if not self._include_labels:
+                                yield keys, list(response.samples), None, response_time


As mentioned by codecov, we are missing a test for the include labels False case for the online dataset. Can you extend the test suite?

MaxiBoether · 2025-01-20T09:56:46Z

modyn/trainer_server/internal/trainer/pytorch_trainer.py

            for batch in self._train_dataloader:
                stopw.stop("FetchBatch")
                batch_timings.append(stopw.stop("IndivFetchBatch"))
                retrieve_weights_from_dataloader, weighted_optimization = self.weights_handling(len(batch))
-


Can you limit the diff to the relevant changes and don't commit the changed new lines please :)? Unless it's an intended cleanup, it's better to only commit the actual changes.

MaxiBoether · 2025-01-20T09:57:40Z

modyn/trainer_server/internal/trainer/pytorch_trainer.py

-                    # compute the scores and accumulate them
                    model_output = self._model.model(data) if self._downsampler.forward_required else torch.Tensor()
                    embeddings = self.get_embeddings_if_recorded()
+                    # Inform the downsampler


Please clean up the diff of this file to make it easier so see what is actually changing that is relevant

John Staib Matilla added 2 commits January 9, 2025 11:20

Copied all my changes to this repository

4df0bba

fixed a bug

91a7054

John Staib Matilla added 10 commits January 13, 2025 14:37

Format changes

f10f24d

Format changes

eff5a2f

Format changes

174bca2

Format changes

6147697

Format changes

6dd260b

Format changes

f37bcfd

Format changes

b57eb67

Format changes

c64710a

Format changes

4cb4d34

Format changes

dd66329

John Staib Matilla added 6 commits January 14, 2025 09:38

Format changes

51222e6

Format changes

915e67f

Format changes

f91ae61

Format changes

b5c1a6d

Format changes

e8ca218

Format changes

082abc2

MaxiBoether requested changes Jan 20, 2025

View reviewed changes

John Staib Matilla and others added 9 commits January 21, 2025 16:33

changed the code acccording to feedback

80e0a6d

changed the code acccording to feedback

7c65830

Format changes

61ef6b8

Format changes

a3eb57a

Format changes

1b1d9e6

Format changes

225f21c

lets keep the docformatter happy

e9b3093

lets keep the docformatter happy

923f627

lets keep the docformatter happy

20c33bf

sjohn4 added 13 commits January 22, 2025 10:15

Fixed some errors and added some tests

8a64e64

Fixed some errors and added some tests

266b3a8

Fixed some errors and added some tests

91e92f9

Fixed some errors and added some tests

900504a

Fixed some errors and added some tests

789ac93

Fixed some errors and added some tests

be606b4

Fixed some errors and added some tests

ed95f67

Fixed some errors and added some tests

63e9cb5

Fixed some errors and added some tests

e296816

Fixed format

18f4896

Fixed format

401cf82

I think everything should trullywork now

2f5bcaf

I think everything should trullywork now

9e3a9e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text generation #647

Text generation #647

sjohn4 commented Jan 13, 2025

codecov bot commented Jan 13, 2025 •

edited

Loading

github-actions bot commented Jan 13, 2025

MaxiBoether left a comment

MaxiBoether Jan 16, 2025

MaxiBoether Jan 20, 2025

MaxiBoether Jan 20, 2025

MaxiBoether Jan 20, 2025

MaxiBoether Jan 20, 2025

MaxiBoether Jan 20, 2025

		@@ -295,33 +295,34 @@ TEST_F(BinaryFileWrapperTest, TestGetSamplesFromIndices) {

Text generation #647

Are you sure you want to change the base?

Text generation #647

Conversation

sjohn4 commented Jan 13, 2025

codecov bot commented Jan 13, 2025 • edited Loading

Codecov Report

github-actions bot commented Jan 13, 2025

MaxiBoether left a comment

Choose a reason for hiding this comment

MaxiBoether Jan 16, 2025

Choose a reason for hiding this comment

MaxiBoether Jan 20, 2025

Choose a reason for hiding this comment

MaxiBoether Jan 20, 2025

Choose a reason for hiding this comment

MaxiBoether Jan 20, 2025

Choose a reason for hiding this comment

MaxiBoether Jan 20, 2025

Choose a reason for hiding this comment

MaxiBoether Jan 20, 2025

Choose a reason for hiding this comment

codecov bot commented Jan 13, 2025 •

edited

Loading