Ported token mapping function #22

menamerai · 2024-02-02T16:54:02Z

Ported mapping function over from previous repo. Made modifications so it works with the HF Dataset classes.

For issue #8.

jettjaniak

so for this to be a script it has to do something, right? Like take dataset name as argument and save/upload the output. I think I'm satisfied with this being simple enough to not require tests, but we need to run it and confirm it works

jettjaniak · 2024-02-02T17:45:29Z

scripts/map_tokens.py

+
+
+def token_map(
+    tokenized_dataset: DatasetDict | Dataset | IterableDatasetDict | IterableDataset,


why so many?

The load_dataset function returns one of those 4 types. Pylance got mad at me if I just use one of the four, so I included all 4 possible return types here. I think our tokenized dataset only has the Dataset class.

You can try using https://docs.python.org/3/library/typing.html#typing.cast let pyright / pylance know we will be only using Dataset only before calling this function

src/delphi/dataset/token_map.py

jettjaniak · 2024-02-10T03:00:19Z

src/delphi/dataset/token_map.py

+    mapping = {}
+    tokenized_dataset = cast(Dataset, tokenized_dataset)
+    for prompt_idx, prompt in enumerate(tokenized_dataset):
+        prompt = cast(dict, prompt)


what's the error here, pyright doesn't realize prompt has a get_item?

Yes. this is the full message:

Argument of type "Literal['tokens']" cannot be assigned to parameter "__s" of type "slice" in function "__getitem__" "Literal['tokens']" is incompatible with "slice"

I tested the dataset live, and the prompt seems to be of type dict when I access it, so I casted it to dict. Should I get a typeignore comment in or do something else?

jettjaniak · 2024-02-10T03:01:57Z

tests/test_dataset.py

+    tokenized_dataset = Dataset.from_dict(
+        {
+            "tokens": [
+                [
+                    0,
+                    1,
+                    2,
+                    3,
+                    4,
+                    5,
+                    0,
+                    6,
+                    7,
+                    0,
+                    1,
+                    2,
+                    3,
+                    4,
+                    5,
+                    0,
+                    6,
+                    7,
+                    0,
+                    1,
+                    2,
+                    3,
+                    4,
+                    5,
+                    0,
+                    6,
+                    7,
+                ],
+            ]
+        }
+    )


this is bad xD
you can add a noqa comment for black and keep it at reasonable number of lines

Hahaha yeah thank you I wasn't sure how to deal with that

jettjaniak · 2024-02-10T03:05:55Z

test looks good!
left some minor comments
please move token_map.py to eval folder and the test to tests/eval/test_token_map.py
then we need a script in scripts/ that takes HF dataset name and produces a pickle
click "ready for review" when that's done and ping me please

menamerai · 2024-02-10T05:17:34Z

@jettjaniak I've updated most things, but still a little confused about some things:

What should I do about the typing on the prompt? Should I ignore the type with # type: ignore, should I keep the cast, or should I do something else?
I only added a basic script that can take any hf dataset string and make a pickle in data/pickle_name.pkl, but should it be more specific? Should it do something else?

jettjaniak · 2024-02-10T05:34:18Z

let's discuss at check-in

menamerai requested review from jaidhyani, jettjaniak and joshuawe February 2, 2024 16:54

menamerai self-assigned this Feb 2, 2024

menamerai linked an issue Feb 2, 2024 that may be closed by this pull request

map tokens to prompts and positions #8

Closed

menamerai marked this pull request as draft February 2, 2024 16:54

jettjaniak requested changes Feb 2, 2024

View reviewed changes

jettjaniak reviewed Feb 10, 2024

View reviewed changes

src/delphi/dataset/token_map.py Outdated Show resolved Hide resolved

jettjaniak reviewed Feb 10, 2024

View reviewed changes

src/delphi/dataset/token_map.py Outdated Show resolved Hide resolved

jettjaniak reviewed Feb 10, 2024

View reviewed changes

menamerai marked this pull request as ready for review February 10, 2024 05:15

menamerai added 8 commits February 14, 2024 13:47

add token mapping

48e56c5

use cast for typing

210a28a

moved token_map function to library

ebd0314

added test case for token_map

293f2db

added test cases for token_map

9ba18dd

review changes

fa83336

add load_hf_dataset function

399c9e3

shorten docstring

3dedb4c

menamerai force-pushed the 8-map-tokens-to-prompts-and-positions branch from 94fa5f1 to 3dedb4c Compare February 14, 2024 18:47

revert pickling changes

99b4d88

jettjaniak approved these changes Feb 14, 2024

View reviewed changes

added token mapping script

40f83d2

menamerai merged commit b8d0d8c into main Feb 14, 2024
1 check passed

jettjaniak deleted the 8-map-tokens-to-prompts-and-positions branch May 22, 2024 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ported token mapping function #22

Ported token mapping function #22

menamerai commented Feb 2, 2024

jettjaniak left a comment

jettjaniak Feb 2, 2024

menamerai Feb 2, 2024

jettjaniak Feb 3, 2024

jettjaniak Feb 10, 2024

menamerai Feb 10, 2024

jettjaniak Feb 10, 2024

menamerai Feb 10, 2024

jettjaniak commented Feb 10, 2024

menamerai commented Feb 10, 2024

jettjaniak commented Feb 10, 2024



		def token_map(
		tokenized_dataset: DatasetDict \| Dataset \| IterableDatasetDict \| IterableDataset,

Ported token mapping function #22

Ported token mapping function #22

Conversation

menamerai commented Feb 2, 2024

jettjaniak left a comment

Choose a reason for hiding this comment

jettjaniak Feb 2, 2024

Choose a reason for hiding this comment

menamerai Feb 2, 2024

Choose a reason for hiding this comment

jettjaniak Feb 3, 2024

Choose a reason for hiding this comment

jettjaniak Feb 10, 2024

Choose a reason for hiding this comment

menamerai Feb 10, 2024

Choose a reason for hiding this comment

jettjaniak Feb 10, 2024

Choose a reason for hiding this comment

menamerai Feb 10, 2024

Choose a reason for hiding this comment

jettjaniak commented Feb 10, 2024

menamerai commented Feb 10, 2024

jettjaniak commented Feb 10, 2024