Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ported token mapping function #22

Merged
merged 10 commits into from
Feb 14, 2024
Merged

Conversation

menamerai
Copy link
Collaborator

Ported mapping function over from previous repo. Made modifications so it works with the HF Dataset classes.

For issue #8.

@menamerai menamerai self-assigned this Feb 2, 2024
@menamerai menamerai linked an issue Feb 2, 2024 that may be closed by this pull request
@menamerai menamerai marked this pull request as draft February 2, 2024 16:54
Copy link
Contributor

@jettjaniak jettjaniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for this to be a script it has to do something, right? Like take dataset name as argument and save/upload the output. I think I'm satisfied with this being simple enough to not require tests, but we need to run it and confirm it works



def token_map(
tokenized_dataset: DatasetDict | Dataset | IterableDatasetDict | IterableDataset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why so many?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load_dataset function returns one of those 4 types. Pylance got mad at me if I just use one of the four, so I included all 4 possible return types here. I think our tokenized dataset only has the Dataset class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can try using https://docs.python.org/3/library/typing.html#typing.cast let pyright / pylance know we will be only using Dataset only before calling this function

mapping = {}
tokenized_dataset = cast(Dataset, tokenized_dataset)
for prompt_idx, prompt in enumerate(tokenized_dataset):
prompt = cast(dict, prompt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the error here, pyright doesn't realize prompt has a get_item?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. this is the full message:

Argument of type "Literal['tokens']" cannot be assigned to parameter "__s" of type "slice" in function "__getitem__"
  "Literal['tokens']" is incompatible with "slice"

I tested the dataset live, and the prompt seems to be of type dict when I access it, so I casted it to dict. Should I get a typeignore comment in or do something else?

Comment on lines 28 to 62
tokenized_dataset = Dataset.from_dict(
{
"tokens": [
[
0,
1,
2,
3,
4,
5,
0,
6,
7,
0,
1,
2,
3,
4,
5,
0,
6,
7,
0,
1,
2,
3,
4,
5,
0,
6,
7,
],
]
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is bad xD
you can add a noqa comment for black and keep it at reasonable number of lines

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahaha yeah thank you I wasn't sure how to deal with that

@jettjaniak
Copy link
Contributor

test looks good!
left some minor comments
please move token_map.py to eval folder and the test to tests/eval/test_token_map.py
then we need a script in scripts/ that takes HF dataset name and produces a pickle
click "ready for review" when that's done and ping me please

@menamerai menamerai marked this pull request as ready for review February 10, 2024 05:15
@menamerai
Copy link
Collaborator Author

@jettjaniak I've updated most things, but still a little confused about some things:

  • What should I do about the typing on the prompt? Should I ignore the type with # type: ignore, should I keep the cast, or should I do something else?
  • I only added a basic script that can take any hf dataset string and make a pickle in data/pickle_name.pkl, but should it be more specific? Should it do something else?

@jettjaniak
Copy link
Contributor

let's discuss at check-in

@menamerai menamerai force-pushed the 8-map-tokens-to-prompts-and-positions branch from 94fa5f1 to 3dedb4c Compare February 14, 2024 18:47
@menamerai menamerai merged commit b8d0d8c into main Feb 14, 2024
1 check passed
@jettjaniak jettjaniak deleted the 8-map-tokens-to-prompts-and-positions branch May 22, 2024 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

map tokens to prompts and positions
2 participants