-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ported token mapping function #22
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so for this to be a script it has to do something, right? Like take dataset name as argument and save/upload the output. I think I'm satisfied with this being simple enough to not require tests, but we need to run it and confirm it works
scripts/map_tokens.py
Outdated
|
||
|
||
def token_map( | ||
tokenized_dataset: DatasetDict | Dataset | IterableDatasetDict | IterableDataset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why so many?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The load_dataset
function returns one of those 4 types. Pylance got mad at me if I just use one of the four, so I included all 4 possible return types here. I think our tokenized dataset only has the Dataset
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can try using https://docs.python.org/3/library/typing.html#typing.cast let pyright / pylance know we will be only using Dataset only before calling this function
src/delphi/dataset/token_map.py
Outdated
mapping = {} | ||
tokenized_dataset = cast(Dataset, tokenized_dataset) | ||
for prompt_idx, prompt in enumerate(tokenized_dataset): | ||
prompt = cast(dict, prompt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the error here, pyright doesn't realize prompt has a get_item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. this is the full message:
Argument of type "Literal['tokens']" cannot be assigned to parameter "__s" of type "slice" in function "__getitem__"
"Literal['tokens']" is incompatible with "slice"
I tested the dataset live, and the prompt seems to be of type dict when I access it, so I casted it to dict. Should I get a typeignore comment in or do something else?
tests/test_dataset.py
Outdated
tokenized_dataset = Dataset.from_dict( | ||
{ | ||
"tokens": [ | ||
[ | ||
0, | ||
1, | ||
2, | ||
3, | ||
4, | ||
5, | ||
0, | ||
6, | ||
7, | ||
0, | ||
1, | ||
2, | ||
3, | ||
4, | ||
5, | ||
0, | ||
6, | ||
7, | ||
0, | ||
1, | ||
2, | ||
3, | ||
4, | ||
5, | ||
0, | ||
6, | ||
7, | ||
], | ||
] | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is bad xD
you can add a noqa comment for black and keep it at reasonable number of lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hahaha yeah thank you I wasn't sure how to deal with that
test looks good! |
@jettjaniak I've updated most things, but still a little confused about some things:
|
let's discuss at check-in |
94fa5f1
to
3dedb4c
Compare
Ported mapping function over from previous repo. Made modifications so it works with the HF Dataset classes.
For issue #8.