Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'tokenizer_config' #66

Closed
guustfranssensEY opened this issue Jan 20, 2022 · 2 comments
Closed

KeyError: 'tokenizer_config' #66

guustfranssensEY opened this issue Jan 20, 2022 · 2 comments

Comments

@guustfranssensEY
Copy link

I am working on integrating my custom model vinai\bertweet-base with Ecco, however I ran into the following issue:

Traceback (most recent call last):
  File "experiment_ecco.py", line 44, in <module>
    nmf_1.explore()
  File "C:\Users\XXXX\anaconda3\envs\disaster_tweets\lib\site-packages\ecco\output.py", line 827, in explore
    }})"""
KeyError: 'tokenizer_config'

I created the lm in the following way:

# loading in tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True, use_fast=False)
model = torch.load("bertmodel.pth")
''' 
this model is obtained from training 
AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", 
output_hidden_states=True, output_attentions=True, num_labels=2)
'''

model_config = {
    'embedding': 'roberta.embeddings.word_embeddings',
    'type': 'mlm',
    'activations': ['intermediate\.dense'],
    'token_prefix': '',
    'partial_token_prefix': ''
}

lm = LM(model=model, tokenizer=tokenizer, model_name="vinai/bertweet-base",
        config=model_config, collect_activations_flag=True, verbose=True)

tweet = "So running down the stairs was a bad idea full on collided... With the floor ??"
inputs = lm.tokenizer([tweet], return_tensors="pt")
output = lm(inputs)

nmf_1 = output.run_nmf(n_components=8)
nmf_1.explore()

upon further inspection I believe the error comes from the following line:

js = f"""
         requirejs(['basic', 'ecco'], function(basic, ecco){{
            const viz_id = basic.init()
            
            ecco.interactiveTokensAndFactorSparklines(viz_id, {data},
            {{
            'hltrCFG': {{'tokenization_config': {json.dumps(self.config['tokenizer_config'])}
                }}
            }})
         }}, function (err) {{
            console.log(err);
        }})"""

I could not traceback the origin of tokenizer_config. Therefore I assume it also has to be passed in the model_config for a custom model? If so, this needs to be specified in the docs.

Or could this issue be related in a strange way to #65

@guustfranssensEY
Copy link
Author

guustfranssensEY commented Jan 21, 2022

After checking the config of a supported model e.g. bert

ecco.from_pretrained('bert-base-uncased', activations=True)
lm.model_config

{'activations': ['\\d+\\.output\\.dense'],
 'embedding': 'embeddings.word_embeddings',
 'partial_token_prefix': '##',
 'token_prefix': '',
 'tokenizer_config': {'partial_token_prefix': '##', 'token_prefix': ''},
 'type': 'mlm'}

I found that I had to add the following as tokenizer config

'tokenizer_config': {'partial_token_prefix': '', 'token_prefix': ''}

therefore my full config for the custom model is now:

model_config = {
    'embedding': 'roberta.embeddings.word_embeddings',
    'type': 'mlm',
    'activations': ['intermediate\.dense'],
    'token_prefix': '',
    'partial_token_prefix': '',
    'tokenizer_config': {'partial_token_prefix': '', 'token_prefix': ''},
}

After fixing this, my code is able to obtain the beautifull visuals @jalammar has made :)

P.S. Could the tokenizer_config be added to the documentation?

@jalammar
Copy link
Owner

Awesome! Thanks for working through this, @guustfranssensEY. The intent was that 'tokenizer_config' is made automatically by the library (so the user doesn't repeat themselves needlessly). Nice catch finding out it's not kicking in when users supply the config object.

I think the direction next is to remove tokenizer_config all together. I've made issue #67 to track this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants