NER with Llama 3 #426

d-kleine · 2024-11-02T14:40:20Z

d-kleine
Nov 2, 2024

I have worked on a private project on how to adapt LLaMA 3.2, a decoder-only (autoregressive) transformer, for Named Entity Recognition (NER) with HuggingFace (so not implemented "from scratch"). Traditionally, encoder-only models like BERT have dominated NER tasks due to their ability to process input text bidirectionally, capturing rich contextual information. However, by removing the causal mask in LLaMA, we enable it to leverage bidirectional context while maintaining its strengths in generative tasks, making it a versatile solution for NER.

What do you think about this implementation?

Project: https://github.com/d-kleine/NER_decoder

rasbt · 2024-11-03T21:31:12Z

rasbt
Nov 3, 2024
Maintainer

That's pretty cool! I think that removing the causal mask is reasonable here. I've seen something similar in the recent classification finetuning papers where they used Llama models.

Based on the loss and qualitative eval at the end, it looks like it definitely works!

Btw. how long did it take to finetune? If it's not too long, I'd be curious to know how it would perform if you left the causal mask.

3 replies

d-kleine Nov 4, 2024
Author

I needed to fix the code as model.config.is_decoder = False does not seem to disable the attention masking, at least for Llama models. I have now implemented the bi-directionality based on this discussion, which works now.

I have tested running both models for a few epochs: The one with masking is faster than the one without (as expected), but after some training performs fairly well. The bidirectional model takes way longer to train (again, as expected) and seems to perform slightly - but not significantly - better than the one with the mask. But these are just first impressions from this toy experiment, nothing solid.

For the same notebook, I have tried out BERT as an encoder-only model some weeks ago: When using a decoder-only model like LLama 3.2, the training in both cases takes way longer (again, as expected, due to the more complex architecture). At least you can see that both approaches (attention masking, bi-directional) for decoder-only models see to work and could be considered as an alternative to encoder-only models for NER.

rasbt Nov 5, 2024
Maintainer

Thanks for sharing. Yeah, good pretrained decoder-only models are usually pretty capable, even if they don't see future tokens, so I can see the bidirectionality maybe not doing that much (but it should definitely help a little like you observed).

I think it makes intuitive sense too. For the example from your notebook

Example 1: Steve Jobs, the co-founder of Apple Inc., was born in San Francisco, California.

One could infer everything from the left context. Of course there will be exceptions, but in most cases you probably don't need the right context (although it can help)

d-kleine Nov 7, 2024
Author

Yeah, I agree. For me, it seems like disabling the attention mask doesn't really pay-off in terms of performance to execution time/costs. But maybe it can be beneficial for specific use cases (or languages) where inferring from both sides is helpful.

I just finalized the notebook and fixed an issue to display the performance metrics (the model overfits highly, but this is just a showcase anyways).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER with Llama 3 #426

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

NER with Llama 3 #426

d-kleine Nov 2, 2024

Replies: 1 comment · 3 replies

rasbt Nov 3, 2024 Maintainer

d-kleine Nov 4, 2024 Author

rasbt Nov 5, 2024 Maintainer

d-kleine Nov 7, 2024 Author

d-kleine
Nov 2, 2024

Replies: 1 comment 3 replies

rasbt
Nov 3, 2024
Maintainer

d-kleine Nov 4, 2024
Author

rasbt Nov 5, 2024
Maintainer

d-kleine Nov 7, 2024
Author