Next to standard WordEmbeddings and CharacterEmbeddings, we also provide classes for BERT, ELMo and Flair embeddings. These embeddings enable you to train truly state-of-the-art NLP models.
This tutorial explains how to use these embeddings. We assume that you're familiar with the base types of this library as well as standard word embeddings, in particular the StackedEmbeddings
class.
All word embedding classes inherit from the TokenEmbeddings
class and implement the embed()
method which you need to
call to embed your text. This means that for most users of Flair, the complexity of different embeddings remains
hidden behind this interface. Simply instantiate the embedding class you require and call embed()
to embed your text.
All embeddings produced with our methods are Pytorch vectors, so they can be immediately used for training and fine-tuning.
Contextual string embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. Key differences are: (1) they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters. And (2) they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.
With Flair, you can use these embeddings simply by instantiating the appropriate embedding class, same as standard word embeddings:
from flair.embeddings import FlairEmbeddings
# init embedding
flair_embedding_forward = FlairEmbeddings('news-forward')
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
flair_embedding_forward.embed(sentence)
You choose which embeddings you load by passing the appropriate string to the constructor of the FlairEmbeddings
class.
Currently, the following contextual string embeddings are provided (more coming):
ID | Language | Embedding |
---|---|---|
'multi-forward' | English, German, French, Italian, Dutch, Polish | Mix of corpora (Web, Wikipedia, Subtitles, News) |
'multi-backward' | English, German, French, Italian, Dutch, Polish | Mix of corpora (Web, Wikipedia, Subtitles, News) |
'multi-forward-fast' | English, German, French, Italian, Dutch, Polish | Mix of corpora (Web, Wikipedia, Subtitles, News) |
'multi-backward-fast' | English, German, French, Italian, Dutch, Polish | Mix of corpora (Web, Wikipedia, Subtitles, News) |
'news-forward' | English | Forward LM embeddings over 1 billion word corpus |
'news-backward' | English | Backward LM embeddings over 1 billion word corpus |
'news-forward-fast' | English | Smaller, CPU-friendly forward LM embeddings over 1 billion word corpus |
'news-backward-fast' | English | Smaller, CPU-friendly backward LM embeddings over 1 billion word corpus |
'mix-forward' | English | Forward LM embeddings over mixed corpus (Web, Wikipedia, Subtitles) |
'mix-backward' | English | Backward LM embeddings over mixed corpus (Web, Wikipedia, Subtitles) |
'german-forward' | German | Forward LM embeddings over mixed corpus (Web, Wikipedia, Subtitles) |
'german-backward' | German | Backward LM embeddings over mixed corpus (Web, Wikipedia, Subtitles) |
'polish-forward' | Polish | Added by @borchmann: Forward LM embeddings over web crawls (Polish part of CommonCrawl) |
'polish-backward' | Polish | Added by @borchmann: Backward LM embeddings over web crawls (Polish part of CommonCrawl) |
'slovenian-forward' | Slovenian | Added by @stefan-it: Forward LM embeddings over various sources (Europarl, Wikipedia and OpenSubtitles2018) |
'slovenian-backward' | Slovenian | Added by @stefan-it: Backward LM embeddings over various sources (Europarl, Wikipedia and OpenSubtitles2018) |
'bulgarian-forward' | Bulgarian | Added by @stefan-it: Forward LM embeddings over various sources (Europarl, Wikipedia or SETimes) |
'bulgarian-backward' | Bulgarian | Added by @stefan-it: Backward LM embeddings over various sources (Europarl, Wikipedia or SETimes) |
'dutch-forward' | Dutch | Added by @stefan-it: Forward LM embeddings over various sources (Europarl, Wikipedia or OpenSubtitles2018) |
'dutch-backward' | Dutch | Added by @stefan-it: Backward LM embeddings over various sources (Europarl, Wikipedia or OpenSubtitles2018) |
'swedish-forward' | Swedish | Added by @stefan-it: Forward LM embeddings over various sources (Europarl, Wikipedia or OpenSubtitles2018) |
'swedish-backward' | Swedish | Added by @stefan-it: Backward LM embeddings over various sources (Europarl, Wikipedia or OpenSubtitles2018) |
'french-forward' | French | Added by @mhham: Forward LM embeddings over French Wikipedia |
'french-backward' | French | Added by @mhham: Backward LM embeddings over French Wikipedia |
'czech-forward' | Czech | Added by @stefan-it: Forward LM embeddings over various sources (Europarl, Wikipedia or OpenSubtitles2018) |
'czech-backward' | Czech | Added by @stefan-it: Backward LM embeddings over various sources (Europarl, Wikipedia or OpenSubtitles2018) |
'portuguese-forward' | Portuguese | Added by @ericlief: Forward LM embeddings |
'portuguese-backward' | Portuguese | Added by @ericlief: Backward LM embeddings |
'basque-forward' | Basque | Added by @stefan-it: Forward LM embeddings |
'basque-backward' | Basque | Added by @stefan-it: Backward LM embeddings |
'spanish-forward' | Spanish | Added by @iamyihwa: Forward LM embeddings over Wikipedia |
'spanish-backward' | Spanish | Added by @iamyihwa: Backward LM embeddings over Wikipedia |
'spanish-forward-fast' | Spanish | Added by @iamyihwa: CPU-friendly forward LM embeddings over Wikipedia |
'spanish-backward-fast' | Spanish | Added by @iamyihwa: CPU-friendly backward LM embeddings over Wikipedia |
So, if you want to load embeddings from the English news backward LM model, instantiate the method as follows:
flair_backward = FlairEmbeddings('news-backward')
We recommend combining both forward and backward Flair embeddings. Depending on the task, we also recommend adding standard word embeddings into the mix. So, our recommended StackedEmbedding
for most English tasks is:
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings
# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
WordEmbeddings('glove'),
FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward'),
])
That's it! Now just use this embedding like all the other embeddings, i.e. call the embed()
method over your sentences.
sentence = Sentence('The grass is green .')
# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)
# now check out the embedded tokens.
for token in sentence:
print(token)
print(token.embedding)
Words are now embedded using a concatenation of three different embeddings. This combination often gives state-of-the-art accuracy.
BERT embeddings were developed by Devlin et al. (2018) and are a different kind of powerful word embedding based on a bidirectional transformer architecture. We are using the implementation of huggingface in Flair. The embeddings itself are wrapped into our simple embedding interface, so that they can be used like any other embedding.
from flair.embeddings import BertEmbeddings
# init embedding
embedding = BertEmbeddings()
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
embedding.embed(sentence)
You can load any of the pre-trained BERT models by providing the model string during initialization:
ID | Language | Embedding |
---|---|---|
'bert-base-uncased' | English | 12-layer, 768-hidden, 12-heads, 110M parameters |
'bert-large-uncased' | English | 24-layer, 1024-hidden, 16-heads, 340M parameters |
'bert-base-cased' | English | 12-layer, 768-hidden, 12-heads , 110M parameters |
'bert-large-cased' | English | 24-layer, 1024-hidden, 16-heads, 340M parameters |
'bert-base-multilingual-cased' | 104 languages | 12-layer, 768-hidden, 12-heads, 110M parameters |
'bert-base-chinese' | Chinese Simplified and Traditional | 12-layer, 768-hidden, 12-heads, 110M parameters |
ELMo embeddings were presented by Peters et al. in 2018. They are using
a bidirectional recurrent neural network to predict the next word in a text.
We are using the implementation of AllenNLP. As this implementation comes with a lot of
sub-dependencies, which we don't want to include in Flair, you need to first install the library via
pip install allennlp
before you can use it in Flair.
Using the embeddings is as simple as using any other embedding type:
from flair.embeddings import ELMoEmbeddings
# init embedding
embedding = ELMoEmbeddings()
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
embedding.embed(sentence)
AllenNLP provides the following pre-trained models. To use any of the following models inside Flair
simple specify the embedding id when initializing the ELMoEmbeddings
.
ID | Language | Embedding |
---|---|---|
'small' | English | 1024-hidden, 1 layer, 14.6M parameters |
'medium' | English | 2048-hidden, 1 layer, 28.0M parameters |
'original' | English | 4096-hidden, 2 layers, 93.6M parameters |
'pt' | Portuguese |
You can very easily mix and match Flair, ELMo, BERT and classic word embeddings. All you need to do is instantiate each embedding you wish to combine and use them in a StackedEmbedding.
For instance, let's say we want to combine the multilingual Flair and BERT embeddings to train a hyper-powerful multilingual downstream task model.
First, instantiate the embeddings you wish to combine:
from flair.embeddings import FlairEmbeddings, BertEmbeddings
# init Flair embeddings
flair_forward_embedding = FlairEmbeddings('multi-forward')
flair_backward_embedding = FlairEmbeddings('multi-backward')
# init multilingual BERT
bert_embedding = BertEmbeddings('bert-base-multilingual-cased')
Now instantiate the StackedEmbeddings
class and pass it a list containing these three embeddings.
from flair.embeddings import StackedEmbeddings
# now create the StackedEmbedding object that combines all embeddings
stacked_embeddings = StackedEmbeddings(
embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding])
That's it! Now just use this embedding like all the other embeddings, i.e. call the embed()
method over your sentences.
sentence = Sentence('The grass is green .')
# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)
# now check out the embedded tokens.
for token in sentence:
print(token)
print(token.embedding)
Words are now embedded using a concatenation of three different embeddings. This means that the resulting embedding vector is still a single Pytorch vector.
You can now either look into document embeddings to embed entire text passages with one vector for tasks such as text classification, or go directly to the tutorial about loading your corpus, which is a pre-requirement for training your own models.