It is about how to load pretrained word embeddings in pytorch, e.g., ELMo\BERT\XLNET.
- python 3.6.x
- pytorch 1.3.1
- pip install gpustat [if gpu is used]
- ELMo in allennlp: pip install allennlp
- BERT/XLNET in transformers: pip install transformers
python elmo_bert_xlnet_layer.py
Usually, we want to get word embeddings from BERT\XLNET models, while one word may be split into multiple tokens after BERT\XLNET tokenization. In this case, we would like to get word embeddings by using the alignment from BERT\XLNET tokens to original words.
For example, the sentence
"i dont care wether it provides free wifi or not"
can be tokenized as
['i', 'dont', 'care', 'wet', '##her', 'it', 'provides', 'free', 'wi', '##fi', 'or', 'not']
.
We provide three types of alignment:
- 'ori': we simply use the output embeddings of BERT\XLNET to represent each input sentence, while ignoring the output embeddings of special tokens like '[CLS]' and '[SEP]'.
- 'first': using the embedding of the first token of each word as the word embedding.
- 'avg': averaging the embeddings of all the tokens of each word as the word embedding.