Our goal is to obtain a language model fine tuned and from it obtaining a emotion classifier by text. In the end we will be able to generate infinite sentences with the labels assigned by the emotion classifier.
The datasets are two and they have been generated from open datasets:
- The first has been generated from a Open Subititle's dataset, using in particular the Italian and english languages.
- The second has been generated from the Ted's talks, always in Italian and english, and it has been generated by web scraping the Ted's talks transcripts
Later we used an english emotion classifier, based on Bert technology, to classify the english sentences and obtain the predicted label that we use to build the Italian dataset, by assigning them to the corresponding Italian sentences.
We also created a third dataset by merging the Sub and Ted datasets.
Here we load the classifiers trained by ULMFiT technology based on a previous work.
#Models load
learn_sub = load_learner('/content/drive/My Drive/Colab Notebooks (1)/ULMFiT Sub/', 'export.pkl')
learn_ted = load_learner('/content/drive/My Drive/Colab Notebooks (1)/ULMFiT Ted2/', 'export-ted2.pkl')
learn_merged = load_learner('/content/drive/My Drive/Colab Notebooks (1)/ULMFiT Merged/', 'export.pkl')
We use this method to preprocess the data and remove special characters that doesn't affect the sentiment analysis (i.e. dots, commas, ...)
def preprocess(s):
import re
s = s.lower()
s = re.sub(r"[^a-zA-ZÀ-ú</>!?♥♡\s\U00010000-\U0010ffff]", ' ', s)
s = re.sub(r"\s+", ' ', s)
s = re.sub(r'(\w)\1{2,}',r'\1\1', s)
s = re.sub ( r'^\s' , '' , s )
s = re.sub ( r'\s$' , '' , s )
return s
#Sentences to test
test_text = pd.DataFrame([
'Questo è il giorno più felice della mia vita',
'Penso di potermela cavare questa volta',
'L\'altro giorno sono andato a mare',
'Domani andro al mare, che bello',
'Credo di amarti',
'Sei una persona orribile',
'Vorrei viaggiare, sarebbe bellissimo',
'Non voglio più uscire di casa',
'La giornata oggi è pessima, non mi va di uscire',
'Oggi sono malinconico, non mi va di uscire',
'Mi sento triste, mi manca mio figlio',
"Mi manca mio figlio"
])
print('Labels:\t',learn_merged.data.classes,'\n')
for s in test_text[0]:
print('\n\nFrase: ',s)
res = learn_sub.predict(preprocess(s))
print('\n\tSub model: \n\t\tPredicted label: ', res[0], '\tProbabilities: ', res[2])
res = learn_ted.predict(preprocess(s))
print('\n\tTed model: \n\t\tPredicted label: ', res[0], '\tProbabilities: ', res[2])
res = learn_merged.predict(preprocess(s))
print('\n\tMerged model: \n\t\tPredicted label: ', res[0], '\tProbabilities: ', res[2])
Labels: ['anger', 'joy', 'neutral', 'optimism', 'pessimism', 'sadness']
Frase: Questo è il giorno più felice della mia vita
Sub model:
Predicted label: optimism Probabilities: tensor([0.0189, 0.2521, 0.0747, 0.4743, 0.0511, 0.1289])
Ted model:
Predicted label: optimism Probabilities: tensor([0.0087, 0.1474, 0.1952, 0.4646, 0.0576, 0.1264])
Merged model:
Predicted label: joy Probabilities: tensor([0.0037, 0.4122, 0.0781, 0.3479, 0.0423, 0.1158])
Frase: Penso di potermela cavare questa volta
Sub model:
Predicted label: optimism Probabilities: tensor([0.0919, 0.1144, 0.3329, 0.3894, 0.0640, 0.0073])
Ted model:
Predicted label: neutral Probabilities: tensor([0.0156, 0.0770, 0.6636, 0.1331, 0.0602, 0.0506])
Merged model:
Predicted label: optimism Probabilities: tensor([0.0298, 0.1212, 0.2287, 0.5359, 0.0721, 0.0123])
Frase: L'altro giorno sono andato a mare
Sub model:
Predicted label: neutral Probabilities: tensor([0.0525, 0.1982, 0.3469, 0.1801, 0.1552, 0.0670])
Ted model:
Predicted label: neutral Probabilities: tensor([0.0101, 0.0346, 0.4815, 0.3753, 0.0555, 0.0431])
Merged model:
Predicted label: optimism Probabilities: tensor([0.0227, 0.1902, 0.2247, 0.4212, 0.0987, 0.0424])
Frase: Domani andro al mare, che bello
Sub model:
Predicted label: joy Probabilities: tensor([0.0323, 0.5608, 0.0793, 0.2537, 0.0263, 0.0477])
Ted model:
Predicted label: neutral Probabilities: tensor([0.0082, 0.0242, 0.5839, 0.3462, 0.0221, 0.0154])
Merged model:
Predicted label: optimism Probabilities: tensor([0.0670, 0.2664, 0.0882, 0.4957, 0.0360, 0.0468])
Frase: Credo di amarti
Sub model:
Predicted label: optimism Probabilities: tensor([0.0212, 0.2248, 0.2798, 0.3959, 0.0726, 0.0056])
Ted model:
Predicted label: neutral Probabilities: tensor([0.0043, 0.0394, 0.5940, 0.3241, 0.0286, 0.0096])
Merged model:
Predicted label: optimism Probabilities: tensor([0.0368, 0.2224, 0.1299, 0.5295, 0.0596, 0.0218])
Frase: Sei una persona orribile
Sub model:
Predicted label: optimism Probabilities: tensor([0.0307, 0.2498, 0.1810, 0.5204, 0.0094, 0.0087])
Ted model:
Predicted label: neutral Probabilities: tensor([0.0080, 0.0261, 0.5722, 0.3457, 0.0396, 0.0084])
Merged model:
Predicted label: optimism Probabilities: tensor([0.0708, 0.2134, 0.2204, 0.4588, 0.0166, 0.0200])
Frase: Vorrei viaggiare, sarebbe bellissimo
Sub model:
Predicted label: optimism Probabilities: tensor([0.0091, 0.1212, 0.0607, 0.7423, 0.0539, 0.0127])
Ted model:
Predicted label: optimism Probabilities: tensor([0.0031, 0.0408, 0.1390, 0.5835, 0.2169, 0.0168])
Merged model:
Predicted label: optimism Probabilities: tensor([0.0037, 0.0731, 0.0312, 0.5713, 0.2942, 0.0265])
Frase: Non voglio più uscire di casa
Sub model:
Predicted label: pessimism Probabilities: tensor([0.1073, 0.0567, 0.0913, 0.1468, 0.5388, 0.0590])
Ted model:
Predicted label: optimism Probabilities: tensor([0.0039, 0.0251, 0.3116, 0.3779, 0.2718, 0.0098])
Merged model:
Predicted label: pessimism Probabilities: tensor([0.1089, 0.0430, 0.0988, 0.2247, 0.4779, 0.0468])
Frase: La giornata oggi è pessima, non mi va di uscire
Sub model:
Predicted label: pessimism Probabilities: tensor([0.1464, 0.1910, 0.1061, 0.1059, 0.2537, 0.1969])
Ted model:
Predicted label: pessimism Probabilities: tensor([0.0127, 0.1391, 0.1608, 0.0785, 0.4944, 0.1144])
Merged model:
Predicted label: joy Probabilities: tensor([0.1178, 0.2558, 0.2044, 0.2261, 0.1100, 0.0858])
Frase: Oggi sono malinconico, non mi va di uscire
Sub model:
Predicted label: joy Probabilities: tensor([0.2673, 0.3016, 0.1067, 0.0770, 0.1379, 0.1095])
Ted model:
Predicted label: pessimism Probabilities: tensor([0.0246, 0.1843, 0.1614, 0.1001, 0.3683, 0.1613])
Merged model:
Predicted label: joy Probabilities: tensor([0.1887, 0.2757, 0.1958, 0.1567, 0.1086, 0.0745])
Frase: Mi sento triste, mi manca mio figlio
Sub model:
Predicted label: sadness Probabilities: tensor([0.0336, 0.1779, 0.0621, 0.1184, 0.0442, 0.5639])
Ted model:
Predicted label: joy Probabilities: tensor([0.0064, 0.4620, 0.0547, 0.1508, 0.1481, 0.1780])
Merged model:
Predicted label: sadness Probabilities: tensor([0.0180, 0.3835, 0.0439, 0.0921, 0.0692, 0.3933])
Frase: Mi manca mio figlio
Sub model:
Predicted label: joy Probabilities: tensor([0.0810, 0.2809, 0.1750, 0.1954, 0.0569, 0.2109])
Ted model:
Predicted label: joy Probabilities: tensor([0.0134, 0.4548, 0.0661, 0.2426, 0.0582, 0.1649])
Merged model:
Predicted label: sadness Probabilities: tensor([0.0365, 0.2914, 0.0654, 0.1081, 0.0856, 0.4130])
Here we show the prediction probabilities for each previous test sentences with Open Subtitle model, and its confusion matrix.
print(learn_sub.data.classes,'\n')
for s in test_text[0]:
res = learn_sub.predict(preprocess(s))
print(s,res[2])
['anger', 'joy', 'neutral', 'optimism', 'pessimism', 'sadness']
Questo è il giorno più felice della mia vita tensor([0.0189, 0.2521, 0.0747, 0.4743, 0.0511, 0.1289])
Penso di potermela cavare questa volta tensor([0.0919, 0.1144, 0.3329, 0.3894, 0.0640, 0.0073])
L'altro giorno sono andato a mare tensor([0.0525, 0.1982, 0.3469, 0.1801, 0.1552, 0.0670])
Domani andro al mare, che bello tensor([0.0323, 0.5608, 0.0793, 0.2537, 0.0263, 0.0477])
Credo di amarti tensor([0.0212, 0.2248, 0.2798, 0.3959, 0.0726, 0.0056])
Sei una persona orribile tensor([0.0307, 0.2498, 0.1810, 0.5204, 0.0094, 0.0087])
Vorrei viaggiare, sarebbe bellissimo tensor([0.0091, 0.1212, 0.0607, 0.7423, 0.0539, 0.0127])
Non voglio più uscire di casa tensor([0.1073, 0.0567, 0.0913, 0.1468, 0.5388, 0.0590])
La giornata oggi è pessima, non mi va di uscire tensor([0.1464, 0.1910, 0.1061, 0.1059, 0.2537, 0.1969])
Oggi sono malinconico, non mi va di uscire tensor([0.2673, 0.3016, 0.1067, 0.0770, 0.1379, 0.1095])
Mi sento triste, mi manca mio figlio tensor([0.0336, 0.1779, 0.0621, 0.1184, 0.0442, 0.5639])
Here we show the prediction probabilities for each previous test sentences with Ted's talks model, and its confusion matrix.
print(learn_ted.data.classes,'\n')
for s in test_text[0]:
res = learn_ted.predict(preprocess(s))
print(s, res[2])
['anger', 'joy', 'neutral', 'optimism', 'pessimism', 'sadness']
Questo è il giorno più felice della mia vita tensor([0.0087, 0.1474, 0.1952, 0.4646, 0.0576, 0.1264])
Penso di potermela cavare questa volta tensor([0.0156, 0.0770, 0.6636, 0.1331, 0.0602, 0.0506])
L'altro giorno sono andato a mare tensor([0.0101, 0.0346, 0.4815, 0.3753, 0.0555, 0.0431])
Domani andro al mare, che bello tensor([0.0082, 0.0242, 0.5839, 0.3462, 0.0221, 0.0154])
Credo di amarti tensor([0.0043, 0.0394, 0.5940, 0.3241, 0.0286, 0.0096])
Sei una persona orribile tensor([0.0080, 0.0261, 0.5722, 0.3457, 0.0396, 0.0084])
Vorrei viaggiare, sarebbe bellissimo tensor([0.0031, 0.0408, 0.1390, 0.5835, 0.2169, 0.0168])
Non voglio più uscire di casa tensor([0.0039, 0.0251, 0.3116, 0.3779, 0.2718, 0.0098])
La giornata oggi è pessima, non mi va di uscire tensor([0.0127, 0.1391, 0.1608, 0.0785, 0.4944, 0.1144])
Oggi sono malinconico, non mi va di uscire tensor([0.0246, 0.1843, 0.1614, 0.1001, 0.3683, 0.1613])
Mi sento triste, mi manca mio figlio tensor([0.0064, 0.4620, 0.0547, 0.1508, 0.1481, 0.1780])
Here we show the prediction probabilities for each previous test sentences with merged dataset model, and its confusion matrix.
print(learn_merged.data.classes,'\n')
for s in test_text[0]:
res = learn_merged.predict(preprocess(s))
print(res[2])
['anger', 'joy', 'neutral', 'optimism', 'pessimism', 'sadness']
tensor([0.0037, 0.4122, 0.0781, 0.3479, 0.0423, 0.1158])
tensor([0.0298, 0.1212, 0.2287, 0.5359, 0.0721, 0.0123])
tensor([0.0227, 0.1902, 0.2247, 0.4212, 0.0987, 0.0424])
tensor([0.0670, 0.2664, 0.0882, 0.4957, 0.0360, 0.0468])
tensor([0.0368, 0.2224, 0.1299, 0.5295, 0.0596, 0.0218])
tensor([0.0708, 0.2134, 0.2204, 0.4588, 0.0166, 0.0200])
tensor([0.0037, 0.0731, 0.0312, 0.5713, 0.2942, 0.0265])
tensor([0.1089, 0.0430, 0.0988, 0.2247, 0.4779, 0.0468])
tensor([0.1178, 0.2558, 0.2044, 0.2261, 0.1100, 0.0858])
tensor([0.1887, 0.2757, 0.1958, 0.1567, 0.1086, 0.0745])
tensor([0.0180, 0.3835, 0.0439, 0.0921, 0.0692, 0.3933])
Using the language models we trained them how to speak Italian. We have done it through our first language model based on the Wikipedia dataset and then we fine tuned it with our datasets, obtaining the last language model.
The accuracy in percentage of our model is a range of 20. In a way to obtain a good model we need a very large corpus, which we have, but we need more time to obtain the Italian datasets. This is going to be our next work
lm_ted = load_learner('/content/drive/My Drive/Colab Notebooks (1)/ULMFiT Ted2/', 'Language_model.pkl')
# String prefix that has to be used for generated sentences
TEXT = 'Come'
# Number of words for each sentence to generate
N_WORDS = 5
# Number of sentences to generate
N_SENTENCES = 7
print('\n\n'.join(lm_ted.predict(TEXT,N_WORDS) for _ in range(N_SENTENCES)))
Come terapie anni si minore di
Come mette anno detto zona ]
Come avanzate il ministro quattro ha
Come della loro 35 da vaccino
Come troppo alternative è un intendo
Come quello un nel molte l'
Come futuro l' fanteria . xxbos
lm_sub = load_learner('/content/drive/My Drive/Colab Notebooks (1)/ULMFiT Sub/', 'Language-model-sub.pkl')
# String prefix that has to be used for generated sentences
TEXT = 'come'
# Number of words for each sentence to generate
N_WORDS = 5
# Number of sentences to generate
N_SENTENCES = 7
print('\n\n'.join(lm_sub.predict(TEXT,N_WORDS) for _ in range(N_SENTENCES)))
come la fondo di meta con
come regolamento che la sapevi siete
come e da portarla parecchio per
come via un sto kit xxbos
come sono dolore che dopo mio
come per e settimana a via
come molto ? xxbos contratto xxbos
lm_merged = load_learner('/content/drive/My Drive/Colab Notebooks (1)/ULMFiT Merged/', 'Language-model-merged.pkl')
# String prefix that has to be used for generated sentences
TEXT = 'come'
# Number of words for each sentence to generate
N_WORDS = 5
# Number of sentences to generate
N_SENTENCES = 7
print('\n\n'.join(lm_merged.predict(TEXT,N_WORDS) for _ in range(N_SENTENCES)))
come margaret ancora spiegare ? xxbos
come ho voglio tutti tutti xxbos
come vai xxbos imparano aveva xxbos
come è questo po nella vita
come non ci noi resto guardie
come su hai organizzare andrebbe ad
come mi togli del costo ?