multi-label classification #4

jesuiskelly · 2021-02-18T18:13:15Z

Hi there, thanks for the tutorial here: https://blogs.rstudio.com/ai/posts/2019-09-30-bert-r/ It's very useful! I wonder if you have one on multi-label classification? (For essentially the same dataset?) Or some code that helps me do that? Thank you very much in advance!

turgut090 · 2021-02-19T17:10:16Z

Hi @jesuiskelly . For the same dataset, try to set random labels from 1 to 3 for the output column. Then, change this part:

input_1 = get_layer(model,name = 'Input-Token')$input
input_2 = get_layer(model,name = 'Input-Segment')$input
inputs = list(input_1,input_2)

dense = get_layer(model,name = 'NSP-Dense')$output

outputs = dense %>% layer_dense(units=3L, activation='softmax', # 3 labels, so 3 units and activation to softmax
                         kernel_initializer=initializer_truncated_normal(stddev = 0.02),
                         name = 'output')

model = keras_model(inputs = inputs,outputs = outputs)

And to categorical cross entropy:

model %>% compile(
  k_bert$AdamWarmup(decay_steps=decay_steps, 
                    warmup_steps=warmup_steps, lr=learning_rate),
  loss = 'categorical_crossentropy',
  metrics = 'accuracy'
)

jesuiskelly · 2021-02-19T20:08:42Z

Thanks so much for coming back to me. This is very helpful! I tried it on my code and made a couple of changes:

the loss function is: sparse_categorical_corssentropy. I tried categorical_crossentropy first but got an error. The message suggested this instead. It worked.
I generated six labels: 0 - 5. It starts with 0 because when I specify units=6L, python expects 0 to 5. (I got an error for using 6 at first)

So the model was fitted, however, I'm concerned that I might have made some mistakes as the accuracy is very low: 0.2042.

The labels I used are not randomly generated, but rather, taken from here: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/), which is virtually identical dataset, except that in each of the label columns, it's 0 or 1. I created the target column using the following code:

`
train_data_raw <- read_csv(paste0(path_proj, path_data, "/train.csv")) # downloaded from link above
tmp_data = data.table(train_data_raw[1:42000,])

tmp_col = names(train_data_raw[3:length(names(tmp_data))])

tmp_data[, target := names(.SD)[max.col(.SD)], .SDcols=tmp_col]

tmp_data[is.na(target)]

unique(tmp_data[, target])

tmp_data[target=="obscene", target := 0L] %>%
.[target=="toxic", target := 1L] %>%
.[target=="insult", target := 2L] %>%
.[target=="severe_toxic", target := 3L] %>%
.[target=="identity_hate", target := 4L] %>%
.[target=="threat", target := 5L]

unique(tmp_data[, target])

train_data = tmp_data[1:40000,]
test_data = tmp_data[40001:42000,]
`

In other words, given that my labels are not random, I'd expect BERT to deliver much higher accuracy. Do you think I have done something wrong?

Separately, it would be good to know if there's a way to get a probability of prediction for each of the labels if possible. I copy below my code in case if it of help.

Thank you so much for your help and advice again!

`
library(tidyverse)
library(keras)
library(reticulate)
library(data.table)
Sys.timezone()
Sys.setenv(TZ = "UTC")
options(scipen = 999)

Sys.setenv(TF_KERAS=1)

reticulate::py_config() #3.6
reticulate::py_module_available('keras_bert') # TRUE
tensorflow::tf_version()
k_bert = import('keras_bert')

path_b_pret = paste0(path_proj, path_data, "/uncased_L-12_H-768_A-12")
path_b_conf = file.path(path_b_pret, "bert_config.json")
path_b_chkp = file.path(path_b_pret, "bert_model.ckpt")
path_b_vcab = paste0(path_b_pret, "/vocab.txt")

token_dict = k_bert$load_vocabulary(path_b_vcab)
tokenizer = k_bert$Tokenizer(token_dict)

seq_length = 50L
bch_size = 70
epochs = 1
learning_rate = 1e-4

DATA_COLUMN = 'comment_text'
LABEL_COLUMN = 'target'

model = k_bert$load_trained_model_from_checkpoint(
path_b_conf,
path_b_chkp,
training=T,
trainable=T,
seq_len=seq_length)

summary(model)

tokenize_fun = function(dataset) {
c(indices, target, segments) %<-% list(list(),list(), list())
for ( i in 1:nrow(dataset)) {
c(indices_tok, segments_tok) %<-% tokenizer$encode(dataset[[DATA_COLUMN]][i],
max_len=seq_length) # encode with padding
indices = indices %>% append(list(as.matrix(indices_tok)))
target = target %>% append(dataset[[LABEL_COLUMN]][i])
segments = segments %>% append(list(as.matrix(segments_tok)))
}
return(list(indices, segments, target))
}

tmp_data = data.table(train_data_raw[1:42000,])
tmp_col = names(train_data_raw[3:length(names(tmp_data))])
tmp_data[, target := names(.SD)[max.col(.SD)], .SDcols=tmp_col]
tmp_data[is.na(target)]
unique(tmp_data[, target])

tmp_data[target=="obscene", target := 0L] %>%
.[target=="toxic", target := 1L] %>%
.[target=="insult", target := 2L] %>%
.[target=="severe_toxic", target := 3L] %>%
.[target=="identity_hate", target := 4L] %>%
.[target=="threat", target := 5L]

unique(tmp_data[, target]) # check

train_data = tmp_data[1:40000,]
test_data = tmp_data[40001:42000,]

c(x_train, x_segment, y_train) %<-% tokenize_fun(train_data)

train = do.call(cbind,x_train) %>% t()
segments = do.call(cbind,x_segment) %>% t()
targets = do.call(cbind,y_train) %>% t()

concat = c(list(train),list(segments))

c(decay_steps, warmup_steps) %<-% k_bert$calc_train_steps(
targets %>% length(),
batch_size=bch_size,
epochs=epochs
)

input_1 = get_layer(model,name = 'Input-Token')$input
input_2 = get_layer(model,name = 'Input-Segment')$input
inputs = list(input_1,input_2)

dense = get_layer(model,name = 'NSP-Dense')$output

outputs = dense %>% layer_dense(units=6L, activation='softmax',
kernel_initializer=initializer_truncated_normal(stddev = 0.02),
name = 'output')

model = keras_model(inputs = inputs, outputs = outputs)
freeze_weights(model, from = "NSP-Dense") # not sure if this should be kept
summary(model)

model %>% compile(
k_bert$AdamWarmup(decay_steps=decay_steps,
warmup_steps=warmup_steps, lr=learning_rate),
loss = 'sparse_categorical_crossentropy', # for multi-label
metrics = 'accuracy'
)

history = model %>% fit(concat,
targets,
epochs=epochs,
batch_size=bch_size,
validation_split=0.2)

c(x_test, x_t_segment, y_test) %<-% tokenize_fun(test_data)

x_test = do.call(cbind,x_test) %>% t()
x_t_segment = do.call(cbind,x_t_segment) %>% t()

concat2 = c(list(x_test),list(x_t_segment))

res = model %>% predict(concat2)

`

turgut090 · 2021-02-21T09:12:27Z

Please, take a look at this kernel. https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing-
From that kernel:

Important Note: In general,
*For binary classification, we can have 1 output units, use sigmoid activation in the output layer and use binary cross-entropy loss

*For multi-class classification, we can have N output units, use softmax activation in the output layer and use categorical cross-entropy loss

*For multi-label classification, we can have N output units, use sigmoid activation in the output layer and use binary cross-entropy loss

Maybe this could help you to get a better score? And please use AUC as metrics https://www.tensorflow.org/api_docs/python/tf/keras/metrics/AUC

jesuiskelly · 2021-02-22T13:53:33Z

Hi there, thanks for this. It does help. I've now:

updated the targets to be a matrix of N of rows x 6 (==no of labels)
applied the sigmoid activation and binary_crossentropy loss function

Now I've got an AUC of ~0.6 to ~0.7 (dependent on different learning rate, etc), which is much better.

The only place where I'm stuck now is getting meaningful prediction results. The code "model %>% predict(concat2)" produces the following, which makes no sense given the severe lack of variations of the numbers between rows. I tried predict_proba() but it's not a method available for this model. Any ideas on how I can get the probability of each label out?

Thank you so much for your help again!

turgut090 · 2021-02-22T18:36:27Z

I think you are doing it right.
This is how it is done from the python side:
https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing-

y_preds = rnn_model.predict(test_data)
#Assign the predictions by the model in the final test dataset
df_test[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]] = y_pred

Just take more data. At least 200k rows and see if it helps.

jesuiskelly · 2021-02-23T12:40:03Z

Thanks so much. Good to know that at least it's not my mistake :) Great, will experiment so more. Thanks again for your help and for making the tutorial available. Much appreciated! Kelly

…

On Mon, Feb 22, 2021 at 6:36 PM Turgut ***@***.***> wrote: I think you are doing it right. This is how it is done from the python side: https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing- y_preds = rnn_model.predict(test_data) #Assign the predictions by the model in the final test dataset df_test[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]] = y_pred Just take more data. At least 200k rows and see if it helps. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARZT2YOXQQLV7RDD6G53WS3TAKP3VANCNFSM4X2Y2EFA> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-label classification #4

multi-label classification #4

jesuiskelly commented Feb 18, 2021

turgut090 commented Feb 19, 2021

jesuiskelly commented Feb 19, 2021 •

edited

Loading

turgut090 commented Feb 21, 2021

jesuiskelly commented Feb 22, 2021

turgut090 commented Feb 22, 2021

jesuiskelly commented Feb 23, 2021 via email

multi-label classification #4

multi-label classification #4

Comments

jesuiskelly commented Feb 18, 2021

turgut090 commented Feb 19, 2021

jesuiskelly commented Feb 19, 2021 • edited Loading

turgut090 commented Feb 21, 2021

jesuiskelly commented Feb 22, 2021

turgut090 commented Feb 22, 2021

jesuiskelly commented Feb 23, 2021 via email

jesuiskelly commented Feb 19, 2021 •

edited

Loading