Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-label classification #4

Open
jesuiskelly opened this issue Feb 18, 2021 · 7 comments
Open

multi-label classification #4

jesuiskelly opened this issue Feb 18, 2021 · 7 comments

Comments

@jesuiskelly
Copy link

Hi there, thanks for the tutorial here: https://blogs.rstudio.com/ai/posts/2019-09-30-bert-r/ It's very useful! I wonder if you have one on multi-label classification? (For essentially the same dataset?) Or some code that helps me do that? Thank you very much in advance!

@turgut090
Copy link
Owner

Hi @jesuiskelly . For the same dataset, try to set random labels from 1 to 3 for the output column. Then, change this part:

input_1 = get_layer(model,name = 'Input-Token')$input
input_2 = get_layer(model,name = 'Input-Segment')$input
inputs = list(input_1,input_2)

dense = get_layer(model,name = 'NSP-Dense')$output

outputs = dense %>% layer_dense(units=3L, activation='softmax', # 3 labels, so 3 units and activation to softmax
                         kernel_initializer=initializer_truncated_normal(stddev = 0.02),
                         name = 'output')

model = keras_model(inputs = inputs,outputs = outputs)

And to categorical cross entropy:

model %>% compile(
  k_bert$AdamWarmup(decay_steps=decay_steps, 
                    warmup_steps=warmup_steps, lr=learning_rate),
  loss = 'categorical_crossentropy',
  metrics = 'accuracy'
)

@jesuiskelly
Copy link
Author

jesuiskelly commented Feb 19, 2021

Thanks so much for coming back to me. This is very helpful! I tried it on my code and made a couple of changes:

  • the loss function is: sparse_categorical_corssentropy. I tried categorical_crossentropy first but got an error. The message suggested this instead. It worked.
  • I generated six labels: 0 - 5. It starts with 0 because when I specify units=6L, python expects 0 to 5. (I got an error for using 6 at first)

So the model was fitted, however, I'm concerned that I might have made some mistakes as the accuracy is very low: 0.2042.

The labels I used are not randomly generated, but rather, taken from here: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/), which is virtually identical dataset, except that in each of the label columns, it's 0 or 1. I created the target column using the following code:

`
train_data_raw <- read_csv(paste0(path_proj, path_data, "/train.csv")) # downloaded from link above
tmp_data = data.table(train_data_raw[1:42000,])

tmp_col = names(train_data_raw[3:length(names(tmp_data))])

tmp_data[, target := names(.SD)[max.col(.SD)], .SDcols=tmp_col]

tmp_data[is.na(target)]

unique(tmp_data[, target])

tmp_data[target=="obscene", target := 0L] %>%
.[target=="toxic", target := 1L] %>%
.[target=="insult", target := 2L] %>%
.[target=="severe_toxic", target := 3L] %>%
.[target=="identity_hate", target := 4L] %>%
.[target=="threat", target := 5L]

unique(tmp_data[, target])

train_data = tmp_data[1:40000,]
test_data = tmp_data[40001:42000,]
`

In other words, given that my labels are not random, I'd expect BERT to deliver much higher accuracy. Do you think I have done something wrong?

Separately, it would be good to know if there's a way to get a probability of prediction for each of the labels if possible. I copy below my code in case if it of help.

Thank you so much for your help and advice again!

`
library(tidyverse)
library(keras)
library(reticulate)
library(data.table)
Sys.timezone()
Sys.setenv(TZ = "UTC")
options(scipen = 999)

Sys.setenv(TF_KERAS=1)

reticulate::py_config() #3.6
reticulate::py_module_available('keras_bert') # TRUE
tensorflow::tf_version()
k_bert = import('keras_bert')

path_b_pret = paste0(path_proj, path_data, "/uncased_L-12_H-768_A-12")
path_b_conf = file.path(path_b_pret, "bert_config.json")
path_b_chkp = file.path(path_b_pret, "bert_model.ckpt")
path_b_vcab = paste0(path_b_pret, "/vocab.txt")

token_dict = k_bert$load_vocabulary(path_b_vcab)
tokenizer = k_bert$Tokenizer(token_dict)

seq_length = 50L
bch_size = 70
epochs = 1
learning_rate = 1e-4

DATA_COLUMN = 'comment_text'
LABEL_COLUMN = 'target'

model = k_bert$load_trained_model_from_checkpoint(
path_b_conf,
path_b_chkp,
training=T,
trainable=T,
seq_len=seq_length)

summary(model)

tokenize_fun = function(dataset) {
c(indices, target, segments) %<-% list(list(),list(), list())
for ( i in 1:nrow(dataset)) {
c(indices_tok, segments_tok) %<-% tokenizer$encode(dataset[[DATA_COLUMN]][i],
max_len=seq_length) # encode with padding
indices = indices %>% append(list(as.matrix(indices_tok)))
target = target %>% append(dataset[[LABEL_COLUMN]][i])
segments = segments %>% append(list(as.matrix(segments_tok)))
}
return(list(indices, segments, target))
}

tmp_data = data.table(train_data_raw[1:42000,])
tmp_col = names(train_data_raw[3:length(names(tmp_data))])
tmp_data[, target := names(.SD)[max.col(.SD)], .SDcols=tmp_col]
tmp_data[is.na(target)]
unique(tmp_data[, target])

tmp_data[target=="obscene", target := 0L] %>%
.[target=="toxic", target := 1L] %>%
.[target=="insult", target := 2L] %>%
.[target=="severe_toxic", target := 3L] %>%
.[target=="identity_hate", target := 4L] %>%
.[target=="threat", target := 5L]

unique(tmp_data[, target]) # check

train_data = tmp_data[1:40000,]
test_data = tmp_data[40001:42000,]

c(x_train, x_segment, y_train) %<-% tokenize_fun(train_data)

train = do.call(cbind,x_train) %>% t()
segments = do.call(cbind,x_segment) %>% t()
targets = do.call(cbind,y_train) %>% t()

concat = c(list(train),list(segments))

c(decay_steps, warmup_steps) %<-% k_bert$calc_train_steps(
targets %>% length(),
batch_size=bch_size,
epochs=epochs
)

input_1 = get_layer(model,name = 'Input-Token')$input
input_2 = get_layer(model,name = 'Input-Segment')$input
inputs = list(input_1,input_2)

dense = get_layer(model,name = 'NSP-Dense')$output

outputs = dense %>% layer_dense(units=6L, activation='softmax',
kernel_initializer=initializer_truncated_normal(stddev = 0.02),
name = 'output')

model = keras_model(inputs = inputs, outputs = outputs)
freeze_weights(model, from = "NSP-Dense") # not sure if this should be kept
summary(model)

model %>% compile(
k_bert$AdamWarmup(decay_steps=decay_steps,
warmup_steps=warmup_steps, lr=learning_rate),
loss = 'sparse_categorical_crossentropy', # for multi-label
metrics = 'accuracy'
)

history = model %>% fit(concat,
targets,
epochs=epochs,
batch_size=bch_size,
validation_split=0.2)

c(x_test, x_t_segment, y_test) %<-% tokenize_fun(test_data)

x_test = do.call(cbind,x_test) %>% t()
x_t_segment = do.call(cbind,x_t_segment) %>% t()

concat2 = c(list(x_test),list(x_t_segment))

res = model %>% predict(concat2)

`

@turgut090
Copy link
Owner

Please, take a look at this kernel. https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing-
From that kernel:

Important Note: In general,
*For binary classification, we can have 1 output units, use sigmoid activation in the output layer and use binary cross-entropy loss

*For multi-class classification, we can have N output units, use softmax activation in the output layer and use categorical cross-entropy loss

*For multi-label classification, we can have N output units, use sigmoid activation in the output layer and use binary cross-entropy loss

Maybe this could help you to get a better score? And please use AUC as metrics https://www.tensorflow.org/api_docs/python/tf/keras/metrics/AUC

@jesuiskelly
Copy link
Author

Hi there, thanks for this. It does help. I've now:

  • updated the targets to be a matrix of N of rows x 6 (==no of labels)
  • applied the sigmoid activation and binary_crossentropy loss function

Now I've got an AUC of ~0.6 to ~0.7 (dependent on different learning rate, etc), which is much better.

The only place where I'm stuck now is getting meaningful prediction results. The code "model %>% predict(concat2)" produces the following, which makes no sense given the severe lack of variations of the numbers between rows. I tried predict_proba() but it's not a method available for this model. Any ideas on how I can get the probability of each label out?

Thank you so much for your help again!

results

@turgut090
Copy link
Owner

I think you are doing it right.
This is how it is done from the python side:
https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing-

y_preds = rnn_model.predict(test_data)
#Assign the predictions by the model in the final test dataset
df_test[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]] = y_pred

Just take more data. At least 200k rows and see if it helps.

@jesuiskelly
Copy link
Author

jesuiskelly commented Feb 23, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@turgut090 @jesuiskelly and others