-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-label classification #4
Comments
Hi @jesuiskelly . For the same dataset, try to set random labels from 1 to 3 for the output column. Then, change this part:
And to categorical cross entropy:
|
Thanks so much for coming back to me. This is very helpful! I tried it on my code and made a couple of changes:
So the model was fitted, however, I'm concerned that I might have made some mistakes as the accuracy is very low: 0.2042. The labels I used are not randomly generated, but rather, taken from here: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/), which is virtually identical dataset, except that in each of the label columns, it's 0 or 1. I created the target column using the following code: ` tmp_col = names(train_data_raw[3:length(names(tmp_data))]) tmp_data[, target := names(.SD)[max.col(.SD)], .SDcols=tmp_col] tmp_data[is.na(target)] unique(tmp_data[, target]) tmp_data[target=="obscene", target := 0L] %>% unique(tmp_data[, target]) train_data = tmp_data[1:40000,] In other words, given that my labels are not random, I'd expect BERT to deliver much higher accuracy. Do you think I have done something wrong? Separately, it would be good to know if there's a way to get a probability of prediction for each of the labels if possible. I copy below my code in case if it of help. Thank you so much for your help and advice again! ` Sys.setenv(TF_KERAS=1) reticulate::py_config() #3.6 path_b_pret = paste0(path_proj, path_data, "/uncased_L-12_H-768_A-12") token_dict = k_bert$load_vocabulary(path_b_vcab) seq_length = 50L DATA_COLUMN = 'comment_text' model = k_bert$load_trained_model_from_checkpoint( summary(model) tokenize_fun = function(dataset) { tmp_data = data.table(train_data_raw[1:42000,]) tmp_data[target=="obscene", target := 0L] %>% unique(tmp_data[, target]) # check train_data = tmp_data[1:40000,] c(x_train, x_segment, y_train) %<-% tokenize_fun(train_data) train = do.call(cbind,x_train) %>% t() concat = c(list(train),list(segments)) c(decay_steps, warmup_steps) %<-% k_bert$calc_train_steps( input_1 = get_layer(model,name = 'Input-Token')$input dense = get_layer(model,name = 'NSP-Dense')$output outputs = dense %>% layer_dense(units=6L, activation='softmax', model = keras_model(inputs = inputs, outputs = outputs) model %>% compile( history = model %>% fit(concat, c(x_test, x_t_segment, y_test) %<-% tokenize_fun(test_data) x_test = do.call(cbind,x_test) %>% t() concat2 = c(list(x_test),list(x_t_segment)) res = model %>% predict(concat2) ` |
Please, take a look at this kernel. https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing-
Maybe this could help you to get a better score? And please use AUC as metrics https://www.tensorflow.org/api_docs/python/tf/keras/metrics/AUC |
Hi there, thanks for this. It does help. I've now:
Now I've got an AUC of ~0.6 to ~0.7 (dependent on different learning rate, etc), which is much better. The only place where I'm stuck now is getting meaningful prediction results. The code "model %>% predict(concat2)" produces the following, which makes no sense given the severe lack of variations of the numbers between rows. I tried predict_proba() but it's not a method available for this model. Any ideas on how I can get the probability of each label out? Thank you so much for your help again! |
I think you are doing it right.
Just take more data. At least 200k rows and see if it helps. |
Thanks so much. Good to know that at least it's not my mistake :)
Great, will experiment so more. Thanks again for your help and for making
the tutorial available.
Much appreciated!
Kelly
…On Mon, Feb 22, 2021 at 6:36 PM Turgut ***@***.***> wrote:
I think you are doing it right.
This is how it is done from the python side:
https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing-
y_preds = rnn_model.predict(test_data)
#Assign the predictions by the model in the final test dataset
df_test[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]] = y_pred
Just take more data. At least 200k rows and see if it helps.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARZT2YOXQQLV7RDD6G53WS3TAKP3VANCNFSM4X2Y2EFA>
.
|
Hi there, thanks for the tutorial here: https://blogs.rstudio.com/ai/posts/2019-09-30-bert-r/ It's very useful! I wonder if you have one on multi-label classification? (For essentially the same dataset?) Or some code that helps me do that? Thank you very much in advance!
The text was updated successfully, but these errors were encountered: