Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

請問要怎麼加入新的一層embedding後送入BERT model訓練 #10

Open
leo88359 opened this issue Oct 11, 2021 · 8 comments
Open

Comments

@leo88359
Copy link

您好,感謝您釋出台北QA的程式碼,有個問題想請教您~
想請問除了word embedding、position embedding、segment embegginh等三者之外,如果有其他的feature做成的embedding,要如何使其能相疊並且送入BERT model去訓練呢?
感謝您 =D

@p208p2002
Copy link
Owner

p208p2002 commented Oct 11, 2021 via email

@leo88359
Copy link
Author

感謝您的回覆,我有依照您的建議進行修改,但會出現 TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType 這個錯誤的訊息。

以下是我的程式碼
https://github.com/leo88359/BERT_modeling/blob/main/modeling_bert.py

當中有新增embedding叫做clinical_feature_embedding
https://github.com/leo88359/BERT_modeling/blob/889af9200348adf73888a908d0838a78ecb818b4/modeling_bert.py#L173
https://github.com/leo88359/BERT_modeling/blob/889af9200348adf73888a908d0838a78ecb818b4/modeling_bert.py#L203

回傳到embeddings
https://github.com/leo88359/BERT_modeling/blob/889af9200348adf73888a908d0838a78ecb818b4/modeling_bert.py#L208

最後進行相加
https://github.com/leo88359/BERT_modeling/blob/527ea345b888defceab30477004945e709c5152a/modeling_bert.py#L973

還懇請您協助解惑,感激不盡

@p208p2002
Copy link
Owner

p208p2002 commented Oct 13, 2021 via email

@p208p2002
Copy link
Owner

你直接修改了BertEmbeddings,當然這也是沒問題的
L965修改成:

embedding_output = self.embeddings(
            input_ids=input_ids,
            position_ids=position_ids,
            token_type_ids=token_type_ids,
            inputs_embeds=inputs_embeds,
            past_key_values_length=past_key_values_length,
            clinical_feature_ids=clinical_feature_ids
        )

刪除L973, L974

# clinical_feature_embeddings_output = self.embeddings.clinical_feature_embeddings(clinical_feature_ids)
# embedding_output = embedding_output + clinical_feature_embeddings_output

@leo88359
Copy link
Author

leo88359 commented Oct 22, 2021

您好,感謝您的建議,按上述修改modeling_bert.py內的程式碼後嘗試運行,仍會出現錯誤如下:
TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not NoneType

近一周嘗試debug,仍不知道為甚麼運行後input會變成NoneType,因此無法有資料丟進embedding()
在此附上我的code,想再請教是那裡出了問題導致無法運行。
感激不盡!!

core.py https://github.com/leo88359/BERT-model/blob/main/core.py
train.py https://github.com/leo88359/BERT-model/blob/main/train.py
predict.py https://github.com/leo88359/BERT-model/blob/main/predict.py
modeling_bert.py https://github.com/leo88359/BERT-model/blob/main/modeling_bert.py


以下簡述為新增一層embedding而有改動的部分

[core.py]
在make_dataset中定義新的特徵 https://github.com/leo88359/BERT-model/blob/094f97262f0873b7b4dfad0fadf2b25eb71394b0/core.py#L41
建立新增feature的矩陣 https://github.com/leo88359/BERT-model/blob/094f97262f0873b7b4dfad0fadf2b25eb71394b0/core.py#L94
https://github.com/leo88359/BERT-model/blob/094f97262f0873b7b4dfad0fadf2b25eb71394b0/core.py#L98
https://github.com/leo88359/BERT-model/blob/094f97262f0873b7b4dfad0fadf2b25eb71394b0/core.py#L101
make DataDic https://github.com/leo88359/BERT-model/blob/094f97262f0873b7b4dfad0fadf2b25eb71394b0/core.py#L109
make tokens https://github.com/leo88359/BERT-model/blob/094f97262f0873b7b4dfad0fadf2b25eb71394b0/core.py#L130
make data_feature https://github.com/leo88359/BERT-model/blob/094f97262f0873b7b4dfad0fadf2b25eb71394b0/core.py#L148

[train.py]
擷取新增的特徵資料 https://github.com/leo88359/BERT-model/blob/094f97262f0873b7b4dfad0fadf2b25eb71394b0/train.py#L68
新特徵的dic https://github.com/leo88359/BERT-model/blob/094f97262f0873b7b4dfad0fadf2b25eb71394b0/train.py#L94

@p208p2002
Copy link
Owner

你的BertEmbeddings修改及傳入的參數有問題,可以參考一下我修改的版本
https://drive.google.com/file/d/1Lt2R8sas80GaK5DHheIBh4H2KRPqCrqe/view?usp=sharing
我在裡面新增了test.py作為測試,直接執行即可

可能需要先安裝 loguru pip install loguru

@p208p2002
Copy link
Owner

p208p2002 commented Oct 22, 2021

你可能會需要注意src/transformers/models/bert/modeling_bert.py#L173中 input shape 的設定,目前是跟隨vocab size;應該要設置成與你的特徵類別相同大小

self.clinical_feature_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants