Skip to content

boostcampaitech5/level2_klue-nlp-08

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ† Level 2 Project # 1 :: KLUE ๋ฌธ์žฅ ๋‚ด ๊ฐœ์ฒด๊ฐ„ ๊ด€๊ณ„ ์ถ”์ถœ

๐Ÿ“œ Abstract

๋ฌธ์žฅ์˜ ๋‹จ์–ด(Entity)์— ๋Œ€ํ•œ ์†์„ฑ๊ณผ ๊ด€๊ณ„๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์ธ๊ณต์ง€๋Šฅ ๋งŒ๋“ค๊ธฐ.


๐ŸŽ–๏ธProject Leader Board

public_7th private_8th

  • Public Leader Board
public_leader_board
  • Private Leader Board
private_leader_board

๐Ÿง‘๐Ÿปโ€๐Ÿ’ป Team Introduction & Members

Team name : ์œค์Šฌ [ NLP 08์กฐ ]

๐Ÿ‘จ๐Ÿผโ€๐Ÿ’ป Members

๊ฐ•๋ฏผ์žฌ ๊น€์ฃผ์› ๊น€ํƒœ๋ฏผ ์‹ ํ˜์ค€ ์œค์ƒ์›

๐Ÿง‘๐Ÿปโ€๐Ÿ”ง Members' Role

  • ๋ชจ๋”๋ ˆ์ดํ„ฐ ์™ธ์—๋„ Github ๊ด€๋ฆฌ์ž๋ฅผ ๋‘์–ด ๋ฒ ์ด์Šค๋ผ์ธ ์ฝ”๋“œ์˜ ๋ฒ„์ „ ๊ด€๋ฆฌ๋ฅผ ์›ํ™œํ•˜๊ฒŒ ํ•˜๊ณ , ๊ฐ™์€ ๋ถ„์•ผ๋ผ๋„ ๋‹ค๋ฅธ ์ž‘์—…์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ถ„์—…์„ ํ•˜์—ฌ ํ˜‘์—…์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.
์ด๋ฆ„ ์—ญํ• 
๊ฐ•๋ฏผ์žฌ EDA(๊ธธ์ด,๋ ˆ์ด๋ธ”,ํ† ํฐ, entity ํŽธํ–ฅ ํ™•์ธ),ErrorAnalysis,๋ฐ์ดํ„ฐ์ฆ๊ฐ•(๋‹จ์ˆœ ๋ณต์ œ, classinverse ๊ด€๊ณ„ ๊ต์ฒด ์ฆ๊ฐ•),๋ฐ์ดํ„ฐ์ „์ฒ˜๋ฆฌ(subject,objectentity์ŠคํŽ˜์…œ ํ† ํฐ ์ฒ˜๋ฆฌ)
๊น€ํƒœ๋ฏผ ๋ชจ๋ธ ์‹คํ—˜(Attention layer ์ถ”๊ฐ€ ์‹คํ—˜, Linear/LSTM layer ์ถ”๊ฐ€ ์‹คํ—˜, Loss, Optimizer ์‹คํ—˜), ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ(Entity Representation โ€“ ENTITY, Typed Entity)
๊น€์ฃผ์› ๋ชจ๋ธ ํŠœ๋‹, ํ”„๋กœ์ ํŠธ ๋งค๋‹ˆ์ €(๋…ธ์…˜๊ด€๋ฆฌ, ํšŒ์˜ ์ง„ํ–‰), EDA, ๋ชจ๋ธ ์•™์ƒ๋ธ”(Hard Voting, Soft Voting, Weighted Voting), Error Analysis(Error Analysis ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๊ฐœ๋ฐœ)
์œค์ƒ์› Github ๋ฒ ์ด์Šค๋ผ์ธ ์ฝ”๋“œ ๊ด€๋ฆฌ(์ฝ”๋“œ ๋ฆฌํŒฉํ† ๋ง, ๋ฒ„๊ทธ ํ”ฝ์Šค, ์ฝ”๋“œ ๋ฒ„์ „ ์ปจํŠธ๋กค), ๋ชจ๋ธ ์‹คํ—˜(TAPT ์ ์šฉ), ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ(Entity Representation โ€“ ENTITY, Typed Entity), EDA(UNK ๊ด€๋ จ EDA), ๋ชจ๋ธ ํŠœ๋‹
์‹ ํ˜์ค€ ย ย ย ย ย ย ย ย ย  EDA(๋ฐ์ดํ„ฐ heuristic ์ฒดํฌ, Label ๋ณ„ ๊ด€๊ณ„ ํŽธํ–ฅ ์กฐ์‚ฌ), ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• (๋™์ผ entity start_idx, end_idx ๊ต์ฒด, Easy Data Augmentation โ€“ SR ๊ธฐ๋ฐ˜ ์ฆ๊ฐ•, ํด๋ž˜์Šค Down Sampling)

๐Ÿ–ฅ๏ธ Project Introduction

ํ”„๋กœ์ ํŠธ ์ฃผ์ œ ๋ฌธ์žฅ ๋‚ด ๊ฐœ์ฒด๊ฐ„ ๊ด€๊ณ„ ์ถ”์ถœ(KLUE RE): ๋ฌธ์žฅ์˜ ๋‹จ์–ด(Entity)์— ๋Œ€ํ•œ ์†์„ฑ๊ณผ ๊ด€๊ณ„๋ฅผ ์˜ˆ์ธกํ•˜๋Š”NLP Task
ํ”„๋กœ์ ํŠธ ๊ตฌํ˜„๋‚ด์šฉ 1. Hugging Face์˜ Pretrained ๋ชจ๋ธ๊ณผKLUE RE ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•ด ์ฃผ์–ด์ง„ subject, object entity๊ฐ„์˜ 30๊ฐœ ์ค‘ ํ•˜๋‚˜์˜ relation ์˜ˆ์ธกํ•˜๋Š” AI ๋ชจ๋ธ ๊ตฌ์ถ•
2. ๋ฆฌ๋”๋ณด๋“œ ํ‰๊ฐ€์ง€ํ‘œ์ธ Micro F1-Score์™€AUPRC ๋†’์€ ์ ์ˆ˜์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ(Entity Representation), ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•, ๋ชจ๋ธ๋ง ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ์ง„ํ–‰
๊ฐœ๋ฐœ ํ™˜๊ฒฝ โ€ข GPU: Tesla V100 ์„œ๋ฒ„ 4๊ฐœ (RAM32G) /Tesla V4 (RAM52G) /GeForce RTX 4090 ๋กœ์ปฌ (RAM 24GB)
โ€ข ๊ฐœ๋ฐœ Tool: PyCharm, Jupyter notebook, VS Code [์„œ๋ฒ„ SSH์—ฐ๊ฒฐ], Colab Pro +, wandb
ํ˜‘์—… ํ™˜๊ฒฝ โ€ข Github Repository : Baseline ์ฝ”๋“œ ๊ณต์œ  ๋ฐ ๋ฒ„์ „ ๊ด€๋ฆฌ
โ€ข Notion : KLUE ํ”„๋กœ์ ํŠธ ํŽ˜์ด์ง€๋ฅผ ํ†ตํ•œ ์—ญํ• ๋ถ„๋‹ด, ๋Œ€ํšŒ ํ˜‘์—… ๊ด€๋ จGround Rule ์„ค์ •, ์•„์ด๋””์–ด ๋ธŒ๋ ˆ์ธ ์Šคํ† ๋ฐ, ๋Œ€ํšŒ๊ด€๋ จ ํšŒ์˜ ๋‚ด์šฉ ๊ธฐ๋ก
โ€ข SLACK, Zoom : ์‹ค์‹œ๊ฐ„ ๋Œ€๋ฉด/๋น„๋Œ€๋ฉด ํšŒ์˜

๐Ÿ—“๏ธ Project Procedure

*์•„๋ž˜๋Š” ์ €ํฌ ํ”„๋กœ์ ํŠธ ์ง„ํ–‰๊ณผ์ •์„ ๋‹ด์€ Gantt์ฐจํŠธ ์ž…๋‹ˆ๋‹ค.

Screenshot 2023-05-24 at 3 31 48 PM

๐Ÿ“ Project Structure

๐Ÿ“„ ๋””๋ ‰ํ† ๋ฆฌ ๋ฐ ์ฝ”๋“œ ๊ตฌ์กฐ ์„ค๋ช…

ํ•™์Šต ์ง„ํ–‰ํ•˜๊ธฐ ์ „ ์ฆ๊ฐ• ๋ฐ์ดํ„ฐ ํ™œ์šฉ์‹œ Augmentation์„ ํ•™์Šต ์ „์— ์ง„ํ–‰
TAPT ์ ์šฉ์‹œ TAPT ์ฝ”๋“œ๋ฅผ ์‚ฌ์ „์— ๋จผ์ € ํ•™์Šต์‹œ์ผœ ๋ชจ๋ธ์— ํ™œ์šฉ

  • Augmentation : ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๋””๋ ‰ํ† ๋ฆฌ
    • augment_train.py : ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๋ชจ๋ธ ํ•™์Šต
    • augment_dataloader.py : ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๋ชจ๋ธ ํ•™์Šต์‹œ ์‚ฌ์šฉํ•˜๋Š” dataloader
    • augment.py : ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ์ฝ”๋“œ
  • dataset : ํ•™์Šต/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋””๋ ‰ํ† ๋ฆฌ
    • train/train.csv : ํ•™์Šต ๋ฐ์ดํ„ฐ
    • test/test_data.csv : ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ
  • config : ๋ชจ๋ธ ํ•™์Šต, ์ถ”๋ก , ์ฆ๊ฐ•์— ๊ด€๋ จ๋œ ์„ค์ •์„ ๋‹ด๊ณ  ์žˆ๋Š” ํŒŒ์ผ
    • augment.yaml : ์ฆ๊ฐ• ๊ด€๋ จ ์„ค์ • ํŒŒ์ผ.
    • default.yaml : ๋ชจ๋ธ ํ•™์Šต, ์ถ”๋ก  ๊ด€๋ จ ์„ค์ • ํŒŒ์ผ. ๋‹ค์–‘ํ•œ ๋ชจ๋ธ, ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ธํŒ…
    • ensemble.yaml : ์•™์ƒ๋ธ” ์„ค์ • ํŒŒ์ผ (Hard Voting, Soft Voting, F1-score Weighted Voting)
    • tapt.yaml : TAPT ๊ด€๋ จ ์„ค์ • ํŒŒ์ผ
  • model_ensemble : ์•™์ƒ๋ธ” ์‹คํ–‰ ํŒŒ์ผ
    • ensemble.py : ์•™์ƒ๋ธ” ์‹คํ–‰ ์ฝ”๋“œ
    • ensemble_model.py : ์•™์ƒ๋ธ” ๊ธฐ๋ฒ• ์ •์˜(Hard Voting, Soft Voting, F1-score Weighted Voting)
    • utils.py : ์•™์ƒ๋ธ”์— ํ•„์š”ํ•œ argmax, softmax ํ•จ์ˆ˜ ์ •์˜
  • models :
    • RBERT.py: R-BERT ๋ชจ๋ธ ์ •์˜ ์ฝ”๋“œ
    • TAEMIN_CUSTOM_RBERT.py: R-BERT ๋‹จ์ˆœํ™” ๋ชจ๋ธ ์ •์˜ ์ฝ”๋“œ
    • TAEMIN_R_RoBERTa.py: R-Roberta ๋ชจ๋ธ ํŒŒ์ผ
    • TAEMIN_RoBERTa_LSTM.py: Roberta-LSTM ๋ชจ๋ธ ์ •์˜ ์ฝ”๋“œ
    • TAEMIN_TOKEN_ATTENTION_BERT.py: BERT + CLS Token Attention ๋ชจ๋ธ ์ •์˜ ์ฝ”๋“œ
    • TAEMIN_TOKEN_ATTENTION_RoBERTa.py: Roberta + CLS Token Attention ๋ชจ๋ธ ์ •์˜ ์ฝ”๋“œ
    • model_base.py: base ๋ชจ๋ธ ์ •์˜ ์ฝ”๋“œ(FC Layer, RobertaClassificationHead, RobertaPooler)
    • utils.py:
  • modules : ๋ชจ๋ธ์— ์“ฐ์ด๋Š” dataset, loss ๋“ฑ ์„ธ๋ถ€์ ์ธ ๋ชจ๋“ˆ ์ •์˜ ๋””๋ ‰ํ† ๋ฆฌ
    • datasets.py : ๋ชจ๋ธ ๋ณ„ dataset ์ƒ์„ฑ ์ฝ”๋“œ
    • losses.py : Focal loss ์ฝ”๋“œ
    • optimizers.py : AdamW, Adam, SGD, Adabelief ๋“ฑ Optimizer ์ •์˜ ์ฝ”๋“œ
    • preprocess.py : ๋ฐ์ดํ„ฐ ํŒŒ์‹ฑ ๋ฐ ์ „์ฒ˜๋ฆฌ ์ฝ”๋“œ
    • schedulers.py : StepLR, CosinLR ์ •์˜ ์ฝ”๋“œ
    • tokenize.py : ํ† ํฌ๋‚˜์ด์ง• ๋ฐ Entity Representation ์ฝ”๋“œ
    • utils.py : micro_f1, config_parser, confusion_matrix ์ฝ”๋“œ
  • pickle : ์ˆซ์ž label - ์ŠคํŠธ๋ง label ๋ณ€ํ™˜ ํ”ผํด ํŒŒ์ผ
    • dict_label_to_num.pkl : ์ˆซ์ž label์„ ์ŠคํŠธ๋ง label๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ”ผํด ํŒŒ์ผ
    • dict_num_to_label.pkl : ์ŠคํŠธ๋ง label์„ ์ˆซ์ž label๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ”ผํด ํŒŒ์ผ
  • prediction : ๋ชจ๋ธ ์ถ”๋ก  ์ €์žฅ ๋””๋ ‰ํ† ๋ฆฌ
  • tapt :
    • dataset.py : TAPT ๋ฐ์ดํ„ฐ ๋กœ๋” ์ฝ”๋“œ
    • tapt.py : TAPT ํ•™์Šต ์ฝ”๋“œ
  • .gitignore : gitignore
  • dataloader.py : ๋ชจ๋ธ data loader ์ฝ”๋“œ
  • inference.py : ๋ชจ๋ธ ์ถ”๋ก  ์ฝ”๋“œ
  • model.py : pytorch-lightning์„ ์ด์šฉํ•œ ๊ธฐ๋ณธ ๋ชจ๋ธ ์ •์˜ ์ฝ”๋“œ
  • requirements.txt : ํ™˜๊ฒฝ ์„ค์ • ๊ด€๋ จ textํŒŒ์ผ
  • train.py : ๋ชจ๋ธ ํ•™์Šต ์ฝ”๋“œ
  • wandb_tuning.py : ์—ฌ๋Ÿฌ๊ฐœ์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ wandb๋กœ ํŠœ๋‹
๐Ÿ“ฆlevel2_klue-nlp-08
 โ”ฃ augmentation
 โ”ƒ โ”ฃ augment.py
 โ”ƒ โ”ฃ augment_dataloader.py
 โ”ƒ โ”ฃ augment_train.py
 โ”ƒ โ”— utils.py
 โ”ฃ config
 โ”ƒ โ”ฃ augment.yaml
 โ”ƒ โ”ฃ default.yaml
 โ”ƒ โ”ฃ ensemble.yaml
 โ”ƒ โ”— tapt.yaml
 โ”ฃ dataset
 โ”ƒ โ”ฃ test
 โ”ƒ โ”ƒ โ”— test_data.csv
 โ”ƒ โ”ฃ train
 โ”ƒ โ”ƒ โ”— train.csv
 โ”ฃ model_ensemble
 โ”ƒ โ”ฃ ensemble.py
 โ”ƒ โ”ฃ ensemble_model.py
 โ”ƒ โ”— utils.py
 โ”ฃ models
 โ”ƒ โ”ฃ RBERT.py
 โ”ƒ โ”ฃ TAEMIN_CUSTOM_RBERT.py
 โ”ƒ โ”ฃ TAEMIN_R_RoBERTa.py
 โ”ƒ โ”ฃ TAEMIN_RoBERTa_LSTM.py
 โ”ƒ โ”ฃ TAEMIN_TOKEN_ATTENTION_BERT.py
 โ”ƒ โ”ฃ TAEMIN_TOKEN_ATTENTION_RoBERTa.py
 โ”ƒ โ”ฃ model_base.py
 โ”ƒ โ”— utils.py
 โ”ฃ modules
 โ”ƒ โ”ฃ datasets.py
 โ”ƒ โ”ฃ losses.py
 โ”ƒ โ”ฃ optimizers.py
 โ”ƒ โ”ฃ preprocess.py
 โ”ƒ โ”ฃ schedulers.py
 โ”ƒ โ”ฃ tokenize.py
 โ”ƒ โ”— utils.py
 โ”ฃ pickle
 โ”ƒ โ”ฃ dict_label_to_num.pkl
 โ”ƒ โ”— dict_num_to_label.pkl
 โ”ฃ prediction
 โ”ฃ tapt
 โ”ƒ โ”ฃ dataset.py
 โ”ƒ โ”— tapt.py
 โ”ฃ wandb
 โ”ฃ .gitignore
 โ”ฃ README.md
 โ”ฃ dataloader.py
 โ”ฃ inference.py
 โ”ฃ model.py
 โ”ฃ requirements.txt
 โ”ฃ train.py
 โ”— wandb_tuning.py

โš™๏ธ Architecture

๋ถ„๋ฅ˜ ๋‚ด์šฉ
๋ชจ๋ธ klue/bert-base, klue/roberta-large, ainize/klue-bert-base-re HuggingFace Transformer Model+Pytorch Lightningํ™œ์šฉ + Attention Layer or FC Layer
์ „์ฒ˜๋ฆฌ โ€ข Entity Representation : Entity marker / Typed entity marker / SUB,OBJ marker / punct(ํ•œ๊ธ€) ๋“ฑ ๋‹ค์–‘ํ•œ entity representation์„ ์ ์šฉํ•˜์—ฌ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” entity representation ์ ์šฉ
๋ฐ์ดํ„ฐ โ€ข raw data : ๊ธฐ๋ณธ train ๋ฐ์ดํ„ฐ 32470๊ฐœ
โ€ข ์ฆ๊ฐ•๋ฐ์ดํ„ฐ : MLM kue/roberta-large ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ์ฆ๊ฐ•๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค๊ณ  uniform ๋ถ„ํฌ๋ฅผ ๋งŒ๋“ค์–ด ์ด 53873๊ฐœ
๊ฒ€์ฆ ์ „๋žต โ€ข ๋งŒ๋“ค์—ˆ๋˜ ๋ชจ๋ธ์˜ Validation ๋ฐ์ดํ„ฐ๋ฅผ inference์— Micro F1-Score์™€ AUPRC Score ๋น„๊ต
โ€ข ์ตœ์ข…์ ์œผ๋กœ ๋ฆฌ๋”๋ณด๋“œ์— ์ œ์ถœํ•˜์—ฌ ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฒ€์ฆ
์•™์ƒ๋ธ” ๋ฐฉ๋ฒ• โ€ข Entity Represenatation๊ณผ ๋ชจ๋ธ๊ธฐ๋ฒ•, ์ฆ๊ฐ•๋ฐ์ดํ„ฐ ์ค‘ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€ ๋ชจ๋ธ 3๊ฐœ๋ฅผ ์„ ์ •ํ•˜์—ฌ soft voting ์•™์ƒ๋ธ”์„ ์ง„ํ–‰
๋ชจ๋ธ ํ‰๊ฐ€ ๋ฐ ๊ฐœ์„  ย ย ย ย ย ย ย ย ย ย  . MLM ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ์‹ค์‹œํ•˜์—ฌ label imbalance ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค. ๋˜ํ•œ, Entity Representation์„ ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒ˜๋ฆฌํ•˜๊ณ  HuggingFace ๋ชจ๋ธ์— Attention layer์™€ FC Layer๋“ฑ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๋“ฑ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•œ๋‹ค.

๐Ÿ’ป Getting Started

โš ๏ธ How To install Requirements

#ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜
pip install -r requirements.txt

โŒจ๏ธ How To Train

# ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• [optional]
python3 augmentation/augment.py --config=config/augment.yaml
python3 augmentation/augment_train.py --config=config/augment.yaml
# TAPT ํ•™์Šต ๋ชจ๋ธ ์ƒ์„ฑ [optional]
python3 tapt/tapt.py --config=config/tapt.yaml
# train.py ์ฝ”๋“œ ์‹คํ–‰ : ๋ชจ๋ธ ํ•™์Šต ์ง„ํ–‰
python3 train.py --config=config/default.yaml

โŒจ๏ธ How To Infer output.csv

# ๋ชจ๋ธ ์˜ˆ์ธก ์ง„ํ–‰
python3 inference.py --config=config/default.yaml
# ์•™์ƒ๋ธ” ์ง„ํ–‰ [config๋ฅผ ํ†ตํ•ด์„œ option ์„ ํƒ]
python3 model_ensemble/ensemble.py --config=config/ensemble.yaml

About

level2_klue-nlp-08 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages