Low Accuracy on 80 Tasks After Fine-Tuning Meta-Llama-3-8B-Instruct (19/400 = 4.75%) #3

bingwork · 2024-11-29T09:52:25Z

I used 80 tasks from the file task_info_selected.csv in this repository, and fine-tuned Meta-Llama-3-8B-Instruct using the train.sh(train.sh.txt below) script generated from this repository. Then, I generated the predict.sh(predict.sh.txt below) script for inference following the instructions in the same repository here. However, the final result I got is a Competition Accuracy of 19 / 400 = 0.0475, meaning it only got 19 tasks correct. Can you help me identify what might be wrong?

/workspace/wubing/marc/test_time_train.py, for using the selected 80 tasks, I updated the code blow.
if args.num_tasks is not None: if args.num_tasks_selected: import pandas as pd df = pd.read_csv('/workspace/wubing/marc/task_info_selected.csv') selected_tasks = df['task_id'].to_list() arc_test_tasks = [task for task in arc_test_tasks if task.name.replace("-0", "") in selected_tasks] print("Use selected tasks as ttt paper") else: arc_test_tasks = arc_test_tasks[: args.num_tasks]

predict.log
train.log
predict.sh.txt
train.sh.txt

The text was updated successfully, but these errors were encountered:

ekinakyurek · 2024-11-30T16:40:13Z

Which checkpoints do you use in Meta-Llama-3-8B-Instruct folder?

ekinakyurek · 2024-11-30T17:12:13Z

The expected value should be around 36 tasks for this model. I attached the logs of the inference run and the resulting predictions for what you're trying to get per my understanding. I also attached one of the tasks' TTT loss logs. The checkpoint used in this run should be same as: https://huggingface.co/ekinakyurek/marc-8B-finetuned-llama3/tree/main (I can additionally verify if necessary)

0a1d4ef5_tt.txt
8B_grids_no_lm_generated_model_tti.zip

==========

We also have some verification notebooks now on kaggle. These use modal branch and BARC checkpoints. This should be fully replicable but requires modal account and some credits. However, you can just take a look and see the logs settings etc. You can also see the induction part (from the BARC team) details as well.

For BARC checkpoints make sure the torchtune tokenizer is in BARC mode --- requires editing to tokenizer file under your torchtune installation: https://github.com/ekinakyurek/torchtune/blob/efd85e000e83dcf6803c623cf83943e4a817377a/torchtune/models/llama3/_tokenizer.py#L51-L55

Here are the notebooks:
Problem 0-99: (5033.9s)
https://www.kaggle.com/code/xu3cpn/dev-ensemble-induction-and-transduction?scriptVersionId=209665817
Problem 100-199: (4017.6s)
https://www.kaggle.com/code/xu3cpn/dev-ensemble-induction-and-transduction?scriptVersionId=209687715
Problem 200-299: (5571.9s)
https://www.kaggle.com/code/xu3cpn/dev-ensemble-induction-and-transduction?scriptVersionId=209747304
Problem 300-399: (4822.4s)
https://www.kaggle.com/code/xu3cpn/dev-ensemble-induction-and-transduction?scriptVersionId=209779088

Score: 251.5/400 = 62.875

bingwork · 2024-12-01T09:28:21Z

Which checkpoints do you use in Meta-Llama-3-8B-Instruct folder?

I am using the downloaded safetensors from Meta-Llama-3-8B-Instruct.

bingwork · 2024-12-01T09:49:54Z

For task 0a1d4ef5, my log is in 0a1d4ef5_log_1732848905.txt.txt
I noticed that my loss values are quite different from yours. Here are mine:

Step 1: loss: 1.4690, lr: 7.14e-06, tokens_per_second_per_gpu: 6844.40
...
Step 144: loss: 0.0259, lr: 0.0, tokens_per_second_per_gpu: 6964.37
Below are yours:

Step 1: loss: 0.3976, lr: 7.14e-06, tokens_per_second_per_gpu: 4387.33
...
Step 144: loss: 7.03e-05, lr: 0.0, tokens_per_second_per_gpu: 7082.98

ekinakyurek · 2024-12-01T12:05:49Z

Which checkpoints do you use in Meta-Llama-3-8B-Instruct folder?

I am using the downloaded safetensors from Meta-Llama-3-8B-Instruct.

Okay that’s the problem I guess. Can you use our finetuned checkpoints as in the paper.

https://huggingface.co/ekinakyurek/marc-8B-finetuned-llama3/tree/main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low Accuracy on 80 Tasks After Fine-Tuning Meta-Llama-3-8B-Instruct (19/400 = 4.75%) #3

Low Accuracy on 80 Tasks After Fine-Tuning Meta-Llama-3-8B-Instruct (19/400 = 4.75%) #3

bingwork commented Nov 29, 2024

ekinakyurek commented Nov 30, 2024

ekinakyurek commented Nov 30, 2024 •

edited

Loading

bingwork commented Dec 1, 2024

bingwork commented Dec 1, 2024

ekinakyurek commented Dec 1, 2024

Low Accuracy on 80 Tasks After Fine-Tuning Meta-Llama-3-8B-Instruct (19/400 = 4.75%) #3

Low Accuracy on 80 Tasks After Fine-Tuning Meta-Llama-3-8B-Instruct (19/400 = 4.75%) #3

Comments

bingwork commented Nov 29, 2024

ekinakyurek commented Nov 30, 2024

ekinakyurek commented Nov 30, 2024 • edited Loading

bingwork commented Dec 1, 2024

bingwork commented Dec 1, 2024

ekinakyurek commented Dec 1, 2024

ekinakyurek commented Nov 30, 2024 •

edited

Loading