Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results of MEZO for RoBERTa-large on SST-2 #24

Open
han678 opened this issue Nov 3, 2023 · 0 comments
Open

Inconsistent results of MEZO for RoBERTa-large on SST-2 #24

han678 opened this issue Nov 3, 2023 · 0 comments

Comments

@han678
Copy link

han678 commented Nov 3, 2023

Greetings,

I did a grid search for MEZO within the search space that included the five seeds (13, 21, 42, 87, 100) provided by your code and the configurations based on the hyperparameters from Table 4 cuz I did not notice that you've already presented a more precise search range in Table 13. But since Table 4 is a superset of Table 13, it should not miss the best hyper-parameter used in the paper. Here, I listed all the promising combinations ( I mean it should give at least a test_acc>0.85), which I discovered during the grid search. For the obtained result, only configurations with seed 87 are likely to perform well. I did not find a configuration that can perform well across all five selected seeds used for both data splitting and mezo training.

2023-11-02 21:49:27,370 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 16, 'WD': 0, 'EPS': 0.001, 'output': {'eval_loss': 0.4354574382305145, 'eval_acc': 0.84375, 'test_loss': 0.27337443828582764, 'test_acc': 0.9128440366972477}}
2023-11-02 21:49:27,605 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 16, 'WD': 0.1, 'EPS': 1e-05, 'output': {'eval_loss': 0.5714977979660034, 'eval_acc': 0.84375, 'test_loss': 0.29330992698669434, 'test_acc': 0.9048165137614679}}
2023-11-02 21:49:27,644 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 64, 'WD': 0, 'EPS': 0.001, 'output': {'eval_loss': 0.6124476194381714, 'eval_acc': 0.84375, 'test_loss': 0.3549887239933014, 'test_acc': 0.8841743119266054}}
2023-11-02 21:49:27,677 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 64, 'WD': 0, 'EPS': 1e-05, 'output': {'eval_loss': 0.6217464804649353, 'eval_acc': 0.8125, 'test_loss': 0.3517194986343384, 'test_acc': 0.875}}
2023-11-02 21:49:27,703 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 64, 'WD': 0.1, 'EPS': 0.001, 'output': {'eval_loss': 0.6124476194381714, 'eval_acc': 0.84375, 'test_loss': 0.3549887239933014, 'test_acc': 0.8841743119266054}}
2023-11-02 21:49:27,719 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 64, 'WD': 0.1, 'EPS': 1e-05, 'output': {'eval_loss': 0.6217464804649353, 'eval_acc': 0.8125, 'test_loss': 0.3517194986343384, 'test_acc': 0.875}}
2023-11-02 21:49:27,840 - main - INFO - {'seed': 87, 'lr': 1e-06, 'BS': 64, 'WD': 0, 'EPS': 0.001, 'output': {'eval_loss': 0.5273604393005371, 'eval_acc': 0.84375, 'test_loss': 0.34729522466659546, 'test_acc': 0.8646788990825688}}
2023-11-02 21:49:27,876 - main - INFO - {'seed': 87, 'lr': 1e-06, 'BS': 64, 'WD': 0.1, 'EPS': 0.001, 'output': {'eval_loss': 0.5273604393005371, 'eval_acc': 0.84375, 'test_loss': 0.34729522466659546, 'test_acc': 0.8646788990825688}}

Since my result differs significantly from that in your paper, I am curious whether you utilized different seeds in your experiments and if you also observed that MEZO can be quite sensitive to the selected seeds. For the best combination I discovered earlier (see below), I also evaluated its performance with 10 different training seeds, while keeping the data seed fixed at 87. I got the following results (see Figures below).

2023-11-02 21:49:27,370 - main - INFO - {'seed': 87, 'lr': 1e-05, 'BS': 16, 'WD': 0, 'EPS': 0.001, 'output': {'eval_loss': 0.4354574382305145, 'eval_acc': 0.84375, 'test_loss': 0.27337443828582764, 'test_acc': 0.9128440366972477}}

However, it seems to me that even with the optimal configuration, its performance is still not consistently stable and somewhat relies on both the training seed and data seed. I think this variability is a bit weird and much higher than the results of the backpropagation methods. Below I gave my grid results of Lora within the search space that included both configurations in Table 13 and the same seeds. Compared with MEZO, Lora seems much more robust to both the hyperparameters and seeds.

I'm curious if the high variability of MEZO performance could be attributed to the characteristics of Mezo. Would it be possible for the authors to offer an explanation for this? I would be very grateful for that. BTW. I also wonder if similar results were observed in your experiment for other models i.e. OPT models on other tasks discussed in your paper.

Best regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant