You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I did a grid search for MEZO within the search space that included the five seeds (13, 21, 42, 87, 100) provided by your code and the configurations based on the hyperparameters from Table 4 cuz I did not notice that you've already presented a more precise search range in Table 13. But since Table 4 is a superset of Table 13, it should not miss the best hyper-parameter used in the paper. Here, I listed all the promising combinations ( I mean it should give at least a test_acc>0.85), which I discovered during the grid search. For the obtained result, only configurations with seed 87 are likely to perform well. I did not find a configuration that can perform well across all five selected seeds used for both data splitting and mezo training.
Since my result differs significantly from that in your paper, I am curious whether you utilized different seeds in your experiments and if you also observed that MEZO can be quite sensitive to the selected seeds. For the best combination I discovered earlier (see below), I also evaluated its performance with 10 different training seeds, while keeping the data seed fixed at 87. I got the following results (see Figures below).
However, it seems to me that even with the optimal configuration, its performance is still not consistently stable and somewhat relies on both the training seed and data seed. I think this variability is a bit weird and much higher than the results of the backpropagation methods. Below I gave my grid results of Lora within the search space that included both configurations in Table 13 and the same seeds. Compared with MEZO, Lora seems much more robust to both the hyperparameters and seeds.
I'm curious if the high variability of MEZO performance could be attributed to the characteristics of Mezo. Would it be possible for the authors to offer an explanation for this? I would be very grateful for that. BTW. I also wonder if similar results were observed in your experiment for other models i.e. OPT models on other tasks discussed in your paper.
Best regards,
The text was updated successfully, but these errors were encountered:
Greetings,
I did a grid search for MEZO within the search space that included the five seeds (13, 21, 42, 87, 100) provided by your code and the configurations based on the hyperparameters from Table 4 cuz I did not notice that you've already presented a more precise search range in Table 13. But since Table 4 is a superset of Table 13, it should not miss the best hyper-parameter used in the paper. Here, I listed all the promising combinations ( I mean it should give at least a test_acc>0.85), which I discovered during the grid search. For the obtained result, only configurations with seed 87 are likely to perform well. I did not find a configuration that can perform well across all five selected seeds used for both data splitting and mezo training.
Since my result differs significantly from that in your paper, I am curious whether you utilized different seeds in your experiments and if you also observed that MEZO can be quite sensitive to the selected seeds. For the best combination I discovered earlier (see below), I also evaluated its performance with 10 different training seeds, while keeping the data seed fixed at 87. I got the following results (see Figures below).
However, it seems to me that even with the optimal configuration, its performance is still not consistently stable and somewhat relies on both the training seed and data seed. I think this variability is a bit weird and much higher than the results of the backpropagation methods. Below I gave my grid results of Lora within the search space that included both configurations in Table 13 and the same seeds. Compared with MEZO, Lora seems much more robust to both the hyperparameters and seeds.
I'm curious if the high variability of MEZO performance could be attributed to the characteristics of Mezo. Would it be possible for the authors to offer an explanation for this? I would be very grateful for that. BTW. I also wonder if similar results were observed in your experiment for other models i.e. OPT models on other tasks discussed in your paper.
Best regards,
The text was updated successfully, but these errors were encountered: