Question about gradient checkpointing? #3

311dada · 2024-06-24T11:51:45Z

Congratulations on the excellent work!

When training large language models, we generally adopt the gradient checkpointing technique. Could you please help me turn on this technique in your code?

Thanks a lot!

why-in-Shanghaitech · 2024-06-25T01:57:08Z

Hi! Thank you for your question. I haven't used gradient checkpointing before, so I cannot ensure the correctness of the solution:

Disable huggingface's warning about use_cache since we need it for further iterations (comment it out):

LCKV/models/modeling_llama_opt.py

Lines 1338 to 1343 in 503c82f

    
           if self.gradient_checkpointing and self.training: 
        
               if use_cache: 
        
                   logger.warning_once( 
        
                       "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." 
        
                   ) 
        
                   use_cache = False

Add --gradient_checkpointing in the bash script (replace this line or add below):

LCKV/run_clm.sh

Line 31 in 503c82f

--gradient_accumulation_steps 1 \

I test the modified code on RTX3090 w/ config tinyllama_opt.json, batch size 16 w/ gradient checkpointing. The train loss of the first 20 steps is consistent with that w/o gradient checkpointing, batch size 4 and gradient accumulation 4.

I hope it could help. If it works, I'd appreciate it if you could add a simple PR so that more people could benefit from the gradient checkpointing feature.

311dada · 2024-06-27T05:16:13Z

Sorry for my late response. I will try it. Thanks for your suggestion.

why-in-Shanghaitech self-assigned this Jun 25, 2024

311dada closed this as completed Jun 27, 2024

why-in-Shanghaitech mentioned this issue Jun 28, 2024

Support gradient checkpointing for lckv #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about gradient checkpointing? #3

Question about gradient checkpointing? #3

311dada commented Jun 24, 2024

why-in-Shanghaitech commented Jun 25, 2024

311dada commented Jun 27, 2024

Question about gradient checkpointing? #3

Question about gradient checkpointing? #3

Comments

311dada commented Jun 24, 2024

why-in-Shanghaitech commented Jun 25, 2024

311dada commented Jun 27, 2024