Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-Validation & Regularization Technique Implemented #10

Merged
merged 5 commits into from
Oct 18, 2024

Conversation

Sanjeev-Kumar78
Copy link
Contributor

@Sanjeev-Kumar78 Sanjeev-Kumar78 commented Oct 17, 2024

Description

File: main.py

  • Added K-Fold Cross-Validation:

    • Implemented K-fold cross-validation to train the model on different data splits.
    • Split the dataset into K folds and trained the model on K-1 folds while validating on the remaining fold.
    • Added L2 Weight Decay to Optimizer: Updated the optimizer to include L2 weight decay.
    • optimizer = torch.optim.AdamW(unet.parameters(), lr=1e-4, weight_decay=1e-5)

File: lora.py

  • Added Dropout to LoRA Layers:

    • Modified the LoRALayer class to include a dropout layer.
    • Updated the forward method in LoRALayer to apply dropout to the input tensor x.
    • Modified the LoRALinear class to pass the dropout rate to the LoRALayer.

File: train.py

  • Implemented Early Stopping:

    • Added early stopping based on validation loss to prevent overfitting and optimize training time.
    • Tracked the validation loss and stopped training if the loss did not improve after a set number of epochs.

Fixes # (#9 #8 #7 )

Type of Change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update
  • Other (please specify):

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes

Additional Information

@sohambuilds

@rycerzes
Copy link
Contributor

@sohambuilds please review this

Copy link
Collaborator

@sohambuilds sohambuilds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add 10-15 generation samples post fine-tuning within your directory, It is suggested but not compulsory, to use another set of images for your data(could be any character or object), and finetune on that. If possible, include CLIP scores or TIFA scores to evaluate performance.

Ensure that based on the number of images in the dataset, you adjust your learning rate. 1e-5 is recommended for small datasets. of 10-20 images. 1e-4 is not appropriate.

Early stopping should not be implemented for few-shot learning/finetuning.

@Sanjeev-Kumar78
Copy link
Contributor Author

Please add 10-15 generation samples post fine-tuning within your directory, It is suggested but not compulsory, to use another set of images for your data(could be any character or object), and finetune on that. If possible, include CLIP scores or TIFA scores to evaluate performance.

Ensure that based on the number of images in the dataset, you adjust your learning rate. 1e-5 is recommended for small datasets. of 10-20 images. 1e-4 is not appropriate.

Thank you for the feedback. I will attempt to make the suggested changes. However, my laptop does not have a powerful GPU, so I’ve been training on CPU, which takes approximately ~40 minutes per epoch for each k-fold. Unfortunately, I've also faced issues with both Colab and Kaggle—Colab had GPU memory overflow problems, and Kaggle produced a different error.

I’ll continue troubleshooting these issues, but due to these constraints, progress may be slower.

@sohambuilds
Copy link
Collaborator

Running on colab should be possible, as we have tried it. If there is a specific issue that you need to troubleshoot, you may join the WhatsApp group for the ML contributors: https://chat.whatsapp.com/Kx8okfEdirALcC8UeFtl5j

Also, do note that implementing early stopping for few shot learning(very few samples) is not desirable.

@Sanjeev-Kumar78
Copy link
Contributor Author

Got it. 👍

@Sanjeev-Kumar78
Copy link
Contributor Author

Sanjeev-Kumar78 commented Oct 17, 2024

I'm trying to run this notebook: https://colab.research.google.com/drive/1Zv6eLFRHovlJgxumTtPozstIqT8cBJ-I?usp=sharing
But having these errors:

  • In TPU v2-8 : ERROR: Unknown command line flag 'xla_latency_hiding_scheduler_rerun

  • In T4 GPU: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 3.06 MiB is free. Process 8429 has 14.74 GiB memory in use. Of the allocated memory 14.36 GiB is allocated by PyTorch, and 253.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

  • In CPU : System RAM gets high and kills the process.

@sohambuilds
Copy link
Collaborator

Please run the following in your colab runtime:
!pip uninstall -y tensorflow !pip install tensorflow-cpu Runtime > Restart runtime

@sohambuilds
Copy link
Collaborator

Please commit the updated code, and your dataset, if you're using a new one. You can add the samples later, I will merge it for now.

@Sanjeev-Kumar78
Copy link
Contributor Author

I managed to run the notebook: https://colab.research.google.com/drive/1Zv6eLFRHovlJgxumTtPozstIqT8cBJ-I?usp=sharing in T4 GPU mode by optimizing the GPU's memory management during the model training in the train.py file. I haven't updated these changes in this repository because I don't believe they're necessary here. Instead, I have updated the zip file, which will download automatically when the notebook is executed.

Please review the changes and let me know if there's anything that needs improvement, @sohambuilds.

val_loss, val_clip_score = validate(val_loader, unet, text_encoder, vae, noise_scheduler, device, pipe)
print(f"Epoch {epoch+1}/{num_epochs}, Validation Loss: {val_loss:.4f}, Validation CLIP Score: {val_clip_score:.4f}")

# Check for early stopping
Copy link
Collaborator

@sohambuilds sohambuilds Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove early stopping as requested earlier. early stopping is implemented in large datasets to check for overfitting. Not in the case of only 10 images. model may need more than a patience of 5 to learn in such cases.
Good to know that you were able to make it run on colab.

@sohambuilds
Copy link
Collaborator

Merging now. Sample generation is not proper. need to check.

@sohambuilds sohambuilds merged commit 188db05 into MLSAKIIT:main Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants