This document outlines the technical implementation of fine-tuning Stable Diffusion 1.4 using Low-Rank Adaptation (LoRA). It provides a detailed guide for beginners to understand and contribute to the project.
sd-lora-finetuning/
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── LICENSE
├── README.md
├── requirements.txt
├── src/ (Example implementation)
│ ├── Dataset/
│ │ ├── ImageCaptions/
│ │ │ └── example1.txt
│ │ └── Images/
│ │ └── example1.png
│ ├── dataset.py
│ ├── generate.py
│ ├── lora.py
│ ├── main.py
│ ├── train.py
│ └── utils.py
└── CONTRIBUTIONS/
└── Example1/
├── Dataset/
│ ├── ImageCaptions/
│ │ └── example1.txt
│ └── Images/
│ └── example1.png
└── src/
├── dataset.py
├── generate.py
├── lora.py
├── main.py
├── train.py
└── utils.py
src/
: Contains the example implementation (refer to this for your contribution)CONTRIBUTIONS/
: Directory where participants should add their implementationsCONTRIBUTING.md
andCODE_OF_CONDUCT.md
: Guidelines and help regarding contributing(MUST READ!)- Other files in the root directory are for project documentation and setup
LoRA is implemented as follows:
a) LoRALayer
class:
class LoRALayer(nn.Module):
def __init__(self, in_features, out_features, rank=4, alpha=1):
super().__init__()
self.lora_A = nn.Parameter(torch.zeros((rank, in_features)))
self.lora_B = nn.Parameter(torch.zeros((out_features, rank)))
self.scale = alpha / rank
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(self, x):
return (x @ self.lora_A.T @ self.lora_B.T) * self.scale
b) apply_lora_to_model
function:
def apply_lora_to_model(model, rank=4, alpha=1):
for name, module in model.named_modules():
if isinstance(module, nn.Linear):
lora_layer = LoRALayer(module.in_features, module.out_features, rank, alpha)
setattr(module, 'lora', lora_layer)
return model
Key concept: LoRA adds trainable low-rank matrices to existing layers, allowing for efficient fine-tuning.
The CustomDataset
class:
class CustomDataset(Dataset):
def __init__(self, img_dir, caption_dir=None, transform=None):
self.img_dir = img_dir
self.caption_dir = caption_dir
self.transform = transform or transforms.Compose([
transforms.Resize((512, 512)),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5])
])
self.images = [f for f in os.listdir(img_dir) if f.endswith(('.png', '.jpg', '.jpeg', '.webp'))]
def __getitem__(self, idx):
img_path = os.path.join(self.img_dir, self.images[idx])
image = Image.open(img_path).convert('RGB')
image = self.transform(image)
if self.caption_dir:
caption_path = os.path.join(self.caption_dir, self.images[idx].rsplit('.', 1)[0] + '.txt')
with open(caption_path, 'r') as f:
caption = f.read().strip()
else:
caption = ""
return image, caption
The train_loop
function implements the core training logic:
def train_loop(dataloader, unet, text_encoder, vae, noise_scheduler, optimizer, device, num_epochs):
for epoch in range(num_epochs):
for batch in dataloader:
images, captions = batch
latents = vae.encode(images.to(device)).latent_dist.sample().detach()
noise = torch.randn_like(latents)
timesteps = torch.randint(0, noise_scheduler.num_train_timesteps, (latents.shape[0],), device=device)
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
text_embeddings = text_encoder(captions)[0]
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states=text_embeddings).sample
loss = F.mse_loss(noise_pred, noise)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}")
Key concept: We're training the model to denoise latent representations, conditioned on text embeddings.
def generate_image(prompt, pipeline, num_inference_steps=50):
with torch.no_grad():
image = pipeline(prompt, num_inference_steps=num_inference_steps).images[0]
return image
-
Clone the repository:
git clone https://github.com/your-username/sd-lora-finetuning.git cd sd-lora-finetuning
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
- Fork the repository and clone your fork.
- Create a new folder in the
CONTRIBUTIONS
directory with your username or project name. - Implement your version of the LoRA fine-tuning following the structure in the
src
directory. - Ensure you include a
Dataset
folder with example images and captions. - Create a pull request with your contribution.
Refer to the src
directory for an example of how to structure your contribution.
Refer to CONTRIBUTING.md
for a detailed overview, if you're a beginner!
LoRA adapts the model by injecting trainable rank decomposition matrices into existing layers:
- For a layer with weight W, LoRA adds BA where B ∈ R^(d×r) and A ∈ R^(r×k)
- The output is computed as: h = Wx + BAx
- Only A and B are trained, keeping the original weights W frozen
This is implemented in the LoRALayer
class:
def forward(self, x):
return (x @ self.lora_A.T @ self.lora_B.T) * self.scale
The model is trained to predict the noise added to the latent representation:
- Images are encoded to latent space: z = Encode(x)
- Noise is added: z_noisy = z + ε
- The model predicts the noise: ε_pred = Model(z_noisy, t, text_embedding)
- Loss is computed: L = MSE(ε_pred, ε)
This is implemented in the training loop:
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states=text_embeddings).sample
loss = F.mse_loss(noise_pred, noise)
By learning to denoise, the model implicitly learns to generate images conditioned on text.
Customimzation and uniqueness is expected from each contributor.
- Feel free to modify
LoRALayer
inlora.py
to experiment with different LoRA architectures - Adjust the U-Net architecture in
main.py
by modifying which layers receive LoRA - Implement additional training techniques in
train.py
(e.g., gradient clipping, learning rate scheduling)
- LoRA paper: LoRA: Low-Rank Adaptation of Large Language Models
- Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models
- Hugging Face Diffusers: Documentation