Skip to content

Commit

Permalink
Merge branch 'dev' of https://github.com/kohya-ss/sd-scripts into dev
Browse files Browse the repository at this point in the history
# Conflicts:
#	requirements.txt
  • Loading branch information
sdbds committed Sep 25, 2024
2 parents f392a30 + 29177d2 commit 5a96969
Show file tree
Hide file tree
Showing 13 changed files with 743 additions and 137 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/typos.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ jobs:
- uses: actions/checkout@v4

- name: typos-action
uses: crate-ci/typos@v1.21.0
uses: crate-ci/typos@v1.24.3
53 changes: 53 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,45 @@ The majority of scripts is licensed under ASL 2.0 (including codes from Diffuser

### Working in progress

- __important__ The dependent libraries are updated. Please see [Upgrade](#upgrade) and update the libraries.
- transformers, accelerate and huggingface_hub are updated.
- If you encounter any issues, please report them.

- Improvements in OFT (Orthogonal Finetuning) Implementation
1. Optimization of Calculation Order:
- Changed the calculation order in the forward method from (Wx)R to W(xR).
- This has improved computational efficiency and processing speed.
2. Correction of Bias Application:
- In the previous implementation, R was incorrectly applied to the bias.
- The new implementation now correctly handles bias by using F.conv2d and F.linear.
3. Efficiency Enhancement in Matrix Operations:
- Introduced einsum in both the forward and merge_to methods.
- This has optimized matrix operations, resulting in further speed improvements.
4. Proper Handling of Data Types:
- Improved to use torch.float32 during calculations and convert results back to the original data type.
- This maintains precision while ensuring compatibility with the original model.
5. Unified Processing for Conv2d and Linear Layers:
- Implemented a consistent method for applying OFT to both layer types.
- These changes have made the OFT implementation more efficient and accurate, potentially leading to improved model performance and training stability.

- Additional Information
* Recommended α value for OFT constraint: We recommend using α values between 1e-4 and 1e-2. This differs slightly from the original implementation of "(α\*out_dim\*out_dim)". Our implementation uses "(α\*out_dim)", hence we recommend higher values than the 1e-5 suggested in the original implementation.

* Performance Improvement: Training speed has been improved by approximately 30%.

* Inference Environment: This implementation is compatible with and operates within Stable Diffusion web UI (SD1/2 and SDXL).

- The INVERSE_SQRT, COSINE_WITH_MIN_LR, and WARMUP_STABLE_DECAY learning rate schedules are now available in the transformers library. See PR [#1393](https://github.com/kohya-ss/sd-scripts/pull/1393) for details. Thanks to sdbds!
- See the [transformers documentation](https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/optimizer_schedules#schedules) for details on each scheduler.
- `--lr_warmup_steps` and `--lr_decay_steps` can now be specified as a ratio of the number of training steps, not just the step value. Example: `--lr_warmup_steps=0.1` or `--lr_warmup_steps=10%`, etc.

https://github.com/kohya-ss/sd-scripts/pull/1393
- When enlarging images in the script (when the size of the training image is small and bucket_no_upscale is not specified), it has been changed to use Pillow's resize and LANCZOS interpolation instead of OpenCV2's resize and Lanczos4 interpolation. The quality of the image enlargement may be slightly improved. PR [#1426](https://github.com/kohya-ss/sd-scripts/pull/1426) Thanks to sdbds!

- Sample image generation during training now works on non-CUDA devices. PR [#1433](https://github.com/kohya-ss/sd-scripts/pull/1433) Thanks to millie-v!

- `--v_parameterization` is available in `sdxl_train.py`. The results are unpredictable, so use with caution. PR [#1505](https://github.com/kohya-ss/sd-scripts/pull/1505) Thanks to liesened!

- Fused optimizer is available for SDXL training. PR [#1259](https://github.com/kohya-ss/sd-scripts/pull/1259) Thanks to 2kpr!
- The memory usage during training is significantly reduced by integrating the optimizer's backward pass with step. The training results are the same as before, but if you have plenty of memory, the speed will be slower.
- Specify the `--fused_backward_pass` option in `sdxl_train.py`. At this time, only AdaFactor is supported. Gradient accumulation is not available.
Expand Down Expand Up @@ -266,6 +304,21 @@ https://github.com/kohya-ss/sd-scripts/pull/1290) frodo821 氏に感謝します

- `gen_imgs.py` のプロンプトオプションに、保存時のファイル名を指定する `--f` オプションを追加しました。また同スクリプトで Diffusers ベースのキーを持つ LoRA の重みに対応しました。


### Sep 13, 2024 / 2024-09-13:

- `sdxl_merge_lora.py` now supports OFT. Thanks to Maru-mee for the PR [#1580](https://github.com/kohya-ss/sd-scripts/pull/1580).
- `svd_merge_lora.py` now supports LBW. Thanks to terracottahaniwa. See PR [#1575](https://github.com/kohya-ss/sd-scripts/pull/1575) for details.
- `sdxl_merge_lora.py` also supports LBW.
- See [LoRA Block Weight](https://github.com/hako-mikan/sd-webui-lora-block-weight) by hako-mikan for details on LBW.
- These will be included in the next release.

- `sdxl_merge_lora.py` が OFT をサポートされました。PR [#1580](https://github.com/kohya-ss/sd-scripts/pull/1580) Maru-mee 氏に感謝します。
- `svd_merge_lora.py` で LBW がサポートされました。PR [#1575](https://github.com/kohya-ss/sd-scripts/pull/1575) terracottahaniwa 氏に感謝します。
- `sdxl_merge_lora.py` でも LBW がサポートされました。
- LBW の詳細は hako-mikan 氏の [LoRA Block Weight](https://github.com/hako-mikan/sd-webui-lora-block-weight) をご覧ください。
- 以上は次回リリースに含まれます。

### Jun 23, 2024 / 2024-06-23:

- Fixed `cache_latents.py` and `cache_text_encoder_outputs.py` not working. (Will be included in the next release.)
Expand Down
8 changes: 5 additions & 3 deletions finetune/tag_images_by_wd14_tagger.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from tqdm import tqdm

import library.train_util as train_util
from library.utils import setup_logging
from library.utils import setup_logging, pil_resize

setup_logging()
import logging
Expand Down Expand Up @@ -42,8 +42,10 @@ def preprocess_image(image):
pad_t = pad_y // 2
image = np.pad(image, ((pad_t, pad_y - pad_t), (pad_l, pad_x - pad_l), (0, 0)), mode="constant", constant_values=255)

interp = cv2.INTER_AREA if size > IMAGE_SIZE else cv2.INTER_LANCZOS4
image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE), interpolation=interp)
if size > IMAGE_SIZE:
image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE), cv2.INTER_AREA)
else:
image = pil_resize(image, (IMAGE_SIZE, IMAGE_SIZE))

image = image.astype(np.float32)
return image
Expand Down
3 changes: 2 additions & 1 deletion gen_img.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,8 @@
"""


def replace_unet_modules(unet: diffusers.models.unet_2d_condition.UNet2DConditionModel, mem_eff_attn, xformers, sdpa):
# def replace_unet_modules(unet: diffusers.models.unets.unet_2d_condition.UNet2DConditionModel, mem_eff_attn, xformers, sdpa):
def replace_unet_modules(unet, mem_eff_attn, xformers, sdpa):
if mem_eff_attn:
logger.info("Enable memory efficient attention for U-Net")

Expand Down
129 changes: 108 additions & 21 deletions library/train_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,11 @@
from torchvision import transforms
from transformers import CLIPTokenizer, CLIPTextModel, CLIPTextModelWithProjection
import transformers
from diffusers.optimization import SchedulerType, TYPE_TO_SCHEDULER_FUNCTION
from diffusers.optimization import (
SchedulerType as DiffusersSchedulerType,
TYPE_TO_SCHEDULER_FUNCTION as DIFFUSERS_TYPE_TO_SCHEDULER_FUNCTION,
)
from transformers.optimization import SchedulerType, TYPE_TO_SCHEDULER_FUNCTION
from diffusers import (
StableDiffusionPipeline,
DDPMScheduler,
Expand Down Expand Up @@ -71,7 +75,7 @@
import library.huggingface_util as huggingface_util
import library.sai_model_spec as sai_model_spec
import library.deepspeed_utils as deepspeed_utils
from library.utils import setup_logging
from library.utils import setup_logging, pil_resize

setup_logging()
import logging
Expand Down Expand Up @@ -1554,7 +1558,7 @@ def read_caption(img_path, caption_extension, enable_wildcard):
def load_dreambooth_dir(subset: DreamBoothSubset):
if not os.path.isdir(subset.image_dir):
logger.warning(f"not directory: {subset.image_dir}")
return [], []
return [], [], []

info_cache_file = os.path.join(subset.image_dir, self.IMAGE_INFO_CACHE_FILE)
use_cached_info_for_subset = subset.cache_info
Expand Down Expand Up @@ -2094,9 +2098,7 @@ def __getitem__(self, index):
# ), f"image size is small / 画像サイズが小さいようです: {image_info.absolute_path}"
# resize to target
if cond_img.shape[0] != target_size_hw[0] or cond_img.shape[1] != target_size_hw[1]:
cond_img = cv2.resize(
cond_img, (int(target_size_hw[1]), int(target_size_hw[0])), interpolation=cv2.INTER_LANCZOS4
)
cond_img = pil_resize(cond_img, (int(target_size_hw[1]), int(target_size_hw[0])))

if flipped:
cond_img = cond_img[:, ::-1, :].copy() # copy to avoid negative stride
Expand Down Expand Up @@ -2205,7 +2207,7 @@ def is_disk_cached_latents_is_expected(reso, npz_path: str, flip_aug: bool, alph
if alpha_mask:
if "alpha_mask" not in npz:
return False
if npz["alpha_mask"].shape[0:2] != reso: # HxW
if (npz["alpha_mask"].shape[1], npz["alpha_mask"].shape[0]) != reso: # HxW => WxH != reso
return False
else:
if "alpha_mask" in npz:
Expand Down Expand Up @@ -2434,7 +2436,7 @@ def load_arbitrary_dataset(args, tokenizer) -> MinimalDataset:
return train_dataset_group


def load_image(image_path, alpha=False):
def load_image(image_path, alpha=False):
try:
with Image.open(image_path) as image:
if alpha:
Expand All @@ -2459,7 +2461,10 @@ def trim_and_resize_if_required(

if image_width != resized_size[0] or image_height != resized_size[1]:
# リサイズする
image = cv2.resize(image, resized_size, interpolation=cv2.INTER_AREA) # INTER_AREAでやりたいのでcv2でリサイズ
if image_width > resized_size[0] and image_height > resized_size[1]:
image = cv2.resize(image, resized_size, interpolation=cv2.INTER_AREA) # INTER_AREAでやりたいのでcv2でリサイズ
else:
image = pil_resize(image, resized_size)

image_height, image_width = image.shape[0:2]

Expand Down Expand Up @@ -2971,6 +2976,20 @@ def add_sd_models_arguments(parser: argparse.ArgumentParser):


def add_optimizer_arguments(parser: argparse.ArgumentParser):
def int_or_float(value):
if value.endswith("%"):
try:
return float(value[:-1]) / 100.0
except ValueError:
raise argparse.ArgumentTypeError(f"Value '{value}' is not a valid percentage")
try:
float_value = float(value)
if float_value >= 1:
return int(value)
return float(value)
except ValueError:
raise argparse.ArgumentTypeError(f"'{value}' is not an int or float")

parser.add_argument(
"--optimizer_type",
type=str,
Expand Down Expand Up @@ -3023,9 +3042,17 @@ def add_optimizer_arguments(parser: argparse.ArgumentParser):
)
parser.add_argument(
"--lr_warmup_steps",
type=int,
type=int_or_float,
default=0,
help="Int number of steps for the warmup in the lr scheduler (default is 0) or float with ratio of train steps"
" / 学習率のスケジューラをウォームアップするステップ数(デフォルト0)、または学習ステップの比率(1未満のfloat値の場合)",
)
parser.add_argument(
"--lr_decay_steps",
type=int_or_float,
default=0,
help="Number of steps for the warmup in the lr scheduler (default is 0) / 学習率のスケジューラをウォームアップするステップ数(デフォルト0)",
help="Int number of steps for the decay in the lr scheduler (default is 0) or float (<1) with ratio of train steps"
" / 学習率のスケジューラを減衰させるステップ数(デフォルト0)、または学習ステップの比率(1未満のfloat値の場合)",
)
parser.add_argument(
"--lr_scheduler_num_cycles",
Expand All @@ -3045,6 +3072,20 @@ def add_optimizer_arguments(parser: argparse.ArgumentParser):
help="Combines backward pass and optimizer step to reduce VRAM usage. Only available in SDXL"
+ " / バックワードパスとオプティマイザステップを組み合わせてVRAMの使用量を削減します。SDXLでのみ有効",
)
parser.add_argument(
"--lr_scheduler_timescale",
type=int,
default=None,
help="Inverse sqrt timescale for inverse sqrt scheduler,defaults to `num_warmup_steps`"
+ " / 逆平方根スケジューラのタイムスケール、デフォルトは`num_warmup_steps`",
)
parser.add_argument(
"--lr_scheduler_min_lr_ratio",
type=float,
default=None,
help="The minimum learning rate as a ratio of the initial learning rate for cosine with min lr scheduler and warmup decay scheduler"
+ " / 初期学習率の比率としての最小学習率を指定する、cosine with min lr と warmup decay スケジューラ で有効",
)


def add_training_arguments(parser: argparse.ArgumentParser, support_dreambooth: bool):
Expand Down Expand Up @@ -4292,10 +4333,18 @@ def get_scheduler_fix(args, optimizer: Optimizer, num_processes: int):
Unified API to get any scheduler from its name.
"""
name = args.lr_scheduler
num_warmup_steps: Optional[int] = args.lr_warmup_steps
num_training_steps = args.max_train_steps * num_processes # * args.gradient_accumulation_steps
num_warmup_steps: Optional[int] = (
int(args.lr_warmup_steps * num_training_steps) if isinstance(args.lr_warmup_steps, float) else args.lr_warmup_steps
)
num_decay_steps: Optional[int] = (
int(args.lr_decay_steps * num_training_steps) if isinstance(args.lr_decay_steps, float) else args.lr_decay_steps
)
num_stable_steps = num_training_steps - num_warmup_steps - num_decay_steps
num_cycles = args.lr_scheduler_num_cycles
power = args.lr_scheduler_power
timescale = args.lr_scheduler_timescale
min_lr_ratio = args.lr_scheduler_min_lr_ratio

lr_scheduler_kwargs = {} # get custom lr_scheduler kwargs
if args.lr_scheduler_args is not None and len(args.lr_scheduler_args) > 0:
Expand Down Expand Up @@ -4331,22 +4380,27 @@ def wrap_check_needless_num_warmup_steps(return_vals):
# logger.info(f"adafactor scheduler init lr {initial_lr}")
return wrap_check_needless_num_warmup_steps(transformers.optimization.AdafactorSchedule(optimizer, initial_lr))

if name == DiffusersSchedulerType.PIECEWISE_CONSTANT.value:
name = DiffusersSchedulerType(name)
schedule_func = DIFFUSERS_TYPE_TO_SCHEDULER_FUNCTION[name]
return schedule_func(optimizer, **lr_scheduler_kwargs) # step_rules and last_epoch are given as kwargs

name = SchedulerType(name)
schedule_func = TYPE_TO_SCHEDULER_FUNCTION[name]

if name == SchedulerType.CONSTANT:
return wrap_check_needless_num_warmup_steps(schedule_func(optimizer, **lr_scheduler_kwargs))

if name == SchedulerType.PIECEWISE_CONSTANT:
return schedule_func(optimizer, **lr_scheduler_kwargs) # step_rules and last_epoch are given as kwargs

# All other schedulers require `num_warmup_steps`
if num_warmup_steps is None:
raise ValueError(f"{name} requires `num_warmup_steps`, please provide that argument.")

if name == SchedulerType.CONSTANT_WITH_WARMUP:
return schedule_func(optimizer, num_warmup_steps=num_warmup_steps, **lr_scheduler_kwargs)

if name == SchedulerType.INVERSE_SQRT:
return schedule_func(optimizer, num_warmup_steps=num_warmup_steps, timescale=timescale, **lr_scheduler_kwargs)

# All other schedulers require `num_training_steps`
if num_training_steps is None:
raise ValueError(f"{name} requires `num_training_steps`, please provide that argument.")
Expand All @@ -4365,7 +4419,37 @@ def wrap_check_needless_num_warmup_steps(return_vals):
optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps, power=power, **lr_scheduler_kwargs
)

return schedule_func(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps, **lr_scheduler_kwargs)
if name == SchedulerType.COSINE_WITH_MIN_LR:
return schedule_func(
optimizer,
num_warmup_steps=num_warmup_steps,
num_training_steps=num_training_steps,
num_cycles=num_cycles / 2,
min_lr_rate=min_lr_ratio,
**lr_scheduler_kwargs,
)

# All other schedulers require `num_decay_steps`
if num_decay_steps is None:
raise ValueError(f"{name} requires `num_decay_steps`, please provide that argument.")
if name == SchedulerType.WARMUP_STABLE_DECAY:
return schedule_func(
optimizer,
num_warmup_steps=num_warmup_steps,
num_stable_steps=num_stable_steps,
num_decay_steps=num_decay_steps,
num_cycles=num_cycles / 2,
min_lr_ratio=min_lr_ratio if min_lr_ratio is not None else 0.0,
**lr_scheduler_kwargs,
)

return schedule_func(
optimizer,
num_warmup_steps=num_warmup_steps,
num_training_steps=num_training_steps,
num_decay_steps=num_decay_steps,
**lr_scheduler_kwargs,
)


def prepare_dataset_args(args: argparse.Namespace, support_metadata: bool):
Expand Down Expand Up @@ -5404,7 +5488,7 @@ def sample_images_common(
clean_memory_on_device(accelerator.device)

torch.set_rng_state(rng_state)
if cuda_rng_state is not None:
if torch.cuda.is_available() and cuda_rng_state is not None:
torch.cuda.set_rng_state(cuda_rng_state)
vae.to(org_vae_device)

Expand Down Expand Up @@ -5438,11 +5522,13 @@ def sample_image_inference(

if seed is not None:
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
else:
# True random sample image generation
torch.seed()
torch.cuda.seed()
if torch.cuda.is_available():
torch.cuda.seed()

scheduler = get_my_scheduler(
sample_sampler=sampler_name,
Expand Down Expand Up @@ -5477,8 +5563,9 @@ def sample_image_inference(
controlnet_image=controlnet_image,
)

with torch.cuda.device(torch.cuda.current_device()):
torch.cuda.empty_cache()
if torch.cuda.is_available():
with torch.cuda.device(torch.cuda.current_device()):
torch.cuda.empty_cache()

image = pipeline.latents_to_image(latents)[0]

Expand Down
Loading

0 comments on commit 5a96969

Please sign in to comment.