Merge branch 'fudan-generative-vision-main'

hay86 · Jul 15, 2024 · 1323282 · 1323282
2 parents 4ccf209 + fbc2152
commit 1323282
Show file tree

Hide file tree

Showing 17 changed files with 3,108 additions and 24 deletions.
diff --git a/README.md b/README.md
@@ -23,4 +23,4 @@ pip install xformers==0.0.22.post7
 ```
 
 ## Download Checkpoints
-All models will be downloaded automatically to ComfyUI's model folder, just no wrries.
+All models will be downloaded automatically to ComfyUI's model folder, just no wrries.
diff --git a/README.md.bak b/README.md.bak
@@ -24,7 +24,7 @@
     <a href='https://fudan-generative-vision.github.io/hallo/#/'><img src='https://img.shields.io/badge/Project-HomePage-Green'></a>
     <a href='https://arxiv.org/pdf/2406.08801'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
     <a href='https://huggingface.co/fudan-generative-ai/hallo'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow'></a>
-    <a href='https://huggingface.co/fudan-generative-ai/hallo'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Demo-yellow'></a>
+    <a href='https://huggingface.co/spaces/fffiloni/tts-hallo-talking-portrait'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Demo-yellow'></a>
     <a href='https://www.modelscope.cn/models/fudan-generative-vision/Hallo/summary'><img src='https://img.shields.io/badge/Modelscope-Model-purple'></a>
     <a href='assets/wechat.jpeg'><img src='https://badges.aleen42.com/src/wechat.svg'></a>
 </div>
@@ -65,6 +65,7 @@ Explore [more examples](https://fudan-generative-vision.github.io/hallo).
 
 ## 📰 News
 
+- **`2024/06/28`**: 🎉🎉🎉 We are proud to announce the release of our model training code. Try your own training data. Here is [tutorial](#training).
 - **`2024/06/21`**: 🚀🚀🚀 Cloned a Gradio demo on [🤗Huggingface space](https://huggingface.co/spaces/fudan-generative-ai/hallo).
 - **`2024/06/20`**: 🌟🌟🌟 Received numerous contributions from the community, including a [Windows version](https://github.com/sdbds/hallo-for-windows), [ComfyUI](https://github.com/AIFSH/ComfyUI-Hallo), [WebUI](https://github.com/fudan-generative-vision/hallo/pull/51), and [Docker template](https://github.com/ashleykleynhans/hallo-docker).
 - **`2024/06/15`**: ✨✨✨ Released some images and audios for inference testing on [🤗Huggingface](https://huggingface.co/datasets/fudan-generative-ai/hallo_inference_samples).
@@ -74,6 +75,7 @@ Explore [more examples](https://fudan-generative-vision.github.io/hallo).
 
 Explore the resources developed by our community to enhance your experience with Hallo:
 
+- [TTS x Hallo Talking Portrait Generator](https://huggingface.co/spaces/fffiloni/tts-hallo-talking-portrait) - Check out this awesome Gradio demo by [@Sylvain Filoni](https://huggingface.co/fffiloni)! With this tool, you can conveniently prepare portrait image and audio for Hallo.
 - [Demo on Huggingface](https://huggingface.co/spaces/multimodalart/hallo) - Check out this easy-to-use Gradio demo by [@multimodalart](https://huggingface.co/multimodalart).
 - [hallo-webui](https://github.com/daswer123/hallo-webui) - Explore the WebUI created by [@daswer123](https://github.com/daswer123).
 - [hallo-for-windows](https://github.com/sdbds/hallo-for-windows) - Utilize Hallo on Windows with the guide by [@sdbds](https://github.com/sdbds).
@@ -233,15 +235,113 @@ options:
                         face region
 ```
 
+## Training
+
+### Prepare Data for Training
+
+The training data, which utilizes some talking-face videos similar to the source images used for inference, also needs to meet the following requirements:
+
+1. It should be cropped into squares.
+2. The face should be the main focus, making up 50%-70% of the image.
+3. The face should be facing forward, with a rotation angle of less than 30° (no side profiles).
+
+Organize your raw videos into the following directory structure:
+
+
+```text
+dataset_name/
+|-- videos/
+|   |-- 0001.mp4
+|   |-- 0002.mp4
+|   |-- 0003.mp4
+|   `-- 0004.mp4
+```
+
+You can use any `dataset_name`, but ensure the `videos` directory is named as shown above.
+
+Next, process the videos with the following commands:
+
+```bash
+python -m scripts.data_preprocess --input_dir dataset_name/videos --step 1
+python -m scripts.data_preprocess --input_dir dataset_name/videos --step 2
+```
+
+**Note:** Execute steps 1 and 2 sequentially as they perform different tasks. Step 1 converts videos into frames, extracts audio from each video, and generates the necessary masks. Step 2 generates face embeddings using InsightFace and audio embeddings using Wav2Vec, and requires a GPU. For parallel processing, use the `-p` and `-r` arguments. The `-p` argument specifies the total number of instances to launch, dividing the data into `p` parts. The `-r` argument specifies which part the current process should handle. You need to manually launch multiple instances with different values for `-r`.
+
+Generate the metadata JSON files with the following commands:
+
+```bash
+python scripts/extract_meta_info_stage1.py -r path/to/dataset -n dataset_name
+python scripts/extract_meta_info_stage2.py -r path/to/dataset -n dataset_name
+```
+
+Replace `path/to/dataset` with the path to the parent directory of `videos`, such as `dataset_name` in the example above. This will generate `dataset_name_stage1.json` and `dataset_name_stage2.json` in the `./data` directory.
+
+### Training
+
+Update the data meta path settings in the configuration YAML files, `configs/train/stage1.yaml` and `configs/train/stage2.yaml`:
+
+
+```yaml
+#stage1.yaml
+data:
+  meta_paths:
+    - ./data/dataset_name_stage1.json
+
+#stage2.yaml
+data:
+  meta_paths:
+    - ./data/dataset_name_stage2.json
+```
+
+Start training with the following command:
+
+```shell
+accelerate launch -m \
+  --config_file accelerate_config.yaml \
+  --machine_rank 0 \
+  --main_process_ip 0.0.0.0 \
+  --main_process_port 20055 \
+  --num_machines 1 \
+  --num_processes 8 \
+  scripts.train_stage1 --config ./configs/train/stage1.yaml
+```
+
+#### Accelerate Usage Explanation
+
+The `accelerate launch` command is used to start the training process with distributed settings.
+
+```shell
+accelerate launch [arguments] {training_script} --{training_script-argument-1} --{training_script-argument-2} ...
+```
+
+**Arguments for Accelerate:**
+
+- `-m, --module`: Interpret the launch script as a Python module.
+- `--config_file`: Configuration file for Hugging Face Accelerate.
+- `--machine_rank`: Rank of the current machine in a multi-node setup.
+- `--main_process_ip`: IP address of the master node.
+- `--main_process_port`: Port of the master node.
+- `--num_machines`: Total number of nodes participating in the training.
+- `--num_processes`: Total number of processes for training, matching the total number of GPUs across all machines.
+
+**Arguments for Training:**
+
+- `{training_script}`: The training script, such as `scripts.train_stage1` or `scripts.train_stage2`.
+- `--{training_script-argument-1}`: Arguments specific to the training script. Our training scripts accept one argument, `--config`, to specify the training configuration file.
+
+For multi-node training, you need to manually run the command with different `machine_rank` on each node separately.
+
+For more settings, refer to the [Accelerate documentation](https://huggingface.co/docs/accelerate/en/index).
+
 ## 📅️ Roadmap
 
 | Status | Milestone                                                                                             |    ETA     |
 | :----: | :---------------------------------------------------------------------------------------------------- | :--------: |
 |   ✅   | **[Inference source code meet everyone on GitHub](https://github.com/fudan-generative-vision/hallo)** | 2024-06-15 |
 |   ✅   | **[Pretrained models on Huggingface](https://huggingface.co/fudan-generative-ai/hallo)**              | 2024-06-15 |
-| 🚧 | **[Optimizing Performance on images with a resolution of 256x256.]()**                                                | 2024-06-23 |
-| 🚀 | **[Improving the model's performance on Mandarin Chinese]()**                                                    |    2024-06-25     |
-| 🚀 | **[Releasing data preparation and training scripts]()**                                                | 2024-06-28 |
+| ✅ | **[Releasing data preparation and training scripts](#training)**                                                | 2024-06-28 |
+| 🚀 | **[Improving the model's performance on Mandarin Chinese]()**                                                    |    TBD     |
 
 <details>
 <summary>Other Enhancements</summary>
@@ -250,7 +350,6 @@ options:
 - [x] Bug: Output video may lose several frames. [#41](https://github.com/fudan-generative-vision/hallo/issues/41)
 - [ ] Bug: Sound volume affecting inference results (audio normalization).
 - [ ] ~~Enhancement: Inference code logic optimization~~. This solution doesn't show significant performance improvements. Trying other approaches.
-- [ ] Enhancement: Enhancing performance on low resolutions(256x256) to support more efficient usage.
 
 </details>
 
@@ -297,4 +396,4 @@ Thank you to all the contributors who have helped to make this project better!
 
 <a href="https://github.com/fudan-generative-vision/hallo/graphs/contributors">
   <img src="https://contrib.rocks/image?repo=fudan-generative-vision/hallo" />
-</a>
+</a>
diff --git a/accelerate_config.yaml b/accelerate_config.yaml
@@ -0,0 +1,21 @@
+compute_environment: LOCAL_MACHINE
+debug: true
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  gradient_accumulation_steps: 1
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: false
+  zero_stage: 2
+distributed_type: DEEPSPEED
+downcast_bf16: "no"
+main_training_function: main
+mixed_precision: "fp16"
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
diff --git a/configs/train/stage1.yaml b/configs/train/stage1.yaml
@@ -0,0 +1,63 @@
+data:
+  train_bs: 8
+  train_width: 512
+  train_height: 512
+  meta_paths:
+    - "./data/HDTF_meta.json"
+  # Margin of frame indexes between ref and tgt images
+  sample_margin: 30
+
+solver:
+  gradient_accumulation_steps: 1
+  mixed_precision: "no"
+  enable_xformers_memory_efficient_attention: True
+  gradient_checkpointing: False
+  max_train_steps: 30000
+  max_grad_norm: 1.0
+  # lr
+  learning_rate: 1.0e-5
+  scale_lr: False
+  lr_warmup_steps: 1
+  lr_scheduler: "constant"
+
+  # optimizer
+  use_8bit_adam: False
+  adam_beta1: 0.9
+  adam_beta2: 0.999
+  adam_weight_decay: 1.0e-2
+  adam_epsilon: 1.0e-8
+
+val:
+  validation_steps: 500
+
+noise_scheduler_kwargs:
+  num_train_timesteps: 1000
+  beta_start: 0.00085
+  beta_end: 0.012
+  beta_schedule: "scaled_linear"
+  steps_offset: 1
+  clip_sample: false
+
+base_model_path: "./pretrained_models/stable-diffusion-v1-5/"
+vae_model_path: "./pretrained_models/sd-vae-ft-mse"
+face_analysis_model_path: "./pretrained_models/face_analysis"
+
+weight_dtype: "fp16" # [fp16, fp32]
+uncond_ratio: 0.1
+noise_offset: 0.05
+snr_gamma: 5.0
+enable_zero_snr: True
+face_locator_pretrained: False
+
+seed: 42
+resume_from_checkpoint: "latest"
+checkpointing_steps: 500
+exp_name: "stage1"
+output_dir: "./exp_output"
+
+ref_image_paths:
+  - "examples/reference_images/1.jpg"
+
+mask_image_paths:
+  - "examples/masks/1.png"
+
diff --git a/configs/train/stage2.yaml b/configs/train/stage2.yaml
@@ -0,0 +1,119 @@
+data:
+  train_bs: 4
+  val_bs: 1
+  train_width: 512
+  train_height: 512
+  fps: 25
+  sample_rate: 16000
+  n_motion_frames: 2
+  n_sample_frames: 14
+  audio_margin: 2
+  train_meta_paths:
+    - "./data/hdtf_split_stage2.json"
+
+wav2vec_config:
+  audio_type: "vocals" # audio vocals
+  model_scale: "base" # base large
+  features: "all" # last avg all
+  model_path: ./pretrained_models/wav2vec/wav2vec2-base-960h
+audio_separator:
+  model_path: ./pretrained_models/audio_separator/Kim_Vocal_2.onnx
+face_expand_ratio: 1.2
+
+solver:
+  gradient_accumulation_steps: 1
+  mixed_precision: "no"
+  enable_xformers_memory_efficient_attention: True
+  gradient_checkpointing: True
+  max_train_steps: 30000
+  max_grad_norm: 1.0
+  # lr
+  learning_rate: 1e-5
+  scale_lr: False
+  lr_warmup_steps: 1
+  lr_scheduler: "constant"
+
+  # optimizer
+  use_8bit_adam: True
+  adam_beta1: 0.9
+  adam_beta2: 0.999
+  adam_weight_decay: 1.0e-2
+  adam_epsilon: 1.0e-8
+
+val:
+  validation_steps: 1000
+
+noise_scheduler_kwargs:
+  num_train_timesteps: 1000
+  beta_start: 0.00085
+  beta_end: 0.012
+  beta_schedule: "linear"
+  steps_offset: 1
+  clip_sample: false
+
+unet_additional_kwargs:
+  use_inflated_groupnorm: true
+  unet_use_cross_frame_attention: false
+  unet_use_temporal_attention: false
+  use_motion_module: true
+  use_audio_module: true
+  motion_module_resolutions:
+    - 1
+    - 2
+    - 4
+    - 8
+  motion_module_mid_block: true
+  motion_module_decoder_only: false
+  motion_module_type: Vanilla
+  motion_module_kwargs:
+    num_attention_heads: 8
+    num_transformer_block: 1
+    attention_block_types:
+      - Temporal_Self
+      - Temporal_Self
+    temporal_position_encoding: true
+    temporal_position_encoding_max_len: 32
+    temporal_attention_dim_div: 1
+  audio_attention_dim: 768
+  stack_enable_blocks_name:
+    - "up"
+    - "down"
+    - "mid"
+  stack_enable_blocks_depth: [0,1,2,3]
+
+trainable_para:
+  - audio_modules
+  - motion_modules
+
+base_model_path: "./pretrained_models/stable-diffusion-v1-5/"
+vae_model_path: "./pretrained_models/sd-vae-ft-mse"
+face_analysis_model_path: "./pretrained_models/face_analysis"
+mm_path: "./pretrained_models/motion_module/mm_sd_v15_v2.ckpt"
+
+weight_dtype: "fp16" # [fp16, fp32]
+uncond_img_ratio: 0.05
+uncond_audio_ratio: 0.05
+uncond_ia_ratio: 0.05
+start_ratio: 0.05
+noise_offset: 0.05
+snr_gamma: 5.0
+enable_zero_snr: True
+stage1_ckpt_dir: "./exp_output/stage1/"
+
+single_inference_times: 10
+inference_steps: 40
+cfg_scale: 3.5
+
+seed: 42
+resume_from_checkpoint: "latest"
+checkpointing_steps: 500
+exp_name: "stage2"
+output_dir: "./exp_output"
+
+ref_img_path:
+  - "examples/reference_images/1.jpg"
+
+audio_path:
+  - "examples/driving_audios/1.wav"
+
+
diff --git a/examples/masks/1.png b/examples/masks/1.png
diff --git a/hallo/datasets/audio_processor.py b/hallo/datasets/audio_processor.py
@@ -73,7 +73,7 @@ def __init__(
         self.wav2vec_feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(wav2vec_model_path, local_files_only=True)
 
 
-    def preprocess(self, wav_file: str, clip_length: int):
+    def preprocess(self, wav_file: str, clip_length: int=-1):
         """
         Preprocess a WAV audio file by separating the vocals from the background and resampling it to a 16 kHz sample rate.
         The separated vocal track is then converted into wav2vec2 for further processing or analysis.
@@ -109,7 +109,8 @@ def preprocess(self, wav_file: str, clip_length: int):
         audio_length = seq_len
 
         audio_feature = torch.from_numpy(audio_feature).float().to(device=self.device)
-        if seq_len % clip_length != 0:
+
+        if clip_length>0 and seq_len % clip_length != 0:
             audio_feature = torch.nn.functional.pad(audio_feature, (0, (clip_length - seq_len % clip_length) * (self.sample_rate // self.fps)), 'constant', 0.0)
             seq_len += clip_length - seq_len % clip_length
         audio_feature = audio_feature.unsqueeze(0)