The model chosen is a vector quantized (VQ) diffusion model based on these two papers:
- Vector Quantized Diffusion Model for Text-to-Image Synthesis
- Improved Vector Quantized Diffusion Models
There are three ways to run the project:
- A .pynb file that was built to work with Google Colab.
- A method through CLI.
- A method through a web app.
Method number 1 only requires that the user uploads the file to Colab. Everything will run smoothly after installing the packages by simply running the first cell.
Methods number 2 and 3 require downloading some packages. These packages are in "Requirements.txt" file. You simply need to create an environment (preferrably using Anaconda) and do the following:
- Choose Python 3.10.11
- After the environement is created, open a terminal with this environment.
- Copy each command in "Requirements.txt" to the terminal and run it.
There are two ways to train:
- Through Colab using the .ipynb file.
- Through the given source code files.
To train using method 1:
- Go to "configs/coco.yaml"
- You can control all the configurations for training in this file. Feel free to leave it as it is.
- Simply follow the steps in the notebook which include:
- Installing the packages.
- Cloning the repo.
- Downloading the dataset.
- Running the training file.
To train using method 2:
- Create a folder called "datasets" in the root directory of the project.
- Create a folder called "MSCOCO_Caption" in "datasets".
- Follow the directory structure for Microsoft COCO Dataset in the "Data Preparing" section in "readme.md" file.
- Download the dataset, choose:
- 2014 Train images
- 2014 Val iamges
- 2014 Train/Val annotations
- "2014 Train images" is a compressed file containing a folder called "train 2014".
- "2014 Val images" is a compressed file containing a folder called "val 2014".
- "2014 Train/Val annotations" is a compressed file containing .JSON files. You only need two:
- "captions_train2014.json"
- "captions_val2014.json"
- Go to "configs/coco.yaml"
- You can control all the configurations for training in this file. Feel free to leave it as it is.
**Note: Training requires a powerful machine with lost of VRAM.
Since training requires a very powerful system. I could not train using the original COCO 2014 dataset. I created a stripped down version.
The reason for this was to check that the training works. I also trained for one epoch. So, it goes without saying that the model will not produce good results.
Even the provided pretrained model was not trained for a lot of epochs.
In this section, I'm going to compare the outputs of my trained model and the pretrained model using the same propmt: "A group of elephants walking in muddy water"
There are six different inference methods which will also be shown.
Pretrained Inference Improved VQ-Diffusion with both learnable classifier-free sampling and fast inference:
Custom Inference Improved VQ-Diffusion with both learnable classifier-free sampling and fast inference:
**COCO 2014
- Download the dataset, choose:
- 2014 Train images
- 2014 Val iamges
- 2014 Train/Val annotations
- "2014 Train images" is a compressed file containing a folder called "train 2014".
- "2014 Val images" is a compressed file containing a folder called "val 2014".
- "2014 Train/Val annotations" is a compressed file containing .JSON files. You only need two:
- "captions_train2014.json"
- "captions_val2014.json"
This project contains two main models:
- VQ-VAE
- VQ-Diffusion
I trained the VQ-Diffusion model which contains:
- content_codec: 65.8 million parameters
- condition_codec: 0
- transformer: 431.3 million parameters
These parameters add up to 497.1 million.
Variational Bayes loss is used in this project. To get this loss, Kullback-Leibler (KL) divergence is calculated.
Streamlit was used to develop the web app for this project.
Once you start running the web app (check "Running" section below), it will start caching the model so that you only need to load them in once and not every time you need to infer.
-
Once the models are loaded and cached, you will be presented by this screen:
-
Once your text description and the number of images you want to generate, click on the "Generate" button:
-
After the image(s) have been generated, they will be displayed to you as shown:
-
The output image(s) are 256 X 256, you can choose to increase the resolution by clicking the "Increase Resolution" button:
-
The output image(s) will be 512 X 512 and will be displayed to you as shown:
-
To run the web app type:
streamlit run web_app.py -
To run through CLI type:
python infer.py "your text decription" "number of images"Example:
python infer.py "A group of elephants walking in muddy water" 4