Lecture

Generative Image

10 Sep 2025

Text-to-Image

News: An A.I.-Generated Picture Won an Art Prize. Artists Aren’t Happy.

Diffusion Models

Paper: High-Resolution Image Synthesis with Latent Diffusion Models

Blog: Introduction to Diffusion Models for Machine Learning

Diffusion Models are a method of creating data that is similar to a set of training data.
They train by destroying the training data through the addition of noise, and then learning to recover the data by reversing this noising process. Given an input image, the Diffusion Model will iteratively corrupt the image with Gaussian noise in a series of timesteps, ultimately leaving pure Gaussian noise, or “TV static”. The Diffusion Model will then work backwards, learning how to isolate and remove the noise at each timestep, undoing the destruction process that just occurred.
Once trained, the model can then be “split in half”, and we can start from randomly sampled Gaussian noise which we use the Diffusion Model to gradually denoise in order to generate an image.

Midjourney

Model Versions
Version 6.1 was released on July 30, 2024 as the new default model. It produces more coherent images with more precise details and textures, and generates images approximately 25% faster than Version 6.

Forward Diffusion Process: The diffusion model starts by taking an input image and gradually adding Gaussian noise.
Noise Accumulation: The model continues to add more noise to the image. After each addition, randomized image sections are covered in noise until the original image is transformed into a noisy or grain-covered version. More noise will result in a more different generation, while less will produce a more similar generation to the original image.
Denoising Process: After adding the desired amount of noise, which Midjourney users can partially control with prompt weighting, the model learns to recover the original image by reversing the noising process.
Iterative Refinement: Denoising is performed iteratively, gradually reducing the noise level in the image. At each step, the diffusion model improves the image’s quality and ability to refine over time.
Training and Predictive Learning: The steps above repeat for as many images in the training dataset as possible. The model eventually learns to predict the original image from the noisy image.
Generating New Data: Once the model is trained, it creates new images by passing random noise samples and generating the colors and shapes from the patterns the model picked up during training. This creates unique images similar to the training data but slightly different, resulting in various possible outputs.

DALL.E, E2, and storyDALL-E

DALL.E

DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs.
Paper: Zero-Shot Text-to-Image Generation
Code: openai/DALL-E
The overview of DALL-E could be illustrated as below. It contains two components: for image, VQGAN (vector quantized GAN) is used to map the 256x256 image to a 32x32 grid of image token and each token has 8192 possible values; then this token is combined with 256 BPE=encoded text token is fed into to train the autoregressive transformer. The text token is set to 256 by maximal.

DALL.E-2

Paper: Hierarchical Text-Conditional Image Generation with CLIP Latents

How Does DALL·E 2 Work?

Compared to DALL·E’s 12-billion parameters, DALL·E 2 works on a 3.5-billion parameter model and another 1.5-billion parameter model to enhance the resolution of its images.

DALL·E 2 image generation process

CLIP Training

Diffusion Models
Diffusion models are transformer-based generative models. They take a piece of data, for example, a photo, and gradually add noise over timesteps, until it is not recognizable. And from that point, they try to reconstruct the image to its original form. In doing so, they learn how to generate images or any other kind of data.

Huggingface: stable-diffusion-xl-base-1.0
SDXL consists of an ensemble of experts pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/) specialized for the final denoising steps. Note that the base model can be used as a standalone module.

Kaggle: https://www.kaggle.com/code/rkuo2000/sdxl-base-1-0

Transfusion

Paper: Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Code: https://github.com/lucidrains/transfusion-pytorch

FLUX1.1 pro

Superior Speed and Efficiency: Faster generation times and reduced latency, enabling more efficient workflows. FLUX1.1 [pro] is three times faster than the currently available FLUX.1 [pro].
Improved Performance: FLUX1.1 [pro] has been introduced and tested under the codename “blueberry” into the Artificial Analysis image arena (https://artificialanalysis.ai/text-to-image), a popular benchmark for text-to-image models. It surpasses all other models on the leaderboard, achieving the highest overall Elo score.

SD 3.5

model: stabilityai/stable-diffusion-3-medium
blog: ComfyUI Now Supports Stable Diffusion 3.5!

Deep Generative Models - Survey

Paper: Personalized Image Generation with Deep Generative Models: A Decade Survey

OpenAI 4o Image Generation

Paper: An Empirical Study of GPT-4o Image Generation Capabilities

DeepMind Imagen

Gemini NanoBanana

Uni-COT

Paper: Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision Code: https://github.com/Fr0zenCrane/UniCoT

ComfyUI

ComfyUI download

Krita

安裝與 ComfyUI 工作流匯入（建築景觀與室內設計應用)

FLUX.1[dev]模型在Krita完美整合

Text-to-3D

Shap-E

Paper: Shap-E: Generating Conditional 3D Implicit Functions
Code: https://github.com/openai/shap-e
Kaggle: https://www.kaggle.com/rkuo2000/shap-e

MVdiffusion

Paper: MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion
Code: https://github.com/Tangshitao/MVDiffusion

MVDream

Paper: MVDream: Multi-view Diffusion for 3D Generation
Code: https://github.com/bytedance/MVDream
Kaggle: https://www.kaggle.com/rkuo2000/mvdream

3D-GPT

Paper: 3D-GPT: Procedural 3D Modeling with Large Language Models

Advances in 3D Generation : A Survey

Paper: Advances in 3D Generation: A Survey

AssetGen

Paper: [Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials] (https://scontent-tpe1-1.xx.fbcdn.net/v/t39.2365-6/449707112_509645168082163_2193712134508658234_n.pdf?_nc_cat=111&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=5bSbn3KaluAQ7kNvgFbjbd7&_nc_ht=scontent-tpe1-1.xx&oh=00_AYBM_JROjIFPbm8vwphinNrr4x1bUEFOeLV5iYsR6l_0rA&oe=668B3191)
Paper: Meta 3D Gen