Generative Image
Text-to-Image
News: An A.I.-Generated Picture Won an Art Prize. Artists Aren’t Happy.
Diffusion Models
Paper: High-Resolution Image Synthesis with Latent Diffusion Models
Blog: Introduction to Diffusion Models for Machine Learning
Diffusion Models are a method of creating data that is similar to a set of training data.
They train by destroying the training data through the addition of noise, and then learning to recover the data by reversing this noising process. Given an input image, the Diffusion Model will iteratively corrupt the image with Gaussian noise in a series of timesteps, ultimately leaving pure Gaussian noise, or “TV static”.
The Diffusion Model will then work backwards, learning how to isolate and remove the noise at each timestep, undoing the destruction process that just occurred.
Once trained, the model can then be “split in half”, and we can start from randomly sampled Gaussian noise which we use the Diffusion Model to gradually denoise in order to generate an image.
Midjourney
Model Versions
Version 6.1 was released on July 30, 2024 as the new default model. It produces more coherent images with more precise details and textures, and generates images approximately 25% faster than Version 6.
-
Forward Diffusion Process: The diffusion model starts by taking an input image and gradually adding Gaussian noise.
-
Noise Accumulation: The model continues to add more noise to the image. After each addition, randomized image sections are covered in noise until the original image is transformed into a noisy or grain-covered version. More noise will result in a more different generation, while less will produce a more similar generation to the original image.
-
Denoising Process: After adding the desired amount of noise, which Midjourney users can partially control with prompt weighting, the model learns to recover the original image by reversing the noising process.
-
Iterative Refinement: Denoising is performed iteratively, gradually reducing the noise level in the image. At each step, the diffusion model improves the image’s quality and ability to refine over time.
-
Training and Predictive Learning: The steps above repeat for as many images in the training dataset as possible. The model eventually learns to predict the original image from the noisy image.
-
Generating New Data: Once the model is trained, it creates new images by passing random noise samples and generating the colors and shapes from the patterns the model picked up during training. This creates unique images similar to the training data but slightly different, resulting in various possible outputs.
DALL.E, E2, and storyDALL-E
DALL.E
DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs.
Paper: Zero-Shot Text-to-Image Generation
Code: openai/DALL-E
The overview of DALL-E could be illustrated as below. It contains two components: for image, VQGAN (vector quantized GAN) is used to map the 256x256 image to a 32x32 grid of image token and each token has 8192 possible values; then this token is combined with 256 BPE=encoded text token is fed into to train the autoregressive transformer. The text token is set to 256 by maximal.
DALL.E-2
Paper: Hierarchical Text-Conditional Image Generation with CLIP Latents
How Does DALL·E 2 Work?
Compared to DALL·E’s 12-billion parameters, DALL·E 2 works on a 3.5-billion parameter model and another 1.5-billion parameter model to enhance the resolution of its images.
DALL·E 2 image generation process
CLIP Training
Diffusion Models
Diffusion models are transformer-based generative models. They take a piece of data, for example, a photo, and gradually add noise over timesteps, until it is not recognizable. And from that point, they try to reconstruct the image to its original form. In doing so, they learn how to generate images or any other kind of data.
LAION-5B Dataset
5.85 billion CLIP-filtered image-text pairs
Paper: LAION-5B: An open large-scale dataset for training next generation image-text models
DALL.E-3
Paper: Improving Image Generation with Better Captions
Dataset Recaptioning
Stability.ai
Code: Stable Diffusion
Imagen
Paper: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Blog: How Imagen Actually Works
SDXL
Paper: SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Code: Generative Models by Stability AI
Huggingface: stable-diffusion-xl-base-1.0
SDXL consists of an ensemble of experts pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/) specialized for the final denoising steps. Note that the base model can be used as a standalone module.
Kaggle: https://www.kaggle.com/code/rkuo2000/sdxl-base-1-0
Transfusion
Paper: Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Code: https://github.com/lucidrains/transfusion-pytorch
FLUX1.1 pro
- Superior Speed and Efficiency: Faster generation times and reduced latency, enabling more efficient workflows. FLUX1.1 [pro] is three times faster than the currently available FLUX.1 [pro].
- Improved Performance: FLUX1.1 [pro] has been introduced and tested under the codename “blueberry” into the Artificial Analysis image arena (https://artificialanalysis.ai/text-to-image), a popular benchmark for text-to-image models. It surpasses all other models on the leaderboard, achieving the highest overall Elo score.
SD 3.5
model: stabilityai/stable-diffusion-3-medium
blog: ComfyUI Now Supports Stable Diffusion 3.5!
Deep Generative Models - Survey
Paper: Personalized Image Generation with Deep Generative Models: A Decade Survey
OpenAI 4o Image Generation
Paper: An Empirical Study of GPT-4o Image Generation Capabilities
DeepMind Imagen
Gemini NanoBanana
Uni-COT
Paper: Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Code: https://github.com/Fr0zenCrane/UniCoT
ComfyUI
Krita
安裝與 ComfyUI 工作流匯入(建築景觀與室內設計應用)
FLUX.1[dev]模型在Krita完美整合
Text-to-3D
Shap-E
Paper: Shap-E: Generating Conditional 3D Implicit Functions
Code: https://github.com/openai/shap-e
Kaggle: https://www.kaggle.com/rkuo2000/shap-e
MVdiffusion
Paper: MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion
Code: https://github.com/Tangshitao/MVDiffusion
MVDream
Paper: MVDream: Multi-view Diffusion for 3D Generation
Code: https://github.com/bytedance/MVDream
Kaggle: https://www.kaggle.com/rkuo2000/mvdream
3D-GPT
Paper: 3D-GPT: Procedural 3D Modeling with Large Language Models
Advances in 3D Generation : A Survey
Paper: Advances in 3D Generation: A Survey
AssetGen
Paper: [Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials]
(https://scontent-tpe1-1.xx.fbcdn.net/v/t39.2365-6/449707112_509645168082163_2193712134508658234_n.pdf?_nc_cat=111&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=5bSbn3KaluAQ7kNvgFbjbd7&_nc_ht=scontent-tpe1-1.xx&oh=00_AYBM_JROjIFPbm8vwphinNrr4x1bUEFOeLV5iYsR6l_0rA&oe=668B3191)
Paper: Meta 3D Gen
AI-Render
Stable Diffusion in Blender
Code: https://github.com/benrugg/AI-Render
Image-to-3D
Zero123++
Blog: Stable Zero123
Paper: Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model
Code: https://github.com/SUDO-AI-3D/zero123plus
Kaggle: https://www.kaggle.com/code/rkuo2000/zero123plus
Kaggle: https://www.kaggle.com/code/rkuo2000/zero123-controlnet
TripoSR
Blog: Introducing TripoSR: Fast 3D Object Generation from Single Images
Paper: LRM: Large Reconstruction Model for Single Image to 3D
Github: https://github.com/VAST-AI-Research/TripoSR
python run.py examples/chair.png --output-dir output/
Depth Pro
Paper: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Code: https://github.com/apple/ml-depth-pro
This site was last updated October 16, 2025.