Lecture

AI Generated Content

12 Aug 2024 • Richard Kuo

This introduction includes Text-to-Image, Text-to-Video, Text-to-Motion, Text-to-3D, Image-to-3D.

Text-to-Image

News: An A.I.-Generated Picture Won an Art Prize. Artists Aren’t Happy.

Diffusion Models

Paper: High-Resolution Image Synthesis with Latent Diffusion Models

Blog: Introduction to Diffusion Models for Machine Learning

Diffusion Models are a method of creating data that is similar to a set of training data.
They train by destroying the training data through the addition of noise, and then learning to recover the data by reversing this noising process. Given an input image, the Diffusion Model will iteratively corrupt the image with Gaussian noise in a series of timesteps, ultimately leaving pure Gaussian noise, or “TV static”. The Diffusion Model will then work backwards, learning how to isolate and remove the noise at each timestep, undoing the destruction process that just occurred.
Once trained, the model can then be “split in half”, and we can start from randomly sampled Gaussian noise which we use the Diffusion Model to gradually denoise in order to generate an image.

Midjourney

Model Versions
Version 6.1 was released on July 30, 2024 as the new default model. It produces more coherent images with more precise details and textures, and generates images approximately 25% faster than Version 6.

Forward Diffusion Process: The diffusion model starts by taking an input image and gradually adding Gaussian noise.
Noise Accumulation: The model continues to add more noise to the image. After each addition, randomized image sections are covered in noise until the original image is transformed into a noisy or grain-covered version. More noise will result in a more different generation, while less will produce a more similar generation to the original image.
Denoising Process: After adding the desired amount of noise, which Midjourney users can partially control with prompt weighting, the model learns to recover the original image by reversing the noising process.
Iterative Refinement: Denoising is performed iteratively, gradually reducing the noise level in the image. At each step, the diffusion model improves the image’s quality and ability to refine over time.
Training and Predictive Learning: The steps above repeat for as many images in the training dataset as possible. The model eventually learns to predict the original image from the noisy image.
Generating New Data: Once the model is trained, it creates new images by passing random noise samples and generating the colors and shapes from the patterns the model picked up during training. This creates unique images similar to the training data but slightly different, resulting in various possible outputs.

DALL.E, E2, and storyDALL-E

DALL.E

DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs.
Paper: Zero-Shot Text-to-Image Generation
Code: openai/DALL-E
The overview of DALL-E could be illustrated as below. It contains two components: for image, VQGAN (vector quantized GAN) is used to map the 256x256 image to a 32x32 grid of image token and each token has 8192 possible values; then this token is combined with 256 BPE=encoded text token is fed into to train the autoregressive transformer. The text token is set to 256 by maximal.

DALL.E-2

Paper: Hierarchical Text-Conditional Image Generation with CLIP Latents

How Does DALL·E 2 Work?

Compared to DALL·E’s 12-billion parameters, DALL·E 2 works on a 3.5-billion parameter model and another 1.5-billion parameter model to enhance the resolution of its images.

DALL·E 2 image generation process

CLIP Training

Diffusion Models
Diffusion models are transformer-based generative models. They take a piece of data, for example, a photo, and gradually add noise over timesteps, until it is not recognizable. And from that point, they try to reconstruct the image to its original form. In doing so, they learn how to generate images or any other kind of data.

Huggingface: stable-diffusion-xl-base-1.0
SDXL consists of an ensemble of experts pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/) specialized for the final denoising steps. Note that the base model can be used as a standalone module.

Kaggle: https://www.kaggle.com/code/rkuo2000/sdxl-base-1-0

Transfusion

Paper: Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Code: https://github.com/lucidrains/transfusion-pytorch

FLUX1.1 pro

Superior Speed and Efficiency: Faster generation times and reduced latency, enabling more efficient workflows. FLUX1.1 [pro] is three times faster than the currently available FLUX.1 [pro].
Improved Performance: FLUX1.1 [pro] has been introduced and tested under the codename “blueberry” into the Artificial Analysis image arena (https://artificialanalysis.ai/text-to-image), a popular benchmark for text-to-image models. It surpasses all other models on the leaderboard, achieving the highest overall Elo score.

ComfyGen

Paper: ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation

SD 3.5

model: stabilityai/stable-diffusion-3-medium
blog: ComfyUI Now Supports Stable Diffusion 3.5!

Krita

安裝與 ComfyUI 工作流匯入（建築景觀與室內設計應用)

FLUX.1[dev]模型在Krita完美整合

Text-to-Video

Turn-A-Video

Paper: Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Code: https://github.com/showlab/Tune-A-Video

Given a video-text pair as input, our method, Tune-A-Video, fine-tunes a pre-trained text-to-image diffusion model for text-to-video generation.

Open-VCLIP

Paper: Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
Paper: Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Code: https://github.com/wengzejia1/Open-VCLIP/

DyST

Paper: DyST: Towards Dynamic Neural Scene Representations on Real-World Videos

Text-to-Motion

TMR

Paper: TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis
Code: https://github.com/Mathux/TMR

Text-to-Motion Retrieval

Paper: Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language
Code: https://github.com/mesnico/text-to-motion-retrieval
A person walks in a counterclockwise circle
A person is kneeling down on all four legs and begins to crawl

MotionDirector

Paper: MotionDirector: Motion Customization of Text-to-Video Diffusion Models

GPT4Motion

Paper: GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

Motion Editing

Paper: Iterative Motion Editing with Natural Language

Awesome Video Diffusion Models

StyleCrafter

Paper: StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter
Code: https://github.com/GongyeLiu/StyleCrafter

Stable Diffusion Video

Paper: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Code: https://github.com/nateraw/stable-diffusion-videos

AnimateDiff

Paper: AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Paper: SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
Code: https://github.com/guoyww/AnimateDiff

Animate Anyone

Paper: Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Outfit Anyone

Code: https://github.com/HumanAIGC/OutfitAnyone

SignLLM

Paper: SignLLM: Sign Languages Production Large Language Models
Code: https://github.com/SignLLM/Prompt2Sign

Text-to-3D

Shap-E

Paper: Shap-E: Generating Conditional 3D Implicit Functions
Code: https://github.com/openai/shap-e
Kaggle: https://www.kaggle.com/rkuo2000/shap-e

MVdiffusion

Paper: MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion
Code: https://github.com/Tangshitao/MVDiffusion

MVDream

Paper: MVDream: Multi-view Diffusion for 3D Generation
Code: https://github.com/bytedance/MVDream
Kaggle: https://www.kaggle.com/rkuo2000/mvdream

3D-GPT

Paper: 3D-GPT: Procedural 3D Modeling with Large Language Models

Advances in 3D Generation : A Survey

Paper: Advances in 3D Generation: A Survey

AssetGen

Paper: [Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials] (https://scontent-tpe1-1.xx.fbcdn.net/v/t39.2365-6/449707112_509645168082163_2193712134508658234_n.pdf?_nc_cat=111&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=5bSbn3KaluAQ7kNvgFbjbd7&_nc_ht=scontent-tpe1-1.xx&oh=00_AYBM_JROjIFPbm8vwphinNrr4x1bUEFOeLV5iYsR6l_0rA&oe=668B3191)
Paper: Meta 3D Gen