Deep Learning Lecture

Lecture

Vision Language Models

16 Aug 2024 • Richard Kuo

Introduction to VLMs/MLLMs

MLLM - Multimodal Large Language Model

Paper: A Survey on Multimodal Large Language Models

MLLM papers

VLM - Vision Language Model

Guide to Vision-Language Models (VLMs)

LLM in Vision papers

Contrastive Learning

CLIP architecture

PrefixLM

SimVLM architecture

VirTex architecture

Frozen architecture

Flamingo architecture

Multimodal Fusing with Cross-Attention

VisualGPT architecture

Masked-language Modeling (MLM) & Image-Text Matching (ITM)

VisualBERT architecture

MM-LLMs

Paper: MM-LLMs: Recent Advances in MultiModal Large Language Models

The general model architecture of MM-LLMs

Paper: Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis

Next-GPT

Paper: Any-to-Any Multimodal Large Language Model

Ferret

Paper: Ferret: Refer and Ground Anything Anywhere at Any Granularity
Code: https://github.com/apple/ml-ferret

MiniGPT-v2

Paper: MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning
Code: https://github.com/Vision-CAIR/MiniGPT-4

GPT4-V

Paper: Assessing GPT4-V on Structured Reasoning Tasks

Gemini

Paper: Gemini: A Family of Highly Capable Multimodal Models

PaLM-E

Paper: PaLM-E: An Embodied Multimodal Language Model
Code: https://github.com/kyegomez/PALM-E

PaLI-X

Paper: PaLI-X: On Scaling up a Multilingual Vision and Language Model

Qwen-VL

model: Qwen/Qwen-VL-Chat
Paper: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Code: https://github.com/QwenLM/Qwen-VL

Yi-VL-34B

model: 01-ai/Yi-VL-34B

FuYu-8B

Blog: Fuyu-8B: A Multimodal Architecture for AI Agents

LLaVA

Paper: Visual Instruction Tuning
Paper: Improved Baselines with Visual Instruction Tuning
Code: https://github.com/haotian-liu/LLaVA

LLaVA-Med

Paper: LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Code: https://github.com/microsoft/LLaVA-Med

LLaVA-Plus

Paper: LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Code: https://github.com/LLaVA-VL/LLaVA-Plus-Codebase

LLaVA-NeXT

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
Compared with LLaVA-1.5, LLaVA-NeXT has several improvements:

Increasing the input image resolution to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, 1344x336 resolution.
Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture.
Better visual conversation for more scenarios, covering different applications. Better world knowledge and logical reasoning.
Efficient deployment and inference with SGLang.

Florence-2

model: microsoft/Florence-2-large
Paper: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Blog: Florence-2: Advancing Multiple Vision Tasks with a Single VLM Model

VILA

Paper: VILA: On Pre-training for Visual Language Models
Code: https://github.com/Efficient-Large-Model/VILA

VILA on Jetson Orin

VLFeedback and Silkie

Paper: Silkie: Preference Distillation for Large Visual Language Models
Code: https://github.com/vlf-silkie/VLFeedback

MobileVLM

Paper: MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Code: https://github.com/Meituan-AutoML/MobileVLM

MyVLM

Paper: MyVLM: Personalizing VLMs for User-Specific Queries
Code: https://github.com/snap-research/MyVLM

Reka Core

Paper: Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

InternLM-XComposer

Code: https://github.com/InternLM/InternLM-XComposer
InternLM-XComposer2-4KHD could further understand 4K Resolution images.

Phi-3

model: microsoft/Phi-3-vision-128k-instruct

Phi-3-vision is a 4.2B parameter multimodal model with language and vision capabilities.
Phi-3-mini is a 3.8B parameter language model, available in two context lengths (128K and 4K).
Phi-3-small is a 7B parameter language model, available in two context lengths (128K and 8K).
Phi-3-medium is a 14B parameter language model, available in two context lengths (128K and 4K).

MiniCPM-V

model: openbmb/MiniCPM-Llama3-V-2_5-int4

SoM

Paper: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Code: https://github.com/microsoft/SoM

SoM-LLaVA

Paper: List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
Code: https://github.com/zzxslp/SoM-LLaVA

Gemini-1.5

Paligemma

mode: google/paligemma-3b-pt-224
Paper: PaliGemma: A versatile 3B VLM for transfer

LongLLaVA

Blog: LongLLaVA: Revolutionizing Multi-Modal AI with Hybrid Architecture
Paper: LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Github: https://github.com/FreedomIntelligence/LongLLaVA

CogVLM2

Paper: CogVLM2: Visual Language Models for Image and Video Understanding
Demo

Phi-3.5-vision

model: microsoft/Phi-3.5-vision-instruct

Pixtral

model: mistralai/Pixtral-12B-2409
Pixtral 12B - the first-ever multimodal Mistral model. Apache 2.0.

New 400M parameter vision encoder trained from scratch
12B parameter multimodal decoder based on Mistral Nemo
Supports variable image sizes and aspect ratios
Supports multiple images in the long context window of 128k tokens

NaVid

Paper: NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Audio LLM

Qwen-Audio

Paper: Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Code: https://github.com/QwenLM/Qwen-Audio

RELF

Octopus

Paper: Octopus: Embodied Vision-Language Programmer from Environmental Feedback
Code: https://github.com/dongyh20/Octopus

VLM-R1

VLM-R1: A stable and generalizable R1-style Large Vision-Language Model

This site was last updated June 01, 2025.