LLM
History of LLMs
A Survey of Large Language Models
LLM Timeline

計算記憶體的成長與Transformer大小的關係
Paper: AI and Memory Wall

Scaling Law
我們可以用模型大小、Dataset大小、總計算量,來預測模型最終能力。(通常以相對簡單的函數型態, ex: Linear relationship)
GPT-4 Technical Report. OpenAI. 2023
Blog: 【LLM 10大觀念-1】Scaling Law
Papers:
- Hestness et al. 於2017發現在Machine Translation, Language Modeling, Speech Recognition和Image Classification都有出現Scaling law.
- OpenAI Kaplan et al.2020 於2020年從計算量、Dataset大小、跟參數量分別討論了Scaling Law。
- Rosenfeld et al. 於2021年發表了關於Scaling Law的survey paper。在各種architecture更進一步驗證Scaling Law的普適性。
Chinchilla Scaling Law
Paper: Training Compute-Optimal Large Language Models
如果我們接受原本Scaling Law的定義(模型性能可藉由參數量、Dataset大小、計算量預測),馬上就會衍伸出兩個很重要的問題:
Return(收益): 在固定的訓練計算量之下,我們所能得到的最好性能是多好?
Allocation(分配):我們要怎麼分配我們的模型參數量跟Dataset大小。
(假設計算量 = 參數量 * Dataset size,我們要大模型 * 少量data、中模型 * 中量data、還是小模型 * 大量data)
2022年DeepMind提出Chinchilla Scaling Law,同時解決了這兩個問題,並且依此改善了當時其他大模型的訓練方式。
他們基於三種方式來找到訓練LLM的Scaling Law:
- 固定模型大小,變化訓練Data數量。
- 固定計算量(浮點運算),變化模型大小。
- 對所有實驗結果,直接擬合參數化損失函數。
Method 3 result from Chinchilla Scaling Law,N是模型參數量、D是數據量、其他都是係數
LLM最終的Loss(Perplexity),會隨著模型放大、數據量變多而下降,並且是跟他們呈現指數映射後線性關係。
Chinchilla最大的貢獻更是在解決Allocation的問題,他們發現
- 數據量(Tokens數)應該要約等於模型參數量的20倍
- 並且數據量跟模型參數量要同比放大(Ex: 模型放大一倍,數據也要跟著增加一倍)
Large Language Models
生成式AI時代下的機器學習(2025) by Hung-Yi Lee
Open LLM Leaderboard
Transformer
Paper: Attention Is All You Need

ChatGPT
ChatGPT: Optimizing Language Models for Dialogue
ChatGPT is fine-tuned from a model in the GPT-3.5 series, which finished training in early 2022.
LLaMA
Paper: LLaMA: Open and Efficient Foundation Language Models

Blog: Building a Million-Parameter LLM from Scratch Using Python
Code: LLaMA from scratch
GPT-4
Paper: GPT-4 Technical Report

Paper: From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
Blog: GPT-4 Code Interpreter: The Next Big Thing in AI
Falcon-40B
HuggingFace: tiiuae/falcon-40b
Paper: The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Vicuna
HuggingFace: lmsys/vicuna-7b-v1.5
Paper: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Code: https://github.com/lm-sys/FastChat
LLaMA-2
HuggingFace: meta-llama/Llama-2-7b-chat-hf
Paper: Llama 2: Open Foundation and Fine-Tuned Chat Models
Code: https://github.com/facebookresearch/llama
Mistral
HuggingFace: mistralai/Mistral-7B-Instruct-v0.2
Paper: Mistral 7B
Code: https://github.com/mistralai/mistral-src
Kaggle: https://www.kaggle.com/code/rkuo2000/llm-mistral-7b-instruct

Mistral 8X7B
HuggingFace: mistralai/Mixtral-8x7B-v0.1
Paper: Mixtral of Experts

Orca 2
HuggingFace: microsoft/Orca-2-7b
Paper: https://arxiv.org/abs/2311.11045
Blog: Microsoft’s Orca 2 LLM Outperforms Models That Are 10x Larger

Taiwan-LLM (優必達+台大)
HuggingFace: yentinglin/Taiwan-LLM-7B-v2.1-chat
Paper: TAIWAN-LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model
Blog: 專屬台灣!優必達攜手台大打造「Taiwan LLM」,為何我們需要本土化的AI?
Code: https://github.com/MiuLab/Taiwan-LLM
Phi-2
HuggingFace: microsoft/phi-2
Blog: Phi-2: The surprising power of small language models
Kaggle: https://www.kaggle.com/code/rkuo2000/llm-phi-2
Mamba
HuggingFace: Q-bert/Mamba-130M
Paper: Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Qwen (通义千问)
HuggingFace: Qwen/Qwen1.5-7B-Chat
Blog: Introducing Qwen1.5
Code: https://github.com/QwenLM/Qwen1.5
Yi (零一万物)
HuggingFace: 01-ai/Yi-6B-Chat
Paper: CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
Paper: Yi: Open Foundation Models by 01.AI
Orca-Math
Paper: Orca-Math: Unlocking the potential of SLMs in Grade School Math
HuggingFace: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k
BitNet
Paper: BitNet: Scaling 1-bit Transformers for Large Language Models
Paper: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Code: bitnet.cpp

Gemma
Blog: Gemma: Introducing new state-of-the-art open models
Kaggle: https://www.kaggle.com/code/nilaychauhan/fine-tune-gemma-models-in-keras-using-lora
Gemini-1.5
Claude 3

Breeze (達哥)
HuggingFace: MediaTek-Research/Breeze-7B-Instruct-v0_1
Paper: Breeze-7B Technical Report
Blog: Breeze-7B: 透過 Mistral-7B Fine-Tune 出來的繁中開源模型
Bialong (白龍)
HuggingFace: INX-TEXT/Bailong-instruct-7B
Paper: Bailong: Bilingual Transfer Learning based on QLoRA and Zip-tie Embedding
TAIDE
HuggingFace: taide/TAIDE-LX-7B-Chat
- TAIDE-LX-7B: 以 LLaMA2-7b 為基礎,僅使用繁體中文資料預訓練 (continuous pretraining)的模型,適合使用者會對模型進一步微調(fine tune)的使用情境。因預訓練模型沒有經過微調和偏好對齊,可能會產生惡意或不安全的輸出,使用時請小心。
- TAIDE-LX-7B-Chat: 以 TAIDE-LX-7B 為基礎,透過指令微調(instruction tuning)強化辦公室常用任務和多輪問答對話能力,適合聊天對話或任務協助的使用情境。TAIDE-LX-7B-Chat另外有提供4 bit 量化模型,量化模型主要是提供使用者的便利性,可能會影響效能與更多不可預期的問題,還請使用者理解與注意。
Llama-3
HuggingFace: meta-llama/Meta-Llama-3-8B-Instruct
Code: https://github.com/meta-llama/llama3/

Phi-3
HuggingFace: microsoft/Phi-3-mini-4k-instruct”
Blog: Introducing Phi-3: Redefining what’s possible with SLMs
Octopus v4
HuggingFace: NexaAIDev/Octopus-v4
Paper: Octopus v4: Graph of language models
Code: https://github.com/NexaAI/octopus-v4
ChatGLM
Paper: ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Llama 3.1
HuggingFace: meta-llama/Meta-Llama-3.1-8B-Instruct

Grok-2
Grok-2 & Grok-2 mini, achieve performance levels competitive to other frontier models in areas such as graduate-level science knowledge (GPQA), general knowledge (MMLU, MMLU-Pro), and math competition problems (MATH). Additionally, Grok-2 excels in vision-based tasks, delivering state-of-the-art performance in visual math reasoning (MathVista) and in document-based question answering (DocVQA).
Phi-3.5
News: Microsoft Unveils Phi-3.5: Powerful AI Models Punch Above Their Weight

OpenAI o1
Blog: Introducing OpenAI o1-preview

Qwen2.5

NVLM 1.0
Paper: NVLM: Open Frontier-Class Multimodal LLMs

Llama 3.2
Blog: Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

LFM Liquid-3B
Llama 3.3
Blog: Meta公布輕巧版多語言模型Llama 3.3
OpenAI o3-mini

DeepSeek-R1
Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Code: https://github.com/deepseek-ai/DeepSeek-R1

Llama-Breeze2
Paper: The Breeze 2 Herd of Models: Traditional Chinese LLMs Based on Llama with Vision-Aware and Function-Calling Capabilities

Grok-3 The Age of Reasoning Agents

Phi-4-multimodal

Gemini-2.5
Llama-4
Blog: Implementing LLaMA 4 from Scratch

Kaggle: https://www.kaggle.com/code/rkuo2000/llama4-from-scratch
Grok-4
GPT-5
Gemini-2.5 Family
Qwen3-Next

Qwen3-Omni
Paper: Qwen3-Omni Technical Report

Olmo3
Blog: Ai2釋出真開源思考模型Olmo 3,支援可回溯推理與長上下文
GLM 4.5
Paper: GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Code: https://github.com/zai-org/GLM-4.5
Gemini 3
Claude Opus 4.5
DeepSeek v3.2
Paper: DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
GPT-5.2
GLM-4.7

Kimi K2.5
Paper: Kimi K2.5: Visual Agentic Intelligence

Nemotron-3
Paper: NVIDIA Nemotron 3: Efficient and Open Intelligence

GPT5.3 Codex
Claude Opus 4.6
MiniMax M2.5: Built for Real-World Productivity
GLM-5
Paper: GLM-5: from Vibe Coding to Agentic Engineering

Qwen3.5:Towards Native Multimodal Agents
JoyAI-LLM-Flash
Paper: JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

GPT-5.4
Nemotron-Cascade 2
Paper: Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Gemma 4

Blog: Accelerating Gemma 4: faster inference with multi-token prediction drafters
Blog: Google公布可在筆電執行的AI模型Gemma 4 12B
Qwen 3.6
DeepSeek-V4
- DeepSeek-V4 Preview : 1M context length.
- DeepSeek-V4-Pro: 1.6T total / 49B active params.
- DeepSeek-V4-Flash: 284B total / 13B active params.
GPT-5.5
Claude Opus 4.8

DiffusionGemma

GLM 5.2
LLM Frameworks
Comprehensive Feature Comparison
| Feature / Metric | llama.cpp | vLLM | Ollama | LM Studio | |———————|———–|——|——–|———–| | Primary TargetPower | Users / Devs | Enterprise / Production | Developers / Prototypers | Casual Users / Testers | | Interface Type | Command Line (CLI) | API-first / Python | ServerCLI / Headless API | Graphical UI (GUI) | | Supported Formats | GGUF | Safetensors, AWQ, GPTQ, FP8 | GGUF (via Modelfiles) | GGUF | | Hardware Strengths | CPU, Apple Silicon, Mixed VRAM | High-end Data Center GPUs | Multi-platform consumer gear | Consumer GPUs and Apple Silicon | | Multi-User Scaling | Poor (Queued/Sequential) | Exceptional (Continuous Batching) | Poor (High latency under load) | Poor (Single-user focus) | | Setup Friction | Moderate to High | High (Complex dependencies) | Extremely Low (Single command) | Extremely Low (App installer) |
llama.cpp
vLLM
Code: https://github.com/vllm-project/vllm
pip install vllm
Ollama
Code: https://github.com/ollama/ollama
curl -fsSL https://ollama.com/install.sh | sh
LM Studio


AI Engineering
Prompt Engineering

Thinking Claude
17歲高中生寫出「神級Prompt」強化Claude推理能力媲美o1模型,如何實現?
Context Engineering

情境工程(Context Engineering)解析:打造實用 AI Agent 的關鍵技巧,與提示工程(Prompt Engineering)有什麼不同?
| 維 度 | 提示工程 (Prompt Engineering) | 情境工程 (Context Engineering) |
|---|---|---|
| 範 疇 | 單一指令 | 整體資訊生態系 |
| 目 標 | 優化單次輸出品質 | 確保在多任務和多會話中的一致性與可靠性 |
| 本 質 | 靜態、手工撰寫的指令 | 動態、系統組裝的資訊負載 |
| 類 比 | 提出一個問題 | 準備一份完整的簡報檔案 |
| 核心挑戰 | 措辭與清晰度 | 檢索、相關性與狀態管理 |
Harness Engineering
Blog: Harness Engineering 完全解析:當 AI Agent 的護城河不再是模型,而是環境
- 第一代:Prompt Engineering(2022-2024)
- 第二代:Context Engineering(2025)
- 第三代:Harness Engineering(2026)
Anthropic Claude Code:三代理 Harness 架構
| Agent | 角色 | 職責 |
|---|---|---|
| Planner | 規劃者 | 把產品規格分解為可執行的任務列表 |
| Generator | 生成者 | 一次實作一個 feature,保持增量開發 |
| Evaluator | 評估者 | 驗證生成結果,回饋修正指令 |
This site was last updated June 19, 2026.