Lecture

LLM

07 Jun 2026

History of LLMs

A Survey of Large Language Models

LLM Timeline

計算記憶體的成長與Transformer大小的關係

Paper: AI and Memory Wall

Scaling Law

我們可以用模型大小、Dataset大小、總計算量，來預測模型最終能力。（通常以相對簡單的函數型態, ex: Linear relationship）
GPT-4 Technical Report. OpenAI. 2023

Blog: 【LLM 10大觀念-1】Scaling Law

Papers:

Hestness et al. 於2017發現在Machine Translation, Language Modeling, Speech Recognition和Image Classification都有出現Scaling law.
OpenAI Kaplan et al.2020 於2020年從計算量、Dataset大小、跟參數量分別討論了Scaling Law。
Rosenfeld et al. 於2021年發表了關於Scaling Law的survey paper。在各種architecture更進一步驗證Scaling Law的普適性。

Chinchilla Scaling Law

Paper: Training Compute-Optimal Large Language Models

如果我們接受原本Scaling Law的定義（模型性能可藉由參數量、Dataset大小、計算量預測），馬上就會衍伸出兩個很重要的問題:

Return（收益）：在固定的訓練計算量之下，我們所能得到的最好性能是多好？
Allocation（分配）：我們要怎麼分配我們的模型參數量跟Dataset大小。
（假設計算量 = 參數量 * Dataset size，我們要大模型 * 少量data、中模型 * 中量data、還是小模型 * 大量data）

2022年DeepMind提出Chinchilla Scaling Law，同時解決了這兩個問題，並且依此改善了當時其他大模型的訓練方式。他們基於三種方式來找到訓練LLM的Scaling Law：

固定模型大小，變化訓練Data數量。
固定計算量（浮點運算），變化模型大小。
對所有實驗結果，直接擬合參數化損失函數。

Method 3 result from Chinchilla Scaling Law，N是模型參數量、D是數據量、其他都是係數

LLM最終的Loss（Perplexity），會隨著模型放大、數據量變多而下降，並且是跟他們呈現指數映射後線性關係。

Chinchilla最大的貢獻更是在解決Allocation的問題，他們發現

數據量（Tokens數）應該要約等於模型參數量的20倍
並且數據量跟模型參數量要同比放大（Ex: 模型放大一倍，數據也要跟著增加一倍）

Large Language Models

生成式AI時代下的機器學習(2025) by Hung-Yi Lee

Transformer

Paper: Attention Is All You Need

ChatGPT

ChatGPT: Optimizing Language Models for Dialogue
ChatGPT is fine-tuned from a model in the GPT-3.5 series, which finished training in early 2022.

GPT-4

Paper: From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

Blog: GPT-4 Code Interpreter: The Next Big Thing in AI

Falcon-40B

HuggingFace: tiiuae/falcon-40b

Paper: The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Vicuna

HuggingFace: lmsys/vicuna-7b-v1.5

Paper: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Code: https://github.com/lm-sys/FastChat

Mistral

HuggingFace: mistralai/Mistral-7B-Instruct-v0.2

Code: https://github.com/mistralai/mistral-src

Kaggle: https://www.kaggle.com/code/rkuo2000/llm-mistral-7b-instruct

Mistral 8X7B

HuggingFace: mistralai/Mixtral-8x7B-v0.1

Orca 2

HuggingFace: microsoft/Orca-2-7b

Paper: https://arxiv.org/abs/2311.11045

Blog: Microsoft’s Orca 2 LLM Outperforms Models That Are 10x Larger

Taiwan-LLM (優必達+台大)

HuggingFace: yentinglin/Taiwan-LLM-7B-v2.1-chat

Paper: TAIWAN-LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model

Blog: 專屬台灣！優必達攜手台大打造「Taiwan LLM」，為何我們需要本土化的AI？

Code: https://github.com/MiuLab/Taiwan-LLM

Phi-2

Blog: Phi-2: The surprising power of small language models

Kaggle: https://www.kaggle.com/code/rkuo2000/llm-phi-2

Qwen (通义千问)

HuggingFace: Qwen/Qwen1.5-7B-Chat

Code: https://github.com/QwenLM/Qwen1.5

Yi (零一万物)

Paper: CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

Paper: Yi: Open Foundation Models by 01.AI

Orca-Math

Paper: Orca-Math: Unlocking the potential of SLMs in Grade School Math

HuggingFace: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k

BitNet

Paper: BitNet: Scaling 1-bit Transformers for Large Language Models

Paper: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Gemma

Blog: Gemma: Introducing new state-of-the-art open models

Kaggle: https://www.kaggle.com/code/nilaychauhan/fine-tune-gemma-models-in-keras-using-lora

Breeze (達哥)

HuggingFace: MediaTek-Research/Breeze-7B-Instruct-v0_1

Paper: Breeze-7B Technical Report

Blog: Breeze-7B: 透過 Mistral-7B Fine-Tune 出來的繁中開源模型

Bialong (白龍)

HuggingFace: INX-TEXT/Bailong-instruct-7B

Paper: Bailong: Bilingual Transfer Learning based on QLoRA and Zip-tie Embedding

TAIDE

HuggingFace: taide/TAIDE-LX-7B-Chat

TAIDE-LX-7B: 以 LLaMA2-7b 為基礎，僅使用繁體中文資料預訓練 (continuous pretraining)的模型，適合使用者會對模型進一步微調(fine tune)的使用情境。因預訓練模型沒有經過微調和偏好對齊，可能會產生惡意或不安全的輸出，使用時請小心。
TAIDE-LX-7B-Chat: 以 TAIDE-LX-7B 為基礎，透過指令微調(instruction tuning)強化辦公室常用任務和多輪問答對話能力，適合聊天對話或任務協助的使用情境。TAIDE-LX-7B-Chat另外有提供4 bit 量化模型，量化模型主要是提供使用者的便利性，可能會影響效能與更多不可預期的問題，還請使用者理解與注意。

Phi-3

HuggingFace: microsoft/Phi-3-mini-4k-instruct”

Blog: Introducing Phi-3: Redefining what’s possible with SLMs

Octopus v4

HuggingFace: NexaAIDev/Octopus-v4

Paper: Octopus v4: Graph of language models

Code: https://github.com/NexaAI/octopus-v4

design demo

ChatGLM

Paper: ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Grok-2

Grok-2 & Grok-2 mini, achieve performance levels competitive to other frontier models in areas such as graduate-level science knowledge (GPQA), general knowledge (MMLU, MMLU-Pro), and math competition problems (MATH). Additionally, Grok-2 excels in vision-based tasks, delivering state-of-the-art performance in visual math reasoning (MathVista) and in document-based question answering (DocVQA).

Phi-3.5

News: Microsoft Unveils Phi-3.5: Powerful AI Models Punch Above Their Weight

OpenAI o1

Blog: Introducing OpenAI o1-preview

Qwen2.5

NVLM 1.0

Paper: NVLM: Open Frontier-Class Multimodal LLMs

Llama 3.2

Blog: Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

LFM Liquid-3B

Try Liquid

Llama 3.3

Blog: Meta公布輕巧版多語言模型Llama 3.3

OpenAI o3-mini

DeepSeek-R1

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Code: https://github.com/deepseek-ai/DeepSeek-R1

Llama-Breeze2

Paper: The Breeze 2 Herd of Models: Traditional Chinese LLMs Based on Llama with Vision-Aware and Function-Calling Capabilities

Grok-3 The Age of Reasoning Agents

Phi-4-multimodal

Gemini-2.5

Llama-4

Blog: Implementing LLaMA 4 from Scratch

Kaggle: https://www.kaggle.com/code/rkuo2000/llama4-from-scratch

Grok-4

GPT-5

Gemini-2.5 Family

Qwen3-Next

Qwen3-Omni

Paper: Qwen3-Omni Technical Report

Olmo3

Blog: Ai2釋出真開源思考模型Olmo 3，支援可回溯推理與長上下文

GLM 4.5

Paper: GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Code: https://github.com/zai-org/GLM-4.5

Gemini 3

Claude Opus 4.5

DeepSeek v3.2

Paper: DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

GPT-5.2

GLM-4.7

Kimi K2.5

Paper: Kimi K2.5: Visual Agentic Intelligence

Nemotron-3

Paper: NVIDIA Nemotron 3: Efficient and Open Intelligence

GPT5.3 Codex

Claude Opus 4.6

MiniMax M2.5: Built for Real-World Productivity

GLM-5

Paper: GLM-5: from Vibe Coding to Agentic Engineering

Qwen3.5：Towards Native Multimodal Agents

JoyAI-LLM-Flash

Paper: JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

GPT-5.4

Nemotron-Cascade 2

Paper: Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Gemma 4

Blog: Accelerating Gemma 4: faster inference with multi-token prediction drafters

Blog: Google公布可在筆電執行的AI模型Gemma 4 12B

Qwen 3.6

DeepSeek-V4

DeepSeek-V4 Preview : 1M context length.
DeepSeek-V4-Pro: 1.6T total / 49B active params.
DeepSeek-V4-Flash: 284B total / 13B active params.

GPT-5.5

Claude Opus 4.8

DiffusionGemma

GLM 5.2

LLM Frameworks

Comprehensive Feature Comparison

Code: https://github.com/ollama/ollama

curl -fsSL https://ollama.com/install.sh | sh

LM Studio

AI Engineering

維度	提示工程 (Prompt Engineering)	情境工程 (Context Engineering)
範疇	單一指令	整體資訊生態系
目標	優化單次輸出品質	確保在多任務和多會話中的一致性與可靠性
本質	靜態、手工撰寫的指令	動態、系統組裝的資訊負載
類比	提出一個問題	準備一份完整的簡報檔案
核心挑戰	措辭與清晰度	檢索、相關性與狀態管理

Harness Engineering

Blog: Harness Engineering 完全解析：當 AI Agent 的護城河不再是模型，而是環境

第一代：Prompt Engineering（2022-2024）
第二代：Context Engineering（2025）
第三代：Harness Engineering（2026）

Anthropic Claude Code：三代理 Harness 架構

Agent	角色	職責
Planner	規劃者	把產品規格分解為可執行的任務列表
Generator	生成者	一次實作一個 feature，保持增量開發
Evaluator	評估者	驗證生成結果，回饋修正指令

This site was last updated July 09, 2026.

genai

LLM