Lecture

Large Language Models

14 Sep 2025 • Richard Kuo

Introduction to LLMs

History of LLMs

A Survey of Large Language Models

LLM Timeline

計算記憶體的成長與Transformer大小的關係

Paper: AI and Memory Wall

Scaling Law

我們可以用模型大小、Dataset大小、總計算量，來預測模型最終能力。（通常以相對簡單的函數型態, ex: Linear relationship）
GPT-4 Technical Report. OpenAI. 2023

Blog: 【LLM 10大觀念-1】Scaling Law

Papers:

Hestness et al. 於2017發現在Machine Translation, Language Modeling, Speech Recognition和Image Classification都有出現Scaling law.
OpenAI Kaplan et al.2020 於2020年從計算量、Dataset大小、跟參數量分別討論了Scaling Law。
Rosenfeld et al. 於2021年發表了關於Scaling Law的survey paper。在各種architecture更進一步驗證Scaling Law的普適性。

Chinchilla Scaling Law

Paper: Training Compute-Optimal Large Language Models

如果我們接受原本Scaling Law的定義（模型性能可藉由參數量、Dataset大小、計算量預測），馬上就會衍伸出兩個很重要的問題:

Return（收益）：在固定的訓練計算量之下，我們所能得到的最好性能是多好？
Allocation（分配）：我們要怎麼分配我們的模型參數量跟Dataset大小。
（假設計算量 = 參數量 * Dataset size，我們要大模型 * 少量data、中模型 * 中量data、還是小模型 * 大量data）

2022年DeepMind提出Chinchilla Scaling Law，同時解決了這兩個問題，並且依此改善了當時其他大模型的訓練方式。他們基於三種方式來找到訓練LLM的Scaling Law：

固定模型大小，變化訓練Data數量。
固定計算量（浮點運算），變化模型大小。
對所有實驗結果，直接擬合參數化損失函數。

Method 3 result from Chinchilla Scaling Law，N是模型參數量、D是數據量、其他都是係數

LLM最終的Loss（Perplexity），會隨著模型放大、數據量變多而下降，並且是跟他們呈現指數映射後線性關係。

Chinchilla最大的貢獻更是在解決Allocation的問題，他們發現

數據量（Tokens數）應該要約等於模型參數量的20倍
並且數據量跟模型參數量要同比放大（Ex: 模型放大一倍，數據也要跟著增加一倍）

Large Language Models

Transformer

Paper: Attention Is All You Need

ChatGPT

ChatGPT: Optimizing Language Models for Dialogue
ChatGPT is fine-tuned from a model in the GPT-3.5 series, which finished training in early 2022.

TAIDE-LX-7B: 以 LLaMA2-7b 為基礎，僅使用繁體中文資料預訓練 (continuous pretraining)的模型，適合使用者會對模型進一步微調(fine tune)的使用情境。因預訓練模型沒有經過微調和偏好對齊，可能會產生惡意或不安全的輸出，使用時請小心。
TAIDE-LX-7B-Chat: 以 TAIDE-LX-7B 為基礎，透過指令微調(instruction tuning)強化辦公室常用任務和多輪問答對話能力，適合聊天對話或任務協助的使用情境。TAIDE-LX-7B-Chat另外有提供4 bit 量化模型，量化模型主要是提供使用者的便利性，可能會影響效能與更多不可預期的問題，還請使用者理解與注意。

Llama-3

HuggingFace: meta-llama/Meta-Llama-3-8B-Instruct
Github: https://github.com/meta-llama/llama3/

Phi-3

HuggingFace: microsoft/Phi-3-mini-4k-instruct”
Blog: Introducing Phi-3: Redefining what’s possible with SLMs

Octopus v4

HuggingFace: NexaAIDev/Octopus-v4
Arxiv: Octopus v4: Graph of language models
Github: https://github.com/NexaAI/octopus-v4
design demo

Llama 3.1

HuggingFace: meta-llama/Meta-Llama-3.1-8B-Instruct

Grok-2

Grok-2 & Grok-2 mini, achieve performance levels competitive to other frontier models in areas such as graduate-level science knowledge (GPQA), general knowledge (MMLU, MMLU-Pro), and math competition problems (MATH). Additionally, Grok-2 excels in vision-based tasks, delivering state-of-the-art performance in visual math reasoning (MathVista) and in document-based question answering (DocVQA).

Phi-3.5

HuggingFace: microsoft/Phi-3.5-mini-instruct
HuggingFace: microsoft/Phi-3.5-vision-instruct
HuggingFace: microsoft/Phi-3.5-MoE-instruct
News: Microsoft Unveils Phi-3.5: Powerful AI Models Punch Above Their Weight

OpenAI o1

Blog: Introducing OpenAI o1-preview

Qwen2.5

HuggingFace: Qwen/Qwen2.5-7B-Instruct

Qwen2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
Qwen2.5-Coder: 1.5B, 7B, coming 32B
Qwen2.5-Math: 1.5B, 7B, 72B

Blog: 阿里雲AI算力大升級！發佈100個開源Qwen 2.5模型及視頻AI模型

NVLM 1.0

Arxiv: NVLM: Open Frontier-Class Multimodal LLMs

Llama 3.2

Blog: Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
HuggingFace: meta-llama/Llama-3.2-1B-Instruct
HuggingFace: meta-llama/Llama-3.2-3B-Instruct
HuggingFace: meta-llama/Llama-3.2-11B-Vision-Instruct

LFM Liquid-3B

Try Liquid

Llama 3.3

HuggingFace: meta-llama/Llama-3.3-70B-Instruct
Blog: Meta公布輕巧版多語言模型Llama 3.3

OpenAI o3-mini

DeepSeek-R1

Arxiv: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Github: https://github.com/deepseek-ai/DeepSeek-R1

Llama-Breeze2

HuggingFace: MediaTek-Research/Llama-Breeze2-8B-Instruct
HuggingFace: MediaTek-Research/Llama-Breeze2-3B-Instruct
Arxiv: The Breeze 2 Herd of Models: Traditional Chinese LLMs Based on Llama with Vision-Aware and Function-Calling Capabilities
Blog: 聯發科一口氣開源2款繁中多模態小模型、符合臺灣口音的語音合成模型
Blog: 如何讓模型更懂繁中知識？聯發科研究團隊揭技術關鍵

safe AI

Constitutional AI

Arxiv: Constitutional AI: Harmlessness from AI Feedback Two key phases:

Supervised Learning Phase (SL Phase)
- Step1: The learning starts using the samples from the initial model
- Step2: From these samples, the model generates self-critiques and revisions
- Step3: Fine-tine the original model with these revisions
Reinforcement Learning Phase (RL Phease)
- Step1. The model uses samples from the fine-tuned model.
- Step2. Use a model to compare the outputs from samples from the initial model and the fine-tuned model
- Step3. Decide which sample is better. (RLHF)
- Step4. Train a new “preference model” from the new dataset of AI preferences. This new “prefernece model” will then be used to re-train the RL (as a reward signal). It is now the RLHAF (Reinforcement Learning from AI feedback)