Large Language Models

Introduction to LLMs


History of LLMs

A Survey of Large Language Models

LLM Timeline


計算記憶體的成長與Transformer大小的關係

Paper: AI and Memory Wall


Scaling Law

我們可以用模型大小、Dataset大小、總計算量,來預測模型最終能力。(通常以相對簡單的函數型態, ex: Linear relationship)
GPT-4 Technical Report. OpenAI. 2023

Blog: 【LLM 10大觀念-1】Scaling Law

Papers:

  • Hestness et al. 於2017發現在Machine Translation, Language Modeling, Speech Recognition和Image Classification都有出現Scaling law.
  • OpenAI Kaplan et al.2020 於2020年從計算量、Dataset大小、跟參數量分別討論了Scaling Law。
  • Rosenfeld et al. 於2021年發表了關於Scaling Law的survey paper。在各種architecture更進一步驗證Scaling Law的普適性。

Chinchilla Scaling Law

Paper: Training Compute-Optimal Large Language Models

如果我們接受原本Scaling Law的定義(模型性能可藉由參數量、Dataset大小、計算量預測),馬上就會衍伸出兩個很重要的問題:

Return(收益): 在固定的訓練計算量之下,我們所能得到的最好性能是多好?
Allocation(分配):我們要怎麼分配我們的模型參數量跟Dataset大小。
(假設計算量 = 參數量 * Dataset size,我們要大模型 * 少量data、中模型 * 中量data、還是小模型 * 大量data)

2022年DeepMind提出Chinchilla Scaling Law,同時解決了這兩個問題,並且依此改善了當時其他大模型的訓練方式。 他們基於三種方式來找到訓練LLM的Scaling Law:

  1. 固定模型大小,變化訓練Data數量。
  2. 固定計算量(浮點運算),變化模型大小。
  3. 對所有實驗結果,直接擬合參數化損失函數。

Method 3 result from Chinchilla Scaling Law,N是模型參數量、D是數據量、其他都是係數

LLM最終的Loss(Perplexity),會隨著模型放大、數據量變多而下降,並且是跟他們呈現指數映射後線性關係。

Chinchilla最大的貢獻更是在解決Allocation的問題,他們發現

  • 數據量(Tokens數)應該要約等於模型參數量的20倍
  • 並且數據量跟模型參數量要同比放大(Ex: 模型放大一倍,數據也要跟著增加一倍)

Large Language Models

Open LLM Leaderboard

Transformer

Paper: Attention Is All You Need

ChatGPT

ChatGPT: Optimizing Language Models for Dialogue
ChatGPT is fine-tuned from a model in the GPT-3.5 series, which finished training in early 2022.


LLaMA

Arxiv: LLaMA: Open and Efficient Foundation Language Models
Blog: Building a Million-Parameter LLM from Scratch Using Python
Kaggle: LLaMA from scratch


GPT4

Arxiv: GPT-4 Technical Report
Arxiv: From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
Blog: GPT-4 Code Interpreter: The Next Big Thing in AI


Falcon-40B

HuggingFace: tiiuae/falcon-40b
Arxiv: The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only


Vicuna

HuggingFace: lmsys/vicuna-7b-v1.5
Arxiv: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Github: https://github.com/lm-sys/FastChat


LLaMA-2

HuggingFace: meta-llama/Llama-2-7b-chat-hf
Paper: Llama 2: Open Foundation and Fine-Tuned Chat Models
Github: https://github.com/facebookresearch/llama


Mistral

HuggingFace: mistralai/Mistral-7B-Instruct-v0.2
Arxiv: Mistral 7B
Github: https://github.com/mistralai/mistral-src
Kaggle: https://www.kaggle.com/code/rkuo2000/llm-mistral-7b-instruct


Mistral 8X7B

HuggingFace: mistralai/Mixtral-8x7B-v0.1
Arxiv: Mixtral of Experts


Orca 2

HuggingFace: microsoft/Orca-2-7b
Arxiv: https://arxiv.org/abs/2311.11045
Blog: Microsoft’s Orca 2 LLM Outperforms Models That Are 10x Larger


Taiwan-LLM (優必達+台大)

HuggingFace: yentinglin/Taiwan-LLM-7B-v2.1-chat
Arxiv: TAIWAN-LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model
Blog: 專屬台灣!優必達攜手台大打造「Taiwan LLM」,為何我們需要本土化的AI?
Github: https://github.com/MiuLab/Taiwan-LLM


Phi-2 (Transformer with 2.7B parameters)

HuggingFace: microsoft/phi-2
Blog: Phi-2: The surprising power of small language models
Kaggle: https://www.kaggle.com/code/rkuo2000/llm-phi-2


Mamba

HuggingFace: Q-bert/Mamba-130M
Arxiv: Mamba: Linear-Time Sequence Modeling with Selective State Spaces


Qwen (通义千问)

HuggingFace Qwen/Qwen1.5-7B-Chat
Blog: Introducing Qwen1.5
Github: https://github.com/QwenLM/Qwen1.5


Yi (零一万物)

HuggingFace: 01-ai/Yi-6B-Chat
Arxiv: CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
Arxiv: Yi: Open Foundation Models by 01.AI


Orca-Math

Arxiv: Orca-Math: Unlocking the potential of SLMs in Grade School Math
HuggingFace: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k


BitNet

Arxiv: BitNet: Scaling 1-bit Transformers for Large Language Models
Arxiv: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58).


Gemma

HuggingFace: google/gemma-1.1-7b-it
Blog: Gemma: Introducing new state-of-the-art open models
Kaggle: https://www.kaggle.com/code/nilaychauhan/fine-tune-gemma-models-in-keras-using-lora


Gemini-1.5


Claude 3


Breeze (達哥)

HuggingFace: MediaTek-Research/Breeze-7B-Instruct-v0_1
Arxiv: Breeze-7B Technical Report
Blog: Breeze-7B: 透過 Mistral-7B Fine-Tune 出來的繁中開源模型


Bialong (白龍)

HuggingFace: INX-TEXT/Bailong-instruct-7B
Arxiv: Bailong: Bilingual Transfer Learning based on QLoRA and Zip-tie Embedding


TAIDE

HuggingFace: taide/TAIDE-LX-7B-Chat

  • TAIDE-LX-7B: 以 LLaMA2-7b 為基礎,僅使用繁體中文資料預訓練 (continuous pretraining)的模型,適合使用者會對模型進一步微調(fine tune)的使用情境。因預訓練模型沒有經過微調和偏好對齊,可能會產生惡意或不安全的輸出,使用時請小心。
  • TAIDE-LX-7B-Chat: 以 TAIDE-LX-7B 為基礎,透過指令微調(instruction tuning)強化辦公室常用任務和多輪問答對話能力,適合聊天對話或任務協助的使用情境。TAIDE-LX-7B-Chat另外有提供4 bit 量化模型,量化模型主要是提供使用者的便利性,可能會影響效能與更多不可預期的問題,還請使用者理解與注意。

Llama-3

HuggingFace: meta-llama/Meta-Llama-3-8B-Instruct
Github: https://github.com/meta-llama/llama3/


Phi-3

HuggingFace: microsoft/Phi-3-mini-4k-instruct”
Blog: Introducing Phi-3: Redefining what’s possible with SLMs


Octopus v4

HuggingFace: NexaAIDev/Octopus-v4
Arxiv: Octopus v4: Graph of language models
Github: https://github.com/NexaAI/octopus-v4
design demo


Llama 3.1

HuggingFace: meta-llama/Meta-Llama-3.1-8B-Instruct


Grok-2

Grok-2 & Grok-2 mini, achieve performance levels competitive to other frontier models in areas such as graduate-level science knowledge (GPQA), general knowledge (MMLU, MMLU-Pro), and math competition problems (MATH). Additionally, Grok-2 excels in vision-based tasks, delivering state-of-the-art performance in visual math reasoning (MathVista) and in document-based question answering (DocVQA).


Phi-3.5

HuggingFace: microsoft/Phi-3.5-mini-instruct
HuggingFace: microsoft/Phi-3.5-vision-instruct
HuggingFace: microsoft/Phi-3.5-MoE-instruct
News: Microsoft Unveils Phi-3.5: Powerful AI Models Punch Above Their Weight


OpenAI o1

Blog: Introducing OpenAI o1-preview


Qwen2.5

HuggingFace: Qwen/Qwen2.5-7B-Instruct

  • Qwen2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
  • Qwen2.5-Coder: 1.5B, 7B, coming 32B
  • Qwen2.5-Math: 1.5B, 7B, 72B

Blog: 阿里雲AI算力大升級!發佈100個開源Qwen 2.5模型及視頻AI模型


NVLM 1.0

Arxiv: NVLM: Open Frontier-Class Multimodal LLMs


Llama 3.2

Blog: Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
HuggingFace: meta-llama/Llama-3.2-1B-Instruct
HuggingFace: meta-llama/Llama-3.2-3B-Instruct
HuggingFace: meta-llama/Llama-3.2-11B-Vision-Instruct


LFM Liquid-3B

Try Liquid


Llama 3.3

HuggingFace: meta-llama/Llama-3.3-70B-Instruct
Blog: Meta公布輕巧版多語言模型Llama 3.3


OpenAI o3-mini


DeepSeek-R1

Arxiv: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Github: https://github.com/deepseek-ai/DeepSeek-R1


Llama-Breeze2

HuggingFace: MediaTek-Research/Llama-Breeze2-8B-Instruct
HuggingFace: MediaTek-Research/Llama-Breeze2-3B-Instruct
Arxiv: The Breeze 2 Herd of Models: Traditional Chinese LLMs Based on Llama with Vision-Aware and Function-Calling Capabilities
Blog: 聯發科一口氣開源2款繁中多模態小模型、符合臺灣口音的語音合成模型
Blog: 如何讓模型更懂繁中知識?聯發科研究團隊揭技術關鍵


Grok-3 The Age of Reasoning Agents


Phi-4-multimodal

Phi-4-multimodal具有56億參數,支援12.8萬Token的上下文長度,並透過監督式微調、直接偏好最佳化(DPO)與人類回饋強化學習(RLHF)等方式,提升指令遵循能力與安全性。在語言支援方面,文字處理涵蓋超過20種語言,包括中文、日文、韓文、德文與法文等,語音處理則涵蓋英語、中文、西班牙語、日語等主要語種,圖像處理目前則以英文為主。
GuggingFace: microsoft/Phi-4-multimodal-instruct


Gemini-2.5


Llama-4

Blog: Implementing LLaMA 4 from Scratch

Kaggle: https://www.kaggle.com/code/rkuo2000/llama4-from-scratch


Grok-4


GPT-5


Gemini-2.5 Family


Qwen3-Next

HuggineFace:


Qwen3-Omni

Paper: Qwen3-Omni Technical Report


safe AI

Constitutional AI

Arxiv: Constitutional AI: Harmlessness from AI Feedback Two key phases:

  1. Supervised Learning Phase (SL Phase)
    • Step1: The learning starts using the samples from the initial model
    • Step2: From these samples, the model generates self-critiques and revisions
    • Step3: Fine-tine the original model with these revisions
  2. Reinforcement Learning Phase (RL Phease)
    • Step1. The model uses samples from the fine-tuned model.
    • Step2. Use a model to compare the outputs from samples from the initial model and the fine-tuned model
    • Step3. Decide which sample is better. (RLHF)
    • Step4. Train a new “preference model” from the new dataset of AI preferences. This new “prefernece model” will then be used to re-train the RL (as a reward signal). It is now the RLHAF (Reinforcement Learning from AI feedback)

Attack LLM

Blog: 如何攻擊 LLM (ChatGPT) ?

  • JailBreak
  • Prompt Injection
  • Data poisoning

LLM running locally

LM Studio


Ollama

ollama -v
ollama
ollama pull deepseek-r1
ollama run llama3.2

Github: Github

Kaggle: Langchain RAG


Jan - Local AI Assistant

Github: https://github.com/menloresearch/jan


llama.cpp

LLM inference in C/C++


PrivateGPT

Github: https://github.com/zylon-ai/private-gpt/tree/primordial


RLM

Arxiv: Reasoning Language Models: A Blueprint

LLM Reasoning


Chain-of-Thought Prompting

Arxiv: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models


ReAct Prompting

Arxiv: ReAct: Synergizing Reasoning and Acting in Language Models
Github: https://github.com/ysymyth/ReAct


Tree-of-Thoughts

Arxiv: Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Github: https://github.com/princeton-nlp/tree-of-thought-llm
Github: https://github.com/kyegomez/tree-of-thoughts


Reinforcement Pre-Training

Arxiv: Reinforcement Pre-Training Microsoft and China AI Research Possible Reinforcement Pre-Training Breakthrough


Teaching LLMs to Plan

Paper: Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning


Alpaca-CoT

Alpaca-CoT: An Instruction-Tuning Platform with Unified Interface for Instruction Collection, Parameter-efficient Methods, and Large Language Models


Prompt Engineering

Perfect Prompt Structure


訓練不了人工智慧?你可以訓練你自己


Thinking Claude

17歲高中生寫出「神級Prompt」強化Claude推理能力媲美o1模型,如何實現?

Thinking Gemini

https://github.com/lanesky/thinking-gemini


Context Engineering

什麼是 Context Engineering 上下文工程?


A Survey of Context Engineering for Large Language Models


情境工程(Context Engineering)解析:打造實用 AI Agent 的關鍵技巧,與提示工程(Prompt Engineering)有什麼不同?



This site was last updated October 02, 2025.