Lecture

LLM Fine-Tuning

01 May 2024 • Richard Kuo

Introduction to LLM Fine-Tuning

LLM FineTuning

LongLoRA

Code: https://github.com/dvlab-research/LongLoRA
[2023.11.19] We release a new version of LongAlpaca models, LongAlpaca-7B-16k, LongAlpaca-7B-16k, and LongAlpaca-7B-16k.

localLLM

Code: https://github.com/ykhli/local-ai-stack
🦙 Inference: Ollama
💻 VectorDB: Supabase pgvector
🧠 LLM Orchestration: Langchain.js
🖼️ App logic: Next.js
🧮 Embeddings generation: Transformer.js and all-MiniLM-L6-v2

AirLLM

Blog: Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique
Code: https://github.com/lyogavin/Anima/tree/main/air_llm

PowerInfer

Paper: PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Code: https://github.com/SJTU-IPADS/PowerInfer
Blog: 2080 Ti就能跑70B大模型，上交大新框架让LLM推理增速11倍

https://github.com/SJTU-IPADS/PowerInfer/assets/34213478/fe441a42-5fce-448b-a3e5-ea4abb43ba23 PowerInfer v.s. llama.cpp on a single RTX 4090(24G) running Falcon(ReLU)-40B-FP16 with a 11x speedup!
Both PowerInfer and llama.cpp were running on the same hardware and fully utilized VRAM on RTX 4090.

Eagle

Paper: EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Code: https://github.com/SafeAILab/EAGLE
Kaggle: https://www.kaggle.com/code/rkuo2000/eagle-llm

Era of 1-bit LLMs

Paper: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Blog: No more Floating Points, The Era of 1.58-bit Large Language Models

GaLore

Paper: GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Code: https://github.com/jiaweizzhao/galore
To train a 7B model with a single GPU such as NVIDIA RTX 4090, all you need to do is to specify –optimizer=galore_adamw8bit_per_layer, which enables GaLoreAdamW8bit with per-layer weight updates. With activation checkpointing, you can maintain a batch size of 16 tested on NVIDIA RTX 4090.

LlamaFactory

Paper: LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Code: https://github.com/hiyouga/LLaMA-Factory

PEFT for Vision Model

Paper: Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey

ReFT

representation fine-tuning (ReFT) library, a Powerful, Parameter-Efficient, and Interpretable fine-tuning method
Paper: ReFT: Representation Finetuning for Language Models
Code: https://github.com/stanfordnlp/pyreft

ORPO

model: kaist-ai/mistral-orpo-beta
Paper: ORPO: Monolithic Preference Optimization without Reference Model
Code: https://github.com/xfactlab/orpo
Blog: Fine-tune Llama 3 with ORPO

FlagEmbedding

model: namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA
Paper: Extending Llama-3’s Context Ten-Fold Overnight
Code: https://github.com/FlagOpen/FlagEmbedding

Memory-Efficient Inference: Smaller KV Cache with Cross-Layer Attention

Paper: Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

This site was last updated June 29, 2024.

LLM Fine-Tuning

LLM FineTuning

Low-Rank Adaptation (LoRA)

FlashAttention

Few-Shot PEFT

PEFT (Parameter-Efficient Fine-Tuning)

QLoRA

AWQ

Soft-prompt Tuning

Platypus

LLM Lingua