Lecture

AI Hardwares

01 Dec 2023 • Richard Kuo

AI chips, Hardware, ML Benchrmark, Framework, Open platforms

AI chips

Neuromorphic Computing

Paper: An overview of brain-like computing: Architecture, applications, and future trends

Top 10 AI Chip Makers of 2023: In-depth Guide

Groq

The Groq LPU™ Inference Engine

Paper: A Software-defined Tensor Streaming Multiprocessor for Large-scale Machine Learning

Tesla

Dojo

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

Teslas will be ‘more intelligent’ than HUMANS by 2033 as their microchips already have 36% the capacity of the brain, study reveals

AI Hardware

Google TPU Cloud

Google’s Cloud TPU v4 provides exaFLOPS-scale ML with industry-leading efficiency One eighth of a TPU v4 pod from Google’s world’s largest publicly available ML cluster located in Oklahoma, which runs on ~90% carbon-free energy.

TPU v4 is the first supercomputer to deploy a reconfigurable OCS. OCSes dynamically reconfigure their interconnect topology Much cheaper, lower power, and faster than Infiniband, OCSes and underlying optical components are <5% of TPU v4’s system cost and <5% of system power.

TAIDE cloud

Blog: 【LLM關鍵基礎建設：算力】因應大模型訓練需求，國網中心算力明年大擴充
國網中心臺灣杉2號，不論是對7B模型進行預訓練（搭配1,400億個Token訓練資料）還是對13B模型預訓練（搭配2,400億個Token資料量）的需求，都可以勝任。
Meta從無到有訓練Llama 2時，需要上千甚至上萬片A100 GPU，所需時間大約為6個月，
而臺灣杉2號採用相對低階的V100 GPU，效能約為1：3。若以臺灣杉2號進行70B模型預訓練，可能得花上9個月至1年。

Nvidia

CUDA & CuDNN

AI SuperComputer

DGX B200

DGX GH200

AI Data Center

DGX SuperPOD with DGX GB200 Systems

HGX H200

AI Workstatione/Server (for Enterprise)

DGX H100

AI HPC

HGX A100
搭配HGX A100模組，華碩發表首款搭配SXM形式GPU伺服器

GPU

GeForce RTX-4090

AMD Instinct GPUs

MI300

304 GPU CUs, 192GB HBM3 memory, 5.3 TB peark theoretical memory bandwidth

MI200

220 CUs, 128GB HBM2e memory, 3.2TB/s Peak Memory Bandwidth, 400GB/s Peark aggregate Infinity Fabric

Intel

Gaudi3

Intel® Gaudi® 3 accelerator with L2 cache for every 2 MME and 16 TPC unit

AI PC/Notebook

NPU: 三款AI PC筆電搶先看！英特爾首度在臺公開展示整合NPU的Core Ultra筆電，具備有支援70億參數Llama 2模型的推論能力
宏碁在現場展示用Core Ultra筆電執行圖像生成模型，可以在筆電桌面螢幕中自動生成動態立體的太空人桌布，還可以利用筆電前置鏡頭來追蹤使用者的臉部輪廓，讓桌布可以朝著使用者視角移動。此外，還可以利用工具將2D平面圖像轉為3D裸眼立體圖。

Edge AI

Collections of MPU for Edge AI applications

天璣 9300

單核性能提升超過 15%
多核性能提升超過 40%
4 個 Cortex-X4 CPU 主頻最高可達 3.25GHz
4 個 Cortex-A720 CPU 主頻為 2.0GHz
內置 18MB 超大容量緩存組合，三級緩存（L3）+ 系統緩存（SLC）容量較上一代提升 29%

天璣 8300

八核 CPU 包括 4 個 Cortex-A715 大核和 4 個 Cortex-A510 能效核心
Mali-G615 GPU
支援 LPDDR5X 8533Mbps 記憶體
支援 UFS 4.0 + 多循環隊列技術（Multi-Circular Queue，MCQ）
高能效 4nm 製程

Kneron 耐能智慧

KNEO300 EdgeGPT

KL530
- 基於ARM Cortex M4 CPU内核的低功耗性能和高能效設計。
- 算力達1 TOPS INT 4，在同等硬件條件下比INT 8的處理效率提升高達70%。
- 支持CNN,Transformer，RNN Hybrid等多種AI模型。
- 智能ISP可基於AI優化圖像質量，強力Codec實現高效率多媒體壓縮。
- 冷啟動時間低於500ms，平均功耗低於500mW。
KL720 (算力可達0.9 TOPS/W)

Realtek AmebaPro2

AMB82-MINI

MCU
- Part Number: RTL8735B
- 32-bit Arm v8M, up to 500MHz
MEMORY
- 768KB ROM
- 512KB RAM
- 16MB Flash
- Supports MCM embedded DDR2/DDR3L memory up to 128MB
KEY FEATURES
- Integrated 802.11 a/b/g/n Wi-Fi, 2.4GHz/5GHz
- Bluetooth Low Energy (BLE) 5.1
- Integrated Intelligent Engine @ 0.4 TOPS

mlplatform.org

The machine learning platform is part of the Linaro Artificial Intelligence Initiative and is the home for Arm NN and Compute Library – open-source software libraries that optimise the execution of machine learning (ML) workloads on Arm-based processors.

Project	Repository
Arm NN	[https://github.com/ARM-software/armnn](https://github.com/ARM-software/armnn)
Compute Library	[https://review.mlplatform.org/#/admin/projects/ml/ComputeLibrary](https://review.mlplatform.org/#/admin/projects/ml/ComputeLibrary)
Arm Android NN Driver	https://github.com/ARM-software/android-nn-driver

ARM NN SDK

免費提供的 Arm NN (類神經網路) SDK，是一組開放原始碼的 Linux 軟體工具，可在節能裝置上實現機器學習工作負載。這項推論引擎可做為橋樑，連接現有神經網路框架與節能的 Arm Cortex-A CPU、Arm Mali 繪圖處理器及 Ethos NPU。

ARM NN
Arm NN is the most performant machine learning (ML) inference engine for Android and Linux, accelerating ML on Arm Cortex-A CPUs and Arm Mali GPUs.

Benchmark

MLPerf

MLPerf™ Inference Benchmark Suite

MLPerf Inference v3.1 (submission 04/08/2023)

model	reference app	framework	dataset
resnet50-v1.5	vision/classification_and_detection	tensorflow, pytorch, onnx	imagenet2012
retinanet 800x800	vision/classification_and_detection	pytorch, onnx	openimages resized to 800x800
bert	language/bert	tensorflow, pytorch, onnx	squad-1.1
dlrm-v2	recommendation/dlrm	pytorch	Multihot Criteo Terabyte
3d-unet	vision/medical_imaging/3d-unet-kits19	pytorch, tensorflow, onnx	KiTS19
rnnt	speech_recognition/rnnt	pytorch	OpenSLR LibriSpeech Corpus
gpt-j	language/gpt-j	pytorch	CNN-Daily Mail

NVIDIA’s MLPerf Benchmark Results

NVIDIA H100 Tensor Core GPU

Benchmark	Per-Accelerator Records
Large Language Model (LLM)	548 hours (23 days)
Natural Language Processing (BERT)	0.71 hours
Recommendation (DLRM-dcnv2)	0.56 hours
Speech Recognition (RNN-T)	2.2 hours
Image Classification (ResNet-50 v1.5)	1.8 hours
Object Detection, Heavyweight (Mask R-CNN)	2.6 hours
Object Detection, Lightweight (RetinaNet)	4.9 hours
Image Segmentation (3D U-Net)	1.6 hours