AI Hardwares

AI chips, Hardware, ML Benchrmark, Framework, Open platforms


AI chips


Etched AI


Neuromorphic Computing


Paper: An overview of brain-like computing: Architecture, applications, and future trends


Top 10 AI Chip Makers of 2023: In-depth Guide

Groq

The Groq LPU™ Inference Engine

Paper: A Software-defined Tensor Streaming Multiprocessor for Large-scale Machine Learning


Rain AI

Digital In-Memory Compute

Numerics


Cerebras


Tesla

Dojo

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

Teslas will be ‘more intelligent’ than HUMANS by 2033 as their microchips already have 36% the capacity of the brain, study reveals


AI Hardware

Google TPU Cloud

Google’s Cloud TPU v4 provides exaFLOPS-scale ML with industry-leading efficiency One eighth of a TPU v4 pod from Google’s world’s largest publicly available ML cluster located in Oklahoma, which runs on ~90% carbon-free energy.

TPU v4 is the first supercomputer to deploy a reconfigurable OCS. OCSes dynamically reconfigure their interconnect topology Much cheaper, lower power, and faster than Infiniband, OCSes and underlying optical components are <5% of TPU v4’s system cost and <5% of system power.


TAIDE cloud

Blog: 【LLM關鍵基礎建設:算力】因應大模型訓練需求,國網中心算力明年大擴充
國網中心臺灣杉2號,不論是對7B模型進行預訓練(搭配1,400億個Token訓練資料)還是對13B模型預訓練(搭配2,400億個Token資料量)的需求,都可以勝任。
Meta從無到有訓練Llama 2時,需要上千甚至上萬片A100 GPU,所需時間大約為6個月,
而臺灣杉2號採用相對低階的V100 GPU,效能約為1:3。若以臺灣杉2號進行70B模型預訓練,可能得花上9個月至1年。


Nvidia

CUDA & CuDNN


AI SuperComputer

DGX B200

DGX GH200


AI Data Center

DGX SuperPOD with DGX GB200 Systems

HGX H200

AI Workstatione/Server (for Enterprise)

DGX H100


AI HPC

HGX A100
搭配HGX A100模組,華碩發表首款搭配SXM形式GPU伺服器


GPU

GeForce RTX-4090


AMD Instinct GPUs

MI300

304 GPU CUs, 192GB HBM3 memory, 5.3 TB peark theoretical memory bandwidth


MI200

220 CUs, 128GB HBM2e memory, 3.2TB/s Peak Memory Bandwidth, 400GB/s Peark aggregate Infinity Fabric


Intel

Gaudi3

Intel® Gaudi® 3 accelerator with L2 cache for every 2 MME and 16 TPC unit


AI PC/Notebook

NPU: 三款AI PC筆電搶先看!英特爾首度在臺公開展示整合NPU的Core Ultra筆電,具備有支援70億參數Llama 2模型的推論能力
宏碁在現場展示用Core Ultra筆電執行圖像生成模型,可以在筆電桌面螢幕中自動生成動態立體的太空人桌布,還可以利用筆電前置鏡頭來追蹤使用者的臉部輪廓,讓桌布可以朝著使用者視角移動。此外,還可以利用工具將2D平面圖像轉為3D裸眼立體圖。


Edge AI


Collections of MPU for Edge AI applications

天璣 9300

  • 單核性能提升超過 15%
  • 多核性能提升超過 40%
  • 4 個 Cortex-X4 CPU 主頻最高可達 3.25GHz
  • 4 個 Cortex-A720 CPU 主頻為 2.0GHz
  • 內置 18MB 超大容量緩存組合,三級緩存(L3)+ 系統緩存(SLC)容量較上一代提升 29%

天璣 8300

  • 八核 CPU 包括 4 個 Cortex-A715 大核和 4 個 Cortex-A510 能效核心
  • Mali-G615 GPU
  • 支援 LPDDR5X 8533Mbps 記憶體
  • 支援 UFS 4.0 + 多循環隊列技術(Multi-Circular Queue,MCQ)
  • 高能效 4nm 製程

Kneron 耐能智慧

  • KNEO300 EdgeGPT

  • KL530
    • 基於ARM Cortex M4 CPU内核的低功耗性能和高能效設計。
    • 算力達1 TOPS INT 4,在同等硬件條件下比INT 8的處理效率提升高達70%。
    • 支持CNN,Transformer,RNN Hybrid等多種AI模型。
    • 智能ISP可基於AI優化圖像質量,強力Codec實現高效率多媒體壓縮。
    • 冷啟動時間低於500ms,平均功耗低於500mW。
  • KL720 (算力可達0.9 TOPS/W)

Realtek AmebaPro2

AMB82-MINI

  • MCU
    • Part Number: RTL8735B
    • 32-bit Arm v8M, up to 500MHz
  • MEMORY
    • 768KB ROM
    • 512KB RAM
    • 16MB Flash
    • Supports MCM embedded DDR2/DDR3L memory up to 128MB
  • KEY FEATURES
    • Integrated 802.11 a/b/g/n Wi-Fi, 2.4GHz/5GHz
    • Bluetooth Low Energy (BLE) 5.1
    • Integrated Intelligent Engine @ 0.4 TOPS

mlplatform.org

The machine learning platform is part of the Linaro Artificial Intelligence Initiative and is the home for Arm NN and Compute Library – open-source software libraries that optimise the execution of machine learning (ML) workloads on Arm-based processors.

ProjectRepository
Arm NN[https://github.com/ARM-software/armnn](https://github.com/ARM-software/armnn)
Compute Library[https://review.mlplatform.org/#/admin/projects/ml/ComputeLibrary](https://review.mlplatform.org/#/admin/projects/ml/ComputeLibrary)
Arm Android NN Driverhttps://github.com/ARM-software/android-nn-driver

ARM NN SDK

免費提供的 Arm NN (類神經網路) SDK,是一組開放原始碼的 Linux 軟體工具,可在節能裝置上實現機器學習工作負載。這項推論引擎可做為橋樑,連接現有神經網路框架與節能的 Arm Cortex-A CPU、Arm Mali 繪圖處理器及 Ethos NPU。

ARM NN
Arm NN is the most performant machine learning (ML) inference engine for Android and Linux, accelerating ML on Arm Cortex-A CPUs and Arm Mali GPUs.


Benchmark

MLPerf

MLPerf™ Inference Benchmark Suite

MLPerf Inference v3.1 (submission 04/08/2023)

model reference app framework dataset
resnet50-v1.5 vision/classification_and_detection tensorflow, pytorch, onnx imagenet2012
retinanet 800x800 vision/classification_and_detection pytorch, onnx openimages resized to 800x800
bert language/bert tensorflow, pytorch, onnx squad-1.1
dlrm-v2 recommendation/dlrm pytorch Multihot Criteo Terabyte
3d-unet vision/medical_imaging/3d-unet-kits19 pytorch, tensorflow, onnx KiTS19
rnnt speech_recognition/rnnt pytorch OpenSLR LibriSpeech Corpus
gpt-j language/gpt-j pytorch CNN-Daily Mail

NVIDIA’s MLPerf Benchmark Results

NVIDIA H100 Tensor Core GPU

Benchmark Per-Accelerator Records
Large Language Model (LLM) 548 hours (23 days)
Natural Language Processing (BERT) 0.71 hours
Recommendation (DLRM-dcnv2) 0.56 hours
Speech Recognition (RNN-T) 2.2 hours
Image Classification (ResNet-50 v1.5) 1.8 hours
Object Detection, Heavyweight (Mask R-CNN) 2.6 hours
Object Detection, Lightweight (RetinaNet) 4.9 hours
Image Segmentation (3D U-Net) 1.6 hours

Frameworks

PyTorch

Tensorflow

Keras 3.0


MLX

MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research.
MLX documentation


TinyML

EloquentTinyML

Tensorflow.js

MediaPipe



This site was last updated June 29, 2024.