Generative Speech

Text-To-Speech (using TTS)

WaveNet

Paper: WaveNet: A Generative Model for Raw Audio

Code: https://github.com/r9y9/wavenet_vocoder
Colab: Tacotron2_and_WaveNet_text_to_speech_demo.ipynb


Tacotron-2

Paper: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Code: https://github.com/Rayhane-mamah/Tacotron-2

Code: Tacotron 2 (without wavenet)


Few-shot Transformer TTS

Paper: Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis
Code: https://github.com/mutiann/few-shot-transformer-tts


MetaAudio

Paper: MetaAudio: A Few-Shot Audio Classification Benchmark
Code: MetaAudio-A-Few-Shot-Audio-Classification-Benchmark
Dataset: ESC-50, NSynth, FSDKaggle18, BirdClef2020, VoxCeleb1


SeamlessM4T

Paper: SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
Code: https://github.com/facebookresearch/seamless_communication
Colab: seamless_m4t_colab


Audiobox Aesthetics

Paper: Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Code: https://github.com/facebookresearch/audiobox-aesthetics


SV2TTS

Paper: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
Code: CorentinJ/Real-Time-Voice-Cloning


Spark-TTS

Paper: Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Code: https://github.com/SparkAudio/Spark-TTS

  • Inference Overview of Voice Cloning

  • Inference Overview of Controlled Generation

ComfyUI: https://github.com/billwuhao/ComfyUI_SparkTTS

Kaggle: rkuo2000/Spark-TTS


IndexTTS2

Paper: IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
Code: https://github.com/index-tts/index-tts

ComfyUI: https://github.com/billwuhao/ComfyUI_IndexTTS

Kaggle: rkuo2000/Index-TTS2


VibeVoice

Paper: VibeVoice Technical Report
model: microsoft/VibeVoice-1.5B
Code: https://github.com/microsoft/VibeVoice

ComfyUI: https://github.com/Enemyx-net/VibeVoice-ComfyUI


FireRedTTS-2

Paper: FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot
Code: https://github.com/FireRedTeam/FireRedTTS2

ComfyUI: https://github.com/1038lab/ComfyUI-FireRedTTS


Voice Cloning

Paper: Voice Cloning: Comprehensive Survey


RVC-WebUI

Code: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI


AI 用你的聲音創建歌詞曲


GPT-SoVITS

Blog: GPT-SoVITS 用 AI 快速複製你的聲音,搭配 Colab 免費入門
Code: https://github.com/RVC-Boss/GPT-SoVITS/
Kaggle: rkuo2000/so-vits-svc-5-0


VITS SVC

Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS
Code: https://github.com/PlayVoice/whisper-vits-svc


Speech Seperation

Looking to Listen

Paper: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Blog: Looking to Listen: Audio-Visual Speech Separation


VoiceFilter

Paper: VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Code: https://github.com/maum-ai/voicefilter
Training took about 20 hours on AWS p3.2xlarge(NVIDIA V100)


VoiceFilter-Lite

Paper: VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
Blog:


Automatic Speech Recognition (ASR)

NeMO

Paper: NeMo: a toolkit for building AI applications using Neural Modules
Nemo ASR

Whisper

Paper: Robust Speech Recognition via Large-Scale Weak Supervision
Kaggle: rkuo2000/asr-whisper


Qwen-Audio

Paper: Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Code: https://github.com/QwenLM/Qwen-Audio


Faster-Whisper

faster-whisper is a reimplementation of OpenAI’s Whisper model using CTranslate2, which is a fast inference engine for Transformer models.
Kaggle: rkuo2000/faster-whisper


Open Whisper-style Speech Models (OWSM)

Paper: OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer


Canary

Paper: Less is More: Accurate Speech Recognition & Translation without Web-Scale Data
model: nvidia/canary-1b-v2
Kaggle: rkuo2000/asr-canary-1b


Whisper Large-v3

model: openai/whisper-large-v3
model: openai/whisper-large-v3-turbo
Kaggle: rkuo2000/whisper-large-v3-turbo



This site was last updated October 26, 2025.