Products/AI/ML - Text-to-Speech/Fish Audio S2

Fish Audio S2

The most expressive voice AI ever made, now open-source.

AI/ML - Text-to-SpeechGen Z TeamRaised $5M+ ARRvoice-aitext-to-speechopen-sourceReviewed
Fish Audio S2

Our Take

A team of four Gen Z builders said "we're going to make the best voice AI on the planet" and then just... did it. Shijia Liao is the CEO and Jiahua Liu—who's at Stanford, by the way—is the co-founder bringing that academic firepower. Fish Audio's S2 model just dropped and it's their most advanced text-to-speech system yet. They hit $5 million in ARR and 420,000 monthly active users with a four-person team. FOUR PEOPLE. That's not a startup. That's a cheat code. They came out of HF0, which is basically the incubator for people who are already cracked at AI before they even start a company. Fish Audio does voice cloning, text-to-speech, and speech-to-speech that sounds so human it's kind of unsettling in the best way possible. Their API serves developers building everything from audiobook platforms to AI companions to game studios that need dynamic voice acting without hiring a thousand voice actors. The voice AI space is getting crowded but Fish Audio is swimming past everyone else because they built their own models from scratch instead of fine-tuning someone else's work. When a Gen Z team is outpacing companies with 10x their headcount and 100x their funding, you pay attention. Fish Audio is the real deal.

Text-to-speech model with ultra-low latency, fine-grained inline control of prosody and emotion, multi-speaker support, and full open-source availability

Key Features
Ultra-low latency under 150ms, Open domain control with natural text instructions, Multi-speaker conversations in single generation, Fully open-source (inference code and model weights), Built with SGLang, 80+ languages support, Fine-grained inline control using [tag] syntax, Over 15,000 unique tags supported, Dual-Autoregressive architecture (4B Slow AR + 400M Fast AR), Real-Time Factor of 0.195, Time-to-first-audio of ~100ms, 3,000+ acoustic tokens per second throughput
Problem It Solves
Creating realistic, expressive speech synthesis with real-time performance for conversational AI, live dubbing, and interactive voice applications
Target Customer
Developers, enterprises, researchers, content creators
Use Cases
Real-time conversational AI, Live dubbing, Interactive voice applications, Audiobooks, Voiceovers, Character voices, Conversational chatbots, Research, Voice cloning
Pricing Details
Research and non-commercial use permitted free of charge. Commercial use requires separate license from Fish Audio.
Free Tier
Research and non-commercial use is free
Differentiator
Most expressive open-source TTS model with Dual-Autoregressive architecture combining 4B-parameter Slow AR for semantic prediction and 400M-parameter Fast AR for acoustic detail
Why Now
Now open-source - both inference code and model weights released
Traction
Notable Metrics: Trained on over 10M+ hours of audio data across 80+ languages

Key Facts

Category
AI/ML - Text-to-Speech
Location
Gen Z Team
Raised
$5M+ ARR
Pricing
50% OFF YEARLY (Limited Time Offer)
Discovered via
product-hunt

The people behind Fish Audio S2

H

Hang Huang

profile

Co-Founder & CEO

Co-Founder & CEO of InsForge (YC P26), a Postgres-based backend platform for AI-native developers. Previously Product Manager at Amazon. Based in San Francisco Bay Area.

J

James Ding

profile

Maker

Before co-founding Draftwise, James spent ten years at the forefront of AI, leading teams at Palantir to create big data AI solutions and patenting breakthroughs in data security and machine learning that now power Draftwise.

Links

Browse by category

Similar products worth knowing

Want products like this in your inbox every morning?

Five products. Every morning. Written by someone who actually cares whether they're good or not. Free forever, unsubscribe whenever.

Fish Audio S2 — SLAYREPORT