Products/AI/ML - Text-to-Speech/Fish Audio S2

Fish Audio S2

The most expressive voice AI ever made, now open-source.

AI/ML - Text-to-SpeechGen Z TeamRaised $5M+ ARRvoice-aitext-to-speechopen-sourceReviewed

Our Take

A team of four Gen Z builders said "we're going to make the best voice AI on the planet" and then just... did it. Shijia Liao is the CEO and Jiahua Liu—who's at Stanford, by the way—is the co-founder bringing that academic firepower. Fish Audio's S2 model just dropped and it's their most advanced text-to-speech system yet. They hit $5 million in ARR and 420,000 monthly active users with a four-person team. FOUR PEOPLE. That's not a startup. That's a cheat code. They came out of HF0, which is basically the incubator for people who are already cracked at AI before they even start a company. Fish Audio does voice cloning, text-to-speech, and speech-to-speech that sounds so human it's kind of unsettling in the best way possible. Their API serves developers building everything from audiobook platforms to AI companions to game studios that need dynamic voice acting without hiring a thousand voice actors. The voice AI space is getting crowded but Fish Audio is swimming past everyone else because they built their own models from scratch instead of fine-tuning someone else's work. When a Gen Z team is outpacing companies with 10x their headcount and 100x their funding, you pay attention. Fish Audio is the real deal.

fish.audio →Product Hunt Launch →Shijia Liao LinkedIn →Jiahua Liu LinkedIn →

Text-to-speech model with ultra-low latency, fine-grained inline control of prosody and emotion, multi-speaker support, and full open-source availability

Key Features

Ultra-low latency under 150ms, Open domain control with natural text instructions, Multi-speaker conversations in single generation, Fully open-source (inference code and model weights), Built with SGLang, 80+ languages support, Fine-grained inline control using [tag] syntax, Over 15,000 unique tags supported, Dual-Autoregressive architecture (4B Slow AR + 400M Fast AR), Real-Time Factor of 0.195, Time-to-first-audio of ~100ms, 3,000+ acoustic tokens per second throughput

Problem It Solves

Creating realistic, expressive speech synthesis with real-time performance for conversational AI, live dubbing, and interactive voice applications

Target Customer

Developers, enterprises, researchers, content creators

Use Cases

Real-time conversational AI, Live dubbing, Interactive voice applications, Audiobooks, Voiceovers, Character voices, Conversational chatbots, Research, Voice cloning

Pricing Details

Research and non-commercial use permitted free of charge. Commercial use requires separate license from Fish Audio.

Free Tier

Research and non-commercial use is free

Differentiator

Most expressive open-source TTS model with Dual-Autoregressive architecture combining 4B-parameter Slow AR for semantic prediction and 400M-parameter Fast AR for acoustic detail

Why Now

Now open-source - both inference code and model weights released

Traction

Notable Metrics: Trained on over 10M+ hours of audio data across 80+ languages

Key Facts

The people behind Fish Audio S2

Hang Huang

profile

Co-Founder & CEO

Co-Founder & CEO of InsForge (YC P26), a Postgres-based backend platform for AI-native developers. Previously Product Manager at Amazon. Based in San Francisco Bay Area.

LinkedIn Twitter/X

James Ding

profile

Maker

Before co-founding Draftwise, James spent ten years at the forefront of AI, leading teams at Palantir to create big data AI solutions and patenting breakthroughs in data security and machine learning that now power Draftwise.

Links

Website LinkedIn GitHub Source: product-hunt

Want products like this in your inbox every morning?

Five products. Every morning. Written by someone who actually cares whether they're good or not. Free forever, unsubscribe whenever.