If you need to minimize latency for your voice interactions, Elto could be a good fit. Elto provides live conversation AI that can handle phone calls with low latency (less than 700ms in 99% of cases), enabling realistic conversations and downstream workflow automation. It comes with a lot of customization options for voice, fine-tuned language models, and supports integration through REST and GraphQL APIs.
Another option is AssemblyAI, which provides a range of AI models for speech-to-text transcription, including low-latency streaming speech-to-text. With 12.5 million hours of multilingual audio data, AssemblyAI supports more than 99 languages and offers features like sentiment analysis and speaker diarization. It's geared for companies building AI products that consume voice data, with flexible integration tools and 24/7 customer support.
If you're more interested in voice synthesis, LMNT offers superfast and realistic voice cloning abilities. It can handle low-latency audio streaming and can create studio-quality voice clones from short audio clips. LMNT is flexible enough for real-time conversations, content creation and product marketing, with pricing levels that scale up or down depending on your project size.
Last, Deepgram provides a variety of speech-to-text and text-to-speech APIs with low latency and high accuracy. It supports multiple languages and offers detailed transcription data, making it good for speech analytics, media transcription and voicebots. Deepgram also offers a free API playground and transparent pricing, so you can experiment with it for different use cases.