Skip to content

Chatterbox TTS API Service

This is a high-performance Text-to-Speech (TTS) service based on Chatterbox-TTS. It offers an OpenAI TTS-compatible API, an enhanced interface supporting voice cloning, and a user-friendly web interface.

This project aims to provide developers and content creators with a privately deployable, powerful, and easily integrable TTS solution.

Project address: https://github.com/jianchang512/chatterbox-api


Using in pyVideoTrans

This project can serve as a powerful TTS backend, providing high-quality English dubbing for pyVideoTrans.

  1. Start the project: Ensure the Chatterbox TTS API service is running locally (http://127.0.0.1:5093).

  2. Update pyVideoTrans: Make sure your pyVideoTrans version is upgraded to v3.73 or higher.

  3. Configure pyVideoTrans:

    • In the pyVideoTrans menu, go to TTS Settings -> Chatterbox TTS.
    • API Address: Fill in the address of this service, which is http://127.0.0.1:5093 by default.
    • Reference Audio (Optional): If you want to use voice cloning, please fill in the filename of the reference audio here (e.g., my_voice.wav). Please ensure that the audio file is placed in the chatterbox folder under the pyVideoTrans root directory.
    • Adjust Parameters: Adjust cfg_weight and exaggeration as needed to achieve the best results.

    Parameter Adjustment Suggestions:

    • General Scenarios (TTS, Voice Assistant): The default settings (cfg_weight=0.5, exaggeration=0.5) are suitable for most situations.
    • Fast-Paced Reference Audio: If the reference audio has a fast pace, try reducing cfg_weight to around 0.3 to improve the rhythm of the generated speech.
    • Expressive/Dramatic Speech: Try a lower cfg_weight (e.g., 0.3) and a higher exaggeration (e.g., 0.7 or higher). Usually, increasing exaggeration will speed up the speech, and lowering cfg_weight helps to balance it, making the rhythm more relaxed and clearer.

✨ Key Features

  • Two API Interfaces:
    1. OpenAI Compatible Interface: /v1/audio/speech, seamlessly integrates into any existing workflow that supports the OpenAI SDK.
    2. Voice Cloning Interface: /v2/audio/speech_with_prompt, generate speech with the same tone by uploading a short piece of reference audio.
  • Web User Interface: Provides an intuitive frontend page for quickly testing and using TTS functions without writing any code.
  • Flexible Output Formats: Supports generating audio in .mp3 and .wav formats.
  • Cross-Platform Support: Provides detailed installation guides for Windows, macOS, and Linux.
  • One-Click Windows Deployment: Provides a compressed package containing all dependencies and startup scripts for Windows users, making it ready to use out of the box.
  • GPU Acceleration: Supports NVIDIA GPU (CUDA) and provides a one-click upgrade script for Windows users.
  • Seamless Integration: Can be easily integrated with tools like pyVideoTrans as a backend service.

🚀 Quick Start

We have prepared a portable package win.7z containing all dependencies for Windows users, greatly simplifying the installation process.

  1. Download and Extract: https://github.com/jianchang512/chatterbox-api/releases and extract it to any location (preferably without Chinese characters in the path).

  2. Install C++ Build Tools (Strongly Recommended):

    • Go to the extracted tools folder and double-click vs_BuildTools.exe.
    • In the installation interface that pops up, check the "Desktop development with C++" option and click Install.
    • This step can pre-install the dependencies required for compiling many Python packages, avoiding a large number of installation errors.
  3. Start the Service:

    • Double-click to run the 启动服务.bat script in the root directory.
    • On the first run, the script will automatically create a Python virtual environment and install all necessary dependency packages. This process may take a few minutes and will automatically download the TTS model, please be patient.
    • The service will start automatically after installation.

    When you see information similar to the following in the command line window, it indicates that the service has started successfully:

```
✅ Model loading complete.
Service started successfully, HTTP address is: http://127.0.0.1:5093
```

Method 2: macOS, Linux, and Manual Installation Users

For macOS, Linux users, or Windows users who want to set up the environment manually, please follow these steps.

1. Prerequisites

  • Python: Ensure that Python 3.9 or later is installed.
  • ffmpeg: This is a required audio and video processing tool.
    • macOS (using Homebrew): brew install ffmpeg
    • Debian/Ubuntu: sudo apt-get update && sudo apt-get install ffmpeg
    • Windows (Manual): Download ffmpeg and add it to the system environment variable PATH.

2. Installation Steps

bash
# 1. Clone the project repository
git clone https://github.com/jianchang512/chatterbox-api.git
cd chatterbox-api

# 2. Create and activate a Python virtual environment (recommended)
python3 -m venv venv
# on Windows:
# venv\Scripts\activate
# on macOS/Linux:
source venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Start the service
python app.py

Once the service starts successfully, you will see the service address http://127.0.0.1:5093 in the terminal.


⚡ Upgrade to GPU Version (Optional)

If your computer is equipped with a CUDA-enabled NVIDIA graphics card and has correctly installed the NVIDIA driver and CUDA Toolkit, you can upgrade to the GPU version for a significant performance improvement.

Windows Users (One-Click Upgrade)

  1. Please make sure you have successfully run 启动服务.bat once to complete the installation of the basic environment.
  2. Double-click to run the 安装N卡GPU支持.bat script.
  3. The script will automatically uninstall the CPU version of PyTorch and install the GPU version compatible with CUDA 12.6.

Linux Manual Upgrade

After activating the virtual environment, execute the following commands:

bash
# 1. Uninstall the existing CPU version of PyTorch
pip uninstall -y torch torchaudio

# 2. Install PyTorch that matches your CUDA version
# The following command is for CUDA 12.6. Please obtain the correct command from the PyTorch official website according to your CUDA version.
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu126

You can visit the PyTorch official website to get the installation commands suitable for your system.

After the upgrade, restart the service and you will see Using device: cuda in the startup log.


📖 Usage Guide

1. Web Interface

After the service starts, open http://127.0.0.1:5093 in your browser to access the Web UI.

  • Input Text: Enter the text you want to convert in the text box.
  • Adjust Parameters:
    • cfg_weight: (Range 0.0 - 1.0) Controls the pace of the speech. Lower values result in slower and more relaxed speech. For fast-paced reference audio, this value can be appropriately reduced (e.g., 0.3).
    • exaggeration: (Range 0.25 - 2.0) Controls the emotion and intonation exaggeration of the speech. Higher values result in richer emotions and potentially faster speech.
  • Voice Cloning: Click "Choose File" to upload a reference audio (e.g., .mp3, .wav). If a reference audio is provided, the service will use the cloning interface.
  • Generate Speech: Click the "Generate Speech" button and wait a moment to listen online and download the generated MP3 file.

2. API Call

Interface 1: OpenAI Compatible Interface (/v1/audio/speech)

This interface does not require reference audio and can be called directly using the OpenAI SDK.

Python Example (openai SDK):

python
from openai import OpenAI
import os

# Direct the client to our local service
client = OpenAI(
    base_url="http://127.0.0.1:5093/v1",
    api_key="not-needed"  # API key is not required, but the SDK requires it to be provided
)

response = client.audio.speech.create(
    model="chatterbox-tts",   # This parameter will be ignored
    voice="en",              # Used to pass the language code, currently only supports 'en'
    speed=0.5,               # Corresponds to the cfg_weight parameter
    input="Hello, this is a test from the OpenAI compatible API.",
    instructions="0.5"     # (Optional) Corresponds to the exaggeration parameter, note that it needs to be a string
    response_format="mp3"    # Optional 'mp3' or 'wav'
)

# Save the audio stream to a file
response.stream_to_file("output_api1.mp3")
print("Audio saved to output_api1.mp3")

Interface 2: Voice Cloning Interface (/v2/audio/speech_with_prompt)

This interface requires uploading both text and a reference audio file in multipart/form-data format.

Python Example (requests library):

python
import requests

API_URL = "http://127.0.0.1:5093/v2/audio/speech_with_prompt"
REFERENCE_AUDIO = "path/to/your/reference.mp3"  # Replace with your reference audio path

form_data = {
    'input': 'This voice should sound like the reference audio.',
    'cfg_weight': '0.5',
    'exaggeration': '0.5',
    'response_format': 'mp3'  # Optional 'mp3' or 'wav'
}

with open(REFERENCE_AUDIO, 'rb') as audio_file:
    files = {'audio_prompt': audio_file}
    response = requests.post(API_URL, data=form_data, files=files)

if response.ok:
    with open("output_api2.mp3", "wb") as f:
        f.write(response.content)
    print("Cloned audio saved to output_api2.mp3")
else:
    print("Request failed:", response.text)