/Auralis

A Fast TTS Engine

Primary LanguagePythonOtherNOASSERTION

Auralis 🌌 (/auˈralis/)

Transform text into natural speech (with voice cloning) at warp speed. Process an entire novel in minutes, not hours.

What is Auralis? πŸš€

Auralis is a text-to-speech engine that makes voice generation practical for real-world use:

  • Convert the entire first Harry Potter book to speech in 10 minutes (realtime factor of β‰ˆ 0.02x! )
  • Automatically enhance the reference quality, you can register them even with a low quality mic!
  • It can be configured to have a small memory footprint (scheduler_max_concurrency)
  • Process multiple requests simultaneously
  • Stream long texts piece by piece

Quick Start ⭐

  1. Create a new Conda environment:

    conda create -n auralis_env python=3.10 -y
  2. Activate the environment:

    conda activate auralis_env
  3. Install Auralis:

    pip install auralis

and then you can try it out via python

from auralis import TTS, TTSRequest

# Initialize
tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt')

# Generate speech
request = TTSRequest(
    text="Hello Earth! This is Auralis speaking.",
    speaker_files=['reference.wav']
)

output = tts.generate_speech(request)
output.save('hello.wav')

or via cli using the openai compatible server

auralis.openai --host 127.0.0.1 --port 8000 --model AstraMindAI/xttsv2 --gpt_model AstraMindAI/xtts2-gpt --max_concurrency 8 --vllm_logging_level warn  

You can see here for a more in-depth explanation or try it out with this example

Key Features πŸ›Έ

Speed & Efficiency

  • Processes long texts rapidly using smart batching
  • Runs on consumer GPUs without memory issues
  • Handles multiple requests in parallel

Easy Integration

  • Simple Python API
  • Streaming support for long texts
  • Built-in audio enhancement
  • Automatic language detection

Audio Quality

  • Voice cloning from short samples
  • Background noise reduction
  • Speech clarity enhancement
  • Volume normalization

XTTSv2 Finetunes

You can use your own XTTSv2 finetunes by simply converting them from the standard coqui checkpoint format to our safetensor format. Use this script:

python checkpoint_converter.py path/to/checkpoint.pth --output_dir path/to/output

it will create two folders, one with the core xttsv2 checkpoint and one with the gtp2 component. Then create a TTS instance with

tts = TTS().from_pretrained("som/core-xttsv2_model", gpt_model='some/xttsv2-gpt_model')

Examples & Usage πŸš€

Basic Examples ⭐

Simple Text Generation
from auralis import TTS, TTSRequest

# Initialize
tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt')
# Basic generation
request = TTSRequest(
    text="Hello Earth! This is Auralis speaking.",
    speaker_files=["speaker.wav"]
)
output = tts.generate_speech(request)
output.save("hello.wav")
Working with TTSRequest 🎀
# Basic request
request = TTSRequest(
    text="Hello world!",
    speaker_files=["speaker.wav"]
)

# Enhanced audio processing
request = TTSRequest(
    text="Pristine audio quality",
    speaker_files=["speaker.wav"],
    audio_config=AudioPreprocessingConfig(
        normalize=True,
        trim_silence=True,
        enhance_speech=True,
        enhance_amount=1.5
    )
)

# Language-specific request
request = TTSRequest(
    text="Bonjour le monde!",
    speaker_files=["speaker.wav"],
    language="fr"
)

# Streaming configuration
request = TTSRequest(
    text="Very long text...",
    speaker_files=["speaker.wav"],
    stream=True,
)

# Generation parameters
request = TTSRequest(
    text="Creative variations",
    speaker_files=["speaker.wav"],
    temperature=0.8,
    top_p=0.9,
    top_k=50
)
Working with TTSOutput 🎧
# Load audio file
output = TTSOutput.from_file("input.wav")

# Format conversion
output.bit_depth = 32
output.channel = 2
tensor_audio = output.to_tensor()
audio_bytes = output.to_bytes()



# Audio processing
resampled = output.resample(target_sr=44100)
faster = output.change_speed(1.5)
num_samples, sample_rate, duration = output.get_info()

# Combine multiple outputs
combined = TTSOutput.combine_outputs([output1, output2, output3])

# Playback and saving
output.play()  # Play audio
output.preview()  # Smart playback (Jupyter/system)
output.save("processed.wav", sample_rate=44100)

Synchronous Advanced Examples 🌟

Batch Text Processing
# Process multiple texts with same voice
texts = ["First paragraph.", "Second paragraph.", "Third paragraph."]
requests = [
    TTSRequest(
        text=text,
        speaker_files=["speaker.wav"]
    ) for text in texts
]

# Sequential processing with progress
outputs = []
for i, req in enumerate(requests, 1):
    print(f"Processing text {i}/{len(requests)}")
    outputs.append(tts.generate_speech(req))

# Combine all outputs
combined = TTSOutput.combine_outputs(outputs)
combined.save("combined_output.wav")
Book Chapter Processing
def process_book(chapter_file: str, speaker_file: str):
    # Read chapter
    with open(chapter_file, 'r') as f:
        chapter = f.read()
    
    # You can pass the whole book, auralis will take care of splitting
    
    request = TTSRequest(
            text=chapter,
            speaker_files=[speaker_file],
            audio_config=AudioPreprocessingConfig(
                enhance_speech=True,
                normalize=True
            )
        )
        
    output = tts.generate_speech(request)
    
    output.play()
    output.save("chapter_output.wav")

Asynchronous Examples πŸ›Έ

Basic Async Generation
import asyncio
from auralis import TTS, TTSRequest

async def generate_speech():
    tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt')
    
    request = TTSRequest(
        text="Async generation example",
        speaker_files=["speaker.wav"]
    )
    
    output = await tts.generate_speech_async(request)
    output.save("async_output.wav")

asyncio.run(generate_speech())
Parallel Processing
async def generate_parallel():
    tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt')
    
    # Create multiple requests
    requests = [
        TTSRequest(
            text=f"This is voice {i}",
            speaker_files=[f"speaker_{i}.wav"]
        ) for i in range(3)
    ]
    
    # Process in parallel
    coroutines = [tts.generate_speech_async(req) for req in requests]
    outputs = await asyncio.gather(*coroutines, return_exceptions=True)
    
    # Handle results
    valid_outputs = [
        out for out in outputs 
        if not isinstance(out, Exception)
    ]
    
    combined = TTSOutput.combine_outputs(valid_outputs)
    combined.save("parallel_output.wav")

asyncio.run(generate_parallel())
Async Streaming with Multiple Requests
async def stream_multiple_texts():
    tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt')
    
    # Prepare streaming requests
    texts = [
        "First long text...",
        "Second long text...",
        "Third long text..."
    ]
    
    requests = [
        TTSRequest(
            text=text,
            speaker_files=["speaker.wav"],
            stream=True,
        ) for text in texts
    ]
    
    # Process streams in parallel
    coroutines = [tts.generate_speech_async(req) for req in requests]
    streams = await asyncio.gather(*coroutines)
    
    # Collect outputs
    output_container = {i: [] for i in range(len(requests))}
    
    async def process_stream(idx, stream):
        async for chunk in stream:
            output_container[idx].append(chunk)
            print(f"Processed chunk for text {idx+1}")
            
    # Process all streams
    await asyncio.gather(
        *(process_stream(i, stream) 
          for i, stream in enumerate(streams))
    )
    
    # Save results
    for idx, chunks in output_container.items():
        TTSOutput.combine_outputs(chunks).save(
            f"text_{idx}_output.wav"
        )

asyncio.run(stream_multiple_texts())

Core Classes 🌟

TTSRequest - Unified request container with audio enhancement 🎀
@dataclass
class TTSRequest:
    """Container for TTS inference request data"""
    # Request metadata
    text: Union[AsyncGenerator[str, None], str, List[str]]

    speaker_files: Union[List[str], bytes]  # Path to the speaker audio file

    enhance_speech: bool = True
    audio_config: AudioPreprocessingConfig = field(default_factory=AudioPreprocessingConfig)
    language: SupportedLanguages = "auto"
    request_id: str = field(default_factory=lambda: uuid.uuid4().hex)
    load_sample_rate: int = 22050
    sound_norm_refs: bool = False

    # Voice conditioning parameters
    max_ref_length: int = 60
    gpt_cond_len: int = 30
    gpt_cond_chunk_len: int = 4

    # Generation parameters
    stream: bool = False
    temperature: float = 0.75
    top_p: float = 0.85
    top_k: int = 50
    repetition_penalty: float = 5.0
    length_penalty: float = 1.0
    do_sample: bool = True

Examples

# Basic usage
request = TTSRequest(
    text="Hello world!",
    speaker_files=["reference.wav"]
)

# With custom audio enhancement
request = TTSRequest(
    text="Hello world!",
    speaker_files=["reference.wav"],
    audio_config=AudioPreprocessingConfig(
        normalize=True,
        trim_silence=True,
        enhance_speech=True,
        enhance_amount=1.5
    )
)

# Streaming long text
request = TTSRequest(
    text="Very long text...",
    speaker_files=["reference.wav"],
    stream=True,
)

Features

  • Automatic language detection
  • Audio preprocessing & enhancement
  • Flexible input handling (strings, lists, generators)
  • Configurable generation parameters
  • Caching for efficient processing
TTSOutput - Unified output container for audio processing 🎧
@dataclass
class TTSOutput:
    array: np.ndarray
    sample_rate: int

Methods

Format Conversion

output.to_tensor()      # β†’ torch.Tensor
output.to_bytes()       # β†’ bytes (wav/raw)
output.from_tensor()    # β†’ TTSOutput
output.from_file()      # β†’ TTSOutput

Audio Processing

output.combine_outputs()  # Combine multiple outputs
output.resample()        # Change sample rate
output.get_info()        # Get audio properties
output.change_speed()    # Modify playback speed

File & Playback

output.save()           # Save to file
output.play()          # Play audio
output.display()       # Show in Jupyter
output.preview()       # Smart playback

Examples

# Load and process
output = TTSOutput.from_file("input.wav")
output = output.resample(target_sr=44100)
output.save("output.wav")

# Combine multiple outputs
combined = TTSOutput.combine_outputs([output1, output2, output3])

# Change playback speed
faster = output.change_speed(1.5)

Languages 🌍

XTTSv2 Supports: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (Simplified), Hungarian, Korean, Japanese, Hindi

Performance Details πŸ“Š

Processing speeds on NVIDIA 3090:

  • Short phrases (< 100 chars): ~1 second
  • Medium texts (< 1000 chars): ~5-10 seconds
  • Full books (~500K chars @ concurrency 36): ~10 minutes

Memory usage:

  • Base: ~2.5GB VRAM concurrency = 1
  • ~ 5.3GB VRAM concurrency = 20

Gradio

Gradio code

Auralis

Contributions

Join Our Community!

We welcome and appreciate any contributions to our project! To ensure a smooth and efficient process, please take a moment to review our Contribution Guideline. By following these guidelines, you'll help us review and accept your contribution quickly. Thank you for your support!

Learn More πŸ”­

License

The codebase is released under Apache 2.0, feel free to use it in your projects.

The XTTSv2 model (and the files under auralis/models/xttsv2/components/tts) are licensed under the Coqui AI License.