Live Translator - Documentation

⚙

Installation

Set up the project locally with Docker, Redis, and LibreTranslate in minutes.

▦

Architecture

Understand the STT → Translation → TTS pipeline and real-time Socket.io communication.

▶

Live Translation

Stream from YouTube or microphone with automatic EN/RU/UK language detection and voice output.

📜

Biblical Simulator

Test the full pipeline with AI-generated biblical passages in King James, Church Slavonic, or Ukrainian style.

🎤

Voice Training

Clone custom voices from microphone recordings or YouTube videos using ElevenLabs IVC.

💡

Looking for examples?

Check the Quickstart guide for a complete walkthrough, or jump to the API Reference for endpoint documentation.

Prerequisites

●

Node.js 20+ Runtime for backend and build tools

●

Docker + Docker Compose For Redis and LibreTranslate services

●

yt-dlp + ffmpeg Required for YouTube audio extraction

●

ElevenLabs API Key For speech-to-text and text-to-speech

Clone & Configure

Terminal

git clone https://github.com/Pzharyuk/live-translator-node.git && cd live-translator-node
cp .env.example .env

Edit .env and set your API key:

.env

ELEVENLABS_API_KEY=sk-your-key-here
ADMIN_PASSWORD=your-secure-password

Start Infrastructure

Terminal

# Start Redis + LibreTranslate
docker compose -f docker-compose.local.yml up -d

# Wait for LibreTranslate to download language models (~500 MB)
docker logs -f $(docker ps -qf "name=libretranslate") 2>&1 | grep -i "running"

Start Backend

Terminal

cd backend
npm install
npm run dev  # nodemon watches for changes

Start Frontend

Terminal

cd frontend
npm install
npm run dev  # Vite hot-reload on localhost:5173

✓

You're all set!

Open http://localhost:5173 — log in with user / changeme and you will be redirected to /translate. Admin panel: http://localhost:5173/admin (admin password: admin123).

1

Start the services

Follow the Installation guide to get Docker services, backend, and frontend running.

2

Open the Admin Panel

Navigate to http://localhost:5173/admin and enter the admin password.

3

Select a Voice

Choose a TTS voice from the dropdown. The voice list is fetched from your ElevenLabs account.

4

Test with Text

Use the free-text area in the admin panel to type a phrase. Click translate to hear the TTS output instantly.

5

Go Live

Open the user view at http://localhost:5173/translate. Select "Mic" as input, pick a voice, and click Start. Speak into your microphone and watch real-time translation appear with audio playback.

💡

Try the Biblical Simulator

For a hands-free demo, enable the biblical_simulator feature flag in admin, enter an Anthropic API key, select a language, and click "Generate". The system will produce biblical passages through the full STT → Translation → TTS pipeline.

System Overview

Frontend

React 19 + Vite

Socket.io Client

Web Audio API

↔

Backend

Express + Socket.io

TypeScript

Service Layer

ElevenLabs

Scribe v2 (STT)

TTS Streaming

Voice Cloning

Translation

Google Translate (Cloud API)

LibreTranslate (self-hosted)

DeepL (premium API)

Claude / Anthropic (AI)

Redis

Feature Flags

Settings Store

Google Gemini

Biblical Simulator

Sermon Generation

Voice Training Text

DeepL

Free & Pro tiers

Auto endpoint detection

Data Flow

1 Audio Input (Mic / YouTube / Simulator)

↓

2 PCM 16-bit LE @ 16kHz via Socket.io chunks

↓

3 ElevenLabs Scribe v2 WebSocket STT

↓

4 Commit Merge Buffer 2.5s VAD aggregation

↓

5 Translation Provider Google / LibreTranslate / DeepL / Claude

↓

6 ElevenLabs TTS Voice synthesis streaming

↓

7 Audio Playback Queued with 600ms pause

Key Architecture Decisions

Two-layer Language Detection

LibreTranslate's /detect endpoint returns 0-confidence for short Cyrillic phrases. The app uses script-based pre-detection (Unicode 0x0400–0x04FF = Cyrillic) combined with ElevenLabs Scribe's language_code output for reliable EN/RU/UK auto-detection.

VAD Commit Merging

Voice Activity Detection can fire aggressively on speaker breathing. Commits are buffered for 2.5 seconds before translation to merge fragments into meaningful phrases.

Feature Flag Merging

YAML config defaults are merged with Redis runtime overrides. Redis values take priority, falling back to YAML if Redis is unavailable.

API Key Hierarchy

Keys resolve in order: Runtime Cache → Redis → Config File → Empty. This allows hot-swapping keys without restarts.

Connection Lifecycle

Client sends start_session with source type (mic or youtube) and optional voiceId
Backend opens a WebSocket to wss://api.elevenlabs.io/v1/speech-to-text/realtime
For YouTube: spawns yt-dlp | ffmpeg child processes to extract PCM audio
For Microphone: awaits audio_chunk events from the frontend

Audio Streaming

Audio chunks are sent to Scribe as JSON messages:

WebSocket Message

{
  "message_type": "input_audio_chunk",
  "audio_base_64": "UklGR..."  // PCM 16-bit LE, 16kHz, mono
}

Scribe Responses

Response Type	Meaning	Action
`partial_transcript`	Live partial text (speculative)	Emitted as non-final `transcript` event
`committed_transcript`	VAD fired — complete phrase	Buffered for commit merge window

Commit Merge Buffer

After receiving a committed_transcript, the backend waits 2.5 seconds (COMMIT_MERGE_MS) to collect additional commits before translating. This prevents fragmented translations from aggressive VAD.

Stability Timeout

If VAD stalls (no new commits), a 3.5 second fallback timer (STABILITY_TIMEOUT_MS) fires to translate whatever new text has accumulated, preventing indefinite silence.

Text Validation

Before translation, text is validated against EN/RU/UK character regex patterns. This filters out hallucinated text from the STT model (common with silence or background noise).

Provider Chain

The system supports three translation providers with automatic fallback:

Default

LibreTranslate

Self-hosted, no API key required. Runs in Docker alongside the app. Best for privacy and cost.

Premium

DeepL

High-quality translations. Supports both free and paid API tiers. Auto-detects endpoint.

AI

Claude

Anthropic's Claude for context-aware translations. Uses claude-haiku-4-5 for speed.

Fallback Logic

Provider Resolution

1. Try primary provider (admin-selected)
2. If primary fails → try configured fallback
3. If fallback fails → try LibreTranslate (last resort)
4. If all fail → emit error event

Language Detection

The app uses a two-layer auto-detection approach:

Layer 1: Script-based Pre-detection

Before calling any translation API, the backend checks Unicode character scripts:

Cyrillic characters (Unicode 0x0400–0x04FF) → if >50% of matched letters are Cyrillic, detected as Russian
Latin characters → detected as English
This avoids low-confidence results from LibreTranslate's /detect endpoint on short text

Layer 2: STT Language Code

When the auto_language_detect flag is enabled, ElevenLabs Scribe returns a language_code with each transcript commit. The backend uses this to correctly route EN/RU/UK without relying solely on script detection.

Note: For LibreTranslate, both Russian and Ukrainian Cyrillic text is passed with source ru since LibreTranslate handles Ukrainian text acceptably via the Russian model. DeepL and Claude providers distinguish Ukrainian natively and handle uk as a proper source language.

Language Gating

Detected languages are checked against the admin-approved pool. If a detected language isn't in the allowed set, the translation is rejected to prevent hallucinated language outputs.

TTS Pipeline

After translation, the text is sent to ElevenLabs TTS:

TypeScript

const stream = await client.textToSpeech.stream(voiceId, {
  text: translatedText,
  model_id: "eleven_multilingual_v2",
  output_format: "mp3_44100_128",
  voice_settings: {
    stability: 0.5,
    similarity_boost: 0.75,
    style: 0.0,
    speed: 1.0,
    use_speaker_boost: true
  }
});

Audio Delivery

TTS audio is streamed to a Buffer, then emitted as a base64-encoded MP3 via the tts_audio Socket.io event.

Frontend Playback Queue

The frontend maintains an audio queue to prevent overlapping playback:

Received tts_audio events are queued
Each segment plays to completion before the next starts
A configurable pause (600ms default) is inserted between segments
The pause duration is controlled by tts_segment_pause_ms (adjustable in admin)

Microphone Input

User selects "Mic" tab and chooses a TTS voice
Browser captures audio via Web Audio API's ScriptProcessor
PCM 16-bit LE at 16kHz sample rate sent to backend via Socket.io
Backend pipes audio to ElevenLabs Scribe v2 Realtime WebSocket
Language auto-detected (EN/RU/UK), text translated and synthesized
TTS audio returned and played back with inter-segment pauses

YouTube Input

User pastes a YouTube URL (live stream or video)
Backend spawns yt-dlp | ffmpeg child processes
Audio extracted as PCM stream (16kHz, 16-bit LE, mono)
Piped to Scribe v2, same pipeline as microphone
Stream ends when YouTube content ends or user stops

User Interface

The user view features a dark cavern theme with:

Waveform visualizer — Canvas-based bar chart with orange gradient and cyan tips
Transcript display — White translated text scrolls upward with fade masks
Partial transcript — Shown in italic orange while STT is processing
Source tabs — Toggle between Mic and YouTube (controlled by feature flags)

How It Works

The backend uses yt-dlp and ffmpeg as child processes to extract audio from YouTube URLs:

Pipeline

yt-dlp (best audio) → ffmpeg (PCM 16kHz 16-bit LE mono) → Scribe v2

Supported Sources

Live streams — Translates in real-time as the stream progresses
Regular videos — Processes the full audio track
Any URL supported by yt-dlp (YouTube, etc.)

Requirements

Both yt-dlp and ffmpeg must be installed and available in the system PATH. On macOS:

Terminal

brew install yt-dlp ffmpeg

⚠

Feature Flag Required

YouTube input is controlled by the youtube_input feature flag. Enable it in the admin panel to show the YouTube tab in the user view.

Overview

The Biblical Transcript Simulator is an admin-only feature that generates biblical text passages using Google's Gemini API (gemini-2.5-flash), then routes them through the full translation pipeline. This provides a hands-free way to test STT → Translation → TTS without a live audio source.

Language Styles

Language	Style	Example
`en`	King James English	"In the beginning was the Word..."
`ru`	Church Slavonic Russian	"В начале было Слово..."
`uk`	Traditional Ukrainian	"На початку було Слово..."

Flow

Admin selects language (EN/RU/UK)
Backend calls Gemini 2.5 Flash with streaming
Gemini generates 6-8 biblical passages, 3-5 sentences each
Stream is buffered until 140+ characters AND complete sentences
Chunks emitted with 1800ms smooth pacing between them
Each chunk flows through the standard pipeline:
- Emitted as transcript (isFinal: true)
- Auto-translated via configured provider
- TTS synthesized and audio returned
Frontend plays audio with standard inter-segment pause

💡

Feature Flag

Enable biblical_simulator in the admin feature flags panel. The Gemini API key is configured via the GEMINI_API_KEY environment variable or set at runtime in the admin API Keys panel.

Overview

Voice Training uses ElevenLabs' Instant Voice Cloning (IVC) API to create custom voices from audio samples. Once cloned, the voice appears in the voice selector immediately.

From Microphone

Open the Voice Training section in the admin panel
Click Generate Text to get an AI-generated reading passage (via Gemini) — gives the speaker natural, phonetically diverse text to read aloud
Record multiple audio clips using your browser microphone while reading the generated text
Provide a name for the voice
Clips are uploaded to ElevenLabs IVC API
Cloned voice is available for TTS immediately
Click Preview Voice to hear the cloned voice speak a sample sentence via TTS

From YouTube

Paste a YouTube URL in the Voice Training section
Backend extracts N × 30-second clips via yt-dlp + ffmpeg
Clips are uploaded to ElevenLabs IVC API
Resulting voice is stored in your ElevenLabs account

⚠

ElevenLabs Account

Cloned voices are stored in your ElevenLabs account, not locally. Ensure your plan supports voice cloning.

Concepts

Concept	Description
Active Language Pair	The current pair used for translation (e.g., EN ↔ RU, EN ↔ UK, or RU ↔ UK). Set by admin.
Available Languages	The pool of languages viewers can select from (if `user_language_selector` is enabled).

Admin Controls

Change the active language pair via the admin panel
Changes broadcast to all connected clients in real-time
Manage the available languages pool for viewer selection

Viewer Selection

When the user_language_selector feature flag is enabled, viewers can override the admin-set language pair by selecting their own preferred languages from the available pool.

Overview

Two people can video call each other through the app, each speaking their own language. The app transcribes, translates, and synthesizes speech in real-time so each participant hears the other in their language.

Feature flag: Video call is gated behind the video_translation flag. Enable it in the admin panel or set video_translation: true in your YAML config.

How It Works

Create a room — Person A selects their language, picks a TTS voice, and clicks "Create Room". A 6-character room code is generated.
Share the code — Person A shares the room code with Person B (copy button provided).
Join the room — Person B enters the code, selects their language and TTS voice, and clicks "Join".
WebRTC connection — The app establishes a peer-to-peer video connection via WebRTC (signaled through Socket.io). Video flows directly between browsers.
Audio translation — Each participant's microphone audio is simultaneously:
- Sent to the peer via WebRTC (but muted on their end)
- Captured as PCM chunks and sent to the backend via Socket.io for STT
Translation pipeline — Each participant has their own independent Scribe STT session. Transcribed text is translated to the other participant's language, then synthesized via ElevenLabs TTS and sent back to the peer.
Playback — The peer hears the TTS translation instead of the raw audio. Translated transcript is displayed below the video.

Architecture

Person A (Browser)         Server              Person B (Browser)
├─ getUserMedia            ├─ Socket.io         ├─ getUserMedia
├─ WebRTC P2P ═══video═══►│  (signaling)  ◄═══ ├─ WebRTC P2P
│                          │                    │
├─ PCM chunks ──Socket.io─►├─ ScribeA(STT)      │
│                          │  ↓ translate       │
│                          │  ↓ TTS ───────────►├─ Plays TTS
│                          │                    │
│  Plays TTS ◄─────────────├─ ScribeB(STT) ◄───├─ PCM chunks
│  (remote video muted)    │  ↓ translate       │  (remote video muted)
└──────────────────────────┴────────────────────┘

Socket Events

Event	Direction	Purpose
`video_create_room`	C→S	Create a new room with language + voice
`video_room_created`	S→C	Returns the 6-char room code
`video_join_room`	C→S	Join an existing room
`video_room_joined`	S→C	Sent to both participants, triggers WebRTC
`video_signal_offer/answer/ice`	C↔S	WebRTC signaling relay
`video_audio_chunk`	C→S	PCM audio for STT processing
`video_transcript`	S→C	Transcript sent to the speaker
`video_translation`	S→C	Translation sent to the listener
`video_tts_audio`	S→C	TTS audio sent to the listener
`video_leave_room`	C→S	Leave the room
`video_room_closed`	S→C	Notify peer when other leaves

Room Lifecycle

Rooms are stored in Redis with key video_room:{code} and a 4-hour TTL
Maximum 2 participants per room
When one participant disconnects, the other is notified and the call ends
Scribe sessions are automatically cleaned up on disconnect

The Mac Audio Agent has moved to its own public repository:

github.com/Pzharyuk/live-translator-agent

It is a lightweight Node.js daemon that runs as a macOS LaunchAgent and streams microphone audio to the live-translator backend via Socket.io — eliminating the need to open a browser for the Remote Audio Source role.

Feature Flags

Feature flags are stored in config/application.yaml and can be overridden at runtime via Redis. The admin dashboard merges YAML defaults with Redis overrides, allowing dynamic feature control without restarting the server. All connected clients receive merged flags via the feature_flags socket event on connection and whenever a flag is toggled.

Flag	Default	Description
`youtube_input`	true	Enable YouTube URL input for live streams and personal sessions.
`mic_input`	true	Enable microphone audio capture for transcription.
`auto_language_detect`	true	Automatically detect source language before translation.
`user_language_selector`	false	Allow viewers to select their language pair from the available pool.
`audio_device_selector`	true	Show audio device selection in the broadcast setup UI.
`video_translation`	true	Enable the `/video` route for peer-to-peer video call translation.
`video_voice_cloning`	false	Show the Clone Voice button in the video call lobby (premium feature).
`remote_audio_source`	false	Enable the `/audio-source` route for headless remote audio relay agents.
`agent_audio_source`	false	Show connected agent audio sources section in the admin broadcast panel.
`broadcast`	false	Enable the `/broadcast` route for public broadcast receiver.
`translate`	false	Enable the `/translate` route for live translator sessions.

Runtime Storage & API

Feature flags are persisted in Redis under the key prefix flag:{flagName}. When the admin changes a flag via the dashboard, it is stored in Redis and immediately broadcast to all connected clients via the feature_flags socket event. YAML defaults are merged server-side; if a flag is not set in Redis, the YAML value is used. This allows gradual rollout, A⁄B testing, and real-time feature toggles across all clients without downtime.

Admin API Endpoints

GET /admin/flags
  → { "flags": { "youtube_input": true, "broadcast": false, ... } }

POST /admin/flags/:flag
  Body: { "value": boolean }
  → { "flag": "broadcast", "value": true }
  (broadcasts updated flags to all connected clients)

File Structure

File	Purpose
`config/application.yaml`	Base defaults for all environments
`config/application-local.yaml`	Local development overrides (localhost URLs)
`config/application-prod.yaml`	Production overrides (Docker service names)

The APP_ENV environment variable (local or prod) determines which overlay file is loaded on top of the base config.

Full Configuration Reference

config/application.yaml

server:
  port: 3001
  cors_origin: "http://localhost:5173"

elevenlabs:
  api_key: "${ELEVENLABS_API_KEY}"
  default_voice_id: "kxj9qk6u5PfI0ITgJwO0"
  tts_model: "eleven_multilingual_v2"
  tts_settings:
    stability: 0.5
    similarity_boost: 0.75
    style: 0.0
    speed: 1.0
    use_speaker_boost: true
  stt_model: "scribe_v2"

anthropic:
  api_key: "${ANTHROPIC_API_KEY}"

deepl:
  api_key: "${DEEPL_API_KEY}"

libretranslate:
  url: "http://libretranslate:5000"
  api_key: ""

redis:
  host: "redis"
  port: 6379
  password: ""

feature_flags:
  youtube_input: true
  mic_input: true
  auto_language_detect: true
  user_language_selector: false
  audio_device_selector: true
  video_translation: false
  video_voice_cloning: false
  broadcast: false

audio:
  sample_rate: 16000
  channels: 1
  chunk_duration_ms: 250

translation:
  source_lang: "auto"
  target_lang_en: "en"
  target_lang_ru: "ru"
  provider: "libretranslate"
  fallback: "libretranslate"

Environment Variable Interpolation

YAML values using ${VAR_NAME} syntax are automatically replaced with the corresponding environment variable at startup.

Variable	Required	Default	Description
`ELEVENLABS_API_KEY`	Yes	—	ElevenLabs API key for text-to-speech & speech-to-text services.
`ELEVENLABS_VOICE_ID`	No	`JBFqnCBsd6RMkjVDRZzb`	Default voice ID for TTS (overridden by `elevenlabs.default_voice_id` in application.yaml).
`ANTHROPIC_API_KEY`	No	—	Anthropic API key for sermon generation & Claude translation provider; can be set in Admin UI.
`GEMINI_API_KEY`	No	—	Google Gemini API key for biblical simulator & sermon generation (cheaper than Anthropic).
`GOOGLE_TRANSLATE_API_KEY`	No	—	Google Translate API key; required if using Google as translation provider.
`DEEPL_API_KEY`	No	—	DeepL API key; required if using DeepL as translation provider.
`YOUTUBE_API_KEY`	No	—	YouTube Data API v3 key for live stream lookup; can be set in Admin UI.
`YOUTUBE_CHANNEL_ID`	No	—	Default YouTube channel ID to search for live streams; can be overridden in Admin UI.
`TRANSLATION_PROVIDER`	No	`google`	Translation provider boot default: `google` \| `deepl` \| `claude` \| `libretranslate`; can be changed in Admin UI.
`APP_ENV`	No	`local`	Application environment: `local` (development) or `prod` (Docker).
`FRONTEND_URL`	No	`http://localhost`	Frontend URL for CORS origin; set to your actual domain in production (e.g., `https://translate.example.com`).
`LISTEN_PORT`	No	`80`	Host port the frontend listens on.
`REDIS_PASSWORD`	No	—	Redis authentication password; leave empty if Redis requires no auth.
`LIBRETRANSLATE_API_KEY`	No	—	LibreTranslate API key if your self-hosted instance requires authentication.
`ADMIN_PASSWORD`	No	`admin123`	Legacy socket auth password for admin panel; change in production.
`APP_ADMIN_USERNAME`	No	`admin`	Admin user seeded into database on first boot.
`APP_ADMIN_PASSWORD`	No	`admin123`	Admin user password seeded on first boot; user can change on first login.
`APP_USERNAME`	No	`user`	User-facing login username; change in production.
`APP_PASSWORD`	No	`changeme`	User-facing login password; change in production.
`JWT_SECRET`	No	—	JWT secret for session cookies; generate a strong random string in production (e.g., `openssl rand -hex 32`).
`COOKIE_SECURE`	No	`true`	Enable secure cookies; set to `true` when serving over HTTPS.
`DB_PASSWORD`	No	—	PostgreSQL database password (referenced in application.yaml as `${DB_PASSWORD}`).

Configuration	Required	Default	Description
`server.port`	No	`3001`	Backend server port.
`server.cors_origin`	No	`http://localhost:5183`	CORS origin for frontend requests.
`elevenlabs.default_voice_id`	No	`kxj9qk6u5PfI0ITgJwO0`	Default ElevenLabs voice ID for TTS.
`elevenlabs.tts_model`	No	`eleven_multilingual_v2`	ElevenLabs TTS model identifier.
`elevenlabs.tts_settings.stability`	No	`0.5`	TTS voice stability (0.0–1.0); higher = more consistent.
`elevenlabs.tts_settings.similarity_boost`	No	`0.75`	TTS voice similarity boost (0.0–1.0); higher = closer to original voice.
`elevenlabs.tts_settings.style`	No	`0.0`	TTS style exaggeration (0.0–1.0).
`elevenlabs.tts_settings.speed`	No	`1.0`	TTS playback speed multiplier.
`elevenlabs.tts_settings.use_speaker_boost`	No	`true`	Enable ElevenLabs speaker boost for clearer audio.
`elevenlabs.stt_model`	No	`scribe_v2_realtime`	ElevenLabs speech-to-text model identifier.
`database.host`	No	`postgres`	PostgreSQL database hostname.
`database.port`	No	`5432`	PostgreSQL database port.
`database.username`	No	`translator`	PostgreSQL database user.
`database.database`	No	`translator_db`	PostgreSQL database name.
`database.pool_size`	No	`10`	PostgreSQL connection pool size.
`redis.host`	No	`redis`	Redis hostname.
`redis.port`	No	`6379`	Redis port.
`auth.admin_username`	No	`admin`	Legacy admin username (socket auth).
`auth.admin_password`	No	`admin123`	Legacy admin password (socket auth); change in production.
`auth.session_days`	No	`30`	JWT session expiration in days.
`feature_flags.youtube_input`	No	`true`	Enable YouTube stream input for /translate & /broadcast.
`feature_flags.mic_input`	No	`true`	Enable microphone input for /translate & /broadcast.
`feature_flags.auto_language_detect`	No	`true`	Enable automatic source language detection.
`feature_flags.user_language_selector`	No	`false`	Allow viewers to select language pair from available pool.
`feature_flags.audio_device_selector`	No	`true`	Show audio input device selector on /translate & /broadcast.
`feature_flags.video_translation`	No	`true`	Enable /video route for real-time video call translation.
`feature_flags.video_voice_cloning`	No	`false`	Show Clone Voice button in /video lobby (premium feature).
`feature_flags.remote_audio_source`	No	`false`	Enable /audio-source route for headless remote audio relay.
`feature_flags.agent_audio_source`	No	`false`	Show connected agent audio sources section in admin panel.
`feature_flags.broadcast`	No	`false`	Enable /broadcast route for public live receiver page.
`feature_flags.translate`	No	`false`	Enable /translate route for live translator page.
`audio.sample_rate`	No	`16000`	Audio sample rate in Hz.
`audio.channels`	No	`1`	Number of audio channels (mono = 1, stereo = 2).
`audio.chunk_duration_ms`	No	`250`	Audio chunk duration in milliseconds.
`translation.source_lang`	No	`auto`	Source language for translation (“auto” = auto-detect).
`translation.target_lang_en`	No	`en`	Target language code for English environment.
`translation.target_lang_ru`	No	`ru`	Target language code for Russian environment.
`translation.provider`	No	`google`	Primary translation provider: `google` \| `deepl` \| `claude` \| `libretranslate`.
`translation.fallback`	No	`libretranslate`	Fallback provider when primary fails: `google` \| `deepl` \| `claude` \| `libretranslate` \| `none`.
`translation.translate_workers`	No	`2`	Number of parallel translation workers in Stage 1 of TTS pipeline.
`tts_pipeline.initial_buffer_segments`	No	`2`	Number of translated segments to buffer before starting TTS playback.
`tts_pipeline.low_water_hold_ms`	No	`1500`	Low-water-mark hold time in milliseconds; waits for next segment before emitting audio to eliminate catch-up stalls (set to 0 to disable).

TTS Settings

API Endpoints

GET /admin/tts-settings
Returns current TTS voice & synthesis parameters.

Response:
{
  "settings": {
    "stability": 0.5,
    "similarity_boost": 0.75,
    "style": 0.0,
    "speed": 1.0,
    "use_speaker_boost": true
  }
}

POST /admin/tts-settings
Update one or more TTS settings (partial update supported).

Request body:
{
  "stability": 0.5,
  "similarity_boost": 0.75,
  "style": 0.0,
  "speed": 1.0,
  "use_speaker_boost": true
}

Response: Same as GET (returns updated settings)

Settings Table

Setting	Range	Default	Description
`stability`	0.0 – 1.0	0.5	Voice consistency; lower = more variable, higher = more stable.
`similarity_boost`	0.0 – 1.0	0.75	How closely synthesized voice matches the original voice sample.
`style`	0.0 – 1.0	0.0	Exaggeration of voice character; 0 = neutral, higher = more dramatic.
`speed`	0.5 – 2.0	1.0	Playback speed multiplier; <1 slower, >1 faster.
`use_speaker_boost`	`true` \| `false`	`true`	Improve audio clarity by boosting speaker prominence (ElevenLabs feature).

STT Timing Settings

GET /admin/stt-timing
Returns speech-to-text timing & VAD parameters.

Response:
{
  "settings": {
    "commit_merge_ms": 1500,
    "stability_timeout_ms": 2000,
    "tts_segment_pause_ms": 0,
    "max_accumulation_ms": 8000,
    "vad_threshold": 0.5,
    "vad_silence_threshold_secs": 1.0,
    "min_speech_duration_ms": 100,
    "min_silence_duration_ms": 100,
    "flush_on_sentence_boundary": true,
    "min_chars_before_dispatch": 40
  }
}

POST /admin/stt-timing
Update one or more STT timing settings.

Request body: (same structure as response above)

Response: Updated settings

Setting	Range	Default	Description
`commit_merge_ms`	100 – 5000	1500	Buffer VAD commits for this duration before merging & translating (ms).
`stability_timeout_ms`	500 – 5000	2000	Wait this long for partial text to stabilize before translating (ms).
`tts_segment_pause_ms`	0 – 2000	0	Pause between TTS audio segments sent to frontend (ms); frontend uses this value.
`max_accumulation_ms`	2000 – 20000	8000	Force-dispatch new words for translation every this many ms during continuous speech (ms).
`vad_threshold`	0.0 – 1.0	0.5	Voice Activity Detection sensitivity; higher = stricter noise filter.
`vad_silence_threshold_secs`	0.5 – 3.0	1.0	Seconds of silence required before VAD triggers a commit (s).
`min_speech_duration_ms`	50 – 500	100	Ignore speech shorter than this duration (ms).
`min_silence_duration_ms`	50 – 500	100	Minimum silence gap required to split speech segments (ms).
`flush_on_sentence_boundary`	`true` \| `false`	`true`	Split & dispatch at sentence boundaries (.?!) instead of all at once.
`min_chars_before_dispatch`	10 – 200	40	Minimum characters accumulated before a chunk is sent for translation (prevents tiny fragments).

Video Call Settings

GET /admin/video-settings
Returns video call STT/TTS parameters (separate from broadcast settings).

Response:
{
  "stability_ms": 500,
  "commit_merge_ms": 50,
  "translation_provider": "claude"
}

POST /admin/video-settings
Update video call settings.

Request body:
{
  "stability_ms": 500,
  "commit_merge_ms": 50,
  "translation_provider": "claude"
}

Response: Updated settings

Setting	Range	Default	Description
`stability_ms`	100 – 2000	500	Wait this long for stable partial text before translating in video calls (ms).
`commit_merge_ms`	10 – 500	50	Buffer VAD commits for this duration in video calls (ms).
`translation_provider`	`google` \| `deepl` \| `claude` \| `libretranslate`	`claude`	Which translation provider to use for video call real-time translation.

Configuration File Defaults

These settings are defined in config/application.yaml and can be overridden at runtime via the admin API:

Config Key	Value	Description
`elevenlabs.tts_model`	`eleven_multilingual_v2`	ElevenLabs TTS model for broadcast & private sessions.
`elevenlabs.default_voice_id`	`kxj9qk6u5PfI0ITgJwO0`	Default voice when no voice is explicitly selected.
`elevenlabs.stt_model`	`scribe_v2_realtime`	ElevenLabs speech-to-text model (realtime WebSocket endpoint).
`audio.sample_rate`	`16000`	PCM sample rate for Scribe STT input (Hz).
`audio.channels`	`1`	PCM audio channels (1 = mono).
`audio.chunk_duration_ms`	`250`	Duration of each audio chunk sent to STT (ms).
`tts_pipeline.initial_buffer_segments`	`2`	Number of translated segments to buffer before starting TTS playback.
`tts_pipeline.low_water_hold_ms`	`1500`	Hold before emitting audio to ensure next segment is queued (ms); set 0 to disable.

Notes

All settings except tts_model, stt_model, and default_voice_id can be changed at runtime via the admin API.
STT timing settings are sent to ElevenLabs Scribe as WebSocket query parameters on connection.
Video call settings are independent from broadcast settings to allow tuning latency separately.
The tts_segment_pause_ms setting is sent to the frontend so it knows when to pause playback between segments.
Sentence boundary flushing requires flush_on_sentence_boundary: true — disabled by setting to false to dispatch all accumulated text at once.
All runtime settings are persisted to Redis and restored on server restart.

STT Timing Settings

Configure speech-to-text (STT) timing behavior, VAD parameters, and dispatch thresholds for the real-time translation pipeline.

Settings Reference

Setting	Default	Description
`commit_merge_ms`	1500	Buffer VAD commits for this duration (ms) before translating, merging short speech fragments into coherent chunks.
`stability_timeout_ms`	2000	Maximum time (ms) to wait for stable partial transcript before dispatching for translation when text hasn’t changed.
`tts_segment_pause_ms`	0	Pause duration (ms) between consecutive TTS audio segments sent to frontend; 0 = no pause.
`max_accumulation_ms`	8000	Force-dispatch accumulated words for translation after this duration (ms) during continuous speech, even if VAD or stability timers haven’t fired.
`vad_threshold`	0.5	Voice Activity Detection threshold (0–1); higher = stricter noise filter, fewer false positives but may miss quiet speech.
`vad_silence_threshold_secs`	1.0	Seconds of silence required before VAD triggers a commit; lower = snappier response, higher = fewer fragmented commits.
`min_speech_duration_ms`	100	Ignore speech segments shorter than this (ms); filters out brief clicks, pops, and background noise.
`min_silence_duration_ms`	100	Minimum gap (ms) between detected speech segments; prevents false splits caused by momentary dips in audio level.
`flush_on_sentence_boundary`	true	When true, dispatch complete sentences (.?!;) immediately instead of waiting for VAD or stability timeout, enabling eager translation during sermons.
`min_chars_before_dispatch`	40	Minimum characters required before a chunk is dispatched for translation; prevents translation of tiny fragments like “Um” or “Uh”.

Tuning Guide

Snappier response (lower latency): Decrease commit_merge_ms (e.g., 500–800ms), lower vad_silence_threshold_secs (e.g., 0.5s), and reduce min_chars_before_dispatch.
Fewer fragmented chunks: Increase commit_merge_ms (e.g., 2000–3000ms) to buffer multiple VAD commits together, and raise vad_silence_threshold_secs (e.g., 1.5–2.0s).
Continuous speech (sermons): Enable flush_on_sentence_boundary=true to detect and dispatch complete sentences eagerly, preventing long accumulation waits.
Reduce noise: Increase vad_threshold (e.g., 0.6–0.8) to filter background chatter, and raise min_speech_duration_ms (e.g., 150–200ms).
Lower latency during silence: Decrease stability_timeout_ms (e.g., 1000–1500ms) so partial text triggers translation faster when the speaker pauses momentarily.
Prevent tiny translations: Increase min_chars_before_dispatch (e.g., 60–100) to skip single words and focus on meaningful phrases.
Audio gaps during playback: Increase max_accumulation_ms (e.g., 10000–15000ms) to allow more time for translation and TTS to keep ahead of playback.

API Endpoints

GET /admin/stt-timing

Retrieve current STT timing settings.

curl -X GET http://localhost:3001/admin/stt-timing \
  -H "Cookie: session=<jwt_token>" \
  -H "Content-Type: application/json"

// Response
{
  "settings": {
    "commit_merge_ms": 1500,
    "stability_timeout_ms": 2000,
    "tts_segment_pause_ms": 0,
    "max_accumulation_ms": 8000,
    "vad_threshold": 0.5,
    "vad_silence_threshold_secs": 1.0,
    "min_speech_duration_ms": 100,
    "min_silence_duration_ms": 100,
    "flush_on_sentence_boundary": true,
    "min_chars_before_dispatch": 40
  }
}

POST /admin/stt-timing

Update one or more STT timing settings. Only specified fields are modified; omitted fields retain their current values.

curl -X POST http://localhost:3001/admin/stt-timing \
  -H "Cookie: session=<jwt_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "commit_merge_ms": 800,
    "vad_silence_threshold_secs": 0.8,
    "max_accumulation_ms": 10000,
    "flush_on_sentence_boundary": true
  }'

// Response
{
  "settings": {
    "commit_merge_ms": 800,
    "stability_timeout_ms": 2000,
    "tts_segment_pause_ms": 0,
    "max_accumulation_ms": 10000,
    "vad_threshold": 0.5,
    "vad_silence_threshold_secs": 0.8,
    "min_speech_duration_ms": 100,
    "min_silence_duration_ms": 100,
    "flush_on_sentence_boundary": true,
    "min_chars_before_dispatch": 40
  }
}

Socket Events

When STT timing settings are retrieved or updated, the server emits the new configuration to all connected clients:

socket.on('stt_timing', (data) => {
  console.log('STT timing updated:', data.tts_segment_pause_ms);
  // data.tts_segment_pause_ms — frontend uses this to space out audio playback
});

Runtime Behavior

Settings are read dynamically: Admin changes to STT timing apply immediately to ongoing broadcasts without requiring reconnection.
Three-stage dispatch pipeline:
1. Sentence boundary detection: If flush_on_sentence_boundary=true, complete sentences (.?!;) trigger immediate translation, even during continuous speech.
2. Stability timeout: If partial text remains unchanged for stability_timeout_ms, dispatch it for translation (handles pauses between thoughts).
3. Accumulation timer: If neither sentence boundaries nor stability fire after max_accumulation_ms, force-dispatch accumulated words (handles sermons and long utterances).
VAD commit buffering: Voice Activity Detection commits are buffered for commit_merge_ms before being dispatched, reducing fragmentation from breathing pauses.
Minimum thresholds: Chunks smaller than min_chars_before_dispatch characters are held back and merged with subsequent speech.

Notes

Settings are persisted to Redis and restored on server restart.
All timing values are in milliseconds except vad_silence_threshold_secs and vad_threshold.
The tts_segment_pause_ms setting is sent to the frontend via the stt_timing socket event so viewers can pace audio playback accordingly.
For live broadcasts with human speakers, flush_on_sentence_boundary=true and moderate max_accumulation_ms (8–10s) provide the best balance of responsiveness and chunk coherence.

Authentication: All endpoints require a valid JWT cookie (auth_token) obtained from POST /api/login. Admin or role-based permissions required.

API Keys Management

GET /admin/api-keys

Retrieve status of all configured API keys (names, whether set, last update).

POST /admin/api-keys

Update one or more API keys (elevenlabs, anthropic, deepl, libretranslate, google, youtube).

Body: {
  "elevenlabs": "sk-...",
  "anthropic": "sk-ant-...",
  "deepl": "...",
  "libretranslate": "...",
  "google": "...",
  "youtube": "..."
}

Voice Management

GET /admin/voices

Scan ElevenLabs API for all available voices and log new discoveries.

GET /admin/available-voices

Get the admin-allowed voice IDs pool (null → all voices allowed, array → filtered list).

POST /admin/available-voices

Set the admin-allowed voice IDs pool. Broadcasts to all clients for real-time UI update.

Body: {
  "voiceIds": ["voice_id_1", "voice_id_2", ...]
}

Feature Flags

GET /admin/flags

Retrieve all feature flags (merged from YAML config defaults & Redis overrides).

GET /admin/flags/:flag

Get a single feature flag value by name.

POST /admin/flags/:flag

Set a feature flag and broadcast updated flags to all connected clients via Socket.io.

Body: {
  "value": true
}

TTS & STT Settings

GET /admin/tts-settings

Get current TTS settings (stability, similarity_boost, style, speed, use_speaker_boost).

POST /admin/tts-settings

Update TTS settings partially (persisted to Redis, applied immediately).

Body: {
  "stability": 0.5,
  "similarity_boost": 0.75,
  "style": 0.0,
  "speed": 1.0,
  "use_speaker_boost": true
}

GET /admin/stt-timing

Get STT timing settings (VAD parameters, commit merge delay, stability timeout, accumulation window).

POST /admin/stt-timing

Update STT timing settings (affects speech recognition responsiveness & buffering behavior).

Body: {
  "commit_merge_ms": 1500,
  "stability_timeout_ms": 2000,
  "tts_segment_pause_ms": 0,
  "max_accumulation_ms": 8000,
  "vad_threshold": 0.5,
  "vad_silence_threshold_secs": 1.0,
  "min_speech_duration_ms": 100,
  "min_silence_duration_ms": 100,
  "flush_on_sentence_boundary": true,
  "min_chars_before_dispatch": 40
}

GET /admin/video-settings

Get video call STT/TTS settings (separate from broadcast pipeline).

POST /admin/video-settings

Update video call settings (stability_ms, commit_merge_ms, translation_provider).

Body: {
  "stability_ms": 500,
  "commit_merge_ms": 50,
  "translation_provider": "claude"
}

Languages

GET /admin/languages

Get the current active language pair (e.g., ["en", "ru"]).

POST /admin/languages

Set the active language pair. Must be exactly 2 distinct codes. Broadcasts to all clients.

Body: {
  "languages": ["en", "ru"]
}

GET /admin/available-languages

Get the language pool that viewers can choose from (admin-curated).

POST /admin/available-languages

Set the language pool and broadcast to all clients for real-time UI update.

Body: {
  "languages": ["en", "ru", "uk", "es", "fr"]
}

Translation Provider

GET /admin/translation-provider

Get the active translation provider (google, deepl, claude, or libretranslate) & list of available options.

POST /admin/translation-provider

Switch the translation provider at runtime (persisted to Redis).

Body: {
  "provider": "google"
}

GET /admin/claude-model

Get the active Claude translation model & list of available Claude models.

POST /admin/claude-model

Switch the Claude model for translation (persisted to Redis).

Body: {
  "model": "claude-3-5-sonnet-20241022"
}

Audio Device

GET /admin/audio-device

Get the admin-selected audio input device (overrides viewer's local selection).

POST /admin/audio-device

Set the admin-selected audio input device & broadcast to all clients. Viewers will use this device instead of their own selection.

Body: {
  "deviceId": "device_id",
  "label": "Built-in Microphone"
}

YouTube

GET /admin/youtube/channel-id

Get the configured YouTube channel ID & whether it came from environment variables.

PUT /admin/youtube/channel-id

Update the YouTube channel ID at runtime.

Body: {
  "channelId": "UCxxxxxxxxxxxxxxxxxxxxxx"
}

GET /admin/youtube/live-streams

Find live streams for the configured channel using YouTube API or yt-dlp fallback.

Sermon Generation

POST /admin/generate-sermon

Generate a biblical sermon snippet via Gemini Flash 2.5 (used by biblical simulator).

Body: {
  "apiKey": "your-gemini-key (optional)",
  "language": "en",
  "sentences": 3
}

Broadcast Schedule

GET /admin/broadcast-schedule

Get scheduled broadcast events (array of event objects with id, title, datetime, description).

POST /admin/broadcast-schedule

Set broadcast schedule. Past events are expired automatically when a broadcast starts.

Body: {
  "events": [
    {
      "id": "evt1",
      "title": "Sunday Service",
      "datetime": "2024-01-14T10:00:00Z",
      "description": "Weekly live translation"
    }
  ]
}

TTS Preview

POST /admin/tts-preview

Generate TTS audio for a text snippet. Returns MP3 audio buffer (Content-Type: audio/mpeg).

Body: {
  "text": "Hello, this is a test.",
  "voiceId": "kxj9qk6u5PfI0ITgJwO0"
}

Voice Training & Cloning

POST /admin/voice-training/from-recording

Clone a voice from browser mic recordings. Base64-encoded audio blobs are uploaded to ElevenLabs.

Body: {
  "name": "My Custom Voice",
  "clips": ["base64_encoded_audio_blob_1", "base64_encoded_audio_blob_2"],
  "mimeType": "audio/webm"
}

POST /admin/voice-training/from-youtube

Clone a voice from a YouTube URL. Server extracts 30-second clips using yt-dlp & ffmpeg, then uploads to ElevenLabs.

Body: {
  "name": "YouTube Voice",
  "youtubeUrl": "https://www.youtube.com/watch?v=...",
  "clipCount": 3,
  "startOffset": 0
}

Monitoring & Logs

GET /admin/hallucinations

Get hallucination detection statistics & log entries (invalid scripts, repeated noise, etc.).

DELETE /admin/hallucinations

Clear the hallucination log.

GET /admin/translation-log

Get historical translation entries (original, translated, language, provider, timing).

DELETE /admin/translation-log

Clear the translation log.

GET /admin/queue-depth

Get real-time broadcast queue depth & stream statistics (pending, translated, consumer lag).

Session History

GET /admin/sessions

Get all broadcast session records from PostgreSQL (started_at, duration, transcript count).

GET /admin/sessions/:id

Get detailed session with full transcript entries (seq, original, translated, timing, language).

GET /admin/sessions/:id/export

Export session as JSON, CSV, or plain text. Query param: ?format=json|csv|txt (default: json).

User Management

GET /admin/users

List all users (password hashes & avatar data stripped). Requires: user_management permission.

POST /admin/users/:id/role

Update user admin status &/or assign roles. Requires: user_management permission.

Body: {
  "isAdmin": true,
  "roleIds": ["role_id_1", "role_id_2"]
}

POST /admin/users/:id/reset-password

Force-reset a user's password. Password must be at least 6 characters. Requires: user_management permission.

Body: {
  "password": "newpassword123"
}

DELETE /admin/users/:id

Delete a user. Cannot delete your own account. Requires: user_management permission.

Role Management

GET /admin/permissions

List all available permissions that can be assigned to roles. Requires: user_management permission.

GET /admin/roles

Get all custom roles. Requires: user_management permission.

POST /admin/roles

Create a new role with a set of permissions. Role name must be unique. Requires: user_management permission.

Body: {
  "name": "Translator",
  "permissions": ["broadcast_control", "settings_read"]
}

PUT /admin/roles/:id

Update an existing role's name & permissions. Requires: user_management permission.

Body: {
  "name": "Senior Translator",
  "permissions": ["broadcast_control", "settings_read", "settings_write"]
}

DELETE /admin/roles/:id

Delete a role. Requires: user_management permission.

Public Endpoints

GET /admin/anthropic-key

Get the configured Anthropic API key status (internal use only — do not expose to frontend).

Socket.IO Events

Server → Client Events

Event	Payload	Description
`feature_flags`	`{ [flag: string]: boolean }`	Merged feature flags from YAML defaults & Redis overrides.
`languages`	`{ languages: [string, string] }`	Current active language pair [source, target].
`available_languages`	`{ languages: string[] }`	Pool of languages viewers can select from.
`stt_timing`	`{ tts_segment_pause_ms: number }`	STT timing settings for frontend pause between TTS segments.
`broadcast_status`	`{ active: boolean; source?: 'mic'\|'youtube'\|'remote'\|'biblical'; pauseReason?: 'prayer'\|'song'\|null; skipSourceLang?: string\|null; voiceId?: string; orphaned?: boolean }`	Broadcast state: active flag, source type, pause reason, skip language, voice ID, orphan status.
`broadcast_viewer_count`	`{ count: number }`	Current number of viewers watching the broadcast.
`remote_audio_sources`	`{ sources: Array<{ socketId: string; agentId?: string; label: string; devices: Array<{ id: string; name: string }>; selectedDevice?: string }> }`	List of registered remote audio sources with their device lists.
`admin_audio_device`	`{ deviceId: string; label: string }`	Admin-selected audio input device that overrides viewer's local selection.
`transcript`	`{ text: string; isFinal: boolean }`	STT transcript output for private session.
`translation`	`{ original: string; translated: string; detectedLanguage?: string }`	Translated text for private session.
`tts_audio`	`{ audio: string }`	Base64-encoded MP3 audio chunk for private session.
`session_started`	`{ source: 'mic'\|'youtube' }`	Private session has started.
`session_stopped`	`{}`	Private session has stopped.
`stream_ended`	`{}`	YouTube or biblical stream has ended.
`audio_level`	`{ data: number[] }`	Downsampled waveform data (64 samples) for audio level visualization.
`error`	`{ message: string }`	Error message from STT, TTS, or streaming.
`tts_clear_queue`	`{}`	Clear any queued TTS audio (used during pause/resume or session stop).
`broadcast_transcript`	`{ text: string; isFinal: boolean; skipped?: boolean }`	STT transcript output broadcast to all viewers.
`broadcast_translation`	`{ original: string; translated: string; detectedLanguage?: string }`	Translated text broadcast to all viewers.
`broadcast_tts_audio`	`{ audio: string }`	Base64-encoded MP3 audio chunk broadcast to all viewers.
`broadcast_voice_changed`	`{ voiceId: string }`	TTS voice changed during active broadcast.
`admin_translate_result`	`{ original: string; translated: string; detectedLanguage?: string; audio: string }`	Result of admin test translate + TTS (private to admin socket).
`select_device`	`{ id: string }`	Admin requests remote audio source to switch to a specific device.
`device_select_error`	`{ socketId: string; message: string }`	Error response when device selection fails.
`refresh_devices`	`{}`	Admin requests remote audio source to refresh its device list.
`remote_audio_error`	`{ socketId: string; deviceId: string; message: string }`	Remote audio source reports a stream or device error.

Client → Server Events

Event	Payload	Description
`join_broadcast`	`{}`	Viewer joins the broadcast room to receive translations & audio.
`leave_broadcast`	`{}`	Viewer leaves the broadcast room.
`set_languages`	`{ languages: [string, string] }`	Viewer selects a language pair from the available pool.
`start_session`	`{ source: 'mic'\|'youtube'; voiceId?: string; youtubeUrl?: string }`	User starts a private translation session from mic or YouTube URL.
`stop_session`	`{}`	User stops their private session.
`change_voice`	`{ voiceId: string }`	Change TTS voice for active session or broadcast without restarting.
`audio_chunk`	`{ audio: string }`	Base64-encoded PCM audio chunk from mic — routes to broadcast (if admin) or private session.
`test_audio_chunk`	`{ audio: string }`	Base64-encoded PCM audio chunk for private session testing only (never broadcast).
`admin_start_broadcast`	`{ voiceId?: string; source: 'mic'\|'youtube'\|'remote'; youtubeUrl?: string }`	Admin starts a broadcast session from mic, YouTube, or remote audio sources.
`admin_stop_broadcast`	`{}`	Admin stops the active broadcast.
`reclaim_broadcast`	`{}`	Admin reclaims an orphaned broadcast after reconnecting.
`broadcast_pause`	`{ reason: 'prayer'\|'song' }`	Admin pauses broadcast and clears TTS queue (e.g., during prayer or song).
`broadcast_resume`	`{}`	Admin resumes broadcast after pause.
`broadcast_skip_lang`	`{ lang: string\|null }`	Admin sets source language to skip (e.g., skip English when human translator is active).
`admin_translate_test`	`{ text: string; voiceId?: string; sourceLang?: string; targetLang?: string }`	Admin requests instant translate & TTS test (private to admin socket).
`start_biblical_sim`	`{ anthropicApiKey?: string; geminiApiKey?: string; language: 'en'\|'ru'\|'uk'; voiceId?: string }`	Admin starts a biblical text broadcast simulation.
`stop_biblical_sim`	`{}`	Admin stops the biblical simulation.
`register_audio_source`	`{ agentId?: string; label: string; deviceId: string; devices?: Array<{ id: string; name: string }>; selectedDevice?: string\|null }`	Remote audio source registers itself with label, device list, & selection.
`unregister_audio_source`	`{}`	Remote audio source unregisters (disconnects).
`select_agent_device`	`{ socketId: string; deviceId: string }`	Admin selects a specific device on a remote audio source.
`refresh_devices`	`{ socketId: string }`	Admin requests a remote audio source to refresh its device list.
`audio_stream_error`	`{ deviceId: string; message: string }`	Remote audio source reports a stream or device error.

SDK

Uses the official @elevenlabs/elevenlabs-js SDK (v2). The client is lazy-loaded on first use.

Speech-to-Text (Scribe v2 Realtime)

Connects via native WebSocket to wss://api.elevenlabs.io/v1/speech-to-text/realtime. Handles:

VAD-based commit buffering with configurable merge window
Stability timeout fallback for stalled VAD
Text validation (EN/RU/UK character regex filtering)
Partial and final transcript emission

Text-to-Speech

Uses client.textToSpeech.stream() with the eleven_multilingual_v2 model. Audio is collected into a Buffer and emitted as base64 MP3.

Voice Management

client.voices.getAll() — fetches all voices from account
Admin can filter which voices are available to viewers
Voice cloning via IVC API (from recordings or YouTube)

Key File

backend/src/services/elevenlabs.service.ts

Provider Details

Google Translate

Google Cloud Translation API v2. Fast (~200ms), deterministic, and reliable. Requires GOOGLE_TRANSLATE_API_KEY with the Cloud Translation API enabled in Google Cloud Console. Ensure the API key has no HTTP referrer restrictions (server-side requests have no referrer).

File: backend/src/services/google-translate.service.ts

LibreTranslate

Self-hosted in Docker. No API key required by default. Provides language detection and translation via REST API.

File: backend/src/services/libretranslate.service.ts

DeepL

Premium translation API. Auto-detects free vs. paid endpoint based on the API key format.

File: backend/src/services/deepl.service.ts

Claude (Anthropic)

AI-powered translation using claude-haiku-4-5 for speed. Includes language detection and auto-flip logic.

File: backend/src/services/claude-translate.service.ts

Routing

Provider routing is handled by backend/src/services/translation.provider.ts:

Try admin-selected primary provider
On failure, try configured fallback provider
LibreTranslate is always the last-resort fallback

Connection

Uses ioredis with automatic retry strategy. Falls back to in-memory/YAML defaults if Redis is unavailable.

Key Patterns

Pattern	Example	Purpose
`flag:<name>`	`flag:youtube_input`	Feature flag boolean values
`setting:<name>`	`setting:tts_settings`	JSON settings objects

Key File

backend/src/services/redis.service.ts

Local Development

Use docker-compose.local.yml for Redis and LibreTranslate only (backend/frontend run natively):

Terminal

docker compose -f docker-compose.local.yml up -d

Production

Use docker-compose.yml for all services:

Terminal

docker compose up -d --build

Services

Service	Image	Port	Notes
frontend	node:24-alpine + Nginx	80 (exposed)	Serves React build, proxies API/WS to backend
backend	node:24-alpine	3001 (internal)	Express + Socket.io server
redis	redis:7-alpine	6379 (internal)	Feature flags and settings store
libretranslate	libretranslate/libretranslate	5000 (internal)	Self-hosted translation engine

Configuration

.env (production)

ELEVENLABS_API_KEY=sk-your-production-key
ADMIN_PASSWORD=strong-secure-password
FRONTEND_URL=https://translate.example.com
APP_ENV=prod
REDIS_PASSWORD=redis-secret

Deploy

Terminal

docker compose up -d --build

Reverse Proxy

When running behind Nginx or another reverse proxy:

Set LISTEN_PORT in .env (e.g., 8080)
Proxy pass to localhost:8080
Important: Ensure WebSocket upgrades are forwarded for the /socket.io/ path

Nginx Config (example)

server {
    listen 443 ssl;
    server_name translate.example.com;

    location / {
        proxy_pass http://localhost:8080;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }
}

Monitoring

Terminal

# Check all services
docker compose ps

# View backend logs
docker compose logs -f backend

# Health check
curl http://localhost:3001/api/health

Shipped

v0.1 – v0.2 — Core Translation Engine

Real-time STT via ElevenLabs Scribe v2 Realtime
Multi-provider translation (LibreTranslate, DeepL, Claude)
TTS voice synthesis with ElevenLabs
Microphone and YouTube live input
Admin panel with feature flags, voice management, TTS tuning
Biblical Transcript Simulator for pipeline testing
Instant Voice Cloning from recordings and YouTube

Shipped

v0.3 — Audio Mixer & Device Selection

Browser-side audio device scanning with support for professional mixing consoles, virtual audio devices, and audio interfaces.

Browser-side device enumeration with permission flow
Virtual device detection (Loopback, BlackHole, VB-Audio, Voicemeeter, OBS)
Categorized device picker (Microphones vs Mixers / Virtual Devices)
Admin device override broadcast to all viewers via Socket.io
Real-time feature flag broadcasting

Shipped

v0.7 — Broadcast Service

The /translate route is now a true broadcast service. Admins start one global translation session from the admin panel and all connected viewers receive the live output simultaneously.

Single global broadcast session (one-to-many)
Admin "Broadcast Control" panel — Start/Stop with source + voice selection
Microphone and YouTube source both supported for broadcast
All translation output (transcript, translated text, TTS audio) io.emit’d to every viewer
Viewer shows Waiting for broadcast to start… status when off air
"On Air" / "Off Air" status pill visible to viewers in real-time
Broadcast ownership tracked by admin socket ID; auto-stops on admin disconnect
Biblical Transcript Simulator also broadcasts to all viewers

Shipped

v0.8 — Navigation, Broadcast FF & Transcript UX

Global persistent bottom navigation, feature-flag-gated route visibility, and a refined transcript reading experience.

Persistent bottom navigation bar on all pages (/translate, /broadcast, /video, /admin)
FF-gated nav links — Broadcast and Video Call entries only appear when their flags are enabled
No extra socket connection — nav reads flags from the page’s existing useSocket call via props
Nav renders a frosted dark background gradient so it never overlaps content
/broadcast route is now public (no login required); gated inside the page by the broadcast feature flag
broadcast feature flag added to YAML, backend config, and frontend FeatureFlags interface
Transcript panel: newest translation is always at the top; older lines scroll down and fade out at the bottom
Each new transcript entry animates in from above (transcriptIn keyframe)
Removed duplicate “Video Call” button from /translate and /broadcast header bars

Shipped

v0.9 — Translation Pipeline Overhaul & Google Integration

Major improvements to translation chunking, provider support, and admin tooling.

Google Translate as primary translation provider with automatic fallback chain
Google Gemini 2.5 Flash for biblical simulator and sermon generation (replaces deprecated Gemini 2.0 Flash)
Overhauled STT chunking: disabled aggressive sentence-boundary splitting, stability timer defers to accumulation during continuous speech, commit buffer defers when speaker has resumed
Configurable sermon length (1–20 sentences) in admin UI
Voice training: AI-generated reading text (Gemini) for mic recording sessions
Voice training: preview playback of cloned voice after training via TTS
Broadcast mute/unmute toggle (muted by default, replaces “Tap to enable audio” banner)
Audio device auto-scan on page load with spinning refresh indicator
Fixed admin Raw Server Logs auto-scroll toggle re-enabling on new messages
Updated Claude model list: removed deprecated models, default is claude-haiku-4-5
Docker images upgraded to Node.js 24 (Alpine)

Shipped

v0.4 — Mac Audio Agent

Lightweight Node.js daemon that captures Mac microphone audio and streams it to the backend via Socket.io — no browser required on the audio source machine.

Runs as a macOS LaunchAgent (auto-start on login, auto-restart on crash)
Captures 16 kHz 16-bit mono PCM via sox
Identical chunk format and encoding to the browser client
Registers as a named remote audio source visible in the Admin UI
Starts/stops streaming automatically based on broadcast_status events
One-command install script (see standalone repo)

Up Next

v0.4.1 — Direct Audio Interface Feed

Accept audio directly from professional mixing consoles and audio interfaces — extend the Mac agent to support Core Audio device selection for broadcast-quality input.

Direct audio interface input (Core Audio / ASIO / ALSA)
Multi-channel mixer feed support
Low-latency audio routing (sub-100ms)
Hardware device auto-discovery and selection
Professional broadcast integration (NDI, Dante)

Shipped

v0.5 — Video Call Translation

WebRTC peer-to-peer video calls with real-time bidirectional translation. Two people speak different languages and hear each other translated via TTS.

Built-in WebRTC video call with room codes
Full-duplex translation (each person hears the other translated)
Per-participant STT pipeline with independent Scribe sessions
Video grid UI with local PiP and remote full-screen
Mic/video mute controls, hang up, auto-cleanup on disconnect
Feature-flagged behind video_translation

Shipped

v0.6 — Auth, Mobile & Voice Cloning in /video

User-facing login page (/) with JWT cookie sessions (30-day sticky, HttpOnly)
All app routes protected — redirect to login if unauthenticated
Live translator moved to /translate
Mobile-responsive UI across Translator, Admin, and Video Call views
FaceTime-style full-screen in-call layout on mobile with safe-area insets
“Clone Voice” button in /video lobby, gated by video_voice_cloning feature flag
Voice cloning modal with mic recording or YouTube URL, admin-password gated

Planned

Future

Additional language pairs beyond EN/RU/UK
Speaker diarization (multi-speaker detection)
Translation memory and glossary support
Webhooks and API for third-party integrations
Multi-tenant deployment with user accounts

Introduction

Ship live translationswith confidence

Installation

Architecture

Live Translation

Biblical Simulator

Voice Training

Installation

Prerequisites

Clone & Configure

Start Infrastructure

Start Backend

Start Frontend

Quickstart

Start the services

Open the Admin Panel

Select a Voice

Test with Text

Go Live

Architecture

System Overview

Data Flow

Key Architecture Decisions

Two-layer Language Detection

VAD Commit Merging

Feature Flag Merging

API Key Hierarchy

Speech-to-Text Flow

Connection Lifecycle

Audio Streaming

Scribe Responses

Commit Merge Buffer

Stability Timeout

Text Validation

Translation Pipeline

Provider Chain

LibreTranslate

DeepL

Claude

Fallback Logic

Language Detection

Layer 1: Script-based Pre-detection

Layer 2: STT Language Code

Language Gating

TTS & Playback

TTS Pipeline

Audio Delivery

Frontend Playback Queue

Live Translation

Microphone Input

YouTube Input

User Interface

YouTube Input

How It Works

Supported Sources

Requirements

Biblical Transcript Simulator

Overview

Language Styles

Flow

Voice Training

Overview

From Microphone

From YouTube

Language Management

Concepts

Admin Controls

Viewer Selection

Video Call Translation

Overview

How It Works

Architecture

Socket Events

Room Lifecycle

Mac Audio Agent

Feature Flags

Feature Flags

Runtime Storage & API

Admin API Endpoints

YAML Configuration

Ship live translations
with confidence