Documentation Index
Fetch the complete documentation index at: https://docs.munsit.com/llms.txt
Use this file to discover all available pages before exploring further.
Munsit STT for LiveKit Agents
Thelivekit-plugins-munsit package adds Munsit speech-to-text to LiveKit Agents. It is optimized for Arabic speech recognition and supports Arabic/English code-switching through the munsit-en-ar model.
Prerequisites
- A Munsit AI account
- Python 3.10 or higher
- Basic familiarity with the LiveKit Agents framework
- A LiveKit Cloud account or self-hosted LiveKit server
API Key
Go to Munsit - API Keys, generate an API key, then save it securely. Set the key in your environment:Installation
Install the Munsit STT plugin for LiveKit Agents:Quick Start
munsit.STT() uses batch mode by default. In batch mode, LiveKit’s VAD signals end-of-speech through flush(), then the plugin sends the buffered utterance to Munsit and emits a final transcript with word-level timestamps.
The Silero VAD thresholds above (activation_threshold=0.6, min_speech_duration=0.3) are tuned for real-world microphones, where post-echo-cancellation audio from the agent’s own speaker can otherwise be misinterpreted as user speech. See Best Practices for details.
Modes
| Mode | When to use | Endpoint | Latency | Word timestamps | Interim events |
|---|---|---|---|---|---|
batch (default) | Production with AgentSession + VAD; transcribing recorded audio. | POST /api/v1/audio/transcribe | VAD detection + upload + server processing (~1-2 s for short utterances) | Yes, populated on SpeechData.words | No |
streaming | Live captions or on-the-fly UI updates while the user is still speaking. | WS /api/v1/websocket/speech-to-text | ~700 ms idle threshold by default | No | Yes, through INTERIM_TRANSCRIPT events |
STT.recognize(audio_buffer) always uses the batch HTTP endpoint, even when mode="streaming" is configured.Full Agent Example
Create a file calledarabic_stt_agent.py:
Pairing Munsit STT with Faseeh TTS gives you a fully Arabic-native voice loop. If you’d rather use a non-Munsit TTS,
cartesia.TTS() is a strong choice; openai.TTS() works but produces less natural Arabic prosody.Model Selection
Choose the model based on the input language:| Model | Use case |
|---|---|
munsit | Arabic speech recognition. This is the default model. |
munsit-en-ar | Mixed Arabic-English speech with code-switching. |
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
mode | batch | batch for HTTP transcription or streaming for WebSocket transcription. |
model | munsit | munsit for Arabic or munsit-en-ar for Arabic/English code-switching. |
api_key | env MUNSIT_API_KEY | Munsit API key. |
base_url | Munsit production WebSocket URL | Override the streaming WebSocket URL. |
batch_base_url | Munsit production HTTPS URL | Override the batch HTTP URL. |
auth_method | header | Authentication style: header, bearer, or query. |
sample_rate | 16000 | Sample rate used for the generated WAV header. |
num_channels | 1 | Number of audio channels. |
interim_results | True | Emits interim transcripts in streaming mode. |
endpointing | server_diff | Streaming endpointing strategy: server_diff or client_vad. |
finalize_after_silence_ms | 700 | Silence threshold before finalizing in server_diff mode. |
energy_filter | False | Enables energy filtering for client_vad mode. |
vad_silence_ms | 1500 | Silence duration used by client_vad mode. |
language | None | Label attached to SpeechData.language; defaults to ar. |
http_session | None | Custom aiohttp.ClientSession. |
extra_query_params | None | Extra query params for the streaming WebSocket endpoint. |
Authentication Methods
Munsit accepts the API key in three different places. Choose whichever fits your deployment:Streaming Endpointing
Whenmode="streaming" is enabled, the plugin supports two endpointing strategies:
server_diff: Keeps one long-lived WebSocket open, emits interim transcripts as cumulative text arrives, then emits a final transcript and end-of-speech afterfinalize_after_silence_msof server silence.client_vad: Opens a WebSocket when local audio energy starts and closes it after silence. This gives stronger utterance boundaries with slightly more connection overhead.
Synchronous Batch Recognition
When you have a recorded audio buffer, such as a voicemail or uploaded file, and do not need a live stream, callrecognize directly. This always uses the batch HTTP endpoint regardless of the mode setting and returns a transcript with word-level timestamps.
Tracking Turn Metrics
Each conversation turn carries timing data on itsChatMessage. Subscribe to conversation_item_added to read transcription delay, end-of-turn delay, and downstream LLM/TTS metrics:
The previous
metrics_collected event is deprecated. New integrations should use conversation_item_added and read metrics from ChatMessage.metrics.Best Practices
- Use batch mode in production. It returns word-level timestamps and is more accurate than streaming on Arabic. Switch to streaming only when your UI needs to update before the speaker finishes.
- Pin the model explicitly. Pass
model="munsit"ormodel="munsit-en-ar"instead of relying on the default, so a future plugin default change does not affect your agent. - Tighten the VAD for real microphones.
silero.VAD.load(activation_threshold=0.6, min_speech_duration=0.3)filters background noise and post-echo-cancellation residual without losing real speech under normal conditions. The default Silero thresholds are too permissive for full-duplex voice agents. - Use
consolemode while developing. It runs locally with your microphone and prints transcripts to stdout, giving the fastest debugging loop. No LiveKit server required. - Keep the API key in the environment. Set
MUNSIT_API_KEYin your shell or.envfile rather than passing it in code. The plugin reads it automatically. - Pair Munsit STT with Faseeh TTS for an Arabic-native loop. Both plugins speak the same dialect register, which produces a more cohesive user experience than mixing in a non-Arabic-native TTS.
Troubleshooting
Missing or Invalid API Key
Verify thatMUNSIT_API_KEY is available to the process running your agent:
No Final Transcript
Make sure yourAgentSession includes VAD when using batch mode. Batch mode finalizes after LiveKit signals end-of-speech.
Microphone Feedback Loop
If the agent transcribes its own TTS playback and that fake transcript triggers a new conversational turn, the cause is residual audio leaking through the microphone after acoustic echo cancellation. The bilingualmunsit-en-ar model is more sensitive to low-energy input than munsit is and can return text on those residuals.
Tighten the Silero VAD so quieter post-AEC audio does not reach STT:
model="munsit" to confirm the issue is model-specific, or run the agent with headphones so the speaker output cannot reach the microphone.
Need Live Captions
Use streaming mode when your UI needs transcript updates before the speaker finishes:Support
- Package: livekit-plugins-munsit on PyPI
- Plugin Issues: GitHub Issues
- Munsit Support: Schedule a Meeting
- LiveKit Support: LiveKit Community
