LiveKit STT Plugin

Munsit STT for LiveKit Agents

The livekit-plugins-munsit package adds Munsit speech-to-text to LiveKit Agents. It is optimized for Arabic speech recognition and supports Arabic/English code-switching through the munsit-en-ar model.

Prerequisites

A Munsit AI account
Python 3.10 or higher
Basic familiarity with the LiveKit Agents framework
A LiveKit Cloud account or self-hosted LiveKit server

API Key

Go to Munsit - API Keys, generate an API key, then save it securely.

Your API key is only shown once. Save it securely.

Set the key in your environment:

MUNSIT_API_KEY=your_MUNSIT_API_KEY_here

Installation

Install the Munsit STT plugin for LiveKit Agents:

pip install livekit-plugins-munsit

Quick Start

from livekit.agents import AgentSession
from livekit.plugins import munsit, silero

session = AgentSession(
    stt=munsit.STT(),
    vad=silero.VAD.load(activation_threshold=0.6, min_speech_duration=0.3),
    # ... llm, tts ...
)

munsit.STT() uses batch mode by default. In batch mode, LiveKit’s VAD signals end-of-speech through flush(), then the plugin sends the buffered utterance to Munsit and emits a final transcript with word-level timestamps. The Silero VAD thresholds above (activation_threshold=0.6, min_speech_duration=0.3) are tuned for real-world microphones, where post-echo-cancellation audio from the agent’s own speaker can otherwise be misinterpreted as user speech. See Best Practices for details.

Modes

Mode	When to use	Endpoint	Latency	Word timestamps	Interim events
`batch` (default)	Production with `AgentSession` + VAD; transcribing recorded audio.	`POST /api/v1/audio/transcribe`	VAD detection + upload + server processing (~1-2 s for short utterances)	Yes, populated on `SpeechData.words`	No
`streaming`	Live captions or on-the-fly UI updates while the user is still speaking.	`WS /api/v1/websocket/speech-to-text`	~700 ms idle threshold by default	No	Yes, through `INTERIM_TRANSCRIPT` events

from livekit.plugins import munsit

batch_stt = munsit.STT()
streaming_stt = munsit.STT(mode="streaming")

STT.recognize(audio_buffer) always uses the batch HTTP endpoint, even when mode="streaming" is configured.

Full Agent Example

Create a file called arabic_stt_agent.py:

from dotenv import load_dotenv
from livekit import agents
from livekit.agents import Agent, AgentSession, AgentServer
from livekit.plugins import faseeh, munsit, openai, silero

load_dotenv(".env.local")


class ArabicAssistant(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions="""أنت مساعد صوتي ذكي يتحدث العربية بطلاقة.
            أجب على المستخدم بطريقة واضحة ومختصرة."""
        )


server = AgentServer()


@server.rtc_session()
async def my_agent(ctx: agents.JobContext):
    session = AgentSession(
        stt=munsit.STT(
            model="munsit",
            mode="batch",
        ),
        llm=openai.LLM(model="gpt-4o-mini"),
        tts=faseeh.TTS(),
        vad=silero.VAD.load(activation_threshold=0.6, min_speech_duration=0.3),
    )

    await session.start(
        room=ctx.room,
        agent=ArabicAssistant(),
    )


if __name__ == "__main__":
    agents.cli.run_app(server)

Run the agent locally:

python arabic_stt_agent.py dev

Pairing Munsit STT with Faseeh TTS gives you a fully Arabic-native voice loop. If you’d rather use a non-Munsit TTS, cartesia.TTS() is a strong choice; openai.TTS() works but produces less natural Arabic prosody.

Model Selection

Choose the model based on the input language:

Model	Use case
`munsit`	Arabic speech recognition. This is the default model.
`munsit-en-ar`	Mixed Arabic-English speech with code-switching.

stt = munsit.STT(model="munsit-en-ar")

Configuration Reference

Parameter	Default	Description
`mode`	`batch`	`batch` for HTTP transcription or `streaming` for WebSocket transcription.
`model`	`munsit`	`munsit` for Arabic or `munsit-en-ar` for Arabic/English code-switching.
`api_key`	env `MUNSIT_API_KEY`	Munsit API key.
`base_url`	Munsit production WebSocket URL	Override the streaming WebSocket URL.
`batch_base_url`	Munsit production HTTPS URL	Override the batch HTTP URL.
`auth_method`	`header`	Authentication style: `header`, `bearer`, or `query`.
`sample_rate`	`16000`	Sample rate used for the generated WAV header.
`num_channels`	`1`	Number of audio channels.
`interim_results`	`True`	Emits interim transcripts in streaming mode.
`endpointing`	`server_diff`	Streaming endpointing strategy: `server_diff` or `client_vad`.
`finalize_after_silence_ms`	`700`	Silence threshold before finalizing in `server_diff` mode.
`energy_filter`	`False`	Enables energy filtering for `client_vad` mode.
`vad_silence_ms`	`1500`	Silence duration used by `client_vad` mode.
`language`	`None`	Label attached to `SpeechData.language`; defaults to `ar`.
`http_session`	`None`	Custom `aiohttp.ClientSession`.
`extra_query_params`	`None`	Extra query params for the streaming WebSocket endpoint.

Authentication Methods

Munsit accepts the API key in three different places. Choose whichever fits your deployment:

# Default: sends the key as the x-api-key header.
munsit.STT(auth_method="header")

# Authorization: Bearer <key>
munsit.STT(auth_method="bearer")

# Query parameter (?token=<key>): useful when an upstream proxy strips headers.
munsit.STT(auth_method="query")

All three methods work on both the batch HTTP endpoint and the streaming WebSocket endpoint.

Streaming Endpointing

When mode="streaming" is enabled, the plugin supports two endpointing strategies:

server_diff: Keeps one long-lived WebSocket open, emits interim transcripts as cumulative text arrives, then emits a final transcript and end-of-speech after finalize_after_silence_ms of server silence.
client_vad: Opens a WebSocket when local audio energy starts and closes it after silence. This gives stronger utterance boundaries with slightly more connection overhead.

stt = munsit.STT(
    mode="streaming",
    endpointing="server_diff",
    finalize_after_silence_ms=700,
)

Synchronous Batch Recognition

When you have a recorded audio buffer, such as a voicemail or uploaded file, and do not need a live stream, call recognize directly. This always uses the batch HTTP endpoint regardless of the mode setting and returns a transcript with word-level timestamps.

from livekit import rtc
from livekit.plugins import munsit

stt = munsit.STT()
frames = [...]  # list of rtc.AudioFrame
combined = rtc.combine_audio_frames(frames)

result = await stt.recognize(combined)
print(result.alternatives[0].text)
for word in result.alternatives[0].words:
    print(f"{word.start_time:.2f}s -> {word.end_time:.2f}s  {word}")

Tracking Turn Metrics

Each conversation turn carries timing data on its ChatMessage. Subscribe to conversation_item_added to read transcription delay, end-of-turn delay, and downstream LLM/TTS metrics:

from livekit.agents import ChatMessage


@session.on("conversation_item_added")
def on_item(event):
    msg = event.item
    if not isinstance(msg, ChatMessage):
        return

    metrics = msg.metrics or {}
    if msg.role == "user":
        transcription_delay = metrics.get("transcription_delay")
        end_of_turn_delay = metrics.get("end_of_turn_delay")
        if transcription_delay is not None:
            print(f"STT delay: {transcription_delay * 1000:.0f} ms")
        if end_of_turn_delay is not None:
            print(f"EOU delay: {end_of_turn_delay * 1000:.0f} ms")
    elif msg.role == "assistant":
        llm_ttft = metrics.get("llm_node_ttft")
        tts_ttfb = metrics.get("tts_node_ttfb")
        if llm_ttft:
            print(f"LLM TTFT: {llm_ttft * 1000:.0f} ms")
        if tts_ttfb:
            print(f"TTS TTFB: {tts_ttfb * 1000:.0f} ms")

All values are reported in seconds.

The previous metrics_collected event is deprecated. New integrations should use conversation_item_added and read metrics from ChatMessage.metrics.

Best Practices

Use batch mode in production. It returns word-level timestamps and is more accurate than streaming on Arabic. Switch to streaming only when your UI needs to update before the speaker finishes.
Pin the model explicitly. Pass model="munsit" or model="munsit-en-ar" instead of relying on the default, so a future plugin default change does not affect your agent.
Tighten the VAD for real microphones. silero.VAD.load(activation_threshold=0.6, min_speech_duration=0.3) filters background noise and post-echo-cancellation residual without losing real speech under normal conditions. The default Silero thresholds are too permissive for full-duplex voice agents.
Use console mode while developing. It runs locally with your microphone and prints transcripts to stdout, giving the fastest debugging loop. No LiveKit server required.
Keep the API key in the environment. Set MUNSIT_API_KEY in your shell or .env file rather than passing it in code. The plugin reads it automatically.
Pair Munsit STT with Faseeh TTS for an Arabic-native loop. Both plugins speak the same dialect register, which produces a more cohesive user experience than mixing in a non-Arabic-native TTS.

Troubleshooting

Missing or Invalid API Key

Verify that MUNSIT_API_KEY is available to the process running your agent:

echo $MUNSIT_API_KEY

No Final Transcript

Make sure your AgentSession includes VAD when using batch mode. Batch mode finalizes after LiveKit signals end-of-speech.

session = AgentSession(
    stt=munsit.STT(),
    vad=silero.VAD.load(activation_threshold=0.6, min_speech_duration=0.3),
    # ... llm, tts ...
)

Microphone Feedback Loop

If the agent transcribes its own TTS playback and that fake transcript triggers a new conversational turn, the cause is residual audio leaking through the microphone after acoustic echo cancellation. The bilingual munsit-en-ar model is more sensitive to low-energy input than munsit is and can return text on those residuals. Tighten the Silero VAD so quieter post-AEC audio does not reach STT:

vad=silero.VAD.load(activation_threshold=0.6, min_speech_duration=0.3),

If the loop persists, switch temporarily to model="munsit" to confirm the issue is model-specific, or run the agent with headphones so the speaker output cannot reach the microphone.

Need Live Captions

Use streaming mode when your UI needs transcript updates before the speaker finishes:

stt = munsit.STT(mode="streaming", interim_results=True)

Support

Package: livekit-plugins-munsit on PyPI
Plugin Issues: GitHub Issues
Munsit Support: Schedule a Meeting
LiveKit Support: LiveKit Community

License

This plugin is licensed under Apache License 2.0. See LICENSE for details.

Quickstart

Text To Speech

Speech to Text

Capabilities

Integrations

Munsit STT for LiveKit Agents

Prerequisites

API Key

Installation

Quick Start

Modes

Full Agent Example

Model Selection

Configuration Reference

Authentication Methods

Streaming Endpointing

Synchronous Batch Recognition

Tracking Turn Metrics

Best Practices

Troubleshooting

Missing or Invalid API Key

No Final Transcript

Microphone Feedback Loop

Need Live Captions

Support

License

Quickstart

Text To Speech

Speech to Text

Capabilities

Integrations

Documentation Index

​Munsit STT for LiveKit Agents

​Prerequisites

​API Key

​Installation

​Quick Start

​Modes

​Full Agent Example

​Model Selection

​Configuration Reference

​Authentication Methods

​Streaming Endpointing

​Synchronous Batch Recognition

​Tracking Turn Metrics

​Best Practices

​Troubleshooting

​Missing or Invalid API Key

​No Final Transcript

​Microphone Feedback Loop

​Need Live Captions

​Support

​License

Munsit STT for LiveKit Agents

Prerequisites

API Key

Installation

Quick Start

Modes

Full Agent Example

Model Selection

Configuration Reference

Authentication Methods

Streaming Endpointing

Synchronous Batch Recognition

Tracking Turn Metrics

Best Practices

Troubleshooting

Missing or Invalid API Key

No Final Transcript

Microphone Feedback Loop

Need Live Captions

Support

License