Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.munsit.com/llms.txt

Use this file to discover all available pages before exploring further.

Munsit STT for LiveKit Agents

The livekit-plugins-munsit package adds Munsit speech-to-text to LiveKit Agents. It is optimized for Arabic speech recognition and supports Arabic/English code-switching through the munsit-en-ar model.

Prerequisites

  • A Munsit AI account
  • Python 3.10 or higher
  • Basic familiarity with the LiveKit Agents framework
  • A LiveKit Cloud account or self-hosted LiveKit server

API Key

Go to Munsit - API Keys, generate an API key, then save it securely.
Your API key is only shown once. Save it securely.
Set the key in your environment:
MUNSIT_API_KEY=your_MUNSIT_API_KEY_here

Installation

Install the Munsit STT plugin for LiveKit Agents:
pip install livekit-plugins-munsit

Quick Start

from livekit.agents import AgentSession
from livekit.plugins import munsit, silero

session = AgentSession(
    stt=munsit.STT(),
    vad=silero.VAD.load(activation_threshold=0.6, min_speech_duration=0.3),
    # ... llm, tts ...
)
munsit.STT() uses batch mode by default. In batch mode, LiveKit’s VAD signals end-of-speech through flush(), then the plugin sends the buffered utterance to Munsit and emits a final transcript with word-level timestamps. The Silero VAD thresholds above (activation_threshold=0.6, min_speech_duration=0.3) are tuned for real-world microphones, where post-echo-cancellation audio from the agent’s own speaker can otherwise be misinterpreted as user speech. See Best Practices for details.

Modes

ModeWhen to useEndpointLatencyWord timestampsInterim events
batch (default)Production with AgentSession + VAD; transcribing recorded audio.POST /api/v1/audio/transcribeVAD detection + upload + server processing (~1-2 s for short utterances)Yes, populated on SpeechData.wordsNo
streamingLive captions or on-the-fly UI updates while the user is still speaking.WS /api/v1/websocket/speech-to-text~700 ms idle threshold by defaultNoYes, through INTERIM_TRANSCRIPT events
from livekit.plugins import munsit

batch_stt = munsit.STT()
streaming_stt = munsit.STT(mode="streaming")
STT.recognize(audio_buffer) always uses the batch HTTP endpoint, even when mode="streaming" is configured.

Full Agent Example

Create a file called arabic_stt_agent.py:
from dotenv import load_dotenv
from livekit import agents
from livekit.agents import Agent, AgentSession, AgentServer
from livekit.plugins import faseeh, munsit, openai, silero

load_dotenv(".env.local")


class ArabicAssistant(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions="""أنت مساعد صوتي ذكي يتحدث العربية بطلاقة.
            أجب على المستخدم بطريقة واضحة ومختصرة."""
        )


server = AgentServer()


@server.rtc_session()
async def my_agent(ctx: agents.JobContext):
    session = AgentSession(
        stt=munsit.STT(
            model="munsit",
            mode="batch",
        ),
        llm=openai.LLM(model="gpt-4o-mini"),
        tts=faseeh.TTS(),
        vad=silero.VAD.load(activation_threshold=0.6, min_speech_duration=0.3),
    )

    await session.start(
        room=ctx.room,
        agent=ArabicAssistant(),
    )


if __name__ == "__main__":
    agents.cli.run_app(server)
Run the agent locally:
python arabic_stt_agent.py dev
Pairing Munsit STT with Faseeh TTS gives you a fully Arabic-native voice loop. If you’d rather use a non-Munsit TTS, cartesia.TTS() is a strong choice; openai.TTS() works but produces less natural Arabic prosody.

Model Selection

Choose the model based on the input language:
ModelUse case
munsitArabic speech recognition. This is the default model.
munsit-en-arMixed Arabic-English speech with code-switching.
stt = munsit.STT(model="munsit-en-ar")

Configuration Reference

ParameterDefaultDescription
modebatchbatch for HTTP transcription or streaming for WebSocket transcription.
modelmunsitmunsit for Arabic or munsit-en-ar for Arabic/English code-switching.
api_keyenv MUNSIT_API_KEYMunsit API key.
base_urlMunsit production WebSocket URLOverride the streaming WebSocket URL.
batch_base_urlMunsit production HTTPS URLOverride the batch HTTP URL.
auth_methodheaderAuthentication style: header, bearer, or query.
sample_rate16000Sample rate used for the generated WAV header.
num_channels1Number of audio channels.
interim_resultsTrueEmits interim transcripts in streaming mode.
endpointingserver_diffStreaming endpointing strategy: server_diff or client_vad.
finalize_after_silence_ms700Silence threshold before finalizing in server_diff mode.
energy_filterFalseEnables energy filtering for client_vad mode.
vad_silence_ms1500Silence duration used by client_vad mode.
languageNoneLabel attached to SpeechData.language; defaults to ar.
http_sessionNoneCustom aiohttp.ClientSession.
extra_query_paramsNoneExtra query params for the streaming WebSocket endpoint.

Authentication Methods

Munsit accepts the API key in three different places. Choose whichever fits your deployment:
# Default: sends the key as the x-api-key header.
munsit.STT(auth_method="header")

# Authorization: Bearer <key>
munsit.STT(auth_method="bearer")

# Query parameter (?token=<key>): useful when an upstream proxy strips headers.
munsit.STT(auth_method="query")
All three methods work on both the batch HTTP endpoint and the streaming WebSocket endpoint.

Streaming Endpointing

When mode="streaming" is enabled, the plugin supports two endpointing strategies:
  • server_diff: Keeps one long-lived WebSocket open, emits interim transcripts as cumulative text arrives, then emits a final transcript and end-of-speech after finalize_after_silence_ms of server silence.
  • client_vad: Opens a WebSocket when local audio energy starts and closes it after silence. This gives stronger utterance boundaries with slightly more connection overhead.
stt = munsit.STT(
    mode="streaming",
    endpointing="server_diff",
    finalize_after_silence_ms=700,
)

Synchronous Batch Recognition

When you have a recorded audio buffer, such as a voicemail or uploaded file, and do not need a live stream, call recognize directly. This always uses the batch HTTP endpoint regardless of the mode setting and returns a transcript with word-level timestamps.
from livekit import rtc
from livekit.plugins import munsit

stt = munsit.STT()
frames = [...]  # list of rtc.AudioFrame
combined = rtc.combine_audio_frames(frames)

result = await stt.recognize(combined)
print(result.alternatives[0].text)
for word in result.alternatives[0].words:
    print(f"{word.start_time:.2f}s -> {word.end_time:.2f}s  {word}")

Tracking Turn Metrics

Each conversation turn carries timing data on its ChatMessage. Subscribe to conversation_item_added to read transcription delay, end-of-turn delay, and downstream LLM/TTS metrics:
from livekit.agents import ChatMessage


@session.on("conversation_item_added")
def on_item(event):
    msg = event.item
    if not isinstance(msg, ChatMessage):
        return

    metrics = msg.metrics or {}
    if msg.role == "user":
        transcription_delay = metrics.get("transcription_delay")
        end_of_turn_delay = metrics.get("end_of_turn_delay")
        if transcription_delay is not None:
            print(f"STT delay: {transcription_delay * 1000:.0f} ms")
        if end_of_turn_delay is not None:
            print(f"EOU delay: {end_of_turn_delay * 1000:.0f} ms")
    elif msg.role == "assistant":
        llm_ttft = metrics.get("llm_node_ttft")
        tts_ttfb = metrics.get("tts_node_ttfb")
        if llm_ttft:
            print(f"LLM TTFT: {llm_ttft * 1000:.0f} ms")
        if tts_ttfb:
            print(f"TTS TTFB: {tts_ttfb * 1000:.0f} ms")
All values are reported in seconds.
The previous metrics_collected event is deprecated. New integrations should use conversation_item_added and read metrics from ChatMessage.metrics.

Best Practices

  • Use batch mode in production. It returns word-level timestamps and is more accurate than streaming on Arabic. Switch to streaming only when your UI needs to update before the speaker finishes.
  • Pin the model explicitly. Pass model="munsit" or model="munsit-en-ar" instead of relying on the default, so a future plugin default change does not affect your agent.
  • Tighten the VAD for real microphones. silero.VAD.load(activation_threshold=0.6, min_speech_duration=0.3) filters background noise and post-echo-cancellation residual without losing real speech under normal conditions. The default Silero thresholds are too permissive for full-duplex voice agents.
  • Use console mode while developing. It runs locally with your microphone and prints transcripts to stdout, giving the fastest debugging loop. No LiveKit server required.
  • Keep the API key in the environment. Set MUNSIT_API_KEY in your shell or .env file rather than passing it in code. The plugin reads it automatically.
  • Pair Munsit STT with Faseeh TTS for an Arabic-native loop. Both plugins speak the same dialect register, which produces a more cohesive user experience than mixing in a non-Arabic-native TTS.

Troubleshooting

Missing or Invalid API Key

Verify that MUNSIT_API_KEY is available to the process running your agent:
echo $MUNSIT_API_KEY

No Final Transcript

Make sure your AgentSession includes VAD when using batch mode. Batch mode finalizes after LiveKit signals end-of-speech.
session = AgentSession(
    stt=munsit.STT(),
    vad=silero.VAD.load(activation_threshold=0.6, min_speech_duration=0.3),
    # ... llm, tts ...
)

Microphone Feedback Loop

If the agent transcribes its own TTS playback and that fake transcript triggers a new conversational turn, the cause is residual audio leaking through the microphone after acoustic echo cancellation. The bilingual munsit-en-ar model is more sensitive to low-energy input than munsit is and can return text on those residuals. Tighten the Silero VAD so quieter post-AEC audio does not reach STT:
vad=silero.VAD.load(activation_threshold=0.6, min_speech_duration=0.3),
If the loop persists, switch temporarily to model="munsit" to confirm the issue is model-specific, or run the agent with headphones so the speaker output cannot reach the microphone.

Need Live Captions

Use streaming mode when your UI needs transcript updates before the speaker finishes:
stt = munsit.STT(mode="streaming", interim_results=True)

Support

License

This plugin is licensed under Apache License 2.0. See LICENSE for details.