Skip to content

Section 1: LLM Proxy — Streaming Chat API#

In this tutorial, you'll build a streaming chat application from scratch. We'll start with the simplest possible implementation and progressively add capabilities until we have a fully functional chat app with LLM integration and conversation history.

By the end, you'll understand how Ragbits handles the infrastructure so you can focus on your application logic.

What You'll Build#

A chat application that:

  • Streams responses from any LLM provider in real-time
  • Maintains conversation history across messages
  • Provides a web UI out of the box
  • Exposes a REST API for programmatic access

Prerequisites#

Before starting, make sure you have:

  • Python 3.10 or higher installed
  • An OpenAI API key (or another LLM provider key)

Install Ragbits:

pip install ragbits
uv add ragbits

Set your API key:

export OPENAI_API_KEY="your-api-key"

Other Providers

Ragbits uses LiteLLM under the hood, supporting 100+ providers:

  • Anthropic: export ANTHROPIC_API_KEY="your-key"
  • Azure OpenAI: export AZURE_API_KEY="your-key" and export AZURE_API_BASE="your-endpoint"
  • Ollama (local): No API key needed, just have Ollama running

Step 1: Create a Minimal Chat Interface#

Create a new file called main.py with the following code:

from collections.abc import AsyncGenerator

from ragbits.chat.api import RagbitsAPI
from ragbits.chat.interface import ChatInterface
from ragbits.chat.interface.types import ChatContext, ChatResponse
from ragbits.core.prompt import ChatFormat


class SimpleStreamingChat(ChatInterface):
    """A minimal chat interface that echoes user messages."""

    async def chat(
        self,
        message: str,
        history: ChatFormat,
        context: ChatContext,
    ) -> AsyncGenerator[ChatResponse, None]:
        """Process a chat message and return an echo response."""
        yield self.create_text_response("Hello! You said: " + message)

This is the core contract in Ragbits. The ChatInterface class requires you to implement one method: chat(). This method:

  • Receives the user's message
  • Receives the conversation history (we'll use this later)
  • Receives additional context (user info, settings, etc.)
  • Yields ChatResponse objects

The create_text_response() helper creates a properly formatted response. Since chat() is an async generator (note the yield), Ragbits can stream responses to the client as they're generated.

Launch the Application#

To run your chat interface, wrap it with RagbitsAPI. Add this to the bottom of your file:

if __name__ == "__main__":
    api = RagbitsAPI(SimpleStreamingChat)
    api.run()

Run the application:

python main.py

Open http://127.0.0.1:8000 in your browser. You'll see a chat interface. Type a message and you'll get back "Hello! You said: [your message]".

This works, but it's not very useful yet. Let's add an actual LLM.

Step 2: Add an LLM#

Ragbits uses LiteLLM to provide a unified interface to 100+ LLM providers. Add the import and an __init__ method to your class:

from collections.abc import AsyncGenerator

from ragbits.chat.api import RagbitsAPI
from ragbits.chat.interface import ChatInterface
from ragbits.chat.interface.types import ChatContext, ChatResponse
from ragbits.core.llms import LiteLLM
from ragbits.core.prompt import ChatFormat


class SimpleStreamingChat(ChatInterface):
    """A chat interface with LLM initialized but not yet connected."""

    def __init__(self) -> None:
        self.llm = LiteLLM(model_name="gpt-4o-mini")

    async def chat(
        self,
        message: str,
        history: ChatFormat,
        context: ChatContext,
    ) -> AsyncGenerator[ChatResponse, None]:
        """Process a chat message and return an echo response."""
        yield self.create_text_response("Hello! You said: " + message)

You can change "gpt-4o-mini" to any model supported by LiteLLM:

Provider Model Name
OpenAI gpt-4o-mini, gpt-4o, o1
Anthropic claude-sonnet-4-20250514, claude-3-5-haiku-20241022
Azure azure/gpt-4o
Ollama ollama/llama3.2

The LLM is ready, but we're not using it yet. Let's connect it to our chat method.

Step 3: Connect the LLM to Chat#

Now let's make the LLM actually respond to messages. Update the chat() method:

    async def chat(
        self,
        message: str,
        history: ChatFormat,
        context: ChatContext,
    ) -> AsyncGenerator[ChatResponse, None]:
        """Process a chat message and stream the LLM response."""
        conversation = [{"role": "user", "content": message}]
        stream = self.llm.generate_streaming(conversation)

        async for chunk in stream:
            yield self.create_text_response(chunk)

Here's what changed:

  1. We create a conversation list with the user's message in OpenAI's chat format
  2. We call generate_streaming() which returns an async generator
  3. We iterate over the stream, yielding each chunk as it arrives

Run the app again and try it. You'll see the LLM's response stream in real-time, token by token.

But there's a problem: the LLM doesn't remember previous messages. Each message is treated as a new conversation. Let's fix that.

Step 4: Add Conversation History#

The history parameter contains all previous messages in the conversation. Update the chat() method to include history:

    async def chat(
        self,
        message: str,
        history: ChatFormat,
        context: ChatContext,
    ) -> AsyncGenerator[ChatResponse, None]:
        """
        Process a chat message and stream the response.

        Args:
            message: The current user message
            history: Previous messages in the conversation
            context: Additional context (user info, settings, etc.)

        Yields:
            ChatResponse objects containing streamed text chunks
        """
        stream = self.llm.generate_streaming([*history, {"role": "user", "content": message}])

        async for chunk in stream:
            yield self.create_text_response(chunk)

The key change is spreading the history list before the current message:

[*history, {"role": "user", "content": message}]

The history parameter uses the OpenAI chat format:

[
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there! How can I help?"},
    {"role": "user", "content": "What's the weather like?"},
    # ... and so on
]

Ragbits automatically manages this history for you. Each time the user sends a message, the previous messages are passed to your chat() method.

Run the app and have a multi-turn conversation. The LLM now remembers what you discussed earlier.

Running with the CLI#

So far we've been running the app with python main.py. Ragbits also provides a CLI that offers more options:

ragbits api run main:SimpleStreamingChat

The format is module.path:ClassName. The CLI supports additional flags:

# Custom host and port
ragbits api run main:SimpleStreamingChat --host 0.0.0.0 --port 9000

# Auto-reload when code changes (useful for development)
ragbits api run main:SimpleStreamingChat --reload

# Enable debug mode
ragbits api run main:SimpleStreamingChat --debug

The Complete Application#

Here's the final code that includes everything we built:

View full source on GitHub

main.py
"""
Section 1: LLM Proxy — Streaming Chat API

A minimal streaming chat application using Ragbits.
This establishes the foundational pattern that all subsequent sections build upon.

Run with CLI:
    ragbits api run ragbits_example.main:SimpleStreamingChat

Or programmatically:
    python -m ragbits_example.main
"""

from collections.abc import AsyncGenerator

from ragbits.chat.api import RagbitsAPI
from ragbits.chat.interface import ChatInterface
from ragbits.chat.interface.types import ChatContext, ChatResponse
from ragbits.core.llms import LiteLLM
from ragbits.core.prompt import ChatFormat


class SimpleStreamingChat(ChatInterface):
    """A minimal streaming chat interface that proxies requests to any LLM provider."""

    def __init__(self) -> None:
        self.llm = LiteLLM(model_name="gpt-4o-mini")

    async def chat(
        self,
        message: str,
        history: ChatFormat,
        context: ChatContext,
    ) -> AsyncGenerator[ChatResponse, None]:
        """
        Process a chat message and stream the response.

        Args:
            message: The current user message
            history: Previous messages in the conversation
            context: Additional context (user info, settings, etc.)

        Yields:
            ChatResponse objects containing streamed text chunks
        """
        stream = self.llm.generate_streaming([*history, {"role": "user", "content": message}])

        async for chunk in stream:
            yield self.create_text_response(chunk)


if __name__ == "__main__":
    api = RagbitsAPI(SimpleStreamingChat)
    api.run()

What You've Learned#

In this tutorial, you:

  1. Created a minimal ChatInterface implementation
  2. Launched it with RagbitsAPI to get a web UI and REST API
  3. Integrated an LLM using LiteLLM
  4. Added streaming responses for real-time output
  5. Enabled conversation history for multi-turn chats

The key insight: Ragbits handles the infrastructure (API server, streaming, UI, history management) so you can focus on your application logic. Your chat() method is where your business logic lives.

What's Next#

You now have a working chat application with streaming responses and conversation history. In the next section, you'll add structured output to get type-safe, predictable responses from your LLM.

Reference#

Component Package Purpose
ChatInterface ragbits-chat Abstract base class defining the chat() contract
RagbitsAPI ragbits-chat FastAPI application with built-in UI and streaming
LiteLLM ragbits-core Unified interface to 100+ LLM providers
ChatFormat ragbits-core OpenAI-compatible message format
ChatResponse ragbits-chat Response envelope for streaming content