Back to Wiki
Development
Last updated: 2024-12-266 min read

Deployment

Deploying AI agents in production

Deployment

Moving an AI agent from a local Jupyter notebook to a scalable production environment brings a unique set of challenges.

Deployment Architectures

1. Stateless Microservices

The most common pattern. The agent logic runs in a container (Docker) and exposes an HTTP endpoint (REST or GraphQL).

  • Pros: Easy to scale horizontally behind a load balancer.
  • Cons: Managing state (conversation history) requires an external database (Redis/Postgres).

2. Stateful WebSocket Servers

For real-time, streaming interactions, WebSockets are preferred over HTTP.

  • Pros: Lower latency, supports streaming tokens effectively.
  • Cons: Harder to scale (sticky sessions needed), connection management is complex.

3. Edge Deployment

Running smaller models directly on the user's device (browser or mobile).

  • Model: TensorFlow.js, ONNX Runtime, or specialized mobile models (e.g., Gemma 2B).
  • Pros: Zero latency, works offline, better privacy.
  • Cons: Limited model capability, drains battery.

Key Considerations

Streaming

Users hate waiting 5 seconds for a full paragraph to appear.

  • Token Streaming: Send each chunk of text as it is generated. This creates a perception of instant response.
  • Protocol: Server-Sent Events (SSE) is often easier than WebSockets for simple one-way streaming.

Rate Limiting & Cost Control

LLM APIs are expensive.

  • User Quotas: Limit requests per user/day.
  • Caching: Semantic caching (e.g., GPTCache) to store responses for similar queries and save costs.

Security

  • Prompt Injection: Ensure specific firewalls (like Reask or Lakera) are in place to detect malicious inputs.
  • Data Leakage: Ensure the agent doesn't output sensitive data (PII) from its training or context.

CI/CD for Agents

  1. Code Change: Developer commits new prompt or logic.
  2. Unit Tests: Run code assertions.
  3. Eval Run: Run a subset of "Golden Dataset" queries through the agent.
  4. Gate: If generic evaluation score drops > 5%, block deployment.
  5. Deploy: Push to staging, then production.