Skip to main content
This is a beta feature according to Algolia’s Terms of Service (“Beta Services”).
Guardrails classify user messages and agent responses against categories you define, blocking content that violates your policies. When content is blocked, a fallback response is returned instead.

How guardrails work

When a user sends a message to an agent with guardrails enabled:
  1. Input check: the user’s message is classified against your input-scoped categories. If it matches a category, the agent returns the fallback response without calling the LLM.
  2. Agent response: if the input passes, the agent generates a response normally.
  3. Output check: the agent’s response is classified against your output-scoped categories. If it matches, the response is replaced with the fallback.
Classification uses a separate LLM call with a dedicated model and provider. This keeps guardrail logic independent from the agent’s main model.
Guardrails use a fail-open design. If the classification LLM is unavailable (timeout, API error, rate limit), content is allowed through rather than blocked. This prevents guardrail outages from disrupting your agent.

Set up guardrails

1

Configure a provider

Guardrails need an LLM provider for classification. You can use the same provider as your agent or a different one.
  1. Go to your agent in the Agent Studio dashboard.
  2. Open the Safety controls tab.
  3. Enable Guardrails.
  4. Select a provider and model.
Use a fast, low-latency model for classification. Larger models may improve accuracy but add latency to every request.
2

Define your agent's scope

Describe your agent’s domain to give the classifier context. For example: “Customer support agent for an electronics store.”This helps the classifier distinguish legitimate queries from off-topic content.
3

Add violation categories

Each category defines a type of content to block:
  • Name: identifier for the category (for example, competitor_mentions)
  • Scope: input (user messages), output (agent responses), or both
  • Description: what content this category catches
  • Fallback response: message returned when this category triggers
Add categories that match your use case. For example, an e-commerce agent might block competitor mentions, off-topic questions, and inappropriate content.
Consider consolidating categories if you have more than eight. Too many categories can reduce classification accuracy.
4

Test in the playground

Send messages that should be blocked and messages that should pass through. The playground shows guardrail violations and provider errors so you can verify your configuration before deploying.

Category scope

Each category is scoped to control when it’s checked:
ScopeChecks user inputChecks agent output
inputYesNo
outputNoYes
bothYesYes
Use input scope for categories that filter what users can ask (off-topic questions, prompt injection attempts). Use output scope for categories that filter what the agent can say (confidential information, competitor mentions). Use both when the same policy applies to both directions.

Fallback responses

When content is blocked, the user sees the category’s fallback response. If no fallback is configured, a default message is used:
  • Input violations: “I cannot process this request.”
  • Output violations: “I cannot provide this response.”
Configure specific fallback responses to guide users toward acceptable queries. For example: “I can only help with questions about our products and services.”

Streaming behavior

For streaming responses, guardrails work differently for input and output: Input guardrails run concurrently with the LLM stream. If a violation is detected mid-stream, a violation event is emitted and the client discards any already-streamed content. Output guardrails classify the full response after streaming completes. If a violation is detected, a violation event is emitted as the final chunk. The client replaces the streamed content with the fallback response. In both cases, the Vercel AI SDK handles the replacement automatically. If you’re building a custom integration, handle the guardrailViolation event in your streaming parser.

API configuration

Configure guardrails in the agent’s config.guardrail object:
JSON
{
  "config": {
    "guardrail": {
      "enabled": true,
      "providerId": "PROVIDER_UUID",
      "model": "gpt-4.1-mini",
      "scope": "Customer support agent for an electronics store.",
      "categories": [
        {
          "name": "off_topic",
          "scope": "input",
          "description": "Questions unrelated to electronics or the store.",
          "fallbackResponse": "I can only help with electronics questions."
        },
        {
          "name": "inappropriate",
          "scope": "both",
          "description": "Offensive, hateful, or sexually explicit content."
        }
      ]
    }
  }
}

Configuration reference

FieldTypeDefaultDescription
enabledbooleanfalseTurn guardrails on or off
requiredbooleanfalseReturn 503 if the guardrail provider can’t be initialized
providerIdstringUUID of the provider authentication for classification
modelstringModel name for the classification LLM
scopestringDescription of the agent’s domain (max 1,024 characters)
categoriesarray[]List of violation categories

Category reference

FieldTypeDefaultDescription
namestringrequiredCategory identifier (1-64 characters)
scopestringbothinput, output, or both
descriptionstringWhat content this category catches (max 1,024 characters)
fallbackResponsestringMessage returned when this category triggers

Handling guardrail events

If you’re using the Vercel AI SDK, guardrail violations are handled automatically. The SDK replaces blocked content with the fallback response. For custom integrations, handle these streaming events:
Event (AI SDK v5)Event (AI SDK v4)Description
data-guardrail-violationguardrailViolation data chunkContent was blocked. Contains category, guardrailType, and fallbackResponse.
When you receive a violation event:
  1. Discard any content already streamed for this message.
  2. Display the fallbackResponse from the event data.

Troubleshooting

Guardrail not blocking content

  • Verify the category’s scope matches the direction you’re testing (input vs output).
  • Check the category description is specific enough for the classifier to detect violations.
  • Test with clear violations first, then refine descriptions for edge cases.

”Review your guardrail model configuration” error

This error appears in the playground when the guardrail provider can’t classify content. Common causes:
  • Invalid or expired API key on the provider
  • Wrong model name
  • Provider rate limit exceeded
Check your provider settings and try again. This error doesn’t appear in production. In production, guardrails fail open silently.

Slow response times

Guardrail classification adds latency to each request. To minimize this:
  • Use a fast, low-latency model for classification
  • Keep the number of categories under eight
  • Keep category descriptions concise

See also

Last modified on April 24, 2026