Skip to main content

General

Omium is an observability and reliability platform for AI agents. It captures execution traces, creates state checkpoints, detects failures, and enables one-click recovery — so your LangGraph, CrewAI, or custom agents stay debuggable in production.
Omium has auto-instrumentation for LangGraph and CrewAI. For any other Python framework (or custom agents), use the @omium.trace and @omium.checkpoint decorators. The REST API works from any language.
Yes. The free tier includes 500 agent executions per month, core tracing, checkpoints, and 7-day data retention. No credit card required. Get started →
Under 5 minutes. Install the SDK (pip install omium), run omium init to authenticate, and add two lines to your agent code. See the quickstart →
No. Omium complements your existing stack. It adds agent-specific observability — execution traces, state checkpoints, failure detection, and replay — that generic logging tools like Datadog or Sentry don’t provide for multi-step AI workflows.

Debugging agent failures

Omium captures every step of your agent’s execution as a trace. Open the failed run in the dashboard, expand the execution timeline, and click into the failing step. You’ll see the exact input, output, tool calls, and error message at the point of failure — no manual logging required.Related: Execution tracing →
Omium saves checkpoints — full state snapshots at critical points during execution. When a run fails, you can replay from the last valid checkpoint instead of re-running the entire pipeline. This saves time and avoids duplicate API calls.
# Replay from the last checkpoint
omium replay <execution_id>
Related: Checkpoints API →, CLI replay →
These are silent failures — the hardest kind to catch. Omium’s failure detection monitors for output drift, hallucinations, and quality degradation even when no exception is thrown. Enable omium.instrument_langgraph() and the dashboard will flag anomalies automatically.Related: LangGraph integration →
Omium detects infinite loops and circular tool-call patterns in real-time. When detected, you’ll see a failure alert in the dashboard with the exact loop pattern. You can then:
  1. View the trace to see where the loop starts
  2. Roll back to the last checkpoint before the loop
  3. Apply a fix and replay
Related: CrewAI integration →, Failures API →
Omium traces the full execution graph across agents — including handoffs, shared state, and tool calls between agents. The dashboard visualizes these as a connected timeline so you can follow the flow from one agent to another and pinpoint where communication breaks down.Related: Platform capabilities →
Yes. Omium traces every external tool call your agent makes, including duration. You can filter executions by status (failed, timeout) and sort by latency to find the slow calls. Checkpoints before the timeout let you retry just the failing step.Related: Executions API →

Monitoring and observability

Once instrumented, Omium automatically captures every execution. The dashboard shows real-time metrics: success rate, failure rate, latency, cost per run, and active runs. Set up Slack notifications for failures and daily digests.Related: Automations →
Yes. Omium tracks token usage and estimated cost per execution, broken down by workflow, model, and time period. The Cost page in the dashboard shows trends and lets you set budget alerts.Related: Billing API →, API keys & billing →
Connect Slack in your dashboard settings. Omium sends real-time failure alerts to your configured channel. You can also set up daily/weekly digest reports that summarize wins, issues, and key metrics.Related: Platform →
A trace is a read-only record of what happened — every step, tool call, and LLM response during an execution. A checkpoint is a writable state snapshot that you can roll back to and replay from. Traces help you understand; checkpoints help you recover.Related: Checkpoints →, Platform capabilities →

Integration and setup

Yes. Omium is LLM-agnostic. It instruments at the agent framework level (LangGraph, CrewAI) or at the function level (@omium.trace), so it works regardless of which LLM provider your agents call.
Absolutely. Use the @omium.trace and @omium.checkpoint decorators on any Python function. The REST API also works from non-Python services.
@omium.trace("my_step")
def my_custom_step(data):
    return process(data)
Related: Python SDK →, REST API →
Two lines of code. No refactoring needed.
import omium
omium.init()
omium.instrument_langgraph()  # or instrument_crewai()
Your existing agent code runs unchanged. Omium wraps framework internals to capture traces and checkpoints automatically.Related: Installation →, Quickstart →
Yes. All data is encrypted in transit (TLS) and at rest. Omium does not store your LLM prompts or responses unless you explicitly enable full-content tracing. API keys are scoped per project and can be rotated at any time.Related: API keys & billing →
Enterprise plans include self-hosted deployment options. Contact us to discuss your requirements.

Pricing and billing

Omium has four tiers: Free (500 runs/mo), Developer (49/mo,2,500runs),Pro(49/mo, 2,500 runs), **Pro** (299/mo, 25,000 runs), and Enterprise (custom). All tiers include core tracing and checkpoints. Higher tiers unlock failure analytics, fix suggestions, and priority support.See full pricing →
One execution = one top-level agent run (e.g., one app.invoke() in LangGraph or one crew.kickoff() in CrewAI). Steps within that run (tool calls, LLM calls, checkpoints) are included and don’t count separately.
Yes. Changes take effect immediately. When upgrading, you’re charged the prorated difference. When downgrading, the new rate applies at the next billing cycle.Related: API keys & billing →

Still have questions? Join our Discord community or email us at founders@omium.ai.