Observability Runbook

Trace failed, stuck, expensive, or noisy Mogplex work from source surface to run events, tool calls, sandboxes, and API state.

Use this runbook after something has tried to run.

If no work was emitted at all, start with the owning route surface instead: GitHub, Triggers, Assignments, Automations, Slack, CLI, or the API Quickstart.

First Question

Ask one question before opening every page:

Did Mogplex create a run, call, or sandbox record?

Answer	Start here
No	Source surface: GitHub coverage, Slack channel link, Trigger, Assignment, Automation, CLI, or API request.
Yes	Observability, then the linked run, call, or sandbox row.
Unsure	Check Observability by time window, then search by repo or source surface.

Observability is strongest after work exists. It is not the best proof that a GitHub webhook, Slack mention, or API request was routed correctly before a run was created.

Gather The Minimum Facts

Before changing configuration, capture:

source surface: CLI, API, Slack, GitHub mention, Trigger, Assignment, or Automation
repo owner/name and Mogplex repo ID
run ID or call ID if one exists
sandbox ID if a preview/runtime exists
model and provider
first error message
whether the status is pending, streaming, success, failed, or cancelled

Those facts usually identify the owning layer.

Read The Summary Cards First

Open Observability and read the cards before opening row details.

Use them to decide whether the problem is:

broad pressure, such as stale pending work or start failures
isolated runtime failure
high token or cost usage
sandbox-related
local CLI activity rather than hosted automation

Then open the exact Activity row.

Expand The Activity Row

The expanded row is the source of truth for runtime debugging.

Check in this order:

surface badge
repo and source metadata
model and call type
error string
event timeline
tool calls
sandbox metadata and preview URL
raw metadata only if the higher-level fields are not enough

Do not edit the agent prompt until you know whether the failure came from the model, a tool, connection state, sandbox state, or routing setup.

No Run Appears

If no row appears in Observability:

GitHub event: check Installations and GitHub Routing Cookbook.
Slack mention: check Slack channel links, user mapping, repo-agent enabled state, and monthly/user limits.
Trigger or Assignment: check enabled state, repo scope, and selected agent.
Automation: check published version, active state, start event, and entry agent role.
API request: check token scope, idempotency key, response code, and repo ID.
CLI run: check CLI auth, local config, and whether the failure is local-only.

The fastest fix is usually on the source surface, not in Observability.

Pending Or Stuck Work

When a run exists but stays pending:

Check summary cards for stale pending work or start failures.
Open the run row and inspect the first event.
Check sandbox allocation state if the run needs a sandbox.
Check managed model access if the run never reaches a model call.
Check whether the source surface is repeatedly re-enqueuing the same work.

If the run came from the public API, use GET /api/v1/mogplex/runs/{runId} and GET /api/v1/mogplex/runs/{runId}/events to compare API state with the UI.

Model Or Access Failure

For model access errors:

Confirm the model exists in Available Models.
Confirm the user or team plan includes access to hosted model usage.
Check whether the agent, repo, or route excludes the model.
Compare token and cost fields to see whether the call reached execution.

Use Model Selection and Cost when the model is available but the default, route, or cost policy is unclear.

Tool Or Connection Failure

For external tool failures:

Open Settings and test the connection.
Confirm the connection is enabled.
Confirm the repo did not exclude a global connection.
Confirm project-scoped connections are attached to the repo that ran.
Open the Activity row and inspect the tool-call payload and error.

If the failure involves MCP sync into the CLI, check External MCP server catalog API and local CLI auth separately.

Sandbox Or Preview Failure

For sandbox-backed failures:

Open the linked sandbox from the Activity row when available.
Check root directory, install command, dev command, dev port, and env source.
Check managed sandbox access and optional Vercel project metadata.
Check branch and preview URL.
Stop, restart, or delete only after capturing the first failure event.

Use Sandbox Setup Checklist for launch configuration, and Projects and Sandboxes for the platform model.

Cost Or Token Spike

For unexpectedly high usage:

Filter Activity by surface.
Sort or scan for high-token calls.
Compare model, run source, tool calls, and sandbox linkage.
Check whether an automation loop, Slack route, or API retry generated extra starts.
Move the policy decision back to the source: model choice, routing volume, repo exclusion, or monthly limit.

Use Model Selection and Cost when the fix is policy, not a one-off failed run.

Escalation Packet

When handing the issue to another person, include:

run ID or call ID
source surface and exact trigger text or API response
repo owner/name and repo ID
first event and first error
model id
sandbox ID or preview URL if applicable
what changed immediately before the failure

Leave out raw credentials, mog_... tokens, Slack link URLs, decrypted MCP config, and full .env files. See Security and Data Handling for the safe-sharing checklist.

First Question

Gather The Minimum Facts

Read The Summary Cards First

Expand The Activity Row

No Run Appears

Pending Or Stuck Work

Model Or Access Failure

Tool Or Connection Failure

Sandbox Or Preview Failure

Cost Or Token Spike

Escalation Packet

Read Next

Observability

Troubleshooting

Slack

Security and Data Handling

On this page