Observability
ark-operator emits OpenTelemetry traces and metrics from both the operator and every agent pod. Because it uses the standard OTLP exporter, you can route telemetry to any compatible backend — Jaeger, Tempo, Prometheus, Datadog, Grafana, Honeycomb — without changing any operator code.
What is instrumented
Traces (per agent task)
| Span | Emitted by | Key attributes |
|---|---|---|
arkonis.task | Agent runtime | agent.name, task.id, task.prompt_len |
arkonis.llm.call | LLM provider | llm.model, llm.provider, llm.input_tokens, llm.output_tokens |
arkonis.tool.call | Runner (per tool use) | tool.name |
Trace context is propagated across the Redis queue boundary via task metadata, so a single trace spans the full operator → queue → agent pod path.
Kubernetes Events (audit log)
Every agent action emits a structured Kubernetes Event:
| Reason | Emitted when |
|---|---|
TaskStarted | Agent pod picks up a task |
TaskCompleted | Task finishes successfully |
TaskFailed | Task fails or exceeds timeout |
TaskDelegated | Agent delegates a sub-task to a team role |
kubectl get events -n ai-team --field-selector involvedObject.kind=ArkAgent
kubectl get events -n ai-team --field-selector reason=TaskDelegated
Connecting to a backend
Jaeger
Deploy Jaeger all-in-one:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
ports:
- containerPort: 16686 # UI
- containerPort: 4318 # OTLP HTTP
- containerPort: 4317 # OTLP gRPC
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: monitoring
spec:
selector:
app: jaeger
ports:
- name: ui
port: 16686
- name: otlp-http
port: 4318
Point the operator at Jaeger via Helm:
helm upgrade ark-operator arkonis/ark-operator \
--set otel.endpoint=http://jaeger.monitoring.svc.cluster.local:4318
Open the UI:
kubectl port-forward svc/jaeger 16686:16686 -n monitoring
open http://localhost:16686
Select service ark-agent, click Find Traces, then click any trace to see the full waterfall: queue wait, LLM call duration, and token counts.
Prometheus
Metrics are exported via OTLP. To scrape with Prometheus, deploy an OpenTelemetry Collector with a prometheusexporter pipeline:
helm upgrade ark-operator arkonis/ark-operator \
--set otel.endpoint=http://otel-collector.monitoring.svc.cluster.local:4318
ark run --trace (local, no backend)
When running flows locally, --trace collects spans in-memory and prints a tree after the flow completes — no Jaeger required:
ark run quickstart.yaml --trace
Flow succeeded in 9.1s — 1,516 tokens
arkonis.task [9.1s]
├─ arkonis.llm.call [4.2s] research in=1,204 out=312
└─ arkonis.llm.call [2.1s] summarize in=312 out=88
ark trace (remote lookup)
Look up a specific task by ID in a running Jaeger or Tempo instance:
# Jaeger
ark trace <task-id> --endpoint http://localhost:16686
# Tempo
ark trace <task-id> --endpoint http://tempo.monitoring.svc.cluster.local:3100
# Use env var instead of flag
export JAEGER_ENDPOINT=http://localhost:16686
ark trace <task-id>
Find task IDs from the ArkTeam status:
kubectl get arkteam <team-name> -n ai-team \
-o jsonpath='{.status.steps[*].taskID}'
Environment variables
| Variable | Description |
|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | OTel collector endpoint (e.g. http://jaeger:4318). Enables tracing and metrics when set. |
OTEL_SERVICE_NAME | Service name reported in traces. Defaults to ark-agent for agent pods, ark-operator for the operator. |
See also
- CLI: ark trace — look up traces from the terminal
- Environment Variables reference — all OTel-related vars