Observability
ark-operator emits OpenTelemetry traces and metrics from both the operator and every agent pod. Because it uses the standard OTLP exporter, you can route telemetry to any compatible backend — Jaeger, Tempo, Prometheus, Datadog, Grafana, Honeycomb — without changing any operator code.
What is instrumented
Traces (per agent task)
| Span | Emitted by | Key attributes |
|---|---|---|
arkonis.task | Agent runtime | agent.name, task.id, task.prompt_len |
arkonis.llm.call | LLM provider | llm.model, llm.provider, llm.input_tokens, llm.output_tokens |
arkonis.tool.call | Runner (per tool use) | tool.name |
Trace context is propagated across the Redis queue boundary via task metadata, so a single trace spans the full operator → queue → agent pod path.
Kubernetes Events (audit log)
Every agent action emits a structured Kubernetes Event:
| Reason | Emitted when |
|---|---|
TaskStarted | Agent pod picks up a task |
TaskCompleted | Task finishes successfully |
TaskFailed | Task fails or exceeds timeout |
TaskDelegated | Agent delegates a sub-task to a team role |
kubectl get events -n ai-team --field-selector involvedObject.kind=ArkAgent
Connect to Jaeger
Deploy Jaeger all-in-one
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
ports:
- containerPort: 16686 # UI
- containerPort: 4318 # OTLP HTTP
- containerPort: 4317 # OTLP gRPC
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: monitoring
spec:
selector:
app: jaeger
ports:
- name: ui
port: 16686
- name: otlp-http
port: 4318
Point the operator at Jaeger
Set otel.endpoint in your Helm values or pass it as an environment variable in the secret injected into agent pods:
# Via Helm
helm upgrade ark-operator arkonis/ark-operator \
--set otel.endpoint=http://jaeger.monitoring.svc.cluster.local:4318
# Or add to your existing API keys secret
kubectl patch secret ark-api-keys -n ai-team \
-p '{"stringData":{"OTEL_EXPORTER_OTLP_ENDPOINT":"http://jaeger.monitoring.svc.cluster.local:4318"}}'
Open the Jaeger UI
kubectl port-forward svc/jaeger 16686:16686 -n monitoring
open http://localhost:16686
Select service ark-agent, click Find Traces, and click any trace to see the full waterfall including queue wait, LLM call duration, and token counts.
Query traces from the terminal
# All recent traces with token counts and durations
curl -s "http://localhost:16686/api/traces?service=ark-agent&limit=20" | python3 -c "
import sys, json, re
data = json.load(sys.stdin)
print(f'{'AGENT':<20} {'OPERATION':<22} {'DURATION':>10} {'IN':>6} {'OUT':>6} MODEL')
print('-' * 80)
for t in data['data']:
procs = t.get('processes', {})
for s in t['spans']:
tags = {tag['key']: tag['value'] for tag in s.get('tags', [])}
ptags = {tag['key']: tag['value'] for tag in procs.get(s['processID'], {}).get('tags', [])}
host = ptags.get('host.name', '')
m = re.search(r'(\w[\w-]+)-agent-[a-z0-9]+-[a-z0-9]+', host)
agent = m.group(1) if m else 'ark-agent'
dur = s['duration'] / 1000
print(f\"{agent:<20} {s['operationName']:<22} {dur:>8.0f}ms {str(tags.get('llm.input_tokens','-')):>6} {str(tags.get('llm.output_tokens','-')):>6} {tags.get('llm.model','-')}\")
"
ark run –trace (local, no backend)
When running flows locally, --trace collects spans in-memory and prints a tree after the flow completes — no Jaeger required:
ark run quickstart.yaml --trace
Flow succeeded in 9.1s — 1,516 tokens
arkonis.task [9.1s]
├─ arkonis.llm.call [4.2s] research in=1,204 out=312
└─ arkonis.llm.call [2.1s] summarize in=312 out=88
ark trace (remote lookup)
Look up a specific task by ID in a running Jaeger or Tempo instance:
# Jaeger
ark trace <task-id> --endpoint http://localhost:16686
# Tempo
ark trace <task-id> --endpoint http://tempo.monitoring.svc.cluster.local:3100
# Use env var instead of flag
export JAEGER_ENDPOINT=http://localhost:16686
ark trace <task-id>
The task ID is the Redis stream message ID stored in .status.steps[].taskID on the ArkFlow object:
kubectl get arkflow <flow-name> -n ai-team \
-o jsonpath='{.status.steps[*].taskID}'
Connect to Prometheus
Metrics are exported via OTLP. To scrape them with Prometheus, deploy an OpenTelemetry Collector with a prometheusexporter pipeline, or use the OTLP-to-Prometheus bridge.
Point the operator at your collector:
helm upgrade ark-operator arkonis/ark-operator \
--set otel.endpoint=http://otel-collector.monitoring.svc.cluster.local:4318