Observability

ark-operator emits OpenTelemetry traces and metrics from both the operator and every agent pod. Because it uses the standard OTLP exporter, you can route telemetry to any compatible backend — Jaeger, Tempo, Prometheus, Datadog, Grafana, Honeycomb — without changing any operator code.


What is instrumented

Traces (per agent task)

Span Emitted by Key attributes
arkonis.task Agent runtime agent.name, task.id, task.prompt_len
arkonis.llm.call LLM provider llm.model, llm.provider, llm.input_tokens, llm.output_tokens
arkonis.tool.call Runner (per tool use) tool.name

Trace context is propagated across the Redis queue boundary via task metadata, so a single trace spans the full operator → queue → agent pod path.

Kubernetes Events (audit log)

Every agent action emits a structured Kubernetes Event:

Reason Emitted when
TaskStarted Agent pod picks up a task
TaskCompleted Task finishes successfully
TaskFailed Task fails or exceeds timeout
TaskDelegated Agent delegates a sub-task to a team role
kubectl get events -n ai-team --field-selector involvedObject.kind=ArkAgent

Connect to Jaeger

Deploy Jaeger all-in-one

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:latest
          env:
            - name: COLLECTOR_OTLP_ENABLED
              value: "true"
          ports:
            - containerPort: 16686 # UI
            - containerPort: 4318 # OTLP HTTP
            - containerPort: 4317 # OTLP gRPC
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
  namespace: monitoring
spec:
  selector:
    app: jaeger
  ports:
    - name: ui
      port: 16686
    - name: otlp-http
      port: 4318

Point the operator at Jaeger

Set otel.endpoint in your Helm values or pass it as an environment variable in the secret injected into agent pods:

# Via Helm
helm upgrade ark-operator arkonis/ark-operator \
  --set otel.endpoint=http://jaeger.monitoring.svc.cluster.local:4318

# Or add to your existing API keys secret
kubectl patch secret ark-api-keys -n ai-team \
  -p '{"stringData":{"OTEL_EXPORTER_OTLP_ENDPOINT":"http://jaeger.monitoring.svc.cluster.local:4318"}}'

Open the Jaeger UI

kubectl port-forward svc/jaeger 16686:16686 -n monitoring
open http://localhost:16686

Select service ark-agent, click Find Traces, and click any trace to see the full waterfall including queue wait, LLM call duration, and token counts.


Query traces from the terminal

# All recent traces with token counts and durations
curl -s "http://localhost:16686/api/traces?service=ark-agent&limit=20" | python3 -c "
import sys, json, re
data = json.load(sys.stdin)
print(f'{'AGENT':<20} {'OPERATION':<22} {'DURATION':>10}  {'IN':>6}  {'OUT':>6}  MODEL')
print('-' * 80)
for t in data['data']:
    procs = t.get('processes', {})
    for s in t['spans']:
        tags  = {tag['key']: tag['value'] for tag in s.get('tags', [])}
        ptags = {tag['key']: tag['value'] for tag in procs.get(s['processID'], {}).get('tags', [])}
        host  = ptags.get('host.name', '')
        m     = re.search(r'(\w[\w-]+)-agent-[a-z0-9]+-[a-z0-9]+', host)
        agent = m.group(1) if m else 'ark-agent'
        dur   = s['duration'] / 1000
        print(f\"{agent:<20} {s['operationName']:<22} {dur:>8.0f}ms  {str(tags.get('llm.input_tokens','-')):>6}  {str(tags.get('llm.output_tokens','-')):>6}  {tags.get('llm.model','-')}\")
"

ark run –trace (local, no backend)

When running flows locally, --trace collects spans in-memory and prints a tree after the flow completes — no Jaeger required:

ark run quickstart.yaml --trace
Flow succeeded in 9.1s — 1,516 tokens

arkonis.task [9.1s]
  ├─ arkonis.llm.call [4.2s]  research  in=1,204 out=312
  └─ arkonis.llm.call [2.1s]  summarize  in=312 out=88

ark trace (remote lookup)

Look up a specific task by ID in a running Jaeger or Tempo instance:

# Jaeger
ark trace <task-id> --endpoint http://localhost:16686

# Tempo
ark trace <task-id> --endpoint http://tempo.monitoring.svc.cluster.local:3100

# Use env var instead of flag
export JAEGER_ENDPOINT=http://localhost:16686
ark trace <task-id>

The task ID is the Redis stream message ID stored in .status.steps[].taskID on the ArkFlow object:

kubectl get arkflow <flow-name> -n ai-team \
  -o jsonpath='{.status.steps[*].taskID}'

Connect to Prometheus

Metrics are exported via OTLP. To scrape them with Prometheus, deploy an OpenTelemetry Collector with a prometheusexporter pipeline, or use the OTLP-to-Prometheus bridge.

Point the operator at your collector:

helm upgrade ark-operator arkonis/ark-operator \
  --set otel.endpoint=http://otel-collector.monitoring.svc.cluster.local:4318

Apache 2.0 · ARKONIS