Observability

ark-operator emits OpenTelemetry traces and metrics from both the operator and every agent pod. Because it uses the standard OTLP exporter, you can route telemetry to any compatible backend — Jaeger, Tempo, Prometheus, Datadog, Grafana, Honeycomb — without changing any operator code.


What is instrumented

Traces (per agent task)

Span Emitted by Key attributes
arkonis.task Agent runtime agent.name, task.id, task.prompt_len
arkonis.llm.call LLM provider llm.model, llm.provider, llm.input_tokens, llm.output_tokens
arkonis.tool.call Runner (per tool use) tool.name

Trace context is propagated across the Redis queue boundary via task metadata, so a single trace spans the full operator → queue → agent pod path.

Kubernetes Events (audit log)

Every agent action emits a structured Kubernetes Event:

Reason Emitted when
TaskStarted Agent pod picks up a task
TaskCompleted Task finishes successfully
TaskFailed Task fails or exceeds timeout
TaskDelegated Agent delegates a sub-task to a team role
kubectl get events -n ai-team --field-selector involvedObject.kind=ArkAgent
kubectl get events -n ai-team --field-selector reason=TaskDelegated

Connecting to a backend

Jaeger

Deploy Jaeger all-in-one:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:latest
          env:
            - name: COLLECTOR_OTLP_ENABLED
              value: "true"
          ports:
            - containerPort: 16686   # UI
            - containerPort: 4318    # OTLP HTTP
            - containerPort: 4317    # OTLP gRPC
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
  namespace: monitoring
spec:
  selector:
    app: jaeger
  ports:
    - name: ui
      port: 16686
    - name: otlp-http
      port: 4318

Point the operator at Jaeger via Helm:

helm upgrade ark-operator arkonis/ark-operator \
  --set otel.endpoint=http://jaeger.monitoring.svc.cluster.local:4318

Open the UI:

kubectl port-forward svc/jaeger 16686:16686 -n monitoring
open http://localhost:16686

Select service ark-agent, click Find Traces, then click any trace to see the full waterfall: queue wait, LLM call duration, and token counts.

Prometheus

Metrics are exported via OTLP. To scrape with Prometheus, deploy an OpenTelemetry Collector with a prometheusexporter pipeline:

helm upgrade ark-operator arkonis/ark-operator \
  --set otel.endpoint=http://otel-collector.monitoring.svc.cluster.local:4318

ark run --trace (local, no backend)

When running flows locally, --trace collects spans in-memory and prints a tree after the flow completes — no Jaeger required:

ark run quickstart.yaml --trace
Flow succeeded in 9.1s — 1,516 tokens

arkonis.task [9.1s]
  ├─ arkonis.llm.call [4.2s]  research  in=1,204 out=312
  └─ arkonis.llm.call [2.1s]  summarize  in=312 out=88

ark trace (remote lookup)

Look up a specific task by ID in a running Jaeger or Tempo instance:

# Jaeger
ark trace <task-id> --endpoint http://localhost:16686

# Tempo
ark trace <task-id> --endpoint http://tempo.monitoring.svc.cluster.local:3100

# Use env var instead of flag
export JAEGER_ENDPOINT=http://localhost:16686
ark trace <task-id>

Find task IDs from the ArkTeam status:

kubectl get arkteam <team-name> -n ai-team \
  -o jsonpath='{.status.steps[*].taskID}'

Environment variables

Variable Description
OTEL_EXPORTER_OTLP_ENDPOINT OTel collector endpoint (e.g. http://jaeger:4318). Enables tracing and metrics when set.
OTEL_SERVICE_NAME Service name reported in traces. Defaults to ark-agent for agent pods, ark-operator for the operator.

See also


Apache 2.0 · ARKONIS