How It Works

A mental model for understanding ark-operator before diving into the details.


The core idea

Kubernetes manages Deployments of containers. ark-operator manages ArkAgents of LLM-based agents.

The analogy is direct:

Kubernetes ark-operator
Container image Model + system prompt + MCP tools
Deployment ArkAgent
Service ArkService
ConfigMap ArkSettings
CronJob / Ingress ArkEvent

When you define an ArkAgent, you declare the desired state: which model to use, what to tell it, what tools to give it, and how many replicas to run. The operator’s job is to make reality match that declaration — exactly like a Deployment controller reconciles pod replicas.


The operator reconcile loop

The operator watches your ArkAgent resources. When something changes — or on a periodic resync — it runs a reconcile:

Watch ArkAgent CR
    │
    ▼
Reconcile()
    ├── How many agent pods are currently running?
    ├── Compare with spec.replicas
    ├── Scale up → create new agent pods
    ├── Scale down → delete excess pods
    ├── Inject env vars: MODEL, SYSTEM_PROMPT, MCP_SERVERS
    ├── Run semantic liveness checks
    └── Update .status.readyReplicas

The operator never modifies your YAML. It only manages the backing Kubernetes resources (Deployments, Pods) that bring the desired state to life.


What runs inside an agent pod

Each agent pod runs the ark-runtime binary. It does one thing: poll a task queue, call the LLM, and return results.

agent pod startup
    │
    ├── Read config from env vars (MODEL, SYSTEM_PROMPT, MCP_SERVERS, ...)
    ├── Connect to MCP tool servers
    └── Poll task queue (Redis Streams)
            │
            ▼
        Task arrives
            │
            ├── Build prompt from system prompt + task input
            ├── Call LLM provider (tool-use loop until model stops calling tools)
            └── Return result to queue

The operator injects all configuration as environment variables. The agent runtime has no knowledge of Kubernetes — it just reads env vars and processes tasks. This means you can run the same runtime code locally with ark run.


How tasks flow

External trigger (webhook / cron / ark trigger)
    │
    ▼
ArkEvent → dispatches ArkTeam run
    │
    ▼
ArkTeam controller → submits step tasks to Redis Streams
    │
    ▼
Agent pods poll their queue → call LLM → write result back
    │
    ▼
ArkTeam controller reads results → advances DAG → submits next steps
    │
    ▼
All steps done → ArkTeam phase = Succeeded → output available

The Redis Streams queue is the boundary between the operator and the agent pods. Trace context is propagated across this boundary, so a single OpenTelemetry trace spans the full path from trigger to final output.


ArkTeam: the execution primitive

An ArkTeam is the unit of work. It defines roles (agents) and either:

  • A pipeline — a DAG of steps executed in declared order (like a CI workflow)
  • Dynamic delegation — agents decide at runtime what to delegate to whom (like an org chart)

In pipeline mode, template expressions connect step outputs to the next step’s inputs:

pipeline:
  - role: research
    inputs:
      prompt: ""
  - role: summarize
    dependsOn: [research]
    inputs:
      content: ""

The operator tracks each step’s phase (Pending → Running → Succeeded/Failed) and orchestrates the DAG. You never write scheduling logic.


What the operator does not do

  • It does not make LLM API calls — agent pods do
  • It does not parse or validate LLM outputs (unless a step has outputSchema)
  • It does not route external traffic — use ArkService for that
  • It does not store agent memory — use ArkMemory for that

Next steps


Apache 2.0 · ARKONIS