June 30, 2026 20 min to read

Kagent: Bringing AI Agents to Kubernetes!

An autonomous AI agent framework that runs inside your Kubernetes cluster to diagnose, plan, and execute solutions

Overview

When a failure occurs in your Kubernetes cluster, how do you typically respond?

Check Slack alert
Check pod status with kubectl
Review logs
Check Prometheus metrics
Copy/paste error message to ChatGPT
Try ChatGPT's suggestion
Another error occurs
Back to ChatGPT...
Repeat... 😫

Does this process sound familiar? We receive help from AI, but we constantly have to act as the bridge between AI and our infrastructure. AI cannot see our cluster, and we have to manually execute what AI suggests.

Kagent solves this problem. An AI Agent runs directly inside your Kubernetes cluster, autonomously diagnosing problems, planning solutions, and taking actual actions. No more need for us to be the middleman.

This guide covers what Kagent is, how it works, and practical steps to install and use it.

flowchart LR subgraph Traditional["Traditional Approach"] direction TB T1[Alert] --> T2[Human checks kubectl] T2 --> T3[Human checks logs] T3 --> T4[Human asks ChatGPT] T4 --> T5[Human executes fix] T5 --> T6[Human verifies] end subgraph Kagent["Kagent Approach"] direction TB K1[Alert] --> K2[Agent diagnoses] K2 --> K3[Agent plans solution] K3 --> K4[Agent executes fix] K4 --> K5[Agent verifies] K5 --> K6[Human reviews report] end Traditional -.->|"30+ minutes"| Result1[Issue Resolved] Kagent -.->|"2-5 minutes"| Result2[Issue Resolved] style Traditional fill:#fee2e2,stroke:#dc2626 style Kagent fill:#dcfce7,stroke:#16a34a

What is Kagent?

One-Line Summary

“An autonomous AI Agent framework that runs inside your Kubernetes cluster”

Basic Information

Attribute	Details
Developer	Solo.io
Release Date	March 17, 2025
License	Apache 2.0 (Open Source)
Status	CNCF Sandbox Project
Foundation	Microsoft AutoGen Framework
GitHub	https://github.com/kagent-dev/kagent

Why Was It Created?

Kagent originated from Solo.io’s customer problem-solving process. Complex Kubernetes environments required too much manual work for troubleshooting, configuration management, and deployment automation. It was developed to automate these tasks using AI.

Kagent Core Architecture

Kagent consists of three layers:

flowchart TB subgraph Framework["Framework Layer"] direction LR CLI[CLI Interface] WebUI[Web UI Dashboard] YAML[Declarative YAML] end subgraph Agents["Agent Layer"] direction LR K8sAgent[K8s Expert Agent] HelmAgent[Helm Agent] PrometheusAgent[Prometheus Agent] CustomAgent[Custom Agents] end subgraph Tools["Tools Layer"] direction LR KubectlTools[Kubernetes Tools] HelmTools[Helm Tools] PrometheusTools[Prometheus Tools] ArgoTools[Argo Tools] IstioTools[Istio Tools] end subgraph LLM["LLM Providers"] direction LR OpenAI[OpenAI] Anthropic[Anthropic] Ollama[Ollama - Free] end Framework --> Agents Agents --> Tools Agents --> LLM style Ollama fill:#22c55e,stroke:#16a34a,color:#fff

Layer 1: Tools

MCP (Model Context Protocol) style functions that AI Agents can use.

Built-in Tools

flowchart LR subgraph K8sTools["Kubernetes Tools"] GetResources[GetResources] DescribeResource[DescribeResource] GetPodLogs[GetPodLogs] GetEvents[GetEvents] ApplyManifest[ApplyManifest] end subgraph HelmTools["Helm Tools"] GetRepositories[GetRepositories] GetCharts[GetCharts] InstallChart[InstallChart] UpgradeRelease[UpgradeRelease] end subgraph MonitoringTools["Monitoring Tools"] QueryMetrics[QueryMetrics] GetAlerts[GetAlerts] end subgraph GitOpsTools["GitOps Tools"] GetApplications[GetApplications] SyncApplication[SyncApplication] GetRollouts[GetRollouts] end

Tool Categories Summary

Category	Tools	Purpose
Kubernetes	GetResources, DescribeResource, GetPodLogs, GetEvents, ApplyManifest, CreateResource	Core K8s operations
Helm	GetRepositories, GetCharts, InstallChart, UpgradeRelease	Package management
Prometheus	QueryMetrics, GetAlerts	Monitoring & alerting
Argo	GetApplications, SyncApplication, GetRollouts	GitOps & deployments
Istio	GetVirtualServices, GetDestinationRules, GetGateways	Service mesh
Custom	HTTP API, Database, Slack, External Systems	Extensibility

Layer 2: Agents

Autonomous AI systems that operate independently. Not just chatbots—they plan, execute, analyze results, and decide next actions.

Agent Configuration Example

apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: kubernetes-expert
spec:
  # System prompt (defines Agent's role)
  systemPrompt: |
    You are a Kubernetes expert specializing in troubleshooting.
    You help diagnose issues, analyze logs, and suggest fixes.
  
  # LLM configuration
  modelConfig:
    name: gpt-4
    provider: openai
  
  # Available tools
  tools:
    - name: kubectl-tools
    - name: prometheus-tools
  
  # Can collaborate with other Agents
  agents:
    - name: helm-agent

Agent Characteristics

flowchart TD subgraph AgentCapabilities["Agent Capabilities"] NL[Natural Language Understanding] Plan[Multi-step Task Planning] Analyze[Result Analysis & Adaptation] A2A[Agent-to-Agent Collaboration] end subgraph Workflow["Agent Workflow"] W1[Receive Task] --> W2[Plan Steps] W2 --> W3[Execute Tools] W3 --> W4[Analyze Results] W4 --> W5{Success?} W5 -->|No| W6[Adapt Strategy] W6 --> W2 W5 -->|Yes| W7[Report Results] end AgentCapabilities --> Workflow

Layer 3: Framework

The interface for managing and executing Agents.

Three Management Methods

Method	Description	Best For
CLI	Command line interface for quick interactions	Developers, scripting
Web UI	Browser-based dashboard for visual management	Teams, monitoring
Declarative YAML	GitOps-friendly configuration files	Production, automation

# CLI example
kagent list agents
kagent run kubernetes-expert "Check if all pods are running"
kagent logs kubernetes-expert

What Kagent Can Do

1. Automated Troubleshooting

Scenario: Pod in CrashLoopBackOff

flowchart LR subgraph Traditional["Traditional (30+ min)"] direction TB T1[kubectl get pods] --> T2[kubectl describe] T2 --> T3[kubectl logs] T3 --> T4[Check Prometheus] T4 --> T5[Analyze cause] T5 --> T6[Google solution] T6 --> T7[Apply fix] T7 --> T8[Verify] end subgraph Kagent["Kagent (2 min)"] direction TB K1[User: Check production] --> K2[Auto: Check pods] K2 --> K3[Auto: Found CrashLoop] K3 --> K4[Auto: Analyze logs] K4 --> K5[Auto: Check services] K5 --> K6[Auto: Found issue] K6 --> K7[Suggest: Fix NetworkPolicy] K7 --> K8[Auto: Verify] end style Traditional fill:#fee2e2,stroke:#dc2626 style Kagent fill:#dcfce7,stroke:#16a34a

Example Conversation

User: "A failure occurred in the production namespace. Please check."

Kagent Agent:
1. [Auto] Check pod status in production namespace
2. [Auto] CrashLoopBackOff pod found: payment-api
3. [Auto] Analyze pod logs
   → "Error: Cannot connect to database: connection refused"
4. [Auto] Check Service status
   → database-service is normal
5. [Auto] Check NetworkPolicy
   → Traffic blocked between payment-api → database!
6. [Suggest] Propose NetworkPolicy modification
7. [After approval] Apply NetworkPolicy
8. [Auto] Confirm pod is running normally ✓

Total time: 2 minutes (Previous 30 min → 2 min)

2. Canary Deployment Automation

User: "Please deploy payment-api v2 with Canary. Start with 10% traffic,
      and gradually increase if error rate is below 1%."

Kagent Agent:
1. [Plan] Establish Canary deployment strategy
   - 10% → 25% → 50% → 100%
   - Wait 5 minutes at each stage
   - Verify error rate < 1%

2. [Execute] Create Argo Rollout
   - Deploy payment-api v2 (10% traffic)

3. [Monitor] Check Prometheus metrics
   - Error rate: 0.3% ✓
   - Response time: avg 120ms ✓

4. [Progress] Increase to 25%

5. [Monitor] Continue...

6. [Complete] 100% deployment finished
   - Total time: 20 minutes
   - Safely deployed without errors

3. Zero Trust Security Policy Application

flowchart TD subgraph Analysis["Step 1: Analysis"] A1[Analyze current service communication patterns] A2[payment-api → database] A3[payment-api → redis] A4[api-gateway → payment-api] end subgraph Creation["Step 2: Policy Creation"] C1[Auto-generate NetworkPolicy] C2[Allow only necessary communication] C3[Default deny all] C4[Create Istio AuthorizationPolicy] end subgraph Verification["Step 3: Verification"] V1[Test connectivity] V2[All services normal] V3[Deploy policies] V4[Generate report] end Analysis --> Creation Creation --> Verification

Hands-On: Installing and Using Kagent

Prerequisites

Requirement	Details
Kubernetes Cluster	Minikube, Kind, EKS, GKE, etc.
Helm	Version 3.x
kubectl	Configured for your cluster
LLM API Key	OpenAI, Anthropic, or Ollama (free)

Step 1: Start Minikube Cluster

# Start Minikube
minikube start --cpus=4 --memory=8192

# Verify cluster status
kubectl get nodes

Step 2: Install Kagent

Install with Helm

Verify Installation

# Check Kagent pods
kubectl get pods -n kagent-system

# Expected output:
# NAME                                READY   STATUS    RESTARTS   AGE
# kagent-controller-7d8f9c5d4-xk2m9   1/1     Running   0          2m
# kagent-ui-5f9b8d7c6-p4n8k          1/1     Running   0          2m
# kagent-engine-6c8d7b5f4-q3m7n      1/1     Running   0          2m

Step 3: Access Web UI

# Port-forward to access UI
kubectl port-forward -n kagent-system svc/kagent-ui 8080:80

# Access in browser: http://localhost:8080

Step 4: Create Your First Agent

Agent YAML Configuration

# k8s-expert-agent.yaml
apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: k8s-expert
  namespace: kagent-system
spec:
  description: "Kubernetes troubleshooting expert"
  
  systemPrompt: |
    You are KubeExpert, a Kubernetes specialist.
    Your role is to help diagnose and resolve Kubernetes issues.
    
    When analyzing problems:
    1. Check pod status first
    2. Review logs for errors
    3. Verify service connectivity
    4. Check resource constraints
    5. Suggest actionable fixes
    
    Always explain your reasoning and provide clear solutions.
  
  modelConfig:
    name: gpt-4
    provider: openai
    temperature: 0.3  # Lower temperature = more consistent answers
  
  tools:
    - name: kubectl
      type: kubernetes
      permissions:
        - get
        - list
        - describe
      resources:
        - pods
        - services
        - deployments
        - events
    
    - name: logs
      type: kubernetes
      permissions:
        - logs

Deploy Agent

kubectl apply -f k8s-expert-agent.yaml

# Verify Agent
kubectl get agents -n kagent-system

Step 5: Chat with Your Agent

# Install Kagent CLI
curl -sL https://kagent.dev/install.sh | bash

# Start conversation
kagent chat k8s-expert

# Enter your question:
> Show me all pods in default namespace

Agent response:
[Thinking] I'll check the pods in the default namespace...
[Action] Running: kubectl get pods -n default
[Result] 
NAME                     READY   STATUS    RESTARTS   AGE
nginx-7d8f5c8b9-2xk4m   1/1     Running   0          5h
redis-6c7f8d9b8-9pm3n   1/1     Running   1          3h

[Summary] There are 2 pods in the default namespace:
- nginx: Running normally
- redis: Running with 1 restart (check logs if concerned)

Running Kagent for Free with Ollama

Cost Comparison

Method	Cost	Performance	Setup Time
OpenAI API	$0.03/1K tokens	Excellent	5 min
Anthropic API	$0.025/1K tokens	Excellent	5 min
Ollama (Local)	Completely Free	Medium	10 min

Step 1: Install Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama server
ollama serve

# Download model (in another terminal)
ollama pull llama3.2:3b  # 3B parameter model (lightweight)

Step 2: Deploy Ollama to Kubernetes

# ollama-deployment.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ollama

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      initContainers:
      - name: model-puller
        image: ollama/ollama:latest
        command: ["/bin/sh", "-c"]
        args:
        - |
          ollama serve &
          sleep 10
          ollama pull llama3.2:3b
          pkill ollama
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
      
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - name: http
          containerPort: 11434
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
      
      volumes:
      - name: ollama-data
        emptyDir: {}

---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  type: ClusterIP
  selector:
    app: ollama
  ports:
  - port: 80
    targetPort: http

kubectl apply -f ollama-deployment.yaml

# Wait for model download (2-5 minutes)
kubectl logs -n ollama -f deployment/ollama -c model-puller

Step 3: Connect Kagent to Ollama

Create Ollama ModelConfig

# ollama-modelconfig.yaml
apiVersion: kagent.dev/v1alpha2
kind: ModelConfig
metadata:
  name: llama3-local
  namespace: kagent
spec:
  model: llama3.2:3b
  provider: Ollama
  ollama:
    host: http://ollama.ollama.svc.cluster.local:80

kubectl apply -f ollama-modelconfig.yaml

Recommended Ollama Models

RAM Requirement	Model	Size	Notes
≤4GB RAM	llama3.2:1b	1.3GB	Lightest option
≤4GB RAM	phi3:mini	2.3GB	Microsoft model
≤4GB RAM	gemma2:2b	1.6GB	Google model
8GB RAM	llama3.2:3b	2GB	Recommended!
8GB RAM	mistral:7b	4.1GB	Good performance
16GB+ RAM	llama3.1:8b	4.7GB	Best local option

Multi-Agent Collaboration

Scenario: Helm Deployment + Monitoring Setup

flowchart TD subgraph Coordinator["Deployment Coordinator"] C1[Receive Task] C2[Break Down Steps] C3[Coordinate Agents] C4[Final Verification] end subgraph HelmAgent["Helm Agent"] H1[Install Chart] H2[Verify Deployment] end subgraph MetricsAgent["Metrics Analyzer"] M1[Create ServiceMonitor] M2[Setup Alerts] end C1 --> C2 C2 --> H1 H1 --> H2 H2 --> C3 C3 --> M1 M1 --> M2 M2 --> C4 style Coordinator fill:#dbeafe,stroke:#2563eb style HelmAgent fill:#fef3c7,stroke:#d97706 style MetricsAgent fill:#dcfce7,stroke:#16a34a

Coordinator Agent Configuration

# deployment-coordinator.yaml
apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: deployment-coordinator
  namespace: kagent-system
spec:
  description: "Coordinates deployment and monitoring setup"
  
  systemPrompt: |
    You coordinate between helm-agent and prometheus-agent to:
    1. Deploy applications via Helm
    2. Verify deployment success
    3. Setup monitoring and alerts
    4. Run smoke tests
  
  # Can call other Agents
  agents:
    - name: helm-agent
      role: deployment
    - name: metrics-analyzer  
      role: monitoring
  
  tools:
    - name: kubectl

Best Practices

1. Security

API Key Management

RBAC Configuration

# Grant minimum permissions to Agent
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kagent-readonly
  namespace: production
rules:
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "list"]  # No write permissions
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get"]

2. Cost Management

spec:
  modelConfig:
    maxTokens: 4000  # Limit response tokens
    budget:
      daily: 1000000  # Daily token limit
      perQuery: 10000  # Per-query token limit
  
  cache:
    enabled: true
    ttl: 3600  # Cache same queries for 1 hour

3. Observability

# Check Agent logs
kubectl logs -n kagent-system deployment/kagent-engine -f

# View specific Agent history
kagent history k8s-expert

# Check metrics
kubectl port-forward -n kagent-system svc/kagent-metrics 9090:9090

Troubleshooting

Problem 1: Agent Not Responding

# 1. Check Engine pod status
kubectl get pods -n kagent-system -l app=kagent-engine

# 2. Check logs
kubectl logs -n kagent-system deployment/kagent-engine

# 3. Verify API key
kubectl get secret openai-secret -n kagent-system -o yaml

# 4. Check network
kubectl exec -it -n kagent-system deployment/kagent-engine -- curl -I https://api.openai.com

Problem 2: Permission Error

Cleanup Resources

#!/bin/bash
# cleanup-kagent.sh

echo "🧹 Starting Kagent resource cleanup..."

# Remove Helm releases
helm uninstall kagent -n kagent-system 2>/dev/null
helm uninstall kagent-crds -n kagent 2>/dev/null

# Remove Ollama
kubectl delete namespace ollama --grace-period=0 --force 2>/dev/null

# Remove Kagent namespaces
kubectl delete namespace kagent-system --grace-period=0 --force 2>/dev/null
kubectl delete namespace kagent --grace-period=0 --force 2>/dev/null

# Remove RBAC
kubectl delete clusterrole kagent-reader 2>/dev/null
kubectl delete clusterrolebinding kagent-reader-binding 2>/dev/null

# Stop Minikube
minikube stop

echo "✅ Cleanup complete!"

Real-World Use Cases

Case 1: Automated Nighttime Incident Response

Situation:

Production failure at 3 AM
On-call engineer is sleeping

Kagent Configuration:

apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: incident-responder
spec:
  triggers:
    - type: prometheus-alert
      severity: critical
      
  automation:
    enabled: true
    requireApproval: false  # No approval needed for emergencies
    
  actions:
    - diagnose: true
    - attempt-fix: true
    - notify-on-call: true
    - create-incident-report: true

Result:

Agent automatically diagnoses problem
Restarts memory-exhausted pod
Recovery in 5 minutes
Engineer reviews report in the morning

Case 2: Developer Onboarding Acceleration

Problem: New developer doesn’t know Kubernetes

Solution:

apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
  name: newbie-helper
spec:
  systemPrompt: |
    You are a friendly Kubernetes tutor for new developers.
    Explain concepts simply and provide step-by-step guidance.
    Always include educational context with your answers.

Future Roadmap

timeline title Kagent Development Roadmap section Current (Dec 2025) CNCF Sandbox : 800+ GitHub Stars Core Integrations : Argo, Helm, Istio, K8s, Prometheus section Planned OpenTelemetry : Full observability integration Multi-agent Workflows : Complex orchestration Visual Designer : GUI-based Agent design Agent Marketplace : Community-shared Agents

Conclusion

The paradigm of Kubernetes operations is changing.

flowchart LR subgraph Past["Past"] P1[Problem] --> P2[Human diagnoses] P2 --> P3[Human fixes] P3 --> P4[Human verifies] end subgraph Present["Present (Kagent)"] PR1[Problem] --> PR2[AI diagnoses] PR2 --> PR3[AI fixes] PR3 --> PR4[AI verifies] PR4 --> PR5[Human approves] end subgraph Future["Future"] F1[Problem] --> F2[AI handles everything] F2 --> F3[Human focuses on strategy] end Past --> Present Present --> Future

Key Benefits by Role

Role	Benefits
DevOps Engineers	Freedom from repetitive troubleshooting, complex deployment automation, reduced 24/7 on-call burden
Platform Teams	Enhanced developer self-service, standardized operations, democratized knowledge
Organizations	Faster incident response, reduced operational costs, engineers focus on higher-value work

Kagent is still in its early stages, but the possibilities are endless. Whether you’re looking to reduce operational toil, accelerate incident response, or democratize Kubernetes expertise across your organization, Kagent provides a compelling path forward for AI-augmented infrastructure operations.