AI 赋能 K8s 运维:K8sGPT 让集群故障 “自动诊断 + 秒出方案”
还在为 Kubernetes 集群故障排查头疼?试试 K8sGPT—— 这款基于 AI 的智能诊断工具,能自动扫描集群异常,并通过 OpenAI、DeepSeek 等模型生成 step-by-step 解决方案。本文手把手教你用 CLI 或 Operator 模式部署,从安装到实战验证,让 K8s 运维效率飙升!
1. k8sgpt 是什么
K8sGPT 是一款基于 AI 的 Kubernetes 智能诊断工具,自动扫描集群异常并通过 AI 生成解决方案。它支持多类 AI 后端(OpenAI、本地模型等),提供两种核心使用模式:
Cli 方式安装:安装 cli 工具方式使用,通过 kubeconfig 连接集群,即时诊断问题
Operator 方式安装:通过在集群中安装 Operator 方式使用,这种方式非常适合持续监控集群,并且可以与 Prometheus 和 Alertmanager 等现有监控集成
在 k8s 环境部署,推荐使用 Operator 方式安装。
2.Cli 方式安装
2.1 安装
k8sgpt 的能力由 cli 工具提供,可以直接使用 brew 安装:
brew install k8sgpt
或者到 Releases 界面下载
version=v0.4.20
wget https://github.com/k8sgpt-ai/k8sgpt/releases/download/${version}/k8sgpt_Linux_x86_64.tar.gz
tar -zxvf k8sgpt_Linux_x86_64.tar.gz
mv k8sgpt /usr/local/bin/
# 查看版本
k8sgpt version
2.2 配置 AI Provider
k8sgpt 支持多种 AI Provider,具体如下:
[root@kc-master ~]# k8sgpt auth list
Default:
> openai
Active:
Unused:
> localai
> openai
> ollama
> azureopenai
> cohere
> amazonbedrock
> amazonsagemaker
> google
> noopai
> huggingface
> googlevertexai
> oci
> customrest
> ibmwatsonxai
常见的 openai、ollama 等,也支持使用 localai 对接任意满足 OpenAI API 格式的外部模型。
这里我们使用 localai Provider 来对接 DeepSeek:
baseurl=https://api.deepseek.com/v1
model=deepseek-reasoner
key=sk-xxx
k8sgpt auth add -b localai -u $baseurl -m $model -p $key
并将其设置为 默认 Provider
$ k8sgpt auth default -p localai
Default provider set to localai
2.3 扫描集群
运行 k8sgpt analyze
扫描集群中的问题:
# k8sgpt analyze
AI Provider: AI not used; --explain not set
0: Deployment default/broken-image()
- Error: Deployment default/broken-image has 1 replicas but 0 are available with status running
1: Pod default/broken-image-8896f7cf4-vf47k(Deployment/broken-image)
- Error: Back-off pulling image "nginx:invalid-tag"
2: ConfigMap calico-apiserver/kube-root-ca.crt()
- Error: ConfigMap kube-root-ca.crt is not used by any pods in the namespace
...
也可以指定 kubeconfig 访问远程集群
k8sgpt analyze --kubeconfig mykubeconfig
默认情况下不会使用 AI,只是简单扫描集群中的问题,需要增加--explain
flag 才会与 AI 交互给出对应解决方案:
# -b 指定使用上一步配置的 Provider
k8sgpt analyze --explain
效果如下:
[root@kc-master ~]# k8sgpt analyze --explain
AI Provider: localai
0: Deployment default/broken-image()
- Error: Deployment default/broken-image has 1 replicas but 0 are available with status running
Error: The deployment "broken-image" has 1 pod defined, but 0 pods are running successfully.
Solution:
1. Check pod status: `kubectl get pods -n default`
2. View pod logs: `kubectl logs <pod-name> -n default`
3. Inspect errors: `kubectl describe pod <pod-name> -n default`
4. Fix image/configuration issue (e.g., correct image name in deployment YAML)
5. Apply changes: `kubectl apply -f deployment.yaml`
(275 characters)
1: Pod default/broken-image-8896f7cf4-vf47k(Deployment/broken-image)
- Error: Back-off pulling image "nginx:invalid-tag"
Error: Kubernetes cannot pull the specified container image because the tag "invalid-tag" doesn't exist in the Docker registry for "nginx".
Solution:
1. Verify valid nginx tags on Docker Hub or via `docker search nginx --limit 5`.
2. Edit your deployment: `kubectl edit deployment <deployment-name>`.
3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `image: nginx:latest` or `image: nginx:alpine`).
4. Save changes to restart pods automatically.
5. Confirm fix: `kubectl get pods` should show running status.
3. Operator 方式安装
3.1 部署 Operator
直接 helm 部署,命令如下:
helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update
helm upgrade --install -n k8sgpt-operator-system k8sgpt k8sgpt/k8sgpt-operator --create-namespace
查看 Pod 运行情况
# kubectl -n k8sgpt-operator-system get po
NAME READY STATUS RESTARTS AGE
release-k8sgpt-operator-controller-manager-69b6fd9696-zc9t5 2/2 Running 0 33m
会启动一个 k8sgpt-operator-controller-manager Pod,这样就算部署好了
3.2 创建 K8sGPT 对象
首先创建一个 K8sGPT 对象
kubectl apply -f - << EOF
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
name: k8sgpt-local-ai
namespace: k8sgpt-operator-system
spec:
ai:
enabled: true
model: deepseek-reasoner
backend: localai
baseUrl: https://api.deepseek.com/v1
secret:
name: k8sgpt-sample-secret
key: openai-api-key
noCache: false
repository: ghcr.io/k8sgpt-ai/k8sgpt
version: v0.4.1
EOF
大致包含两部分信息:
1)K8sgpt 仓库地址以及版本,会使用该信息创建对应 Pod
repository:ghcr.io/k8sgpt-ai/k8sgpt
version:v0.4.1
2)对接的 LLM 信息,会使用这部分信息与 LLM 交互,这里用的是 DeepSeek
backend:localai
baseUrl:https://api.deepseek.com/v1
model:deepseek-reasoner
secret:指定存放 API Key 的 secret,如果 API 没有设置 Auth 的话不指定 secret 部分
…
需要提供一个 LLM 服务配置,不一定是 chatgpt,只要是兼容 OpenAI API 格式的服务都可以。
比如可以用 DeepSeek https://platform.deepseek.com/usage,也可以使用本地的服务
如果 LLM 服务需要认证,提前创建 secret 方式存储 API Key,secret 名称和 key 需与 K8sGPT
对象 spec 中的配置一致”,避免配置不匹配导致的异常。
OPENAI_TOKEN=sk-921726c2856a4a198a6e262fec0e9e5a
kubectl create secret generic k8sgpt-sample-secret --from-literal=openai-api-key=$OPENAI_TOKEN -n k8sgpt-operator-system
K8sGPT 对象创建之后,上一步部署的 Operator 会根据信息启动一个 Pod
(base) [root@label-studio-k8s ~]# k -n k8sgpt-operator-system get pod -w
NAME READY STATUS RESTARTS AGE
k8sgpt-local-ai-5ff75b9b8f-mfv8t 0/1 ContainerCreating 0 19s
这个 Pod 就是真正工作的 k8sgpt Pod,该 Pod 会持续扫描集群并与 AI 交互得到解决方案后存储到 result 对象中。
3.3 模拟故障
接下来就可以开始验证了。
k8sgpt 会自动收集集群信息,并生成诊断结果,可以通过以下命令查看:
(base) [root@label-studio-k8s ~]# k -n k8sgpt-operator-system get result
NAME KIND BACKEND AGE
defaultyoloservice Service localai 60s
kubesystemhamidcuvgpudevicepluginktlg7 Pod localai 60s
kubesystemhamidevicepluginmonitor Service localai 60s
kubesystemkccsicontroller785cc89b7bbmdkh Pod localai 60s
其中每个 result 对象对应一个问题,如果集群不存在任何问题是不会生成 result 对象的。
因为我们可以自己创造一些问题进行模拟,例如模拟服务镜像拉取失败的场景。
使用以下命令创建 deployment,由于 tag 不存在肯定会出现无法拉取镜像的问题。
kubectl create deployment broken-image --image=nginx:invalid-tag
和预期一样,Pod 会因为无法拉取镜像,启动失败
# kubectl get po
NAME READY STATUS RESTARTS AGE
broken-image-8896f7cf4-qst86 0/1 ImagePullBackOff 0 4m33s
看下 k8sgpt 能否检测到该问题:
# kubectl -n k8sgpt-operator-system get result
NAME KIND BACKEND AGE
defaultbrokenimage8896f7cf4qst86 Pod localai 12s
k8sgpt 检测到故障并生成 result 通常需要 30 秒到 2 分钟(取决于集群规模)。
已经有新的 result 了,看起来没什么问题,result 对象名称由 namespace 和对象名(这里就是 PodName)组成。
k8sgpt 会自动将 error 信息发送给 AI,并将结果写入 result 中的 details 字段,内容如下:
# kubectl -n k8sgpt-operator-system get result defaultbrokenimage8896f7cf4vf47k -oyaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
creationTimestamp: "2025-06-26T04:23:06Z"
generation: 1
labels:
k8sgpts.k8sgpt.ai/backend: localai
k8sgpts.k8sgpt.ai/name: k8sgpt-local-ai
k8sgpts.k8sgpt.ai/namespace: k8sgpt-operator-system
name: defaultbrokenimage8896f7cf4vf47k
namespace: k8sgpt-operator-system
resourceVersion: "90019"
uid: 1ab8b2ad-7398-4d6d-bd8b-e2d5beba70ae
spec:
backend: localai
details: "Error: Kubernetes cannot pull the container image because the tag \"invalid-tag\"
doesn't exist in the Docker registry for nginx. \nSolution: \n1. Verify valid
nginx tags at [hub.docker.com/_/nginx](https://hub.docker.com/_/nginx) \n2. Edit
your deployment: \n```bash \nkubectl edit deployment <deployment-name> \n```
\ \n3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `nginx:latest`
or `nginx:1.25`) \n4. Save and exit. Kubernetes will automatically retry pulling
the new image. \n\n*(273 characters)*"
error:
- text: Back-off pulling image "nginx:invalid-tag"
kind: Pod
name: default/broken-image-8896f7cf4-vf47k
parentObject: ""
status: {}
格式化之后内容如下:
Error: Kubernetes cannot pull the container image because the tag "invalid-tag" doesn't exist in the Docker registry for nginx.
Solution:
1. Verify valid nginx tags at [hub.docker.com/_/nginx](https://hub.docker.com/_/nginx)
2. Edit your deployment:
```bash
kubectl edit deployment <deployment-name>
```
3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `nginx:latest` or `nginx:1.25`)
4. Save and exit. Kubernetes will automatically retry pulling the new image.
details 部分包含了具体问题,以及解决方案。
这里 AI 让我们编辑 deployment,将镜像 tag 修改为有效值。
3.4 修复故障
按照 AI 给出的解决方案修复:
kubectl set image deployment/broken-image nginx=nginx:1.25
调整镜像后 Pod 正常启动
# kubectl get po
NAME READY STATUS RESTARTS AGE
broken-image-564ff59bd-b4fdm 1/1 Running 0 20s
修复后,故障(result 对象)消失
# kubectl -n k8sgpt-operator-system get result
No resources found in k8sgpt-operator-system namespace.
4. 工作流程
k8sgpt 是如何工作的呢?
4.1 完整流程
分为以下步骤:
1)初始化 Analysis:配置 Kubernetes 客户端、AI 后端等等
2)运行 Analysis
运行自定义 Analysis(如果有配置)
运行内置 Analysis
3)请求 AI 生成处理方案(如果指定 explain)
4)结构化返回诊断报告
完整代码如下:
func (h *Handler) Analyze(ctx context.Context, i *schemav1.AnalyzeRequest) (
*schemav1.AnalyzeResponse,
error,
) {
if i.Output == "" {
i.Output = "json"
}
if int(i.MaxConcurrency) == 0 {
i.MaxConcurrency = 10
}
// 初始化 Analysis
config, err := analysis.NewAnalysis(
i.Backend,
i.Language,
i.Filters,
i.Namespace,
i.LabelSelector,
i.Nocache,
i.Explain,
int(i.MaxConcurrency),
false, // Kubernetes Doc disabled in server mode
false, // Interactive mode disabled in server mode
[]string{}, //TODO: add custom http headers in server mode
false, // with stats disable
)
if err != nil {
return &schemav1.AnalyzeResponse{}, err
}
config.Context = ctx // Replace context for correct timeouts.
defer config.Close()
// 运行自定义 Analysis
if config.CustomAnalyzersAreAvailable() {
config.RunCustomAnalysis()
}
// 运行内置 Analysis
config.RunAnalysis()
// 请求 AI 生成处理方案(如果指定 explain)
if i.Explain {
err := config.GetAIResults(i.Output, i.Anonymize)
if err != nil {
return &schemav1.AnalyzeResponse{}, err
}
}
// 结构化返回诊断报告
out, err := config.PrintOutput(i.Output)
if err != nil {
return &schemav1.AnalyzeResponse{}, err
}
var obj schemav1.AnalyzeResponse
err = json.Unmarshal(out, &obj)
if err != nil {
return &schemav1.AnalyzeResponse{}, err
}
return &obj, nil
}
4.2 Analyzer 工作逻辑
k8sgpt 中包括多个内置 Analyzer
以 Pod Analyzer 为例,流程也比较简单:
1)使用 k8s client 获取 Pod 列表
2)然后遍历检查每个 Pod 的状态,获取错误信息
3)最终将错误信息结构化为 Result 格式返回
完整代码如下:
func (PodAnalyzer) Analyze(a common.Analyzer) ([]common.Result, error) {
kind := "Pod"
AnalyzerErrorsMetric.DeletePartialMatch(map[string]string{
"analyzer_name": kind,
})
// search all namespaces for pods that are not running
list, err := a.Client.GetClient().CoreV1().Pods(a.Namespace).List(a.Context, metav1.ListOptions{
LabelSelector: a.LabelSelector,
})
if err != nil {
return nil, err
}
var preAnalysis = map[string]common.PreAnalysis{}
for _, pod := range list.Items {
var failures []common.Failure
// Check for pending pods
if pod.Status.Phase == "Pending" {
// Check through container status to check for crashes
for _, containerStatus := range pod.Status.Conditions {
if containerStatus.Type == v1.PodScheduled && containerStatus.Reason == "Unschedulable" {
if containerStatus.Message != "" {
failures = append(failures, common.Failure{
Text: containerStatus.Message,
Sensitive: []common.Sensitive{},
})
}
}
}
}
// Check for errors in the init containers.
failures = append(failures, analyzeContainerStatusFailures(a, pod.Status.InitContainerStatuses, pod.Name, pod.Namespace, string(pod.Status.Phase))...)
// Check for errors in containers.
failures = append(failures, analyzeContainerStatusFailures(a, pod.Status.ContainerStatuses, pod.Name, pod.Namespace, string(pod.Status.Phase))...)
if len(failures) > 0 {
preAnalysis[fmt.Sprintf("%s/%s", pod.Namespace, pod.Name)] = common.PreAnalysis{
Pod: pod,
FailureDetails: failures,
}
AnalyzerErrorsMetric.WithLabelValues(kind, pod.Name, pod.Namespace).Set(float64(len(failures)))
}
}
for key, value := range preAnalysis {
var currentAnalysis = common.Result{
Kind: kind,
Name: key,
Error: value.FailureDetails,
}
parent, found := util.GetParent(a.Client, value.Pod.ObjectMeta)
if found {
currentAnalysis.ParentObject = parent
}
a.Results = append(a.Results, currentAnalysis)
}
return a.Results, nil
}
4.3 Prompt 模板与 AI 交互
在上一步 k8sgpt 拿到了集群中的异常信息,例如:
Back-off pulling image "nginx:invalid-tag"
接下来就拿这个错误信息给 AI 生成解决方案,这里k8sgpt 用到的 prompt 模板如下:
以下是用于处理 Kubernetes 错误的 Prompt 模板:
default_prompt = `Simplify the following Kubernetes error message delimited by triple dashes written in --- %s --- language; --- %s ---.
Provide the most possible solution in a step by step style in no more than 280 characters. Write the output in the following format:
Error: {Explain error here}
Solution: {Step by step solution here}
`
这也是为什么我们看到的 Result 中的 detail 是这样的:
[root@kc-master ~]# kubectl -n k8sgpt-operator-system get result defaultbrokenimage8896f7cf4vf47k -oyaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
creationTimestamp: "2025-06-26T04:23:06Z"
generation: 1
labels:
k8sgpts.k8sgpt.ai/backend: localai
k8sgpts.k8sgpt.ai/name: k8sgpt-local-ai
k8sgpts.k8sgpt.ai/namespace: k8sgpt-operator-system
name: defaultbrokenimage8896f7cf4vf47k
namespace: k8sgpt-operator-system
resourceVersion: "90019"
uid: 1ab8b2ad-7398-4d6d-bd8b-e2d5beba70ae
spec:
backend: localai
details: "Error: Kubernetes cannot pull the container image because the tag \"invalid-tag\"
doesn't exist in the Docker registry for nginx. \nSolution: \n1. Verify valid
nginx tags at [hub.docker.com/_/nginx](https://hub.docker.com/_/nginx) \n2. Edit
your deployment: \n```bash \nkubectl edit deployment <deployment-name> \n```
\ \n3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `nginx:latest`
or `nginx:1.25`) \n4. Save and exit. Kubernetes will automatically retry pulling
the new image. \n\n*(273 characters)*"
error:
- text: Back-off pulling image "nginx:invalid-tag"
kind: Pod
name: default/broken-image-8896f7cf4-vf47k
parentObject: ""
status: {}
格式化之后
Error: Kubernetes cannot pull the container image because the tag "invalid-tag" doesn't exist in the Docker registry for nginx.
Solution:
1. Verify valid nginx tags at [hub.docker.com/_/nginx](https://hub.docker.com/_/nginx)
2. Edit your deployment:
```bash
kubectl edit deployment <deployment-name>
```
3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `nginx:latest` or `nginx:1.25`)
4. Save and exit. Kubernetes will automatically retry pulling the new image.
5. 小结
K8sGPT 作为一款基于 AI 的 Kubernetes 智能诊断工具,核心价值在于自动化集群故障检测与 AI 驱动的解决方案生成,大幅降低了 K8s 故障排查的技术门槛。
两种部署模式适配不同场景:CLI 模式适合临时诊断(如手动触发集群扫描),Operator 模式适合持续监控(结合 Prometheus 等工具实现实时告警与自动修复建议)。
工作流程清晰高效:通过内置的 Analyzer 扫描集群资源(Pod、Deployment 等)的异常状态,结合 LLM 模型(如 DeepSeek、OpenAI)生成结构化的解决方案,输出格式统一(错误描述 + 步骤化修复建议),便于快速落地。
灵活性强:支持多类 AI 后端(本地模型、云服务模型等)
注意:对于复杂故障(如网络策略冲突、持久化存储异常),AI 生成的解决方案可能需要结合人工验证,不可完全依赖自动化结果。