AI 赋能 K8s 运维：K8sGPT 让集群故障 “自动诊断 + 秒出方案”

2025-07-01 3373 words 7 minutes

Contents

还在为 Kubernetes 集群故障排查头疼？试试 K8sGPT—— 这款基于 AI 的智能诊断工具，能自动扫描集群异常，并通过 OpenAI、DeepSeek 等模型生成 step-by-step 解决方案。本文手把手教你用 CLI 或 Operator 模式部署，从安装到实战验证，让 K8s 运维效率飙升！

1. k8sgpt 是什么

K8sGPT 是一款基于 AI 的 Kubernetes 智能诊断工具，自动扫描集群异常并通过 AI 生成解决方案。它支持多类 AI 后端（OpenAI、本地模型等），提供两种核心使用模式：

Cli 方式安装：安装 cli 工具方式使用,通过 kubeconfig 连接集群，即时诊断问题
Operator 方式安装：通过在集群中安装 Operator 方式使用，这种方式非常适合持续监控集群，并且可以与 Prometheus 和 Alertmanager 等现有监控集成

在 k8s 环境部署，推荐使用 Operator 方式安装。

2.Cli 方式安装

2.1 安装

k8sgpt 的能力由 cli 工具提供，可以直接使用 brew 安装：

brew install k8sgpt

或者到 Releases 界面下载

version=v0.4.20

wget https://github.com/k8sgpt-ai/k8sgpt/releases/download/${version}/k8sgpt_Linux_x86_64.tar.gz

tar -zxvf k8sgpt_Linux_x86_64.tar.gz
mv k8sgpt /usr/local/bin/

# 查看版本
k8sgpt version

2.2 配置 AI Provider

k8sgpt 支持多种 AI Provider，具体如下：

[root@kc-master ~]# k8sgpt auth list
Default:
> openai
Active:
Unused:
> localai
> openai
> ollama
> azureopenai
> cohere
> amazonbedrock
> amazonsagemaker
> google
> noopai
> huggingface
> googlevertexai
> oci
> customrest
> ibmwatsonxai

常见的 openai、ollama 等，也支持使用 localai 对接任意满足 OpenAI API 格式的外部模型。

这里我们使用 localai Provider 来对接 DeepSeek：

baseurl=https://api.deepseek.com/v1
model=deepseek-reasoner
key=sk-xxx

k8sgpt auth add -b localai -u $baseurl -m $model -p $key

并将其设置为默认 Provider

$ k8sgpt auth default -p localai
Default provider set to localai

2.3 扫描集群

运行 k8sgpt analyze 扫描集群中的问题：

# k8sgpt analyze
AI Provider: AI not used; --explain not set

0: Deployment default/broken-image()
- Error: Deployment default/broken-image has 1 replicas but 0 are available with status running

1: Pod default/broken-image-8896f7cf4-vf47k(Deployment/broken-image)
- Error: Back-off pulling image "nginx:invalid-tag"

2: ConfigMap calico-apiserver/kube-root-ca.crt()
- Error: ConfigMap kube-root-ca.crt is not used by any pods in the namespace

...

也可以指定 kubeconfig 访问远程集群

k8sgpt analyze --kubeconfig mykubeconfig

默认情况下不会使用 AI，只是简单扫描集群中的问题，需要增加--explain flag 才会与 AI 交互给出对应解决方案：

# -b 指定使用上一步配置的 Provider
k8sgpt analyze --explain

效果如下：

[root@kc-master ~]# k8sgpt analyze --explain
AI Provider: localai

0: Deployment default/broken-image()
- Error: Deployment default/broken-image has 1 replicas but 0 are available with status running
Error: The deployment "broken-image" has 1 pod defined, but 0 pods are running successfully.
Solution:
1. Check pod status: `kubectl get pods -n default`
2. View pod logs: `kubectl logs <pod-name> -n default`
3. Inspect errors: `kubectl describe pod <pod-name> -n default`
4. Fix image/configuration issue (e.g., correct image name in deployment YAML)
5. Apply changes: `kubectl apply -f deployment.yaml`

(275 characters)
1: Pod default/broken-image-8896f7cf4-vf47k(Deployment/broken-image)
- Error: Back-off pulling image "nginx:invalid-tag"
Error: Kubernetes cannot pull the specified container image because the tag "invalid-tag" doesn't exist in the Docker registry for "nginx".

Solution:
1. Verify valid nginx tags on Docker Hub or via `docker search nginx --limit 5`.
2. Edit your deployment: `kubectl edit deployment <deployment-name>`.
3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `image: nginx:latest` or `image: nginx:alpine`).
4. Save changes to restart pods automatically.
5. Confirm fix: `kubectl get pods` should show running status.

3. Operator 方式安装

3.1 部署 Operator

直接 helm 部署，命令如下：

helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update

helm upgrade --install -n k8sgpt-operator-system k8sgpt k8sgpt/k8sgpt-operator  --create-namespace

查看 Pod 运行情况

# kubectl -n k8sgpt-operator-system get po
NAME                                                          READY   STATUS    RESTARTS   AGE
release-k8sgpt-operator-controller-manager-69b6fd9696-zc9t5   2/2     Running   0          33m

会启动一个 k8sgpt-operator-controller-manager Pod，这样就算部署好了

3.2 创建 K8sGPT 对象

首先创建一个 K8sGPT 对象

kubectl apply -f - << EOF
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-local-ai
  namespace: k8sgpt-operator-system
spec:
  ai:
    enabled: true
    model: deepseek-reasoner
    backend: localai
    baseUrl: https://api.deepseek.com/v1
    secret:
      name: k8sgpt-sample-secret
      key: openai-api-key
  noCache: false
  repository: ghcr.io/k8sgpt-ai/k8sgpt
  version: v0.4.1
EOF

大致包含两部分信息：

1）K8sgpt 仓库地址以及版本，会使用该信息创建对应 Pod
- repository：ghcr.io/k8sgpt-ai/k8sgpt
- version：v0.4.1
2）对接的 LLM 信息，会使用这部分信息与 LLM 交互，这里用的是 DeepSeek
- backend：localai
- baseUrl：https://api.deepseek.com/v1
- model：deepseek-reasoner
- secret：指定存放 API Key 的 secret，如果 API 没有设置 Auth 的话不指定 secret 部分
- …

需要提供一个 LLM 服务配置,不一定是 chatgpt,只要是兼容 OpenAI API 格式的服务都可以。

比如可以用 DeepSeek https://platform.deepseek.com/usage，也可以使用本地的服务

如果 LLM 服务需要认证，提前创建 secret 方式存储 API Key，secret 名称和 key 需与 K8sGPT 对象 spec 中的配置一致”，避免配置不匹配导致的异常。

OPENAI_TOKEN=sk-921726c2856a4a198a6e262fec0e9e5a
kubectl create secret generic k8sgpt-sample-secret --from-literal=openai-api-key=$OPENAI_TOKEN -n k8sgpt-operator-system

K8sGPT 对象创建之后，上一步部署的 Operator 会根据信息启动一个 Pod

(base) [root@label-studio-k8s ~]# k -n k8sgpt-operator-system get pod -w
NAME                                                          READY   STATUS              RESTARTS   AGE
k8sgpt-local-ai-5ff75b9b8f-mfv8t                              0/1     ContainerCreating   0          19s

这个 Pod 就是真正工作的 k8sgpt Pod，该 Pod 会持续扫描集群并与 AI 交互得到解决方案后存储到 result 对象中。

3.3 模拟故障

接下来就可以开始验证了。

k8sgpt 会自动收集集群信息，并生成诊断结果，可以通过以下命令查看：

(base) [root@label-studio-k8s ~]# k -n k8sgpt-operator-system get result
NAME                                       KIND      BACKEND   AGE
defaultyoloservice                         Service   localai   60s
kubesystemhamidcuvgpudevicepluginktlg7     Pod       localai   60s
kubesystemhamidevicepluginmonitor          Service   localai   60s
kubesystemkccsicontroller785cc89b7bbmdkh   Pod       localai   60s

其中每个 result 对象对应一个问题，如果集群不存在任何问题是不会生成 result 对象的。

因为我们可以自己创造一些问题进行模拟，例如模拟服务镜像拉取失败的场景。

使用以下命令创建 deployment,由于 tag 不存在肯定会出现无法拉取镜像的问题。

kubectl create deployment broken-image --image=nginx:invalid-tag

和预期一样，Pod 会因为无法拉取镜像，启动失败

# kubectl get po
NAME                           READY   STATUS             RESTARTS   AGE
broken-image-8896f7cf4-qst86   0/1     ImagePullBackOff   0          4m33s

看下 k8sgpt 能否检测到该问题：

# kubectl -n k8sgpt-operator-system get result
NAME                               KIND   BACKEND   AGE
defaultbrokenimage8896f7cf4qst86   Pod    localai   12s

k8sgpt 检测到故障并生成 result 通常需要 30 秒到 2 分钟（取决于集群规模）。

已经有新的 result 了，看起来没什么问题，result 对象名称由 namespace 和对象名(这里就是 PodName)组成。

k8sgpt 会自动将 error 信息发送给 AI，并将结果写入 result 中的 details 字段，内容如下：

# kubectl -n k8sgpt-operator-system get result defaultbrokenimage8896f7cf4vf47k -oyaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
  creationTimestamp: "2025-06-26T04:23:06Z"
  generation: 1
  labels:
    k8sgpts.k8sgpt.ai/backend: localai
    k8sgpts.k8sgpt.ai/name: k8sgpt-local-ai
    k8sgpts.k8sgpt.ai/namespace: k8sgpt-operator-system
  name: defaultbrokenimage8896f7cf4vf47k
  namespace: k8sgpt-operator-system
  resourceVersion: "90019"
  uid: 1ab8b2ad-7398-4d6d-bd8b-e2d5beba70ae
spec:
  backend: localai
  details: "Error: Kubernetes cannot pull the container image because the tag \"invalid-tag\"
    doesn't exist in the Docker registry for nginx.  \nSolution:  \n1. Verify valid
    nginx tags at [hub.docker.com/_/nginx](https://hub.docker.com/_/nginx)  \n2. Edit
    your deployment:  \n```bash  \nkubectl edit deployment <deployment-name>  \n```
    \ \n3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `nginx:latest`
    or `nginx:1.25`)  \n4. Save and exit. Kubernetes will automatically retry pulling
    the new image.  \n\n*(273 characters)*"
  error:
  - text: Back-off pulling image "nginx:invalid-tag"
  kind: Pod
  name: default/broken-image-8896f7cf4-vf47k
  parentObject: ""
status: {}

格式化之后内容如下：

Error: Kubernetes cannot pull the container image because the tag "invalid-tag" doesn't exist in the Docker registry for nginx.  

Solution:  
1. Verify valid nginx tags at [hub.docker.com/_/nginx](https://hub.docker.com/_/nginx)  
2. Edit your deployment:  
```bash  
kubectl edit deployment <deployment-name>  
```  
3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `nginx:latest` or `nginx:1.25`)  
4. Save and exit. Kubernetes will automatically retry pulling the new image.

details 部分包含了具体问题，以及解决方案。

这里 AI 让我们编辑 deployment，将镜像 tag 修改为有效值。

3.4 修复故障

按照 AI 给出的解决方案修复：

kubectl set image deployment/broken-image nginx=nginx:1.25

调整镜像后 Pod 正常启动

# kubectl get po
NAME                           READY   STATUS    RESTARTS   AGE
broken-image-564ff59bd-b4fdm   1/1     Running   0          20s

修复后，故障(result 对象)消失

# kubectl -n k8sgpt-operator-system get result
No resources found in k8sgpt-operator-system namespace.

4. 工作流程

k8sgpt 是如何工作的呢？

4.1 完整流程

分为以下步骤：

1）初始化 Analysis：配置 Kubernetes 客户端、AI 后端等等
2）运行 Analysis
- 运行自定义 Analysis（如果有配置）
- 运行内置 Analysis
3）请求 AI 生成处理方案(如果指定 explain)
4）结构化返回诊断报告

完整代码如下：

analyze.go#L11-L67

func (h *Handler) Analyze(ctx context.Context, i *schemav1.AnalyzeRequest) (
    *schemav1.AnalyzeResponse,
    error,
) {
    if i.Output == "" {
       i.Output = "json"
    }

    if int(i.MaxConcurrency) == 0 {
       i.MaxConcurrency = 10
    }
   
    // 初始化 Analysis
    config, err := analysis.NewAnalysis(
       i.Backend,
       i.Language,
       i.Filters,
       i.Namespace,
       i.LabelSelector,
       i.Nocache,
       i.Explain,
       int(i.MaxConcurrency),
       false,      // Kubernetes Doc disabled in server mode
       false,      // Interactive mode disabled in server mode
       []string{}, //TODO: add custom http headers in server mode
       false,      // with stats disable
    )
    if err != nil {
       return &schemav1.AnalyzeResponse{}, err
    }
    config.Context = ctx // Replace context for correct timeouts.
    defer config.Close()

    // 运行自定义 Analysis
    if config.CustomAnalyzersAreAvailable() {
       config.RunCustomAnalysis()
    }
    // 运行内置 Analysis
    config.RunAnalysis()

    // 请求 AI 生成处理方案(如果指定 explain)
    if i.Explain {
       err := config.GetAIResults(i.Output, i.Anonymize)
       if err != nil {
          return &schemav1.AnalyzeResponse{}, err
       }
    }

    // 结构化返回诊断报告
    out, err := config.PrintOutput(i.Output)
    if err != nil {
       return &schemav1.AnalyzeResponse{}, err
    }
    var obj schemav1.AnalyzeResponse

    err = json.Unmarshal(out, &obj)
    if err != nil {
       return &schemav1.AnalyzeResponse{}, err
    }

    return &obj, nil
}

4.2 Analyzer 工作逻辑

k8sgpt 中包括多个内置 Analyzer

以 Pod Analyzer 为例，流程也比较简单：

1）使用 k8s client 获取 Pod 列表
2）然后遍历检查每个 Pod 的状态,获取错误信息
3）最终将错误信息结构化为 Result 格式返回

完整代码如下：

func (PodAnalyzer) Analyze(a common.Analyzer) ([]common.Result, error) {

    kind := "Pod"
   
    AnalyzerErrorsMetric.DeletePartialMatch(map[string]string{
       "analyzer_name": kind,
    })

    // search all namespaces for pods that are not running
    list, err := a.Client.GetClient().CoreV1().Pods(a.Namespace).List(a.Context, metav1.ListOptions{
       LabelSelector: a.LabelSelector,
    })
    if err != nil {
       return nil, err
    }
    var preAnalysis = map[string]common.PreAnalysis{}

    for _, pod := range list.Items {
       var failures []common.Failure

       // Check for pending pods
       if pod.Status.Phase == "Pending" {
          // Check through container status to check for crashes
          for _, containerStatus := range pod.Status.Conditions {
             if containerStatus.Type == v1.PodScheduled && containerStatus.Reason == "Unschedulable" {
                if containerStatus.Message != "" {
                   failures = append(failures, common.Failure{
                      Text:      containerStatus.Message,
                      Sensitive: []common.Sensitive{},
                   })
                }
             }
          }
       }

       // Check for errors in the init containers.
       failures = append(failures, analyzeContainerStatusFailures(a, pod.Status.InitContainerStatuses, pod.Name, pod.Namespace, string(pod.Status.Phase))...)

       // Check for errors in containers.
       failures = append(failures, analyzeContainerStatusFailures(a, pod.Status.ContainerStatuses, pod.Name, pod.Namespace, string(pod.Status.Phase))...)

       if len(failures) > 0 {
          preAnalysis[fmt.Sprintf("%s/%s", pod.Namespace, pod.Name)] = common.PreAnalysis{
             Pod:            pod,
             FailureDetails: failures,
          }
          AnalyzerErrorsMetric.WithLabelValues(kind, pod.Name, pod.Namespace).Set(float64(len(failures)))
       }
    }

    for key, value := range preAnalysis {
       var currentAnalysis = common.Result{
          Kind:  kind,
          Name:  key,
          Error: value.FailureDetails,
       }

       parent, found := util.GetParent(a.Client, value.Pod.ObjectMeta)
       if found {
          currentAnalysis.ParentObject = parent
       }
       a.Results = append(a.Results, currentAnalysis)
    }

    return a.Results, nil
}

4.3 Prompt 模板与 AI 交互

在上一步 k8sgpt 拿到了集群中的异常信息，例如：

Back-off pulling image "nginx:invalid-tag"

接下来就拿这个错误信息给 AI 生成解决方案，这里k8sgpt 用到的 prompt 模板如下：

prompts.go#L4-L16

以下是用于处理 Kubernetes 错误的 Prompt 模板：

default_prompt = `Simplify the following Kubernetes error message delimited by triple dashes written in --- %s --- language; --- %s ---.
Provide the most possible solution in a step by step style in no more than 280 characters. Write the output in the following format:
Error: {Explain error here}
Solution: {Step by step solution here}
`

这也是为什么我们看到的 Result 中的 detail 是这样的：

             
[root@kc-master ~]# kubectl -n k8sgpt-operator-system get result defaultbrokenimage8896f7cf4vf47k -oyaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
  creationTimestamp: "2025-06-26T04:23:06Z"
  generation: 1
  labels:
    k8sgpts.k8sgpt.ai/backend: localai
    k8sgpts.k8sgpt.ai/name: k8sgpt-local-ai
    k8sgpts.k8sgpt.ai/namespace: k8sgpt-operator-system
  name: defaultbrokenimage8896f7cf4vf47k
  namespace: k8sgpt-operator-system
  resourceVersion: "90019"
  uid: 1ab8b2ad-7398-4d6d-bd8b-e2d5beba70ae
spec:
  backend: localai
  details: "Error: Kubernetes cannot pull the container image because the tag \"invalid-tag\"
    doesn't exist in the Docker registry for nginx.  \nSolution:  \n1. Verify valid
    nginx tags at [hub.docker.com/_/nginx](https://hub.docker.com/_/nginx)  \n2. Edit
    your deployment:  \n```bash  \nkubectl edit deployment <deployment-name>  \n```
    \ \n3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `nginx:latest`
    or `nginx:1.25`)  \n4. Save and exit. Kubernetes will automatically retry pulling
    the new image.  \n\n*(273 characters)*"
  error:
  - text: Back-off pulling image "nginx:invalid-tag"
  kind: Pod
  name: default/broken-image-8896f7cf4-vf47k
  parentObject: ""
status: {}

格式化之后

Error: Kubernetes cannot pull the container image because the tag "invalid-tag" doesn't exist in the Docker registry for nginx.  

Solution:  
1. Verify valid nginx tags at [hub.docker.com/_/nginx](https://hub.docker.com/_/nginx)  
2. Edit your deployment:  
```bash  
kubectl edit deployment <deployment-name>  
```  
3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `nginx:latest` or `nginx:1.25`)  
4. Save and exit. Kubernetes will automatically retry pulling the new image.

5. 小结

K8sGPT 作为一款基于 AI 的 Kubernetes 智能诊断工具，核心价值在于自动化集群故障检测与 AI 驱动的解决方案生成，大幅降低了 K8s 故障排查的技术门槛。

两种部署模式适配不同场景：CLI 模式适合临时诊断（如手动触发集群扫描），Operator 模式适合持续监控（结合 Prometheus 等工具实现实时告警与自动修复建议）。
工作流程清晰高效：通过内置的 Analyzer 扫描集群资源（Pod、Deployment 等）的异常状态，结合 LLM 模型（如 DeepSeek、OpenAI）生成结构化的解决方案，输出格式统一（错误描述 + 步骤化修复建议），便于快速落地。
灵活性强：支持多类 AI 后端（本地模型、云服务模型等）

注意：对于复杂故障（如网络策略冲突、持久化存储异常），AI 生成的解决方案可能需要结合人工验证，不可完全依赖自动化结果。