$ k8sgpt auth default -p localai
Default provider set to localai
2.3 扫描集群
运行 k8sgpt analyze 扫描集群中的问题:
1
2
3
4
5
6
7
8
9
10
11
12
13
# k8sgpt analyzeAI Provider: AI not used; --explain not set0: Deployment default/broken-image()- Error: Deployment default/broken-image has 1 replicas but 0 are available with status running
1: Pod default/broken-image-8896f7cf4-vf47k(Deployment/broken-image)- Error: Back-off pulling image "nginx:invalid-tag"2: ConfigMap calico-apiserver/kube-root-ca.crt()- Error: ConfigMap kube-root-ca.crt is not used by any pods in the namespace
...
也可以指定 kubeconfig 访问远程集群
1
k8sgpt analyze --kubeconfig mykubeconfig
默认情况下不会使用 AI,只是简单扫描集群中的问题,需要增加--explain flag 才会与 AI 交互给出对应解决方案:
[root@kc-master ~]# k8sgpt analyze --explainAI Provider: localai
0: Deployment default/broken-image()- Error: Deployment default/broken-image has 1 replicas but 0 are available with status running
Error: The deployment "broken-image" has 1 pod defined, but 0 pods are running successfully.
Solution:
1. Check pod status: `kubectl get pods -n default`2. View pod logs: `kubectl logs <pod-name> -n default`3. Inspect errors: `kubectl describe pod <pod-name> -n default`4. Fix image/configuration issue (e.g., correct image name in deployment YAML)5. Apply changes: `kubectl apply -f deployment.yaml`(275 characters)1: Pod default/broken-image-8896f7cf4-vf47k(Deployment/broken-image)- Error: Back-off pulling image "nginx:invalid-tag"Error: Kubernetes cannot pull the specified container image because the tag "invalid-tag" doesn't exist in the Docker registry for"nginx".
Solution:
1. Verify valid nginx tags on Docker Hub or via `docker search nginx --limit 5`.
2. Edit your deployment: `kubectl edit deployment <deployment-name>`.
3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `image: nginx:latest` or `image: nginx:alpine`).
4. Save changes to restart pods automatically.
5. Confirm fix: `kubectl get pods` should show running status.
(base)[root@label-studio-k8s ~]# k -n k8sgpt-operator-system get pod -wNAME READY STATUS RESTARTS AGE
k8sgpt-local-ai-5ff75b9b8f-mfv8t 0/1 ContainerCreating 0 19s
这个 Pod 就是真正与 AI 交互的 Pod,该 Pod 会启动一个 http 服务提供给 Controller, Controller 扫描集群故障后,通过 http 请求该服务,该服务与 AI 交互得到解决方案后返回给 Controller,最终 Controller 将其存储到 result 对象中。
3.3 模拟故障
接下来就可以开始验证了。
k8sgpt 会自动收集集群信息,并生成诊断结果,可以通过以下命令查看:
1
2
3
4
5
6
(base)[root@label-studio-k8s ~]# k -n k8sgpt-operator-system get resultNAME KIND BACKEND AGE
defaultyoloservice Service localai 60s
kubesystemhamidcuvgpudevicepluginktlg7 Pod localai 60s
kubesystemhamidevicepluginmonitor Service localai 60s
kubesystemkccsicontroller785cc89b7bbmdkh Pod localai 60s
# kubectl -n k8sgpt-operator-system get result defaultbrokenimage8896f7cf4vf47k -oyamlapiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
creationTimestamp: "2025-06-26T04:23:06Z" generation: 1 labels:
k8sgpts.k8sgpt.ai/backend: localai
k8sgpts.k8sgpt.ai/name: k8sgpt-local-ai
k8sgpts.k8sgpt.ai/namespace: k8sgpt-operator-system
name: defaultbrokenimage8896f7cf4vf47k
namespace: k8sgpt-operator-system
resourceVersion: "90019" uid: 1ab8b2ad-7398-4d6d-bd8b-e2d5beba70ae
spec:
backend: localai
details: "Error: Kubernetes cannot pull the container image because the tag \"invalid-tag\"
doesn't exist in the Docker registry for nginx. \nSolution: \n1. Verify valid
nginx tags at [hub.docker.com/_/nginx](https://hub.docker.com/_/nginx) \n2. Edit
your deployment: \n```bash \nkubectl edit deployment <deployment-name> \n```
\ \n3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `nginx:latest`
or `nginx:1.25`) \n4. Save and exit. Kubernetes will automatically retry pulling
the new image. \n\n*(273 characters)*" error:
- text: Back-off pulling image "nginx:invalid-tag" kind: Pod
name: default/broken-image-8896f7cf4-vf47k
parentObject: ""status: {}
格式化之后内容如下:
1
2
3
4
5
6
7
8
9
10
Error: Kubernetes cannot pull the container image because the tag "invalid-tag" doesn't exist in the Docker registry for nginx.
Solution:
1. Verify valid nginx tags at [hub.docker.com/_/nginx](https://hub.docker.com/_/nginx)2. Edit your deployment:
```bash
kubectl edit deployment <deployment-name>
```3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `nginx:latest` or `nginx:1.25`)4. Save and exit. Kubernetes will automatically retry pulling the new image.
details 部分包含了具体问题,以及解决方案。
这里 AI 让我们编辑 deployment,将镜像 tag 修改为有效值。
3.4 修复故障
按照 AI 给出的解决方案修复:
1
kubectl set image deployment/broken-image nginx=nginx:1.25
调整镜像后 Pod 正常启动
1
2
3
# kubectl get poNAME READY STATUS RESTARTS AGE
broken-image-564ff59bd-b4fdm 1/1 Running 0 20s
修复后,故障(result 对象)消失
1
2
# kubectl -n k8sgpt-operator-system get resultNo resources found in k8sgpt-operator-system namespace.
func(h*Handler)Analyze(ctxcontext.Context,i*schemav1.AnalyzeRequest)(*schemav1.AnalyzeResponse,error,){ifi.Output==""{i.Output="json"}ifint(i.MaxConcurrency)==0{i.MaxConcurrency=10}// 初始化 Analysisconfig,err:=analysis.NewAnalysis(i.Backend,i.Language,i.Filters,i.Namespace,i.LabelSelector,i.Nocache,i.Explain,int(i.MaxConcurrency),false,// Kubernetes Doc disabled in server modefalse,// Interactive mode disabled in server mode[]string{},//TODO: add custom http headers in server modefalse,// with stats disable)iferr!=nil{return&schemav1.AnalyzeResponse{},err}config.Context=ctx// Replace context for correct timeouts.deferconfig.Close()// 运行自定义 Analysisifconfig.CustomAnalyzersAreAvailable(){config.RunCustomAnalysis()}// 运行内置 Analysisconfig.RunAnalysis()// 请求 AI 生成处理方案(如果指定 explain)ifi.Explain{err:=config.GetAIResults(i.Output,i.Anonymize)iferr!=nil{return&schemav1.AnalyzeResponse{},err}}// 结构化返回诊断报告out,err:=config.PrintOutput(i.Output)iferr!=nil{return&schemav1.AnalyzeResponse{},err}varobjschemav1.AnalyzeResponseerr=json.Unmarshal(out,&obj)iferr!=nil{return&schemav1.AnalyzeResponse{},err}return&obj,nil}
func(PodAnalyzer)Analyze(acommon.Analyzer)([]common.Result,error){kind:="Pod"AnalyzerErrorsMetric.DeletePartialMatch(map[string]string{"analyzer_name":kind,})// search all namespaces for pods that are not runninglist,err:=a.Client.GetClient().CoreV1().Pods(a.Namespace).List(a.Context,metav1.ListOptions{LabelSelector:a.LabelSelector,})iferr!=nil{returnnil,err}varpreAnalysis=map[string]common.PreAnalysis{}for_,pod:=rangelist.Items{varfailures[]common.Failure// Check for pending podsifpod.Status.Phase=="Pending"{// Check through container status to check for crashesfor_,containerStatus:=rangepod.Status.Conditions{ifcontainerStatus.Type==v1.PodScheduled&&containerStatus.Reason=="Unschedulable"{ifcontainerStatus.Message!=""{failures=append(failures,common.Failure{Text:containerStatus.Message,Sensitive:[]common.Sensitive{},})}}}}// Check for errors in the init containers.failures=append(failures,analyzeContainerStatusFailures(a,pod.Status.InitContainerStatuses,pod.Name,pod.Namespace,string(pod.Status.Phase))...)// Check for errors in containers.failures=append(failures,analyzeContainerStatusFailures(a,pod.Status.ContainerStatuses,pod.Name,pod.Namespace,string(pod.Status.Phase))...)iflen(failures)>0{preAnalysis[fmt.Sprintf("%s/%s",pod.Namespace,pod.Name)]=common.PreAnalysis{Pod:pod,FailureDetails:failures,}AnalyzerErrorsMetric.WithLabelValues(kind,pod.Name,pod.Namespace).Set(float64(len(failures)))}}forkey,value:=rangepreAnalysis{varcurrentAnalysis=common.Result{Kind:kind,Name:key,Error:value.FailureDetails,}parent,found:=util.GetParent(a.Client,value.Pod.ObjectMeta)iffound{currentAnalysis.ParentObject=parent}a.Results=append(a.Results,currentAnalysis)}returna.Results,nil}
default_prompt=`Simplify the following Kubernetes error message delimited by triple dashes written in --- %s --- language; --- %s ---.
Provide the most possible solution in a step by step style in no more than 280 characters. Write the output in the following format:
Error: {Explain error here}
Solution: {Step by step solution here}
`
[root@kc-master ~]# kubectl -n k8sgpt-operator-system get result defaultbrokenimage8896f7cf4vf47k -oyamlapiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
creationTimestamp: "2025-06-26T04:23:06Z" generation: 1 labels:
k8sgpts.k8sgpt.ai/backend: localai
k8sgpts.k8sgpt.ai/name: k8sgpt-local-ai
k8sgpts.k8sgpt.ai/namespace: k8sgpt-operator-system
name: defaultbrokenimage8896f7cf4vf47k
namespace: k8sgpt-operator-system
resourceVersion: "90019" uid: 1ab8b2ad-7398-4d6d-bd8b-e2d5beba70ae
spec:
backend: localai
details: "Error: Kubernetes cannot pull the container image because the tag \"invalid-tag\"
doesn't exist in the Docker registry for nginx. \nSolution: \n1. Verify valid
nginx tags at [hub.docker.com/_/nginx](https://hub.docker.com/_/nginx) \n2. Edit
your deployment: \n```bash \nkubectl edit deployment <deployment-name> \n```
\ \n3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `nginx:latest`
or `nginx:1.25`) \n4. Save and exit. Kubernetes will automatically retry pulling
the new image. \n\n*(273 characters)*" error:
- text: Back-off pulling image "nginx:invalid-tag" kind: Pod
name: default/broken-image-8896f7cf4-vf47k
parentObject: ""status: {}
格式化之后
1
2
3
4
5
6
7
8
9
10
Error: Kubernetes cannot pull the container image because the tag "invalid-tag" doesn't exist in the Docker registry for nginx.
Solution:
1. Verify valid nginx tags at [hub.docker.com/_/nginx](https://hub.docker.com/_/nginx)2. Edit your deployment:
```bash
kubectl edit deployment <deployment-name>
```3. Replace `image: nginx:invalid-tag` with a valid tag (e.g., `nginx:latest` or `nginx:1.25`)4. Save and exit. Kubernetes will automatically retry pulling the new image.
5. 小结
K8sGPT 作为一款基于 AI 的 Kubernetes 智能诊断工具,核心价值在于自动化集群故障检测与 AI 驱动的解决方案生成,大幅降低了 K8s 故障排查的技术门槛。