Kubernetes教程(十九)---Kubelet 垃圾回收机制揭秘：本地镜像离奇消失之谜

意琦行收录于 Kubernetes

2023-11-17 00:00:00 约 7300 字预计阅读 15 分钟 - 次阅读

本文主要从源码层面分析了 Kubelet 中的垃圾回收功能具体实现，包括镜像垃圾回收和容器垃圾回收，最后则是记录了一下，发现的一个小 bug，可能导致某些情况下镜像被意外清理。

为了保证节点资源充足，Kubelet 提供了 Pod 驱逐和垃圾回收功能来保证节点资源不被消耗干净。

Pod 驱逐：当节点资源不足时会根据 Qos 优先级驱逐 Pod 直到清理出足够资源
垃圾回收：清理不需要的容器和镜像以回收资源

Pod 驱逐相关下一篇再分析，今天主要分析 Kubelet 的垃圾回收功能。

1. 相关配置

从代码层面来看，Kubelet 中使用两个后台 goroutine 来实现垃圾回收功能。

kubelet 默认开启了垃圾回收功能，kubelet 对容器进行垃圾回收的频率是每分钟一次，对镜像进行垃圾回收的频率是每五分钟一次。

除了间隔时间之外，具体清理逻辑可以通过以下参数进行配置：

kubelet 容器垃圾回收的参数:

--maximum-dead-containers-per-container: 每个 pod 可以保留几个死亡的容器, 默认为 1, 也就是每次把挂掉的容器清理掉.
--maximum-dead-containers: 一个节点上最多有多少个死亡的容器, 默认为 -1, 表示节点不做限制.
--minimum-container-ttl-duration: 容器可被回收的最小生存年龄，默认是 0 分钟，这意味着每个死亡容器都会被立即执行垃圾回收.

kubelet 镜像垃圾回收的参数:

--image-gc-high-threshold: 当磁盘使用率超过该阈值, 则进行垃圾回收, 默认为 85%.
--image-gc-low-threshold: 当空间已经小于该阈值, 则停止垃回收, 默认为 80%.
--minimum-image-ttl-duration: 镜像的最低存留时间, 默认为 2m0s.

2. 触发点

以下源码分析基于 v.1.28.1 版本

这里分析一下垃圾回收功能是怎么触发的，垃圾回收有两个地方会触发：

1）正常触发：定时执行(每分钟或者每 5 分钟)，以保证节点资源充足
2）强制触发：在节点资源不足时会驱逐该节点上的 Pod，但是 Kubelet 在驱逐 Pod 前会先强制执行一次垃圾回收，如果清理后资源充足了就不会驱逐 Pod，用于减少 Pod 驱逐以提升稳定性

正常定时触发

具体启动方法为 StartGarbageCollection，内部启动了两个 goroutine 来负责容器和镜像的垃圾回收。

当把 kubelet 里的--image-gc-high-threshold 参数设为 100 时,可以关闭 kubelet 的垃圾回收功能。

但是不推荐使用外部的垃圾回收工具，因为这些工具有可能会删除 kubelet 仍然需要的容器或者镜像。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// pkg/kubelet/kubelet.go#L1397
func (kl *Kubelet) StartGarbageCollection() {
    loggedContainerGCFailure := false
    go wait.Until(func() {
       ctx := context.Background()
       if err := kl.containerGC.GarbageCollect(ctx); err != nil {
          klog.ErrorS(err, "Container garbage collection failed")
          kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ContainerGCFailed, err.Error())
          loggedContainerGCFailure = true
       } else {
          var vLevel klog.Level = 4
          if loggedContainerGCFailure {
             vLevel = 1
             loggedContainerGCFailure = false
          }

          klog.V(vLevel).InfoS("Container garbage collection succeeded")
       }
    }, ContainerGCPeriod, wait.NeverStop) // 每 1 分钟执行一次容器垃圾回收

    // 一个特殊逻辑，如果把参数设置为 100 可以跳过垃圾回收
    // when the high threshold is set to 100, stub the image GC manager
    if kl.kubeletConfiguration.ImageGCHighThresholdPercent == 100 {
       klog.V(2).InfoS("ImageGCHighThresholdPercent is set 100, Disable image GC")
       return
    }

    prevImageGCFailed := false
    go wait.Until(func() {
       ctx := context.Background()
       if err := kl.imageManager.GarbageCollect(ctx); err != nil {
          if prevImageGCFailed {
             klog.ErrorS(err, "Image garbage collection failed multiple times in a row")
             // Only create an event for repeated failures
             kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, err.Error())
          } else {
             klog.ErrorS(err, "Image garbage collection failed once. Stats initialization may not have completed yet")
          }
          prevImageGCFailed = true
       } else {
          var vLevel klog.Level = 4
          if prevImageGCFailed {
             vLevel = 1
             prevImageGCFailed = false
          }

          klog.V(vLevel).InfoS("Image garbage collection succeeded")
       }
    }, ImageGCPeriod, wait.NeverStop) // 每 5 分钟执行一次镜像垃圾回收
}

间隔时间则是前面提到的 1 分钟和 5 分钟

1
2
3
4
// ContainerGCPeriod is the period for performing container garbage collection.
ContainerGCPeriod = time.Minute
// ImageGCPeriod is the period for performing image garbage collection.
ImageGCPeriod = 5 * time.Minute

驱逐逻辑中强制触发

实际上除了前面的两个 goroutine 之外还有驱逐逻辑里也会直接调用相关垃圾回收功能,在满足驱逐 Pod 的条件之后，Kubelet 也会直接调用垃圾回收功能，尝试清理资源，以减少对 Pod 的驱逐，从而提升稳定性。

具体触发点在

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) {
    thresholdHandler := func(message string) {
       klog.InfoS(message)
       m.synchronize(diskInfoProvider, podFunc)
    }
    if m.config.KernelMemcgNotification {
       for _, threshold := range m.config.Thresholds {
          if threshold.Signal == evictionapi.SignalMemoryAvailable || threshold.Signal == evictionapi.SignalAllocatableMemoryAvailable {
             notifier, err := NewMemoryThresholdNotifier(threshold, m.config.PodCgroupRoot, &CgroupNotifierFactory{}, thresholdHandler)
             if err != nil {
                klog.InfoS("Eviction manager: failed to create memory threshold notifier", "err", err)
             } else {
                go notifier.Start()
                m.thresholdNotifiers = append(m.thresholdNotifiers, notifier)
             }
          }
       }
    }
    // start the eviction manager monitoring
    go func() {
       for {
          if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil {
             klog.InfoS("Eviction manager: pods evicted, waiting for pod to be cleaned up", "pods", klog.KObjSlice(evictedPods))
             m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
          } else {
             time.Sleep(monitoringInterval)
          }
       }
    }()
}

其中启动了一个 goroutine 一直在调用 synchronize 方法，判断是否有 Pod 需要驱逐，synchronize 方法如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
    // 省略其他逻辑...

    // 根据 thresholdToReclaim.Signal 信号进行垃圾回收，如果回收后剩余资源低于阈值，就直接返回 nil，
    // 代表本次不需要驱逐任何 Pod 了
    if m.reclaimNodeLevelResources(ctx, thresholdToReclaim.Signal, resourceToReclaim) {
        klog.InfoS("Eviction manager: able to reduce resource pressure without evicting pods.", "resourceName", resourceToReclaim)
        return nil
    }
    // 否则就在所有 Pod 里找一个进行驱逐，为了保证稳定性，每轮也只会驱逐一个 Pod
    for i := range activePods {
        pod := activePods[i]
        gracePeriodOverride := int64(0)
        if !isHardEvictionThreshold(thresholdToReclaim) {
           gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
        }
        message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc, thresholds, observations)
        var condition *v1.PodCondition
        if utilfeature.DefaultFeatureGate.Enabled(features.PodDisruptionConditions) {
           condition = &v1.PodCondition{
              Type:    v1.DisruptionTarget,
              Status:  v1.ConditionTrue,
              Reason:  v1.PodReasonTerminationByKubelet,
              Message: message,
           }
        }
        if m.evictPod(pod, gracePeriodOverride, message, annotations, condition) {
           metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc()
           return []*v1.Pod{pod}
        }
    }
}

然后里面提到了根据信号来回收不同的资源，具体对应关系如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
在 eviction 功能的主循环里就会根据对应的驱逐信号，执行对应方法，具体对应关系如下：
// buildSignalToNodeReclaimFuncs returns reclaim functions associated with resources.
func buildSignalToNodeReclaimFuncs(imageGC ImageGC, containerGC ContainerGC, withImageFs bool) map[evictionapi.Signal]nodeReclaimFuncs {
    signalToReclaimFunc := map[evictionapi.Signal]nodeReclaimFuncs{}
    // usage of an imagefs is optional
    if withImageFs {
       // with an imagefs, nodefs pressure should just delete logs
       signalToReclaimFunc[evictionapi.SignalNodeFsAvailable] = nodeReclaimFuncs{}
       signalToReclaimFunc[evictionapi.SignalNodeFsInodesFree] = nodeReclaimFuncs{}
       // with an imagefs, imagefs pressure should delete unused images
       signalToReclaimFunc[evictionapi.SignalImageFsAvailable] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
       signalToReclaimFunc[evictionapi.SignalImageFsInodesFree] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
    } else {
       // without an imagefs, nodefs pressure should delete logs, and unused images
       // since imagefs and nodefs share a common device, they share common reclaim functions
       signalToReclaimFunc[evictionapi.SignalNodeFsAvailable] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
       signalToReclaimFunc[evictionapi.SignalNodeFsInodesFree] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
       signalToReclaimFunc[evictionapi.SignalImageFsAvailable] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
       signalToReclaimFunc[evictionapi.SignalImageFsInodesFree] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
    }
    return signalToReclaimFunc
}

可以看到，对于不同信号，做的操作无非是执行容器垃圾回收或者镜像垃圾回收两个动作。

3. 容器垃圾回收

整个回收流程分为三个部分：

1）清理可以驱逐的容器
2）清理可以驱逐的 sandbox
3）清理所有 pod 的日志目录

K8s 中一个 Pod 可以有多个 container，同时除了业务 container 之外还会存在一个 sandbox container，因此这里清理的时候也是按照这个顺序在处理。

先清理业务 container，如果 Pod 里都没有业务 container 了，那么这个 sandbox container 也可以清理了，等 sandbox container 都被清理了，那这个 Pod 就算是被清理干净了，最后就是清理 Pod 对应的日志目录。

容器垃圾回收具体在GarbageCollect 方法里，具体代码如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
func (cgc *containerGC) GarbageCollect(ctx context.Context, gcPolicy kubecontainer.GCPolicy, allSourcesReady bool, evictNonDeletedPods bool) error {
    ctx, otelSpan := cgc.tracer.Start(ctx, "Containers/GarbageCollect")
    defer otelSpan.End()
    errors := []error{}
    // Remove evictable containers
    if err := cgc.evictContainers(ctx, gcPolicy, allSourcesReady, evictNonDeletedPods); err != nil {
       errors = append(errors, err)
    }

    // Remove sandboxes with zero containers
    if err := cgc.evictSandboxes(ctx, evictNonDeletedPods); err != nil {
       errors = append(errors, err)
    }

    // Remove pod sandbox log directory
    if err := cgc.evictPodLogsDirectories(ctx, allSourcesReady); err != nil {
       errors = append(errors, err)
    }
    return utilerrors.NewAggregate(errors)
}

evictContainers

evictContainers 驱逐所有可以驱逐的 container。

流程如下：

1）找到所有可以驱逐的容器，判断条件为：当前状态不是 RUNNING 并且是在本轮 GC 前创建的容器
2）执行清理，直到满足条件
- 条件一：每个 Pod 里面存在的已经挂掉的 container 数满足阈值
- 条件二：所有 Pod 里面存在的已经挂掉的 container 数满足阈值

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
func (cgc *containerGC) evictContainers(ctx context.Context, gcPolicy kubecontainer.GCPolicy, allSourcesReady bool, evictNonDeletedPods bool) error {
    // 找到所有可以驱逐的容器，判断条件为：当前状态不是 RUNNING 并且是在本轮 GC 前创建的容器
    evictUnits, err := cgc.evictableContainers(ctx, gcPolicy.MinAge)
    if err != nil {
       return err
    }

    // allSourceReady 为 true, 则进行回收，这个参数是外部传入的，暂时先不管
    if allSourcesReady {
       for key, unit := range evictUnits {
          if cgc.podStateProvider.ShouldPodContentBeRemoved(key.uid) || (evictNonDeletedPods && cgc.podStateProvider.ShouldPodRuntimeBeRemoved(key.uid)) {
             cgc.removeOldestN(ctx, unit, len(unit)) // Remove all.
             delete(evictUnits, key)
          }
       }
    }
    // MaxPerPodContainer 用于限制每个 Pod 里可以挂掉的容器数，
    // 如果配置了则处理一下，移除其他挂掉的容器，该值默认为 1
    if gcPolicy.MaxPerPodContainer >= 0 {
       cgc.enforceMaxContainersPerEvictUnit(ctx, evictUnits, gcPolicy.MaxPerPodContainer)
    }

    // 和前面类似，这次是限制所有 Pod 累计挂掉的容器数，如果超过了则继续清理
    // 对应配置为  --maximum-dead-containers ，默认为 -1 也就是不限制
    if gcPolicy.MaxContainers >= 0 && evictUnits.NumContainers() > gcPolicy.MaxContainers {
       // Leave an equal number of containers per evict unit (min: 1).
       numContainersPerEvictUnit := gcPolicy.MaxContainers / evictUnits.NumEvictUnits()
       if numContainersPerEvictUnit < 1 {
          numContainersPerEvictUnit = 1
       }
       cgc.enforceMaxContainersPerEvictUnit(ctx, evictUnits, numContainersPerEvictUnit)

       // 如果还不满足条件，则继续清理
       // 因为前面 numContainersPerEvictUnit 可能是小数，但是用的 int 来存储，那么就会出现
       // 每个 Pod 都移除了一个了，但是总数还是超过限制，因为计算出来每个 Pod 可能要移除的是 1.2 个这种
       numContainers := evictUnits.NumContainers()
       if numContainers > gcPolicy.MaxContainers {
          flattened := make([]containerGCInfo, 0, numContainers)
          for key := range evictUnits {
             flattened = append(flattened, evictUnits[key]...)
          }
          // 按照创建时间排序
          sort.Sort(byCreated(flattened))
          // 删除比较旧的容器
          cgc.removeOldestN(ctx, flattened, numContainers-gcPolicy.MaxContainers)
       }
    }
    return nil
}

evictSandboxes

evictSandboxes 清理已经被删除的(没有业务 container) pods 的 sandboxes。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
func (cgc *containerGC) evictSandboxes(ctx context.Context, evictNonDeletedPods bool) error {
    // 获取节点上所有容器
    containers, err := cgc.manager.getKubeletContainers(ctx, true)
    if err != nil {
       return err
    }
    // 获取节点上所有沙箱容器
    sandboxes, err := cgc.manager.getKubeletSandboxes(ctx, true)
    if err != nil {
       return err
    }

    // 计算沙箱容器和普通容器的关系
    sandboxIDs := sets.NewString()
    for _, container := range containers {
       sandboxIDs.Insert(container.PodSandboxId)
    }

    sandboxesByPod := make(sandboxesByPodUID, len(sandboxes))
    for _, sandbox := range sandboxes {
       podUID := types.UID(sandbox.Metadata.Uid)
       sandboxInfo := sandboxGCInfo{
          id:         sandbox.Id,
          createTime: time.Unix(0, sandbox.CreatedAt),
       }

       if sandbox.State == runtimeapi.PodSandboxState_SANDBOX_READY || sandboxIDs.Has(sandbox.Id) {
          sandboxInfo.active = true
       }

       sandboxesByPod[podUID] = append(sandboxesByPod[podUID], sandboxInfo)
    }

    for podUID, sandboxes := range sandboxesByPod {
       // 如果和沙箱容器想关联的普通容器都处于驱逐、删除、终止状态
       // 或者外部参数指定要驱逐所有未删除的 Pod 同时所有普通容器处理终止状态
       // 就删除所有沙箱容器
       if cgc.podStateProvider.ShouldPodContentBeRemoved(podUID) || (evictNonDeletedPods && cgc.podStateProvider.ShouldPodRuntimeBeRemoved(podUID)) {
          cgc.removeOldestNSandboxes(ctx, sandboxes, len(sandboxes))
       } else {
          // 否则就保留一个沙箱容器
          cgc.removeOldestNSandboxes(ctx, sandboxes, len(sandboxes)-1)
       }
    }
    return nil
}

evictPodLogsDirectories

evictPodLogsDirectories 清理日志空间, 如果某 pod 已经被删除，则可以删除对应的日志空间及软链.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
func (cgc *containerGC) evictPodLogsDirectories(ctx context.Context, allSourcesReady bool) error {
    osInterface := cgc.manager.osInterface
    if allSourcesReady {
       // 获取 kubelet 日志目录（默认为 /var/log/pods）下的子目录
       dirs, err := osInterface.ReadDir(podLogsRootDirectory)
       if err != nil {
          return fmt.Errorf("failed to read podLogsRootDirectory %q: %v", podLogsRootDirectory, err)
       }
       for _, dir := range dirs {
          name := dir.Name()
          // 从目录名中解析出 pod 的 UID，目录格式为 NAMESPACE_NAME_UID
          podUID := parsePodUIDFromLogsDirectory(name)
          // 这里再判断了一次是否可以清理
          if !cgc.podStateProvider.ShouldPodContentBeRemoved(podUID) {
             continue
          }
          // 条件都满足则直接移除对应目录
          err := osInterface.RemoveAll(filepath.Join(podLogsRootDirectory, name))
          if err != nil {
             klog.ErrorS(err, "Failed to remove pod logs directory", "path", name)
          }
       }
    }

    // 删除对应的软链接
    logSymlinks, _ := osInterface.Glob(filepath.Join(legacyContainerLogsDir, fmt.Sprintf("*.%s", legacyLogSuffix)))
    for _, logSymlink := range logSymlinks {
       if _, err := osInterface.Stat(logSymlink); os.IsNotExist(err) {
          if containerID, err := getContainerIDFromLegacyLogSymlink(logSymlink); err == nil {
             resp, err := cgc.manager.runtimeService.ContainerStatus(ctx, containerID, false)
             if err != nil {
                 klog.InfoS("Error getting ContainerStatus for containerID", "containerID", containerID, "err", err)
             } else {
                status := resp.GetStatus()
                // 如果 container 没有退出则不删除
                if status.State != runtimeapi.ContainerState_CONTAINER_EXITED {
                   continue
                }
             }
          }
          err := osInterface.Remove(logSymlink)
          if err != nil {
             klog.ErrorS(err, "Failed to remove container log dead symlink", "path", logSymlink)
          } else {
             klog.V(4).InfoS("Removed symlink", "path", logSymlink)
          }
       }
    }
    return nil
}

4. 镜像垃圾回收

GarbageCollect

大致流程：

1）获取 kubelet 磁盘状态
2）从 1 中拿到磁盘的总量和可用容量并计算出磁盘使用率百分比
3）如果 2 中使用率超过参数(--image-gc-high-threshold)中配置的阈值就执行镜像垃圾回收
4）根据参数(--image-gc-low-threshold)和当前使用率计算出需要清理的空间大小
5）执行垃圾回收，直到清理出 4 中需要的空间，如果不能清理出这么多空间则打印一个失败日志

镜像垃圾回收方法也叫 GarbageCollect ，具体代码如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// pkg/kubelet/images/image_gc_manager.go#290
func (im *realImageGCManager) GarbageCollect(ctx context.Context) error {
    ctx, otelSpan := im.tracer.Start(ctx, "Images/GarbageCollect")
    defer otelSpan.End()
    // 获取磁盘状态
    fsStats, err := im.statsProvider.ImageFsStats(ctx)
    if err != nil {
       return err
    }
    // 拿到次哦的容量和可用量
    var capacity, available int64
    if fsStats.CapacityBytes != nil {
       capacity = int64(*fsStats.CapacityBytes)
    }
    if fsStats.AvailableBytes != nil {
       available = int64(*fsStats.AvailableBytes)
    }

    if available > capacity {
       klog.InfoS("Availability is larger than capacity", "available", available, "capacity", capacity)
       available = capacity
    }

    // Check valid capacity.
    if capacity == 0 {
       err := goerrors.New("invalid capacity 0 on image filesystem")
       im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())
       return err
    }

    // 计算磁盘使用量，如果超过阈值则进行镜像垃圾回收
    usagePercent := 100 - int(available*100/capacity)
    if usagePercent >= im.policy.HighThresholdPercent {
       // 根据参数计算出需要腾出的空间大小
       amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available
       klog.InfoS("Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold", "usage", usagePercent, "highThreshold", im.policy.HighThresholdPercent, "amountToFree", amountToFree, "lowThreshold", im.policy.LowThresholdPercent)
       // 镜像垃圾回收的真正实现在这里
       freed, err := im.freeSpace(ctx, amountToFree, time.Now())
       if err != nil {
          return err
       }
       // 如果最终腾出的空间小于目标值则记录 event 并返回一个 err
       if freed < amountToFree {
          err := fmt.Errorf("Failed to garbage collect required amount of images. Attempted to free %d bytes, but only found %d bytes eligible to free.", amountToFree, freed)
          im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.FreeDiskSpaceFailed, err.Error())
          return err
       }
    }

    return nil
}

freeSpace

大致实现：

1）拿到当前正在使用的镜像列表
2）找到可以清理的镜像列表，实际就是从所有镜像里过滤出用到的镜像，那么剩下的就是可以清理的
3）对可以清理的镜像按照上次时间时间排序，优先清理最久未使用的镜像
4）清理时跳过本轮检测到的新镜像和没有到最短存活时间的镜像
5）调用 runtime 接口移除镜像，并记录已经腾出的空间大小
6）如果已经腾出的空间达到目标就直接退出

具体清理逻辑在im.freeSpace(ctx, amountToFree, time.Now())里面，在追踪一下具体实现：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
func (im *realImageGCManager) freeSpace(ctx context.Context, bytesToFree int64, freeTime time.Time) (int64, error) {
    // 检测镜像状态，拿到当前在使用的镜像列表
    imagesInUse, err := im.detectImages(ctx, freeTime)
    if err != nil {
       return 0, err
    }

    im.imageRecordsLock.Lock()
    defer im.imageRecordsLock.Unlock()

    images := make([]evictionInfo, 0, len(im.imageRecords))
    // 找到可以清理的镜像列表，实际就是从所有镜像里过滤出用到的镜像， 那么剩下的就是可以清理的
    for image, record := range im.imageRecords {
       // 过滤掉使用中的
       if isImageUsed(image, imagesInUse) {
          continue
       }
       // 打上了 pinned 标记的镜像，跳过清理
       if record.pinned {
          continue

       }
       images = append(images, evictionInfo{
          id:          image,
          imageRecord: *record,
       })
    }
    // 根据上次使用时间排个序，优先清理最久未使用的
    sort.Sort(byLastUsedAndDetected(images))


    var deletionErrors []error
    spaceFreed := int64(0)
    for _, image := range images {
       // 如果是新镜像(由前面的 detectImages 方法才发现的镜像)就跳过，本轮不清理
       if image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) {
          continue
       }

       // 再次判断如果镜像的存活时间没有达到配置值也不清理(实际上这里有个 bug)
       if freeTime.Sub(image.firstDetected) < im.policy.MinAge {
          continue
       }

       // 走到这里的都是需要清理的镜像了，调用 containerruntime 移除镜像
       err := im.runtime.RemoveImage(ctx, container.ImageSpec{Image: image.id})
       if err != nil {
          deletionErrors = append(deletionErrors, err)
          continue
       }
       // 从镜像记录中删除
       delete(im.imageRecords, image.id)
       // 累计已经腾出的空间
       spaceFreed += image.size
       // 如果本轮已经清理到了目标值就返回了，不在继续清理
       if spaceFreed >= bytesToFree {
          break
       }
    }

    if len(deletionErrors) > 0 {
       return spaceFreed, fmt.Errorf("wanted to free %d bytes, but freed %d bytes space with errors in image deletion: %v", bytesToFree, spaceFreed, errors.NewAggregate(deletionErrors))
    }
    return spaceFreed, nil
}

detectImages

接下来则看一下im.detectImages(ctx, freeTime)是怎么判断那些镜像是在使用中的：

流程还是比较简单：

1）对 sanbox 镜像做特殊处理，因为这个镜像比较特殊，不管启动什么 Pod 都会用到他，因此直接把这个镜像保留下来
2）拿到镜像列表
3）拿到当前正在运行的容器列表
4）2 中引用的镜像就是在使用中的
5）做一些维护工作，根据当前最新的镜像列表维护内存中的 imageRecords 数据

具体代码如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// pkg/kubelet/images/image_gc_manager.go#L223
func (im *realImageGCManager) detectImages(ctx context.Context, detectTime time.Time) (sets.String, error) {
    imagesInUse := sets.NewString()

    // 首先是对 sandbox 这个镜像做特殊处理，默认sandbox 在被使用，只要这个镜像存在
    imageRef, err := im.runtime.GetImageRef(ctx, container.ImageSpec{Image: im.sandboxImage})
    if err == nil && imageRef != "" {
       imagesInUse.Insert(imageRef)
    }

    // 然后列出节点上的所有镜像
    images, err := im.runtime.ListImages(ctx)
    if err != nil {
       return imagesInUse, err
    }
    // 列出节点上运行的所有 Pod
    pods, err := im.runtime.GetPods(ctx, true)
    if err != nil {
       return imagesInUse, err
    }

    // 然后把 Pod 里用到的镜像都记录到使用中的镜像列表里
    for _, pod := range pods {
       for _, container := range pod.Containers {
          imagesInUse.Insert(container.ImageID)
       }
    }
    // 到这里主线逻辑就结束了，下面都是在做一个状态维护工作，可以跳过

    now := time.Now()
    currentImages := sets.NewString()
    im.imageRecordsLock.Lock()
    defer im.imageRecordsLock.Unlock()
    for _, image := range images {

       currentImages.Insert(image.ID)

       // 如果镜像不在之前的记录列表里，就认为是新镜像，写入记录列表并将当前时间作为第一次检测到的时间
       if _, ok := im.imageRecords[image.ID]; !ok {
          im.imageRecords[image.ID] = &imageRecord{
             firstDetected: detectTime, // 注意，这里和前面提到的那个 bug 有关系
          }
       }

       // 如果镜像在使用就把记录里的使用上次使用时间更新
       if isImageUsed(image.ID, imagesInUse) {
          im.imageRecords[image.ID].lastUsed = now
       }

       // 更新一下镜像的其他信息
       im.imageRecords[image.ID].size = image.Size
       im.imageRecords[image.ID].pinned = image.Pinned
    }

    // 最终把记录里面有但是现在检测到已经不存在的镜像从记录列表中移除
    for image := range im.imageRecords {
       if !currentImages.Has(image) {
          delete(im.imageRecords, image)
       }
    }
    // 最后返回当前正在使用的镜像列表
    return imagesInUse, nil
}

至此，镜像垃圾回收相关逻辑都分析完了。

5. Bug

现象

某次在执行集群备份恢复的时候，发现恢复完成后集群一直阻塞在健康检查这一步骤，最终发现是 kube-proxy 组件因为拉取镜像失败而无法启动。

该集群使用离线方式部署，未使用镜像仓库，使用 ctr import 方式导入镜像到对应节点。

但是并没有人手动去清理对应的镜像，那么镜像哪儿去了呢？

kubelet 日志如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ journalctl -xu kubelet
I0727 15:28:35.418744   30409 image_gc_manager.go:327] "Attempting to delete unused images"
I0727 15:29:45.459473    8467 image_gc_manager.go:327] "Attempting to delete unused images"
I0727 15:29:45.461144    8467 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:25f8c7f3da61c2a810effe5fa779cf80ca171afb0adf94c7cb51eb9a8546629d" size=293916868
I0727 15:29:46.228437    8467 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:8fa62c12256df9d9d0c3f1cf90856e27d90f209f42271c2f19326a705342c3b6" size=136514003
I0727 15:29:46.818796    8467 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:d8cdff0abb5073eca15e27bbb1630dc9c314fcdbfd18f5bde9b025338eb3cd9f" size=13844798
I0727 15:29:47.521647    8467 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:df7b72818ad2e4f1f204c7ffb51239de67f49c6b22671c70354ee5d65ac37657" size=126335289
I0727 15:29:47.981591    8467 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:ac4c93db7a87a9cea11ed08e2c55a1d87184e09ab0d8679c7fc5e2ebb4455c53" size=140246249
I0727 15:29:48.415960    8467 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:595f327f224a42213913a39d224c8aceb96c81ad3909ae13f6045f570aafe8f0" size=54839608
I0727 15:29:49.977718    8467 image_gc_manager.go:387] "Removing image to free bytes" imageID="sha256:a4ca41631cc7ac19ce1be3ebf0314ac5f47af7c711f17066006db82ee3b75b03" size=46957023

可以看到，Kubelet 刚启动就清理了一些镜像来回收资源，也就是说缺失的部分镜像被 kubelet 给清理掉了。

测试了多次集群恢复，确实也是每次执行到 kubelet 重启后不久都会导致部分镜像被清理掉。

根据前面分析可知，Kubelet 提供了参数--minimum-image-ttl-duration 来控制镜像最短存活时间，以避免刚 pull 下来的镜像就被清理掉。同时根据源码可知 Kubelet 实际使用第一次发现镜像的时间作为镜像存活起始时间，按照正常逻辑，未配置--minimum-image-ttl-duration 情况下，kubelet 默认值是 2 分钟，也就是说要 2 分钟之后才会开始清理才对。

看起来是重启导致配置的 --minimum-image-ttl-duration 参数没有生效？

原因分析

kubelet 清理镜像有两个条件：

1）镜像当前没有被使用
2）超过最小存活时间

因为在集群备份恢复逻辑里，确实有一个操作是停止某些服务，这样刚好满足了条件一，某些镜像正好没有被使用到。如果 kubelet 存在某些 bug 导致条件二满足的话，确实会造成镜像被意外清理掉。

经过排查发现 kubelet 在重启的时候，确实会导致条件二满足。

具体原因在这个 Issue #119642,感兴趣的可以跳转查阅。

这里简单说明一下，Kubelet 启动镜像垃圾回收的相关代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
func (im *realImageGCManager) Start() {
	ctx := context.Background()
	go wait.Until(func() {
		// Initial detection make detected time "unknown" in the past.
		var ts time.Time
		if im.initialized {
			ts = time.Now()
		}

    // in first time,ts is zero.
		_, err := im.detectImages(ctx, ts)
		if err != nil {
			klog.InfoS("Failed to monitor images", "err", err)
		} else {
			im.initialized = true
		}
	}, 5*time.Minute, wait.NeverStop)
//...

主要是这一句：

1
2
3
4
		var ts time.Time
		if im.initialized {
			ts = time.Now()
		}

只有 initialized 为 true 时 ts 才会被赋值，否则就是零值，time.Time 的零值就是0001-01-01T00:00:00Z这个时间，具体如下：

1
2
3
4
func main() {
	ts := time.Time{}
	fmt.Println(ts.Format(time.RFC3339)) // 0001-01-01T00:00:00Z
}

而每次 Kubelet 重启后 initialized 都为 false，只有第一次扫描完才会赋值为 true，因此导致第一次扫描镜像的时候用的 time.Time零值作为当前时间进行扫描，那么得到的结果就是所有扫描出来的镜像起始时间都记录为0001-01-01T00:00:00Z。

从0001-01-01T00:00:00Z 开始计算存活时间，现在都 2023 年了，那肯定是超过最小存活时间了…

因此到清理的时候，做时间对比,下面这个条件始终是不满足的，也就是该镜像不会被保留下来，即镜像存活时间freeTime.Sub(image.firstDetected) 肯定是大于im.policy.MinAge 参数中指定的最小存活时间的，因此所有镜像都可以被清理。

1
2
3
4
		if freeTime.Sub(image.firstDetected) < im.policy.MinAge {
			klog.V(5).InfoS("Image ID's age is less than the policy's minAge, not eligible for garbage collection", "imageID", image.id, "age", freeTime.Sub(image.firstDetected), "minAge", im.policy.MinAge)
			continue
		}

也就是现象中体现出来的--minimum-image-ttl-duration 参数未生效。

解决

后续也是提了一个 PR #119652 修复了一下该问题。

提交的时候正好是 1.28 发布代码冻结，经历了漫长的 pending 之后，可算是合并进 master 了，应该会出现在 1.29 版本。

小结

看起来是 bug，也算是设计上的调整，具体对比如下：

调整前：kubelet 启动前已经存在的镜像，起始时间按照 0001-01-01T00:00:00Z 计算，导致这部分镜像一定是满足存活时间的，如果没有被使用就会立马被清理掉。
调整后：kubelet 启动前已经存在的镜像，起始时间也按照当前时间(kubelet 启动后第一次镜像扫描时间)计算，保证 --minimum-image-ttl-duration 时间内镜像不会被清理。

有些服务可能就需要 kubelet 启动后才会运行，调整后可以避免这部分镜像被意外清理。

6. 小结

本文主要分析了 Kubelet 的垃圾回收功能，包括：

容器垃圾回收
- 每分钟执行一次，清理 Pod 中挂掉的容器，当 Pod 中所有业务容器都清理掉之后，则清理掉 Sandbox 容器，然后清理掉该 Pod 的日志目录
镜像垃圾回收
- 每 5 分钟执行一次，根据配置参数，磁盘使用率大于阈值(默认 85%) 时执行垃圾清理，清理到阈值(80%) 时停止
- 只会清理当前未使用镜像，优先清理最久未使用镜像，同时对于新拉取镜像会存活一定时间后才执行清理，用于访问刚拉取下来的镜像就被清理掉
最佳实践：尽量使用镜像仓库，而不是依赖于本地镜像，毕竟本地镜像是有可能被清理掉的。

最后再把这几个图贴一下，现在看起来应该就比较清晰了：