Contents

LabelStudio+YOLO实战:从数据标注到模型训练完整指南

https://img.lixueduan.com/ai/cover/labelstudio-yolo.png

上一篇分享了 LabelStudio 的智能标注(预标注),通过对接 ML 后端实现自动标注,解放双手。本文主要记录如何使用 LabelStudio 标注数据、导出数据集并用于训练模型,通过本教程,您将学习如何将 LabelStudio 标注的数据转换为 YOLO 格式,并用作模型训练。

大致分为以下几个步骤:

  • 1)LabelStudio 进行数据标注

  • 2)标注完成后导出结果

  • 3)将导出结果格式化

  • 4)使用得到的数据集进行模型训练

1. LabelStudio 数据标注

这部分直接参考 LabelStudio:开源多模态数据标注神器初体验解放双手!LabelStudio 智能标注实战,实现数据标注即可。

2. LabelStudio 导出数据

2.1 Label Studio 界面导出

标注完成后导出结果

https://img.lixueduan.com/ai/labelstudio/quickstart/ls-export-yolo.png

这里选择 YOLO 格式,导出后会得到一个 zip 压缩文件。

需要注意的是,根据项目配置的 Labeling Interface 不同,最后导出时可以选择的格式也会有变化。

官方目前支持的是 JSON、CSV、TSV、COCO、YOLO 等,经过测试直接导出 YOLO 格式并不成功,导出结果不包含原始图像文件,无法直接使用,需通过脚本额外处理图像路径与标注的匹配。

2.2 使用 LabelStudio SDK 导出

converter.py

converter.py 完整内容如下:

import os
import subprocess
import time
from label_studio_sdk import Client
from label_studio_tools.core.utils.io import get_local_path


def clean_filename(name):
    return name.split("__", 1)[-1]


# Initialize the Label Studio SDK client
LABEL_STUDIO_URL = 'http://1.1.1.1:8080/'
API_KEY = 'your-api-key-here'
PROJECT_ID = 5  # Replace with your actual project ID

client = Client(url=LABEL_STUDIO_URL, api_key=API_KEY)
project = client.get_project(PROJECT_ID)

# 1. Export JSON snapshot
snapshot = project.export_snapshot_create('my_snapshot')
export_id = snapshot['id']

# Wait until the snapshot is ready
while not project.export_snapshot_status(export_id).is_completed():
    time.sleep(1)  # Sleep to avoid excessive requests

# Download the snapshot
# Will get a file like:project-5-at-2025-04-07-08-00-01f7252d.json
status, json_file_path = project.export_snapshot_download(export_id, export_type='JSON')

# 2. Convert JSON to YOLO dataset using label-studio-converter
# Will convert file like project-5-at-2025-04-07-08-00-01f7252d.json to yolo format
label_config_xml = project.params['label_config']
xml_file_path = 'label_config.xml'
with open(xml_file_path, 'w') as xml_file:
    xml_file.write(label_config_xml)

# Run label-studio-converter CLI
# Need env LS_UPLOAD_DIR
subprocess.run([
    'label-studio-converter', 'export',
    '-i', json_file_path,
    '-o', 'output_yolo',
    '-c', xml_file_path,
    '-f', 'YOLO'
])

# 3. Download all images and copy to YOLO images folder
# Will download all image into target path
yolo_images_dir = os.path.join('output_yolo', 'images')
os.makedirs(yolo_images_dir, exist_ok=True)

# Assuming the JSON structure contains a list of tasks with image URLs
# for task in project.get_tasks().all():
for task in project.get_tasks():
    image_url = task['data'].get('image')
    if image_url:
        local_image_path = get_local_path(
            url=image_url,
            hostname=LABEL_STUDIO_URL,
            access_token=API_KEY,
            download_resources=True,
            task_id=task['id']
        )
        # Rename image
        originBasename = os.path.basename(local_image_path)
        basename = clean_filename(originBasename)
        print(f'originBasename: {originBasename} basename:{basename}')
        target_path = os.path.join(yolo_images_dir, basename)
        # Copy the image to the YOLO images directory
        print(f'local_image_path:{local_image_path} target_path: {target_path}')
        os.rename(local_image_path, target_path)

print("Conversion and image preparation complete.")

导出前只需要修改以下配置文件:

LABEL_STUDIO_URL = 'http://1.1.1.1:8080/'
API_KEY = 'your-api-key-here'
PROJECT_ID = 5  # Replace with your actual project ID

安装依赖

pip install label-studio-converter
pip install label-studio-tools
pip install label_studio_sdk

导出数据

python converter.py

过程中可能会出现以下错误,暂时忽略。

FileNotFoundError: Can't find upload dir: either LS_UPLOAD_DIR or project should be passed to converter

最终输出以下内容就算成功

Conversion and image preparation complete.

导入内容在当前目录下的 output_yolo 文件夹下,查看导出的内容:

❯ tree output_yolo
output_yolo
├── classes.txt
├── images
│   ├── 113d41bf-dog2.jpg
│   ├── 31dd6dd8-dog3.jpg
│   ├── 670e6ac3-cat2.jpg
│   ├── e3f55527-cat1.jpg
│   ├── f1ff4841-cat3.jpg
│   └── f43803b2-dog1.jpg
├── labels
│   ├── 113d41bf-dog2.txt
│   ├── 31dd6dd8-dog3.txt
│   ├── 670e6ac3-cat2.txt
│   ├── e3f55527-cat1.txt
│   ├── f1ff4841-cat3.txt
│   └── f43803b2-dog1.txt
└── notes.json

2 directories, 14 files

至此,YOLO 数据集雏形就有了,接下来还需要做一些调整,将其构建为标注 YOLO 数据集。

3. 构建为标准 YOLO 数据集

3.1 数据集拆分

将数据集 image 下图片按照 7:2:1 比例拆分为 train、val、test 三个目录,labels 目录也同步处理。

split.py

使用以下脚本快速处理,split.py 完整内容如下:

import os
import shutil
import random
from pathlib import Path


def split_yolo_dataset(src_dir="output_yolo", ratios=(0.7, 0.2, 0.1)):
    """
    参数说明:
    src_dir: 原始数据集目录(需要包含images和labels子目录)
    ratios: 训练/验证/测试集比例(建议总和为1)
    """

    # 创建备份目录
    dst_dir = f"{src_dir}_split"
    if os.path.exists(dst_dir):
        shutil.rmtree(dst_dir)

    # 创建标准目录结构
    base_path = Path(dst_dir)
    (base_path / "images").mkdir(parents=True)
    (base_path / "labels").mkdir(parents=True)

    # 复制原始文件
    shutil.copytree(Path(src_dir) / "images", base_path / "images" / "original")
    shutil.copytree(Path(src_dir) / "labels", base_path / "labels" / "original")
    shutil.copy(Path(src_dir) / "classes.txt", base_path)
    if (Path(src_dir) / "notes.json").exists():
        shutil.copy(Path(src_dir) / "notes.json", base_path)

    # 获取所有图像文件名(不带扩展名)
    all_images = [f.stem for f in (base_path / "images/original").glob("*.*")
                  if f.suffix.lower() in ['.jpg', '.png', '.jpeg']]
    random.shuffle(all_images)  # 随机打乱顺序

    # 计算分割点
    total = len(all_images)
    train_end = int(ratios[0] * total)
    val_end = train_end + int(ratios[1] * total)

    # 划分数据集
    splits = {
        "train": all_images[:train_end],
        "val": all_images[train_end:val_end],
        "test": all_images[val_end:]
    }

    # 创建目标目录结构
    for split in splits:
        (base_path / "images" / split).mkdir()
        (base_path / "labels" / split).mkdir()

    # 移动文件到对应目录
    for split, files in splits.items():
        for fname in files:
            # 处理图像文件
            src_img = next((base_path / "images/original").glob(f"{fname}.*"))
            dst_img = base_path / "images" / split / src_img.name
            shutil.move(str(src_img), str(dst_img))

            # 处理标注文件
            src_label = base_path / "labels/original" / f"{fname}.txt"
            dst_label = base_path / "labels" / split / src_label.name
            if src_label.exists():
                shutil.move(str(src_label), str(dst_label))
            else:
                print(f"警告:缺失标注文件 {src_label}")

    # 清理原始目录
    shutil.rmtree(base_path / "images/original")
    shutil.rmtree(base_path / "labels/original")

    print(f"数据集已分割到 {dst_dir}")
    print(f"最终目录结构:")
    print(f"images/")
    print(f"├── train/ : {len(splits['train'])} 图像")
    print(f"├── val/   : {len(splits['val'])} 图像")
    print(f"└── test/  : {len(splits['test'])} 图像")
    print(f"labels/")
    print(f"├── train/ : {len(splits['train'])} 标注")
    print(f"├── val/   : {len(splits['val'])} 标注")
    print(f"└── test/  : {len(splits['test'])} 标注")


if __name__ == "__main__":
    split_yolo_dataset()

演示

$ python spilt.py 
最终目录结构:
images/
├── train/ : 4 图像
├── val/   : 1 图像
└── test/  : 1 图像
labels/
├── train/ : 4 标注
├── val/   : 1 标注
└── test/  : 1 标注

处理后,数据集目录结构如下:

$ tree output_yolo_split
├── classes.txt
├── images
│   ├── test
│   │   └── f43803b2-dog1.jpg
│   ├── train
│   │   ├── 113d41bf-dog2.jpg
│   │   ├── 31dd6dd8-dog3.jpg
│   │   ├── 670e6ac3-cat2.jpg
│   │   └── f1ff4841-cat3.jpg
│   └── val
│       └── e3f55527-cat1.jpg
├── labels
│   ├── test
│   │   └── f43803b2-dog1.txt
│   ├── train
│   │   ├── 113d41bf-dog2.txt
│   │   ├── 31dd6dd8-dog3.txt
│   │   ├── 670e6ac3-cat2.txt
│   │   └── f1ff4841-cat3.txt
│   └── val
│       └── e3f55527-cat1.txt
└── notes.json

8 directories, 14 files

至此,已经比较接近 yolo 数据集格式了。

3.2 创建数据集描述文件

在数据集根目录创建 data.yaml 配置文件,文件内容如下:

# 参考文档:https://docs.ultralytics.com/datasets/detect/#ultralytics-yolo-format
# dataset path
# path: '' '# dataset root dir(relative to this yaml file)
train: images/train # train images (relative to 'path')
val: images/val # val images (relative to 'path')
test: images/test # test images (optional)

# number of classes
nc: 3

# class names
names:
  0: Dog
  1: Cat
  2: Other

各个配置含义如下:

  • dataset path 部分就指定各个部分数据的位置

    • path 为数据集根目录,相对当前 data.yaml 来说的,因为我们把 data.yaml 直接放在数据集根目录,因此 path 直接留空。

    • train/val/test:都是相对于 path 来说,比如当前配置下 train 目录就是 images/train

  • nc 则是分类的数量

  • names 则是分类和序号的对应关系

按照实际情况填写即可

至此,一个适用于 yolo 的数据集就准备好了。

❯ tree output_yolo_split
├── classes.txt
├── data.yaml
├── images
│   ├── test
│   │   └── f43803b2-dog1.jpg
│   ├── train
│   │   ├── 113d41bf-dog2.jpg
│   │   ├── 31dd6dd8-dog3.jpg
│   │   ├── 670e6ac3-cat2.jpg
│   │   └── f1ff4841-cat3.jpg
│   └── val
│       └── e3f55527-cat1.jpg
├── labels
│   ├── test
│   │   └── f43803b2-dog1.txt
│   ├── train
│   │   ├── 113d41bf-dog2.txt
│   │   ├── 31dd6dd8-dog3.txt
│   │   ├── 670e6ac3-cat2.txt
│   │   └── f1ff4841-cat3.txt
│   └── val
│       └── e3f55527-cat1.txt
└── notes.json

8 directories, 15 files

后续可以使用该数据集进行训练。

4. YOLOv11 训练

官方文档:yolo#quickstart

4.1 安装依赖

安装适合自身硬件配置的 torch 和 torchvision。

# 1.安装或更新 ultralytics
pip install -U ultralytics

# 2.安装适合自身硬件配置的torch和torchvision
# https://pytorch.org/get-started/locally/
pip3 install torch torchvision torchaudio

4.2 Example

from ultralytics import YOLO

# Load a model
#model = YOLO("yolo11n.yaml")  # build a new model from YAML
model = YOLO("yolo11n.pt")  # load a pretrained model (recommended for training)
#model = YOLO("yolo11n.yaml").load("yolo11n.pt")  # build from YAML and transfer weights

# Train the model
results = model.train(data="coco8.yaml", epochs=100, imgsz=640)
# Train the model with 2 GPUs
#results = model.train(data="coco8.yaml", epochs=100, imgsz=640, device=[0, 1])

加载模型有三种方式:

  • 1)从 yaml 构建新模型

  • 2)直接从预训练权重加载模型(推荐)

  • 3)从 yaml 构建新模型并权重转换,一般用于修改模型结构

然后使用 coco8 数据集进行训练,同时使用多个 GPU 通过 device 参数指定。

运行一下,输出如下:

$ python demo.py
100 epochs completed in 0.009 hours.
Optimizer stripped from runs/detect/train/weights/last.pt, 5.5MB
Optimizer stripped from runs/detect/train/weights/best.pt, 5.5MB

Validating runs/detect/train/weights/best.pt...
Ultralytics 8.3.104 🚀 Python-3.10.16 torch-2.6.0+cu124 CUDA:0 (NVIDIA L40S, 48650MiB)
YOLO11n summary (fused): 100 layers, 2,616,248 parameters, 0 gradients, 6.5 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00, 59.11
                   all          4         17       0.65      0.783      0.913      0.652
                person          3         10      0.621        0.7      0.667      0.326
                   dog          1          1      0.524          1      0.995      0.796
                 horse          1          2      0.633          1      0.995      0.676
              elephant          1          2      0.567          1      0.828      0.323
              umbrella          1          1      0.555          1      0.995      0.895
          potted plant          1          1          1          0      0.995      0.895
Speed: 0.1ms preprocess, 1.4ms inference, 0.0ms loss, 0.5ms postprocess per image

会自动下载预训练模型和数据集进行训练.

4.3 使用自定义数据集训练

from ultralytics import YOLO

# Load a model
model = YOLO("/mnt/e015a2b7cb4b49f18419022d3fb045ec/iyolo/yolo11n.pt")  # load a pretrained model (recommended for training)


# Train the model
results = model.train(data="/mnt/e015a2b7cb4b49f18419022d3fb045ec/iyolo/mydata/data.yaml", epochs=100, imgsz=640)

将 data 换成我们前面创建的 data.yaml 即可。

输出如下:

$ python train.py
100 epochs completed in 0.011 hours.
Optimizer stripped from runs/detect/train2/weights/last.pt, 5.5MB
Optimizer stripped from runs/detect/train2/weights/best.pt, 5.5MB

Validating runs/detect/train2/weights/best.pt...
Ultralytics 8.3.104 🚀 Python-3.10.16 torch-2.6.0+cu124 CUDA:0 (NVIDIA L40S, 48650MiB)
YOLO11n summary (fused): 100 layers, 2,582,737 parameters, 0 gradients, 6.3 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00, 93.01
                   all          1          1      0.011          1      0.995      0.697
                   Dog          1          1      0.011          1      0.995      0.697
Speed: 0.2ms preprocess, 5.9ms inference, 0.0ms loss, 1.3ms postprocess per image
Results saved to runs/detect/train2

4.4 推理

用训练后的得到的 best.pt 权重进行推理,使用数据集中的 test 部分进行验证。

from ultralytics import YOLO


if __name__ == '__main__':

    model = YOLO('/mnt/e015a2b7cb4b49f18419022d3fb045ec/iyolo/runs/detect/train/weights//best.pt')  # build from YAML and transfer weights
    model.predict(source='/mnt/e015a2b7cb4b49f18419022d3fb045ec/iyolo/mydata/images/test', conf = 0.4, save = True)

运行一下,输出:

$ python predict.py
image 1/1 /mnt/e015a2b7cb4b49f18419022d3fb045ec/iyolo/mydata/images/test/f43803b2-dog1.jpg: 448x640 1 dog, 1 bed, 51.5ms
Speed: 6.4ms preprocess, 51.5ms inference, 507.8ms postprocess per image at shape (1, 3, 448, 640)
Results saved to runs/detect/predict

查看效果

https://img.lixueduan.com/ai/labelstudio/yolo/ls-yolo-infer-result.png

至此,YOLO11 在 Label Studio 标记数据上训练/推理的完成了。

5. 小结

本文主要分享了如何使用 LabelStudio 完成数据标注、结果导出、数据集格式化、模型训练全流程。

可以配合 解放双手!LabelStudio 智能标注实战 LabelStudio 的智能标注实现正向循环:

  • 1)使用模型实现智能标注
  • 2)标注完成后手动检查结果,并调整错误
  • 3)导出结果并用于训练模型
  • 4)使用新模型实现智能标注