title: Detailed explanation of YOLO series models: A complete guide from YOLOv1 to YOLOv8 | Daoman PythonAI description: A complete YOLO target detection algorithm tutorial, with an in-depth analysis of the development history, architectural features, PyTorch implementation and practical application scenarios from YOLOv1 to YOLOv8, including detailed code examples and performance comparisons. keywords: [YOLO, target detection, computer vision, PyTorch, deep learning, YOLOv8, YOLOv5, real-time detection, image recognition]

Detailed explanation of YOLO series models: A complete guide from YOLOv1 to YOLOv8

The real-time AR makeup try-on you see on your phone, the automatic license plate reading in the parking lot, the drone patrolling the sky to find the source of fire - behind these cool applications, there is a high probability that there is a shadow of YOLO (You Only Look Once). It uses a bold idea to split the target detection from the two stages of "slow work and careful work" to a single stage return of "determine the world at a glance". Since the birth of the first version in 2015, every iteration of YOLO has refreshed the balance of "accuracy - speed - ease of use".

This article will help you quickly sort out the core evolution logic of the YOLO family, use key codes to understand the architectural changes, and then help you choose the version that suits you best through performance comparison, and finally give you a practical solution to get started.

1. Basics of getting started with YOLO

1.1 Core Definition: Single-stage regression

The workflow of traditional two-stage algorithms (such as Faster R-CNN) is much like an old-fashioned camera: first "select the box" (generate candidate regions), and then "focus" (classification + fine-tuning). YOLO's idea is completely different - one step to get it right:

Divide the input image intoS×Sgrid
Each grid is only responsible for detecting those targets whose "center falls on me"
Output "B prediction boxes + box confidence + C category probabilities" corresponding to all grids at once

In this way, the model "looks" at the picture once and can directly tell what is on the picture and where it is, which greatly simplifies the detection process.

1.2 Intuitive comparison with mainstream algorithms

The following table clearly shows the differences between single-stage models such as YOLO, two-stage algorithms, and other single-stage algorithms, helping you quickly establish global awareness.

Algorithm type	Core logic	Representative model	Speed	Accuracy	Applicable scenarios
Two stages	First generate candidate frames, then classify and fine-tune	Faster R-CNN	Slow (10-30 FPS)	High (general mAP high)	High-precision industrial quality inspection
Single stage	No candidate box, direct regression	YOLOv5/v8	Fast (50-200 FPS)	Medium to high (approximating two stages)	Real-time AR/security/autonomous driving
Balanced single-stage	Dense anchor boxes based on multi-scale feature maps	SSD	Medium (30-60 FPS)	Medium	General real-time applications

It can be seen that YOLO's advantages lie in extreme speed and increasing accuracy, which makes it the first choice in real-time scenarios.

2. Key evolution of YOLO (core code highlighting)

From v1 to v8, each generation of YOLO stands on the shoulders of the previous generation and focuses on solving the remaining pain points. Following this evolution route, we will use code to explain and connect these key improvement points in series.

2.1 YOLOv1 (2015): Groundbreaking "One Glance"

Core Breakthrough

Proposed single-stage end-to-end detection for the first time, making target detection easier than ever
Strong global context understanding ability, greatly reducing misjudgments of treating the background as a target

Pain points

Poor positioning accuracy: direct regression to the width and height of the bounding box, without a priori box as a reference
Small targets are often missed: using only a single grid of 7×7, it is difficult for small targets to “occupy” the center of a grid
Single grid, single category: One grid can only predict one category, and overlapping targets (such as two people crowded together) cannot be detected at the same time.

import torch
import torch.nn as nn

class YOLOv1(nn.Module):
    """
    YOLOv1简化实现（仅核心结构）
    """
    def __init__(self, S=7, B=2, C=20):
        super().__init__()
        self.S = S  # 7×7网格
        self.B = B  # 每网格2个预测框
        self.C = C  # VOC数据集20类
        
        # 简化的Darknet19前序（特征提取）
        self.backbone = nn.Sequential(
            nn.Conv2d(3, 64, 7, 2, 3),  # (64, 112, 112)
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(2, 2),          # (64, 56, 56)
            nn.Conv2d(64, 192, 3, 1, 1),# (192, 56, 56)
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(2, 2)           # (192, 28, 28)
        )
        
        # 全连接预测头
        self.head = nn.Sequential(
            nn.Flatten(),
            nn.Linear(192 * 7 * 7, 4096),
            nn.Dropout(0.5),
            nn.Linear(4096, S * S * (B * 5 + C))  # 5 = x,y,w,h,conf
        )
    
    def forward(self, x):
        x = self.backbone(x)
        return self.head(x).view(-1, self.S, self.S, self.B*5+self.C)

2.2 YOLOv2 (2016): Anchor boxes and accuracy improvement

Core improvements

K-means clustering generates a priori anchor boxes: gives a reliable starting point for the regression of width and height, and solves the problem of inaccurate positioning of v1
Batch Normalization: Accelerate training convergence and improve model stability
Darknet-19: lightweight and efficient feature extraction backbone, faster and stronger
Multi-scale training: Randomly switch multiple resolutions of 32 (320-608) every 10 batches to allow the model to adapt to targets of different sizes.

The following code shows how v2 uses anchor boxes to decode raw predictions into true bounding box coordinates:

class YOLOv2Anchor(nn.Module):
    """
    YOLOv2关键锚框后处理简化逻辑
    """
    def __init__(self, anchors, num_classes=80):
        super().__init__()
        self.anchors = torch.tensor(anchors).float()  # K-means生成的5个锚框
        self.num_classes = num_classes
        
    def forward(self, x, img_size):
        B, C, H, W = x.shape
        # 特征图网格坐标
        grid_x = torch.arange(W).repeat(H, 1).view(1, 1, H, W).to(x.device)
        grid_y = torch.arange(H).repeat(W, 1).t().view(1, 1, H, W).to(x.device)
        
        # 锚框宽高
        anchor_w = self.anchors[:, 0].view(1, 5, 1, 1).to(x.device)
        anchor_h = self.anchors[:, 1].view(1, 5, 1, 1).to(x.device)
        
        # 调整预测输出为可解释形式
        x = x.view(B, 5, 5 + self.num_classes, H, W).permute(0, 1, 3, 4, 2)
        
        # 预测框解码
        pred_box_xy = torch.sigmoid(x[..., 0:2]) + torch.cat([grid_x, grid_y], dim=-1)
        pred_box_wh = torch.exp(x[..., 2:4]) * torch.cat([anchor_w, anchor_h], dim=-1)
        pred_box_xywh = torch.cat([pred_box_xy, pred_box_wh], dim=-1) * (img_size / W)
        
        pred_conf = torch.sigmoid(x[..., 4:5])
        pred_cls = torch.sigmoid(x[..., 5:])
        
        return pred_box_xywh, pred_conf, pred_cls

2.3 YOLOv3 (2018): Small goal savior

This generation focuses on solving the problems of small target and overlapping target detection, and the changes are very practical.

Core Breakthrough

Darknet-53: Introducing a deeper backbone of residual connections, greatly enhancing feature extraction capabilities
Three-scale detection head: Predict on three feature maps of different sizes, 13×13 is responsible for large targets, 26×26 is responsible for medium targets, and 52×52 specializes in small targets. The problem of single-scale missed detection has been completely solved
Binary Cross Entropy classification: Use binary cross entropy to replace softmax, allowing a single grid to predict multiple categories, and overlapping targets no longer conflict

2.4 YOLOv5/v8 (2020-2023): Ease of use and ceiling improved simultaneously

These two versions (v8 can be seen as a comprehensive evolution of v5) led by the Ultralytics team have transformed YOLO from an "academic black box" into an "industrial toolkit". The developer experience is straight forward.

v5 key features

PyTorch native implementation: The API is very friendly and you can train on public data sets such as COCO with just a few lines of code.
Focus structure: a clever lossless downsampling method that rearranges width and height information into channel dimensions to minimize information loss
CSPDarknet/C3 structure: efficient feature extraction and fusion module, both powerful and fast
Mosaic data enhancement: Randomly splice 4 pictures into one, which greatly enriches the training background and the context of small targets, and the generalization ability soars.

class Focus(nn.Module):
    """
    YOLOv5 Focus结构：将(w,h)变成(w/2,h/2)，通道数×4
    """
    def __init__(self, c1, c2, k=1, s=1, g=1, act=True):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(c1*4, c2, k, s, autopad(k), g, bias=False),
            nn.BatchNorm2d(c2),
            nn.SiLU() if act else nn.Identity()
        )
    
    def forward(self, x):
        return self.conv(torch.cat([
            x[..., ::2, ::2], x[..., 1::2, ::2],
            x[..., ::2, 1::2], x[..., 1::2, 1::2]
        ], 1))

def autopad(k, p=None):
    """自动计算卷积填充，保持输入输出尺寸一致"""
    return p if p is not None else k//2

v8 key features

Anchor-less design: Completely throw away the set of hyper-parameters of the anchor box, making deployment and maintenance extremely simple.
Multi-task unification: One model skeleton handles detection, segmentation, classification, and pose estimation, right out of the box
DFL + CIoU loss: more accurate bounding box regression and more stable positioning
Extremely lightweight: The smallest v8n model has only 3.2M parameters and can be run directly on the mobile phone

3. YOLO practical introduction (Ultralytics v8)

After talking about the theory, let’s use a few pieces of code to help you actually run it. We use Ultralytics YOLOv8, which is currently the most complete ecosystem.

3.1 Installation and basic reasoning

pip install ultralytics

from ultralytics import YOLO

# 1. 加载预训练模型（n/s/m/l/xl：速度从快到慢，精度从低到高）
model = YOLO('yolov8n.pt')

# 2. 单张图片推理
results = model('test_image.jpg', conf=0.25, iou=0.45)
# 显示推理结果
for r in results:
    r.show()
    r.save('result.jpg')  # 保存结果

# 3. 视频/摄像头推理
model.predict('test_video.mp4', save=True)
# model.predict(0, show=True)  # 0 表示默认摄像头

3.2 Performance and scene selection

Different application scenarios have completely different requirements for speed and accuracy. The comparison table below can help you quickly identify the appropriate model.

Model	mAP@0.5 (COCO)	Speed (FPS, T4)	Parameters	Applicable scenarios
v8n	37.3%	176	3.2M	Mobile phones/edge devices, ultra-real-time applications
v8s	44.9%	113	11.2M	General real-time applications, security monitoring
v8m	50.2%	62	25.9M	Balance accuracy and speed, lightweight version of autonomous driving
v8l	52.9%	33	43.7M	High-precision industrial quality inspection, offline processing
v8xl	53.9%	16	68.2M	Highest accuracy, scientific research analysis

Generally speaking, try v8s or v8m first. If the speed is not satisfactory, go down in the direction of n. If the accuracy is not enough, go up in the direction of l/xl.

4. Learning and optimization suggestions

4.1 Learning Path

It is recommended to follow the following order, which is both easy and solid:

First understand the single-stage regression idea of YOLOv1 and figure out "why just look at it"
Look at the anchor box and BN of v2, and the multi-scale design of v3, and connect the historical pain points with solutions
Finally use v8 to get started with the project, focus on data enhancement strategies and parameter adjustment of loss functions, and start directly from industrial-level practice

4.2 Common optimization directions

Data Enhancement: If there are many small targets, try Copy-Paste; if overfitting occurs, appropriately reduce the intensity of Mosaic
Training Tips: Turn on multi-scale training (imgsz=640-1280), use cosine annealing learning rate attenuation, usually 1~2 points can be improved
Deployment Optimization: Usemodel.export(format='onnx')Export ONNX and use TensorRT or OpenVINO to accelerate inference on the target device

Summarize

The success of YOLO is essentially a victory of continuously implementing cutting-edge academic technology into industrial-grade easy-to-use tools. From the "dare to be the first" of v1 to the "all-rounder" of v8, it has covered almost all computing scenarios from mobile phones to supercomputers.

If you want to choose an entry-level version or a landing product now, strongly recommend YOLOv8 - it has complete documentation, friendly API, and high performance ceiling. It is one of the well-deserved de facto standards in the field of target detection.

🔗Extended resources

When customizing a data set, **be sure to check whether the annotation format is correct** (YOLO format:`class x_center y_center w h`, both are normalized values); if small targets are missed, you can try to increase imgsz or switch to a slightly larger model such as v8s/v8m.

#Detailed explanation of YOLO series models: A complete guide from YOLOv1 to YOLOv8

#1. Basics of getting started with YOLO

#1.1 Core Definition: Single-stage regression

#1.2 Intuitive comparison with mainstream algorithms

#2. Key evolution of YOLO (core code highlighting)

#2.1 YOLOv1 (2015): Groundbreaking "One Glance"

#Core Breakthrough

#Pain points

#2.2 YOLOv2 (2016): Anchor boxes and accuracy improvement

#Core improvements

#2.3 YOLOv3 (2018): Small goal savior

#Core Breakthrough

#2.4 YOLOv5/v8 (2020-2023): Ease of use and ceiling improved simultaneously

#v5 key features

#v8 key features

#3. YOLO practical introduction (Ultralytics v8)

#3.1 Installation and basic reasoning

#3.2 Performance and scene selection

#4. Learning and optimization suggestions

#4.1 Learning Path

#4.2 Common optimization directions

#Summarize

Detailed explanation of YOLO series models: A complete guide from YOLOv1 to YOLOv8

1. Basics of getting started with YOLO

1.1 Core Definition: Single-stage regression

1.2 Intuitive comparison with mainstream algorithms

2. Key evolution of YOLO (core code highlighting)

2.1 YOLOv1 (2015): Groundbreaking "One Glance"

Core Breakthrough

Pain points

2.2 YOLOv2 (2016): Anchor boxes and accuracy improvement

Core improvements

2.3 YOLOv3 (2018): Small goal savior

Core Breakthrough

2.4 YOLOv5/v8 (2020-2023): Ease of use and ceiling improved simultaneously

v5 key features

v8 key features

3. YOLO practical introduction (Ultralytics v8)

3.1 Installation and basic reasoning

3.2 Performance and scene selection

4. Learning and optimization suggestions

4.1 Learning Path

4.2 Common optimization directions

Summarize