IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
1
YOLOP: You Only Look Once for Panoptic Driving
Perception
Dong Wu, Manwen Liao, Weitian Zhang, and Xinggang Wang, Member, IEEE
Abstract—A panoptic driving perception system is an essential
part of autonomous driving. A high-precision and real-time
perception system can assist the vehicle in making the reasonable
decision while driving. We present a panoptic driving perception
network (YOLOP) to perform traffic object detection, drivable
area segmentation and lane detection simultaneously. It is com-
posed of one encoder for feature extraction and three decoders
to handle the specific tasks. Our model performs extremely well
on the challenging BDD100K dataset, achieving state-of-the-art
on all three tasks in terms of accuracy and speed. Besides,
we verify the effectiveness of our multi-task learning model for
joint training via ablative studies. To our best knowledge, this
is the first work that can process these three visual perception
tasks simultaneously in real-time on an embedded device Jetson
TX2(23 FPS) and maintain excellent accuracy. To facilitate
further research, the source codes and pre-trained models will
be released at
https://github.com/hustvl/YOLOP
.
Index Terms—Deep learning, multitask learning, traffic object
detection, drivable area segmentation, lane detection.
I. I
NTRODUCTION
R
ECENTLY, extensive research on autonomous driving
has revealed the importance of the panoptic driving
perception system. It plays a significant role in autonomous
driving as it can extract visual information from the images
taken by the camera and assist the decision system to control
the actions of the vehicle. In order to restrict the maneuver
of vehicles, the visual perception system should be able to
understand the scene and then provide the decision system
with information including: locations of the obstacles, judge-
ments of whether the road is drivable, the position of the
lanes etc. Object detection is usually involved in the panoptic
driving perception system to help the vehicles avoid obstacles
and follow traffic rules. Drivable area segmentation and lane
detection are also needed as they are crucial for planning the
driving route of the vehicle.
Many methods handle these tasks separately. For instance,
Faster R-CNN [1] and YOLOv4 [2] deal with object de-
tection; UNet [3] and PSPNet [4] are proposed to perform
semantic segmentation. SCNN [5] and SAD-ENet [6] are used
for detecting lanes. Despite the excellent performance these
methods achieve, processing these tasks one after another takes
longer time than tackling them all at once. When deploying
the panoptic driving perception system on embedded devices
commonly used in the self-driving car, limited computational
resources and latency should be taken into consideration.
D. Wu, M. Liao, W. Zhang and X. Wang are with the School of Elec-
tronic Information and Communication, Huazhong University Of Science
And Technology, Wuhan 430074, China (e-mail: {riserwu, mwliao, wtzhang,
xgwang}@hust.edu.cn)
(a) Input
(b) Output
Fig. 1.
The input and output of our model. The purpose of our model
is to process traffic objects detection, drivable area segmentation and lane
detection simultaneously in one input image. In (b), the brown bounding boxes
indicate traffic objects, the green areas are the drivable areas, and the blue
lines represent the lane line.
In addition, different tasks in traffic scenes understanding
often have much related information, such as the three tasks
mentioned above. As shown in the Figure 1, the lanes are often
the boundary of the drivable area, and the drivable area usually
closely surrounds the traffic objects. A multi-task network is
more suitable in this situation as (1) it can accelerate the image
analysis process by handling multiple tasks at once instead of
one by one (2) it can share information among multiple tasks,
which may improve the performance of each task as multi-task
network often shares the same feature extraction backbone.
Therefore, it is of essence to explore multi-task approaches in
autonomous driving.
MultiNet [7] uses the encoder-decoder structure which has
one shared encoder and three separate decoders for classifica-
arXiv:2108.11250v1 [cs.CV] 25 Aug 2021
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
2
tion, object detection and semantic segmentation. It performs
well on these tasks and achieves state-of-the-art on KITTI
drivable area segmentation task. Classification tasks, however,
are not as crucial as lane detection in controlling the vehicle.
DLT-Net [8] combines traffic object detection, drivable area
segmentation and lane detection all together and proposes
context tensor to fuse feature maps between decoders in
order to share mutual information. Although with competitive
performance, it does not reach real-time. Thus, we construct
an efficient multi-task network for panoptic driving percep-
tion system which includes object detection, drivable area
segmentation and lane detection task and can reach real-time
on embedded device Jetson TX2 with TensorRT deployment.
By processing these three key tasks in autonomous driving
all at once, we reduce the inference time of the panoptic
driving perception system, constrain the computational cost
to a reasonable range and enhance the performance of each
task.
In order to obtain high precision and fast speed, we design a
simple and efficient network architecture. We use a lightweight
CNN [9] as the encoder to extract features from the image.
Then these feature maps are fed to three decoders to complete
their respective tasks. Our detection decoder is based on the
current best-performing single-stage detection network [2] for
two main reasons: (1) The single-stage detection network is
faster than the two-stage detection network. (2) The grid-based
prediction mechanism of the single-stage detector is more
related to the other two semantic segmentation tasks, while
instance segmentation is usually combined with the region-
based detector [10]. The feature map output by the encoder
incorporates semantic features of different levels and scales,
and our segmentation branch can use these feature maps to
complete pixel-wise semantic prediction excellently.
In addition to the end-to-end training strategy, we attempt
some alternating optimization paradigms which train our
model step-by-step. On the one hand, we can put unrelated
tasks in different training steps to prevent inter-limitation. On
the other hand, the task trained first can guide other tasks. So
this kind of paradigm sometimes works well though cumber-
some. However, experiments show that it is unnecessary for
our model as the one trained end to end can perform well
enough. As a result, our panoptic driving perception system
reaches 41 FPS on a single NVIDIA TITAN XP and 23 FPS
on Jetson TX2; meanwhile, it achieves state-of-the-art on the
three tasks of the BDD100K dataset [11].
In summary, our main contributions are: (1) We put forward
an efficient multi-task network that can jointly handle three
crucial tasks in autonomous driving: object detection, drivable
area segmentation and lane detection to save computational
costs, reduce inference time as well as improve the perfor-
mance of each task. Our work is the first to reach real-time
on embedded devices while maintaining state-of-the-art level
performance on the BDD100K dataset. (2) We design the
ablative experiments to verify the effectiveness of our multi-
tasking scheme. It is proved that the three tasks can be learned
jointly without tedious alternating optimization.
II. R
ELATED
W
ORK
In this section, we review solutions to the above three
tasks respectively, and then introduce some related multi-task
learning work. We only concentrate on solutions based on deep
learning.
A. Traffic Object Detection
In recent years, with the rapid development of deep learning,
many prominent object detection algorithms have emerged.
Current mainstream object detection algorithms can be divided
into two-stage methods and one-stage methods.
Two-stage methods complete the detection task in two steps.
First, regional proposals are obtained, and then features in
the regional proposals are used to locate and classify the
objects. The generation of regional proposals has gone through
several stages of development. R-CNN [12] creatively tries
to use selective search instead of sliding windows to extract
regional proposals on the original image, while Fast R-CNN
[13] performs this operation directly on the feature map. The
RPN network proposed in Faster-RCNN [1] greatly reduces
the time consumption and obtains higher accuracy. Based on
the former, R-FCN [14] proposes a fully convolutional network
that replaces the fully connected layer with the convolutional
layer to further speed up detection.
The SDD-series [15] and YOLO-series algorithms are mile-
stones among one-stage methods. This kind of algorithm
performs bounding box regression and object classification
simultaneously. YOLO [16] divides the picture into S×S
grids instead of extracting regional proposals with the RPN
network, which significantly accelerates the detection speed.
YOLO9000 [17] introduces the anchor mechanism to improve
the recall of detection. YOLOv3 [18] uses the feature pyramid
network structure to achieve multi-scale detection. YOLOv4
[2] further improves the detection performance by refining
the network structure, activation function, loss function and
applying abundant data augmentation.
B. Drivable Area Segmentation
Due to the great success of deep learning, CNN-based
methods are used widely in semantic segmentation recently.
FCN [19] firstly introduces fully convolutional network to
semantic segmentation. It preserves the backbone of the CNN-
classifier and replaces the final fully connected layer with
1 × 1 convolutional layer and upsample layer. Despite the
skip-connection refinement, its performance is still limited
by low-resolution output. In order to obtain higher-resolution
output, Unet[3] constructs the encoder-decoder architecture.
DeepLab [20] uses CRF(conditional random field) to improve
the quality of the output as well as proposes the atrous
algorithm to expand the receptive field while maintaining
similar computational costs. PSPNet [4] comes up with the
pyramid pooling module to extract features in various scales
to enhance its performance.
C. Lane Detection
In lane detection, there are lots of innovative researches
based on deep learning. [21] constructs a dual-branch network
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
3
to perform semantic segmentation and pixel embedding on
images. It further clusters the dual-branch features to achieve
lane instance segmentation. SCNN [5] proposes slice-by-slice
convolution, which enables the message to pass between pixels
across rows and columns in a layer, but this convolution is very
time-consuming. Enet-SAD [6] uses self attention distillation
method, which enables low-level feature maps to learn from
high-level feature maps. This method improves the perfor-
mance of the model while keeping the model lightweight. [22]
defines lane detection as a task to find the collection of lane
lines location in certain rows of the image, and this row-based
classification uses global features.
D. Multi-task Approaches
The goal of multi-task learning is to learn better repre-
sentations through shared information among multiple tasks.
Especially, a CNN-based multitask learning method can also
achieve convolutional sharing of the network structure. Mask
R-CNN [10] extends Faster R-CNN by adding a branch for
predicting object mask, which combines instance segmentation
and object detection tasks effectively, and these two tasks can
promote each other’s performance. With a shared encoder and
three independent decoders, MultiNet [7] completes the three
scene perception tasks of scene classification, object detection
and segmentation of the driving area simultaneously. DLT-Net
[8] inherits the encoder-decoder structure, and contributively
constructs context tensors between sub-task decoders to share
designate information among tasks. [23] puts forward mutually
interlinked sub-structures between lane area segmentation and
lane boundary detection. Meanwhile, it proposes a novel loss
function to constrain the lane line to the outer contour of
the lane area so that they’re going to overlap geometrically.
However, this prior assumption also limits its application as
it only works well on scenarios where the lane line tightly
wraps the lane area. What’s more, the training paradigm of
multitask model is also worth thinking about. [24] states that
the joint training is appropriate and beneficial only when all
those tasks are indeed related; otherwise, it is necessary to
adopt alternating optimization. So Faster R-CNN [1] adopts a
pragmatic 4-step training algorithm to learn shared features.
This paradigm sometimes may be helpful, but it is so tedious.
III. M
ETHODOLOGY
We put forward a simple and efficient feed-forward network
that can accomplish traffic object detection, drivable area
segmentation and lane detection tasks altogether. As shown in
Figure 2, our panoptic driving perception single-shot network,
termed as YOLOP, contains one shared encoder and three
subsequent decoders to solve specific tasks. There are no
complex and redundant shared blocks between different de-
coders, which reduces computational consumption and allows
our network to be easily trained end-to-end.
A. Encoder
Our network shares one encoder, which is composed of a
backbone network and a neck network.
1) Backbone:
The backbone network is used to extract
the features of the input image. Usually, some classic image
classification networks serve as the backbone. Due to the
excellent performance of YOLOv4 [2] on object detection,
we choose CSPDarknet [9] as the backbone, which solves the
problem of gradient duplication during optimization [25]. It
supports feature propagation and feature reuse which reduces
the amount of parameters and calculations. Therefore, it is
conducive to ensuring the real-time performance of the net-
work.
2) Neck:
The neck is used to fuse the features generated
by the backbone. Our neck is mainly composed of Spatial
Pyramid Pooling (SPP) module [26] and Feature Pyramid
Network (FPN) module [27]. SPP generates and fuses features
of different scales, and FPN fuses features at different semantic
levels, making the generated features contain multiple scales
and multiple semantic level information. We adopt the method
of concatenation to fuse features in our work.
B. Decoders
The three heads in our network are specific decoders for the
three tasks.
1) Detect Head:
Similar to YOLOv4, we adopt an anchor-
based multi-scale detection scheme. Firstly, we use a structure
called Path Aggregation Network (PAN), a bottom-up feature
pyramid network [28]. FPN transfers semantic features top-
down, and PAN transfers positioning features bottom-up. We
combine them to obtain a better feature fusion effect, and
then directly use the multi-scale fusion feature maps in the
PAN for detection. Then, each grid of the multi-scale feature
map will be assigned three prior anchors with different aspect
ratios, and the detection head will predict the offset of position
and the scaling of the height and width, as well as the
corresponding probability of each category and the confidence
of the prediction.
2) Drivable Area Segment Head & Lane Line Segment
Head:
Drivable area segment head and Lane line Segment
head adopt the same network structure. We feed the bottom
layer of FPN to the segmentation branch, with the size of
(W/8, H/8, 256). Our segmentation branch is very simple.
After three upsampling processes, we restore the output feature
map to the size of (W, H, 2), which represents the probability
of each pixel in the input image for the drivable area/lane
line and the background. Because of the shared SPP in
the neck network, we do not add an extra SPP module to
segment branches like others usually do [4], which brings no
improvement to the performance of our network. Additionally,
we use the Nearest Interpolation method in our upsampling
layer to reduce computation cost instead of deconvolution. As
a result, not only do our segment decoders gain high precision
output, but also be very fast during inference.
C. Loss Function
Since there are three decoders in our network, our multi-task
loss contains three parts. As for the detection loss L
det
, it is a
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
4
Fig. 2. The architecture of YOLOP. YOLOP shares one encoder and combines three decoders to solve different tasks. The encoder consists of a backbone
and a neck.
weighted sum of classification loss, object loss and bounding
box loss as in equation 1.
L
det
= α
1
L
class
+ α
2
L
obj
+ α
3
L
box
,
(1)
where L
class
and L
obj
are focal loss [29], which is utilized
to reduce the loss of well-classified examples, thus forces the
network to focus on the hard ones. L
class
is used for penalizing
classification and L
obj
for the confidence of one prediction.
L
box
is L
CIoU
[30], which takes distance, overlap rate, the
similarity of scale and aspect ratio between the predicted box
and ground truth into consideration.
Both of the loss of drivable area segmentation L
da−seg
and
lane line segmentation L
ll−seg
contain Cross Entropy Loss
with Logits L
ce
, which aims to minimize the classification
errors between pixels of network outputs and the targets. It
is worth mentioning that IoU loss: L
IoU
=
T N
T N +F P +F N
is
added to L
ll−seg
as it is especially efficient for the prediction
of the sparse category of lane lines. L
da
and L
ll−seg
are
defined as equation (2), (3) respectively.
L
da−seg
= L
ce
,
(2)
L
ll−seg
= L
ce
+ L
IoU
.
(3)
In conclusion, our final loss is a weighted sum of the three
parts all together as in equation (4).
L
all
= γ
1
L
det
+ γ
2
L
da−seg
+ γ
3
L
ll−seg
,
(4)
where α
1
, α
2
, α
3
, γ
1
, γ
2
, γ
3
can be tuned to balance all parts
of the total loss.
D. Training Paradigm
We attempt different paradigms to train our model. The
simplest one is training end to end, and then three tasks can be
learned jointly. This training paradigm is useful when all tasks
are indeed related. In addition, some alternating optimization
algorithms also have been tried, which train our model step
by step. In each step, the model can focus on one or multiple
related tasks regardless of those unrelated. Even if not all tasks
are related, our model can still learn adequately on each task
with this paradigm. And Algorithm 1 illustrates the process of
one step-by-step training method.
IV. E
XPERIMENTS
A. Setting
1) Dataset Setting:
The BDD100K dataset [11] supports
the research of multi-task learning in the field of autonomous
driving. With 100k frames of pictures and annotations of 10
tasks, it is the largest driving video dataset. As the dataset
has the diversity of geography, environment, and weather, the
algorithm trained on the BDD100k dataset is robust enough
to migrate to a new environment. Therefore, we choose the
BDD100k dataset to train and evaluate our network. The
BDD100K dataset has three parts, training set with 70K
images, validation set with 10K images, and test set with 20K
images. Since the label of the test set is not public, we evaluate
our network on the validation set.
2) Implementation Details:
In order to enhance the per-
formance of our model, we empirically adopt some practical
techniques and methods of data augmentation.
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
5
Algorithm 1 One step-by-step Training Method. First, we only
train Encoder and Detect head. Then we freeze the Encoder
and Detect head as well as train two Segmentation heads.
Finally, the entire network is trained jointly for all three tasks.
Input: Target neural network F with parameter group:
Θ = {θ
enc
, θ
det
, θ
seg
};
Training set: T ;
Threshold for convergence: thr;
Loss function: L
all
Output: Well-trained network: F (x; Θ)
1:
procedure T
RAIN
(F , T )
2:
repeat
3:
Sample a mini-batch (x
s
, y
s
) from training set T .
4:
` ← L
all
(F (x
s
; Θ), y
s
)
5:
Θ ← arg min
Θ
`
6:
until ` < thr
7:
end procedure
8:
Θ ← Θ\{θ
seg
} // Freeze parameters of two Segmentation
heads.
9:
T
RAIN
(F , T )
10:
Θ ← Θ ∪ {θ
seg
} \ {θ
det
, θ
enc
} // Freeze parameters of
Encoder and Detect head and activate parameters of two
Segmentation heads.
11:
T
RAIN
(F , T )
12:
Θ ← Θ ∪ {θ
det
, θ
enc
} // Activate all parameters of the
neural network.
13:
T
RAIN
(F , T )
14:
return Trained network F (x; Θ)
With the purpose of enabling our detector to get more prior
knowledge of the objects in the traffic scene, we use the k-
means clustering algorithm to obtain prior anchors from all
detection frames of the dataset. We use Adam as the optimizer
to train our model and the initial learning rate, β
1
, and β
2
are
set to be 0.001, 0.937, and 0.999 respectively. Warm-up and
cosine annealing are used to adjust the learning rate during the
training, which aim at leading the model to converge faster and
better [31].
We use data augmentation to increase the variability of
images so as to make our model robust in different environ-
ments. Photometric distortions and geometric distortions are
taken into consideration in our training scheme. For photo-
metric distortions, we adjust the hue, saturation and value of
images. We use random rotating, scaling, translating, shearing,
and left-right flipping to process images to handle geometric
distortions.
3) Experimental Setting:
We select some excellent multi-
task networks and networks that focus on a single task
to compare with our network. Both MultiNet and DLT-Net
handle multiple panoptic driving perception tasks, and they
have achieved great performance in object detection and
drivable area segmentation tasks on the BDD100k dataset.
Faster-RCNN is an outstanding representative of the two-
stage object detection network. YOLOv5 is the single-stage
network that achieves state-of-the-art performance on the
COCO dataset. PSPNet achieves splendid performance on se-
mantic segmentation task with its superior ability to aggregate
Network
Recall(%)
mAP50(%)
Speed(fps)
MultiNet
81.3
60.2
8.6
DLT-Net
89.4
68.4
9.3
Faster R-CNN
77.2
55.6
5.3
YOLOv5s
86.8
77.2
82
YOLOP (ours)
89.2
76.5
41
TABLE I
T
RAFFIC
O
BJECT
D
ETECTION
R
ESULTS
:
COMPARING THE PROPOSED
YOLOP
WITH STATE
-
OF
-
THE
-
ART DETECTORS
.
global information. We retrain the above networks on the
BDD100k dataset and compare them with our network on
object detection and drivable area segmentation tasks. Since
there is no suitable existing multi-task network that processes
lane detection task on the BDD100K dataset, we compare
our network with Enet [32], SCNN and Enet-SAD, three
advanced lane detection networks. Besides, the performance
of the joint training paradigm is compared with alternating
training paradigms of many kinds. Moreover, we compare the
accuracy and speed of our multi-task model trained to handle
multiple tasks with the one trained to perform a specific task.
Following [6], we resize images in BDD100k dataset from
1280×720×3 to 640×384×3. All control experiments follow
the same experimental settings and evaluation metrics, and all
experiments are run on NVIDIA GTX TITAN XP.
B. Result
In this section, we just simply train our model end to end
and then compare it with other representative models on all
three tasks.
1) Traffic Object Detection Result:
Since the Multinet and
DLT-Net can only detect vehicles, we only consider the vehicle
detection results of five models on the BDD100K dataset. As
shown in Table I, we use Recall and mAP50 as the evaluation
metric of detection accuracy. Our model exceeds Faster R-
CNN, MultiNet, and DLT-Net in detection accuracy, and is
comparable to YOLOv5s that actually uses more tricks than
ours. Moreover, our model can infer in real time. YOLOv5s is
faster than ours because it does not have the lane line segment
head and drivable area segment head. Visualization of the
traffic objects detection is shown in Figure 3.
2) Drivable Area Segmentation Result:
In this paper, both
“area/drivable” and “area/alternative” classes in BDD100K
dataset are categorized as ”Drivable area” without distinction.
Our model only needs to distinguish the drivable area and
the background in the image. mIoU is used to evaluate the
segmentation performance of different models. The results are
shown in Table II. It can be seen that our model outperforms
MultiNet, DLT-Net and PSPNet by 19.9%, 20.2%, and 1.9%,
respectively. Furthermore, our inference speed is 4 to 5 times
faster than theirs. Visualization results of the drivable area
segmentation can be seen in Figure 4.
3) Lane Detection Result:
The lane lines in BDD100K
dataset are labeled with two lines, so it is very tricky to directly
use the annotation. The experimental settings follow the [6]
in order to compare expediently. First of all, we calculate the
center lines based on the two-line annotations. Then we draw
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
6
(a) Day-time result
(b) Night-time result
Fig. 3. Visualization of the traffic objects detection results of YOLOP. Top Row: Traffic objects detection results in day-time scenes. Bottom row: Traffic
objects detection results in night scenes.
(a) Day-time result
(b) Night-time result
Fig. 4. Visualization of the drivable area segmentation results of YOLOP. Top Row: Drivable area segmentation results in day-time scenes. Bottom row:
Drivable area segmentation results in night scenes.
Network
mIoU(%)
Speed(fps)
MultiNet
71.6
8.6
DLT-Net
71.3
9.3
PSPNet
89.6
11.1
YOLOP (ours)
91.5
41
TABLE II
D
RIVABLE
A
REA
S
EGMENTATION
R
ESULTS
: C
OMPARING THE PROPOSED
YOLOP
WITH STATE
-
OF
-
THE
-
ART DRIVABLE AREA SEGMENTATION OR
SEMANTIC SEGMENTATION METHODS
.
the lane line of the training with width set to 8 pixels while
keeping the lane line width of the test set as 2 pixels. We
use pixel accuracy and IoU of lanes as evaluation metrics.
As shown in the Table III, the performance of our model
dramatically exceeds the other three models. The visualization
results of lane detection can be seen in Figure 5.
C. Ablation Studies
We designed the following two ablation experiments to
further illustrate the effectiveness of our scheme. All the
evaluation metrics in this section are consistent with above.
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
7
(a) Day-time result
(b) Night-time result
Fig. 5. Visualization of the lane detection results of YOLOP. Top Row: Lane detection results in day-time scenes. Bottom row: Lane detection results in
night scenes.
Network
Accuracy(%)
IoU(%)
ENet
34.12
14.64
SCNN
35.79
15.84
ENet-SAD
36.56
16.02
YOLOP (ours)
70.50
26.20
TABLE III
L
ANE
D
ETECTION
R
ESULTS
:
COMPARING THE PROPOSED
YOLOP
WITH
STATE
-
OF
-
THE
-
ART LANE DETECTION METHODS
.
1) End-to-end v.s. Step-by-step:
In Table IV, we compare
the performance of joint training paradigm with alternating
training paradigms of many kinds
1
. Obviously, our model has
performed very well enough through end-to-end training, so
there is no need to perform alternating optimization. However,
it is interesting that the paradigm training detection task firstly
seems to perform better. We think it is mainly because our
model is closer to a complete detection model and the model
is harder to converge when performing detection tasks. What’s
more, the paradigm consist of three steps slightly outperforms
that with two steps. Similar alternating training can be run for
more steps, but we have observed negligible improvements.
1
E, D, S and W refer to Encoder, Detect head, two Segment heads and
whole network. So the Algorithm 1 can be marked as ED-S-W, and the same
for others.
2) Multi-task v.s. Single task:
To verify the effectiveness of
our multi-task learning scheme, we compare the performance
of the multi-task scheme and single task scheme. On the one
hand, we train our model to perform 3 tasks simultaneously.
On the other hand, we train our model to perform traffic
object detection, drivable area segmentation, and lane line
segmentation tasks separately. Table V shows the comparison
of the performance of these two schemes on each specific task.
It can be seen that our model adopts the multi-task scheme to
achieve performance is close to that of focusing on a single
task. More importantly, the multitask model can save a lot of
time compared to executing each task individually.
V. C
ONCLUSION
In this paper, we put forward a simple and efficient network,
which can simultaneously handle three driving perception
tasks of object detection, drivable area segmentation and
lane detection and can be trained end-to-end. Our model
performs exceptionally well on the challenging BDD100k
dataset, achieving or greatly exceeding state-of-the-art level
on all three tasks. And it can perform real-time inference on
embedded device Jetson TX2, which ensures that our network
can be used in real-world scenarios.
Training method
Recall(%)
AP(%)
mIoU(%)
Accuracy(%)
IoU(%)
ES-W
87.0
75.3
90.4
66.8
26.2
ED-W
87.3
76.0
91.6
71,2
26.1
ES-D-W
87.0
75.1
91.7
68.6
27.0
ED-S-W
87.5
76.1
91.6
68.0
26.8
End-to-end
89.2
76.5
91.5
70.5
26.2
TABLE IV
P
ANOPTIC DRIVING PERCEPTION RESULTS
:
THE END
-
TO
-
END SCHEME V
.
S
.
DIFFERENT STEP
-
BY
-
STEP SCHEMES
.
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
8
Training method
Recall(%)
AP(%)
mIoU(%)
Accuracy(%)
IoU(%)
Speed(ms/frame)
Det(only)
88.2
76.9
-
-
-
15.7
Da-Seg(only)
-
-
92.0
-
-
14.8
Ll-Seg(only)
-
-
-
79.6
27.9
14.8
Multitask
89.2
76.5
91.5
70.5
26.2
24.4
TABLE V
P
ANOPTIC DRIVING PERCEPTION RESULTS
:
MULTI
-
TASK LEARNING V
.
S
.
SINGLE TASK LEARNING
.
R
EFERENCES
[1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-
time object detection with region proposal networks,” arXiv preprint
arXiv:1506.01497, 2015.
[2] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Op-
timal speed and accuracy of object detection,” arXiv preprint
arXiv:2004.10934, 2020.
[3] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in Medical Image Computing and
Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Horneg-
ger, W. M. Wells, and A. F. Frangi, Eds. Cham: Springer International
Publishing, 2015, pp. 234–241.
[4] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
network,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), July 2017.
[5] X. Pan, J. Shi, P. Luo, X. Wang, and X. Tang, “Spatial as deep:
Spatial cnn for traffic scene understanding,” in Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
[6] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning lightweight lane
detection cnns by self attention distillation,” in Proceedings of the
IEEE/CVF International Conference on Computer Vision, 2019, pp.
1013–1021.
[7] R. Chandra and P. Bahl, “Multinet: Connecting to multiple ieee 802.11
networks using a single wireless card,” in ieee infocom 2004, vol. 2.
IEEE, 2004, pp. 882–893.
[8] Y. Qian, J. M. Dolan, and M. Yang, “Dlt-net: Joint detection of drivable
areas, lane lines, and traffic objects,” IEEE Transactions on Intelligent
Transportation Systems, vol. 21, no. 11, pp. 4670–4679, 2019.
[9] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Scaled-yolov4:
Scaling cross stage partial network,” arXiv preprint arXiv:2011.08036,
2020.
[10] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask r-cnn,” in
Proceedings of the IEEE International Conference on Computer Vision,
2017, pp. 2961–2969.
[11] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell,
“Bdd100k: A diverse driving video database with scalable annotation
tooling,” arXiv preprint arXiv:1805.04687, vol. 2, no. 5, p. 6, 2018.
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2014, pp. 580–587.
[13] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International
Conference on Computer Vision, 2015, pp. 1440–1448.
[14] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-
based fully convolutional networks,” arXiv preprint arXiv:1605.06409,
2016.
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” in European Conference on
Computer Vision.
Springer, 2016, pp. 21–37.
[16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–
788.
[17] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017, pp. 7263–7271.
[18] ——,
“Yolov3:
An
incremental
improvement,”
arXiv
preprint
arXiv:1804.02767, 2018.
[19] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
[20] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected crfs,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848,
2017.
[21] D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and
L. Van Gool, “Towards end-to-end lane detection: an instance segmen-
tation approach,” in 2018 IEEE Intelligent Vehicles Symposium (IV).
IEEE, 2018, pp. 286–291.
[22] Z. Qin, H. Wang, and X. Li, “Ultra fast structure-aware deep lane
detection,” arXiv preprint arXiv:2004.11757, 2020.
[23] J. Zhang, Y. Xu, B. Ni, and Z. Duan, “Geometric constrained joint
lane segmentation and lane boundary detection,” in Proceedings of the
European Conference on Computer Vision (ECCV), 2018, pp. 486–502.
[24] Z. Kang, K. Grauman, and F. Sha, “Learning with whom to share in
multi-task feature learning,” in ICML, 2011.
[25] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H.
Yeh, “Cspnet: A new backbone that can enhance learning capability of
cnn,” in Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops, 2020, pp. 390–391.
[26] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
convolutional networks for visual recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–
1916, 2015.
[27] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,
“Feature pyramid networks for object detection,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2017,
pp. 2117–2125.
[28] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
for instance segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018, pp. 8759–8768.
[29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss
for dense object detection,” in Proceedings of the IEEE International
Conference on Computer Vision, 2017, pp. 2980–2988.
[30] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss:
Faster and better learning for bounding box regression,” in Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020,
pp. 12 993–13 000.
[31] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with
warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
[32] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep
neural network architecture for real-time semantic segmentation,” arXiv
preprint arXiv:1606.02147, 2016.
DongWu is a undergraduate senior student in the
School of Electronics Information and Communica-
tions, Huazhong University of Science and Technol-
ogy (HUST), Wuhan, China. His research interests
include computer vision, machine learning and au-
tonomous driving.
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
9
Manwen Liao is a senior undergraduate student
from School of Electronics Information and Com-
munications, Huazhong University of Science and
Technology (HUST), Wuhan, China. He majors in
Electronic Information Engineering. His research
interests mainly include computer vision, machine
learning, robotics and autonomous driving.
Weitian Zhang is a undergraduate senior student
from Huazhong University of Science and Technol-
ogy, Wuhan, Hubei, China, majoring in Electronic
Information Engineering.
Her main research interests include computer vi-
sion and machine learning.
Xinggang Wang(M’17) received the B.S. and Ph.D.
degrees in Electronics and Information Engineering
from Huazhong University of Science and Tech-
nology (HUST), Wuhan, China, in 2009 and 2014,
respectively. He is currently an Associate Professor
with the School of Electronic Information an Com-
munications, HUST. His research interests include
computer vision and machine learning. He services
as associate editors for Pattern Recognition and Im-
age and Vision Computing journals and an editorial
board member of Electronics journal.
Dostları ilə paylaş: |