FASSD-Net: Fast and Accurate Real-Time Semantic Segmentation for Embedded Systems

Leonel Rosas-Arias; Gibran Benitez-Garcia; Jose Portillo-Portillo; Jesus Olivares-Mercado; Gabriel Sanchez-Perez; Keiji Yanai

doi:10.1109/TITS.2021.3127553

FASSD-Net: Fast and Accurate Real-Time Semantic Segmentation for Embedded Systems

Leonel Rosas-Arias, Gibran Benitez-Garcia, Jose Portillo-Portillo, Jesus Olivares-Mercado, Gabriel Sanchez-Perez, Keiji Yanai

Escuela Superior de Ingeniería Mecánica y Eléctrica (ESIME), Unidad Culhuacán

Research output: Contribution to journal › Article › peer-review

11 Scopus citations

Abstract

Recent works of real-time semantic segmentation, remove or make use of light decoders from dense deep neural networks to achieve fast inference speed. This strategy helps to achieve real-time performance; however, the accuracy is significantly compromised in comparison to non-real-time methods. In this paper, we introduce two key modules aimed to design a high-performance decoder for real-time semantic segmentation, which also reduces the accuracy gap between real-time and non-real-time networks. The first module, Dilated Asymmetric Pyramidal Fusion (DAPF), is designed to increase the receptive field on the top of the last stage of the encoder, obtaining richer contextual features. The second module, Multi-resolution Dilated Asymmetric (MDA) module, fuses and refines detail and contextual information from multi-scale feature maps coming from early and deeper stages of the network. Both modules are designed to keep a low computational complexity by using asymmetric convolutions. With these modules, we propose a network entitled ``FASSD-Net,'' which is based on a light-weight CNN backbone. Running on a single Nvidia GTX 1080Ti, our model reaches 77.5% and 69.3% of mIoU, at 41 and 80 FPS on the Cityscapes and CamVid datasets, respectively. We present an extensive analysis of the accuracy-speed tradeoffs of three FASSD-Net variations on different embedded systems, demonstrating that a light version of our network can run on the low-power consumption Jetson Xavier NX, at 32 FPS reaching 74% of mIoU with full resolution (1024x 2048). The source code and pre-trained models are available at github.com/GibranBenitez/FASSD-Net.

Original language	English
Journal	IEEE Transactions on Intelligent Transportation Systems
DOIs	https://doi.org/10.1109/TITS.2021.3127553
State	Accepted/In press - 2021

Keywords

Convolutional codes
Decoding
Embedded systems
HarDNet
Image segmentation
Jetson Xavier NX.
Real-time systems
Semantic segmentation
Semantics
Task analysis
embedded systems
fully convolutional networks
spatial pyramid pooling

Access to Document

10.1109/TITS.2021.3127553

Cite this

@article{f786c1184205423c959f22ff795c9ce2,

title = "FASSD-Net: Fast and Accurate Real-Time Semantic Segmentation for Embedded Systems",

abstract = "Recent works of real-time semantic segmentation, remove or make use of light decoders from dense deep neural networks to achieve fast inference speed. This strategy helps to achieve real-time performance; however, the accuracy is significantly compromised in comparison to non-real-time methods. In this paper, we introduce two key modules aimed to design a high-performance decoder for real-time semantic segmentation, which also reduces the accuracy gap between real-time and non-real-time networks. The first module, Dilated Asymmetric Pyramidal Fusion (DAPF), is designed to increase the receptive field on the top of the last stage of the encoder, obtaining richer contextual features. The second module, Multi-resolution Dilated Asymmetric (MDA) module, fuses and refines detail and contextual information from multi-scale feature maps coming from early and deeper stages of the network. Both modules are designed to keep a low computational complexity by using asymmetric convolutions. With these modules, we propose a network entitled ``FASSD-Net,'' which is based on a light-weight CNN backbone. Running on a single Nvidia GTX 1080Ti, our model reaches 77.5% and 69.3% of mIoU, at 41 and 80 FPS on the Cityscapes and CamVid datasets, respectively. We present an extensive analysis of the accuracy-speed tradeoffs of three FASSD-Net variations on different embedded systems, demonstrating that a light version of our network can run on the low-power consumption Jetson Xavier NX, at 32 FPS reaching 74% of mIoU with full resolution (1024x 2048). The source code and pre-trained models are available at github.com/GibranBenitez/FASSD-Net.",

keywords = "Convolutional codes, Decoding, Embedded systems, HarDNet, Image segmentation, Jetson Xavier NX., Real-time systems, Semantic segmentation, Semantics, Task analysis, embedded systems, fully convolutional networks, spatial pyramid pooling",

author = "Leonel Rosas-Arias and Gibran Benitez-Garcia and Jose Portillo-Portillo and Jesus Olivares-Mercado and Gabriel Sanchez-Perez and Keiji Yanai",

note = "Publisher Copyright: IEEE",

year = "2021",

doi = "10.1109/TITS.2021.3127553",

language = "Ingl{\'e}s",

journal = "IEEE Transactions on Intelligent Transportation Systems",

issn = "1524-9050",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - FASSD-Net

T2 - Fast and Accurate Real-Time Semantic Segmentation for Embedded Systems

AU - Rosas-Arias, Leonel

AU - Benitez-Garcia, Gibran

AU - Portillo-Portillo, Jose

AU - Olivares-Mercado, Jesus

AU - Sanchez-Perez, Gabriel

AU - Yanai, Keiji

N1 - Publisher Copyright: IEEE

PY - 2021

Y1 - 2021

N2 - Recent works of real-time semantic segmentation, remove or make use of light decoders from dense deep neural networks to achieve fast inference speed. This strategy helps to achieve real-time performance; however, the accuracy is significantly compromised in comparison to non-real-time methods. In this paper, we introduce two key modules aimed to design a high-performance decoder for real-time semantic segmentation, which also reduces the accuracy gap between real-time and non-real-time networks. The first module, Dilated Asymmetric Pyramidal Fusion (DAPF), is designed to increase the receptive field on the top of the last stage of the encoder, obtaining richer contextual features. The second module, Multi-resolution Dilated Asymmetric (MDA) module, fuses and refines detail and contextual information from multi-scale feature maps coming from early and deeper stages of the network. Both modules are designed to keep a low computational complexity by using asymmetric convolutions. With these modules, we propose a network entitled ``FASSD-Net,'' which is based on a light-weight CNN backbone. Running on a single Nvidia GTX 1080Ti, our model reaches 77.5% and 69.3% of mIoU, at 41 and 80 FPS on the Cityscapes and CamVid datasets, respectively. We present an extensive analysis of the accuracy-speed tradeoffs of three FASSD-Net variations on different embedded systems, demonstrating that a light version of our network can run on the low-power consumption Jetson Xavier NX, at 32 FPS reaching 74% of mIoU with full resolution (1024x 2048). The source code and pre-trained models are available at github.com/GibranBenitez/FASSD-Net.

AB - Recent works of real-time semantic segmentation, remove or make use of light decoders from dense deep neural networks to achieve fast inference speed. This strategy helps to achieve real-time performance; however, the accuracy is significantly compromised in comparison to non-real-time methods. In this paper, we introduce two key modules aimed to design a high-performance decoder for real-time semantic segmentation, which also reduces the accuracy gap between real-time and non-real-time networks. The first module, Dilated Asymmetric Pyramidal Fusion (DAPF), is designed to increase the receptive field on the top of the last stage of the encoder, obtaining richer contextual features. The second module, Multi-resolution Dilated Asymmetric (MDA) module, fuses and refines detail and contextual information from multi-scale feature maps coming from early and deeper stages of the network. Both modules are designed to keep a low computational complexity by using asymmetric convolutions. With these modules, we propose a network entitled ``FASSD-Net,'' which is based on a light-weight CNN backbone. Running on a single Nvidia GTX 1080Ti, our model reaches 77.5% and 69.3% of mIoU, at 41 and 80 FPS on the Cityscapes and CamVid datasets, respectively. We present an extensive analysis of the accuracy-speed tradeoffs of three FASSD-Net variations on different embedded systems, demonstrating that a light version of our network can run on the low-power consumption Jetson Xavier NX, at 32 FPS reaching 74% of mIoU with full resolution (1024x 2048). The source code and pre-trained models are available at github.com/GibranBenitez/FASSD-Net.

KW - Convolutional codes

KW - Decoding

KW - Embedded systems

KW - HarDNet

KW - Image segmentation

KW - Jetson Xavier NX.

KW - Real-time systems

KW - Semantic segmentation

KW - Semantics

KW - Task analysis

KW - embedded systems

KW - fully convolutional networks

KW - spatial pyramid pooling

UR - http://www.scopus.com/inward/record.url?scp=85107475654&partnerID=8YFLogxK

U2 - 10.1109/TITS.2021.3127553

DO - 10.1109/TITS.2021.3127553

M3 - Artículo

AN - SCOPUS:85107475654

SN - 1524-9050

JO - IEEE Transactions on Intelligent Transportation Systems

JF - IEEE Transactions on Intelligent Transportation Systems

ER -

FASSD-Net: Fast and Accurate Real-Time Semantic Segmentation for Embedded Systems

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this