A convolutional vision transformer for semantic segmentation of side-scan sonar data

Rajani, Hayat; Grácias, Nuno Ricardo Estrela; García Campos, Rafael

A convolutional vision transformer for semantic segmentation of side-scan sonar data

Rajani, Hayat

orcId Grácias, Nuno Ricardo Estrela researcherId Grácias, Nuno Ricardo Estrela scopusId Grácias, Nuno Ricardo Estrela

Grácias, Nuno Ricardo Estrela

orcId García Campos, Rafael researcherId García Campos, Rafael scopusId García Campos, Rafael

García Campos, Rafael

2023-10-15

Text Complet

1-s2.0-S0029801823020310-main.pdf 3.100 Mb | PDF

Distinguishing among different marine benthic habitat characteristics is of key importance in a wide set of seabed operations ranging from installations of oil rigs to laying networks of cables and monitoring the impact of humans on marine ecosystems. The Side-Scan Sonar (SSS) is a widely used imaging sensor in this regard. It produces high-resolution seafloor maps by logging the intensities of sound waves reflected back from the seafloor. In this work, we leverage these acoustic intensity maps to produce pixel-wise categorization of different seafloor types. We propose a novel architecture adapted from the Vision Transformer (ViT) in an encoder–decoder framework. Further, in doing so, the applicability of ViTs is evaluated on smaller datasets. To overcome the lack of CNN-like inductive biases, thereby making ViTs more conducive to applications in low data regimes, we propose a novel feature extraction module to replace the Multi-layer Perceptron (MLP) block within transformer layers and a novel module to extract multiscale patch embeddings. A lightweight decoder is also proposed to complement this design in order to further enhance multiscale feature extraction. With the modified architecture, we achieve state-of-the-art results and also meet real-time computational requirements

Aquest document està subjecte a una llicència Creative Commons:Reconeixement (by)