Page 50 - 2025年第56卷第9期
P. 50
[ 19] SONG L, WENG L, WANG L, et al. Two-stream designed 2d/3d residual networks with lstms for action recogni⁃
/
tion in videos[C]/2018 25th IEEE International Conference on Image Processing (ICIP) New York:IEEE,2018.
.
.
[ 20] VASWANI A,SHAZEER N,PARMAR N,et al. Attention is all you need[J] arXiv,2017,12:5998-6008.
/
[ 21] TSAI Y H H,BAI S,LIANG P P,et al. Multimodal transformer for unaligned multimodal language sequences[C]/
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Associa⁃
tion for Computational Linguistics,2019.
[ 22] BALFAS M, AHAMED S I, TAMMA C, et al. A study and estimation a lost person behavior in crowded areas
/
using accelerometer data from smartphones[C]/IEEE 42nd Annual Computer Software and Applications Conference
.
(COMPSAC) IEEE,2018.
[ 23] LAU H Y, TONG K Y, ZHU H. Support vector machine for classification of walking conditions using miniature
.
kinematic sensors[J] Medical & Biological Engineering & Computing,2008,46:563-573.
[ 24] FUJIKI Y. IPhone as a physical activity measurement platform[C]/CHI′10 Extended Abstracts on Human Factors in
/
Computing Systems. Atlanta Georgia USA:ACM,2010.
[ 25] XIE S,SUN C,HUANG J,et al. Rethinking spatiotemporal feature learning:speed-accuracy trade-offs in video
classification[C]/Proceedings of the European Conference on Computer Vision(ECCV) 2018.
.
/
/
[ 26] HERSHEY S, CHAUDHURI S, ELLIs D P W, et al. CNN architectures for large-scale audio classification[C]/
2017 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) New Orleans, LA:
.
IEEE,2017.
[ 27] SONG Y,ZHENG Q,LIU B,et al. EEG conformer:Convolutional transformer for EEG decoding and visualization
.
[J] IEEE Transactions on Neural Systems and Rehabilitation Engineering,2022,31:710-719.
Deep learning model for identifying construction machinery activities in
underground caverns based on real-time multimodal data
TONG Dawei,FENG Kaiyue,YU Jia,WANG Xiaoling
(Tianjin University,State Key Laboratory of Hydraulic Engineering Intelligent Construction and Operation,Tianjin 300072,China)
Abstract:Identification of construction machinery activities is an effective approach to analyzing production effi⁃
ciency and ensuring operational safety. Current methods primarily focus on the characteristics of individual modalities
such as kinematics,vision,and acoustics,without adequately considering the intrinsic correlations between multi⁃
modal data. This limitation reduces their effectiveness in environments such as dimly lit,confined,and noisy under⁃
ground chambers. To address this,this study proposes a deep learning model based on the Transformer architecture
for real-time multimodal data-based recognition of construction machinery activities in underground chambers. Lever⁃
aging the attention mechanism’s capability to capture long-term dependencies across different modalities,the pro⁃
posed model integrates multimodal data to improve recognition performance. Initially,real-time video,audio,and
kinematic data are collected during the construction process. The preliminary features of these three modalities are
extracted using S3D,VGGish,and Conformer models,respectively. Cross-modal attention and self-attention mecha⁃
nisms are then applied to integrate and extract these preliminary features,generating multimodal fused features. Sub⁃
sequently,the multi-head attention mechanism further combines the preliminary and fused features,enabling robust
activity classification based on the enriched feature set. Case studies demonstrate that the proposed model achieves an
identification accuracy of 98.14% and an F1 score of 96.47%,representing improvements of 6.38% and 9.13%,
respectively,over the best-performing single-modality models. This study provides a novel approach for recognizing
construction machinery activities in underground chamber environments.
Keywords:underground caverns;construction machinery activity recognition;multi-modal data;attention mecha⁃
nism;feature fusion
(责任编辑:李 娜)
— 1154 —

