Page 50 - 2025年第56卷第9期
P. 50

[ 19] SONG L, WENG L, WANG L, et al.  Two-stream designed 2d/3d residual networks with lstms for action recogni⁃
                                   /
                      tion in videos[C]/2018 25th IEEE International Conference on Image Processing (ICIP)  New York:IEEE,2018.
                                                                                      .
                                                                                .
               [ 20] VASWANI A,SHAZEER N,PARMAR N,et al.  Attention is all you need[J]  arXiv,2017,12:5998-6008.
                                                                                                          /
               [ 21] TSAI Y H H,BAI S,LIANG P P,et al.  Multimodal transformer for unaligned multimodal language sequences[C]/
                      Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Associa⁃
                      tion for Computational Linguistics,2019.
               [ 22] BALFAS  M, AHAMED  S  I, TAMMA  C, et  al.   A  study  and  estimation  a  lost  person  behavior  in  crowded  areas
                                                         /
                      using accelerometer data from smartphones[C]/IEEE 42nd Annual Computer Software and Applications Conference
                                .
                      (COMPSAC)  IEEE,2018.
               [ 23] LAU  H  Y, TONG  K  Y, ZHU  H.   Support  vector  machine  for  classification  of  walking  conditions  using  miniature
                                      .
                      kinematic sensors[J]  Medical & Biological Engineering & Computing,2008,46:563-573.
               [ 24] FUJIKI Y.  IPhone as a physical activity measurement platform[C]/CHI′10 Extended Abstracts on Human Factors in
                                                                       /
                      Computing Systems. Atlanta Georgia USA:ACM,2010.
               [ 25] XIE S,SUN C,HUANG J,et al.  Rethinking spatiotemporal feature learning:speed-accuracy trade-offs in video
                      classification[C]/Proceedings of the European Conference on Computer Vision(ECCV)  2018.
                                                                                     .
                                   /
                                                                                                          /
               [ 26] HERSHEY S, CHAUDHURI S, ELLIs D P W, et al.  CNN architectures for large-scale audio classification[C]/
                      2017  IEEE  International  Conference  on  Acoustics, Speech  and  Signal  Processing(ICASSP)   New  Orleans, LA:
                                                                                           .
                      IEEE,2017.
               [ 27] SONG Y,ZHENG Q,LIU B,et al.  EEG conformer:Convolutional transformer for EEG decoding and visualization
                         .
                      [J]  IEEE Transactions on Neural Systems and Rehabilitation Engineering,2022,31:710-719.


                           Deep learning model for identifying construction machinery activities in
                                 underground caverns based on real-time multimodal data

                                     TONG Dawei,FENG Kaiyue,YU Jia,WANG Xiaoling






                  (Tianjin University,State Key Laboratory of Hydraulic Engineering Intelligent Construction and Operation,Tianjin  300072,China)

                Abstract:Identification  of  construction  machinery  activities  is  an  effective  approach  to  analyzing  production  effi⁃
                ciency and ensuring operational safety. Current methods primarily focus on the characteristics of individual modalities
                such as kinematics,vision,and acoustics,without adequately considering the intrinsic correlations between multi⁃




                modal data. This limitation reduces their effectiveness in environments such as dimly lit,confined,and noisy under⁃

                ground chambers. To address this,this study proposes a deep learning model based on the Transformer architecture

                for real-time multimodal data-based recognition of construction machinery activities in underground chambers. Lever⁃
                aging the attention mechanism’s capability to capture long-term dependencies across different modalities,the pro⁃




                posed  model  integrates  multimodal  data  to  improve  recognition  performance. Initially,real-time  video,audio,and
                kinematic data are collected during the construction process. The preliminary features of these three modalities are
                extracted using S3D,VGGish,and Conformer models,respectively. Cross-modal attention and self-attention mecha⁃




                nisms are then applied to integrate and extract these preliminary features,generating multimodal fused features. Sub⁃


                sequently,the multi-head attention mechanism further combines the preliminary and fused features,enabling robust
                activity classification based on the enriched feature set. Case studies demonstrate that the proposed model achieves an
                identification  accuracy  of  98.14%  and  an  F1  score  of  96.47%,representing  improvements  of  6.38%  and  9.13%,

                respectively,over the best-performing single-modality models. This study provides a novel approach for recognizing

                construction machinery activities in underground chamber environments.



                Keywords:underground caverns;construction machinery activity recognition;multi-modal data;attention mecha⁃


                nism;feature fusion
                                                                                     (责任编辑:李  娜)
                — 1154   —
   45   46   47   48   49   50   51   52   53   54   55