{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,8]],"date-time":"2026-03-08T20:35:05Z","timestamp":1773002105786,"version":"3.50.1"},"reference-count":64,"publisher":"Institution of Engineering and Technology (IET)","issue":"1","license":[{"start":{"date-parts":[[2025,11,29]],"date-time":"2025-11-29T00:00:00Z","timestamp":1764374400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/2.zoppoz.workers.dev:443\/http\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"},{"start":{"date-parts":[[2025,11,29]],"date-time":"2025-11-29T00:00:00Z","timestamp":1764374400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/2.zoppoz.workers.dev:443\/http\/doi.wiley.com\/10.1002\/tdm_license_1.1"}],"funder":[{"DOI":"10.13039\/501100019491","name":"National Natural Science Foundation of China - State Grid Corporation Joint Fund for Smart Grid","doi-asserted-by":"publisher","award":["62403345"],"award-info":[{"award-number":["62403345"]}],"id":[{"id":"10.13039\/501100019491","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["ietresearch.onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["CAAI Trans on Intel Tech"],"published-print":{"date-parts":[[2026,2]]},"abstract":"<jats:title>ABSTRACT<\/jats:title>\n                  <jats:p>Audio\u2010visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms. Multimodal fusion methods can improve both the accuracy and robustness of speaker tracking. However, in complex multispeaker tracking scenarios, critical challenges such as cross\u2010modal feature discrepancy, weak sound source localisation ambiguity and frequent identity switch errors remain unresolved, which severely hinder the modelling of speaker identity consistency and consequently lead to degraded tracking accuracy and unstable tracking trajectories. To this end, this paper proposes a multimodal multispeaker tracking network using audio\u2010visual contrastive learning (AVCLNet). By integrating heterogeneous modal representations into a unified space through audio\u2010visual contrastive learning, which facilitates cross\u2010modal feature alignment, mitigates cross\u2010modal feature bias and enhances identity\u2010consistent representations. In the audio\u2010visual measurement stage, we design a vision\u2010guided weak sound source weighted enhancement method, which leverages visual cues to establish cross\u2010modal mappings and employs a spatiotemporal dynamic weighted mechanism to improve the detectability of weak sound sources. Furthermore, in the data association phase, a dual geometric constraint strategy is introduced by combining the 2D and 3D spatial geometric information, reducing frequent identity switch errors. Experiments on the AV16.3 and CAV3D datasets show that AVCLNet outperforms state\u2010of\u2010the\u2010art methods, demonstrating superior robustness in multispeaker scenarios.<\/jats:p>","DOI":"10.1049\/cit2.70092","type":"journal-article","created":{"date-parts":[[2025,11,29]],"date-time":"2025-11-29T17:00:46Z","timestamp":1764435646000},"page":"238-255","update-policy":"https:\/\/2.zoppoz.workers.dev:443\/https\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["AVCLNet: Multimodal Multispeaker Tracking Network Using Audio\u2010Visual Contrastive Learning"],"prefix":"10.1049","volume":"11","author":[{"ORCID":"https:\/\/2.zoppoz.workers.dev:443\/https\/orcid.org\/0009-0009-9924-5529","authenticated-orcid":false,"given":"Yihan","family":"Li","sequence":"first","affiliation":[{"name":"College of Computer Science and Technology Taiyuan University of Technology  Taiyuan China"}]},{"ORCID":"https:\/\/2.zoppoz.workers.dev:443\/https\/orcid.org\/0000-0002-5236-7010","authenticated-orcid":false,"given":"Yidi","family":"Li","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology Taiyuan University of Technology  Taiyuan China"},{"name":"Graduate School of Engineering Science The University of Osaka  Osaka Japan"}]},{"ORCID":"https:\/\/2.zoppoz.workers.dev:443\/https\/orcid.org\/0000-0001-5720-7051","authenticated-orcid":false,"given":"Zhenhuan","family":"Xu","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology Taiyuan University of Technology  Taiyuan China"}]},{"given":"Hao","family":"Guo","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology Taiyuan University of Technology  Taiyuan China"}]},{"ORCID":"https:\/\/2.zoppoz.workers.dev:443\/https\/orcid.org\/0000-0002-6332-8316","authenticated-orcid":false,"given":"Mengyuan","family":"Liu","sequence":"additional","affiliation":[{"name":"Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology  Shenzhen China"},{"name":"Peking University, Shenzhen Graduate School  Shenzhen China"}]},{"ORCID":"https:\/\/2.zoppoz.workers.dev:443\/https\/orcid.org\/0000-0002-0058-2819","authenticated-orcid":false,"given":"Weiwei","family":"Wan","sequence":"additional","affiliation":[{"name":"Graduate School of Engineering Science The University of Osaka  Osaka Japan"}]}],"member":"265","published-online":{"date-parts":[[2025,11,29]]},"reference":[{"key":"e_1_2_9_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/tasl.2011.2125954"},{"key":"e_1_2_9_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.110"},{"key":"e_1_2_9_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00803"},{"key":"e_1_2_9_5_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Yoon J. S.","year":"2019"},{"key":"e_1_2_9_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/tbiom.2021.3120412"},{"key":"e_1_2_9_7_1","first-page":"736","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing","author":"Chen Y.","year":"2022"},{"key":"e_1_2_9_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/taslp.2020.3040031"},{"key":"e_1_2_9_9_1","first-page":"4280","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing","author":"Qian X.","year":"2021"},{"key":"e_1_2_9_10_1","doi-asserted-by":"publisher","DOI":"10.1049\/cp.2012.0410"},{"key":"e_1_2_9_11_1","first-page":"2896","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing","author":"Qian X.","year":"2017"},{"key":"e_1_2_9_12_1","first-page":"1456","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Li Y.","year":"2022"},{"key":"e_1_2_9_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3182151"},{"key":"e_1_2_9_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/taslp.2022.3226330"},{"key":"e_1_2_9_15_1","volume-title":"International Conference on Learning Representations","author":"Shi B.","year":"2022"},{"key":"e_1_2_9_16_1","first-page":"1791","volume-title":"Proceedings of Interspeech","author":"Berg A.","year":"2022"},{"key":"e_1_2_9_17_1","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing","author":"Shmuel D. H.","year":"2023"},{"key":"e_1_2_9_18_1","doi-asserted-by":"publisher","DOI":"10.3390\/axioms12090862"},{"key":"e_1_2_9_19_1","doi-asserted-by":"publisher","DOI":"10.3390\/rs16081423"},{"key":"e_1_2_9_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/tpami.2023.3301975"},{"key":"e_1_2_9_21_1","doi-asserted-by":"publisher","DOI":"10.2514\/1.i011301"},{"key":"e_1_2_9_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ojsp.2024.3451167"},{"key":"e_1_2_9_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/tmm.2016.2599150"},{"key":"e_1_2_9_24_1","first-page":"974","volume-title":"European Signal Processing Conference","author":"Brutti A.","year":"2010"},{"key":"e_1_2_9_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/jstsp.2008.2001429"},{"key":"e_1_2_9_26_1","first-page":"446","volume-title":"Proceedings of IEEE International Conference on Computer Vision Workshops","author":"Ban Y.","year":"2017"},{"key":"e_1_2_9_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/tsmcb.2008.922063"},{"key":"e_1_2_9_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/tpami.2019.2953020"},{"key":"e_1_2_9_29_1","first-page":"55","volume-title":"International Evaluation Workshop on Classification of Events, Activities and Relationships","author":"Brunelli R.","year":"2006"},{"key":"e_1_2_9_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA40945.2020.9197528"},{"key":"e_1_2_9_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/tpami.2017.2648793"},{"key":"e_1_2_9_32_1","doi-asserted-by":"publisher","DOI":"10.1155\/s1110865702206058"},{"key":"e_1_2_9_33_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2022-10190"},{"key":"e_1_2_9_34_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2010.06.005"},{"key":"e_1_2_9_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-22482-4_17"},{"key":"e_1_2_9_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/tmm.2014.2301977"},{"key":"e_1_2_9_37_1","first-page":"1259","volume-title":"Proceedings of International Conference on Image Processing","author":"Gatica\u2010Perez D.","year":"2003"},{"key":"e_1_2_9_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2004.1327252"},{"key":"e_1_2_9_39_1","first-page":"1955","volume-title":"IEEE International Conference on Image Processing","author":"Liu H.","year":"2019"},{"key":"e_1_2_9_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/tmm.2014.2377515"},{"key":"e_1_2_9_41_1","doi-asserted-by":"publisher","DOI":"10.23919\/Eusipco47968.2020.9287677"},{"key":"e_1_2_9_42_1","first-page":"7343","volume-title":"Proceedings of IEEE International Conference on Pattern Recognition","author":"Liu H.","year":"2021"},{"key":"e_1_2_9_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/tmm.2019.2937185"},{"key":"e_1_2_9_44_1","first-page":"5068","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing","author":"Zhao J.","year":"2022"},{"key":"e_1_2_9_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/taslp.2021.3057230"},{"key":"e_1_2_9_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME51207.2021.9428373"},{"key":"e_1_2_9_47_1","first-page":"247","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Tian Y.","year":"2018"},{"key":"e_1_2_9_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2022.3199662"},{"key":"e_1_2_9_49_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2023.122028"},{"key":"e_1_2_9_50_1","first-page":"1","volume-title":"European Signal Processing Conference","author":"Grumiaux P.\u2010A.","year":"2021"},{"key":"e_1_2_9_51_1","first-page":"5998","article-title":"Attention Is all You Need","author":"Vaswani A.","year":"2017","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_9_52_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2106.03903"},{"key":"e_1_2_9_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/jstsp.2019.2900164"},{"key":"e_1_2_9_54_1","first-page":"385","article-title":"Sound Event Localization and Detection Model Based on Multi\u2010View Attention","volume":"40","author":"Yang J.","year":"2024","journal-title":"Journal of Signal Processing"},{"key":"e_1_2_9_55_1","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford A.","year":"2021"},{"key":"e_1_2_9_56_1","first-page":"4904","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Jia C.","year":"2021"},{"key":"e_1_2_9_57_1","volume-title":"Advances in Neural Information Processing Systems","author":"Bao H.","year":"2022"},{"key":"e_1_2_9_58_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2024.102382"},{"key":"e_1_2_9_59_1","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing","author":"Tsiamas I.","year":"2025"},{"key":"e_1_2_9_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49660.2025.10890263"},{"key":"e_1_2_9_61_1","first-page":"182","volume-title":"Proceedings of the International Workshop on Machine Learning for Multimodal Interaction","author":"Lathoud G.","year":"2004"},{"key":"e_1_2_9_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/tmm.2019.2902489"},{"key":"e_1_2_9_63_1","volume-title":"Proceedings of the Conference on Neural Information Processing Systems","author":"Ao W.","year":"2024"},{"key":"e_1_2_9_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_9_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/tmm.2021.3061800"}],"container-title":["CAAI Transactions on Intelligence Technology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/2.zoppoz.workers.dev:443\/https\/ietresearch.onlinelibrary.wiley.com\/doi\/pdf\/10.1049\/cit2.70092","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/2.zoppoz.workers.dev:443\/https\/ietresearch.onlinelibrary.wiley.com\/doi\/full-xml\/10.1049\/cit2.70092","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/2.zoppoz.workers.dev:443\/https\/ietresearch.onlinelibrary.wiley.com\/doi\/pdf\/10.1049\/cit2.70092","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,8]],"date-time":"2026-03-08T18:11:39Z","timestamp":1772993499000},"score":1,"resource":{"primary":{"URL":"https:\/\/2.zoppoz.workers.dev:443\/https\/ietresearch.onlinelibrary.wiley.com\/doi\/10.1049\/cit2.70092"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,29]]},"references-count":64,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,2]]}},"alternative-id":["10.1049\/cit2.70092"],"URL":"https:\/\/2.zoppoz.workers.dev:443\/https\/doi.org\/10.1049\/cit2.70092","archive":["Portico"],"relation":{},"ISSN":["2468-6557","2468-2322"],"issn-type":[{"value":"2468-6557","type":"print"},{"value":"2468-2322","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,29]]},"assertion":[{"value":"2025-04-10","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-16","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-29","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}