The First Workshop on Benchmarking and Expanding AI Multimodal Approaches at CVPR 2025

BEAM2025

Workshop Time, June 11 8:30am - 12:30pm

Workshop location, Music City Center and Online (Hybrid) | Room 202 C

The event will feature discussions highlighting the evolution and future of multimodal AI benchmarks, exploring methodologies and their real-world applications. It will conclude with a dynamic panel discussion, offering insights into the challenges and solutions for creating robust multimodal benchmarks and fostering interdisciplinary collaboration and innovation in AI benchmarking standards.

As a continuation of the 1st Workshop on Evaluation for Generative Foundation Models at CVPR 2024, the 1st Workshop on Benchmarking and Expanding AI Multimodal Approaches at CVPR 2025 aims to build a forum to discuss ongoing efforts in industry and academia, share best practices, and engage the community in working towards more comprehensive AI evaluation framework incorporating audio, visual, and textual inputs, addressing the limitations of current unimodal benchmarks.

Workshop Schedule

Opening Remarks
(Laszlo A. Jeni)

8:30 - 8:40 CDT

Keynote, Katerina Fragkiadaki
(Laszlo A. Jeni)

RobotArena: Towards Unlimited Robot Evaluation via Demonstration-to-Simulation Translation
Abstract 8:40 - 9:15 CDT

Keynote, Sherry Tongshuang Wu
(Mosam Dahbi)

Practical Evaluation of General-Purpose AI
Abstract 9:15 - 9:50 CDT

Oral Presentation
(Yang Zou)

  • TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark: Forouzan Fallah, Maitreya Patel, Agneet Chatterjee, Vlad Morariu, Chitta Baral, Yezhou Yang
  • Choosing 'Right' from Wrong: A Closer Look at Selection Bias in Spatial Multiple-Choice Questions in Large Multimodal Models: Giselle Zeno, Nour Jedidi, Steven Gomez

9:50 - 10:05 CDT

Poster Session, Coffee Break
(Ananya Bal)

10:05 - 10:45 CDT

Keynote, Andre Araujo
(Xu Zhang)

What's missing in multimodal AI? Towards spatial awareness, effective tool use and fine-grained understanding
Abstract 10:45 - 11:20 CDT

Invited Paper Presentation from the Main Conference
(Liuyue Xie)

  • Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
    • Poster Session and Location: Exhibit Hall D, Poster #348 @ Sun June 15th 10:30 -- 12:30 CDT (Poster Session 5)
    • Paper
  • VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
    • Poster Session and Location: Exhibit Hall D, Poster #396 @ Fri June 13th 16:00 -- 18:00 CDT (Poster Session 2)
    • Paper
    • Poster
  • Towards Scalable Human-aligned Benchmark for Text-guided Image Editing
    • Poster Session and Location: Exhibit Hall D, Poster #238 @ Sat June 14th 17:00 -- 19:00 CDT (Poster Session 4)
    • Paper
    • Poster
    • Slides

11:20 - 11:40 CDT

Benchmark Promotion
(Zhaowei Cai)

11:40 - 11:45 CDT

Keynote, Huaxiu Yao
(Ce Zheng)

From Generation to Feedback: Benchmark-Guided Reward Modeling and Closed-Loop Alignment for Multimodal Systems
Abstract 11:45 - 12:20 CDT

Closing Remarks
(Davide Modolo)

12:20 - 12:30 CDT


Invited Speakers

Huaxiu Yao

Assistant Professor at University of North Carolina at Chapel Hill

Andre Araujo

Senior Staff Research Scientist / Tech Lead Manager at Google DeepMind

Sherry Tongshuang Wu

Assistant Professor at Carnegie Mellon University

Katerina Fragkiadaki

JPMorgan Chase Associate Professor at Carnegie Mellon University


Accepted Papers

  • Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models; Andrés Villa, Juan Léon, Alvaro Soto, Bernard Ghanem.

  • Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model; Lu Xu, Sijie Zhu, Chunyuan Li, Chia-Wen Kuo, Fan Chen, Xinyao Wang, Guang Chen, Dawei Du, Ye Yuan, Longyin Wen.

  • Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models; Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S.-H. Gary Chan, Hongyang Zhang.

  • TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark; Forouzan Fallah, Maitreya Patel, Agneet Chatterjee, Vlad Morariu , Chitta Baral, Yezhou Yang .

  • Choosing 'Right' from Wrong: A Closer Look at Selection Bias in Spatial Multiple-Choice Questions in Large Multimodal Models; Giselle Zeno, Nour Jedidi, Steven Gomez.

  • Quantum Federated Learning for Multimodal Data: A Modality-Agnostic Approach; Atit Pokharel, Ratun Rahman, Thomas Morris, Dinh C. Nguyen.

  • Revisiting Multi-Modal LLM Evaluation; Jian Lu, Junyu Chen, Robik Shrestha, Manoj Acharya, Kushal Kafle, Christopher Kanan.

  • MerCulture: A Comprehensive Benchmark to Evaluate Vision-Language Models on Cultural Understanding in Singapore; Pranav Tushar, Eshan Pandey, Lyka Diane Bala Austria, Loo Yin Yin, Jing Hao Lim, Indriyati Atmosukarto, Donny Soh Cheng Lock.

  • KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language; Yoonshik Kim, Jaeyoon Jung.


Program Committee

George Z. Wei

Website Chair & Dataset Chair

CMU

Avik Kuthiala

Dataset Chair

Plus


Organizers

Laszlo A. Jeni

CMU (Primary Contact)

Morteza Ziyadi

Amazon

Hao Yang

Amazon

Xu Zhang

Amazon (Primary Contact)

Yang Zou

Amazon

Zhaowei Cai

Amazon

Maria Zontak

Amazon

Davide Modolo

Amazon

Ashwin Swaminathan

Amazon

Liuyue Xie

CMU

Mosam Dabhi

CMU

Xiang Yue

CMU

Ce Zheng

CMU

Rohan Choudhury

CMU

Ananya Bal

CMU

© 2025 BEAM2025

We thank Jalpc for the jekyll template