2nd GenAI Media Generation Challenge Workshop @ CVPR2025

Jun 11th (9:00 - 12:00)
Workshop Meeting Location: 110A


Workshop Overview

This year, we are excited to host the 2nd GenAI Media Generation Challenge Workshop at CVPR 2025. Building on the success of last year's event, which focused on text-to-image and image editing tasks, we are expanding the challenge to include video generation.

We are proud to announce the launch of the 2nd GenAI Media Generation Challenge (MAGIC), featuring a media generation track and auto-evaluation track:

  • Media Generation Festival: For the first time, we are organizing a media generation festival with no restrictions on prompts. We would define a few different topics for which submitted media would compete in, and participants can submit their best generated videos or images for those specific topics. For each topic, we run a crowd sourced voting mechanism to determine the winners for each topic .
  • Auto Evaluation Challenge: We are introducing an auto evaluation challenge for both text-to-image and text-to-video tasks. Participants can develop and submit their auto evaluation score for a preselect set of images and videos that we will provide and enter into the media generation festival track. Auto evaluation submissions would be to predict the outcomes from the crowd sourced voting mechanism in the media generation festival The auto evaluation method that achieves the best correlation with the final results would be the winners for this challenge..


Crowedsourcing voting platform" is Open!

Click the link below to vote for your favorite AI-generated media.

Vote Now


Submission for "Challenge A - Media Generation Festival" is Open!

Click the link below to share your generated images or videos with us. See Submission Instruction.

Submit Now


Submission for "Challenge B - Auto Evaluation" is Open!

Click the link below to share your auto eval results with us. See Submission Instruction.

Submit Now


Challenges Overview

Challenge A - Media Generation Festival

Track A.1 - Video (Short)

For this task, there are no restrictions on prompts or models but limited to based on topics. Participants are free to use any models, including third-party video generation tools, and are encouraged to design their own creative prompts to produce engaging videos. Submissions can be single topic submissions or multiple topic submissions. However, the video length is limited to a maximum of 10 seconds.

For this track, we will have the following set of of topics:

  1. People
  2. Animals
  3. Landscape
Track A.2 - Video (Long)

Similar to track A.1, there are no restrictions on prompts or models but limited to based on topics. For this track though, the video length is limited to a maximum of 5 minutes.

For this track, we will have the following set of of topics:

  1. Action
  2. History
  3. Sci-Fi
  4. Fantasy
  5. Comedy
Track A.3 - Images

For this task, there are no restrictions on prompts or models but limited to based on topics. Participants are free to use any models, including third-party video generation tools, and are encouraged to design their own creative prompts to produce engaging videos.

For this track, we will have the following set of of topics:

  1. People
  2. Animals
  3. Landscape
Evaluation Protocol

All submitted video/images will be uploaded to our crowdsourcing platform for public voting. The top three submissions with the highest votes for each topic will be declared winners. Beyond the winners per topic, we also have joint submission winners where we would have the best overall performing solution. We would use the Elo ranking system to compute the final ranking between different submissions.

For the voting setup, we would use a pairwise comparison setup, showing two videos or images along with the topic, and ask the question "Which one would you prefer?". The selections would be 3-scale win/tie/lose rating.

Submission Instruction
For this challenge, we do not have much limitations excepts: categories (see above) and the video length (less than 10s for Track A.1, 5mins for Track A.2). Please upload your videos or images indicating the category and track. We do not limit the number of submissions of each team.

Challenge B - Results Prediction with Auto Evaluation (Artifacts/Flaws)

In this challenge, we will use provided (text, image) pairs and (text, video) pairs for which participants would run their own auto evaluation. Participants will submit a binary classification of these media indicating if the media (image or video) has flaws/artifacts.
Track B.1 - Video Generation Auto Eval

For this track, we use Movie Gen Video Bench for benchmarking. Each participant will be asked to download the 1003 videos and prompts from Movie Gen Video Bench and run their auto eval models to classify the 1003 videos if they have artifacts or flaws.

Track B.2 - Image Generation Auto Eval

For this track, we use Emu_1k for benchmarking. Each participant can download the 1000 emu-generated images and prompts from the benchmark and run their auto eval models to indicate if there are flaws or aritifacts in each image.

Evaluation Protocol

All submitted video/images will be evaluated against our internal human annotations.

Submission Instruction

To participate in Track B.1, please follow these steps:

  1. Download the 1003 videos, indexed from 0 to 1002, from the Movie Gen Video Bench.
  2. Run your auto-evaluation model to generate the rankings for these videos. You can also utilize the prompts and meta information provided in the Movie Gen Video Generation benchmark to enhance your evaluation.
  3. To submit your results, prepare a text file with 1003 lines. Each line indicates whether this video has flaws or artifacts in the generation.
Example of a submitted txt file.
1
0
...
1
      
In this submission,
  • The first video (0.mp4) has flaws or artifacts, so the first line is 1.
  • The second video (1.mp4) has no flaws, so the second line is 0.
  • The last video (1002.mp4) has artifacts, so the last line is 1.


For Track B.2, please follow these instructions:

  1. Download the "images.zip" file from the Emu 1k benchmark.
  2. Inside "images.zip", you will find 1000 images along with their corresponding prompts. For example, "000000.jpg" has its prompts in "000000.txt".
  3. To submit your results, create a text file with 1000 lines. Each line should indicate whether this image has flaws or artifacts.
Example of a submitted txt file.
1
0
...
1
      
In this submission,
  • The first image (000000.jpg) has flaws, so the first line is 1.
  • The second image (000001.jpg) has no flaws, so the second line is 0.
  • The last image (000999.jpg) has flaws, so the last line is 1.


Leaderboard

TBA


Important Dates

Description Date
Submission opens for all Challenges. 3/3/2025
Submission closes for Challenge A. 4/14/2025
Crowd-sourced polling opens 4/21/2025
Submission closes for Challenge B. 6/2/2025
Crowd-sourced poll ends. 6/2/2025
Workshop Date 6/11/2025


Workshop Schedule

Time Agenda Speech Title Speaker(s)
Opening Session 09:15 - 09:30 Opening Remarks Ji Hou
Session I 09:30 - 10:00 Keynote Speech TBA Björn Ommer
10:00 - 10:30 Keynote Speech TBA Haoqi Fan
Break 10:30 - 10:50 Coffee Break
Session II 10:50 - 11:20 Keynote Speech TBA Jun-Yan Zhu
11:20 - 11:50 Keynote Speech TBA Saining Xie
Closing Session 11:50 - 12:00 Closing Remarks Yaqiao Luo


Invited Speakers

Björn Ommer Dr. Björn Ommer is a full professor at LMU where he heads the Computer Vision & Learning Group (previously Computer Vision Group Heidelberg). Before he was a full professor at the Department of Mathematics and Computer Science of Heidelberg University and also served as a one of the directors of the Interdisciplinary Center for Scientific Computing (IWR) and of the Heidelberg Collaboratory for Image Processing (HCI). He has served as program chair for GCPR, as Senior Area Chair and Area Chair for multiple CVPR, ICCV, ECCV, and NeurIPS conferences, and as workshop and tutorial organizer at these venues.


Jun-Yan Zhu Dr. Jun-Yan Zhu is the Michael B. Donohue Assistant Professor of Computer Science and Robotics at CMU’s School of Computer Science. Prior to joining CMU, he was a Research Scientist at Adobe Research and a postdoc at MIT CSAIL. He obtained his Ph.D. from UC Berkeley and B.E. from Tsinghua University. He studies computer vision, computer graphics, and computational photography. His current research focuses on generative models for visual storytelling. He is the recipient of the Samsung AI Research of the Year, the Packard Fellowships for Science and Engineering, the NSF CAREER Award, the ACM SIGGRAPH Outstanding Doctoral Dissertation Award, and the UC Berkeley EECS David J. Sakrison Memorial Prize for outstanding doctoral research, among other awards.


Haoqi Fan Haoqi Fan is a Research Scientist at Seed Edge, where he leads efforts to build world foundational models. He spent seven years at Facebook AI Research (FAIR), focusing on self-supervised learning and backbone design for image and video understanding. His works won the ActivityNet Challenge at ICCV 2019 and were nominated for Best Paper at CVPR 2020. He has also co-organized several tutorials at CVPR, ICCV, and ECCV.


Saining Xie Dr. Saining Xie is an Assistant Professor of Computer Science at NYU Courant and part of the CILVR group. He is also affiliated with NYU Center for Data Science. Before that I was a research scientist at Facebook AI Research (FAIR), Menlo Park. He received my Ph.D. and M.S. degrees from CSE Department at UC San Diego, advised by Zhuowen Tu. During his PhD study, he also interned at NEC Labs, Adobe, Facebook, Google, DeepMind. Prior to that, he obtained his bachelor degree from Shanghai Jiao Tong University. His primary areas of interest in research are computer vision and machine learning.


Organizers



Senior Advisors


Contact

To contact the organizers please use [email protected]



Acknowledgments

Thanks to languagefor3dscenes for the webpage format.