
Jian Liu
Ant Group
EgoLink (Egocentric Language-Vision Interactive Network Knowledge Challenge) is designed to advance embodied AI within complex, real-world egocentric scenarios. Moving beyond traditional navigation or static object perception, the challenge holistically evaluates a model's capacity for social reasoning and interactive task execution. It challenges intelligent AI to not only perceive emotional cues, understand causal relationships, and predict behavioral intents in human interactions but also to actively solve daily life tasks through multimodal dialogue, dynamic tool use, and autonomous planning. EgoLink aims to foster integrated intelligence where perception, reasoning, and decision-making are tightly coupled in unstructured social environments.
Track 1: Social Reasoning in Egocentric Video. This track focuses on social reasoning in egocentric video. Instead of only testing navigation or object perception, the benchmark evaluates emotional perception, causal understanding, behavioral intent prediction, and semantic summarization in real-world human interaction scenes.
The challenge is built on E3 (Exploring Embodied Emotion) and introduces a unified MCQ-based protocol designed to be objective, reproducible, and accessible to both multimodal learning and embodied AI communities.
Track 2: Interactive Agent Challenge: Multimodal Interaction Task Execution in Social Life Scenarios. This track evaluates how well an intelligent agent can solve real-world tasks in complex and dynamic social-life environments through tool use.
Unlike traditional static QA or single-modality recognition tasks, this track requires the model to act as an interactive agent. The agent receives first-person visual streams (e.g., shopping, ordering food, and other high-frequency daily scenarios), combines them with user natural-language instructions and available external tools, and completes user goals through multi-turn dialogue, accurate tool invocation, autonomous planning, and closed-loop execution.
Beyond visual perception accuracy, this track primarily assesses the model's integrated intelligence in unstructured environments, where perception, reasoning, and action must be tightly coupled.
Track 1
The dataset builds upon the E3 (Exploring Embodied Emotion) dataset, a pioneering large-scale egocentric video benchmark for embodied emotion understanding. While E3 provides foundational egocentric video data with emotion annotations, EgoLink transforms this resource into a comprehensive social reasoning benchmark through systematic re-annotation and task reformulation. The MCQ construction scheme references the methodology discussed in Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models .
We will provide the constructed MCQ training set, the MCQ validation set, and the original E3 annotation labels to participants for model training.
Track 2
This track provides benchmark data only and does not release a training set.
Track 1
Track 2
Track 1
Track 2
ACM Multimedia 2026 is an on-site event only. This means that all papers and contributions must be presented by a physical person on-site; remote presentations will not be hosted or allowed. Papers and contributions not presented on-site will be considered a no-show and removed from the proceedings of the conference. More details will be provided to handle unfortunate situations in which none of the authors would be able to attend the conference physically.
Organising committee.

Ant Group

Ant Group

Zhejiang University

Zhejiang University
Challenge chairs and core team.

Zhejiang University

Ant Group

Ant Group

Ant Group

Ant Group

Ant Group

The Chinese University of Hong Kong (CUHK)

Ant Group
License: CC BY-NC-SA 4.0 for non-commercial research and education usage.