EgoLink Challenge – Egocentric Language‑Vision Interactive Network Knowledge

Intro

EgoLink (Egocentric Language-Vision Interactive Network Knowledge Challenge) is designed to advance embodied AI within complex, real-world egocentric scenarios. Moving beyond traditional navigation or static object perception, the challenge holistically evaluates a model's capacity for social reasoning and interactive task execution. It challenges intelligent AI to not only perceive emotional cues, understand causal relationships, and predict behavioral intents in human interactions but also to actively solve daily life tasks through multimodal dialogue, dynamic tool use, and autonomous planning. EgoLink aims to foster integrated intelligence where perception, reasoning, and decision-making are tightly coupled in unstructured social environments.

News

Apr 15, 2026: Register for EgoLink Challenge 2026 Registration is now open. Welcome to sign up! 🎉🎉🎉
Apr 2, 2026: Official challenge website initialized.

Challenge Tasks

Track 1: Social Reasoning in Egocentric Video. This track focuses on social reasoning in egocentric video. Instead of only testing navigation or object perception, the benchmark evaluates emotional perception, causal understanding, behavioral intent prediction, and semantic summarization in real-world human interaction scenes.

The challenge is built on E3 (Exploring Embodied Emotion) and introduces a unified MCQ-based protocol designed to be objective, reproducible, and accessible to both multimodal learning and embodied AI communities.

1) Emotional Perception and Localization. Identify emotional categories and their temporal boundaries in egocentric streams.
2) Social Causal Reasoning. Analyze causal triggers behind observed social emotions and responses.
3) Behavioral Intent Prediction. Infer likely future intentions and goals from multimodal interaction context.
4) Egocentric Semantic Summarization. Select the best high-level summary that captures the social episode from the ego perspective.

Track 2: Interactive Agent Challenge: Multimodal Interaction Task Execution in Social Life Scenarios. This track evaluates how well an intelligent agent can solve real-world tasks in complex and dynamic social-life environments through tool use.

Unlike traditional static QA or single-modality recognition tasks, this track requires the model to act as an interactive agent. The agent receives first-person visual streams (e.g., shopping, ordering food, and other high-frequency daily scenarios), combines them with user natural-language instructions and available external tools, and completes user goals through multi-turn dialogue, accurate tool invocation, autonomous planning, and closed-loop execution.

Beyond visual perception accuracy, this track primarily assesses the model's integrated intelligence in unstructured environments, where perception, reasoning, and action must be tightly coupled.

1) Fine-grained Egocentric Visual Understanding. The agent must process continuous video streams or temporal image sequences and understand the environment in a human-like way. This includes spatiotemporal perception (e.g., object state changes over time and relative spatial relations) and attribute recognition (e.g., color, shape, texture, and brand cues).
2) Dynamic Tool Invocation and Execution. The agent must not only "talk" but also "act". It should decide when to invoke tools and which tool to use, construct correct input parameters from real-time visual evidence (e.g., on-screen price or menu options), and close the loop by interpreting tool outputs and selecting the next action (finish, retry, or switch strategy).
3) Multi-hop Logical Reasoning and Complex Decision Making. For ambiguous and complex real-world user requests, the model is evaluated on high-level cognition such as conditional reasoning ("if...then..." constraints) and multi-reference resolution in long dialogues and multi-object scenes (e.g., \"the red one\" or \"the one we saw earlier\").

Dataset

Track 1

The dataset builds upon the E3 (Exploring Embodied Emotion) dataset, a pioneering large-scale egocentric video benchmark for embodied emotion understanding. While E3 provides foundational egocentric video data with emotion annotations, EgoLink transforms this resource into a comprehensive social reasoning benchmark through systematic re-annotation and task reformulation. The MCQ construction scheme references the methodology discussed in Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models .

We will provide the constructed MCQ training set, the MCQ validation set, and the original E3 annotation labels to participants for model training.

Track 2

This track provides benchmark data only and does not release a training set.

Evaluation

Track 1

Protocol: Multi-dimensional MCQ evaluation for objective scoring.
Main Metric: Overall Top-1 Accuracy.
Sub-Metrics: Dimension-wise scores for emotion, causality, intent, and summarization.

Track 2

Protocol: Interactive API-based evaluation where agents interact with a server to receive tasks, feedback, and invoke tools.
Main Metric: Overall Task Success Rate based on objective ground truth.
Sub-Metrics: Tool Invocation Accuracy, Database State Accuracy, and Interaction Efficiency (measured by turns).
Submission Guide: Track 2 Evaluation & Submission Guide

Competition Rules

Data Usage: Participants may use only the officially released training and development sets for model training, validation, model selection, and threshold determination. Participants may use open-source or closed-source models to perform additional data processing on videos in the training and development sets.
External Data: Except for the officially released data, the use of any external labeled or unlabeled video data is strictly prohibited for training, fine-tuning, distillation, calibration, or pseudo-label construction, including self-generated synthetic data from external generative models or services.
Pretrained Models: Publicly available and traceable pretrained models are allowed, including self-supervised learning (SSL) models, multimodal large language models (MLLMs), and other general-purpose pretrained models, provided that their sources can be clearly specified in the final metadata.

Important Dates

Track 1

2026.04.15: E3 dataset is available for pre-download. E3 Dataset
2026.05.10: Training dataset label release.
2026.06.08: Evaluation dataset released.
2026.06.22 - 2026.06.25: Final answer and report submission window.

Track 2

2026.05.18: Development and Testing Phase start.
2026.06.15 - 2026.06.25: Final answer and report submission window.

Presentation policy

ACM Multimedia 2026 is an on-site event only. This means that all papers and contributions must be presented by a physical person on-site; remote presentations will not be hosted or allowed. Papers and contributions not presented on-site will be considered a no-show and removed from the proceedings of the conference. More details will be provided to handle unfortunate situations in which none of the authors would be able to attend the conference physically.