REAL-TSE Challenge

Real-world Target Speaker Extraction Challenge

Moving target speaker extraction research from simulated benchmarks
toward real-world conversational applications

Registration Opens

TBA

Challenge Starts

TBA

Submission Deadline

TBA

Introduction of the Challenge

Bridging the gap between simulation and reality

Target Speaker Extraction (TSE) has become a cornerstone of modern speech technology. Unlike blind speech separation that attempts to recover all sources, TSE focuses on extracting the voice of a specific target speaker from multi-speaker mixtures by leveraging auxiliary cues such as enrollment utterances and spatial information.

Despite rapid advances in neural TSE models, most prior works rely on simulated mixtures, limiting real-world applicability. These simulations combine utterances recorded in different acoustic conditions, thereby distorting loudness relationships, and they lack authentic room reverberation and ambient noise, producing an unnatural acoustic signature. Crucially, widely used datasets are dominated by read speech from fixed prompts and therefore miss the spontaneous turn-taking and reactive overlaps that characterize real conversational settings—a well-known challenge in the cocktail-party problem.

To close this gap, we propose the REAL-TSE Challenge (Real-world Target Speaker Extraction Challenge). The challenge is grounded in real conversational recordings spanning Mandarin and English and covering diverse scenarios (e.g., meetings and dinner-party interactions). It also provides multiple enrollment utterances per speaker, enabling robust, enrollment-aware conditioning. The goal is to evaluate TSE systems under natural, spontaneous conversational dynamics. Participants of the challenge shall be required to design a Target Speaker Extraction (TSE) system. This system must process real-world speech mixtures and an enrollment speech sample of the target speaker, and subsequently output an estimated signal of the target speaker extracted from the mixtures. The recordings will be captured using distant microphones in real-life scenarios, such as meetings or dinner parties. The challenge consists of two complementary tracks:

1) Online (Causal) Track – focusing on streaming and low-latency TSE, where the model must operate under causality and limited latency.

2) Offline (Global) Track – exploring context-aware extraction using full utterances, allowing for global optimization and non-causal modeling.

Together, these tracks form a comprehensive benchmark for practical TSE systems. By rooting evaluation in real acoustics and natural interaction, the REAL-TSE Challenge aims to move the field beyond synthetic benchmarks toward deployable, speaker-aware auditory intelligence.

Significance of the Challenge

Advancing the frontier of speaker-aware auditory intelligence

The REAL-TSE Challenge addresses a fundamental bottleneck in the evolution of speech understanding systems—how to enable machines to focus on one speaker in a complex acoustic scene. This goal resonates strongly with the theme of Interspeech 2026, "Speaking Together," as it reflects the cognitive process of selective auditory attention in human communication.

The significance of this challenge can be summarized in three major aspects:
1. Bridging Research and Real-World Deployment
  • Most existing TSE datasets (e.g., LibriMix, WSJ0-2mix) rely on simulated mixtures that fail to capture true spatial, temporal, and conversational dynamics.
  • REAL-TSE provides the first real conversational Benchmark for both online and offline TSE evaluation, serving as a critical stepping stone from laboratory research to real applications such as smart assistants, teleconferencing, and hearing devices.
2. Advancing Dual-Track Research Directions
  • The Online Track targets causal, low-latency, and computationally efficient TSE for real-time processing. Participants are encouraged to optimize latency, and robustness under strict causal constraints—reflecting the real requirements of interactive systems.
  • The Offline Track promotes exploration of global modeling paradigms, including self-supervised learning, diffusion-based generation, and speaker-conditioned large models. This track focuses on pushing the frontier of speech extraction fidelity and perceptual naturalness.
  • Together, these tracks foster synergy between system efficiency and modeling quality, encouraging innovation across both algorithmic and architectural dimensions.
3. Driving Community Collaboration and Benchmark Standardization
  • By providing an open, reproducible, and ethically approved dataset and baseline framework (Wesep: https://github.com/wenet-e2e/wesep), this challenge aims to unify evaluation practices for real-world TSE.
  • The joint analysis of intelligibility (WER), speaker preservation (similarity), and perceptual quality (MOS) establishes a multi-faceted metric standard for future studies.
  • We anticipate that the REAL-TSE Challenge will not only stimulate new research ideas but also serve as a long-term benchmark and platform for cross-community collaboration among ASR, separation, and speaker modeling researchers.

Challenge Tracks

Choose your track and compete!

Track 1: Online (Causal)

Real-Time

Focusing on streaming and low-latency TSE for real-time processing.

Track 2: Offline (Global)

High Quality

Exploring context-aware extraction using full utterances and global optimization.

Rules for Participation

Guidelines for submission and evaluation

Online Track Requirements
  1. Causality Constraint: Systems must be strictly causal and not access any future frame information.
  2. Real-Time Requirement: The end-to-end algorithmic latency can not exceed 100 ms.
  3. Streaming Output: Models should support frame-wise or chunk-wise input with continuous output generation, without relying on full-utterance buffering.
Offline Track Requirements
  1. Full Context Access: Systems may utilize the entire utterance and global context information.
  2. No Restriction on Architecture or Complexity: There is no limitation on model type, inference duration, or computational cost.
  3. Encouraged Exploration: Participants are encouraged to explore high-performance architectures, long-context modeling, and generation-based methods.
General Submission Rules
  • All submissions must include detailed system descriptions for reproducibility.
  • Each team may submit up to three valid systems per track.
  • Teams are encouraged to make their models or scripts publicly available after the challenge.

Important Dates

Mark your calendar

Registration Opens

TBA

Register your team and get access to the challenge data

Development Phase

TBA

Train and develop your TSE systems using the provided datasets

Evaluation Phase

TBA

Submit your system outputs for evaluation

System Description Deadline

TBA

Submit a detailed technical report describing your system

Results Announcement

TBA

Winners announced and leaderboard finalized

Registration & Resources

Get started with the challenge

Registration Coming Soon

Registration details and the challenge platform will be announced shortly. Stay tuned for updates!

Organizers

Meet the challenge organizers

Information will be announced soon.