Moving target speaker extraction research from simulated benchmarks
toward real-world conversational applications
TBA
TBA
TBA
Target Speaker Extraction (TSE) has become a cornerstone of modern speech technology. Unlike blind speech separation that attempts to recover all sources, TSE focuses on extracting the voice of a specific target speaker from multi-speaker mixtures by leveraging auxiliary cues such as enrollment utterances and spatial information.
Despite rapid advances in neural TSE models, most prior works rely on simulated mixtures, limiting real-world applicability. These simulations combine utterances recorded in different acoustic conditions, thereby distorting loudness relationships, and they lack authentic room reverberation and ambient noise, producing an unnatural acoustic signature. Crucially, widely used datasets are dominated by read speech from fixed prompts and therefore miss the spontaneous turn-taking and reactive overlaps that characterize real conversational settings—a well-known challenge in the cocktail-party problem.
To close this gap, we propose the REAL-TSE Challenge (Real-world Target Speaker Extraction Challenge). The challenge is grounded in real conversational recordings spanning Mandarin and English and covering diverse scenarios (e.g., meetings and dinner-party interactions). It also provides multiple enrollment utterances per speaker, enabling robust, enrollment-aware conditioning. The goal is to evaluate TSE systems under natural, spontaneous conversational dynamics. Participants of the challenge shall be required to design a Target Speaker Extraction (TSE) system. This system must process real-world speech mixtures and an enrollment speech sample of the target speaker, and subsequently output an estimated signal of the target speaker extracted from the mixtures. The recordings will be captured using distant microphones in real-life scenarios, such as meetings or dinner parties. The challenge consists of two complementary tracks:
1) Online (Causal) Track – focusing on streaming and low-latency TSE, where the model must operate under causality and limited latency.
2) Offline (Global) Track – exploring context-aware extraction using full utterances, allowing for global optimization and non-causal modeling.
Together, these tracks form a comprehensive benchmark for practical TSE systems. By rooting evaluation in real acoustics and natural interaction, the REAL-TSE Challenge aims to move the field beyond synthetic benchmarks toward deployable, speaker-aware auditory intelligence.
The REAL-TSE Challenge addresses a fundamental bottleneck in the evolution of speech understanding systems—how to enable machines to focus on one speaker in a complex acoustic scene. This goal resonates strongly with the theme of Interspeech 2026, "Speaking Together," as it reflects the cognitive process of selective auditory attention in human communication.
In summary, the REAL-TSE Challenge represents a timely and necessary effort to move target speaker extraction research from simulated benchmarks toward real-world and conversational applications—an essential step in enabling machines to truly "listen and speak together" with us.
Focusing on streaming and low-latency TSE for real-time processing.
Exploring context-aware extraction using full utterances and global optimization.
Participants may submit results for one or both tracks of the Challenge.
TBA
Register your team and get access to the challenge data
TBA
Train and develop your TSE systems using the provided datasets
TBA
Submit your system outputs for evaluation
TBA
Submit a detailed technical report describing your system
TBA
Winners announced and leaderboard finalized
Registration details and the challenge platform will be announced shortly. Stay tuned for updates!
Information will be announced soon.