Moving target speaker extraction research from simulated benchmarks toward real-world conversational applications
A satellite challenge of IEEE SLT 2026
The REAL-TSE Challenge, organized as a satellite challenge of IEEE SLT 2026, focuses on Target Speaker Extraction (TSE) in real conversational environments. Given a multi-speaker mixture recorded in real-life settings such as meetings and dinner-party interactions, together with enrollment utterance(s) from a target speaker, the task is to recover the target speaker’s speech from the mixture. In contrast to conventional benchmarks built on simulated or read-speech data, REAL-TSE emphasizes naturally occurring overlap, authentic reverberation, ambient noise, and conversational dynamics in Mandarin and English, providing a more practical testbed for real-world TSE research.
The REAL-TSE Challenge provides an official development set and evaluation set, while model training is restricted to eligible external open-source data under the rules summarized below.
The challenge does not define an official training set. Participants may use any open-source datasets for training, and pre-trained models are allowed, provided that all data sources and checkpoints are clearly documented.
Usage: Every dataset and pre-trained model must be reported in the system description with citations or links. The development and test sets of AliMeeting, AISHELL-4, AMI, DipCo, and CHiME6 must not be used at any stage, including pre-training, training, or data augmentation. However, the official training splits of these corpora are permitted. Note that some corpora (e.g., AMI) have multiple partition schemes — please refer to the FAQ: What data can I use for training? for details on which sessions are excluded.
The development set is derived from REAL-T, a real-world conversational dataset for target speaker extraction (to be released soon). The data originates from five speaker diarization corpora—AISHELL-4, AliMeeting, AMI, DipCo, and CHiME6—covering Mandarin and English in scenarios such as meetings and dinner parties. Each sample includes a multi-speaker mixture, enrollment utterance(s) for the target speaker, and the clean target reference.
The data is constructed through an automated pipeline that extracts naturally overlapping segments as mixtures and selects non-overlapping speech segments of at least 5 seconds as enrollment utterances. Unlike synthetic mixtures, it captures realistic overlap patterns, reverberation, ambient noise, and conversational turn-taking. The development set is intended for validation, model comparison, and hyper-parameter tuning only, and must not be used for training or fine-tuning.
Usage: The development set may be used for hyper-parameter selection, model comparison, and validation, but must not be used for model training or fine-tuning.
The evaluation set (to be released soon) consists of two subsets that together assess both in-domain robustness and real-world generalization:
Constructed from the same source corpora as the development set using a held-out partition, sharing similar acoustic conditions and conversation styles. EVAL-1 measures system performance under familiar, in-domain settings.
Newly collected recordings that cover a broader range of real-world scenarios beyond the development set. EVAL-2 provides a comprehensive assessment of model generalization to previously unseen acoustic environments and interaction patterns.
Usage: The evaluation set must not be used for training, fine-tuning, hyper-parameter tuning, or any form of model optimization. It is reserved exclusively for final evaluation.
Scenario: Target speaker extraction in latency-sensitive applications such as assistive hearing, real-time communication, and interactive voice systems, where the system must produce temporally responsive outputs under overlapping speech and background noise, enabling continuous and natural listening.
Task: Evaluate target speaker extraction under strict end-to-end latency constraints. Effective latency is measured via the temporal response of outputs to localized input perturbations. Models must ensure input changes are reflected within a bounded delay (≤ 100 ms), balancing extraction quality and responsiveness.
Scenario: Target speaker extraction in offline scenarios such as speech transcription, meeting analysis, and audio post-processing, where the full utterance is available prior to inference.
Task: Evaluate target speaker extraction with full-utterance access and unconstrained inference. Systems may exploit global temporal dependencies and full-context modeling to maximize separation quality. This track targets the upper performance bounds of TSE systems, focusing on speech quality, interference suppression, and robustness without latency constraints.
We provide four pretrained baseline models, all based on the BSRNN architecture [1] and trained on the Libri2Mix-100 dataset [2]. These models differ in the type of speaker representation and whether low-latency constraints are considered.
Two models use speaker embeddings extracted from a pretrained ECAPA-TDNN model as speaker conditioning [3][4]. The other two adopt a combination of TF-Map and contextual embeddings [5]. For each type of speaker representation, both an offline and an online (low-latency) variant are provided. All models are trained on 16 kHz audio for 150 epochs with an exponential decay learning rate schedule from 0.001 to 2.5e-5.
Provided Checkpoints
ECAPA-TDNN speaker embeddings as conditioning.
spk_emb_100 — Offlinespk_emb_causal_100 — OnlineTF-Map and contextual embeddings as multi-level speaker representation.
tfmap_context_100 — Offlinetfmap_context_causal_100 — OnlineThe baseline systems can be run using wesep-real-tse, and have been integrated into the REAL-TSE Challenge repo for automated inference and evaluation.
[1] Y. Luo and J. Yu, “Music source separation with band-split RNN,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1893–1901, 2023.
[2] J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “LibriMix: An open-source dataset for generalizable speech separation,” 2020.
[3] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech 2020, pp. 3830–3834.
[4] H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y. Deng, and Y. Qian, “WeSpeaker: A research and production oriented speaker embedding learning toolkit,” in ICASSP 2023.
[5] K. Zhang, J. Li, S. Wang, Y. Wei, Y. Wang, Y. Wang, and H. Li, “Multi-level speaker representation for target speaker extraction,” in ICASSP 2025.
Systems are ranked by a composite score that comprehensively integrates the four metrics described below. The detailed formulations for each sub-metric and the composite score will be specified in the challenge description paper and released with the official scoring toolkit.
Word Error Rate (WER) measures how accurately the extracted speech can be transcribed. A lower WER indicates better intelligibility.
Cosine similarity between speaker embeddings of the extracted and reference speech. Higher similarity indicates better speaker fidelity.
Neural Mean Opinion Score estimator (DNSMOS) evaluating the overall perceptual quality of the extracted speech.
Measures accuracy of extracting speech only from target-speaker active regions via temporal precision, recall, and their harmonic mean F1.
Key dates for the REAL-TSE Challenge at IEEE SLT 2026. All times follow the official challenge announcement unless stated otherwise.
7 April 2026 (AoE)
Team registration for the challenge opens.
10 April 2026 (AoE)
The development dataset and baseline systems are released. Dataset access is restricted to registered teams only; download links and passwords have been sent to registered teams, and will continue to be sent to newly registered teams (within 24 hours of registration).
31 May 2026 (AoE, tentative)
The evaluation dataset is released, the leaderboard opens, and registration closes. Only registered teams will receive dataset access via email.
20 June 2026 (AoE, tentative)
The leaderboard is frozen and result submissions close.
1 July 2026 (AoE, tentative)
Deadline for submitting the system report.
8 July 2026 (AoE, tentative)
Deadline for submitting challenge-related papers to the SLT Challenge Paper Track.
1 September 2026 (AoE, tentative)
Acceptance decisions for papers submitted to the SLT Challenge Paper Track are announced.
Please register here. We will send you a successful registration email within three days after your registration. If you have not received it, please contact us.
Participants may submit results for one or both tracks of the Challenge.
Key Requirements
Latency
The offset introduced by the whole processing chain including STFT, iSTFT, overlap-add, additional lookahead frames, etc., compared to just passing the signal through without modification. This does not include buffering latency.
The latency introduced by block-wise processing, often referred to as hop-size, frame-shift, or temporal stride.
Algorithmic and buffering latency definitions and examples above follow the ICASSP 2023 Deep Noise Suppression Challenge (Microsoft Research).
Key Requirements
Detailed data usage constraints—including allowed training sources, prohibited corpora, and the usage scope of the development and evaluation sets—are specified in the Data section above. All participants must comply with those requirements.
Teams may use only openly available (open-source) datasets for model training. The official training splits of AliMeeting, AISHELL-4, AMI, DipCo, and CHiME6 are permitted, but their development and test sets must not be used at any stage, including pre-training, training, or data augmentation. Pre-trained models are allowed but must be clearly documented.
Note on AMI corpus splits: The AMI corpus has multiple partition schemes (see AMI dataset page). We follow the Full-corpus-ASR partition of meetings. Under this partition, the following sessions are considered development or test data and must not be used:
Nanjing University
Chinese University of Hong Kong, Shenzhen
Nanjing University
Northwestern Polytechnical University
Nanjing University
NTT, Inc.
Northwestern Polytechnical University
Brno University of Technology
Shanghai Jiao Tong University
Northwestern Polytechnical University
Chinese University of Hong Kong, Shenzhen
Chinese University of Hong Kong, Shenzhen
For any questions or inquiries, please feel free to reach out to us at:
realtse.challenge@gmail.com