REAL-TSE Challenge

News

Apr 10, 2026 (AoE) The development set and baseline checkpoints have been sent to all registered teams via email. Teams that register afterwards will also receive them (within 24 hours of registration). If you have not received them after registration, please contact us.
Apr 7, 2026 (AoE) Challenge registration is now open. Teams can register via the Google Form.

Introduction

The REAL-TSE Challenge, organized as a satellite challenge of IEEE SLT 2026, focuses on Target Speaker Extraction (TSE) in real conversational environments. Given a multi-speaker mixture recorded in real-life settings such as meetings and dinner-party interactions, together with enrollment utterance(s) from a target speaker, the task is to recover the target speaker’s speech from the mixture. In contrast to conventional benchmarks built on simulated or read-speech data, REAL-TSE emphasizes naturally occurring overlap, authentic reverberation, ambient noise, and conversational dynamics in Mandarin and English, providing a more practical testbed for real-world TSE research.

Data

The REAL-TSE Challenge provides an official development set and evaluation set, while model training is restricted to eligible external open-source data under the rules summarized below.

Dataset Distribution: The challenge datasets are available exclusively to registered teams. Only teams that have completed registration via the Google Form will receive access. The dataset download link and password have been sent to registered teams, and will continue to be sent to newly registered teams. If you have not received the email after registration, please contact us.

Training Set

The challenge does not define an official training set. Participants may use any open-source datasets for training, and pre-trained models are allowed, provided that all data sources and checkpoints are clearly documented.

Usage: Every dataset and pre-trained model must be reported in the system description with citations or links. The development and test sets of AliMeeting, AISHELL-4, AMI, DipCo, and CHiME6 must not be used at any stage, including pre-training, training, or data augmentation. However, the official training splits of these corpora are permitted. Note that some corpora (e.g., AMI) have multiple partition schemes — please refer to the FAQ: What data can I use for training? for details on which sessions are excluded.

Development Set

The development set is derived from REAL-T, a real-world conversational dataset for target speaker extraction (to be released soon). The data originates from five speaker diarization corpora—AISHELL-4, AliMeeting, AMI, DipCo, and CHiME6—covering Mandarin and English in scenarios such as meetings and dinner parties. Each sample includes a multi-speaker mixture, enrollment utterance(s) for the target speaker, and the clean target reference.

The data is constructed through an automated pipeline that extracts naturally overlapping segments as mixtures and selects non-overlapping speech segments of at least 5 seconds as enrollment utterances. Unlike synthetic mixtures, it captures realistic overlap patterns, reverberation, ambient noise, and conversational turn-taking. The development set is intended for validation, model comparison, and hyper-parameter tuning only, and must not be used for training or fine-tuning.

Usage: The development set may be used for hyper-parameter selection, model comparison, and validation, but must not be used for model training or fine-tuning.

Evaluation Set

The evaluation set (to be released soon) consists of two subsets that together assess both in-domain robustness and real-world generalization:

EVAL-1Seen Set

Constructed from the same source corpora as the development set using a held-out partition, sharing similar acoustic conditions and conversation styles. EVAL-1 measures system performance under familiar, in-domain settings.

EVAL-2Unseen Set

Newly collected recordings that cover a broader range of real-world scenarios beyond the development set. EVAL-2 provides a comprehensive assessment of model generalization to previously unseen acoustic environments and interaction patterns.

Usage: The evaluation set must not be used for training, fine-tuning, hyper-parameter tuning, or any form of model optimization. It is reserved exclusively for final evaluation.

Tracks

Track 1: Online Target Speaker Extraction

Low-Latency TSE

Scenario: Target speaker extraction in latency-sensitive applications such as assistive hearing, real-time communication, and interactive voice systems, where the system must produce temporally responsive outputs under overlapping speech and background noise, enabling continuous and natural listening.

Task: Evaluate target speaker extraction under strict end-to-end latency constraints. Effective latency is measured via the temporal response of outputs to localized input perturbations. Models must ensure input changes are reflected within a bounded delay (≤ 100 ms), balancing extraction quality and responsiveness.

Track 2: Offline Target Speaker Extraction

Full-Context TSE

Scenario: Target speaker extraction in offline scenarios such as speech transcription, meeting analysis, and audio post-processing, where the full utterance is available prior to inference.

Task: Evaluate target speaker extraction with full-utterance access and unconstrained inference. Systems may exploit global temporal dependencies and full-context modeling to maximize separation quality. This track targets the upper performance bounds of TSE systems, focusing on speech quality, interference suppression, and robustness without latency constraints.

Baselines

We provide four pretrained baseline models, all based on the BSRNN architecture [1] and trained on the Libri2Mix-100 dataset [2]. These models differ in the type of speaker representation and whether low-latency constraints are considered.

Two models use speaker embeddings extracted from a pretrained ECAPA-TDNN model as speaker conditioning [3][4]. The other two adopt a combination of TF-Map and contextual embeddings [5]. For each type of speaker representation, both an offline and an online (low-latency) variant are provided. All models are trained on 16 kHz audio for 150 epochs with an exponential decay learning rate schedule from 0.001 to 2.5e-5.

Provided Checkpoints

Speaker Embedding

ECAPA-TDNN speaker embeddings as conditioning.

spk_emb_100 — Offline
spk_emb_causal_100 — Online

TF-Map + Context

TF-Map and contextual embeddings as multi-level speaker representation.

tfmap_context_100 — Offline
tfmap_context_causal_100 — Online

The baseline systems can be run using wesep-real-tse, and have been integrated into the REAL-TSE Challenge repo for automated inference and evaluation.

Pretrained baseline checkpoints have been sent via email to registered teams, and will continue to be sent to newly registered teams.

A note on baseline limitations These baselines are trained on Libri2Mix — synthetic mixtures with full speaker overlap and fixed 3-second segments — which differs substantially from REAL-TSE's real conversational data. Baseline performance should therefore be interpreted as a lower bound rather than a representative ceiling. We encourage participants to explore training strategies, data simulation pipelines, and model architectures better suited to real-world conversational TSE.

References

[1] Y. Luo and J. Yu, “Music source separation with band-split RNN,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1893–1901, 2023.

[2] J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “LibriMix: An open-source dataset for generalizable speech separation,” 2020.

[3] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech 2020, pp. 3830–3834.

[4] H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y. Deng, and Y. Qian, “WeSpeaker: A research and production oriented speaker embedding learning toolkit,” in ICASSP 2023.

[5] K. Zhang, J. Li, S. Wang, Y. Wei, Y. Wang, Y. Wang, and H. Li, “Multi-level speaker representation for target speaker extraction,” in ICASSP 2025.

Evaluation Metrics

Composite Score

Systems are ranked by a composite score that comprehensively integrates the four metrics described below. The detailed formulations for each sub-metric and the composite score will be specified in the challenge description paper and released with the official scoring toolkit.

Intelligibility — WER

Word Error Rate (WER) measures how accurately the extracted speech can be transcribed. A lower WER indicates better intelligibility.

Speaker Consistency — SpkSim

Cosine similarity between speaker embeddings of the extracted and reference speech. Higher similarity indicates better speaker fidelity.

Perceptual Quality — DNSMOS

Neural Mean Opinion Score estimator (DNSMOS) evaluating the overall perceptual quality of the extracted speech.

Target Speaker Presence Rate — F1

Measures accuracy of extracting speech only from target-speaker active regions via temporal precision, recall, and their harmonic mean F1.

Timeline

Key dates for the REAL-TSE Challenge at IEEE SLT 2026. All times follow the official challenge announcement unless stated otherwise.

7 April 2026 (AoE)

Registration Opens

Team registration for the challenge opens.

10 April 2026 (AoE)

Development Set & Baselines Released

The development dataset and baseline systems are released. Dataset access is restricted to registered teams only; download links and passwords have been sent to registered teams, and will continue to be sent to newly registered teams (within 24 hours of registration).

31 May 2026 (AoE, tentative)

Evaluation Set Released / Leaderboard Opens / Registration Closes

The evaluation dataset is released, the leaderboard opens, and registration closes. Only registered teams will receive dataset access via email.

20 June 2026 (AoE, tentative)

Leaderboard Freeze / Submission Deadline

The leaderboard is frozen and result submissions close.

1 July 2026 (AoE, tentative)

System Report Submission Deadline

Deadline for submitting the system report.

8 July 2026 (AoE, tentative)

SLT Challenge Paper Track: Paper Submission Deadline

Deadline for submitting challenge-related papers to the SLT Challenge Paper Track.

1 September 2026 (AoE, tentative)

SLT Challenge Paper Track: Notification of Acceptance

Acceptance decisions for papers submitted to the SLT Challenge Paper Track are announced.

Registration

Please register here. We will send you a successful registration email within three days after your registration. If you have not received it, please contact us.

Register Form

Rules for Participation

Tracks

Participants may submit results for one or both tracks of the Challenge.

Online Track — Low-Latency TSE

Key Requirements

Latency constraint: The end-to-end algorithmic latency must not exceed 100 ms. Systems may use limited future context (lookahead) as long as this latency budget is satisfied.
Streaming output: Models should support frame-wise or chunk-wise input with continuous output generation, without relying on full-utterance buffering.

Latency

Algorithmic Latency

The offset introduced by the whole processing chain including STFT, iSTFT, overlap-add, additional lookahead frames, etc., compared to just passing the signal through without modification. This does not include buffering latency.

Ex.1: A STFT-based processing with window length = 20 ms and hop length = 10 ms introduces an algorithmic delay of window length − hop length = 10 ms.
Ex.2: A STFT-based processing with window length = 32 ms and hop length = 8 ms introduces an algorithmic delay of window length − hop length = 24 ms. For reference, the online BSRNN baselines we provide (tfmap_context_causal_100 and spk_emb_causal_100) use this STFT configuration.
Ex.3: An overlap-save based processing algorithm introduces no additional algorithmic latency.
Ex.4: A time-domain convolution with a filter kernel size = 16 samples introduces an algorithmic latency of kernel size − 1 = 15 samples. Using one-sided (left) padding with kernel size − 1 samples, the operation introduces no additional algorithmic latency.
Ex.5: A STFT-based processing with window_length = 20 ms and hop_length = 10 ms using 2 future frames information introduces an algorithmic latency of (window_length − hop_length) + 2 × hop_length = 30 ms.

Buffering Latency

The latency introduced by block-wise processing, often referred to as hop-size, frame-shift, or temporal stride.

Ex.1: A STFT-based processing has a buffering latency corresponding to the hop size.
Ex.2: An overlap-save processing has a buffering latency corresponding to the frame size.
Ex.3: A time-domain convolution with stride 1 introduces a buffering latency of 1 sample.

Algorithmic and buffering latency definitions and examples above follow the ICASSP 2023 Deep Noise Suppression Challenge (Microsoft Research).

In addition to the reported algorithmic latency, the effective system latency will be evaluated by measuring the output response delay to a controlled input perturbation. This evaluation is designed to reflect real streaming behavior without requiring participants to explicitly implement a streaming inference pipeline.

Offline Track — Full-Context TSE

Key Requirements

Full context access: Systems may utilize the entire utterance and global context information.
No restriction on architecture: There is no limitation on model type, inference duration, or computational cost.
Encouraged exploration: Participants are encouraged to explore high-performance architectures, long-context modeling, and generation-based methods.

Training Data & Pre-trained Models

Detailed data usage constraints—including allowed training sources, prohibited corpora, and the usage scope of the development and evaluation sets—are specified in the Data section above. All participants must comply with those requirements.

Submission Rules

System description: All submissions must include detailed system descriptions for reproducibility.
Number of submissions: Each team may submit up to three valid systems per track.
Encouraged open source: Teams are encouraged to make their models or scripts publicly available after the challenge.

FAQ

Can I participate in multiple tracks?

Yes, participants may submit results for one or both tracks of the Challenge. Each team may submit up to three valid systems per track.

What data can I use for training?

Teams may use only openly available (open-source) datasets for model training. The official training splits of AliMeeting, AISHELL-4, AMI, DipCo, and CHiME6 are permitted, but their development and test sets must not be used at any stage, including pre-training, training, or data augmentation. Pre-trained models are allowed but must be clearly documented.

Note on AMI corpus splits: The AMI corpus has multiple partition schemes (see AMI dataset page). We follow the Full-corpus-ASR partition of meetings. Under this partition, the following sessions are considered development or test data and must not be used:

Dev (SB): ES2011, IS1008, TS3004, IB4001, IB4002, IB4003, IB4004, IB4010, IB4011
Test (SC / Unseen Eval): ES2004, IS1009, TS3003, EN2002

Is the test set available for tuning?

No. The official test set must not be used for training, fine-tuning, hyper-parameter tuning, or any form of model optimization. It is reserved exclusively for final evaluation. The development set may be used for hyper-parameter tuning and validation.

Organizers

Shuai Wang

Nanjing University

Ke Zhang

Chinese University of Hong Kong, Shenzhen

Zihan Qian

Nanjing University

Zikai Liu

Northwestern Polytechnical University

Haoyu Li

Nanjing University

Marc Delcroix

NTT, Inc.

Zhaokai Sun

Northwestern Polytechnical University

Jiangyu Han

Brno University of Technology

Kai Yu

Shanghai Jiao Tong University

Lei Xie

Northwestern Polytechnical University

Ming Li

Chinese University of Hong Kong, Shenzhen

Haizhou Li

Chinese University of Hong Kong, Shenzhen

Contact

For any questions or inquiries, please feel free to reach out to us at:

realtse.challenge@gmail.com