REAL-T: Real Conversational Mixtures for Target Speaker Extraction

Let's Start!

REAL-T

A real-world, conversation-centric benchmark for Target Speaker Extraction (TSE)

Multilingual

Covers both Mandarin and English conversations from diverse real-world sources.

Multi-genre Scenarios

Covers diverse settings such as meeting rooms and dinner parties, drawn from datasets like AISHELL-4, AliMeeting, AMI, CHiME6, and DipCo.

Multi-Enrollment

Each target speaker in REAL-T is provided with multiple enrollment utterance from different parts of the conversation

PRIMARY

Moderate difficulty
150+ minutes total audio
70+ minutes overlapping speech

BASE

More challenging
360+ minutes total audio
140+ minutes overlapping speech

Key Insight

TSE models trained on synthetic data exhibit degraded performance when evaluated on realistic conversational scenarios.

Features

Some features about the REAL-T!

Source

PRIMARY

BASE

Ratio

Enrolment

Example

Publications

Please cite the following if you make use of the dataset.

Shaole Li, Shuai Wang, Jiangyu Han, Ke Zhang, Wupeng Wang, Haizhou Li

REAL-T: Real Conversational Mixtures for Target Speaker Extraction, Interspeech 2025

Bibtex |  Abstract |  PDF  

Guidance

Evaluate your TSE model on REAL-T!

Hugging Face Dataset available on Hugging Face

GitHub The REAL-T evaluation tool is open-sourced on GitHub

Step 0: Preparation

① Clone the repository and enter the project directory

git clone https://github.com/REAL-TSE/REAL-T.git
cd REAL-T

② Create a Conda environment and install dependencies

conda create -n REAL-T python=3.9
conda activate REAL-T
pip install -r requirements.txt

# install wesep
git submodule init
git submodule update

③ Set up environment variables

Please replace $PWD below with the absolute path to this project.

export PATH=$PWD/FireRedASR/fireredasr/:$PWD/FireRedASR/fireredasr/utils/:$PATH
export PYTHONPATH=$PWD/FireRedASR/:$PYTHONPATH
export PYTHONPATH=$PWD/wesep:$PYTHONPATH

④ Automatically prepare dataset and checkpoints

bash -i ./pre.sh

Step 1: Run TSE Inference

Run inference script (You can adapt its input/output structure to suit your own TSE model)

bash -i ./run_tse.sh

Step 2: ASR-based Evaluation

Run both transcription and evaluation

bash -i ./transcribe_and_evaluation.sh 1 2

Performance

The table below compares the performance of several recently proposed TSE models on the simulated Libri2Mix and PRIMARY test sets.

Model Training Data Libri2Mix
SI-SDR (dB)
PRIMARY
zh (%)
PRIMARY
en (%)
TSELM-L Libri2Mix-360 / 331.73 192.39
USEF-TFGridnet Libri2Mix-100 18.05 67.98 87.27
BSRNN Libri2Mix-100 12.95 81.74 91.20
Libri2Mix-360 16.57 69.80 73.61
VoxCeleb1 16.50 57.61 69.63
BSRNN_HR Libri2Mix-100 15.91 70.03 78.96
Libri2Mix-360 17.99 63.38 74.64
VoxCeleb1 16.38 58.77 66.46

Acknowledgement

         This work was supported by National Natural Science Foundation of China, (Grant No. 62401377), Shenzhen Science and Technology Program (Shenzhen Key Laboratory, Grant No. ZDSYS20230626091302006), Shenzhen Science and Technology Research Fund (Fundamental Research Key Project, Grant No. JCYJ20220818103001002), Program for Guangdong Introducing Innovative and Entrepreneurial Teams, (Grant No. 2023ZT10X044), Yangtze River Delta Science and Technology Innovation Community Joint Research Project (Grant No. 2024CSJGG01100).