<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

CFAD: A Chinese Dataset for Fake Audio Detection

Name: CFAD: A Chinese Dataset for Fake Audio Detection
Keywords: Deepfake, Fake Audio, Fake Audio Detection, Dataset

Research datakeyboard_double_arrow_right Dataset 09 Jun 2022 Chinese Publisher:Zenodo

Authors: Haoxin Ma; Jiangyan Yi;

doi: 10.5281/zenodo.6623226 , 10.5281/zenodo.8122764

CFAD: A Chinese Dataset for Fake Audio Detection

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions. In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake audio. To simulate the real-life scenarios, three noise datasets are selected for noise adding at five different signal-to-noise ratios, and six codecs are considered for audio transcoding. CFAD dataset can be used not only for fake audio detection but also for detecting the algorithms of fake utterances for audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging. The CFAD dataset is publicly available on GitHub https://github.com/ADDchallenge/CFAD CFAD dataset considers 12 types of fake audio, 11 of which are generated by different speech synthesis techniques and the remaining one is partially fake type. Partially fake audio is completely different from synthesis speech and thus can better evaluate the generalization of the detection model to unknown types. The real audio is collected from 6 different corpora to increase the diversity of real category distributions, which makes model less prone to artifact from a single database. For robustness evaluation, we additionally simulate background noise and media codecs that might occur in real life and provide detailed labels, including fake type, real source, noise type, signal noise ratio (SNR), and media codecs. Overall, CFAD dataset consists of three different versions, named clean, noisy, and codec versions. Each version of the dataset is divided into disjoint training, development, and test sets in the same way. There is no speaker overlap across these three subsets. Each test set is further divided into seen and unseen test sets. Unseen test sets can evaluate the generalization of the methods to unknown types. It is worth mentioning that both real audio and fake audio in the unseen test set are unknown to the model. For the noisy speech part, we select three noise databases for simulation. Additive noises are added to each audio in the clean dataset at 5 different SNRs. The additive noises of the unseen test set and the remaining subsets come from different noise databases. For the codec speech part, we select six different codecs. Two of them are applied for unseen test set. In each version (clean, noisy, and codec versions) of the CFAD dataset, there are 138400 utterances in training set, 14400 utterances in development set, 42000 utterances in seen test set, and 21000 utterances in unseen test set. Clean Real Audios Collection From the point of eliminating the interference of irrelevant factors, we collect clean real audios from two aspects: 5 open resources from OpenSLR platform (http://www.openslr.org/12/) and one self-recording dataset. Clean Fake Audios Generation We select 11 representative speech synthesis methods to generate the fake audios and one partially fake audios. Noisy Audios Simulation Noisy audios aim to quantify the robustness of the methods under noisy conditions. To simulate the real-life scenarios, we artificially sample the noise signals and add them to clean audios at 5 different SNRs, which are 0dB, 5dB, 10dB, 15dB and 20dB. Additive noises are selected from three noise databases: PNL 100 Nonspeech Sounds, NOISEX-92, and TAU Urban Acoustic Scenes. Audio Transcoding The Codec version aims to quantify the robustness of the methods under different format conversions. We select a total of six codecs. For the training, development, and seen test sets in codec version, mp3, flac, ogg, and m4a are used. For the unseen test set of the codec version, aac, and wma are used. Audio transcoding operation is operated on the audio in the clean version. Each clean audio will be randomly transformed with one of the candidate codecs and converted back to original WAV files using ffmpeg toolkits. This data set is licensed with a CC BY-NC-ND 4.0 license. You can cite the data using the following BibTeX entry.

{"references": ["Jiangyan Yi, Ye Bai, Jianhua Tao, Zhengkun Tian, Chenglong Wang, Tao Wang, and Ruibo Fu. Half-truth: A partially fake audio detection dataset. Interspeech 2021.", "Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. Aishell-1: An open-source mandarin Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pages 1\u20135. IEEE, 2017.", "Xin Xu Shaoji Zhang Ming Li Yao Shi, Hui Bu. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. 2015.", "Zhiyong Zhang Dong Wang, Xuewei Zhang. Thchs-30 : A free chinese speech corpus, 2015.", "Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, et al. Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset. arXiv preprint arXiv:2203.16844, 2022.", "Guoning Hu and DeLiang Wang. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Transactions on Audio, Speech, and Language Processing, 18(8):2067\u20132079,2010.", "Andrew Varga and Herman JM Steeneken. Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech communication, 12(3):247\u2013251, 1993.", "Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), pages 9\u201313, November 2018."]}

Related Organizations

Chinese Academy of Sciences
China (People's Republic of)

Keywords

Deepfake, Fake Audio, Fake Audio Detection, Dataset

Filter by relation

All relations

arrow_drop_down

2 Research products, page 1 of 1

FAD: A Chinese Dataset for Fake Audio Detection
2022IsAmongTopNSimilarDocuments
FAD: A Chinese Dataset for Fake Audio Detection
2022HasVersion

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	120
download	downloads	130

120
views
130
downloads
Powered by

Found an issue? Give us feedback

visibility

download

Average

120

130