PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset

Xuanjun Chen^*, Jiawei Du^*, Haibin Wu^†, Lin Zhang, I-Ming Lin, I-Hsiang Chiu, Wenze Ren, Yuan Tseng, Yu Tsao, Jyh-Shing Roger Jang, Hung-yi Lee

National Taiwan University , John Hopkins University , Academia Sinica
^* Indicates Equal Contribution , ^† Indicates Corresponding Author

Paper Code arXiv Hugging Face

News

The ``CodecFake+'' project page is currently under development. Thank you for your kind patience and continued interest in our work.

Abstract

With the rise of neural audio codecs, codec-based speech generation (CoSG) systems can produce highly realistic speech, posing new risks of deepfakes—referred to as CodecFake. While detection is increasingly critical, existing systems are mainly designed for traditional synthesis methods and often fail against CodecFake. To address this gap, we introduce CodecFake+, the largest and most diverse dataset for codec-based deepfake detection. It includes training data re-synthesized from 31 open-source neural codecs and evaluation data collected from 17 advanced CoSG models. The codec re-synthesized data is not fake speech, but serves as a proxy for CoSG data detection. We also propose a comprehensive taxonomy that categorizes codecs based on three key components: vector quantizer, auxiliary objectives, and decoder type. This taxonomy enables multi-level analysis to uncover key factors for detection. At the codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) for large-scale training. At the taxonomy level, we find that disentanglement-based objectives and frequency-domain decoders improve detection. Finally, we show that balancing training data across decoder types further enhances performance, even surpassing models trained solely on CoSG data. Overall, CodecFake+ provides a valuable foundation for advancing both general and fine-grained detection of codec-based deepfakes.

Audio Samples

We show some audio samples generated by our proposed dataset creation pipeline. Ground truth (GT) samples are randomly sampled from the VCTK corpus.

Codec Model Used	GT File ID	GT Speech	Generated Speech
SpeechTokenizer	p225_273
academiccodec_hifi_16k_320d	p295_144

BibTeX


@article{chen2025codecfake+,
  title   = {CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset},
  author  = {Chen, Xuanjun and Du, Jiawei and Wu, Haibin and Zhang, Lin and Lin, I and Chiu, I and Ren, Wenze and others},
  journal = {arXiv preprint arXiv:2501.08238},
  year    = {2025}
}

@inproceedings{wu24p_interspeech,
  title     = {{CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems}},
  author    = {Haibin Wu and Yuan Tseng and Hung-yi Lee},
  year      = {2024},
  booktitle = {{Interspeech 2024}},
  pages     = {1770--1774},
  doi       = {10.21437/Interspeech.2024-2093},
  issn      = {2958-1796},
}