Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Geng, Xuelong; Xu, Tianyi; Wei, Kun; Mu, Bingshen; Xue, Hongfei; Wang, He; Li, Yangze; Guo, Pengcheng; Dai, Yuhang; Li, Longhao; Shao, Mingchen; Xie, Lei

Computer Science > Sound

arXiv:2405.02132 (cs)

[Submitted on 3 May 2024 (v1), last revised 6 May 2024 (this version, v2)]

Title:Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Authors:Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research.

Subjects:	Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2405.02132 [cs.SD]
	(or arXiv:2405.02132v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2405.02132

Submission history

From: Xuelong Geng [view email]
[v1] Fri, 3 May 2024 14:35:58 UTC (1,206 KB)
[v2] Mon, 6 May 2024 08:56:50 UTC (85 KB)

Computer Science > Sound

Title:Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators