Audio and Speech Processing
- [1] arXiv:2405.15093 [pdf, ps, html, other]
-
Title: Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow SynthesisComments: 5 pages,4 figuresSubjects: Audio and Speech Processing (eess.AS)
Singing voice conversion is to convert the source sing voice into the target sing voice except for the content. Currently, flow-based models can complete the task of voice conversion, but they struggle to effectively extract latent variables in the more rhythmically rich and emotionally expressive task of singing voice conversion, while also facing issues with low efficiency in speech processing. In this paper, we propose a high-fidelity flow-based model based on multi-decoupling feature constraints, which enhances the capture of vocal details by integrating multiple encoders. We also use iSTFT to enhance the speed of speech processing by replacing some layers of the Vocoder. We compare the synthesized singing voice with other models from multiple dimensions, and our proposed model is highly consistent with the current state-of-the-art, with the demo which is available at \url{this https URL}
New submissions for Monday, 27 May 2024 (showing 1 of 1 entries )
- [2] arXiv:2405.15085 (cross-list from eess.SP) [pdf, ps, html, other]
-
Title: Acoustical Features as Knee Health Biomarkers: A Critical AnalysisChristodoulos Kechris, Jerome Thevenot, Tomas Teijeiro, Vincent A. Stadelmann, Nicola A. Maffiuletti, David AtienzaSubjects: Signal Processing (eess.SP); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Acoustical knee health assessment has long promised an alternative to clinically available medical imaging tools, but this modality has yet to be adopted in medical practice. The field is currently led by machine learning models processing acoustical features, which have presented promising diagnostic performances. However, these methods overlook the intricate multi-source nature of audio signals and the underlying mechanisms at play. By addressing this critical gap, the present paper introduces a novel causal framework for validating knee acoustical features. We argue that current machine learning methodologies for acoustical knee diagnosis lack the required assurances and thus cannot be used to classify acoustic features as biomarkers. Our framework establishes a set of essential theoretical guarantees necessary to validate this claim. We apply our methodology to three real-world experiments investigating the effect of researchers' expectations, the experimental protocol and the wearable employed sensor. This investigation reveals latent issues such as underlying shortcut learning and performance inflation. This study is the first independent result reproduction study in the field of acoustical knee health evaluation. We conclude with actionable insights from our findings, offering valuable guidance to navigate these crucial limitations in future research.
- [3] arXiv:2405.15096 (cross-list from cs.SD) [pdf, ps, html, other]
-
Title: Music Genre Classification: Training an AI modelSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Music genre classification is an area that utilizes machine learning models and techniques for the processing of audio signals, in which applications range from content recommendation systems to music recommendation systems. In this research I explore various machine learning algorithms for the purpose of music genre classification, using features extracted from audio signals.The systems are namely, a Multilayer Perceptron (built from scratch), a k-Nearest Neighbours (also built from scratch), a Convolutional Neural Network and lastly a Random Forest wide model. In order to process the audio signals, feature extraction methods such as Short-Time Fourier Transform, and the extraction of Mel Cepstral Coefficients (MFCCs), is performed. Through this extensive research, I aim to asses the robustness of machine learning models for genre classification, and to compare their results.
- [4] arXiv:2405.15103 (cross-list from cs.SD) [pdf, ps, html, other]
-
Title: The Rarity of Musical Audio Signals Within the Space of Possible Audio GenerationSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
A white noise signal can access any possible configuration of values, though statistically over many samples tends to a uniform spectral distribution, and is highly unlikely to produce intelligible sound. But how unlikely? The probability that white noise generates a music-like signal over different durations is analyzed, based on some necessary features observed in real music audio signals such as mostly proximate movement and zero crossing rate. Given the mathematical results, the rarity of music as a signal is considered overall. The applicability of this study is not just to show that music has a precious rarity value, but that examination of the size of music relative to the overall size of audio signal space provides information to inform new generations of algorithmic music system (which are now often founded on audio signal generation directly, and may relate to white noise via such machine learning processes as diffusion). Estimated upper bounds on the rarity of music to the size of various physical and musical spaces are compared, to better understand the magnitude of the results (pun intended). Underlying the research are the questions `how much music is still out there?' and `how much music could a machine learning process actually reach?'.
- [5] arXiv:2405.15216 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Denoising LM: Pushing the Limits of Error Correction Models for Speech RecognitionComments: under reviewSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a $\textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several $\textit{key ingredients}$: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on $\textit{test-clean}$ and 3.3% WER on $\textit{test-other}$ on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.
- [6] arXiv:2405.15338 (cross-list from cs.SD) [pdf, ps, html, other]
-
Title: SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound GenerationSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD. Demo page: \url{this https URL}.
- [7] arXiv:2405.15655 (cross-list from cs.SD) [pdf, ps, html, other]
-
Title: HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification SystemSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
In recent years, the remarkable advancements in deep neural networks have brought tremendous convenience. However, the training process of a highly effective model necessitates a substantial quantity of samples, which brings huge potential threats, like unauthorized exploitation with privacy leakage. In response, we propose a framework named HiddenSpeaker, embedding imperceptible perturbations within the training speech samples and rendering them unlearnable for deep-learning-based speaker verification systems that employ large-scale speakers for efficient training. The HiddenSpeaker utilizes a simplified error-minimizing method named Single-Level Error-Minimizing (SLEM) to generate specific and effective perturbations. Additionally, a hybrid objective function is employed for human perceptual optimization, ensuring the perturbation is indistinguishable from human listeners. We conduct extensive experiments on multiple state-of-the-art (SOTA) models in the speaker verification domain to evaluate HiddenSpeaker. Our results demonstrate that HiddenSpeaker not only deceives the model with unlearnable samples but also enhances the imperceptibility of the perturbations, showcasing strong transferability across different models.
Cross submissions for Monday, 27 May 2024 (showing 6 of 6 entries )
- [8] arXiv:2405.12609 (replaced) [pdf, ps, html, other]
-
Title: Mamba in Speech: Towards an Alternative to Self-AttentionXiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian, Beena Ahmed, Eliathamby Ambikairajah, Haizhou Li, Julien EppsSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and computer vision tasks, but its superiority has rarely been investigated in speech signal processing. This paper explores solutions for applying Mamba to speech processing using two typical speech processing tasks: speech recognition, which requires semantic and sequential information, and speech enhancement, which focuses primarily on sequential patterns. The experimental results exhibit the superiority of bidirectional Mamba (BiMamba) for speech processing to vanilla Mamba. Moreover, experiments demonstrate the effectiveness of BiMamba as an alternative to the self-attention module in Transformer and its derivates, particularly for the semantic-aware task. The crucial technologies for transferring Mamba to speech are then summarized in ablation studies and the discussion section to offer insights for future research.
- [9] arXiv:2403.11732 (replaced) [pdf, ps, html, other]
-
Title: Hallucination in Perceptual Metric-Driven Speech Enhancement NetworksComments: Accepted for EUSIPCO 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Within the area of speech enhancement, there is an ongoing interest in the creation of neural systems which explicitly aim to improve the perceptual quality of the processed audio. In concert with this is the topic of non-intrusive (i.e. without clean reference) speech quality prediction, for which neural networks are trained to predict human-assigned quality labels directly from distorted audio. When combined, these areas allow for the creation of powerful new speech enhancement systems which can leverage large real-world datasets of distorted audio, by taking inference of a pre-trained speech quality predictor as the sole loss function of the speech enhancement system. This paper aims to identify a potential pitfall with this approach, namely hallucinations which are introduced by the enhancement system `tricking' the speech quality predictor.
- [10] arXiv:2404.09466 (replaced) [pdf, ps, html, other]
-
Title: Scoring Intervals using Non-Hierarchical Transformer For Automatic Piano TranscriptionComments: Fixed TyposSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed intervals tied to specific event types. The neural semi-CRF approach requires an interval scoring matrix that assigns a score for every candidate interval. However, designing an efficient and expressive architecture for scoring intervals is not trivial. In this paper, we introduce a simple method for scoring intervals using scaled inner product operations that resemble how attention scoring is done in transformers. We show theoretically that, due to the special structure from encoding the non-overlapping intervals, under a mild condition, the inner product operations are expressive enough to represent an ideal scoring matrix that can yield the correct transcription result. We then demonstrate that an encoder-only non-hierarchical transformer backbone, operating only on a low-time-resolution feature map, is capable of transcribing piano notes and pedals with high accuracy and time precision. The experiment shows that our approach achieves the new state-of-the-art performance across all subtasks in terms of the F1 measure on the Maestro dataset.
- [11] arXiv:2405.14598 (replaced) [pdf, ps, html, other]
-
Title: Visual Echoes: A Simple Unified Transformer for Audio-Visual GenerationShiqi Yang, Zhi Zhong, Mengjie Zhao, Shusuke Takahashi, Masato Ishii, Takashi Shibuya, Yuki MitsufujiComments: 10 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at this https URL