Abstract:
The main goal of the Voice Activity Detection (VAD) techniques is to distinguish between voiced regions and silent intervals in an audio communication. This step is crucial in most speech processing systems such as mobile communications, Voice over IP, speech recognition and hearing aid systems. This task seems to be relatively easy in a weakly disturbed environment. However, in the case of strongly noise, it becomes difficult to provide accurate information about the presence of active voice. In general, a VAD achieves the compression
of silence intervals in modern communications systems by reducing the average bit rate via the discontinuous transmission mode (DTX). In this thesis, we describe the main standardized VAD methods mentioned in the literature; namely the VAD G.729-B approved by ITU-T in 1996, the AMR VAD (Adaptive Multi Rate), the AFE VAD (Advanced front-end) and SILK (developed by Skype). The VAD of the G.729-B standard generates a binary VAD decision for each frame as a function of four relevant parameters extracted from the audio signal. In a simplified way, these parameters are directly related to energy and spectral components of a voiced frame.
The G.729-B is used in the majority of audio transmission applications, becoming the most popular VAD technique. Therefore, the G-729-B is today the best standard for comparative studies in most scientific articles dealing with VAD. In the 1st contribution, we propose a VAD scheme based on adaptive threshold while
maintaining the False Acceptance Rate at a nominal value. As well known in the binary decision theory, the error rate, denoted ""False Acceptance Rate"", is related to the probability of misclassified a frame of silence as Active Voice. The basic idea is to perform sequential tests, based on full band energy, in order to reject or to accept the frame under investigation as active voice region. The most interesting feature of the proposed algorithm concerns its ability to dynamically update the noise level estimator, according to the current environment. Taking into account the long-term stationary property of the speech, we also developed a
smoothing procedure to discard discontinuities that may appear in the processed signal. The performance of the proposed approach has been evaluated and compared to the VAD of
the G.729-B in several situations including various environmental acoustic noises with
different SNRs. Analysis of the results has been performed using the NOIZEUS experimental
database as well as real recorded signals.The 2nd contribution consists of implementing the proposed approach on a microcontrollerbased system, in order to:
• Ensure the robustness of the algorithm,
• Evaluate its implementation complexity
• Validate the real time operation mode
In this context, various tests were conducted in real time mode via the development tools
available on the microcontroller system (STM32F7). These tools allowed performing realtime monitoring of several signal parameters in realistic situations. By this way, we were able to accurately determine the processing time (Latency) required to generate a final decision for each frame. The real-time analysis allowed us to obtain a global latency of 4 μs, which seems sufficient to guarantee real-time operation regarding to the common sampling frequencies of speech processing systems (8 kHz to 16 kHz).