Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2013, Acoustical Science and Technology
https://doi.org/10.1250/AST.34.311…
11 pages
1 file
In this study, we propose a method of classifying speech under stress using parameters extracted from a physical model to characterize the behavior of the vocal folds. Although many conventional methods have been proposed, feature parameters are directly extracted from waveforms or spectrums of input speech. Parameters derived from the physical model can characterize stressed speech more precisely because they represent physical characteristics of the vocal folds. Therefore, we propose a method that fits a two-mass model to real speech in order to estimate the physical parameters that represent muscle tension in the vocal folds, vocal fold viscosity loss, and subglottal pressure coming from the lungs. Furthermore, combinations of these physical parameters are proposed as features effective for the classification of speech as either neutral or stressed. Experimental results show that our proposed features achieved better classification performance than conventional methods.
EURASIP Journal on Audio, Speech, and Music Processing, 2013
In this study, we focus on the classification of neutral and stressed speech based on a physical model. In order to represent the characteristics of the vocal folds and vocal tract during the process of speech production and to explore the physical parameters involved, we propose a method using the two-mass model. As feature parameters, we focus on stiffness parameters of the vocal folds, vocal tract length, and cross-sectional areas of the vocal tract. The stiffness parameters and the area of the entrance to the vocal tract are extracted from the two-mass model after we fit the model to real data using our proposed algorithm. These parameters are related to the velocity of glottal airflow and acoustic interaction between the vocal folds and the vocal tract and can precisely represent features of speech under stress because they are affected by the speaker's psychological state during speech production. In our experiments, the physical features generated using the proposed approach are compared with traditionally used features, and the results demonstrate a clear improvement of up to 10% to 15% in average stress classification performance, which shows that our proposed method is more effective than conventional methods.
IEEE Transactions on Speech and Audio Processing, 2001
Studies have shown that variability introduced by stress or emotion can severely reduce speech recognition accuracy. Techniques for detecting or assessing the presence of stress could help improve the robustness of speech recognition systems. Although some acoustic variables derived from linear speech production theory have been investigated as indicators of stress, they are not always consistent. In this paper, three new features derived from the nonlinear Teager energy operator (TEO) are investigated for stress classification. It is believed that the TEO based features are better able to reflect the nonlinear airflow structure of speech production under adverse stressful conditions. The features proposed include TEO-decomposed FM variation (TEO-FM-Var), normalized TEO autocorrelation envelope area (TEO-Auto-Env), and critical band based TEO autocorrelation envelope area (TEO-CB-Auto-Env). The proposed features are evaluated for the task of stress classification using simulated and actual stressed speech and it is shown that the TEO-CB-Auto-Env feature outperforms traditional pitch and mel-frequency cepstrum coefficients (MFCC) substantially. Performance for TEO based features are maintained in both text-dependent and text-independent models, while performance of traditional features degrades in text-independent models. Overall neutral versus stress classification rates are also shown to be more consistent across different stress styles.
International Journal of Speech …, 2007
The variations in speech production due to stress have an adverse affect on the performances of speech and speaker recognition algorithms. In this work, different speech features, such as Sinusoidal Frequency Features (SFF), Sinusoidal Amplitude Features ( ...
1998
There are many stressful environments which deteriorate the performance of speech recognition systems. Examples include aircraft cockpits, 911 emergency telephone response, high workload task stress, or emotional situations. To address this, we investigate a number of linear and nonlinear features and processing methods for stressed speech classi cation. The linear features include properties of pitch, duration, intensity, glottal source, and the vocal tract spectrum. Nonlinear processing is based on our newly proposed Teager Energy Operator TEO speech feature which incorporates frequency domain critical band lters and properties of the resulting TEO autocorrelation envelope. In this study, w e employ a B a y esian hypothesis testing approach and a hidden Markov model HMM processor as classi cation methods. Evaluations focused on speech under loud, angry, and the Lombard e ect 1 from the SUSAS database. Results using receiver operating characteristic ROC curves and EER equal error rate based detection show that pitch is the best of the ve linear features for stress classi cation; while the new nonlinear TEO-based feature outperforms the best linear feature by +5.2, with a reduction in classi cation rate variability from 8.66 to 3.90.
Speech Communication, 1996
Speech production variations due to perceptually induced stress contribute significantly to reduced speech processing performance. One approach for assessment of production variations due to stress is to formulate an objective classification of speaker stress based upon the acoustic speech signal. This study proposes an algorithm for estimation of the probability of perceptually induced stress. It is suggested that the resulting stress score could be integrated into robust speech processing algorithms to improve robustness in adverse conditions. First, results from a previous stress classification study are employed to motivate selection of a targeted set of speech features on a per phoneme and stress group level. Analysis of articulatory, excitation and cepstral based features is conducted using a previously established stressed speech database (Speech Under Simulated and Actual Stress (SUSAS)). Stress sensitive targeted feature sets are then selected across ten stress conditions (including Apache helicopter cockpit, Angry, Clear, Lombard effect, Loud, etc.) and incorporated into a new targeted neural network stress classifier. Second, the targeted feature stress classification system is then evaluated and shown to achieve closed speaker, open token classification rates of 91.0%. Finally, the proposed stress classification algorithm is incorporated into a stress directed speech recognition system, where separate hidden Markov model recognizers are trained for each stress condition. An improvement of + 10.1% and + 15.4% over conventionally trained neutral and multi-style trained recognizers is demonstrated using the new stress directed recognition approach.
Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.
A major challenge of automatic speech recognition systems found in many areas of today's society is the ability to overcome natural phoneme conditions that potentially degrade performance. In this study, we discuss the effects of two critical phoneme characteristics, decreased vowel duration and mismatched vowel type, on the performance of automatic stress detection in speech using Teager Energy Operator features. We determine the scope and magnitude of these effects on stress detection performance and propose an algorithm to compensate for vowel type and duration shortening on stress detection performance using a composite phoneme decision scheme, which results in relative error reductions of 24% and 39% in the non-stress and stress conditions, respectively.
Acta Physica Polonica A, 2012
This paper presents how voice stress is manifested in the acoustic and phonetic structure of the speech signal. Out of 60 000 authentic Police 997 emergency phone calls, 22 000 were automatically selected, a few hundred of which were chosen for acoustic evaluation, the basis for selection being a perceptual assessment. In highly stressful conditions (e.g. panic) a systematic dynamic over-one-octave shift in pitch and signicant increase in speech tempo was observed. In states of depression a systematic down shift in pitch and signicant decrease in speech tempo was observed. Basic statistical measurements for stressed and neutral speech run over the database showed the relevance of the arousal and potency dimension in stress processing. In speech produced under fear an upward shift in pitch register was signicant (in comparison to neutral speech), while speech recorded during experiencing anger was characterized by an increase in F0 range.
Asilomar Conference on Signals, Systems & Computers, 2009
Stress in human speech can be detected by various methods know as Voice Stress Analysis (VSA). The detection is accomplished by measuring the frequency shift of a microtremor normally residing in the frequency range of 8 to 12 Hz when not stressed. Conventional detection methods include Fast Fourier Transform (FFT) or McQuiston-Ford algorithm. This paper presents a new method called
In this research, we model and analyze the vocal tract under normal and stressful talking conditions. This research answers the question of the degradation in the recognition performance of textdependent speaker identification under stressful talking conditions. This research can be used (for future research) to improve the recognition performance under stressful talking conditions.
We have developed software based on the Stevens landmark theory to extract features in utterances in and adjacent to voiced regions. We then apply two statistical methods, closest-match (CM) and principal components analysis (PCA), to these features to classify utterances according to their emotional content. Using a subset of samples from the Actual Stress portion of the SUSAS database as a reference set, we automatically classify the emotional state of other samples with 75% accuracy, using CM either alone or with PCA and CM together. The accuracy apparently does not depend strongly on measurement errors or other small details of the present data, giving confidence that the results will be applicable to other data.
Speech of human beings is the reflection of the state of mind. Proper evaluation of these speech signals into stress types is necessary in order to ensure that the person is in a healthy state of mind. In this work we propose a SVM classifier for speech stress classification algorithm, with sophisticated feature extraction techniques as Mel Frequency Cepstral Coefficients (MFCC). The SVM algorithm assists the system to learn the speech patterns in real time and self-train itself in order to improve the classification accuracy of the overall system. The proposed system is suitable for real time speech and is language and word independent.
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181)
This study proposes a new set of feature parameters based on subband analysis of the speech signal for classi cation of speech under stress. The new speech features are Scale Energy SE, Autocorrelation-Scale-Energy ACSE, Subband based cepstral parameters SC, and Autocorrelation-SC ACSC. The parameters' ability to capture di erent stress types is compared to widely used Mel-scale cepstrum based representations: Mel-frequency cepstral coe cents MFCC and Autocorrelation-Mel-scale AC-Mel. Next, a feedforward neural network is formulated for speaker-dependent stress classi cation of 10 stress conditions: Angry, Clear, Cond50 70, Fast, Loud, Lombard, Neutral, Question, Slow, and Soft. The classi cation algorithm is evaluated using a previously established stressed speech database SUSAS 4. Subband based features are shown to achieve + 7:3 and +9:1 increase in the classi cation rates over the MFCC based parameters for ungrouped and grouped stress closed vocabulary test scenarios respectively. Moreover the average scores across the simulations of new features are +8:6 and +13:6 higher than MFCC based features for the ungrouped and grouped stress test scenarious respectively.
International Journal of Bioinformatics Research and Applications, 2015
When a person is emotionally charged, stress could be discerned in his voice. This paper presents a simplified and a non-invasive approach to detect psycho-physiological stress by monitoring the acoustic modifications during a stressful conversation. Voice database consists of audio clips from eight different popular FM broadcasts wherein the host of the show vexes the subjects who are otherwise unaware of the charade. The audio clips are obtained from real-life stressful conversations (no simulated emotions). Analysis is done using PRAAT software to evaluate mean fundamental frequency (F0) and formant frequencies (F1, F2, F3, F4) both in neutral and stressed state. Results suggest that F0 increases with stress; however, formant frequency decreases with stress. Comparison of Fourier and chirp spectra of short vowel segment shows that for relaxed speech, the two spectra are similar; however, for stressed speech, they differ in the high frequency range due to increased pitch modulation.
9th International Conference on Affective Computing and Intelligent Interaction (ACII), 2021
This paper investigates a robust and effective automatic stress detection model based on human vocal features. Our study experimental dataset contains the voices of 58 Greekspeaking participants (24 male, 34 female, 26.9±4.8 years old), both in neutral and stressed conditions. We extracted a total of 76 speech-derived features after extensive study of the relevant literature. We investigated and selected the most robust features using automatic feature selection methods, comparing multiple feature ranking methods (such as RFE, mRMR, stepwise fit) to assess their pattern across gender & experimental phase factors. Then, classification was performed both for the entire dataset, and then for each experimental task, for both genders combined and separately. The performance was evaluated using 10-fold cross-validation on the speakers. Our analysis achieved a best classification accuracy of 84.8% using linear SVM for the social exposure phase and 74.5% for the mental tasks phase using the gaussian SVM classifier. The ordinal modelling improved significantly our results, yielding a best on-subject basis 10fold cross-validation classification accuracy of 95.0% for social exposure using gaussian SVM and 85.9% for mental tasks using the gaussian SVM. From our analysis, specific vocal features were identified as being robust and relevant to stress along with parameters to construct the stress model. However, it is was observed the susceptibility of speech to bias and masking and thus the need for universal speech markers for stress detection.
Lecture Notes in Computer Science
In this chapter, we consider a range of issues associated with analysis, modeling, and recognition of speech under stress. We start by defining stress, what could be perceived as stress, and how it affects the speech production system. In the discussion that follows, we explore how individuals differ in their perception of stress, and hence understand the cues associated with perceiving stress. Having considered the domains of stress, areas for speech analysis under stress, we shift to the development of algorithms to estimate, classify or distinguish different stress conditions. We will then conclude with revealing what might be in store for understanding stress, and the development of techniques to overcome the effects of stress for speech recognition and human-computer interactive systems.
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings
Presently, automatic stress detection methods for speech employ a binary decision approach, deciding whether the speaker is or is not under stress. Since the amount of stress a speaker is under varies and can change gradually, a reliable stress level detection scheme becomes necessary to accurately assess the condition of the speaker. Such a capability is pertinent to a number of applications, such as for those personnel in law enforcement positions. Using speech and biometric data collected from a realworld, variable-stress level law enforcement training scenario, this study illustrates two methods for automatically assessing stress levels in speech using a hybrid multi-dimensional feature space comprised of frequency-based and Teager Energy Operator-based features. The first approach uses a nearest neighbor-type clustering scheme at the vowel token level to classify speech data into one of three levels of stress, yielding an overall error rate of 50.5%. The second approach employs accumulated Euclidean distance metric weighting at the sentence-level to yield a relative improvement of 12.1% in performance.
Interspeech 2008, 2008
International Journal of Engineering & Technology, 2018
The Detection of stress from speech signal is gaining large attention recently. The emergence of new methods and techniques for feature extraction and classification paved the way to different solutions to detect different stress conditions using human speech and led to an in-crease in the accuracy of stress recognition. A large number of parameters are proposed for the characterization of stress in speech. Similarly numerous classifiers and machine learning algorithms are investigated for stress classification and regression. In this treatise, a recital on the commonly used databases, stress conditions, different feature extraction methods and classifiers along with some of the statistical measures as well as compensation techniques for stress detection are presented in this article. After thorough illustration of existing methodology for the task, future prospects for the work are elaborated.
IEEE Transactions on Speech and Audio Processing, 2000
It is well known that the performance of speech recognition algorithms degrade in the presence of adverse environments where a speaker is under stress, emotion, or Lombard effect. This study evaluates the effectiveness of traditional features in recognition of speech under stress and formulates new features which are shown to improve stressed speech recognition. The focus is on formulating robust features which are less dependent on the speaking conditions rather than applying compensation or adaptation techniques. The stressed speaking styles considered are simulated angry and loud, Lombard effect speech, and noisy actual stressed speech from the SUSAS database which is available on CD-ROM through the NATO IST/TG-01 research group and LDC 1 . In addition, this study investigates the immunity of linear prediction power spectrum and fast Fourier transform power spectrum to the presence of stress. Our results show that unlike fast Fourier transform's (FFT) immunity to noise, the linear prediction power spectrum is more immune than FFT to stress as well as to a combination of a noisy and stressful environment. Finally, the effect of various parameter processing such as fixed versus variable preemphasis, liftering, and fixed versus cepstral mean normalization are studied. Two alternative frequency partitioning methods are proposed and compared with traditional mel-frequency cepstral coefficients (MFCC) features for stressed speech recognition. It is shown that the alternate filterbank frequency partitions are more effective for recognition of speech under both simulated and actual stressed conditions.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.