Performance Evaluation of Hybrid Acoustic Feature Integration for Tonal Language Speech Processing Systems

Authors

  • Liang Chen Department of Educational Sciences, Beijing Normal University, Beijing, China

Keywords:

Tonal language processing, acoustic feature integration, cepstral coefficients

Abstract

The advancement of automatic speech recognition (ASR) systems for tonal languages has introduced unique challenges due to the critical role of pitch variation in lexical differentiation. This study presents a comprehensive evaluation of hybrid acoustic feature integration frameworks that combine spectro-temporal, cepstral, and prosodic representations for improved recognition performance in tonal language processing systems. The research systematically investigates how multi-stream feature architectures enhance phonetic discrimination, robustness under acoustic variability, and tonal modeling accuracy.

The study employs a structured experimental framework integrating multiple feature extraction techniques, including Mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP), spectro-temporal modulation features, and pitch-based descriptors. These features are combined using hierarchical and parallel architectures to capture complementary acoustic information. The evaluation is conducted across controlled and noisy environments to assess system robustness and generalization capabilities.

Results demonstrate that hybrid feature integration significantly improves recognition accuracy compared to single-feature systems, particularly in scenarios involving tonal ambiguity and environmental distortion. The findings reveal that spectro-temporal features enhance temporal resolution, while cepstral features maintain spectral stability, and pitch information ensures tonal integrity. Furthermore, hierarchical fusion strategies outperform simple concatenation approaches by enabling context-sensitive feature weighting.

The study contributes to the theoretical understanding of feature complementarity in ASR systems and provides practical insights for designing robust speech recognition frameworks for tonal languages. Limitations related to computational complexity and scalability are also discussed, along with future directions involving deep learning-based feature fusion and adaptive modeling techniques.

References

S. Chang and L. Lee, "Data-driven clustered hierarchical tandem system for LVCSR", Proc. Interspeech, 2008.

L. Cheng and L. Lee, "Improved large vocabulary Mandarin speech recognition by selectively using tone information with a two-stage prosodic model", Proc. Interspeech, 2008.

T. Chi, Y. Gao, M. Guyton, P. Ru and S. Shamma, "Spectro-temporal modulation transfer functions and speech intelligibility", J. Acoust. Soc. Amer., vol. 106, pp. 2719-2732, 1999.

D. Depireux, J. Simon, D. Klein and S. Shamma, "Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex", J. Neurophysiol., vol. 85, no. 3, pp. 1220, 2001.

X. Domont, M. Heckmann, F. Joublin and C. Goerick, "Hierarchical spectro-temporal features for robust speech recognition", Proc. ICASSP, pp. 4417-4420, 2008.

D. Gelbart, Ensemble feature selection for multi-stream automatic speech recognition, 2008.

S. Ganapathy, S. Thomas and H. Hermansky, "Robust spectro-temporal features based on autoregressive models of Hilbert envelopes", Proc. ICASSP, pp. 4286-4289, 2010.

F. Grézl and P. Fousek, "Optimizing bottle-neck features for LVCSR", Proc. ICASSP, pp. 4729-4732, 2008.

H. Hermansky and P. Fousek, "Multi-resolution RASTA filtering for tandem-based ASR", Proc. Interspeech, 2005.

H. Hermansky and N. Morgan, "RASTA processing of speech", IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. 578-589, Oct. 1994.

H. Hermansky, D. Ellis and S. Sharma, "Tandem connectionist feature extraction for conventional HMM systems", Proc. ICASSP, pp. 1635-1638, 2000.

M. Hwang, W. Wang, X. Lei, J. Zheng, O. Cetin and G. Peng, "Advances in Mandarin broadcast speech recognition", Proc. Interspeech, 2007.

M. Hwang, G. Peng, W. Wang, A. Faria, A. Heidel and M. Ostendorf, "Building a highly accurate Mandarin speech recognizer", Proc. ASRU, pp. 490-495, 2007.

H. Ketabdar and H. Bourlard, "Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation", Proc. ICASSP, pp. 4065-4068, 2008.

M. Kleinschmidt and D. Gelbart, "Improving word accuracy with Gabor feature extraction", Proc. ICSLP, vol. 5, pp. 16-38, 2002.

L. Lee, "Voice dictation of Mandarin Chinese", IEEE Signal Process. Mag., vol. 14, no. 4, pp. 63-101, Jul. 1997.

L. Lee, C. Tseng, H. Gu, F. Liu, C. Chang, Y. Lin, et al., "Golden Mandarin (I)-a real-time Mandarin speech dictation machine for Chinese language with very large vocabulary", IEEE Trans. Speech Audio Process., vol. 1, no. 2, pp. 158-179, Apr. 1993.

T. Lee, W. Lau, Y. Wong and P. Ching, "Using tone information in Cantonese continuous speech recognition", ACM Trans. Asian Lang. Inf. Process., vol. 1, no. 1, pp. 83-102, 2002.

X. Lei, M. Hwang and M. Ostendorf, "Incorporating tone-related MLP posteriors in the feature representation for Mandarin ASR", Proc. Interspeech, 2005.

X. Lei, M. Siu, M. Hwang, M. Ostendorf and T. Lee, "Improved tone modeling for Mandarin broadcast news speech recognition", Proc. Interspeech, 2006.

X. Lei and M. Ostendorf, "Word-level tone modeling for Mandarin speech recognition", Proc. ICASSP, vol. 4, pp. IV-665-IV-668, 2007.

S. Li, L. Sun and L. Lee, "Improved phoneme recognition by integrating evidence from spectro-temporal and cepstral features", Proc. Interspeech, 2010.

S. Li, L. Sun and L. Lee, "Multi-stream spectro-temporal and cepstral features based on data-driven hierarchical phoneme clusters", Proc. ICASSP, pp. 5196-5199, 2011.

N. Mesgarani, S. Thomas and H. Hermansky, "A multistream multiresolution framework for phoneme recognition", Proc. Interspeech, 2010.

B. Meyer and B. Kollmeier, "Complementarity of MFCC PLP and Gabor features in the presence of speech-intrinsic variabilities", Proc. Interspeech, 2009.

S. Ravuri and N. Morgan, "Using spectro-temporal features to improve AFE feature extraction for ASR", Proc. Interspeech, 2010.

P. Schwarz, P. Matejka and J. Cernocky, "Hierarchical structures of neural networks for phoneme recognition", Proc. ICASSP, vol. 1, pp. I-I, 2006.

S. Thomas, S. Ganapathy and H. Hermansky, "Recognition of reverberant speech using frequency domain linear prediction", IEEE Signal Process. Lett., vol. 15, pp. 681-684, 2008.

F. Valente and H. Hermansky, "Hierarchical and parallel processing of modulation spectrum for ASR applications", Proc. ICASSP, pp. 4165-4168, 2008.

F. Valente, M. Doss, C. Plahl, S. Ravuri and W. Wang, "A comparative large scale study of MLP features for Mandarin ASR", Proc. Interspeech, 2010.

H. Wang, T. Ho, R. Yang, J. Shen, B. Bai, J. Hong, et al., "Complete recognition of continuous Mandarin speech for Chinese language with very large vocabulary using limited training data", IEEE Trans. Speech Audio Process., vol. 5, no. 2, pp. 195-200, Mar. 1997.

H. Wang, Y. Qian, F. Soong, J. Zhou and J. Han, "A multi-space distribution (MSD) approach to speech recognition of tonal languages", Proc. Interspeech, 2006.

H. Wang, Y. Qian, F. Soong, J. Zhou and J. Han, "Improved Mandarin speech recognition by lattice rescoring with enhanced tone models", Proc. ISCSLP, pp. 445-453, 2006.

X. Wang, Y. Yu, X. Wu and H. Chi, "Maximum entropy based tone modeling for Mandarin speech recognition", Proc. ICASSP, 2010.

H. Wei, X. Wang, H. Wu, D. Luo and X. Wu, "Exploiting prosodic and lexical features for tone modeling in a conditional random field framework", Proc. ICASSP, pp. 4549-4552, 2008.

Q. Zhu, B. Chen, F. Grezl and N. Morgan, "Improved MLP structures for data-driven feature extraction for ASR", Proc. Interspeech, 2005.

S. Zhao and N. Morgan, "Multi-stream spectro-temporal features for robust speech recognition", Proc. Interspeech, 2008.

S. Zhao, S. Ravuri and N. Morgan, "Multi-stream to many-stream: Using spectro-temporal features for ASR", Proc. Interspeech, 2009.

Downloads

Published

2026-04-01

How to Cite

Liang Chen. (2026). Performance Evaluation of Hybrid Acoustic Feature Integration for Tonal Language Speech Processing Systems. European International Journal of Pedagogics, 6(04), 1–8. Retrieved from https://eipublication.com/index.php/eijp/article/view/4281