The source-filter theory (Fant 1960) hypothesizes that an acoustic speech signal can be seen as a source signal, filtered with the resonances in the cavities of the vocal tract downstream from the glottis or the constriction. This simple model for speech synthesis is based on a assumption that the dynamics of the system is linear and separable into three main blocks: a glottal energy (source), the vocal tract (filter) and the effect of modeling radiation sound that are independent(figure1).
The glottal source roughly matches the subglottal systems, while vocal tract (VT) corresponds to the supra-glottal system. The radiation block can be considered as a converter, which converts volume velocity in to acoustic pressure.In general, the radiation characteristic R(f) and the spectrum envelop of the source function S(f) for the glottal source are smooth and monotonic functions of frequency. The transfer function T(f), however, is usually characterized by several peaks corresponding to resonances of the acoustic cavities that form the vocal tract. Manipulating the shape of these cavities results in the positions and amplitudes of the peaks. Figure () qualitatively shows the configuration of vocal tract corresponding to a vowel. The forms of the source spectrum S(f), the transfer function T(f), the radiation characteristic R(f), and the sound pressure pr(f) is shown in each case.
The transfer function T(f) is determined by applying the theory of sound propagation in tubes of arbitrary shape. For frequencies up to 5000 Hz, the cross dimensions of vocal tract are less than a wavelength of the sound. Therefore, the sound propagation can be considered as plane waves parallel to the axis of the tube and the vocal tract can be viewed as an acoustic tube of varying diameter.
Vocal tract transfer functionEdit
The vocal tract is approximated as an acoustic tube of a given length composed of a number of sections with different section areas. This is equivalent to the modeling of the sampled vocal tract transfer function (H(z)) as a superposition of a given number of spectral poles and zeros , which in spectral domain can be represented by
Where K is constant, sa,sb,..are the zeros of T(s), s1,s2,... are the poles. for this equation, the poles and zeros mostly occur in complex conjugate pairs, and the real parts of these complex frequencies are much less than the imaginary parts which means that the peak energy lost in one cycle is much less than the energy stored in one cycle of the oscillation. Therefore, the poles of T(s) can be expressed as below:
Where Kp is a constant and the stars indicate complex conjugates. Natural frequencies of vocal tract are represented by poles and the imaginary parts indicate formant frequencies which refers to frequencies at which oscillations happen in the absence of excitation[reference]. In the other words, depending on the shape of the acoustic tube (mainly influenced by tongue position), a sound wave traveling through it will be reflected in a certain way so that interferences will generate resonances at certain frequencies. These resonances are called formants . Their location largely determines the speech sound that is heard.