Engineering Acoustics/source-filter theory

The source-filter theory (Fant 1960) hypothesizes that an acoustic speech signal can be seen as a source signal, filtered with the resonances in the cavities of the vocal tract downstream from the glottis or the constriction. This simple model for speech synthesis is based on an assumption that the dynamics of the system is linear and separable into three main blocks: a glottal energy (soure), the vocal tract (filter) and the effect of modeling radiation sound that are independent(As shown in the figure on the right).

The glottal source roughly matches the subglottal systems, while vocal tract (VT) corresponds to the supra-glottal system. The radiation block can be considered as a converter, which converts volume velocity in to acoustic pressure.In general, the radiation characteristic R(f) and the spectrum envelop of the source function S(f) for the glottal source are smooth and monotonic functions of frequency. The transfer function T(f), however, is usually characterized by several peaks corresponding to resonances of the acoustic cavities that form the vocal tract. Manipulating the shape of these cavities results in the positions and amplitudes of the peaks. Figure on the left qualitatively shows the configuration of vocal tract corresponding to a vowel. The forms of the source spectrum S(f), the transfer function T(f), the radiation characteristic R(f), and the sound pressure pr(f) is shown in each case.

The transfer function T(f) is determined by applying the theory of sound propagation in tubes of arbitrary shape. For frequencies up to 5000 Hz, the cross dimensions of vocal tract are less than a wavelength of the sound. Therefore, the sound propagation can be considered as plane waves parallel to the axis of the tube and the vocal tract can be viewed as an acoustic tube of varying diameter.

vocal tract transfer function

The vocal tract is approximated as an acoustic tube of a given length composed of a number of sections with different section areas. This is equivalent to the modelling of the sampled vocal tract transfer function (H(s)) as a superposition of a given number of spectral poles and zeros, which in spectral domain can be represented by

$H(s)=k{\begin{matrix}\prod _{i=1}^{N}{\frac {s-s_{ai}}{s-s_{i}}}\end{matrix}}$

Where K is constant, s_a1,s_a2,..are the zeros of H(s), s₁,s₂,... are the poles. for this equation, the poles and zeros mostly occur in complex conjugate pairs, and the real parts of these complex frequencies are much less than the imaginary parts which means that the peak energy lost in one cycle is much less than the energy stored in one cycle of the oscillation. Therfeore, the poles of H(s) can be expressed as below:

${\frac {1}{k_{n}}}{\begin{matrix}\prod _{i=1}^{N}{\frac {s_{i}{s_{i}}^{*}}{(s-s_{i})(s-{s_{i}}^{*})}}\end{matrix}}$

Where Kp is a constant and the stars indicate complex conjugates. Natural frequencies of vocal tract are represented by poles and the imaginary parts indicate formant frequencies which refers to frequencies at which oscillations happen in the absence of excitation and the real parts give the rates of decay of theses oscillations. In the other words, depending on the shape of the acoustic tube (mainly influenced by tongue position), a sound wave travelling through it will be reflected in a certain way so that interferences will generate resonances at certain frequencies. These resonances are called formants . Their location largely determines the speech sound that is heard.

Acoustic interpretation of transfer function

File:Vocal tract as tubes with varying cross section.jpg

Vocal tract as tubes with varying cross section

According to acoustics of tubes, Pressure and volume velocity of a tube at the end of tube (x=L) can be related to the variables at beginning of the tube (x=0). The following transfer matrix expresses the acoustical relationship between two sides of a tube in frequency domain:

{\begin{bmatrix}P_{0}\\U_{0}\\\end{bmatrix}}=T(\omega )*{\begin{bmatrix}P_{L}\\U_{L}\\\end{bmatrix}},T(\omega )={\begin{bmatrix}a_{11}&a_{22}\\a_{21}&a_{22}\\\end{bmatrix}}

a_{11}=cos(KL),a_{21}=jsin(KL)/(\rho *c),a_{21}=\rho *c*jsin(KL),a_{22}=cos(KL)

where K is the wave number, and L is the length of the tube.The aboved mentioned relation can be used to calculate the state of the wave field at one location, having on hand te state of the field at another location.

Since the vocal tract can be considered as n tubes with different cross section (see figure on the right), the transer function can be used to relate the states between glottis and the radiate sound:

T(\omega )=T_{1}(\omega )*T_{2}(\omega )*T_{3}(\omega )*....*T_{n}(\omega )

and the overall equation for vocal tract becomes:

{\begin{bmatrix}P_{g}\\U_{g}\\\end{bmatrix}}=T(\omega )*{\begin{bmatrix}P_{r}\\U_{r}\\\end{bmatrix}},whereP_{r}=U_{r}*Z_{rad}

In this equation, Z_rad is the radiation impedance. Up to a frequency of about 6000 Hz, the acoustic radiation impedance can be written approximately as:

Z_{rad}=\rho *c/A_{m}*(\pi *f^{2}/c^{2}*A_{m})*K_{s}(f)+j*2*\pi *f*\rho *(0.8a)/A_{m}

where A is the area of the mouth opening, a is the effective radius, and K_s(f)is adimensionless frequency dependant factor that accounts for baffling effect of the head.

The Transfer function of the system can be calculated as fallows:

H(\omega )=U_{r}/U_{g}

Therefore the equations result in:

P_{r}(\omega )=U_{g}(\omega )*H(\omega )*Z_{rad}(\omega )

As can be seen, the equation above expresses the pressure in front of the mouse with a source, filter and radiation characteristics of the mouse. This equation describes the source filter theory mentioned in the first section.

Effects of vocal tract wall and other losses

In previous section vocal tract is modeled as a system without losses, except the termination impedance term. However, There are some other second order effects that are necessary for precise modelling, such as wall effects, heat conduction and viscosity, glottal opening. These loses can change band-widths of the resonances frequencies. Also, they can change or shift resonance frequences.

Resonant frequencies of air in a tube

The relationship between vocal tract shape and transfer function is complex -we will consider the simple case of a uniform tube. The vocal tract in a vowel can be approximated by a tube which is closed at one end (the glottis) and open at the other (the lips). For a relatively unconstricted vocal tract, the resonances of a 17 cm vocal tract occur at the following frequencies:

f= n * c / 4 * L for n = 1, 3, 5, ...

f = formant frequency in Hz c = speed of sound 34,000 cm/s L= length of vocal tract in cm

So the lowest formant frequency in a 17 cm. vocal tract is:

f = c / 4 * L = 34,000 / 4 * 17 = 500 Hz

And the spacing between formants is: f = 2 *c / 4 * L = c / 2 * L (always twice the lowest f) = 1000 Hz

Therefore, formant frequencies are: F₁=500, F₂=1500, F₃=2500, F₄=3500.

Two Tube Vocal Tract Models of Vowels

File:Tube a.jpg

Two tube model for vowel /a/

File:Tube i.jpg

Two tube model for vowel /i/

Two resonators or uniform tubes of different cross sectional areas can be connected to approximate some vowels or consonants. In this case, natural frequencies of the whole system are not simply the frequencies for each tube because of acoustic coupling. figures() shows different configuration of tubes to simulate vowels /a/, /i.

Typical values (for an adult male vocal tract for vowel /a/) are l1 = 8 cm, l2 = 9 cm with A1 = 5 cm2, A2 = 0.5 cm2. Acoustic theory predicts that there will be resonances at 944 Hz, 1063 Hz, 2833 Hz. The narrow and wide tubes can be considered as separate tubes with resonance frequencies obeying the one stated in previous section for a tube. However, the acoustic impedance at the boundary between two tubes is not zero, thus effects the natural frequencies of the tubes. The natural frequencies of the combined system are the frequencies for which the sum of the reactances at the junction is zero that is:

-\rho *c/A_{1}cot(KL_{1})+\rho *c/A_{2}tan(KL_{2})=0

It should be noted that when the natural frequencies of the tubes are remote from one another, the influence of coupling is small.

Typical values for vowel /i/ in the human vocal tract are l1 = 9 cm, l2 = 8 cm, A1 = 5 cm2, A2 = 0.5 cm2. Thus, in theory, F1 = 202 Hz, F2 = 1890 Hz, F3 = 2125 Hz.

Four Tube Vocal Tract Models of Vowels

File:Four tube model model.jpg

four tube model

Four tube models of vowels provide a much better estimate of formant frequencies for a wider range of vowels than do two tube models and so are more a more popular method of modeling vowels. Such models consist of a lip tube (tube 1) a tongue constriction tube (tube 3) and unconstricted tubes either side of the constriction tube. This model is controlled by three parameters. They are i) the position of the centre of tube 3, ii) the cross-sectional area of tube 3, and iii) the ratio of the length to the cross-sectional area at the lip section. For extreme back constrictions tube 4 disappears whilst for extreme front constrictions tube 2 disappears.

Calculations of resonance frequencies using the 4 tube model are quite complex and so Fant (1960) supplied a (fairly complex) graphical representation of the relationship between the three parameters and the resultant formant frequencies. These graphical representations are called nomograms. The original versions of these nomograms supply, for a continuous range of x constriction positions (i.e. distance from the centre of the tongue constriction to the glottis) a continuous range of resultant F1 to F5 values. The original nomograms do this for 5 values of lip area (A1) and for two values of tongue constriction cross-sectional area (A3). For different vocal tract lengths, different nomograms need to be computed.

The four tube, three parameter, model provides a sufficiently accurate prediction of most vowel sounds, but cannot model nasalisation of vowels.

References

1- Kenneth N. Stevens,2000, Acoustic phonetics, The MIT Press.

2- Kinsler et al. 2000,Fundamentals of Acoustics, John Wiley & Sons.

3- Titze, I.R. (1994). Principles of Voice Production, Prentice Hall (currently published by NCVS.org), ISBN 978-0137178933.

4- James L. Flangam and Lawrence R. Rabiner,1973, Speech synthesis.