Event Location by Triangular Interpolation for Temporal Decomposition of Speech

For low-bit-rate coding and synthesis the evolution of spectral parameters is a source of redundancy to be consid ered. A triangular interpolation spectral measure (TRISM) is pro posed as the basis for an open-loop event location criterion for lo w-delay temporal decomposition (TD). TRISM comes as an improvement in linear interpolation error measurement over the spectral transition measure (STM). While STM is heuristic and presupposes asymmetric event functions, TRISM is a minimum square interpolation error based on symmetric functions. Minimun TRISM (MINTRISM) TD interpolates up to 13 frames between adjacent events at a mean event rate of 15 Hz and interpolatio n error level equivalent to that of standard low-bit-rate speech coders. The MINTRISM criterion is also a more stable solution to the location of events and determination of their number t han previous global and local TD methods.


I. INTRODUCTION
T HE representation of speech spectral features plays a central role in speech coding, synthesis and recognition.Each spectral vector represents the envelope of the average speech spectrum along a frame, which is a quasistationary segment of speech that lasts typically for some tens of milliseconds.The line spectral frequency (LSF) coefficients [1], [2], [3] are the representation of choice for the spectral vectors in speech coding since they are very robust parameters against quantization and interpolation errors.For a pth-order linear prediction (LP) analysis, the LSFs constitute the complete set of p resonant frequencies of the lossless vocal tract model under both alternative conditions of open and closed termination at the glottis.The LSF values range over the doubly-open interval (0, π) radians per sample, that is, from DC to the Nyquist frequency.
Variable-rate interpolation of target spectral vectors is implemented in various methods known as temporal decomposition [4].The temporal decomposition of parameter tracks involves the location of event centers in the analysis phase when event targets are sampled at event center locations and then refined.The set of frames that lie between an event center inclusively and the next one exclusively is called a superframe.In the synthesis or recognition phase, event targets are interpolated by means of event functions in order to reconstruct the parameter tracks.
Unconstrained TD incurs in long delays.While such algorithms are useful for speech recognition [5], store-and-forward messaging applications [6] and for compressing speech synthesis corpora [7], for two-way coding applications low-delay TD algorithms are necessary.That is why the TD algorithm to be described locates event centers one at a time and constrains event functions to a finite support that spans two consecutive intertarget intervals.
In particular, a spectral measure for event location in TD is proposed.The triangular interpolation spectral measure (TRISM) is based on interpolation error minimization and local slope minimization under the condition of a triangular event function.It is compared to the spectral feature transition rate (SFTR) [8], which reduces to the spectral transition measure (STM) [9] when the location window length is fixed.
There are no well established guidelines for acceptable interpolation distortion as there are for LSF quantization distortion.As a matter of fact, speech decoders usually interpolate frame rate vectors for a subframe resolution of one-fourth frame length, without accounting for the interpolation error incurred.An exceptional work in this respect was done by Paliwal [10], which may be taken as a baseline reference for the performance of uniform linear interpolation.
Besides, for low-bit-rate speech coding, weighted distortion measures across frame time and frequency [11], [12] should be considered.

II. SPECTRAL MEASURES FOR EVENT LOCATION
The LSF evolution matrix Y contains, for each frame in the range n = 0, 1, . . .N − 1, p LSFs as column vector y(n).It is temporally decomposed, generating target matrix A and event matrix Φ, which may be used to estimate Y as The columns a j in matrix A for j = 0, 1, . . ., J − 1 are the target vectors, where J is the number of events.Event functions φ j (n) for n = 0, 1, . . ., N −1 and j = 0, 1, . . ., J −1 are represented as vectors φ j whose transposes are the rows in matrix Φ.
In triangular TD event functions for error measurement are assumed to be linear interpolation functions which are symmetric around their center locations while STM-based TD uses asymmetric linear interpolation functions implicitly in STM evaluation.
In particular, local TD is performed between the current and the next event locations in a two-stage sequential process.The first stage involves the determination of the next event location.Then, in the second stage, the next event target vector is determined along with the functions for the current and the next events in an iterative fashion.
In the first stage, event function φ j (n) is assumed to reach its peak unity value at event center C(j) = n 0 and to be triangular and symmetric about it so that for n 0 ≤ n ≤ n 0 + M and for n 0 − M ≤ n ≤ n 0 , where α is the attack slope.Along the kth LSF track, the interpolation error is where a kj is the target LSF value.The square interpolation error along the kth LSF track for the window The joint interpolation error of all LSF tracks is ε j = p k=1 ε kj .Expanding Eq. ( 4), rearranging the result and casting it in vector notation, yields Imposing ∂εj ∂α = 0 for minimum square interpolation error, the slope of the event function turns out to be (6) It is noticeable that α = α is really a unique minimum for ε j (α) at given window length 2M + 1 and location n = n 0 since, by differentiating twice (5) with respect to α, we get which is strictly positive.For event center location, the target vector a j is identified to the central LSF vector y (n 0 ) .The criterion proposed for determining event centers is the minimization over frame time of the triangularly fit slope α(n).For a given location window length 2M +1, this is equivalent to the determination of frame

Operation
locations n that locally minimize the triangular interpolation spectral measure (TRISM), defined by which consists of the scaled version of the absolute value of slope estimate (6).
In a previous local TD method, event functions are assumed to be linear and minimum event function slope is taken to be the manifestation of spectral stability, whose location is declared event center [8].This led to the minimization of the spectral transition measure (STM) [9] where 2M + 1 is the location window length.By inspection of Eq. ( 8), TRISM is seen to be a normalized measure in relation to the spectral coefficients and to the location window length, whereas, by Eq. ( 9), STM is found to depend directly on the magnitude of the spectral coefficients.The weighting of the spectral coefficients is seen to be symmetric for TRISM and antisymmetric for STM as shown additionally in Fig. 1.This can be interpreted to implicitly involve symmetric interpolation functions in the evaluation of TRISM and asymmetric ones in the evaluation of STM as illustrated in Fig. 2.
For computational cost evaluation, Eq. ( 9) can be rearranged as follows The operational complexity involved in the evaluation of MINTRISM and STM for a frame, according to Eqs. ( 8) and ( 10) is displayed in Table I, where it can be verified that TRISM requires one addition, p multiplications and one division per frame more than STM for the location of event centers, where p is the LP order.But the greater stability of TRISM evaluations allows for a reduction in the number of refinement iterations in comparison.

III. LOCAL TEMPORAL DECOMPOSITION
The measures presented in Section II locate the internal event centers C j for j = 1, 2, . . ., J −2.Additionally, endpoint where f f is the frame rate for LP analysis.Each target vector is initially identified to the original LSF vector at the event center just located, that is, for j = 0, 1, . . ., J − 1.
The two event functions, φ j (n) and φ j+1 (n), for the current superframe n = C j , C j+1 , . . ., C j+1 − 1, are determined as a function of the right-hand target vector a j for the previous superframe and the running estimate a j+1 for the righthand target vector of the current superframe by the optimal procedure outlined in [13], which consists of the solutions to the sets of equations The estimated event function samples in Eq. ( 13) are modified, if necessary, to lie in the range from zero to unity, that is, for n = C j , C j + 1, . . ., C j+1 − 1.
Next, the right-hand target vector a j+1 for the current superframe is reestimated, given the left-hand target vector and the sample values of the event functions in the current superframe, by minimizing the square interpolation error Setting the gradient ∂εj ∂aj+1 = 0 and rearranging terms, the new estimate for the right-hand target vector is found to be The LSFs in refined target vectors are tested for stability and made stable by the procedure described in [9] if necessary.By defining initial event functions as straight-line segments, refinement can be carried out in either order, that is, event functions first or target vector first.Both orders are tested in the experiments described in Section IV.
Refinement is repeated until iteration I such that the relative square interpolation error difference satisfies the inequality Also, lower complexity TD algorithms are used that constrain the two event functions in a superframe to be symmetric unity-complementary [9], [14], that is, for n = C j , C j + 1, . . ., C j+1 − 1.
The computational complexity of overall TD analysis is dominated by the second-stage iterative determination of next event target and current and next event functions.Further, the number of iterations in the second stage depends on the method used for event function determination, either the optimal or the symmetric procedures.

IV. EXPERIMENTS WITH TEMPORAL DECOMPOSITION AND LINEAR INTERPOLATION
Speech spectral envelopes were obtained at a frame rate of 200 Hz as the LSF vector representation that results from tenth-order LP analysis of a 25 ms segment of speech extracted through a Hamming window.The whole set of signals in the test partition of the TIMIT database [15], [16] was resampled at 8 kHz and LP-analyzed as just described, resulting in a total of 1.037 million frames of speech.
Since TRISM is a more stable measure for event location, event rate hardly varies with location window length 2M + 1.However, by interposing a dead time of M frames after each event detection, frame rate can be controlled when using TRISM.By this procedure event rates may be varied from 12 Hz up to 65 Hz when M = 1, 2, . . ., 12.This same variation in M causes event rates for STM to range from around 12 Hz up to 50 Hz.Different event rates may also be obtained with a fixed window length for all rates by varying the original frame rate [14].
Interpolation error is measured as log spectral distortion (SD) [17] between the original log spectral envelope 10 log 10 S n e jω and the interpolated log spectral envelope 10 log 10 Ŝn e jω associated with the original LSF vector y(n) and the interpolated LSF vector ŷ(n), respectively.The log SD is evaluated as the root mean square value D(n) of the difference between these log spectral envelopes over a 1000point uniform grid on the unit circle.
The minimum relative square interpolation error difference for stopping refinement, defined in Eq. ( 17), was set to δ = 1 • 10 −4 , resulting in a mean number of refinement iterations per superframe ranging from 5 to 10.
Two factors may be compared and contrasted by observing Fig. 3, namely, the event location criterion and the order the event functions and the target vectors are reestimated in each refining iteration.For the MINSTM criterion, reestimating the target vector last causes a decrease of about 0.20 dB in distortion along most of the event rate range tested while for MINTRISM the improvement is far from uniform, reaching a maximum of about 0.20 dB at around 30 Hz but giving virtually coincident results below an event rate of around 15 Hz.When target vectors are refined last, the MINTRISM criterion displays a consistent decrease of 0.20 dB over MINSTM for event rates below 30 Hz.
Next, a reference for goodness of fit was sought for the TD algorithms by comparing their overall performance to that of uniform linear interpolation.For the outline of the Optimized TD, MINSTM, functions last Optimized TD, MINSTM, target last Optimized TD, MINTRISM, functions last Optimized TD, MINTRISM, target last Fig. 3. MINTRISM versus MINSTM criteria for optimal TD with reestimation of event functions and target vector in both orders.Symmetric MINTRISM and MINSTM TD are compared to MINTRISM optimal TD and to uniform linear interpolation.Reestimation of event functions is followed by target vector reestimation.Percentage of frames n such that 2 dB < D(n) ≤ 4 dB for symmetric MINTRISM TD, optimal MINTRISM and MINSTM TD and uniform linear interpolation with reestimation of event functions followed by target vector reestimation.global domain of TD performance, just the lowermost and the uppermost curves in Fig. 3 were selected for overlay with the uniform linear interpolation curve in Fig. 4. It can be seen that the distortion for optimal MINTRISM TD with target refinement last is always lower than that of linear interpolation by at least 0.4 dB for mean event rates below 33 Hz.Even the upper distortion bound for optimal TD lies below the linear interpolation distortion curve for all mean event rates below about 36 Hz.
The symmetric low-complexity local TD algorithms are compared to the best optimal TD algorithm and to uniform linear interpolation in Fig. 5.For mean event rates below 30 Hz, the low-complexity MINTRISM TD performance is uniformly 0.2 dB higher in distortion than the optimal algorithm, whose distortion is lower than that of linear interpolation by 0.3 dB at the higher event rates to more than 0.5 dB at the lower rates.On the other hand, the performance of the symmetric MINSTM TD algorithm traverses from the meeting point with linear interpolation at about 37 Hz to a virtual encounter with symmetric MINTRISM TD at about 12.5 Hz.The distribution along the frames of the log SD for LSF interpolation may be better assessed through the analysis of Figs. 6 and 7, which display the percentage of outliers in the range above 2 dB and up to 4 dB and in the range above 4 dB, respectively.As a group, TD algorithms show from half to one-fifth as much percentage of outermost outliers as linear interpolation.Inside the TD group, the behavior of the best symmetric TD algorithm is confined between those of optimal MINTRISM TD and optimal MINSTM TD.The operation of linear interpolation at 33.33 Hz may be taken as acceptable since it is used in low-bit-rate coding [18], [10].For the same outermost outlier percentage, optimal MINSTM TD operates at a mean event rate of 20 Hz and optimal MINTRISM TD operates at around 15 Hz as shown in Fig. 7.This means over two times a compression ratio for MINTRISM TD over linear interpolation.These event rates include on average 10 and 13 frames per superframe, respectively.In addition, lowcomplexity symmetric MINTRISM TD operates at virtually the same event rate as optimal MINSTM TD.

V. CONCLUSION
Variable-rate sampling and interpolation of LSF tracks for speech signals has been analyzed and tested, using uniform linear interpolation as a baseline for comparison.The proposed algorithm features low algorithmic delay due to sequential event location.Events are localized by the first stage of the algorithm using the proposed minimum triangular interpolation spectral measure (MINTRISM) criterion.The mean realized event rate under MINTRISM is the least sensitive to location window length among global and local TD criteria.Refining target vectors after event functions improves the spectral match, particularly at higher event rates, but the order of refinement is immaterial below a mean event rate of 20 Hz.
Over a mean event rate range from 12 Hz up to 35 Hz, TRISM performs better than STM by 0.2 dB in log SD.A lower complexity version of MINTRISM TD constrains the two event functions in a superframe to be symmetric unity-complementary and performs between MINSTM and MINTRISM TD.They can interpolate a maximum of 10, 11 and 13 frames between adjacent events, for a uniform frame rate of 200 Hz, within the interpolation distortion of standard low-bit-rate speech coders.

Fig. 1 .
Fig. 1.Location windows for the evaluation of TRISM (left plot) and STM (right plot), illustrated for a five-frame long case.

Fig. 2 .
Fig.2.General shape of event functions involved in the evaluation of TRISM (left plot) and STM (right plot), illustrated for a five-frame long case.

Fig. 4 .
Fig.4.MINTRISM and MINSTM criteria for optimal TD are compared to uniform linear interpolation.Reestimation of event functions is followed by target vector reestimation for MINTRISM TD and is done the other way around for MINSTM TD.
Fig.5.Symmetric MINTRISM and MINSTM TD are compared to MINTRISM optimal TD and to uniform linear interpolation.Reestimation of event functions is followed by target vector reestimation.
Fig.6.Percentage of frames n such that 2 dB < D(n) ≤ 4 dB for symmetric MINTRISM TD, optimal MINTRISM and MINSTM TD and uniform linear interpolation with reestimation of event functions followed by target vector reestimation.

Fig. 7 .
Fig.7.Percentage of frames n such that D(n) > 4 dB for symmetric MINTRISM TD, optimal MINTRISM and MINSTM TD and uniform linear interpolation with reestimation of event functions followed by target vector reestimation.

TABLE I OPERATIONAL
COMPLEXITY PER FRAME FOR TRISM AND STM EVALUATION, WHERE p IS LP ORDER AND 2M + 1 IS THE LOCATION WINDOW LENGTH.