Review of Objective Speech Qualitv Measures for Voiceband Coders

Irldeed, a primary merit of subjective measures is that they provide a score which incorporates ali aspects of the human speech perception processo ln fact, listEners do not use only the acoustic cues but actively exploit their knowledge of the language, the syntactic and semantic contexts and even talker related information to succeed in the perception task. However, either many subjects or trained crews must be used in the experiments, in arder to remove the large component of variance due to the relative tolerance of each subject for distortions and noise. Unfortunately, subjective seores are not related to physicaJ signaJ characteristics and give little insight into how a speech codec can be improved. Another drawbac!< is that subjective measu­ res are susceptible to errors of both subjects and administrators. Further, it is difficult to compare subjetive results obtained in different times and places.

lhe purpose of thís paper is to review 'lhe most important publisned studies in the area of objective measures, useful for the quality predictíon of speech coders.Admittedly, "quality" is and extremely elusive concept that involves something more than intelligibility and takes into account the phenomenon of individual differences in taste.
Irldeed, a primary merit of subjective measures is that they provide a score which incorporates ali aspects of the human speech perception processo ln fact, listEners do not use only the acoustic cues but actively exploit their knowledge of the language, the syntactic and semantic contexts and even talker related information to succeed in the perception task.However, either many subjects or trained crews must be used in the experiments, in arder to remove the large component of variance due to the relative tolerance of each subject for distortions and noise.Unfortunately, subjective seores are not related to physicaJ signaJ characteristics and give little insight into how a speech codec can be improved.Another drawbac!< is that subjective measu res are susceptible to errors of both subjects and administrators.Further, it is difficult to compare subjetive results obtained in different times and places.ln the recent years, new digital speech processing techniques have been developed and incorporated in voíceband codecs.Because of the large number of factors determining the output speech quality, it is practically impossible, as previously noted, to optimize the coding system by resorting to subjective judgments, which are generaJly costly and time consuming.Therefore, various objective measures have been tested and compared in order to establish their degree of correlation with subjec'ove scores.Objective measures are inexpensive to administer anel quite reliable.ln addition, it is easy improving the coder performance by directly minimizing the distortion as defined by the objective measure itself.
So far, a c1ear-cut answer to the problem of the quaJity evaJuation of very different coders or speech distortions through a single universaJ objective measure is not yet available.ln fact, an optimaJ quaJity estimatc.rshould provide not only a goOO prediction of the performance of a given coder, but aJso a correct ranking among various coders.However, the results presented in the literature, and summed up in the following, show that a few objective measures look promising for certain classes of speech digitizers.

OBJECTIVE MEASURES
This section deals with the description of a number of computable objective measures that are usuaJly employed as tools in the evaJuation task.The first three measures are defined in the time domain and represent a simple way to characterize, in a single number, the performance of a codec under test.The other measures are defined in the frequency domain and permit a more sophisticated approach to the issue of gauging speech quality.Moreover, they are insensitive to short delays between input and output signaJs or to phase distortion.

Long-term SNR
Awidely used measure of performance, easyto compute and well understood, is the convetionaJ signaJ-to-noise ratio (SNR) defined as (1 ) where x(n) and y(n} are the input and output signals, respectively.Since the summat;ons are taken over the entire speech utterance, Eq. ( 1) is called long-term SNR.
SeveraJ experiments have shown that this measure is poorly correlated with subjective quality, nevertheless is sometimes used during the design and ''tuning'' of waveform coders.

Segmentai SNR (SNRseg)
An improvement to the previous measure, suggested by P. NolI [1], averages SNR values over short (15-30ms) segments and therefore assigns equal weight to loud and soft parts of the utterance.This measure can capture individual preferences for a given ceder, but still faHs to predict a correct ranking between different coders.The main shortcoming of this measure arises when the variation of individual SNR values around the average is large.

Gain-compensated SNRseg
ln this measure, variations of the output speech levei with respect to the input signal are compensated before taking the SNR, segment by segment.This procedure is supported by the fact that small amplitudevariations introduced by the coder can impair the SNR measurements, while having a negligible impact on the subjective quality.

Frequency weighted segmentai SNR
Many experiments have shown that the perceptual quality of a coder depends, among other things, on the frequency distribution of this quantizing noise relative to the speech spectrum.ln fact, it is well known that the auditory mechanism relies upon a short-term spectral analysis oi the incoming signal, exploiting this spectral information as a frequency warped "place spectrum" translated on the basilar membrane of the ear [7].It turns out that a speech signal is judged to be of "goOO quality" when each location on the basilar membrane (or equivalent/y each "criticai band") is excited by a signal with a sufficient/y high SNR.
According to the classical articulation moelel, the speech band, ranging trem 200 up to 6100Hz, is divided into 20 nonuniform subbands experimentally derived.Each subband is assumed to contribute, independently of t.he others and under optimum conditions, an equal 5% to the so-called articulation index (AI) J J whert7 tile peak signal to rm3 noise ratio in band j, SNR(j), is c1ipped to a maximum vaJue ofTH dB (e. g.TH = 30 dB), sothatAI cannotbe greaterthan 1.
Eq. ( 2) can be transformated to an integral in frequency [7], giving where F(f) (the Jacobian of the transformation) is a frequency wheighting that falis down with an approximate slope of 20 dB from 200 to 6100Hz.Therefore, the AI measure can be considered as a frequency weighted SNR measure.
An important feature of this measure is that the wicrth of each band increases with the center frequency, in order to carry the same amount cf contribution.
.rhis is in tune with the fact that the short-term speech spectrum tends to be fiatter (whiter) in high frequency subbands, which leads to a decreasing amount of information (entropy).Moreover, the weighting behavior is in agreement wíth the optimaJ SNR distribution, as a function of frequency, for subband coders.
Cor;sidering the time varying nature of the short-time speech spectrum, a static weighting function does not perform adequately well, while a dynamic one has tf'1e potential to yield a better measure of quality.A general form assurned by this refined measure is where; refers to the speech segment index, f is frequency, S(f) is the short-time spectrum oft1e input speech, G[S(f)] is a dynamic frequncy weighting which is related to the spe~ch production mechanism, F(f) is a static frequency weighting derived from psycho-acoustic properties of hearing, and SNR(f) is the short-time SNR at frequency f (Le. the coeler pértormance).
From Eq. ( 4), we can define a number of objective measures, which differ in the dynamic weighting function G and in the SNR computation [8].
The first form is based on the assumption that G[S(~] = 1, and SNR(~ is computed in dB (109 scale).So we have J where !!xj is the power of the reference signal x(n) in the articulation band j, for a short segment (e.g.30ms), and M is the power of the corresponding noise signal y(n) -x(n).The notation < > denotes the average over ali the segments in the speech utterance.
The second 1'0rm is characterizeCl by a weighting that approximates the subjective loudness

J
where Sy(~ is the spectrum of the coded speech, and fj is the center frequency of band j.The resulting measure is The third form exploits a normalized log spectral weighting (8) where!-lYi is the output signal power in band j, and !-ly is the total power in the output signal (for the i-th segment).Thus P(~) is 1 if the energy in a band j is equal to the total energy.and is set to Oif it is 40 dB (or more) be10w the total output energy.Therefore, the log spectral weighted measure is We will consider first the unweighted Euclidean distance based upon cepstraJ coefficients, which will be referred to as the cepstraJ distance measure CDM.
Consider two all-pole spectraJ models G/A(z) and G'/P\(z).The error or difference between these models on a log magnitude versus frequency scaJe is defined as [32] (10 where aa normalízed frequency or angle in the z plane, with Jt representing the half-sampling frequency.A 10gicaJ choice for a distance measure between spectral models is the set of Lp norms defined as The rms log spectraJ measure is defined for p =2.These Lp measures can be related to decibel variations in the log spectral domain through the multiplicative factor 10/ln(1 O) =4.34... ln order to reduce the computationaJ load required to estimate V(a) as a summation, we can resort to other efficient methods based on linear prediction anaJysis.To this effect, if A(z) is an Mth order polynomiaJ in Z•l wíth aJl of its roots within the unit circle, and A( 00) = 1, then a Taylor series expansion gives k where {qJ are the cepstral coefficients.It follows [32] that the Fourier series expansion for the model 109 spectrum is (13) where Cc = ln [G 2 ] and C-k = Ck.
An application of the Parseval's theorem to the G distance measure gives 00 00 Ali the spectral shape information lies ir. the coefficients C1, C2, ... , CM. sinee they uniquely deseribe the filter coefficients of A(z).Thus, we can take a truncated series to define a cepstra/ measure u(L), for L greater than ar equal to M, as k -1 The cepstral distance measure in dB is defined a~; k -1 ln conclusion, Eq. ( 16) CW1 be readily computed by means oflinear prediction analysis, to evaluate the models A(z) and A'(z) , and we!l known transforma.. tions between model coefficients 8j and cepstral coefficiel"'ts Ck, It is a1so possible to define the quefrency weighted cepstral distance or Root Power Sums [33] as The most important feature of the wieghting is that it de-weights the lower order cepstral coefficients, rather than weighting the higher order ones, in a data independent mal mer.
Finally, it is worth mentioning the use of transitional speciral variations into a specific cepstral measure.This dynamic feature of the spectral space has mainly been used in speech recognizers [34].An appiication to speech compression algorithms is presented in [35].
The spectral variation in time is represented by the time derivative of the log-spectrum or, recalling Eq. ( 13), by the time derivative of the sampled time series ~(t), that usually does not have an analytic formo Since the 1st order finite difference is in general noisy, the derivativa can be approximated by an orthogonal polynomial fit on each cepstral trajectory over ê. fixed number of frames (window).
The 1st order coefficient, or spectral slope in time, of the orth09onal polynomial has theform where h n is the window of lenght 2N + 1.A weighted Euclidean distance between two given transitional spectra is defined as (19) Dynamíc spectraJ features play an important role in speech perception, as demonstrated in a perceptual experiment by S. Furui [36].
2.6 Other parametric and spectral measures ln addition to the previous objective measures, other parametric and spectral measures have been suggested for the evaluation of speech coders [25].Of particular interest are those based on linear prediction analysis, such as: log area ratio (LAR) measure, reflection coefficient (RFC) measure, feedback (or predictor) coefficient (FBC) measure, 109 likelihood ratio (LLR) measure, linear spectral distance (LSO) and trequency-variant spectral distance (FVSO).
The measures LAR, RFC and FBC are computed by performing the linear prediction analysis over input rmd output speech trames, and then evaluating the L1 norm between the corresponding input and (.;utput param6ters.
For LLR, the likelíhood ratio between input and output LPC parameters is raised to the power 0.25.
For LSO, the LPC all-pole medel spectra for ínput and output speech are normalized to have the sarne geometric mean, then the 12 norm is taken.
The measure FVSO is a1so computed using the input and output LPC model spectra.Each spectrum is divided into tipically six bands, which are separately normalized so that the average spectral amplitude over each band is unity.A weighted ~ norm is taken between the normalized subband spectra, the weighting function being the input LPC spectrum.Finally a linear combination of the six norms is formed.The constants required for the linear combination can be obtained via linear regression analysis with subjective scores.
The short-time banded SNR (STB-SNR) is a generalization of the segmentai SNR.The input speech and the noise (difference between the input and the output signals) are filtered into six bands and for each band the SNRseg is computed.A linear combination of the six SNRseg values is then formed.The final objective score is obtained, as usual, by time averaging over trames.
Another interesting measure, proposed in [26], incorporates an explicit para metric medel of speech perception, and is based on the perturbations exhibited by the spectral peaks of the output signal.Speech formants are computed on the original and distorted signals via the Une Spectrum Pair transformation of the LPC polynomial.Nine different features (e. g. energy, differences in location and in bandwidth, movement, etc.) are determined by comparing the spectral peaks, classified as lost, false and distorted.The objective measure is obtained as a combination of 13 individual features, each computed on at most four spectral peaks.

EXPERIMENTAL RESULTS
ln this secticn, we report brieflythe most important results obtained by various researchers.ln particular, each subsection deals with the work performed in a specific laboratory.
Subjective measure: -Absolute category rating with Mean Opinion Score (MOS).
Results: A combined measure, based on a modified 109 likelihood and the percent articulatory bandwidht, predicts well the individual preferences in a given coder as well as the inter-relationships between coders.
Results: The subjective SNR, defined as the SNR of that reference signal whiet. .;, on the average, is equally preferred to the test signal by a group of listeners, can be used only for waveform coders of good quality. .The correla tion obtained using the coherence function over PCM, AOPCM and APe is 0.96.
Resutts: The best prediction scores are achieved by a gain-compensation S NRseg, and aJso by a spectral signal-tcrdistortion ratio.A linear combination ofthese eas es has been used to predict ratings of APC-NS.
An attem o es 'mate the quality of ADPCM coders through a measuring sys-tem, based artificial signals and identification procedures, is descri bed in [17].

' .
Results: There is a good eorrelation between preference seares and the overall spectral distortion caused by the frame perioel, eepstrum order and quantization noise.

CONCLUDING REMARKS
The large number of objective measures available 50 far in the literature represents an important result of the notable effort provided by many re searchers in this field.However, the major problems continues to be the eap2bility of a given objective measure to perform adequately well aeross a large sampie of ali distortions and ali talkers.At the moment, based on experimental results and educated guess, we can saytharthis issue is twofold, sinee it involves both the statistieal reliability of the objective measure and the correlation with subjective ratings.As far as the former point is eoncerned, it 5eems that the reliability is a very strong feature 01 many objective measures, while the latter issue is extremely subtle and deserves a careful examination.
Basically, the potential of an objective measure can be improved by tailoring its parameters on specific classes of distortions but this leads, unfortunately, to specialized quality estimators, losing generality and universality.On the other hand, a more general measure, devised to handle awide range of speech distortions, will exibit a comparatively lower performance, while requiring a huge subjective database to set the controlling parameters pro perly.ln this light, it is clear that a substantial performanee improvement could be provided by new measures designed aeeording to an effective and advanced moelel ofthe speech perception process, ratherthan a signal fidelity eriterion.
A few examples of measures conceived towards this goal have been reviewed in the preceding sections, but further enhancements are still needed to improve the quality prediction capability across a large set of different condi tions and relevant distortions.This is a most basic step to be considered in the Mure researeh directions.ln conclusion, therefore, objective measures must be selected and used earefully, exploiting their usefulness for speech codec optimization and testing but, a1so, bearing in mind their eurrent limitations.

I 2 . 5
Cepstral distance measuresWhile the preceding SNRF's are based on the entire spectrum, computed through FFT, other distance measures are based on transformations that retain only the smoothed spectraJ behavior of the speech signaJ.