Prediction Transform GMM Vector Quantization for Wideband LSFs

Split vector quantizers are specialized over clusters defined by a low-order Gaussian Mixture Model (GMM). A prediction-based lower-triangular transform is adapted for the enhancement of vector quantization (VQ) in each cluster. Th is transform is generalized to be used in generic vector spaces , where component shifts are used instead of time shifts. Opti mal quantizer banks are designed in minimum noise structures wh ose codebooks are used for the proposed Cartesian split, which improves their coding gain. A novel minimum noise structure is proposed for split VQ. This kind of split VQ is tested for line spectral frequency (LSF) quantization of wideband spech spectra, revealing a comparable average performance to the Karhunen-Loève transform at lower rates with reduced outlier generation and computational complexity. Index Terms prediction transform, vector quantization, Gaussian mixture models, line spectral frequencies, speec h analysis, speech coding. I. I NTRODUCTION V ECTOR quantization (VQ) is more efficient than scalar quantization (SQ) but generally its search complexity is much higher and grows exponentially with dimension when full search is applied [1]. A successful approach factors th e space into a Cartesian product of lower-dimensional subspa ces in what is known as split VQ (SVQ). Another approach involves transform coding. Both approaches lead to lower computational complexity at a reduced performance penalty if properly applied. Indeed, for a broad range of applicatio ns, SVQ proper [2] or enhanced versions such as [3], [4] are good enough. Linear transform coding of a vector source leads to a vector space where the components are less correlated. This makes quantization under weighted square distortion more efficient for jointly Gaussian sources. Eventually, if the s ource vectors can be rendered completely independent, the scalar quantization of the components of the transformed vector is very efficient and flexible [5], even though vector quantizat ion still holds the space-filling advantage [6]. Such an optimal transform is the source-specific Karhunen-Loève transfor m (KLT) for jointly Gaussian sources. By modeling an arbitrary source as a Gaussian mixture, each cluster can be viewed as a jointly Gaussian source. Thus the KLT can be considered optimal as long as each cluster is assigned its own KLT and the clusters are sufficiently far apart. This work is supported by Conselho Nacional de Desenvolvime nto Cientı́fico e Tecnológico (CNPq) under Grant no. 307633/20 11and by Fundação de Amparo à Pesquisa do Estado de São Paulo (FAP ESP) under Grant no. 2012/24789-0. The current state-of-art transform domain quantization is GMM-based classified SVQ in the KLT domain for the line spectral frequency (LSF) representation of speech spectra l parameters [7], building on previous results for GMM-based classified SVQ [8], [9]. However, there are other decorrelating transforms, which a re less complex than the KLT to compute. One such transform is the prediction-based lower triangular transform (PLT) [ 10], which is investigated in this paper for SVQ at reduced complexity. By proper implementation, which involves the Cartesian SVQ proposed in Section V, its performance is better than scalar quantization while its gain matches that of the KLT. II. PREDICTION TRANSFORM FOR ANY SPACE The prediction-based lower triangular transform (PLT) [10 ] B transforms thep × 1 zero-mean source vector x with covariance matrixRxx into vector y = Bx (1) with covariance matrixRyy, which is diagonal, wherep × p matrix B is the lower triangular analysis matrix. Unlike the KLT, however, its diagonal entries are not its eigenvalu es, but the residues or backward prediction error variances βm of increasing order m = 0, 1, . . . , p−1, wherep is the dimension of the source vector space. The PLT may be understood in a general linear prediction (LP) context where the vector space may be any so that the vectors need not be constrained to blocks of time delayed samples as assumed in its original proposal [10] and the component shift operator is used instead as is outlined in th e Appendix. In this case the only constraints on the covarianc e matrix are its positive definiteness and its symmetry so that the proper LP method to be used is the covariance method [11] in contrast to the autocorrelation method suitable for statio n ry sample vectors. The covariance matrix for multivariate vector x is defined as Rxx = E [ xx ] , (2) where E [·] stands for the expected value with respect to the joint probability density function (pdf) of random vect or x. Since this pdf is not readily available, we use a training data matrixΞc with Nc columnsξ of source vectors from a given Gaussian cluster c in order to estimate the p× p source covariance matrixRxx with entries

Prediction Transform GMM Vector Quantization for Wideband LSFs

Miguel Arjona Ramírez
Abstract-Split vector quantizers are specialized over clusters defined by a low-order Gaussian Mixture Model (GMM).A prediction-based lower-triangular transform is adapted for the enhancement of vector quantization (VQ) in each cluster.This transform is generalized to be used in generic vector spaces, where component shifts are used instead of time shifts.Optimal quantizer banks are designed in minimum noise structures whose codebooks are used for the proposed Cartesian split, which improves their coding gain.A novel minimum noise structure is proposed for split VQ.This kind of split VQ is tested for line spectral frequency (LSF) quantization of wideband speech spectra, revealing a comparable average performance to the Karhunen-Loève transform at lower rates with reduced outlier generation and computational complexity.
Index Terms prediction transform, vector quantization, Gaussian mixture models, line spectral frequencies, speech analysis, speech coding.

I. INTRODUCTION
V ECTOR quantization (VQ) is more efficient than scalar quantization (SQ) but generally its search complexity is much higher and grows exponentially with dimension when full search is applied [1].A successful approach factors the space into a Cartesian product of lower-dimensional subspaces in what is known as split VQ (SVQ).Another approach involves transform coding.Both approaches lead to lower computational complexity at a reduced performance penalty if properly applied.Indeed, for a broad range of applications, SVQ proper [2] or enhanced versions such as [3], [4] are good enough.
Linear transform coding of a vector source leads to a vector space where the components are less correlated.This makes quantization under weighted square distortion more efficient for jointly Gaussian sources.Eventually, if the source vectors can be rendered completely independent, the scalar quantization of the components of the transformed vector is very efficient and flexible [5], even though vector quantization still holds the space-filling advantage [6].Such an optimal transform is the source-specific Karhunen-Loève transform (KLT) for jointly Gaussian sources.
By modeling an arbitrary source as a Gaussian mixture, each cluster can be viewed as a jointly Gaussian source.Thus the KLT can be considered optimal as long as each cluster is assigned its own KLT and the clusters are sufficiently far apart.
The current state-of-art transform domain quantization is GMM-based classified SVQ in the KLT domain for the line spectral frequency (LSF) representation of speech spectral parameters [7], building on previous results for GMM-based classified SVQ [8], [9].
However, there are other decorrelating transforms, which are less complex than the KLT to compute.One such transform is the prediction-based lower triangular transform (PLT) [10], which is investigated in this paper for SVQ at reduced complexity.By proper implementation, which involves the Cartesian SVQ proposed in Section V, its performance is better than scalar quantization while its gain matches that of the KLT.

II. PREDICTION TRANSFORM FOR ANY SPACE
The prediction-based lower triangular transform (PLT) [10] B transforms the p × 1 zero-mean source vector x with covariance matrix R xx into vector with covariance matrix R yy , which is diagonal, where p × p matrix B is the lower triangular analysis matrix.Unlike the KLT, however, its diagonal entries are not its eigenvalues, but the residues or backward prediction error variances β m of increasing order m = 0, 1, . . ., p−1, where p is the dimension of the source vector space.The PLT may be understood in a general linear prediction (LP) context where the vector space may be any so that the vectors need not be constrained to blocks of time delayed samples as assumed in its original proposal [10] and the component shift operator is used instead as is outlined in the Appendix.In this case the only constraints on the covariance matrix are its positive definiteness and its symmetry so that the proper LP method to be used is the covariance method [11] in contrast to the autocorrelation method suitable for stationary sample vectors.
The covariance matrix for multivariate vector x is defined as where E [•] stands for the expected value with respect to the joint probability density function (pdf) of random vector x.Since this pdf is not readily available, we use a training data matrix Ξ c with N c columns ξ of source vectors from a given Gaussian cluster c in order to estimate the p × p source covariance matrix R xx with entries for i, j = 0, 1, . . ., p − 1.
Analysis matrix B may be obtained by the upper-lower (UL) Cholesky factorization of R −1 xx as where diagonal matrix R yy has main diagonal entries β m , for m = 0, 1, . . ., p − 1, which may be interpreted as backward prediction error variances according to the derivation given in the Appendix.Alternatively, we may be interested in obtaining the inverse transfom matrix S = B −1 directly for the implementation of a minimum noise structure as outlined below.In this case, we carry out a lower-upper (LU) Cholesky factorization of R xx as The PLT is not a unitary transform so that its inverse is not its transpose.But it attains the same gain as the KLT as long as it is implemented in a minimum noise structure.Two such structures have been proposed, MINLAB(I) and MINLAB(II) [10].We will use the former, which turns out to be less complex when the sequence of vectors is much longer than the dimension.
In order to implement MINLAB(I), the inverse transform matrix S must be derived and then factored as where S m takes its mth row from S and the remaining rows from the identity matrix.
Next, the transform matrix may be recovered by inverting Eq. ( 6) as where the matrix inverses are quite straightforward to obtain since row m in S −1 m is obtained as from row m in S m whereas the remaining rows are just repeated so that diagonal entries s mm are unity and do not change sign upon inversion.

III. GAUSSIAN MIXTURE MODEL CLUSTERING
Split vector quantization is to be performed over subvectors so that each split quantizer becomes isolated from the other subvectors in the vector to be quantized, thereby causing a split loss [12].In order to enhance the overall performance of the quantizer, a joint GMM-SVQ system is used.
The whole training source vectors are used for modeling their joint probability density function f X (x) by a Gaussian mixture model where M is the number of Gaussian components or clusters in the mixture and are the mixture parameters with c i and θ i = µ i , R i for i = 1, 2, . . ., M being the a priori cluster probabilities and the parameters for each Gaussian component, which are its mean vector µ i and its covariance matrix R i , so that the component pdfs are for i = 1, 2, . . ., M .The number of clusters M could be a model parameter [5] but we have chosen to fix it at M = 8 in order to keep the computational complexity manageable while the model is still verified to be efficient.The model parameters are estimated over the training database, described in Section VII, that consists of a p × N data matrix Ξ holding N p-dimensional LSF coefficients.At first, they are sequentially segmented into M equiprobable clusters, that is, initial a priori probabilities are c i = 1/M for i = 1, 2, . . ., M and the mean vectors and the covariance matrices are estimated over each cluster, thereby defining the initial model.Then the expectation-maximization (EM) algorithm [13] is run in iterations consisting of two steps 1) Expectation: For each training vector ξ(n), the a posteriori probability that it was generated by component m in the mixture is computed as for n = 0, 1, . . ., N − 1 and m = 1, 2, . . ., M.

2) Maximization:
The likelihood is maximized by reevaluating for each cluster for m = 1, 2, . . ., M. With the mixture model estimated, the training vectors are assigned to the cluster whose component pdf provides the maximum likelihood, that is, These clusters are referred to as Gaussian clusters.

QUANTIZATION
Ideally, the transform should remove the correlation in the vector to be coded and leave the complementary modeling of the probability density function (pdf) to the scalar quantizer, which is aided in this task by the GMM clustering described in Section III.
The scalar quantizers are just inserted in cascade between the analysis and synthesis filterbanks when a unitary transform is used.However, for the minimum noise PLT the scalar quantizer bank must be interleaved with the factored implementation of the analysis filterbank.This may be represented by means of diagonal functional operator matrices Q m (•) with diagonal for m = 0, 1, . . ., p − 1, with scalar quantizer q m (•) at the mth column.This allows us to represent the MINLAB(I) implementation of the encoder as where ỹ is the quantized transformed vector.Therefore, in this implementation, transforming and quantizing are interleaved.
Furthermore, it is interesting to remark that this algorithm implements subband noise feedback from lower-frequency bands.
Conversely, the inverse transform for decoding may be implemented by using either the ladder decomposition in Eq. ( 6) or, equivalently, by using matrix S directly.
Since the subband signals are uncorrelated after transforming, optimal rate allocation among scalar quantizers is determined by the prediction error variances β m as where R is the bit rate per vector, R/p is the average bit rate per sample, log 2 M is the bit rate per vector for GMM cluster selection and R m for m = 0, 1, . . ., p − 1 are the bit rates per sample for each subband.

V. PREDICTION TRANSFORM AND VECTOR QUANTIZATION
In principle, scalar quantization is optimal when the transform generates vectors with independent components before quantization as pointed out in Section IV for the KLT and the PLT.However, the dependence may not be completely removed due to estimation errors and nonlinear dependence [7].The latter is significant in many practical situations so that VQ increases the coding gain as shown in Secion VII.
In resolution-constrained quantization, linear and nonlinear dependences are measured by the memory advantage of VQ over SQ [6], [7], which for square distortion and p-dimensional vectors is defined as where f X (x) is the joint pdf and f Xm (x m ) for m = 0, 1, . . ., p − 1 are the marginal pdfs.Assuming a jointly Gaussian vector, the memory advantage is The PLT actually provides a measure of nonlinear dependence by β p−1 , the prediction error variance of highest order, which quantifies how much of the variance goes unexplained by linear prediction.When it equals zero, all dependence is linear; otherwise, β p−1 > 0 indicates remaining nonlinear dependence between vector coordinates.The absence of any significant independent interferer is assumed.
Split quantization is proposed to take partial advantage of eventual nonlinear dependence.However, quantization noise is harder to compensate at the split level than by a scalar interleaved structure.Fortunately, an interesting association exists between the noise minimization provided by scalar quantization and the encoding benefits of vector quantization.It is achieved by Cartesian SVQ (CSVQ) as described below.
Scalar codebooks C m are designed for each dimension m = m 0i , m 0i + 1, . . ., m i in split i using a MINLAB(I) structure.For SVQ, the codebook for split i is obtained by the Cartesian product In fact, the Cartesian codebook structure enables the nonlinear memory advantage of VQ over SQ to be used while enforcing the minimum noise condition as shown by the results in Section VII.Its analysis is less complex due to the lower triangular structure of the analysis matrix, which can be easily factored.

VI. COMPLEXITY
The operational complexity for a transform quantizer may be broken down into the following pieces: mean subtraction (MSUB), analysis filterbank (ANAFB), quantization (Q), distortion calculation (DIST), synthesis filterbank (SYNFB), mean addition (MADD) and final vector comparison (FVECC).The dependence of these complexity components upon number of clusters and vector dimension is given in Table I but for the quantization component.
Quantization may be implemented as block scalar quantization (BSQ) or vector quantization (VQ).We have considered mainly nonuniform BSQ with binary search, whose computational complexity is where R ij is the rate for component j in cluster i, R is the total bit rate per vector and log 2 M is the bit rate per vector for GMM cluster selection.
Using the partial complexities in Table I, the total computational complexity for PLT BSQ is and the total computational complexity for KLT BSQ is so that the total complexity for scalar PLT is about half that of scalar KLT.Specifically, in the range of situations tested in Section VII, this ratio is around 54%.
For VQ ANAFB, complexity will have to be distributed over splits and clusters, leading to where R ij is the rate for split j in cluster i, ς is the number of splits and M is the number of clusters.Therefore, in order to evaluate this complexity component the dimension split and the bit allocation per split are necessary.This is exemplified in Section VII.

VII. EXPERIMENTAL RESULTS
The transform quantization methods discussed and proposed have been applied to sequences of line spectral frequency (LSF) vectors extracted from wideband speech signals.The adaptive multirate wideband (AMR-WB) [14] coder has been used to compute LSF vectors at a rate of 50 Hz for the signals in the TIMIT database [15], whose training partition with 705,580 vectors has been used for training the quantizers while its test partition with 257,852 vectors has been assigned for testing.For the simulations, MATLAB has been used.
For the training set of LSF vectors, the mean vector is evaluated and then subtracted from each vector, thereby obtaining centered vectors.Spectral weighting coefficients are computed from the sensitivity matrix of each LSF vector under high-rate approximation [16] and the bit rate is optimally allocated to scalar quantizers according to prediction residue variances for the PLT and eigenvalues for the KLT with rounding and adjustment.For vector quantizers, the allocations are cumulated over each split.An exceptional allocation is made for the pure split vector quantizer in Table II, considered as a reference.
Performance is measured according to the criteria set forth by Paliwal and Atal for transparent quantization [2]: • The average spectral distortion (SD) is about 1 dB.
• There is no outlier frame with SD above 4 dB.
• The ratio of outlier frames in the range from 2 dB to 4 dB is less than 2%.
The best reference for scalar transform quantization is the KLT scalar quantizer (KLT SQ), implemented with bit allocation based on Eq. ( 18), whose performance is shown in Table III, and can be seen to outperform SVQ in mean spectral distortion by more than 0.15 dB and by a lower number of outliers in the 2 dB to 4 dB range, even though it is slightly inferior in outlier performance above 4 dB.Now the stage is set for evaluating the performance of PLT scalar quantization (PLT SQ), displayed in Table IV, which is found to outperform KLT SQ at 45 bit/fr and 46 bit/fr and by following rather close the performance of KLT SQ at lower rates and consistently exceeding it in outlier performance above 4 dB.Further, when the lower complexity of PLT SQ is taken into consideration, it sounds like a better option for transform SQ.
For training the transform vector quantizers, the training vectors are first clustered into eight classes through a Gaussian Mixture Model (GMM) and then a vector quantizer is designed for each cluster.The Linde-Buzo-Gray (LBG) algorithm [17] is used initializing with a single codevector at the centroid and doubling the number of codevectors in centroid splitting steps.For testing, each test vector is quantized with the vector quantizer for each Gaussian cluster and the lowest distortion result is selected.
Using the procedure outlined above, the performance of KLT SVQ is found to improve significantly over the scalar quantization version as shown in Table V, particularly in outlier performance in both ranges.
Finally, PLT Cartesian SVQ (PLT CSVQ) has a gain in performance over its scalar version as can be seen from the results in Table VI.This is to be expected since speech spectral parameters are known to have significant nonlinear dependence [1] as discussed in Section V.It is most noticeable that outliers are greatly reduced either over the scalar version performance as over the KLT SVQ performance.Still the average distortions are somewhat higher for PLT CSVQ but they may be traded off for the significantly lower operational complexity incurred by PLT CSVQ as compared with KLT SVQ, shown in Table VII to be around 3/4 as much.

VIII. CONCLUSION
The prediction transform has been proposed for a transform quantizer designed over clusters determined by a low-order Gaussian mixture model which improves the performance of split VQ by reducing its split loss.The transform matrices have been derived by the covariance method of linear prediction for general vector spaces using component shifts in contrast to the original proposal of the PLT for time shifts.A scalar quantizer has been proposed in a minimum noise structure with interleaved analysis and quantization, whose computational complexity is almost half that of KLT scalar quantization.The coding gain has been enhanced by using VQ, which in a novel PLT Cartesian SVQ comes close to KLT SVQ average performance at low rates with improved outlier performance and complexity of almost 3/4 that of KLT SVQ.This has been achieved because of the novel minimum noise structure for split VQ.

APPENDIX GRAM-SCHMIDT ORTHOGONALIZATION FOR COMPONENTWISE PREDICTION
Relations between component random variables in a multivariate vector may be expressed by means of the shift operator, which may be represented by the lower shift matrix and its powers, or, alternatively, by the polynomial z −1 and its powers.The latter allows an interpretation of the transforming operations in the context of linear prediction, establishing a correspondence between entries in rows of the analysis matrix and coefficients in the backward prediction error polynomial of the same order as the row index.The covariance matrix R xx induces an inner product for a polynomial space associated to the vector space under analysis.Writing these polynomials in the variable z −1 , consider two such polynomials P (z) = K i=1 c i z −i and Q(z) = L j=1 d l z −j .Due to the distributive property of inner products over vector addition, the inner product of these polynomials may be expanded as so that the inner product is completely defined by the products of monomials z −i , z −j = r xx (i, j) for i, j = 0, 1, . . ., p − 1 as long as our interest is restricted to polynomials of degrees K ≤ p and L ≤ p.These monomials are shift operators for the vector coordinates and should not be identified with time delays, which they may even be just as a special case.It should be observed that this is a valid definition for an inner product because matrix R xx is positive definite and symmetric.Given a p × p covariance matrix R xx defining the inner product in the vector space of polynomials with degree less than or equal to p in the variable z −1 and the canoni-