Multichannel Source Separation Using Time-Deconvolutive CNMF

This paper addresses the separation of audio sources from convolutive mixtures captured by a microphone array. We approach the problem using complex-valued non-negative matrix factorization (CNMF), and extend previous works by tailoring advanced (single-channel) NMF models, such as the deconvolutive NMF, to the multichannel factorization setup. Further, a sparsitypromoting scheme is proposed so that the underlying estimated parameters better fit the time-frequency properties inherent in some audio sources. The proposed parameter estimation framework is compatible with previous related works, and can be thought of as a step toward a more general method. We evaluate the resulting separation accuracy using a simulated acoustic scenario, and the tests confirm that the proposed algorithm provides superior separation quality when compared to a stateof-the-art benchmark. Finally, an analysis of the effects of the introduced regularization term shows that the solution is in fact steered toward a sparser representation.


I. INTRODUCTION
Blind source separation (BSS) is an extensively researched topic with a wide variety of applications [2]. A celebrated example is the use of independent component analysis (ICA) [3,4,5] to separate muscular activity interference from brain activity in encephalographic scans [6]. Another interesting example is the use of BSS in speech enhancement for hearing aid devices [7].
Among traditional techniques for source separation, nonnegative matrix factorization (NMF) [8], a single-channel method, has been successfully employed in the literature [9]. NMF factorizes an input matrix with non-negative entries into two lower-rank matrices with non-negative entries, and is able to extract the most significant components that explain the observed data, i.e., a model and a set of parameters that produce a satisfactory estimate of the data. In comparison with other rank-reducing methods such as singular-value decomposition (SVD), the fact of dealing only with non-negative quantities is a distinctive feature of NMF: this is suitable for parameters This work is an extended version of our previous work [1], presented at the XXXVII Brazilian Symposium on Telecommunications and Signal Processing that are non-negative by nature as is the case of magnitude or power spectra, and prevents mutual cancelling of components by destructive interference, since the model is strictly additive.
In the source separation scenario, the resulting NMF factors from a mixture spectrogram can be thought of as a set of spectral signatures and temporal activation patterns [10]. It is expected that subsets of the extracted signatures explain each source, and a typical challenge is how to assign which components correspond to each source. Usually, this assignment relies on some other prior information.
A more powerful way to perform source separation is to exploit spatial information, as in the case of multichannel processing methods. The complex-valued NMF (CNMF) [11] is a development in this direction, with the introduction of Hermitian positive semidefinite matrices as data points, derived from the measured signals' complex-valued spectrograms. Building on the CNMF model, an alternative factorization, with geometric constraints, is proposed in [12], providing spatially-coherent estimation, thus enhancing the separation quality of the method.
In [11,12,13], the factorization models were based on the standard NMF; by construction, the standard NMF model does not take into account the temporal sequence of samples, that is, a random shuffle in the time frames is irrelevant from the decomposition viewpoint. This approach, therefore, disregards useful continuity structures that can be observed in some data, such as musical samples, and often, the emission signatures cannot be efficiently represented by single NMF components. In order to address this limitation, we propose using a deconvolutive NMF (NMFD) scheme [14]. The extended NMFD signatures have a user-defined span of time frames, opening up the possibility for a more concise representation of some musical emissions, and eventually enhancing the separation quality.
In this paper we show that the deconvolutive NMF model can be tailored to the CNMF framework with good results, and that we may regularize the related cost function toward a sparse solution. We start with the problem statement in Section II, give a brief introduction to the NMF models in Section III, describe the transformations of the original data into the CNMF data points in Section IV, and the constrained construction of channel matrices as well as the application of the deconvolutive model in Section V. We derive the estimation framework for a Euclidean cost function in Section VI, and present the numerical results in Section VII.
A note on notation: In this paper, we denote scalars by regular lower-case letters, e.g. x, vector variables are lower-case bold letters, e.g. x, matrices are upper-case bold, e.g. X, and higher-order tensors are calligraphic upper-case bold, e.g. X . An indexed element from a higher-order tensor corresponds to a slice of the original tensor, e.g. [B k ] it = [B] ikt . Conversely, when some index is omitted and the variable gets promoted to a higher-order form, we refer to the entire collection of elements with the same name, that is, W = {W io ∀i, o}.

II. PROBLEM STATEMENT
We state our problem as that of separating meaningful acoustic sources captured by a microphone array with known geometry placed in a reverberant environment. Meaningful in this case means a source that is spatially concentrated, and emissions from different positions are considered as being from different sources. We consider that the propagation media is linear and time-invariant, and effects such as reverberation and multipath-propagation are then modeled by the unknown source-sensor channel impulse responses.
Under such considerations and assuming an array with M microphones, the acquired signals can then be described by the following model: where Q is the true number of sources, x m (t) is the m th sensor measurement, h qm (t) is the impulse response relative to the channel between the source-sensor pair (q, m), s q (t) is the true emission of source q, and * denotes the convolution operation. We would then like to be able to reconstruct estimates for the individual source imagesŷ qm (t) = (h qm * s q )(t), as captured by the array. We now describe the basic NMF formulations, and the steps toward integrating Eq. (1) into the non-negative framework.

III. NMF BASICS
Non-negative matrix factorization is, at its core, the factorization of a matrix of non-negative entries into two reducedrank matrices with non-negative entries.
In its most basic formulation [15], we have a data matrix X ∈ R I×L + , and want to find a K-rank approximation of the original matrix according to a suitable criterion, subject to a non-negative constraint on the components; the approximation is given by matrices B ∈ R I×K + and G ∈ R K×L + such that as illustrated in Fig. 1.
Usually, the approximation criterion is described by some cost function L(X, BG), and the factors are obtained by jointly minimizing the overall cost: subject to non-negative constraints. (3) A useful interpretation of the extracted factors is the following: if the rows of X are features, and each column of X represents the set of features from a given observation; then, after a successful factorization, the columns of B can be thought of as recurrent signatures, forming a basis for the observations, and the rows of G as coefficients denoting the activation of each corresponding signature [9]. The presence of a given signature could then be related to some hidden factor of the underlying data; for instance, in a spectrogram factorization, an extracted signature can be linked to a particular emission from a musical instrument.
Often, however, the problem is so underdetermined that simply relying on the fit yielded by the minimization of L is not satisfactory. A common solution is then to restrict the model, imposing extra constraints on the obtained factors, such as sparsity, smoothness, or orthogonality [16,17,18].
One particular limitation of the simple NMF model is the fact that the extracted patterns are static, that is, the shape of the estimate provided by each component b k is equal across all observations. Taking temporal continuity into consideration, such that the sequence of observations (columns of X) are in fact a time sequence, a more powerful model can be designed, allowing the patterns to be longer in length than a single observation. The deconvolutive-NMF is a generalization that provides such features by extending the basis vectors b k to matrices B k ∈ R I×T + , for some user-chosen T ; the corresponding activation vector g k is then time-shifted by t and applied to each subcomponent b kt : the dataflow for the deconvolutive NMF is illustrated in Fig. 2.
[·]  [·] is such that for a given vector for t = 0, the operator reduces to identity, and with t > 1, it is equivalent to t successive applications of the operator with unit shift. The left-shift operator is defined analogously, and the extension to matrices and tensors follows naturally, shifting over some specified dimension.
The main idea behind deconvolutive-NMF in musical analysis is that longer musical note profiles can be efficiently modeled by the framework, leading to an increase in separation accuracy.

IV. SIGNAL REPRESENTATION
We now describe the steps toward a phase-aware nonnegative multichannel representation, aimed at the decomposition in non-negative factors. The first thing to do is to extend the basic data matrix X, introduced in Section III, to the multichannel case. One way to do this is by associating each entry of this data matrix with an M × M matrix modeling relations between pairs of sensors; but then we also have to extend the notion of non-negativeness. When dealing with real numbers, being non-negative means belonging to the convex cone 1 R + . Positive-semidefiniteness is the matrix counterpart of the scalar non-negativeness property; indeed, positive-semidefiniteness is preserved under conical combination of positive-semidefinite (PSD) matrices. 2 Translating the relationship of Eq. (1) to the short-time Fourier transform (STFT) domain, each (complex) frequencytime point measurement is where i denotes the frequency bin, l is an index for the time frame, h iqm is the frequency response at bin i of the channel relative to the source-sensor pair (q, m), and s ilq is the STFT of the emission of source q at frequency-time point (i, l).
In order to represent the overall measurements as Hermitian PSD matrices, Sawada et al. [11] take the outer product of the vector x il -formed by the measurements across all sensors at a single frequency-time point-with itself, obtaining the matrices where v H is the Hermitian of v andz is the complex conjugate of z. Additionally, assuming that different sources q and q are orthogonal, and then transposing the correlation property to the 1 A subset C of a given vector space is a cone if αx ∈ C for any x ∈ C and any α ≥ 0. It is a convex cone when it is closed for conical combinations, i.e., when α 1 x 1 + α 2 x 2 ∈ C for any x 1 , x 2 ∈ C and any α 1 , α 2 ≥ 0. 2 The space of PSD matrices, S + , is itself a cone, that is, any element S ∈ S + can be written as a conical combination of some basis matrices.
STFT coefficients, a useful approximation can be obtained by neglecting the crossed terms: Lastly, in order to factor a magnitude instead of a power spectrum, the STFT coefficients are mapped through a magnitude square-root function applied elementwise, as proposed in [11]: φ(z) = z √ |z| . In this new context, equation (6) is replaced by the corresponding expression where η iq is a factor similar to h iq , allowing to define H iq = η iq η H iq as a matrix that encodes the phase properties of source q at bin i. The entries of H iq encode the phase difference between the responses of each channel pair. By the outer product construction, H iq preserves phase information without actually modeling the absolute phase of the measurements. Expression (8) then motivates a joint factorization of the sources' magnitude spectra and spatial-property matrices.

V. FACTORIZATION MODEL
We now describe the steps toward the factorization of the multichannel model in (8) into non-negative factors. The standard CNMF [11] simply attaches to each NMF component some phase information in the form of a set of PSD matrices, as will be described in Subsection V-A. A limitation of this approach is that no structure is imposed on the family of PSD matrices associated with each component; we intend to separate sources based on spatial cues, and one could expect some form of coherence in terms of how the spatial information is encoded within a single component, as shown in [12]. In Subsection V-B, prior knowledge about the sensor array geometry is incorporated into the process in the form of beamforming kernels, from which the spatial matrices associated with each component are derived from. Finally, in Subsection V-C, the complete factorization model is glued together, defining some useful additional constraints, and in Subsection V-D the process of recovering the source image spectrograms from the CNMF parameters is described.
The main idea of factorization model is to explore the compressibility of the magnitude representation in order to find K components that best explain the measurements. In the context of CNMF [11], we seek to explain the measured data points X il as non-negative combinations of positive semidefinite matrices. We assign to each NMF component a family of spatial-property matrices, and cluster components based on their spatial properties when reconstructing the sources. The phase-magnitude model for the measured data points can be written as where H ik encodes spatial properties for a component anď s ilk is a magnitude estimate that shall be computed through the NMF framework. In the following, we detail how the parameters are obtained.

A. Magnitude activation model
In the single channel case, the standard NMF finds a lowrank approximation to some data matrix X ∈ R I×L + as the product of two smaller non-negative matrices B ∈ R I×K + and G ∈ R K×L + , where, usually, K Rank(X). If the input data matrix consists of a magnitude spectrogram, with bins as rows and time frames as columns, a useful interpretation of the extracted matrices arises: the columns b k ∈ R I + of B are spectral signatures present in the measurements, and the rows g k ∈ R L + of G are activation patterns for such signatures across time frames. The application of the simple NMF to the CNMF model would lead to the magnitude estimatě s ilk = b ik g kl . We propose the estimation ofš ilk through a deconvolutive NMF model as presented in Section III. Using the deconvolutive model, the estimate of the instantaneous magnitudes due to the k th component iš In fact, the standard form of the CNMF with the deconvolutive model can be written aŝ and this model shares the single-channel deconvolutive NMF properties of being able to efficiently extract spectral patterns that vary with time. Considering that continuous emissions occur in many audio signals, this model is appropriate when moving toward a more powerful separation model.

B. Spatial covariance model
We apply the direction-of-arrival (DoA) based factorization method introduced by Nikunen and Virtanen [12] to the channel matrices H ik . An issue with the unconstrained estimation of matrices H ik is that there is no guarantee that the set of matrices H k (all matrices H ik with fixed k) actually encodes a single coherent single-input multiple-output channel between some component and the sensor array. Instead, the set H k is constructed as a non-negative linear combination of geometrically-defined beamforming kernel matrices W io .
Consider the scheme depicted in Fig. 3: for a sufficiently far emission source somewhere along the direction of k o (a unit-length vector), such that the wavefronts can be considered planar, the relative time delay between sensor acquisitions can be defined in terms of the sensor array geometry, wave propagation velocity, and incidence direction. In fact, the difference in propagation length can be calculated as the inner product p m − p m , k o so that the relative time-difference of arrival (TDoA) is simply where c denotes the wave propagation velocity. It is straightforward to find the frequency-dependent phase lag using Fourier transform properties, and from the nonnormalized STFT bin frequencies, the per-bin phase lag can be calculated as The idea for modeling H is to sample O directions from the unit sphere around the array and form the beamforming kernels W io for all STFT bins for each DoA sample. The beamforming kernels are M × M Hermitian matrices containing the relative phase shifts (for a set frequency f i and direction k o ) as complex factors: Finally, the channel matrices H ik can be described in terms of the DoA kernels as conic combinations where the factors z ko ∈ R + are shared across all frequencies for a given component, making this a spatially coherent factorization. Additionally, this model allows the spatial properties encoded by the vectors z k to be clustered, since it is expected that components with similar spatial signatures belong to the same source.

C. Complete model
The complete model for the measured covariance matrices in terms of deconvolutive CNMF parameters can be written aŝ This factorization has a possible scaling ambiguity, so additional constraints are introduced, namely o z 2 ko = 1, l g 2 kl = 1, and W io F = 1, where M F = tr(M H M ) denotes the Frobenius norm of a matrix M . In order to find the best estimates for the parameters, a statistical model can be employed, and an estimation framework can be built-see Section VI.

D. Source reconstruction
Given the CNMF parameter estimates, the per-source spectral images can be reconstructed through a Wiener filter where β qk are learned membership coefficients relating a given component k to source q. The membership coefficients can be obtained through a regular clustering algorithm, such as k-means, c-means, or even NMF (considering that the spatial factors z ko are also non-negative). The overall effect is to multiply the input spectrograms by a mask of ratios of estimated spectral magnitudes, preserving the original phase. Finally, the discrete-time version ofŷ qm (t) can be retrieved from the inverse-STFT of the coefficients.

VI. PARAMETER ESTIMATION
Following previous works [11,12], the generative model adopted for the entries of the data matrices X il is that of independent complex Gaussian variables with unit variance, which allows to recast the likelihood estimate as a squared error minimization problem. In fact, the likelihood function for the parameters, considering the overall measurements, can be written as (17) leading to the alternative cost function related to the colog-likelihood. It can be useful to embed some prior on the parameters. This knowledge can be directly related to a regularization factor, steering the algorithm toward a solution with some desirable properties. In this paper we consider the generative model for the spectral signatures b ikt as a one-sided exponential distribution with some scaling factor α B , leading to the regularized likelihood 3 and a regularized cost function where the tensor 1 -norm is defined as i,k,t |b ikt |. This is inspired by a LASSO [19] regression, aiming to enforce the selectivity of signatures B k and, consequently, produce a sparser representation, suitable to pitched audio signals. 3 We extract a factor 2 from α B for convenience.

A. Minimization procedure
To obtain the maximum a posteriori estimate, one must minimize the regularized cost (20). The convoluted interdependence among the parameters Z, W, B, G in this cost function hinders its direct (gradient-based) minimization, calling for alternative procedures. A possible alternative would be to minimize an auxiliary function, related to the actual cost function, that depends on an extra set of parameters while being, in some sense, smoother. The individual gradients w.r.t. the CNMF parameters of the auxiliary function can be straightforwardly derived, and the extra set of parameters can be chosen in a way that explicit computation is avoided. In this subsection, we shall describe a convenient auxiliary function, along with the specific choice of additional parameters as well as the update rules for the non-negative tensors. At last, the update for the kernel matrices, which requires additional projection and normalization steps, is described.
While (20) is non-convex relative to the CNMF parameters, it is individually convex over Z, W, B, and G. Thus, a block relaxation minimization scheme [20] may be employed with good results. We consider an auxiliary function to (20), namely: where r ilkot are any positive variables satisfying k,o,t r ilkot = 1, and S ilkot are Hermitian matrices satisfying k,o,t S ilkot = X il , for which we derive the following proposition: Furthermore, Proof. See Appendix A.
Through the majorizer conditions (22) and (23), the individual minimization of (21) across the CNMF variables with S set as the optimal S * is guaranteed to be non-increasing. With the auxiliary definition a useful way to define r ilkot arises, namely Although this definition is not strictly positive, it is safe to ignore the zeroed values, since and this definition allows for implicit computation of the S * ilkot factors.
Replacing S * and r ilkot with their definitions, the multiplicative rules for non-negative factors can be obtained (see Appendix B): The update process for the kernel matrices is slightly different, since only magnitude optimizations are allowed, and the positive semidefinite constraint must be accounted for. What follows is an optimization scheme similar to a projected gradient algorithm: the possibly unfeasible point that minimizes the cost function is calculated as (see Appendix C) this point is projected onto the positive semidefinite cone by rectification of its eigenvalues: finally, only the entries' magnitudes are updated, as the true update is obtained as where abs(·) and sign(·) both operate elementwise on their arguments, and denotes the matrix Hadamard product. Concerning the scaling factors, after each update, z k and g k are normalized to unity, while the reciprocal correction factor is applied to B k , that is: Similarly, W io is rescaled to unity Frobenius norm, but no rescaling of other parameters is needed:

VII. NUMERICAL RESULTS
The program developed to assess the accuracy of the proposals was coded in Python, using TensorFlow v1.14, and executed on an Intel Xeon Gold 5120. We attempt the separation of two sound sources positioned 90 degrees apart, from four mixtures synchronously captured by omnidirectional microphones placed approximately 8 cm apart from each other, in the form of a tetrahedron. The audio tracks are two musical samples about 20 s long, sampled at 22.05 kHz. A closed room of dimensions 5 m × 4 m × 3 m (length, width, height) with RT 60 ≈ 450 ms was simulated using CATT-Acoustic v9.0c [21], with the microphone array at the center and the sources on the horizontal plane 1.5 m away from it, as depicted in Fig. 4. The DoA vectors were randomly sampled from the unit sphere in a way to approximately maximize the cosine distance between the closest vectors, with O = 110 directions. An illustration of the DoA coordinates is shown in Fig. 5.  Square-rooted Hanning windows were used for STFT analysis and synthesis, with 50% overlap. Frame length was chosen as 1024 samples, corresponding to approximately 46 ms, I = 513 bins, and L = 978 frames. We considered three scenarios: a baseline run, with (K = 60, T = 1, α B = 0), equivalent to [12]; a deconvolutive run, with (K = 60, T = 5, α B = 0) testing the proposed extension; and a sparsity-promoting run, with (K = 60, T = 5, α B = 0.5). The algorithm was run for 500 iterations in all cases.
We performed weighted k-means on the vectors z k , with weights corresponding to the component energy B k 2 F . The source-component membership coefficients β qk were set to 1 or 0 based on the obtained clustering.
First, we analyze the directional sensitivity of the method. With k-means clustering, it is possible to retrieve the centroids corresponding to the spatial signatures for each source; based on the spherical coordinates of the DoA vectors and the obtained centroid coefficients, a bivariate spherical spline interpolation was used to create a visualization of the spatial properties for each source. The results are depicted in Figs. 6 and 7. The true positions are depicted with blue dots. The centroids' peak activations closely match the true directions for each source, such that a rough estimate of the sources directions can be obtained from the method, although the estimate is likely to deteriorate in heavily reverberant environments.
The separation quality [22,23] was measured using the mir_eval suite [24,25]. From the suite's provided image evaluation function, we obtain the source-to-distortion ratio (SDR) and source-to-interference ratio (SIR) for each source, and the results are depicted in Fig. 8. SDR encompasses several types of distortions into a single metric, being a robust form of evaluation for BSS techniques; SIR, on the other hand measures only the overall crosstalk between estimated sources, being another useful metric for separation evaluation. Due to the stochastic nature of the fit, an unbalance between sources can be noted in all scenarios, but a slight advantage can be noted in the deconvolutive case. The regularization slightly degraded the overall measures in comparison to the unregularized deconvolutive test, although they are still superior (on average) to those of the baseline case.

SDR SIR
We would like to investigate the sparsity-inducing properties on the extracted signatures B. While the factors obtained by fitting the data using the Euclidean cost are often sparse, they are not the sparsest possible representation. Due to the nature of the range of the values assumed by b ikt , we ploted the histogram of log 10 (b ikt ) using 50 bins as shown in Fig. 9. A considerable increase in frequency of the lowest bin can be observed: the count on the lowest bin for the unregularized run was 84991 hits, while the regularized test had 145982 hits, a 71% increase. It is then clear that the regularizing component of L R steered the factor search toward a solution with higher selectivity. In general terms, one desires sparse signatures if the data have a sparse nature; this property is usually manifested in tonal emissions, where the emitted energy is well localized in the frequency domain. In this sense, the imposed regularization is expected to enhance the precision of the algorithm for tonal emissions, while some degradation can occur in percussive or other atonal emissions. In our test scenarios, the regularization did result in some slight degradation in separation, suggesting that the tonal assumption is not strictly valid-which is not surprising in real-life composed signals. Nevertheless, 1 penalty parameters are ubiquitous in regular NMF algorithms [26,27], so that the added model flexibility is welcome.

VIII. CONCLUSIONS
We proposed an extended version [1] of the CNMF algorithm [11], leveraging the efficient representation from the deconvolutive NMF model. We also provided the user with control over the distribution of the extracted signatures through regularization, which steers the method toward a sparse solution. Our proposed method is a generalization of the baseline algorithm, with added flexibility, able to efficiently factorize signals with complex spectral signature.
Although testing was not exhaustive, due to the large number of user-chosen parameters and different possible configurations, our simulations corroborated the method's capabilities in the separation task.
Future works include the application of these ideas using different generative models for the matrices X il , as the Gaussian assumption and the associated Frobenius norm method are not entirely adequate for distance measures on the PSD cone; however, the fact that, by construction, X il is rank-1 and lives on the boundary of S + imposes challenges on the application of traditional measures, such as log det divergences [28], which are only defined in the interior of the cone. Also, application of other NMF frameworks, such as multi-layer NMF, could improve the accuracy and consistency of the method. Proper initialization is also a critical point in traditional NMF methods [29], and the design of a smart initialization routine is likely to have a considerable effect on convergence time and accuracy. APPENDIX A PROOF OF PROPOSITION 1 First, the Lagrangian function for Eq. (21) w.r.t. S ilkot restriction is computed: where the regularization term was omitted for brevity. The partial derivative w.r.t. S ilkot is given by Summing over k, o, and t after equating to 0 results in k,o,t Finally, backsolving Eq. (35) for S ilkot yields replacing S * ilkot in Eq. (21) leads directly to the equality in Eq. (23), and the fact that Eq. (21) is quadratic in S ilkot , with a single global minimum, is sufficient to satisfy inequation (22). (27), (28), AND (29) The partial derivatives of + R w.r.t. the non-negative parameters are:

APPENDIX B DERIVATION OF EQUATIONS
(note that the left-shift operator appears in (40)), where the fact that tr(W io W H io ) = 1 was used. Applying the definition of r ilkot in terms of z ko , b ikt , g kl , and those of type ; solving for the respective variables yields: Finally, replacing S with S * leads to the multiplicative updates.

APPENDIX C DERIVATION OF EQUATION (30)
The gradient of + R w.r.t. W io is Once more, applying the choice of r ilkot , and solving for Replacing S with S * leads to the multiplicative update.