Blind Source Separation: Fundamentals and Perspectives on Galois Fields and Sparse Signals

—The problem of blind source separation (BSS) has been intensively studied by the signal processing community. The ﬁrst solutions to deal with BSS were proposed in the 1980’s and are founded on the concept of independent component analysis (ICA). More recently, aiming at tackling some limitations of ICA-based methods, much attention has been paid to alternative BSS approaches. In this tutorial, in addition to providing a brief review of the classical BSS framework, we present two research trends in this area, namely source separation over Galois ﬁelds and sparse component analysis. For both subjects, we provide an overview of the main criteria, highlighting scenarios that can beneﬁt from these more recent BSS paradigms.


I. INTRODUCTION
B LIND Source Separation (BSS) is one of the most relevant subject in unsupervised signal processing, with a myriad of aspects worthy of investigation and analysis, such as (i) the separation criteria and the implied hypothesis about the sources characteristics, (ii) the generative model that yields the mixed signals and its association with the separation system and (iii) the algorithms to determine the solution parameters.
The "canonical" concept to solve BSS is the application of Independent Component Analysis (ICA) [1], [2] in the context of real-or complex-valued signals.Such approach presumes independence between the sources and, consequently, the separation strategy lies on recovering the sources from the set of dependent mixtures by searching for a recovered independent configuration.Nevertheless, there are two alternative points of view that have been consistently treated in the last years, and which deserve special attention: a) the case of linear scenarios with inherently discrete, finite-domain signals, which formally comprise the finite (or Galois) field theory and b) the use of priors based on signal sparsity (instead of independence) in the time domain or in a domain engendered by an adequate transform.
This tutorial intends to introduce and describe these two modern research trends in unsupervised signal processing: BSS over Galois fields and BSS over sparse signals.We put emphasis on the analysis of the main separation criteria and the particularities of each domain, when confronted to the canonical framework.In order to do so, the work is organized in the following sections: Section II reviews the fundamental concepts underlying BSS and the use of ICA; Section III discusses the BSS extension to the domain of Galois fields, describing the main theoretical developments and potential applications; Section IV studies the notion of sparsity in information signals and its relevance to solve BSS within different complexity instances; and Section V presents the final remarks.

II. BASIC CONCEPTS
The BSS problem can be defined, in simple terms, as that of recovering a set of information signals (sources) from mixed versions of them (mixtures).In principle, there are no limitations regarding the mixing process, which can be nonlinear, with memory, time variant etc.However, for the sake of mathematical tractability, and in view of a vast number of applications, the linear and instantaneous mixing model can be assumed as being canonical.In this model, it is considered that N sources are detected by M sensors in the form of linear combinations i.e. there is a superposition of signals with different gains, but not of delayed versions.Mathematically, if there is a source vector s(n) = [s 1 (n), s 2 (n), ..., s N (n)] T and a mixture vector x(n) = [x 1 (n), x 2 (n), ..., x M (n)] T , for a given instant n, the model can be expressed as: being A an M × N mixing matrix.Note that, in this explanation, the model is built without reference to measurement noise, although its presence is relevant in both theoretical and practical terms [1].
When N > M , there arises an underdetermined case, which is difficult to deal with because it maps the desired information from an original signal space onto a space of smaller dimension.On the other hand, when M > N , there is an overdetermined model, which poses, undoubtedly, fewer complications.Finally, there is the most usual case in the literature, when M = N , which will be the standard throughout this section.
In this last case, if the matrix A (which becomes square) is invertible, it is possible to formulate the problem of BSS as that of finding another square matrix W (called separating matrix) so that there is a vector of estimated sources giving rise to a solution as follows: being D a diagonal matrix and P a permutation matrix.The meaning of (3) is that, in the solution of the BSS problem as formulated (and even in a more general sense), the sources can be recovered in any order and are subject to scaling factors, which means that these information-preserving ambiguities are tolerated.
Once those remarks are made, it remains a question: how can W be obtained in an unsupervised (or blind) fashion?To answer it, it is necessary to make some sort of hypothesis about the sources.Although it is beyond dispute that the canonical hypothesis considers sources as mutually independent stochastic signals, there are more than one possible path to follow here, as will be seen later on.This hypothesis, which is valid in many domains [1], is a very strong one under the aegis of the defined model.In fact, as shown in [2], if the components of the vector y(n) = Wx(n) are mutually independent, the sources will have been recovered aside from the ambiguities expressed in (3).In other words, recovering the independence condition implies correct source estimation.
It is exactly because of this fact that there is a strong link between BSS and the methodology known as independent component analysis (ICA) [2], which, in contrast with the more popular technique of principal component analysis (PCA) [1], has the objective of finding projections that generate statistically independent factors (and not only uncorrelated, as is the case with PCA) underlying the focused data.By means of ICA, it is possible to build cost (or contrast) functions that allow the search for matrices W capable of providing efficient source recovery.

A. Criteria for Performing ICA-Based BSS
Several formulations can be used to perform ICA.Here, we will discuss three of them, based on the concepts of mutual information and non-Gaussianity (quantified in terms of kurtosis and negentropy) [3].
1) Mutual Information: A very natural criterion to quantify statistical dependence is the mutual information, which, for a general random variable vector a, with K elements, is defined as [1]: where h(•) is Shannon's differential entropy, given, for a vector, by [4]: Since entropy can be seen, in simple terms, as the degree of uncertainty associated with a random variable, we may interpret (4) as being, intuitively, the difference between the total uncertainty originated by a separated observation of its components and the total uncertainty originated by a joint observation.When this difference is null, "no component carries information about the other", so to say, which in more rigorous terms implies that they are independent.If there is statistical dependence, I(a) > 0 [4].Hence, if the mutual information associated with y(n) is minimized with respect to W, it shall be possible to restore the independence condition and to recover the sources.
A major difficulty here lies in the question that entropy calculation requires knowledge of the involved probability densities or their estimation (something that can be quite complex in some cases).The joint entropy is not an issue, because it is possible, using the hypotheses regarding the model, to write (the time indices will be omitted for simplicity) [3]: Note that h(x) is fixed and that W is known, being the analyzed solution.One cannot avoid nonetheless the need for estimating the marginal entropies that form the first term of the right-hand side of (4).This difficulty explains why the use of the mutual information in the linear and instantaneous case is not common, although it is very relevant, for instance, in the nonlinear context, which will not be dealt with here [5], and in the context of signals over finite fields, as the reader will see in Section III.
2) Non-Gaussianity: Kurtosis and Negentropy: The central limit theorem [3] can be enunciated as follows.Let a set of K continuous random variables a i , i = 1, ..., K, i.i.d.(independent and identically distributed); their mean converges, in the limit K → ∞, to a variable with a Gaussian density.
Well, as, in ICA, mutual independence between sources is assumed, and, moreover, in the linear case the mixtures are essentially sums, the central limit theorem implies that to mix means "to Gaussianize".In other words, it can be said that a mixture is "more Gaussian" than the sources generating it.
A way to quantify Gaussianity is to employ the kurtosis, a fourth-order statistic that, for Gaussian variables, is null.Its definition is, for a real-valued scalar variable a: The use of kurtosis is more naturally explained in the domain of source extraction i.e. when the goal is to recover a single source.This leads to: being w i interpreted as one of the lines of the separating matrix W. It can be shown, which is intuitive in view of the central limit theorem, that to maximize the absolute value of the kurtosis with respect to w i , under a proper constraint, allows a source to be recovered1 .Consequently, there arises a criterion of the form: To recover all sources, it is necessary to make use of a deflation process i.e. to remove each extracted source from the remaining mixture(s) or to resort to constraints that prevent the extraction of the same signal [1].Another form of using the idea of Gaussianization inherent to the mixing process is to bring to the scene a classical result from information theory.This result ensures that, among all random variables a with a fixed second-order statistical structure, the Gaussian density is the one with maximum differential entropy [4].
Having this in mind, let us imagine a random variable a with generic mean and covariance (we will consider the scalar case to simplify things).Let us now consider a Gaussian random variable with the same moments up to second order.It is possible to define the negentropy N(a) as follows: where h(a) is the differential entropy of the variable in question and h gauss (a) is the entropy of a Gaussian random variable with the same moment structure up to order two.
From what was discussed, it follows that N(a) ≥ 0. To obtain a proper extraction vector w i , N(y i ) must be maximized with respect to it, so that the non-Gaussianity be maximized.Again, the retrieval of all sources demands deflation or special constraints.
The use of negentropy requires, in principle, entropy estimation, which, as already mentioned, can be a complex task in some cases.Therefore, it is usual to adopt certain nonlinear functions as means to approximate it.With these functions, it is more straightforward to derive gradient or fixed-point algorithms [1].

B. Other approaches to linear-instantaneous mixing models
Even if ICA is at the origins of BSS, there is now a number of alternative approaches to deal with separation problems in linear models.For instance, another classical paradigm in BSS considers that the sources can be modeled as stochastic processes, which allows one to exploit temporal information related to the sources.Such an approach is the basis of the algorithm SOBI [7] and, more generally, of the class of algorithms that exploit the auto-correlation structure of the sources [2].A nice aspect of such methods is that they can be applied to separate non-white Gaussian sources under the condition that these sources present different auto-correlation functions.
A different approach that has been adopted in BSS comes from the machine learning community: it is known as nonnegative matrix factorization (NMF) [8], [9].The goal in NMF is to search for an approximation for the observed non-negative matrix X as follows where A, S ≥ 0. In the context of BSS, A and S are related to the mixing matrix and the sources, respectively, which are thus assumed non-negative.Such an assumption is realistic in several applications, such as those related to chemical analysis [10] and audio processing in transformed domains [11].Although NMF lacks from separability results (see, for instance, [12]), in the sense that it is an ill-posed problem, the combination of NMF with additional prior such as sparsity or smoothness may provide sound BSS algorithms [13].
Another emblematic example of BSS paradigm is based on a Bayesian formulation of the problem [2].This approach is suitable when there is a set of prior information that can be modeled through probability distributions.For instance, non-negative priors can be modeled in a Bayesian framework by considering distributions of non-negative support [14], whereas sparsity can be represented by, for example, a Laplacian distribution [15].
Finally, an extension to the instantaneous model can be considered i.e. the blind separation of convolutive mixtures.The main difference between the linear-instantaneous model and the convolutive one is that the latter becomes a superposition not only of the present values of the sources, but also of past values.In other words, the convolutive model includes the dimensions of space and time, with the multiplicity of sensors and instants engendering the mixing process.Mathematically, (1) is extended as: where T corresponds to the maximum delay present in any of the mixtures.Notice that ( 12) yields (1) when T = 0.It is possible to solve this problem in the time domain (e.g. using predictive strategies [16]) or to use the property that a convolution in the time domain is a product in the frequency domain, which gives rise to a sort of instantaneous mixture [2].In the latter case, special precautions must be taken with respect to the permutation and scale ambiguities, which may cause severe spectral distortions.

III. SEPARATION OVER FINITE FIELDS
After discussing the general aspects of BSS, now we focus on a more specific case, which necessarily deals with digital data.Then, BSS can be studied, for example, when signals and mixing processes are binary.This perspective was first proposed in [17] and belongs to the generic framework of source separation over finite or Galois fields.This is the fundamental topic of this section, which is organized as follows: first, the signal representation over Galois fields is presented; then the most important criteria to separate such signals from instantaneous mixtures is discussed in Section III-B; Section III-C extends the analysis to the convolutive mixing model and, finally, Section III-D illustrates two potential applications of the techniques so far developed.

A. Signals over GF (q)
Fields are abstractions of familiar number systems and their essential properties [18].A field F is defined as a set of elements associated with two operations, + and •, such that the following axioms are valid: closure, commutativity, associativity, distributivity, existence of neutral element and existence of inverse element [19].
Real and complex numbers are well-known examples of fields, both with an infinite number of elements.However, this is not a mandatory requisite: there are also finite (or Galois) fields, e.g. the set {0, 1} with the logical operations exclusiveor (XOR) and AND as addition and product, respectively.
A finite field with q elements is named F = GF (q); it is possible to show that q = P n , where P is a prime and is typically called the characteristic of the field.If n = 1, F is a prime field and its operations are easily defined as the product and sum modulu P over the elements {0, ..., P − 1}.Otherwise, fields with n > 1 are called extension fields and imply a more complex definition of operations [18].
Vector spaces over finite fields can also be constructed, with a remark that such spaces are not ordered and there is no notion of orthogonality [20].Linear mappings A : F N → F M are represented by M × N matrices with elements in F , in accordance to the usual restrictions for having an inverse mapping -the matrix must be square and with a non-null determinant.

B. Separation over GF (q) in instantaneous models
Consider the BSS formulation for the instantaneous and determined (M = N ) case, as (1) illustrates, but with the difference that all entities and operations are defined over a field F = GF (q).Hence, the problem consists of finding, in the space of all invertible N -dimension matrices -GL(N, q) -, the one that recovers s(n), in equivalence to the definition given in (3).
The following theorem offers the possibility of achieving the solution through ICA [21]: Theorem 1 (Identification via ICA) Consider F = GF (q) a finite field of order q.Assume that s is a vector of independent random variables in F , with probability distribution p s such that the marginal distributions are non-uniform and non-degenerate 2 .If, for some invertible matrix G in F , the components of the vector y = Gs are independent, then G = DP for a permutation matrix P and a diagonal matrix D.
For instance, consider GF (3) and two independent sources with marginal probability vectors given by p s1 = [1/2, 3/8, 1/8] and p s2 = [1/3, 1/6, 1/2].Hence, the joint distribution is If the sources are multiplied by a matrix according to Theorem 1, Consequently, this result indicates that one can employ ICA to perform blind separation of signals over GF (q), as long as the original signals are independent and non-uniformly distributed, leading to extracted signals that differ only by scale and permutation ambiguities.Since there is no definition of statistical moment for random variables over a finite field, in order to define a criterion similar to negentropy or kurtosis, it is necessary to employ the concepts that information theory offers.
An important property states that the linear combination of independent random signals results in an entropy greater than or equal to the original signals [22].Based on this, a first separation strategy rises via source extraction [2], as already mentioned in Section II-A.The AMERICA algorithm [20] implements this technique for performing ICA over GF (q), through an exhaustive search with the criterion to be executed N times, with the restriction that each obtained extraction vector is linearly independent from the previous ones.Note also that H(•) is Shannon's entropy for discrete random variables Figure 1 describes AMERICA pseudocode.Despite the adoption of an exhaustive search approach, AMERICA assures convergence to the correct inverse solution (as long as the criterion is perfectly calculated).There are algorithms that trade convergence for a lower computational cost, through approximations of the criterion defined in (13), such as the techniques named MEXICO and CANADA [20].The MEXICO algorithm, particularly, adopts the strategy of sequentially minimizing the entropy between pairs of mixtures, which does not assure global optimum convergence, but reduces the expected computational cost in comparison to AMERICA [23].
A different perspective lies on considering the same criterion with lower-cost metaheuristics that are appealing for combinatorial problems, e.g.Artificial Immune Systems (AIS) [24].In this case, the algorithm optimizes ( 13), but at the end of the procedure, the N best candidate-solutions which are linearly independent represent the extraction vectors that, finally, compose the separating matrix.This modus operandi is possible due to the intrinsic capacity of AIS to promote diversity among the candidate-solutions, while the search occurs [25], which allows the algorithm to obtain the multiple solutions that are required to build the separating matrix.
Beyond the idea of exploring entropy as contrast function, a second independence criterion involves direct minimization of mutual information among the extracted signals.Mutual information is defined according to (4), remarking that, instead of differential entropy, we consider the entropy for discrete variables.The calculation of I(•) among the components of the estimated sources vector (for the purpose of simplicity, we leave aside the temporal index) hence provides Fortunately, the second term on right-hand-side of ( 15) can be ignored, because when an invertible mapping y = Wx of signals defined over discrete sets is considered, the following relationship holds p Y1,...,YN (y) = p X1,...,XN (W −1 y), which, consequently, implies H(y) = H(x).Then, we obtain the final expression for the criterion: Since the search space size is proportional to q N 2 [26], there is a considerable increasing as compared to the space size of the first criterion, which is proportional to q N , thus hindering the use of exhaustive search methods in this case.Then, it is possible to consider again the application of populationbased metaheuristics such as AIS [27], [28], which offer signal separation with quality levels similar to exhaustive heuristics, but with a reduced computational cost.For instance, Figure 2 illustrates the successful application of the AIS-based method described in [28] for separation of black-and-white images.

C. Separation over GF (q) in convolutive models
Let us consider a new situation, where there is combination of signals, defined over GF (q), both in space and time, which yields the convolutive mixture model, mathematically described in (12).
ICA can be used once again to recover the original signals, as the authors of [29] propose.Assume that the sources are non-uniform and mutually independent (in space and time), which (again) results that the mixing process generates signals with greater entropy than the sources, in a similar fashion to AMERICA algorithm principle.
Hence, it is possible to use the extraction/deflation technique, previously mentioned in Section III-B, to revert the entropy increasing effect.A source extraction problem takes place, which consists of determining the separation filters that produce the output where T e is the maximum delay present in one of the filters w j (n) and w(n) = [w 1 (n) ... w N (n)] T .Figure 3 presents an example of convolutive mixture for N = 2, in association with the extraction procedure of a source signal.Like the instantaneous case, the mixing matrix A(n) must be invertible, i.e. the determinant of A(n) must be non-null for all n.In the context of temporal filtering, this implies that if the matrix is composed of finite impulse response (FIR) filters, with input-output relationship the inversion is only possible if the extraction filter contains feedback loops [29], i.e.
where T b e T c are the number of coefficients b j (k) and c j (l) of the filter, respectively.Then, the values of these parameters are estimated by minimizing the loss function of the extraction process, which is where y(n) is obtained according to (20).When the extraction succeeds, one obtains y(n) = cs i (n − d), i ∈ {1, ..., N }, which means that a delayed and scaled version of a source is recovered.
After extracting a source, the next step is the deflation process, in order to remove the recovered source from the remaining mixtures.Figure 4 details this task: assume that y(n) represents the extracted signal, it must be processed by a non-causal, FIR deflation filter, which identifies the intersymbolic interference signature of the source -the mixing filter a ij (n) -with respect to each mixture; then the signal can be properly subtracted from the mixtures.The deflation filter parameters are defined using (again) the entropy measure, via a criterion that is analogous to the employed for deflation of instantaneous mixtures [21]: When deflation ends, the extraction step must be repeated, in order to obtain the second source, but remember that the new mixtures are represented by the signals r i (n), i = 1, 2, ..., N .Therefore, both processes are alternated until all mixtures become null signals, which means that all sources were recovered.

D. Applications
Although BSS over finite fields and the associated solution strategies via ICA were initially considered only under the theoretical perspective, there are already some potential applications being developed, specially when the mixtures follow the instantaneous paradigm.
A first application lies on eavesdropping MIMO systems which employ PAM modulation and Tomlinson-Harashima pre-coding [20].Consider a system with N transmitters and receivers, which is designed to send N binary signals to each receptor through a pure attenuation channel H ∈ [0, 1] N ×N .Since the transmitters known the channel characteristics, we could consider the strategy of each one sending the vector components given by x(n) = H −1 s(n), such that the reception would result in y(n) = Hx(n) = s(n).
However, if the system employs PAM modulation, hence transmitting data only in the interval [0, 1], this approach would lead to transmission sequences with invalid values.In this case, the Tomlinson-Harashima spatial coding can be employed to circumvent this limitation [30]: the channel matrix is quantized into P levels (P is prime), and the inverse (over GF (P )) of this new matrix Ĥ is applied to the transmission sequence : where Ĥ−1 s(n) is a "conventional" product over the real field, and [•] P denotes the modulu P operation.This formulation results in a sequence with real values in [0, 1] that can be transmitted via a PAM scheme and, in the receptor, the original values are reconstructed via the following expression [20]: In this context, this communication system can be eavesdropped, via ICA, as follows: • A third party with another set of N antennas, intercepts the signals that are being transmitted, ŷe (n).• He knows the value of P , however, he does not know the attenuation matrix between the transmitters and his antennas set, Ĥe , which is assumed to be quantized in P levels.
• When the same operations of the legitimate receivers are applied, the result is [20] (P − 1) 2 ŷe (n where • denotes GF (P ) product and Â is a matrix given by the composition of Ĥe with Ĥ−1 , the latter is employed in the pre-coding step.• Equation ( 25) leads to the definition of the BSS problem over finite fields, hence the application of an ICA algorithm can invert Â and consequently provide estimates for the transmitted sequences.Naturally, this sort of ICA application makes use of hypotheses that restrict its viability, nevertheless, it gives us interesting insights of other potential applications that are related to coding theory.This perspective is reinforced by the second example of application, which involves ICA for improving Network Coding algorithms.
In simple terms, Network Coding claims that the intermediate nodes of a communication network can, instead of just forwarding data packages, process linear combinations of them, with randomly-defined coefficients over a finite field.With this idea, it is possible to show that the transmission flow over the network is maximized and the robustness against errors is increased, specially in the context of real-time applications [31].
However, in order to decode the packages at the destination nodes, the combination coefficients must be sent as a package header, which is an overhead for transmission rate, in the case of small size packages.This is the aspect to be reconsidered, then: if the coefficients are not inserted in the package, decoding still can be done by casting the problem as BSS over GF (q).This is the proposal introduced in [31], which ignores the coefficients header and substitutes it by a non-linear hashing function of each package, in order to assure that data is nonuniformly distributed -a fundamental condition to perform decoding via ICA, as seen in Theorem 1.Since it is quite usual that data traffic, in multimedia networks, presents a distribution close to uniform, e.g.compressed audio and video, the hash mapping is necessary to increase the discriminative power of the algorithm cost function.
It is important to emphasize that the hashing function implies an overhead to each original package smaller than the conventional approach, while the failure probability on executing the separating algorithm is maintained with low values.Experimental analyses, in this context, have shown that packages with size between 1 and 1.5 kilobytes present good decoding rates by the new technique, saving about 50% of header size [31].

IV. SEPARATION OF SPARSE SIGNALS
In the present section, we shall discuss another emerging topic in BSS which has been extensively studied over the last years: the case in which the sources can be modeled as sparse signals.Besides being observed in several real applications [32], the hypothesis of sparsity allows one to develop novel methods that are able to deal with situations for which classical approaches, such as ICA, fail.The separation framework based on the sparsity hypothesis is usually referred to as sparse component analysis (SCA).
The brief overview on SCA provided in this section is organized as follows.Firstly, we discuss the notion of a sparse signal.Then, in Section IV-B, we shall discuss how the sparsity prior is exploited in the case of underdetermined mixtures.As it will be seen in Section IV-C, sparsity is also a useful information in the context of determined sources, especially when the hypothesis of independence does not hold.

A. Sparse signals
Although there is no formal definition for a sparse signal, the notion of sparsity in fields such as signal processing and machine learning is now ubiquitous and is associated with a signal that can be represented by a number of elements that is rather smaller than the signal observed dimension.In Figure 5, examples of sparse signals and images are provided.It is worth noticing the presence of a large amount of temporal samples (in the case of signals) or pixels (in the case of images) that take values that are almost null.Examples of sparse signals and images arise in different domains, including biomedical signal processing [33], geophysics [34], and audio processing [35].
Before defining separation methods based on the notion of sparsity, it is paramount to define measures of sparsity.The most natural one is the ℓ 0 -pseudo-norm , which, for a discrete signal of T samples, represented by the vector s, is defined as follows3 [36] The ℓ 0 -pseudo-norm is simply the number of non-null samples of s.Therefore, a sparse signal tends to present a small ℓ 0 -pseudo-norm.Moreover, a signal can be sparse in other domains, that is, when the ℓ 0 -pseudo-norm of a transformed   version of s is low -a common example is a sine wave, which is sparse in the Fourier domain.There are other measures of sparsity.Among the most relevant ones is the ℓ 1 -norm , defined as follows Since the ℓ 1 -norm engenders convex optimization problems, it is quite used in a vast number of signal processing tasks.

B. Separation of sparse signals in underdetermined models
The first works on sparse models for BSS addressed the case of underdetermined mixtures [37], [38], [39] and are based on two steps.In the first one, one searches for estimating the mixing matrix A. This first process is illustrated in Figure 6, which represents a BSS problem in which there are N = 3 sources and M = 2 mixtures.Given that the sources are sparse, there is a high probability that only one source is active at a given instant.For instance, let us consider the instants in which source s 1 is much higher than s 2 .In these moments, the mixtures almost become functions of a single source, that is, x 1 = a 11 s 1 , x 2 = a 21 s 1 , and, therefore, they carry information about the first column of A. Analogously, when the sources s 2 and s 3 are exclusively active, the mixtures bring information about the second and third columns of A, respectively.
The fact described in the last paragraph is illustrated in Figure 6, which provides the mixtures scatter plot.One can note that the information on the columns of A are related to the clusters that arise when the sources are almost isolated, that is, when there is a single active source dominating the others.Therefore, a natural idea to estimate A is to determined the directions for which there is a relevant concentration of points -such a procedure can be carried out by clustering algorithms [38].
Having estimated the mixing matrix A, a second step is to solve a underdetermined linear system for estimating the sources.A first idea in that respect would be to formulate a least squares problem.However, given that the number of unknowns is greater than the number of observatiom, the resulting problem becomes ill-posed and admits infinite solutions.As an alternative, one may consider, as prior information, the fact that the sources are sparse, which can be implemented according to the following optimization problem: where Â corresponds to the estimated version of the mixing matrix (this estimation is obtained in the first step).It is worth noticing that problems such as (28) have been extensively studied over the last years, mainly due to their applicability in compressive sensing [40], which searches for sampling signals and images by considering a rate that is lower than the Shannon-Nyquist rate.
Besides the formulation expressed in (28), there are other approaches that deal with inverse problems by making use of prior information related to sparse signals.Two notorious examples comprise a method known as the least absolute shrinkage and selection operator (LASSO) and a formulation known as basis pursuit de-noising (BPDN) [41].
Finally, it is worth mentioning that a similar approach based on a two-step strategy can also be applied in the context of sparse source separation.For instance, the algorithm DUET [42] estimates the mixing matrix by considering the disjoint orthogonality assumption, which means that only a single source can be active at a given instant.In a similar fashion, the algorithms TIFROM and TiFCorr search for regions where the sources are isolated [43] either in time or in other transformed domains.

C. Separation of sparse signals in determined models
The assumption of sparse sources is also useful as a prior in the context of determined models.A first approach in this case is similar to the one described for the case of underdetermined models (estimation of mixing matrix followed by sparse inversion).A second possibility is to set up a separation criterion   that takes into account the sparsity prior.In this case, which will be discussed in the present section, source estimation is carried out through a single stage.
Let us consider the problem of source extraction, in which the goal is to retrieve a single source from the mixtures.As discussed in Section II-A, source extraction can be conducted by estimating a vector w i so that y i = w T i X provides a good estimate of a given source.In the case of sparse sources, due to the action of the mixing process, the signals x j are less sparse than the sources s i .Therefore, analogously to ICA, a natural approach to retrieve a source would be to adjust the extraction vector so that y i be as sparse as possible.
In [44], extraction of sparse sources is conducted by considering a criterion based on the ℓ 1 -norm, so the adjustment of w i is carried out as follows: The restriction on the ℓ 2 -norm is necessary here to avoid trivial solutions and implicitly assumes that the data is submitted to a whitening pre-processing stage.In [44], the authors have shown that ( 29) is indeed a contrast function when the sources are disjoint orthogonal.Besides, even when this condition is not observed, numerical experiments pointed out that the minimization of the ℓ 1 -norm leads to source separation [44].Alternatively, it is possible to retrieve sparse signals by means of separation criterion underpinned by the ℓ 0 -pseudonorm [45].In this case, the resulting optimization problem can be expressed as follows: In [45], the authors proved that a sufficient condition to ensure the contrast property of ( 30) is given by ||s 1 || 0 < In the particular case of N = 2 sources, such a condition can be simplified as [45] ||s It is worth noticing that condition (31) allows a certain degree of overlapping between the sources, under the condition that they have different degrees of sparsity (in the sense of the ℓ 0 pseudo-norm).Another fundamental aspect here is that the obtained conditions are not expressed in a probabilistic fashion and do not require statistical independence.In other words, it is possible to separate sparse signals even in the cases in which ICA fails.
Concerning the practical implementation of methods based on (30), an important issue is related to the fact that real signals are not sparse in terms of the ℓ 0 -pseudo norm.Indeed, actual signals that can be considered sparse often contain a few relevant coefficients and many coefficients that are close but not necessarily equal to zero.In order to overcome this problem, one can make use of smooth approximation for the ℓ 0 -pseudo-norm as, for instance, the following one [46] where σ controls the smoothness of the approximation.If σ → 0, then this smooth approximation approaches the ℓ 0 -pseudonorm.

V. CONCLUSION
This work is an introductory text about blind source separation and the most recent perspectives in domains beyond real or complex sets -which is the case of separation over finite fields -, and beyond the statistical independence assumption -which is the case of separation of sparse signals.
In the context of source separation over GF (q), ICA-based strategies were discussed, putting emphasis on entropy-based cost functions to promote separation of signals whether the mixing model is instantaneous or convolutive.Both models implies a combinatorial optimization problem, which can be solved via exhaustive-character search procedures or via bioinspired strategies, e.g. the immune-inspired algorithms.Finally, two examples derived from coding theory show that BSS over Galois fields already offers preliminary contributions, in the sense of real applications.
In the case of separation of sparse signals, the two-step procedure usually employed for underdetermined models was first discussed.This approach considers sparsity both for estimating the mixing matrix and for solving the inverse problem associated with the sources estimation.In addition, the formulation of separation criteria based on sparsity for determined models was discussed.An interesting aspect, in this scenario, is that sparsity-based criteria can be applied even when sources are statistically dependent.
Naturally, the subjects that were introduced in this work are not fully explored here and, furthermore, have very interesting future perspectives, in the context of new algorithms and criteria, theoretical analyses and, ultimately, the potential association of sparsity with signals defined over a finite field.

Figure 2 .
Figure 2. Application example of ICA over GF algorithm with black-andwhite images.

Figure 3 .
Figure 3. Model representing the convolutive mixture problem over GF (q) when N = 2, and the extraction system of a source.

Figure 4 .
Figure 4. Representation of the deflation step, considering y 1 (n) the signal to be removed.
Mixtures scatter plot.Note that the columns of A define the directions for which there are high concentration of data.

min wi ||y 1
|| 0 = ||w T i X|| 0subject to At least one element of w i is not null.