An Introduction to Information Theoretic Learning, Part II: Applications

—This is the second part of the introductory tutorial about information theoretic learning, which, after the theoretical foundations presented in Part I, now discusses the concepts of correntropy, a new similarity measure derived from the quadratic entropy, and presents example problems where the ITL framework can be successfully applied: dynamic modelling, equalization, independent component analysis and cluster analysis.

Abstract-This is the second part of the introductory tutorial about information theoretic learning, which, after the theoretical foundations presented in Part I, now discusses the concepts of correntropy, a new similarity measure derived from the quadratic entropy, and presents example problems where the ITL framework can be successfully applied: dynamic modelling, equalization, independent component analysis and cluster analysis.

I. INTRODUCTION
I N this second half of the two-part tutorial on information theoretic learning, we analyze a number of key ITL criteria and methods and also discuss their use on some important applications.The discussion starts from the notion of correntropy, an interesting higher-order statistical extension of the classical concept of correlation, and also includes a brief survey of pertinent works and case studies.In the following, a representative selection of supervised and unsupervised applications is presented, and is used as a background for the exposition of important methods, like those based on error entropy, unsupervised kernel criteria and independence.It is our belief that this presentation will provide the reader with a complete view on canonical ITL strategies and on its potentialities.
The work is structured as follows: Section II brings the definition of correntropy and discusses several instances of application; Section III analyzes the use of ITL methods in dynamic modeling, supervised and unsupervised equalization, independent component analysis -including recent formulations concerning finite fields and clustering; finally, Section IV summarizes our conclusions and final remarks.

II. CORRENTROPY
The great majority of classical adaptive equalization methods and also of those based on ITL consider that the available data is independently distributed, which, in many cases, is not true.Thus, Santamaria et al. [1] proposed a new measure that is able to take into account both the statistical and temporal structures of the signals.The proposed generalized correlation function was termed correntropy, since it is directly related to Rényi's quadratic entropy estimated using the Parzen window (see Section IV.B of Part I).Mathematically, correntropy is defined as where x t is a stochastic process and K(•) is a kernel function.Likewise other ITL measures (see Part I), the Gaussian function is usually employed as the kernel: In this case, for a non-zero lag, the value of correntropy asymptotically tends to the information potential [1], [2].Correntropy can also be straightforward redefined for a pair of random variables, X and Y v(X, Y ) = E[G(X|Y, σ 2 )], ( which is formally denoted as the cross-correntropy between X and Y .There are two main interpretations for correntropy.The first one associates it with a feature space interpretation that relates nonlinearly with the input space, hence, using correntropy as a Parzen kernel is equivalent to having a linear kernel in a high-dimensional space (Hilbert Space) with reproducing properties [3].The second interpretation is that it is the integral over the line x t1 = x t2 of the joint pdf estimated with Parzen window, which powerfully indicates that correntropy can be viewed as a measure of probability that the random variables x t1 and x t2 are equal.Such view supports the notion of correntropy as a generalized similarity measure.
Using a series expansion for the Gaussian kernel, equation (2) may be rewritten as: which involves all the even-order moments of the random variable x t1 − x t2 .Through (4), it is possible to see that this new measure includes the information provided by the conventional covariance function.Furthermore, the authors of [1] demonstrate that, in order to obtain the property v(t, t − τ ) = v(τ ), the stochastic process must be strictly stationary.In this case, the definition presented in (2) can be estimated in terms of a sample mean: where N is the number of available signal samples.
A careful analysis of (4) was considered in [4], where the authors show that the series may diverge depending on the distribution of the signal being considered.However, for shorter-tailed distributions such as the uniform, it is also possible to derive certain conditions for which the series converges [4].Nevertheless, it is not necessary that the series exist in order that correntropy exist.As demonstrated in an equalization scenario, correntropy performs well even when the series diverges.
Moreover, the choice of the kernel size σ is crucial.If its value is too large, correntropy will basically rely on second-order properties.On the other hand, if the value is too small, an undesirable behavior can be observed, in which the correntropy is dominated by moments of extremely highorder [4].In that sense, σ plays a different role in correntropy, being related to weights on the statistical moments, while, in the information potential, σ is closely related to the shape of the distributions.It is also worth mentioning that, although correntropy asymptotically converges to information potential when Gaussian kernels are used, the computational complexity of correntropy is one order below the cost associated with information potential, as it encompasses a single sum operator over the kernel argument.
Comparing correntropy to the conventional autocorrelation function, it is possible to observe that the mean value of the former changes for different source distributions, whereas the autocorrelation function remains basically the same.This characteristic may be useful in eliminating the bias on the estimation of entropy from finite data sets using the Parzen window method (recall Section IV.B of Part I).
If correntropy is used to design an optimal equalizer for a digital communication system, it is possible to show that the performance will be better than that of a Mean Square Error (MSE)-based receiver if the noise PDF has its global maximum at the origin [5].Furthermore, it presents great robustness to impulsive noise.On the other hand, MSE-based estimation is biased if the noise PDF has non-zero mean, which leads to performance degradation when in the presence of impulsive noise.Thus, correntropy may be very useful in nonlinear and non-Gaussian signal processing [5].Such characteristic also comes from the fact that correntropy may be viewed as a localized similarity measure, related to the probability of how similar two variables are in a neighborhood of the joint space controlled by the kernel size.
In the last few years, correntropy has been used successfully in a large variety of applications when compared to classical techniques.In [6], correntropy is used in a supervised scenario with impulsive noise, outperforming LMS in system identification and noise cancellation.In [1], [7], correntropy is used to perform blind equalization (which is discussed in detail in Section III-C), outperforming classical methods like the CMA [8] in the case of correlated sources.Nonetheless, CMA may have a better performance if the sources are iid [7], [9].In [10], correntropy is used as a unifying instantaneous blind source separation criterion, capable of separating iid sources, which requires higher-order statistics (HOS), and also of separating temporally-correlated Gaussian sources with distinct spectra, which demands temporal information.In [11], it is shown that, since correntropy is capable of quantifying nonlinear statistical relationships, it is suitable as a measure for identifying nonlinear dynamic systems.

III. ITL APPLICATIONS
ITL was proposed with a practical emphasis, aiming at the solution of complex signal processing problems that require a significant amount of information about the available data.As consequence, it is possible to form a representative set of applications of the algorithms into different tasks, employing different criteria, filtering structures and optimization procedures.In the following, some examples are provided to the reader, in order to give a general idea of the potentialities of this paradigm in dealing with modern, data-driven, engineering problems.

A. Dynamic modeling
The aim of dynamic modeling is to build a mathematical representation of the functional relationship between input and output variables.This is the case, for example, when one is interested in building a model that predicts the behavior of an unknown dynamical system.This is a problem often studied within neural network theory, for which a recent and successful approach has been the application of deep neural networks [12] or, alternatively, the popular and well-established multilayer perceptron (MLP) [13] in the role of predictor.The inputs are time-delayed measures of a state variable of the system and the model should provide an estimate of the current state value (see Figure 1).
While the most extensively used criterion in this context is the mean squared error, the typical ITL approach is based on the Minimum Error Entropy (MEE) criterion [14], which consists in the minimization of the error entropy with respect to the MLP synaptic weights.The ideal condition in this case is to have the error signal always at zero, i.e., e(n) should have a distribution in the form of a delta function centered at zero.
Since the quadratic entropy, in association with the use of Gaussian kernels, yields a simple calculation of the required integral, as seen in Section IV.B of Part I, the authors argue in favor of its use in this scenario, arriving at an optimization problem with the following cost function, to be minimized via a gradient descent method 1 Recall that V (•) is the information potential estimator, and we can drop the − log(•) operation to simplify the expression, converting (6) into a cost function to be maximized.Hence, the gradient vector with respect to the weights is calculated and it is possible to apply the backpropagation algorithm [13]: Observe that the expression has a computational cost proportional to N 2 -instead of O(N ), as the MSE-based training -, which is a direct consequence of adopting the information-theoretic criterion and its associated information potential estimator.Simulations in chaotic time series prediction and nonlinear identification [14] have shown that the MEE criterion gave rise to an error PDF more concentrated around zero, while the PDF of the output signal was closer to that of the desired signals, in comparison with the corresponding results of an MSE-based neural network.Moreover, a sequence of this pioneering work indicated that the dynamic adjustment of the kernel size in the training, by means of an annealing process, increases the chance of escaping from locally optimal solutions [15].
Zupanc [16] gives additional support to the observations of previous works, through the comparative analysis between MSE and MEE in the prediction of a chaotic dynamic system and modeling of a polymer mixing process.The results indicated that the MEE criterion achieves a better generalization capability, is more robust to outliers and better approximates the PDF of the state variable under observation.However, the higher computational cost in comparison with an MSE training algorithm and the sensitivity to the kernel size adjustment are issues that the user must take into account.
Prediction of power generated by a wind park was another domain where the effectiveness of ITL criteria has been verified.This scenario is interesting because the error distribution is non-gaussian, and [17] showed that the MEE and the maximum correntropy criterion (MCC) are better than the MSE in providing accurate predictions for the offline and online training modes.

B. Classification
Classification is an important machine learning problem with a vast range of models, criteria and approaches.It can be 1 For simplification reasons, we shall also consider lowercase letters to represent the arguments of probabilistic / information-theoretic operators.
represented in several ways, one of the most useful involves the definition of a set of discriminant function where x represents an m-dimensional feature vector of a given phenomenon to be classified into one of c possible classes.The classifier is said to assign a feature vector x to class ω i if One of the key aspects is, hence, the definition of the functional mapping g i (•) and its optimal parameters.Consider an indicator function 1 ωi (x), which is 1 if x belongs to class ω i and 0 otherwise; g i (•) should approximate the respective class indicator function and, hence, the error vector is defined as e = [g 1 (x)−1 ω1 (x), ..., g c (x)−1 ωc (x)] T .This formulation allows a straightforward application of MEE as optimization criterion, according to Shannon's or Rényi's definition: From the training perspective, a linear discriminant function can be adapted via the Stochastic Information Gradient algorithm [18] or artificial neural networks such as the MLP architecture can be adjusted via the gradient-based search with the backpropagation technique -recall the expressions derived in Section III-A.But, regardless the training method, and since MSE is wellknown due to the robustness and wide adoption as a criterion for classification, a fundamental question that arises when adopting MEE is: does it leads to a smaller classification error probability, for a given model?Interestingly, whether Shannon's or Rényi's definition is employed, [19] shows that, for a perceptron-based classifier, MEE may not lead to solutions close to the minimal misclassification error.Moreover, there are theoretical situations where entropy maximization leads to the ideal configuration.Nevertheless, when the Parzen window entropy estimators are considered, their smoothing property can overcome such limitations, as long as an appropriate kernel size is defined.
This idea is reinforced in [20], where an extensive experimental simulation is performed comparing MSE, MEE (with both Shannon's and Rényi's definition), Cross-Entropy [21] and the generalized exponential risk in the context of MLP training for 35 different classification public datasets.The results indicate that MSE generally under-perform the other criteria, including MEE.Cross-Entropy and Exponential Risk achieved most of the highest classification rates among the datasets and, remarkably, MEE with Renyi's quadratic entropy obtained the poorest generalization capability.
To summarize, MEE criteria presents strong empirical evidences that is beneficial as a surrogate for MSE in classification tasks, however, there is a sensitive dependence on the database and the problem domain that suggests a careful analysis to the designer, in order to choose MEE or a different criteria.
Another promising (and recent) classification criterion comes from the notion of correntropy, where the model can be adapted via the maximization of a cross-correntropy-based criterion [22]: i.e. the functional mapping is defined in order to maximize the correntropy between the classifier output and the class labels.
Recent works [23], [24] on image pattern recognition add to (12) a regularization term derived from sparsity analysis, to define a linear representation of a test image y such that is maximized with respect to w, where X represents the training dataset and | • | l1 is the l 1 -norm of a vector.The solution of this problem is obtained by half-quadratic optimization techniques, and it provides a vector basis for each class of objects to be recognized, which is subsequently adopted to classify the new image as belonging to the particular class basis that reconstructs the most similar (in the correntropy sense) prototype of y.
The experimental results considering severe distortions, such as pixels occlusions and non-Gaussian noise, demonstrated a very good effectiveness of the method in such scenarios.Furthermore, the aforementioned correntropy-based criteria are being employed also in the context of deep learning and extreme learning classifiers [25], [26], with promising results as well.

C. Equalization
In bandlimited and high data rate digital communication systems, equalizers are important devices.Their function is to restore the transmitted information, i.e. the information at the channel input, mitigating or eliminating channel interference.In order to do so, a large variety of techniques have been developed in the last 70 years [27].
Equalization may be considered in two scenarios: supervised or unsupervised.Supervised methods are traditionally based on the MSE criterion, while unsupervised methods rely exclusively on HOS of the involved signals.Under the classical assumption of linearity and Gaussianity, the above mentioned methods are known to provide a reliable performance.However, with respect to non-classical scenarios, e.g., for sparse / correlated signals or even in presence of non-Gaussian noise, the same assertion cannot be hold.
In light of this, as ITL has the potential of extracting the complete statistical information present in signals, a very interesting option emerged: to employ new criteria based on this field to the problem of channel equalization, especially in non-classical scenarios.Hence, let us start by presenting the problem formulation.
Consider a source signal s(n) being transmitted through a linear time-invariant channel.The channel output can be expressed as: where h i are the channel coefficients and η is the additive noise.
The equalizer, designed to remove the intersymbol interference introduced by the channel, is generally modeled as a finite impulse response (FIR) filter.Its output may be written as: where w i are the values of the D filter coefficients.
In the sequel, we will present several ITL-based equalization algorithms.
1) Supervised Equalization: The application of ITL to supervised equalization started with the analysis of the use of the minimum quadratic Rényi's entropy of the error, MEE, between the desired signal and the equalizer output, instead of using the classical MSE criterion [28], [18].Recalling that entropy minimization is equivalent to information potential maximization, as already mentioned in Section III-A, and using the Parzen window method to estimate the error PDF, the associated criterion results in: where e i = d(i−τ )−y(i), being d the desired signal and τ the equalizer delay.The kernel size σ is a parameter to be adjusted according to the given scenario.Note that ( 16) considers the relationship between each pair of error samples.A gradient based method can be employed in order to maximize ( 16), being called stochastic information gradient for MEE (MEE-SIG) [18].
In situations where the channel is linear and the additive noise is gaussian, a linear equalizer that maximizes (16) will present a performance similar to that of an equalizer trained to minimize the MSE, since their solutions tend to be close to each other or even equivalent, under certain specific conditions [28].However, this will not be the case for non-Gaussian noise, where the MEE-SIG algorithm tends to be more robust than that based on the MSE.Furthermore, the difference between the MEE and the MSE becomes more pronounced in nonlinear scenarios [14], [15].As an example, when the channel is composed of a linear distortion followed by a nonlinear function and the equalizer is modeled as a multilayer perceptron neural network, by minimizing the entropy, it is possible to obtain an improved performance in equalization [28].
Another strong branch in supervised ITL criteria is that based on correntropy (Section II).Since this entity can be seen as a nonlinear similarity measure between random variables, one can apply it to the equalization problem by maximizing the correntropy between the transmitted and the equalizer output signals, giving rise to the maximum correntropy criterion (MCC) [6]: where the kernel size σ, in this case, determines the length of the neighborhood of d i to be considered.Hence, a suitable choice of σ can improve the robustness of the MCC against outliers and impulsive noise.With respect to the MEE, the MCC presents the advantages of requiring a lower computational cost (note that there is a single summation operator for MCC) and being less sensitive to variations in σ.On the other hand, it can demand a larger number of samples N to provide a good estimate.From (17), it is possible to derive a simple Least Mean Squares (LMS) like algorithm, called MCC-SIG [18].In the presence of impulsive noise, such method has shown a better performance than the original LMS in system identification [6].An interesting aspect is that, for a very large kernel size, the solution will be very close to the one obtained through the MSE criterion.An alternative algorithm for the optimization of ( 17) was proposed in [29], based on a fixed point solution, which presents a fast convergence when compared to the wellknown Recursive Least Squares (RLS) algorithm [27], also being independent of the eigenvalue spread of the data.
2) Unsupervised Equalization: The use of the ITL framework in the task of unsupervised equalization is a very attractive possibility, in view of the natural availability of higher order statistical information required to solve the problem.In that sense, one of the first unsupervised ITL-based criteria [30] brings together Rényi's α-entropy and the idea behind the well-known blind constant modulus (CM) criterion [31], which penalizes deviations of the equalizer output from a constant modulus, to form the following criterion: where . The last equality comes from the fact that entropy does not depend on the mean of the signal.By assuming α = 2 and using the IP estimator, the cost function to be maximized becomes: The steepest descent algorithm resulting from (19) was named stochastic fast algorithm (SFA) [30].It should be noted that some kind of constraint with respect to the equalizer taps has to be added in order to avoid the trivial solution -which can be done, for instance, by fixing one of the taps to unity or by admitting a unit norm constraint to the equalizer taps.
A parallel development based on the Benveniste-Goursat-Ruget theorem [32] -one of the milestones in blind equalization -was also reached for blind ITL criteria, which gravitates around the notion of matching the PDF of the equalizer output to that of the transmitted signal.As pointed out by [33], [34], [35], this idea can be translated into the following cost function where f Y 2 and f S 2 are the PDFs of the random variables Y 2 and S 2 , which, in turn, are associated with the signals |y(n)| 2 and |s(n)| 2 , respectively.In [35], all terms of (20) depending on the equalizer output are considered and, once again, the PDFs associated with the signals are estimated using the Parzen window method, resulting in a simplified cost function: where L is the cardinality of the transmitted symbol alphabet and s i its ith symbol.The associated gradient-based algorithm was named stochastic quadratic distance (SQD).A slight modification was also proposed in [34], in which the PDF associated with the transmitted signal was evaluated in some specific target values.Still aiming at matching the PDFs, we also highlight the work of [33], where it is suggested the use of only the last term of (20) -the other terms simply work as a normalization factor between the PDFs and can be neglected.In this case, the estimation of PDFs via Parzen window results in the last term of ( 21), which we call ĴMQD .Very interestingly, as presented in [36], the estimates of these blind criteria can be interrelated as: From this, we point out that these criteria will differ mainly in its computational complexity and robustness.While ĴQD is considered the more complex and robust -as it encompasses a richer statistical information about both y n and s n -, the ĴMQD offers a good trade-off and is an attractive option, since ĴSF A , due to the necessity of imposing a constraint to the equalizer coefficients, tends to be more susceptible to local convergence.
It is also important to indicate the points of contact that these ITL criteria establish with the classical CM criterion.J SF A can be seen as a direct extension of the CM formulation to the ITL framework.ĴMQD uses as kernel argument the deviations of the squared equalizer output from a fixed term, just like the CM criterion.Finally, since ĴQD gathers contributions from both of these blind ITL criteria (22), it is expected that the quadratic distance criterion also preserves some elements of the CM approach.Indeed, by comparing the surface contours of ĴQD and the CM cost, as illustrated in Figure 2 for the channel with impulse response H(z) = 1 + 0.6z −1 , there are some similarities between the minima.Besides that, for linear equalizers and under the hypothesis of Gaussianity, the blind ITL criteria behave similarly to the CM, but, for impulsive noise models, the latter loses performance.Similar ITL blind methods for deconvolution can be found in [37], [38] and [39].
Although the criteria discussed above begin with distinct hypothesis, they all assume a common feature: the transmitted signal are composed of iid samples.However, in practical scenarios, the sources may exhibit temporal dependence, in consequence, for instance, of the application of codes before signal transmission and the handling of analog discrete-time signal processing (e.g. in audio-related scenarios).In that sense, as initially proposed in [1], the ITL measure of correntropy can be used to statistically evaluate the time structure of the signal in a blind context.The objective is to make the correntropy of the equalizer output as close as possible to the correntropy of the transmitted source, known a priori: where v s is the correntropy of the source, vy is the estimated correntropy of the equalizer output and P is the number of lags considered.Since correntropy takes into account the statistical and temporal structures of the signals, it has shown a good performance when treating correlated sources, a situation that classical methods fail to equalize [7].Another advantage of the correntropy-based method is its reduced computational complexity in comparison with the PDF matching-based criteria -although it can demand an elevated number of samples for estimation.

D. Independent Component Analysis
Independent Component Analysis (ICA) has been originated as a natural extension of Principal Component Analysis (PCA), and both techniques are unsupervised signal processing paradigms.They are also very useful tools in the context of factor analysis [40].
One of the most representative problems to which ICA is applied is Blind Source Separation (BSS), a task in which information-theoretic optimization criteria have been used with success for three decades [41].BSS, in its linear and instantaneous form, can be formulated as: consider that one observes, at a given time instant, the signal x, m-dimensional, which is the result of the linear combination of k ≤ m independent signals (sources), i.e.
x = As, ( where s is the source vector, k-dimensional, and A is the mixing matrix, m × k.
Without a priori knowledge of A and s, the problem consists in obtaining a demixing matrix W to estimate the output vector y = Wx such that it be equal to s up to scale and permutation factors.Figure 3 illustrates this formulation, when k = m.
The connection between ICA and BSS comes from the fact that, if the solution W generates a set of components y that are independent, this ensures that the original s has been recovered.In this context, up to the previously mentioned ambiguities, there are several criteria to perform ICA, including (i) negentropy [40], a criterion to maximize non-Gaussianity; (ii) the Infomax principle [42], which is based on the idea of maximizing the information flow between the mixtures and the separating system outputs; (iii) the minimization of entropy rate [43], [44], which, similarly to correntropy, allows the exploration of both statistical and temporal structures of the signals; (iv) cumulants and (v) kurtosis [45].
In a general perspective, the ICA criteria can be related to mutual information (MI) rate minimization between the separating system outputs, where two diversity aspects may be considered, in a joint manner or independently: HOS and dependence of source samples [46].Thus, it is convenient to directly measure the independence degree of these signals via ITL methods.As an example, if only HOS diversity is considered, mutual information rate reduces to mutual information, which yields the following criterion to be minimized: Furthermore, if the observations are previously whitened (PCA) and, as consequence, the demixing matrix is a pure rotation matrix, the loss function in (25) is reduced to just the first term, the sum of marginal output entropies 2 .The Minimum Rényi's Mutual Information (MRMI) algorithm [47] applies the gradient descent method to minimize (with respect to the elements of W) this reduced cost function, replacing Shannon's differential entropy by Rényi's definition with α = 2.The marginal entropies are estimated with the well known non-parametric entropy estimator (recall the definition in Section IV.B of Part I), already employed in previous applications.Nonetheless, a deeper analysis of this method [48] has recently demonstrated that the maximal value of Rényi's entropy is associated with the Gaussian distribution only when α = 1.In this case, a modification in the MRMI original criterion is necessary according to the estimated distribution, which can be classified as: sub-Gaussian, a distribution flatter and shorter-tailed (with kurtosis less than 3) than the Gaussian, or super-Gaussian, a distribution more peaked and longertailed (with kurtosis greater than 3) than the Gaussian.The new implementation was compared with other traditional ICA algorithms in separating audio recordings and the results indicated a superior performance of MRMI with the parameter α = 2. Nevertheless, the theoretical results of modified MRMI are valid only for the generalized exponential family of distributions, and [49] demonstrated that the choice of the α value may lead to a cost function that does not satisfy the requirements of a contrast function [41].The final conclusion is that the adoption of Rényi's entropy to perform ICA must be preceded by a careful analysis of the scenario at hand.
An extension of the BSS problem that has been recently studied is related to the post-nonlinear (PNL) model [50], which adds nonlinear, memoryless and invertible functions to the BSS linear model.These functions may represent, for example, the effect of sensors in some measurement process.To perform the separation, it is necessary to apply the nonlinear memoryless functions to each component of x previously to the demixing matrix W.
One of the most robust approaches to separate PNL mixtures is also based on the independence recovery with mutual information minimization.In this direction, [51] uses a direct MI estimator based on order statistics (recall Section IV.C of Part I) as cost function to be minimized with an immuneinspired algorithm.
1) ICA over Finite Fields: Recently, ICA has been extended to the domain of finite and discrete valued signals and systems.Yeredor [52] firstly explored this idea with the development of an ICA algorithm for Boolean signals mixed in accordance with XOR and classical product operations, i.e. in the context of a Galois field of order two.The algorithm iteratively extracts the sources by searching for the linear combination of the mixtures that minimizes the entropy, followed by a deflation process [53] to remove the extracted source from the mixtures.
Afterwards, [54] extended the algorithm towards dealing with finite fields of any size.Analogously, [55] improved the pioneering algorithm, known as AMERICA, and proposed a faster (but less accurate) algorithm based on sequential reduction of the pairwise mutual information, the name of which is MEXICO.A summary of all contributions, at that point, was consolidated in [56].
All these techniques comprise parameter adaptation based on ITL cost functions -the histogram based estimator of Shannon's entropy (see Section IV.F of Part I) -and a sequential search for the separating matrix elements.Differently from this perspective, [57] proposed the application of an immune-inspired algorithm to search for the complete separating matrix.The problem was formulated as a combinatorial optimization task such that the solution was the separating matrix that led to the minimal mutual information (actually, the sum of marginal entropies -recall (25)) between extracted components.
Another related proposal was subsequently developed [58], where a more robust immune-inspired algorithm was applied in association to a Michigan-like approach [59] to model the population individuals.The algorithm criterion was to minimize the entropy of each extracted source, considering that the intrinsic diversity operators of the algorithm may allow that distinct independent signals are obtained, in the end.

E. Cluster Analysis
Clustering is a self-organizing process that plays an important role in a broad range of fields, from pattern recognition, signal compression [21], and knowledge discovery in databases [60] to communication channel estimation and/or equalization [61].Roughly speaking, clustering is aimed at partitioning a set of objects into groups that share some kind of (predefined) similarity.Clearly, it is not a well-posed problem, just like the estimation of MI from finite dataset.To clarify this important point, let us consider a tiny data set of 5 points, represented in Figure 4.
For someone looking for cluster formation in observed data, a naive clustering hypothesis may promptly be raised, as illustrated in Figure 5.Nevertheless, even though a visual inspection may lead one to accept the plausibility of this first hypothesis, there is not a consensual way to measure it, unless we define/chose a numeric criterion, which is, in turn, an arbitrary decision/choice itself.Indeed, it is well known that different clustering criteria, when applied to the very same dataset, may provide different clustering hypotheses, mainly when the number of available samples under analysis is small.This means that one cannot even infer the existence of clusters from finite datasets themselves: any conclusion should be based on a priori information about the data source model.Note that this is implicitly true even when a simple distance measure is used in a clustering criterion.This unavoidable need for a priori information engenders a striking analogy between cluster analysis and mutual information estimation from finite datasets.To properly show it, we first recall that the MI between two random variables, say X and Y , is the amount of decrease in the uncertainty regarding one of them when the other is known (recall Section 2 of Part I).Strictly speaking, for finite datasets, unless there are coincident values (which have a vanishing probability for continuous variables), there is no randomness to be removed.For instance, in Figure 4, by knowing that event X = x 3 occurred, we conclude, deterministically, that Y = y 3 is to occur too.This means that, strictly speaking, for finite sets of samples generated by continuous variables, the mutual information that can be inferred without any a priori source model is always maximal (i.e.no randomness at all)! Evidently, any useful analysis must consider a source model, and use data to adjust this model, as in clustering.Not surprisingly, there can be found in literature many works combining both analyses, mainly on the simplified use of MI for finding consensus among many clustering hypothesis [62], [63].
One simple and straightforward combination of MI and clustering ensemble concerns the estimation of the number of clusters formed by finite datasets in metric spaces.Again, it is an ill-posed problem in clustering analysis, and the use of many clustering hypotheses, along with an MI based criterion, may facilitate the difficult choice of a specific metric and an algorithm to this task.Indeed, in a clustering ensemble based approach, an arbitrarily large number of clustering algorithms, M , provide independent clustering hypothesis with K clusters each one.Each hypothesis yields a vector of labels (one label per pattern), which are regarded as random outputs drawn from M sources of K symbols, thus creating M random discrete variables, X m .Figure 6 illustrates this ensembling process.Therefore, for each X m , an entropy measure can be obtained, as follows: where p m (k) = P [X m = k] stands for the probability of randomly selecting the k-th label from the vector of labels x m .Moreover, for each pair the so defined random variables, it is also possible to measure their labeling agreement through MI, by computing: a fixed K, one may average all quantities i ; j ) as a measure of how much the imposition of K clusters corresponds to a configuration.to properly test this "natural" stability, it is also necessary to ensure a diversity of clustering hypotheses, which can be done (a) through the use of many different clustering methods, (b) through the subsampling of available data or (c) both strategies.
Finally, because the ranges of values for the entropies and the MI depend on K, one would prefer to normalize I(X i ; X j ), yielding the Normalized Mutual Information (NMI).This normalization procedure is not unique.A usual choice is: N M I(X i ; X j ) = I(X i ; X j ) max(H(X i ), H(X j )) (26) and the Averaged NMI for the ensemble of clustering hypothesis is given by: The ANMI is likely to be higher for values of K corresponding to stable clustering hypotheses.Therefore, by varying K, it is possible to estimate the number of clusters in a dataset.As an illustration, Figures 7 and 8 show two bidimensional datasets whose visual inspection provides initial guesses concerning the number of clusters in each one.In both cases, we used ensembles of M = 20 clustering hypotheses provided by the standard K-Means algorithm.Diversity of hypotheses was induced by simple dataset subsampling.In Figure 7, the ANMI peaks at K = 3, whereas, in Figure 8, as the upper cluster spreads up, this peak moves to K = 4, but both values (i.e. 3 and 4) seem to be almost equally likely, which suits most human observer opinions.
This approach relies on a committee of clustering algorithms, whose computational complexity depends on designer choices.On the other hand, as for the remaining structure, the computational complexity is mainly dominated by the computation of the M (M − 1)/2 pairwise information terms, I(X i ; X j ), for each tested value of K.

IV. CONCLUSION
This two-part work presented an introduction to information theoretic learning, an emerging discipline that employs information theory for developing new machine learning criteria and algorithms.
The Part I of this tutorial was devoted to a description of fundamental concepts of information theory, from the seminal work of 1948 by Claude E. Shannon -which is considered the 'birth' of this field -to the generalized measures proposed by Alfred Rényi, which allowed, approximately three decades later, the application of definitions such as entropy and mutual information in the context of new adaptive algorithms that can effectively explore the higher-order statistical content of data.Furthermore, the problem of estimating information-theoretic measures from the available data is discussed, as, differently from classical applications of Information Theory, in ITL, as a rule, there is no prior knowledge about the probability distributions that are involved.
In Part II, a new concept that arises from Rényi's quadratic entropy and the idea of information potential is presented: correntropy, a nonlinear similarity measure that possesses several possibilities of applications, mainly in the domain of signals with a temporal structure.The rest of the paper brings a set of representative problems for which ITL provides effective solutions: dynamic modeling, classification, equalization, independent component analysis and cluster analysis.In each case, we present the main criteria that have been developed, together with the pros and cons of each methodology.
Although this tutorial does not cover the whole spectrum of applications that this research field already presents, we expect that it has provided the reader with a general understanding of the motivations and characteristics of ITL techniques.Moreover, the references employed in this work can be recommended as a basis for further study.

Fig. 2 .
Fig. 2. Surface Contour of the QD and CM criteria

Fig. 3 .
Fig.3.Linear and instantaneous formulation of the Blind Source Separation problem.

Fig. 4 .
Fig.4.Five 2D numeric samples drawn at random from a unknown source.

Fig. 6 .
Fig. 6.Clustering ensemble of M algorithms, providing M clustering with K clusters each one, codified as M vectors of labels.

Fig. 7 .
Fig. 7. Estimation of the number of clusters through Averaged Normalized Mutual Information (ANMI) -strong consensus in favor of 3 clusters.

Fig. 8 .
Fig. 8. Estimation of the number of clusters through Averaged Normalized Mutual Information (ANMI) -weak consensus in favor of 4 clusters.