Bias-corrected estimator for intrinsic dimension and differential entropy--a visual multiscale approach

Intrinsic dimension and differential entropy estimators are studied in this paper, including their systematic bias. A pragmatic approach for joint estimation and bias correction of these two fundamental measures is proposed. Shared steps on both estimators are highlighted, along with their useful consequences to data analysis. It is shown that both estimators can be complementary parts of a single approach, and that the simultaneous estimation of differential entropy and intrinsic dimension give meaning to each other, where estimates at different observation scales convey different perspectives of underlying manifolds. Experiments with synthetic and real datasets are presented to illustrate how to extract meaning from visual inspections, and how to compensate for biases.


I. INTRODUCTION
I NTRINSIC dimension (ID) estimation is a useful tool whenever patterns presented in D-dimensional spaces are supposed to form structures (manifolds) in d-dimensional subspaces, with d < D. Examples of such lower dimensional structures are: projections of a rigid objects whose pictures, with D pixels, are taken under d degrees of freedom [1], [2], or D-dimensional representations of vowel sounds, whereas the vocal tract that generates the sound has only d mechanical degrees of freedom [3].
In all those applications, if probabilistic models are used to represent the source of observations (i.e., the underlying d-dimensional structures) then entropy, differential entropy (DE) and entropy rate [4] can reveal relevant attributes of the corresponding structures. In pattern recognition, estimating both ID and DE is tantamount to analysing shape attributes of manifolds, as explained in Section II, thus suggesting tools for proper design and analysis of classifiers, in special those based on autoencoders. Indeed, while the number of deep neural network applications increases at an astonishing pace, some attempts to explain this success seem to suggest that most answers come from the study of physical restrictions [5] and consequent formation of data manifolds [6], [7], [8], [9].
Although ID does not impose a probabilistic model to be estimated, many published ID estimators are based on probabilistic reasoning [10], [11], [12], [13], [14], [15], [16]. Indeed, even the well known Grassberger-Procaccia (GP) estimator [14], whose aim is to characterize strange attractors in dissipative (deterministic) dynamical systems, also uses the information-theoretic framework to better explain the kind of dimension their method is able to estimate (also referred to as information dimension).
The formulation proposed in [14] includes the use of random variables (RV) as the source model for observations, and explicitly shows a link between intrinsic dimension and differential entropy. Some subsequent works followed this same path, such as [3] and the series of publications by Costa and Hero [12], [17], [18]. However, most published works deal either with DE, under the assumption that ID is known, or with ID estimation, regardless the manifold's volume (thus its DE, as explained in Section II). Indeed, in [3] it is stated that The existence of manifold structures in the data is often overlooked in entropy estimations, with the result that classical methods, assuming the wrong intrinsic dimension (manifold dimension) provide erroneous estimates of the entropy.
On the other hand, in [19], the problem of DE estimation in high-dimensional spaces was tackled through a simple but data-efficient approach, referred to as the Coincidence Method (CM), originally applied in Physics. In [20] this method was extended to differential entropy estimation in the pattern recognition context, which clearly shows that the correlation dimension in [14] uses the same empirical coincidence ratio as the entropy estimation method proposed in [19].
More specifically, the correlation integral defined in [14] is equivalent to the inverse of the number of coincidences defined in [19]. This equivalence is even more striking in nonredundant reformulations of the correlation integral, as in [21]. This suggests a link between works from different domains, developed in this paper to yield a visual method where ID and DE are regarded as complementary parameters of the same estimation problem.
Unfortunately, both methods [14], [19] yield biased estimates, a distortion whose source is also shared by them, which is explained by their common theoretical ground. Concerning the bias in the GP method, a theoretical model was first proposed in [22], where it was shown that ID bias can be predicted on average if the actual ID is known. In this paper, the theoretical model proposed in [22] is developed to the point of predicting and compensating for both ID and DE biases, even if the actual ID is unknown. This paper is organized as follows: In Section II, we present a brief recall of ID and DE, and their complementary meanings, whereas in Section III the theoretical foundations of the joint estimator proposed in this paper are presented. Finally, in Section IV, the method is presented, along with a bias compensation approach. Experiments with both real and artificial data are presented in Section V. We discuss the main contributions of this work in Section VI.

II. INTRINSIC DIMENSION AND DIFFERENTIAL ENTROPY IN
A NUTSHELL According to [23], the ID of a given set of observations is "the minimum number of free variables needed to represent the data without information loss", which agrees with the definition in [24], where "the intrinsic dimensionality of a collection of signals is defined to be equal to the number of free parameters required in a hypothetical signal generator capable of producing a close approximation to each signal in the collection".
DE, on the other hand, is defined as the entropy of a continuous random variable [25]. Besides, [26] presents entropy as an effective cardinality in logarithmic scale. Likewise, DE can be regarded as an effective volume (in logarithmic scale) [25], [20].
For a brief recall on ID and DE, consider the data sources labeled 'Sinusoid' and 'Circle', borrowed from [27]. Although experiments there just consider ID, these datasets can also be used to address DE as well. These sources are defined respectively as where U and V are independent random variables uniformly distributed between 0 and 1. Figures 1 and 2 show 3000 instances of X Cir and X Sin respectively. Their ID are 2 and 1, for the domain of X Sin can be cut and straightened to a line segment (1D) of length slightly greater than 60, whereas the domain of X Cir can also be cut and unbent to a rectangle (2D) of area 0.2π.
If the probability density function (pdf) of an RV is known, its Rényi α-entropy [25] can be obtained as where dx is a differential hypervolume in R D , only taken where the pdf f X (x) is not null. Therefore, if the pdf is not null in d-dimensional manifolds (d < D) the integral must be restricted to it (therefore the local dimensions of the manifolds must be known). This definition encompasses both Shannon DE, for α → 1, and quadratic (or collision) entropy, for α = 2.
In both cases, h α (X) can also be regarded as a proportion between volumes, suggesting that DE is a measure of effective volume for non-uniform distributions, as much as entropy is presented as an effective cardinality for discrete RVs [26], [20]. This intuitive perception of DE can be better explained with the notion of effective length, area or volume, as follows: an RV defined as Z = λU (λ ∈ R and λ > 0) is uniformly distributed along an 1D domain of length λ, then its DE is given by this length λ measured in logarithm scale, h(Z) = log(λ). In general, for non-uniform RV, the effective hypervolume is given by the hypervolume associated to another uniformly distributed RV whose observation removes the same amount of uncertainty about the outcome [4].
Both formal and intuitive points of view reveal a tricky aspect of DE estimation, that the DE is meaningless before the ID is known. Figures 1 and 2 can be used to further illustrate this point, because both RV X Cir and X Sin are defined in a 3D space, but they have ID equal to 2 and 1, respectively. Therefore, the DE associated to X Cir must take a unit square as area reference to yield h(X Cir ) = log 2 (0.2π) bits, whereas X Sin must take a unit line segment as length reference to yield h(X Sin ) ≈ log 2 (60) bits. In both cases, an observer unaware of these IDs would fail to estimate the DE, because both datasets are presented as 3D patterns, but their underlying structures have null volume.

III. JOINT ANALYSIS APPROACH
In this Section, we briefly recall two known approaches for ID and DE estimation that, when put side by side, reveal their striking equivalences. These equivalences are then articulated to yield a joint visual analysis for ID and DE.

A. Intrinsic dimension estimation
Given a set of N observations, {x(1), x(2), . . . , x(N)}, and a threshold r, the "information dimension" (also known as correlation dimension) is defined in [14], and can be obtained from the proportionality as r → 0, where the non-redundant [21], [22] definition of the correlation integral C(r) is where I is an indicator function, i.e. I(λ) = 1 if λ is true, and I(λ) = 0 otherwise. Function I(·) is a coincidence detection function that allows for the use of any pattern matching measure, or even mean opinion scores, which can be particularly useful for ID estimations in psychometrics or econometrics, for instance. In [22] the supremum norm is used instead of the original Euclidean norm [14], thus easing theoretical calculations regarding correlation dimension limits.
For both definitions, since the volume where coincidence occurs in the manifold scales with r d (instead of r D ), then the number of observation pairs coinciding in this volume should scale at a known rate, if the observation volume is such that the probability density of observations is almost constant inside it.
From Eq. 1, it follows that where, as r → 0, h is the logarithm of the proportionality constant.
To estimate d from Eq. 3, a common approach is to use the angular coefficient of the line that best fits points (log r, log C(r)) in a given range for r. Therefore, a single best fit is expected. However, Figure 3 illustrates a case where this expectation is frustrated. This Figure was obtained with N = 3000 independent observations of X Cir , and r ranging from 0.01 to 1 (points were interpolated to improve visualization). It may be seen that there are two almost linear intervals with angular coefficients close to either 1 or 2, depending on the range of r.
Because GP method is based on results for vanishing values of r, one should assume that the estimated ID is 2, corresponding to the lower part of the curved line in Fig. 3. Indeed, the detail presented in Fig. 1 clearly shows a 2D local structure. But the estimation for higher values of r is also meaningful, revealing that in a larger scale the 2D structure becomes negligible, whereas an 1D structure emerges.
That is to say that, on one hand this ambiguity is a drawback of this ID estimator, because bad choices for r may yield inconsistent estimates, whereas good choices remain an open problem [10]. On the other hand, this sensitivity to r can be carefully crafted as a tool for multiscale analysis, as discussed in Subsection III-B.

B. Differential entropy estimation
As for differential entropy, our starting point is the estimator proposed by S. Ma in the context of Statistical Mechanics [19]. This method was motivated by the huge number of reachable physical states in the original problem S. Ma addressed. By replacing states with multivariate random observations, or vectors in an abstract signal space [28] we obtain a DE estimator well suited for pattern recognition problems where the amount of observations is small, as compared to the effective size (effective in the sense of [26]) of the observation domain [29].
To estimate the diferential entropy, h X (x), of a random source modelled as X, we can summarize Ma's method in the following steps: 1. Arbitrarily set a small hypercube volume r d . It is to be noticed that in the original formulation no intrinsic dimension is considered. Here, however, we consider the possibility of data lying in a manifold of dimension d ≤ D, which yields an actual hypervolume r d ≤ r D . 2. Compare all N t = N(N − 1)/2 instance pairs x(i) and x( j), i < j, and compute n c (r) as the number of detected coincidences. A coincidence occurs when Compute the ratio between the number of comparisons and the number of coincidences: Q(r) = N t n c (r) . 4. Estimate the effective volume [25] of an equivalent uniform pdf asV M a = r d Q(r). 5. Estimate the differential entropy as the logarithm of the estimated volume: Note that, according to the definition of C(r), in Eq. 2, it can be related to Q(r) as C(r) = 1 Q(r) , and Eq. 4 can be rewritten asĥ M a = d log 2 (r) − log 2 (C(r)).
Comparing Eq. 5 to Eq. 3 we conclude that the h in Eq. 3 is the Ma's entropy estimate,ĥ M a . As a consequence, the line fitting procedure explained for the ID estimation can also be used for DE estimation, where slope and y-intercept parameters play the role of ID and DE estimates, respectively. On the other hand, the ambiguity problem mentioned in Section III-A is crafted into a tool that allows for multiscale analysis through a perspective similar to that proposed in [16], where almost linear segments with different angular and linear coefficients give clues regarding the structure of the underlying manifold.
As an illustration of this multiscale analysis, we consider again results shown in Figure 3, with two almost linear intervals. The estimated line segments have angular coefficients close to 1 and 2, respectively, associated to DE estimates h 1 ≈ −0.65, thus close to the theoretical DE of the source, −0.67 bits, and h 2 ≈ 2.7 bits, which is close to the logarithm of the ring length in Fig. 1, log 2 (2π) ≈ 2.65 bits.
In other words, the two almost linear segments suggest that (a) at small scales the dominant structure is 2D, with an effective area close to 2 h 2 , whereas (b) at larger scales the dominant structure becomes roughly 1D, with effective length close to 2 h 1 .

IV. A METHOD FOR VISUAL ANALYSIS OF ID AND DE
The method proposed here is a straightforward recombination of the approaches explained in Sections III-A and III-B, chosen for their simplicity and data efficiency (for both methods consider all possible pairs of observations). In this recombination, it is assumed that: • ID is constant over the variable domain. • Probability density function is locally uniform. The method is organized in 7 steps. The first 5 steps are presented below, whereas the remaining ones are presented in Subsection IV-A, where the bias problem is addressed. (S1) Compute the supremum norm for each vector x(i) − x( j), i < j. Double each norm and store the results in an array r. (S2) Sort r. Now r(k) is the edge size of the hypercube that yields k coincidences. (S3) Plot log 2 (k/L r ) versus log 2 (r(k)), where L r is the length of the array r and k ranges from 1 to L r (optionally, points can be resampled and interpolated for better visualization). (S4) Plot ID hypotheses log 2 (r(k)) versus d log 2 (r(k)) for some arbitrary d < D. Thus, r(4) = 66, for instance, means that a cube of edge 66 around each observation yields 4 coincidences. For this particular value we can compute C(66) as the number of coincidences (4) divided by the total number of pairs (10), yielding the pair (log 2 (66), log 2 (4/10)) ≈ (6.0, − 1.3) to be plotted.
Proceeding likewise for all values in vector r, the plot in Fig. 4 is obtained. Through visual inspection, it is possible to infer that observations roughly lie in an 1D structure, for the candidate with most similar slope in Fig. 4 equals one. In other words, although observation are given in D = 3, we are able to infer that they lie in manifold whose intrinsic dimension is d = 1.
Besides, once d is estimated, the DE can be estimated as the average value of differences d log 2 (r(k)) − log 2 C(r(k)). In this example, the differences for three arbitrarily chosen points are 7.4, 7.8 and 7.6, thus yielding an average DE estimate of h = 7.6 bits.
These estimates for ID and DE suggest that the five observations in this example were sampled from an 1D structure of length 2ĥ ≈ 194. Indeed, the N = 5 points were uniformly drawn from a noisy linear segment with length √ 100 2 + 50 2 + 150 2 ≈ 187. Therefore, the ID of the underlying 1D manifold was correctly inferred, while its length was roughly guessed through the estimated DE. Fig. 4. Plot of ordered pairs log 2 (r), log 2 C(r) . The resulting plot is visually compared to 3 ID hypothesis. The best match is 1D (thusd = 1), and the average vertical distance from plotted points to the corresponding line yields an estimatedĥ ≈ 7.6 bits.

A. Bias compensation
Both ID and DE estimators combined in this work are based on the exponentially growing fraction of patterns randomly coinciding, on average, inside small hypercubes of growing edge. Ideally, this edge should be vanishingly small, but in practice the number of observations is finite, what yields two antagonistic restrictions, namely: that the hypercube size should be as small as possible, thus containing a small fraction of observations, and that this fraction should be as large as possible, for statistical reasons.
In [22] it is shown that ID is always underestimated by Eq. 1 in the simple case of a hypercube inside which the probability density of a pattern being observed is uniform, even for an unlimited amount of data. The equations in [22] that explain this bias are rewritten here as Eq. 6 and 7 for the reader convenience: where C 0 (r) and d 0 (r) stands for theoretical estimates of C and d for an RV uniformly distributed in a d-dimensional hypercube of edge r. It is noteworthy that d 0 (r) is the derivative of log C 0 (r) with respect to log r.
In [22], under the following arbitrary restrictions: • R1: d 0 (r) ≥ 0.95d, which imposes an estimate deviation tolerance, and • R2: the minimum r is 1/4 of the maximum r, which allows the expected exponential proportionality of Eq. 6 to appear, it is shown that the minimum number of observations, N min , for a proper ID estimation depends upon the true ID, d, as This requirement is impractical for most real applications. For instance, even for d as low as 5 an experimenter would need more than 100 million independent samples in order to obtain a good ID estimate. In subsequent works this result was replaced with less restrictive ones such as in [30], where a much simpler analytic model is used, yielding In spite of their differences, both works agree that small datasets yield false ID estimates, biased toward lower values. For instance, with N = 1000 observations independently and uniformly sampled in a 10D hypercube, the visual approach used here yields the result presented in Fig. 5, suggesting a wrong ID estimate of about 8D, as well as a wrong DE estimate of about 4 bits (the actual DE is 0 bit).
To predict and compensate for both biases, we developed an approach built upon the analytical model proposed in [22]. In practical terms, it consists of completing a table of underestimated IDs, for a given N, then using this table to infer the unbiased ID, which in turn allows the estimation of a bias compensation for the DE too.
The above mentioned table is based on Eqs. 6 and 7 and on a coarse estimation of the average supremum distance from an observation to its nearest neighbour,r, where for N observations over a regular grid in a d-dimensional unit volume hypercube, one should expect To obtain this average supremum distance we first consider a line segment of unit length which is equally split into n + 1 intervals, thus allowing the placement of n equally spaced points apart from each other by r = 1/(1 + n). Likewise, in a unit area square, N = n 2 points can be regularly arranged by keeping the same r (as a result of the same n = N 1/2 ) as the supremum distance between neighboring points. Through the generalization of this simple reasoning for a unit volume hypercube of dimension d, where N = n d points can be regularly arranged in the vertices of a grid, r remains the supremum distance between any neighboring points of this grid. Therefore, given N and d, there is at least one arrangement of the N points separated from nearest neighbors by r = 1/(1 + N 1/d ). On the other hand, for N points randomly placed inside that same d-dimensional hypercube, the supremum distance between neighboring points is a random variable, say R, but if its underlying probability density function is uniform, we can use Eq. 10 as a coarse approximation of the expected value for R.
This approximation experimentally proved to be useful for N << 2 d , which tends to be the case for high ID values, were bias correction is even more relevant. For instance, if d = 10 and N = 100, the prediction isr(100, 10) ≈ 0.38, which is the same value experimentally obtained up to two decimal places. Likewise, if d = 20 and N = 10000, the prediction is r(10000, 20) ≈ 0.37, whereas the experimental value is about 0.39. By contrast, for less sparse datasets, such as for d = 5 and N = 100, the prediction isr(100, 5) ≈ 0.21, whereas the experimental value is about 0.28.
Applying Eq. 10 to Eq. 7 we obtain By definition [25], a random variable with uniform probability density inside a hypercube of unitary volume has null differential entropy. Therefore, given that Smith's bias is calculated precisely for this random variable, Eq. 4 should yield h M a = 0 in this case, and any imbalance between log 2 (C 0 (r)) and d 0 (r) log 2 (r) is to be taken as an entropy bias, ∆h. Therefore, for the estimated d 0 the expected DE bias is Applying Eq. 6 and 7 to Eq. 12 we obtain which can be simplified to ∆h = d r 2 − r log 2 (r) + log 2 (2 − r) (13) Using Eq. 10 into Eq. 13, we obtain Finally, to compensate for biases, Steps S1 to S5, as proposed in Section IV, are followed by two more steps, namely: (S6) Using Eq. 11, find the compensated ID estimate,d, that yields the closest d 0 (N, d) to the visually estimatedd. (S7) Obtain ∆h(N,d) using Eq. 14 and compute a compensated DE estimate ash =ĥ − ∆h(N,d). Illustration: An experimenter gathered N = 1000 multivariate observations with D = 20 attributes, and this observer applies the visual method (steps S1 to S5), thus obtaining the solid curve in Fig. 5. A naive experimenter would believe that the ID of that data is 8, according to the angle of the dashed line (found after visual comparison between some competing slopes). Lets call it the apparent ID,d ≈ 8, associated to the apparent DE,ĥ ≈ 4 bits. However, because N is too small as compared to 42 8 [22], or even to 10 8/2 [30], one should not accept the result of this first analysis. Proceeding with step S6, a range of possible IDs neard is considered and Eq.11 is used to complete Table I, from which it is possible to infer that the apparent ID near 8 corresponds to a bias compensated ID of 10, which is the actual ID of the data source used in this illustration. On step S7, Eq. 14 further yields ∆h(1000, 10) ≈ 4.2 and a less biased DE estimate is finally obtained asĥ − ∆h(1000, 10) ≈ −0.2 bits (the actual DE of the data source used in this illustration is zero). Two sets of experimental results are presented. First with two artificial data whose intrinsic dimensions are known, and their corresponding results are presented as evidences in favor of the proposed approach. Those results palliate the difficulty of providing statistical analysis for the method, since it depends upon visual (human) evaluation as part of the process. Two real datasets are analyzed afterwards, and despite the fact that their intrinsic dimensions were already analyzed in former published papers, our results induce some interesting questions regarding estimates consistency and the need for bias compensation.
The first artificial dataset source corresponds to a 12dimensional manifold (d = 12) in 72-dimensions (D = 72) first proposed in [27], then reused afterwards in [1] and [31], which makes it a suitable dataset for comparison purposes. N = 1600 random data points were used and two results are separately presented in Figures 6 and 7 for a better visualization of an interesting aspects of this dataset. For values of log 2 (r) from -0.3 to 0.1 (fine observation scale), the apparent ID is about 9.4, whereas the apparent DE is about 15 bits, as can be better observed in Fig. 6. As for Fig. 7, we observe instead an apparent ID of about 12.2, whereas the apparent DE remains around 15 bits, for values of log 2 (r) from 0.1 to 1.0 (coarse observation scale) 1 . Fig. 6. ID estimation for a 12-dimensional manifold in 72-dimensions proposed in [27]. For values of log 2 (r) from -0.3 to 0.1 (small observation scale), the apparent ID is about 9.4, whereas the apparent DE is about 15 bits, both biased. Fig. 7. ID estimation for a 12-dimensional manifold in 72-dimensions proposed in [27]. For values of log 2 (r) from 0.1 to 1.0 (coarse observation scale), the apparent ID is about 12.2, whereas the apparent DE remains around 15 bits. Again, proceeding with step S6, a range of possible IDs is considered and Eq.11 is used to complete Table II, from which one may conclude that for fine scales of observation, the actual ID of the corresponding manifold is about 12 (from 9.4, after bias compensation), whereas its biased DE of about 15 bits should be compensated (step S7) toĥ −∆h(1600, 12) ≈ 15 − 4.8 = 10.2 bits. It is noteworthy that 12 is indeed the artificially imposed ID to the manifold underlying this dataset. Moreover, in [27] it is highlighted that this manifold has a "high curvature and nontrivial probability measure effects on the manifold", and we believe that the second linear trend shown in Fig. 7 is a consequence of that high curvature, for the apparent ID of about 12.2 is compensated to 16, which is compatible with the idea that a 12D manifold can be curved to the point that, for a coarse observation scale, it forms a (hollow) structure of dimension higher than 12. The biased DE of such structure is compensated toĥ − ∆h(1600, 16) ≈ 15 − 5.8 = 9.2 bits. The second artificial dataset is labeled "Data Set D" [32], also used in [1] under label "Santa Fe dataset". As explained in [32], it corresponds to a "relatively long series of known high-dimensional dynamics (...) with weak nonstationarity" with 100,000 points obtained by numerical integration of the equations of motion for a damped, driven particle. We organized the simulated values in N = 2000 50D patterns, as in [1], which yielded the visual result presented in Fig. 8, where an apparent ID of about 7.4 is observed, along with a small apparent bias of about -0.5 bits 2 .  The first real dataset used in this paper is labeled "Paris-14E Parc Montsouris" in [33], corresponding to a time series formed by daily average temperatures (in tenths of Celsius degrees) in Paris, from January 1, 1958 to December 31, 2001. We organized the 15,706 measurements in N = 785 patterns of D = 20 measurements each. In [33] three ID estimation algorithms were applied to this dataset, including GP, with which the authors of [33] estimated an ID of 4.91.
By contrast, Figure 9 presents our reproduction of the experiment with the Grassberger-Procaccia approach, where for values of log 2 (r) from 5.7 to 6.3 the apparent ID is about 10.7, whereas the apparent DE is about 76.5 bits. This visual result, even before any bias compensation, suggests that an ID of about 5 is far from any ID value estimated for small values of log 2 (r). We then conjecture that the authors of [33] estimated an average slope for a wide range of log 2 (r), which indeed would yield an ID estimate near 5. Besides, in [1] twelve different ID estimators were applied to this same dataset, yielding inconsistent estimates ranging from 3.71 up to 13.52.
In this work, we assume that the apparent ID of 10.7 in Fig. 9 as our best guess for small values of r, whose bias compensation, according to Table IV yields an ID of  To check this result, we did an additional analysis similar to that shown in Fig. 5, this time with N = 785 random observations of a random variable uniformly distributed in a hyper-cube of 14 dimensions, thus with actual ID of 14, and actual DE of 0 bit. In this experiment, the apparent ID and DE were found to be d 0 = 10.7 and h 0 ≈ 5.1 bits, as shown in Fig. 10, which seems to confirm that results shown in Fig. 9 are compatible with a random source of 14D (apparent ID of about 10.7), to which a bias compensation of about 5.1 bits is necessary. In other words, Fig. 10 corroborates the idea that the "Paris14e Parc Montsouris" dataset lies in a 14D (thus greater than 10.7) manifold whose DE is about 71.4 (instead of 76.5) bits.  Another experiment with real data was done with all N = 6990 available observations of digits labeled '2' in the MNIST dataset [34], for practical purposes, we label this dataset as "MNIST 2". Digit '2' was chosen to allow a comparison of our result to similar experiments reported in [12], [13] and [27]. Figure 11 corresponds to the visual analysis from this experiment, where an apparent ID of 13 was observed, associated to an apparent DE of about 134 bits.
The visually estimated ID around 13 is in agreement to results presented in [12], [13] and [27], but it seems to be a misleading observation, for the corresponding biascompensated ID is higher than 13. Indeed, after going through steps S6 and S7, Table V suggests that, for N = 6990 observations, an apparent ID of about 13 is expected when the actual ID is 17.  6.4 ≈ 128 bits. This is less than the DE estimated by [12], of about 145 bits. Such a discrepancy may be partially accounted for the fact that in [12], the estimated DE is the intrinsic Rényi α-entropy for α = 1/2, whereas we estimate the collision DE (α = 2).
As in the former experiment with real datasets, to check our results, we did an additional analysis similar to that shown in Fig. 5 with N = 6990 random observations of a random variable uniformly distributed in a unit-volume hyper-cube of 17 dimensions, thus with actual ID and DE equal to 17 and 0 bits, respectively. In this experiment, the apparent dimension and entropy were found to be d 0 = 13 and h 0 ≈ 6.8 bits, with a visual aspect quite similar to those presented in Figures  5 and 10. This result seems to confirm our conclusion that "MNIST 2" samples lies in a 17D manifold. However, the bias compensation prediction of about 6.4 slightly deviated from the observed bias of about 6.8 bits, for the artificial data used in the test.

VI. CONCLUSION
A new approach for bias-compensated estimation of intrinsic dimension and differential entropy was proposed in this paper. It corresponds to the natural combination of previously published estimation methods, one for collision entropy -or quadratic entropy -, by Ma [19], and another for correlation dimension, by Grassberger and Procaccia [14]. In the first part of this work it was explained why these two approaches are connected in spite of their different goals, and how ID and DE should be regarded as two complementary aspects of random observations analysis, thus yielding a joint estimation approach.
An important aspect of this approach is its dependency on scale of analysis. Although it is frequently regarded as a practical obstacle for estimators, we propose that estimates at different scales convey different perspectives of underlying manifolds. Accordingly, we propose a pragmatic visual approach, followed by some illustrations.
On the other hand, the seminal work by Smith [22] is a clear warning regarding the always present bias in the Grassberger-Procaccia estimator. Then, we built upon the theoretical model used by Smith to introduce a systematic bias compensation for both ID and DE estimation, whose use is validated through experiments with real data and further illustrated through experiments with synthetic ones.
It is worthy noticing that while this work is strongly based on Smith's analysis, which yields a quite severe restriction on the minimum number of observations for a reliable estimate of d, as pointed out in Eq. 8, that restriction does not apply to this work. Indeed, Smith's analysis imposes that the estimated dimension should not be less than 95% of the actual one, without any kind of bias compensation. By contrast, in this work, instead of imposing a bias threshold, we use Smith's formula to compensate for that bias, even if the number of observations is much less than N min ≈ 42 d .
The proposed approach is developed under the assumptions that the ID is constant over the variable domain and that the underlying probability density function is locally uniform. If these assumptions are not verified, the proposed approach should not be applied. Notwithstanding, thanks to the visual analysis that is an important part of this approach, and taking into account its potential for a geometrical analysis of manifolds as a whole, as proposed in [35], we believe that the study of visual patterns (of log(r) versus log(C)) even when these assumptions are violated can be a promising research subject for the future.
We also believe that the proposed tool for manifold analysis can be useful in pattern recognition context, specially in this renewed era of artificial neural network applications. Indeed, many researchers concerned with this topic seem to converge to the conclusion that relevant insights should come from the study of manifolds. In this work, we try to provide a pragmatic tool for the bias-corrected estimation of manifold volume and intrinsic dimension. This can be regarded as a first step in understanding how layered processing structures disentangle data manifolds, and how to eventually improve it.