Robust time delay estimation based on non-Gaussian impulsive acoustic channel

The aim of this letter is to propose a new robust method for time delay estimation over impulsive noise and investigate its practical implications. The method uses a nonlinear data transformation associated with the generalized crosscorrelation technique. Hence, simulations and experiments show better performance than traditional methods without increasing the computational cost. Our practical experiments indicate the possibility of obtaining a correlated impulsive noise, in which the proposed method is still presenting accurate estimations.


I. INTRODUCTION
The time delay estimation (TDE) for speech and audio signals is crucial to many applications and some emerging technologies, for instance, to the smart speakers, robots, integrated media, and assistive technologies. Thus, research conception of such technologies must consider realistic acoustic channel models. Notably, the channels subject to impulsive noise are more accurately characterized as non-Gaussian processes [1], [2], although many TDE methods assume implicitly or explicitly that the signal observations are Gaussian distributed. This assumption is found in generalized crosscorrelation (GCC) techniques, and their different frequency domain weighting such as phase transform (GCC-PHAT) [3], Roth weighting (GCC-ROTH), and smoothed coherence factor weighting (GCC-SCOT). As a result, these classical methods experience severe degradation in the spatial resolution [4] when the Gaussian assumption does not work.
A non-Gaussian channel could be modeled with a symmetric α-stable (SαS) distribution. The SαS has heavier tails than the Gaussian distribution, giving a much better approximation to real-world audio signals [1], [2]. Although an approach based on fractional lower order statistics (FLOS) available in α-stable noises is proposed [4], this method relies on the estimating of the characteristic exponent of the model.
In this letter, we develop a straightforward and robust solution for TDE in acoustic channels subject to non-Gaussian impulsiveness noise. Particularly, we propose the use of a nonlinear data-based transformation [5], which we call non-linear transformation (NLT) method, to allow the use of classical TDE methods even in channels where noise has unbounded All  variance. Our main contribution is a method based on the traditional GCC technique, enabling its uses in non-gaussian impulsive noise. Moreover, the performance of the proposed architecture is evaluated through simulation and experimental data, comparing the proposed method to classical and robust techniques. Our proposed method significantly increases the estimation accuracy in the presence of non-Gaussian SαS noise. Finally, we also discuss the influence of the noise correlation between the received signals from our real measurements. This paper is organized as follows. In Section II, we describe the signal model and the state-of-the-art solution, presenting its benefits and limitations. Our method is proposed in Section III. In Section IV, the main results are presented and discussed, comparing the performance of the proposed and the reference TDE algorithms by simulations. In Section V, we present our final remarks.

II. PROBLEM FORMULATION
Assuming microphone receivers in the acoustic field of a single speech source, the signal received in the presence of additive noise, in the instant n, can be expressed as where s(n) represents the speaker's signal, τ represents the time delay, u 1 (n) and u 2 (n) are background noise sources, a is a random complex-valued gain. We assume that u 1 (n), and u 2 (n) are SαS uncorrelated processes.
As previously mentioned, we consider the FLOS as our reference case regarding a robust method to impulsive noise. The fractional lower-order covariance, which mitigates the effects of heavy-tailed noise model, is defined by [4] for an α-stable process as: for 0 ≤ A, B < α/2. Consequently, it is necessary a parametrization of the method. Due to the lack of finite variance, we use covariation [6] (instead of covariance): for any 1 ≤ p < α. For 1 < α < 2, the α-stable distributions have finite first-order moments and all the fractional moments of order p. It is worth mentioning that the covariation is not linear regarding the second variable x 2 , and it is difficult to calculate analytically. Its estimation can be obtained by [ (4) In order to ensure the existence of second-order statistics, the generalized FLOS (GFLOS) has been shown as a smaller variance than the FLOS method [7]. It is defined as where g(·) is a class of non-linear transform functions. This function may be the characteristic exponent like the FLOS, or other functions as the logarithm, sign, or arctangent.
Although the maximum value of R x1x2 (τ ) corresponds to the robust estimated TDE for a particular function, it may not be possible to use the fast algorithms from Fourier convolution computing or it may require some parametrization, depending on the function chosen. The failure of the method to use optimized algorithms makes it an inappropriate method in the case with speed constraints.

III. PROPOSED METHOD
In the acoustic signals, external noises may consist of impulsive electrical noises, human-made audio noises, and uncalibrated sensors. In our proposed method, named nonlinear transform (NLT), the TDE is estimated by a nonlinear transformation followed by the GCC method.
Based on the GFLOC method [7], we compute the covariance using a set of functions from a class called sigmoid functions. Those functions present essential characteristics, such as monotonically increasing, and symmetry. We compared the modulus, algebraic, gudermannian, and hyperbolic tangent functions, with the hyperbolic tangent reaching a lower RMSE than other functions. Therefore, the non-linear data transformation of the ith signal received is a generalized sigmoid function, wherein a practical case, we use the hyperbolic tangent as a shifted and scaled version of the sigmoid, described by In the NLT, we use the following covariance: where φ(·) is a weighting function, and Y i (k) are the frequency domain of the transformed signals y i (n). It may implement any frequency domain weighting of the crosscorrelation according to a specific problem, producing the variations as NLT-PHAT, NLT-ROTH, and NLT-SCOT. Finally, the delay between the channels can be estimated bŷ

IV. RESULTS AND DISCUSSION
Proof-of-concept computer simulations evaluated the NLT-PHAT against GCC-PHAT, and FLOS-PHAT methods, as shown in Figures 1 and 2. We configured two simulation scenarios concerning the α-stable noise. We modeled the noise using the direct integration method for the SαS model [8] with α = 1.9 and α = 1.3 for low and high impulsiveness, respectively. Numerical results are provided by 10000 Monte Carlo runs.
The angle estimation accuracy is assessed by the rootmean-square error (RMSE) in degrees and the probability of resolution (PR) versus the signal-to-noise ratio (SNR). However, the infinite variance of non-Gaussian SαS processes prevents the signal-to-noise ratio to be used as a measurement of signal quality. In this letter, we use the geometric signalto-noise ratio (GSNR) [9] instead of the SNR. The GSNR is given by where the normalization constant C g = e Ce ≈ 1.78 is the exponential of the Euler constant (C e ), used to ensure that GSNR corresponds to SNR when the channel is Gaussian (α = 2); S 0 is the geometric power of a SαS random variable; and A is the root-mean-square value of the audio signal. The PR is a performance measure based on asymptotic analysis able to evaluate the resolution capability and its statistical performance. It is known that the direction of arrival methods is given by the positions of the spectral peaks and the TDOA methods by their cross-correlation peaks. However, the PR allows us to evaluate the direction of arrival (and time delay estimation) accuracy statistically with respect to SNR. Therefore, it is useful when we are comparing two methods or evaluating methods in relation to critical conditions. Due to comparisons with the most common GCC method, the GCC-PHAT, we use the same phase transform (PHAT) weighting function in the robust methods such as FLOS-PHAT and NLT-PHAT, given by [2] φ P HAT (k) = 1 We evaluate the FLOS-PHAT with different values of parameter p based on scenarios with different impulsiveness level (α = 1.9 and α = 1.3). Then, we adopt the parameter A = B = 0.15 in FLOS-PHAT method, since it achieves its best performance [2]. Our results corroborate with the best performance of FLOS-PHAT against the classic GCC-PHAT. However, FLOS-PHAT requires previous knowledge of the noise. On the other hand, our proposed method (NLT-PHAT) does not require any parameter, providing better performance than others over impulsive noise even for low GSNR values.
According to our previous work [1], the noise characterization is presented as an impulsive noise in outdoor and indoor scenarios using fitting for SαS distributions. The outdoor environment experiences a noise with severe impulsiveness (α = 1.3), and the indoor environment with less impulsiveness (α = 1.9). Thus, we also present a performance evaluation of the proposed NLT-PHAT with the signal measurements.
Our experimental analysis is based on a setup with a circular array using four microphones. We collected 240000 signal samples in the indoor and outdoor scenarios. The spacing between the microphones is 5.75 cm and sampling frequency is 48 kHz. The audio source is a speech signal with 1.5 meters and 20 degrees from the first microphone. All experiments are obtained by hand and subject to limited accuracy on the order of centimeters. Table I shows the RMSE and PR for our experimental data. We separated the evaluation in two signal windows: first, using all 240000 samples (scenarios Indoor 1 and Outdoor 1); after that, a window with 100000 samples where the speech signal (scenarios Indoor 2 and Outdoor 2) occurs. The second case was obtained manually, and the same window was used in all methods. Although the lower sampling frequency implies lower resolution, causing an increase of lowest achievable RMSE (RMSE min = 4.7), we use the PR with ξ = 6 • to ensure a proper evaluation. The loss of performance in the full-sized window occurs due to the absence of source signal in some regions of the window.
Analyzing the results from Table I, the FLOS-PHAT algorithm exhibits lower RMSE than the classic GCC-PHAT solution, although the FLOS-PHAT has lower resolution. On the other hand, the NLT-PHAT has the lower RMSE and higher probability of resolution in comparison to other methods. Using our proposed approach, one can reach the minimum RMSE and maximum PR in the outdoor scenario with an appropriate window. A careful observation of the algorithms is performed to investigate the reason for similar performance among the methods. The signal analyzed suggests that the impulsive noise is occasionally correlated in the channels because it is produced by an acoustic source, as illustrated in Figure 3. In this case, the methods have similar performance due to the inappropriate model considered, in which the noise is uncorrelated in each signal received. Thus, the performance of non-robust estimators increases in the presence of the correlated impulsive noise. Nevertheless, these results reveal a better performance of the robust methods even for the case of an occasional presence of the correlated impulsive noise.

V. CONCLUSION
The proposed method exhibits the same low computational cost than the classical solutions presented in the literature. The solution shows some advantages compared to the robust approaches for TDE such as enhanced performance with impulsive noise (even when correlated), computational feasibility (with no need of parametrization), and the availability to use second-order statistics even in unbounded variance scenarios. Our method would be a beneficial complementary approach in the current day attempts to solve the position location issue in the non-Gaussian noise scenarios. It is a feasible solution in the trade-off between model complexity and performance.