Iterative Error Decimation for Syndrome-Based Neural Network Decoders

In this letter, we introduce a new syndrome-based decoder where a deep neural network (DNN) estimates the error pattern from the reliability and syndrome of the received vector. The proposed algorithm works by iteratively selecting the most confident positions to be the error bits of the error pattern, updating the vector received when a new position of the error pattern is selected. Simulation results for the (63,45) and (63,36) BCH codes show that the proposed approach outperforms existing neural network decoders. In addition, the new decoder is flexible in that it can be applied on top of any existing syndrome-based DNN decoder without retraining.


I. INTRODUCTION
I N recent years, investigations into the design of shortlength channel codes have acquired notability, particularly due to applications that newer technologies aim to support. 5G technology, in particular, aims to guarantee services that require ultra-reliable low-latency communication (URLLC) [1]. For example, intelligent transport systems and process automation demand reliability in the order of 10 −3 to 10 −6 and latency between 1 ms to 100 ms. Communication under these conditions is challenging, since the requirements themselves are strict and conflicting [2], [3].
This scenario has motivated the evaluation possible candidate codes in terms of reliability and complexity for a given (short) blocklength [2], [4], [5]. Among many candidateswhich include polar codes, LDPC codes and convolutional codes-BCH codes stand out as having an excellent performance, very close to the fundamental limits in the short blocklength regime. This is achieved by the use of an ordered statistics decoder (OSD), which delivers near-maximumlikelihood (ML) performance; however, this comes at the price of a high complexity, which grows quickly as the blocklength increases.
An alternative that has increasingly been explored in recent work is the use of decoders based on deep neural networks (DNNs). Although the use of neural networks (NNs) for the task of decoding is not recent [6], due to the success of deep learning in several applications, interest in this purpose has been resumed [7]. Recently, in [8], Nachmani  underlying Tanner graph. In subsequent works [9]- [12], other architectures based on [8] are presented. Unlike approaches based on BP decoding, Bennatan et al. proposed in [13] a new decoder structure, where the NN is fed the reliability and syndromes of the received sequences and acts on noise estimation. Their approach can be regarded as a soft-decision extension of the syndrome-based approach of [6]. A great advantage of this structure is that the NN can be designed freely, i.e., without the restrictions present in architectures based on the BP decoder. Subsequently, the vanilla DNN proposed in [13] was simplified in [14], [15]; specifically, the architecture in [15] has fewer parameters and achieves a better performance than the original one.
A common limitation in many previous works is their focus on the bit error rate (BER) as a measure of performance, presumably because it maps more directly to the NN training objective. However, when evaluated by the block error rate (BLER), some of these works fail to significantly improve upon a hard-decision bounded-distance decoder (HD-BDD) that would conventionally be used to decode BCH codes.
In this paper, we present a strategy to improve the performance of any syndrome-based neural decoder (i.e., any decoder following the approach in [13]), at the expense of a moderate increase in complexity.
Our approach is to take the unquantized estimate of the error vector that is output by a neural decoder and iteratively select its most confident position, which is then decimated (subtracted) from the received vector before a new decoding attempt is made. Our results show that this proposed approach significantly improves the BLER achieved by the decoder in [13], outperforming previous results for the BCH(63,36) and BCH(63,45) codes. 1 Notation: We use for the th element of a vector x. Let 0 and 1 be the all-zeros and the all-ones vectors, respectively, with lengths implied by the context. If x ∈ R and ∈ R, then 1[x > ] denotes the vector y ∈ {0, 1} such that = 1 if and only if > . We use a similar notation for 1[x < ].

B. Syndrome-Based Neural Decoding
Let y = 1[y < 0] ∈ {0, 1} be the vector of hard decisions 2 and let e = y + c mod 2 ∈ {0, 1} be corresponding error vector. Clearly, c can be easily found given y and e. Thus, the decoding problem reduces to that of estimating e. As shown in [13], a sufficient statistic for the estimation of e is the pair (s, |y|), where s = y H mod 2 is the syndrome of the error vector (i.e., s = eH mod 2) and |y| = (| 1 |, . . . , | |) is the vector of channel reliabilities.
The approach proposed in [13] is to design an NN to estimate e from (s, |y|). More precisely, the network is trained to minimize the empirical risk is the binary cross-entropy (BCE) loss function andẽ ∈ [0, 1] is the NN output, produced with a sigmoid output activation function. 3 The binary estimate of e is then obtained asê = 1[ẽ > 0.5] ∈ {0, 1} . The complete decoder, which we refer to as a syndrome-based neural decoder (SBND), is shown in Fig. 1.
As argued in [13], the inputs (s, |y|) and the target e are all independent of c, thus the zero codeword assumption c = 0 can be used for both training and performance evaluation of the decoder. This avoids the risk of overfitting to the subset of codewords used during training. Moreover, as with any neural decoder, since the channel model is known, a potentially unlimited number examples can be used for training and testing without risk of overfitting to the noise.

A. Motivation
A main issue in training a syndrome-based neural decoder according to the procedure in Section II-B is the potential presence of inconsistent (or "noisy") training examples, namely, training examples with the same (or very similar) inputs but different targets. This phenomenon, called disturbance in [6], is most clearly seen in a decoder where the input component 2 Note that = −1 + when = 1. 3 The original description in [13] uses a [−1, 1] mapping and a hyperbolic tangent output activation function, which is mathematically equivalent to the description given here. |y| is removed from the neural network, i.e., the neural network is trained to predict the target error vector e solely from its syndrome s. Note that this corresponds to degrading the BI-AWGN channel into a binary symmetric channel (BSC), which is the channel originally considered in [6]. In this case, multiple target error vectors with the same syndrome are likely to appear during training, producing a "noisy" output that tends to be a superposition of those error vectors.
For simplicity, consider the BSC case in the following. Ideally, the neural network should be trained to emulate the performance of a maximum-likelihood decoder; thus, every syndrome s should be paired with a single lowest-weight error vector e corresponding to that syndrome, in order to form the training set. Any distinct error vector with the same syndrome, if used as a training example, will drive the network to deviate from the desired prediction and thus can only hurt performance. However, generating such an optimal training set requires performing maximum-likelihood decoding for every possible syndrome (or, equivalently, generating and storing a full syndrome table) which can be computationally infeasible.
A simple approach proposed in [6] to avoid disturbance is to restrict the training set to only target error vectors of weight up to the guaranteed error-correction capability of the code, = ( min − 1)/2 , where min is the minimum distance of the code. This set is guaranteed to have a single error vector for each syndrome. However, under this approach, the neural network is unlikely to learn to predict error vectors of larger weights, which is precisely what is needed in order to outperform a bounded-distance decoder.  1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0). This prediction is always incorrect, as it does not even correspond to the input syndrome.
An explanation for this behavior is that, under the architecture and training approach of Section II-B, the NN is modeling the bitwise posterior probabilitỹ While this approach can potentially lead to a low bit-error rate (BER), it is clearly unsuited to obtaining low BLER. On the other hand, regarding the problem as a multiclass classification among all possible error vectors (e.g., using softmax output activation with categorical cross-entropy loss) [16] is clearly computationally infeasible unless is very small.

B. Iterative Error Decimation
Rather than modifying the training procedure to avoid disturbance as in [6], we propose to modify the decoder so as to make it robust to the superposition of error patterns.
Our approach is to perform − 1 iterations where a single bit is selected that is most likely (as estimated by the neural network) to be in error; this bit is then flipped in the received vector and the decoding is repeated, until the th iteration where thresholding at 0.5 is applied. We call this procedure iterative error decimation (IED). The underlying idea is that, after a bit error is (correctly) eliminated, the resulting problem becomes easier to solve, leading to more confident estimates. Note that IED can be applied to any syndrome-based neural decoder, without requiring any changes in the training stage.
A detailed description of the decoder is given in Algorithm 1. Note that, in line 8, we assume that the NN outputs probability estimates. In line 10, we select the position of the largest (thus, most confident) element of the vectorẽ. The decimation step occurs at line 11, where the sign of the received vector is flipped at the position estimated to be in error. Since we assume certainty that the chosen position is in error, in principle we could also set the magnitude |y | to infinity (or to a very large value). However, in our experiments we observed that setting |y | to a too high value actually hurts performance, possibly because such values were not observed during training. In practice, we found that the best results are obtained when we do not change the magnitude of |y |.
The algorithm stops when a zero syndrome has been obtained (line 4) or when iterations have been performed, at which point thresholding is applied to the remaining error estimate.
Clearly, the complexity of one iteration of the IED decoder is dominated by that of the NN inference step. Since the number of iterations is at most , the maximum latency is at most times that of a conventional SBND. On the other hand, the average number of iterations is upper bounded by where [E ] is the block error probability of an IED decoder with iterations. Thus, compared to a conventional SBND, the relative increase in the average complexity is typically very small and becomes negligible for high / 0 .

IV. EXPERIMENTS AND RESULTS
In this section, we investigate the BLER performance of the decoders described in the sections II-B and III-B for the linear codes BCH(63,45) and BCH(63,36), where BCH( , ) denotes a primitive narrow-sense binary BCH code of length and dimension . For comparison purposes, we use the best results obtained in [11], [12], [15] as well as the HD-BDD and ML [17] performances. With respect to BER performance, we compare specifically with [12], [15] and [18] (note that [11] presents only BLER performance). All simulations were performed using the Keras API with Tensorflow backend.
For the training of DNNs, we have used 10 7 examples (generated in real time) with / 0 = 4 dB. This value of / 0 is suggested in [7] to give a good balance between In the inference stage, the BLER was estimated by running Monte Carlo simulations until the occurrence of at least 100 block errors for each value / 0 .

A. BCH(63,45) code
For the BCH(63,45) code, we use the DNN architecture presented in [15], which has seven fully connected layers. The six hidden layers have 300 units each and use a rectified linear unit (ReLU) as activation function [19].
Following the same procedures described in [15], for this architeture the learning rate for the gradient propagation is initialized to 10 −3 and is reduced by a factor of 10 −1 when the validation loss stops reducing for 5 epochs. Fig. 2 shows the performance achieved with the SBND proposed in [13] and the IED decoder using the DNN designed in [15]. It is observed that the result obtained in [15] already exceeds the performances shown in [11], [12]. In turn, with the same DNN and using the proposed IED decoder we achieve even better performance. For the interval ∈ [2, 5], we observe a gradual improvement, reaching up to 0.7 dB (for = 5) compared to the result obtained in [15], when BLER = 10 −3 . Our tests indicate that, for > 5, the improvement is not significant.

B. BCH(63,36) code
For the BCH(63,36) code, we propose the 8-layer architecture, with seven fully connected hidden layers, each of which has 8 = 504 units and uses the logistic sigmoid activation function. We also include a single skip connection (concatenation) from the first to the fourth layer. All hidden layers are followed by batch normalization layers to help with the stability and acceleration of the learning process [19]. HDD-BD Lugosch and Gross [11] Be'ery et al. [12] SBND [15] SBND [15] [13] and the IED decoder for the BCH(63,45) code, using the DNN in [15].
For the learning rate, we obtained our best results by applying a triangular cyclic schedule [20] with minimum at 10 −5 , maximum at 10 −3 , and a half-cycle of 64 iterations. Fig. 3 shows the performance of the proposed DNN with the decoder in [13] and the IED decoder. In Fig. 3(a), "(w/o BN)" indicates a version with the batch normalization layers removed and "(relu, w/o BN)" indicates a further modification where the sigmoid activation of the hidden layers is replaced by ReLU. We can see that the combined use of the sigmoid activation and the batch normalization layers significantly improves the performance.
Again, it can be seen that the IED decoder achieves better performance than the results in the literature, including those of [13]. As in the case of the BCH(63,45) code, our best result is obtained when = 5, providing a gain of approximately 0.8 dB at BLER = 10 −3 .

C. Comparison with the Syndrome Loss
To investigate whether the problem of disturbance discussed in Section III-A could be solved by simply penalizing syndrome violations (without IED), we have trained the DNNs of sections IV-A and IV-B using the decoder of [13] and the hybrid loss function proposed in [11], which incorporates a syndrome loss component besides the BCE loss. We experimented training from scratch and after pretrainng with the  BCE loss. However, in both cases, the results were worse than using only the BCE loss and therefore were not included in the figures. This is not surprising since the syndrome loss was proposed in the context of belief-propagation decoding. Moreover, it ideally implies committing to a single rather than multiple superimposed error vectors, which may simply be too hard to learn under an inconsistent training set. In contrast, the BCE loss makes no such commitment, allowing the first iteration of the IED decoder to find and flip the single bit that is most likely to be in error.

V. CONCLUSION
In this letter, we proposed a new decoder that uses the knowledge of the syndrome vector to feed a DNN designed to estimate the error pattern, where a stage of selecting the most confident positions to correspond to errors is used in order to improve estimation of the transmitted codeword. In addition, we designed a new DNN for decoding the BCH(63,36) code.
The results obtained for the BCH(63,45) and BCH(63,36) codes show that the new decoding algorithm improves the performance of the SBND presented in [13], at the price of a moderate increase in complexity. The IED decoder is flexible in the sense that it can be directly applied to any syndromebased neural decoder without retraining.