A SORTING-BASED APPROXIIVlATION OF THE SUM-PRODUCT ALGORITHM

A complexityand delay-efficient simplifica­ tion of the sum-product algorithm (SPA) for decoding low­ density parity-check (LDPC) codes is presented. The key fea­ ture of the new algorithm consists of a modification of the complexity-intensive and delay-causing update equations at the check nodes of the factor graph of the LDPC code. The modified update equations at a check node are based on or­ dering the reliability val ucs of the incoming messages and on using a balanced tree topology to achieve optimum parallel processing. Furthermore. the complexity of the new algo­ rithm can be adjusted: the least complex version of the algo­ rithm corresponds to the so-called min-sum approximation. and the most complex version gives the full SPA.


INTRODUCTION
Owing to their outstanding performance. low-density parity-check (LDPC) codes [I]. are currently recognized to be the best class of codes to approach the Shannon limit ef ficiently. using iterative decoding [2]. Efficient implemen tation of the decoding algorithm in hardware has become an area of increased interest (see e.g. [:i. .+] and references therein). In particular. substantial reductions of the total de coding complexity have been obtained by considering simpli Xiao-Yu Hu and Thomas Miuelholzer are with IE:"1 Research. Zurich Research Laboratory. 8803 Ruschlikon. Switzerland (E mails: {xhu.trni} (i] zurich.ibm.corn, ) The results of this paper have been presented at ITS'02 in '.:atal. Brazil.
fications of the core operations in the sum-product algorithm (SPA).
Here we focus on various binary tree representations asso ciated with the check nodes in the factor graph of an LDPC code. We propose a modification of the highly complex and delay-causing update equations at the check nodes that is based on ordering the reliability values of the incoming mes sages. Specifically. these messages are split into the set of a few least reliable messages and the set of all other messages, which are treated as being fully reliable. In this way. the com plexity of the update equations can essentially be reduced to the case of only a few incoming messages. This reduction in complexity is most pronounced for high-rate LDPC codes. where each check node is connected to a large number of symbol nodes. The proposed simplification does not rely on reduced-complexity approximations of the core operations; therefore. one can achieve additional complexity reductions by applying such approximate core operations [3.4.5].
A critical performance issue of all turbo-like codes is the decoding delay that is inherent in iterative decoding of block codes. Following [.+. 6], we devote special attention to low delay implementations, which are based on balanced tree topologies for maximum parallel processing.
The paper is organized as follows. In Section 2, the SPA is reviewed. and the complexity of the different parts discussed briefly. In Section 3. different implementations for the check node updates are considered in terms of special factor graphs that are binary trees. Section 4 is devoted to the use of or dered statistics of the incoming messages at the check nodes and to simplifications of the factor graph, both of which result in a reduction of the delay as well as of the total complexity. In Section 5. simulation results comparing the full SPA with the proposed simplified versions are shown. Finally. in Sec tion 6. the key features of the proposed simplified algorithm arc summarized.

THE SUM-PRODUCT ALGORITHM IN THE LOG-LIKELIHOOD DOMAIN
The SPA is an efficient iterative algorithm for decoding LDPC codes. In particular, given a received word y = -.Ill ..I}] ..... 1/<. which corresponds to some transmitted bi nary codeword :.rl ..t: ..... ,r ,the SPA updates in each it eration step its "temporal beliefs" about the components J.' Il of the transmitted codeword given y. and, eventually. makes a decision as to which codeword components were sent. We describe the SPA in the log-likelihood domain using similar notations as in [7].
For a given binary sparse parity-check matrix H. we denote by .'vi (n) the set of check nodes that are connected to symbol node II. i.e .. the set of indices that are "I" in the »-th column of the parity-check nodes that participate in the m-th parity-check equation, i.e.. the set of indices that are ''I'' in the m-th 1'0\\ of H. We denote by L(m)\17 the set-theoretic difference of Lim '. from which the Il-th symbol node is excluded. Similarly. for the check-node sets. we introduce the set-theoretic difference M(n)\III. We denote by L((jn-_m) the message that symbol node /I sends to check node in, which corresponds to the log-likelihood ratio (LLR) of some temporal likelihood that the Il-th symbol is "0" or '"1". Similarly. LIJm-n) denotes the message that them-th check node sends to the »-th symbol. which again is an LLR expressing the temporal belief that the m-th check node has about the II-th symbol being "0" or '"I"". The SPA that is based on log-likelihood messages can he stated as follows.

I. Initialization
Each symbol node }) is assigned an initial LLR L IPn). In the case of equiprobable inputs on a memoryless AWGN channel,  The outgoing message L ) is referred to as extrinsic information because it does not depend on the incoming message L(L]n-m)'

Symbol-node update
Each symbol node n propagates its likelihood in formation to all the check nodes that connect to it. The outgoing messages are calculated as The decoder obtains the total temporal a-posteriori information for symbol /I by summing the likeli hoods from all the check nodes that connect to bit ' 17: The algorithm iterates until a valid codeword has been found.i.e .. the hard decision of the temporal a-posteriori vec tor .\1 ..... ,\y satisfies the parity-check matrix H. or a pre set maximum number of iterations has been reached.
Obtaining an efficient implementation of the symbol-node updates is obvious. B: forming firs! the a-posteriori in formation .\", the extrinsic information terms are given by L q..-.T, I = /\,. -L1r n , _ ,, !. Thus. the total computational load for a symbol-node update is only :.2 ,'vt(lI) additions. where the cardinality .'vI i /I! denotes the degree of the vari able node n .
The check-node update is the most complex part of the SPA. Two issues influence its computational complexity: the topology of the multiple outgoing messages and the imple mentation of the core operation. Consider a regular LDPC code of rate R 2> 1 -.J L where .J and k are the column and the rov, weight of its parity-check matrix. respectively. Ifthis code is of high rate. then it consists of check nodes that are connected to many variables, For instance. the check nodes of LDPC codes with R 2> 0.9 and a column weight j = 4 are connected to more than 40 variables, i.e., I: > 40. For an irregular LDPC code. the largest number of row weights can be e\en larger.
The core operation of the check-node update in (I) is the hyperbolic tangent function, which apparently seems difti cult to implement in digital hardware. Analog realizations of the hyperbolic tangent function have been investigated in [8. 9J. with the aim of facilitating extremely high-speed ap plications. The main difficulty in analog circuits is not the complexity of the hyperbolic tangent function but rather sta bility and synchronization issues. In Gallager's approach [I], the core operation of the check-node update in (1) was trans formed into another form. i.e..
is an involution for :1 > (1. By introducing the involution flY), one can run check-node updates efficiently from both the computational-and time-complexity viewpoints in a way similar to the efficient implementation of symbol-node up dates described above. However. it is not evident how to implement the involution function f( efficiently in either analog or digital circuits because of its singularity at O.

BINARY TREES FOR THE CHECK· NODE CONSTRAINTS
The motivation to consider binary trees is based on the fact that the complexity-intensive computation of the check-node update can he obtained by repeated application of the identity 55 Xiao-Yu Hu and Thomas Mittelholzer A Sorting-Based Approximation of the Sum-Product Algorithm _ .
Moreover, as noted in [iO], the light-hand side in (2)  l + e ' For the right-hand side in (3).

which is a function of I([C)
and L (\ "). we will use the well-established notation I (C) EE I (\ .) [10], The domain of the Ei-operation can be extended to the set ofreal numbers together with ±:>o. In this way. one obtains a monoid with .:>0 as neutral element [10], With this notation, the check-node update (1) can be rewritten as where EE"'=l.kL(C",) denotes the k-fold EE-sum.
Without loss of essential generality. we will focus on a sin gle check node. say m = L and assume that it checks the binary input symbols II. I2, . , .. in which the root corresponds to check node m = 1 and the k symbols represent the leaves. To reduce the check-node up dates (IJ, operating on l: -1 arguments, to a repeated appli cation of the 2-argument formula (3J, we transform the k-ary tree into a factor graph, which is a binary tree, by using suit able state variables Sv. Two trees that achieve this task are shown in Fig. 1.
In Fig. 1, the double circles represent binary state vari ables. Each state variable, say 8 v • is the modulo-2 sum of the two symbol or state nodes, say C, and \ ~" checked by the bi nary check node leading towards state node 8 v (starting from the leaves towards the root), i.e., 81/ = LTv EB Ii:,. The checl node updates (1) can be computed by making a forward and backward pass on either of these factor graphs using the em function (3) or some approximation thereof [3, 4. 5], For bot factor graphs in Fig. 1, the total computational load for chec node m = 1 consists of the forward recursive computation ( I (SI/) for the l: -:2 state variables, the backward recursion fc the latter, and the final backward recursion step to the leave which amounts to 3(k -2) core operations I(U) EE III (see [IIJ for the complexity computation on thernaximall unbalanced tree).
The factor graph in Fig. 1(a) will be referred to as the ma: imally unbalanced tree (note that this factor graph is topolo; ically equivalent to a binary tree obtained by deleting all sta variables). Various simplifications of the SPA have been dl rived from this maximally unbalanced tree [3, 5J. The oth: factor graph, Fig. 1(b), which is a balanced tree, has bet proposed in [6J and implicitly in [4]. For the design of pa allel algorithms, the balanced 'tree clearly results in a mue smaller delay, i.e. about :2 log I: core operations, compare with a delay of about l: consecutive core operations for til maximally unbalanced tree (when running the forward an backward passes in parallel).
In this paper, we propose an algorithm that operates 0, a strongly reduced factor graph for each single perity-chccl equation. This simplified balanced factor graph is obtainee by searching for the c (::; 2: 2) least reliable symbol nodes ir each parity-check equation.

ORDERED STATISTICS FOR CHECK· NODE UPDATES
This section is devoted to simplifying the balanced tree factor graph by using the ordered statistics of the reliability values of the symbol nodes. An example of such a simplifi cation is given by a parallel min-sum version of the SPA as described in the following subsection.

PARALLEL MIN-SUM CHECK-NODE UP DATES
We consider the SPA on the balanced tree with a simplified :: This simplified SPA is known as the "min-sum algorithm" [6]. By performing the min-sum update rule on the factor

SORTING-BASED CHECK-NODE UP DATES
By carrying the sorting idea of the preceding subsection further, we will obtain a family of reduced-complexity al gorithms that approximate the check-node updates in the SPA. The balanced tree for the forward recursion can also be considered as a diagram for the sorting algorithm of ten referred to as merge-sorting [I2J that is based on merging two ordered lists. Starting with the magnitudes   The complexity of the SPA on the partially balanced factor graph is mainly determined by the complexity on the left sub tree emanating from state node 8' because full soft reliability values are computed only on this sub-tree. On the right sub tree, which emanates from state 8 11 , the forward pass of the SPA needs only sign computations because all leaves have LLRs that are ±x (note that 00 is the neutral element with respect to the EB-operation).
To obtain a low-latency implementation of the check-node updates. we simplify the partially balanced tree of Fig II sign(I )) l1=z+l sign(I(8)) II sign(L(xi" )). (9) 11=1 The resulting factor graph is shown in Fig. 3. Note that this graph can be further simplified by removing the state node 8' and by incorporating the constraint 8' = (J' into the check node at the root. The SPA 011 this simplified factor graph can either be carried out in its full version or by using some simplification for the EB-operations (see e.g. [4]).
For .c = 2. the algorithm essentially corresponds to the min-sum approximation. However. there is the following slight difference: in the min-sum algorithm the reliability of the messages for the hard-decision nodes are approximated Note that the partial sorting idea can also be applied to Gal lager's decoding approach as described in Section 2. The re liability values for the outgoing messages are then computed from the .::-fold ffi-sum Li i i. ) EE L (.rIe) ffi ... EE L ).

FAST PARTIAL SORTING
The partial-sorting problem, which finds the.:: smallest el ements in a set of I: real values. is a problem whose worst case complexity is difficult to analyze [12]. Here, we give a simple algorithm that provides a simple upper bound on the number of comparisons needed. The algorithm is based on merge-sorting as described above but with the following modification: in the ordered list obtained from the merging of two smaller ordered lists. only the z smallest values are kept and all other (larger) values are deleted from the list. As a result. the maximum list size at each stage of the algorithm is c. A balanced tree with l: leaves contains a total of l: ~ 1 inner nodes (that are not leaves). At these inner nodes. at most :: comparisons have to be done. Hence, the total complexity of partial sorting is upper-bounded by z (k -1) comparisons.
Note that when the comparisons are done in parallel at each level in the tree. the run time of the algorithm corresponds to &bOllt log'}(Ie'] comparisoa operarioas.

COMPLEXITY ANALYSIS
A complexity comparison of the full SPA and the pro posed sorting-based (SB) version with .:: least reliable symbol nodes, the :-SB-SPA, will depend on the particular imp le mentation of the core operations. which are the EE-operation. the addition of real numbers -i-. the comparison of real numbers <. and the sign operation. The sign operation (_I)Cl (_I)b = (_I)Cl Tb can be realized by the XOR-sum of the exponents: it is much less complex than the other three operations. which operate on real numbers. and. therefore. it will be neglected in the complexity anal:-sis.

58
Let T (EE), T (+) and T (<) denote the time complexity in seconds of the three considered operations. In terms of run time, the three operations have the ranking

T(EE) > T(+) > T«)
and a similar ranking applies to the complexity of a hard ware implementation, e.g .. in terms of number of gates or chip area. Typically, T(EE) is much larger than T( +) and T( <).
The full SPA. the min-sum algorithm and the SB-SPA make identical symbol node updates but they differ in the way the check-node update are computed. Therefore, we will re strict the complexity analysis to the check-node update. In the sequel, we will consider a single check node, which is connected to l: symbol nodes.
For the min-sum algorithm, we compute the minimum and second smallest value of the k incoming reliability magni tudes, which requires 2(k -1) comparisons when using the partial sorting algorithm described above. From these two least reliable values all the l: outgoing messages can be deter mined. Thus, the total complexity amounts to 2(k -1) com parisons and the run time (in seconds) is Ilog2(l;)l T«), where [cl denotes the smallest integer larger or equal to c.
The check-node update for the full SPA is computed using the balanced factor graph as shown in Fig. 1(b), which results in a total complexity of 3(k -2) ffi-operations and a run time (in seconds) of :2lJog2(k)l T(EE).
The ::-SB-SPA proceeds in two steps to compute the check-node updates. The first step consists of fast partial sorting, which allows to identify the c-least reliable symbol nodes. In the same step, one can also obtain all the sign information (the complexity of which is negligible as ex plained above). Once the c least reliable symbol nodes have been identified. they can be grouped in a balanced subtree as shown in Fig. 3. In a second step, one runs the SPA on this subtree to obtain the outgoing messages that will be passed to the ; least reliable symbol nodes. Parallel to this second step. the outgoing messages to all the other symbol nodes are com puted. These messages all have the same magnitude, which is determined by the message that is passed to the state node S I as a result of the SPA on the balanced subtree. The total com plexity of the .::-SB-SPA amounts to zi]: -1) comparisons and 3(:: -:2) + 1 EE-operations and the run time (in seconds) is llog2(k)l T( <) + :2llog2(::)l T(EE). Table 1 contains a summary of the complexity computa tiOllS for the three cOllsidered &lgonthms. The simalstion re sults in the following section suggest that appropriate values for :; are 3 and -1. It is evident from Table 1 that for high rate LDPC codes substantial savings in complexity and de lay can be achieved with the z = 3 or 4 using the .::-SB-SPA algorithm.

SIMULATION RESULTS
For simulations on the binary-input AWGN channeL we have considered an LDPC code based on the array code con struction of length .Y = -1-189 and rate -11:"58/-1-189, which is defined by .II = 335 parity checks [13]. Fig. 4 shows the bit-error-rate performance of this code. The following LDPC decoding algorithms have been used: the full SPA. the SB SPA with c = 2, 3, and 4 least reliable values. The results are obtained using Monte Carlo simulations, in which the maxi mum number of iterations is fixed to 80 in all cases. For the core operation EB, we use Eq. (3) in its full accuracy.
We observe that the simple min-sum approximation, which essentially corresponds to the SB-SPA with only; = 2 least reliable values, suffers a performance penalty of about 0.3 dB at a bit-error rate of 10-6 . It is apparent from Fig. 4 that the loss in performance is recovered by increasing the number .:::  5 shows the bit-error-rate performance of an LDPC code oflength .Y = 1ODS and rate 1 2 from the on-line repos itory [14J on the binary-input AWGN channeL Again we see that the simple min-sum approximation suffers nonnegligi ble performance loss relative to the full SPA, which can be regained by using ; = 3 or .:: = -1.

CONCLUSIONS
A famjJy oJ Jow-comnlexiry and Jew-latency aJgorithms that approximate the sum-product algorithm has been pro posed. The main feature of the simplification consists in or dering the reliability of the incoming message at each check node. The complexity of the simplified algorithm depends on the number z of least reliable messages that are selected. When keeping only c = 2 least reliable messages, the SB SPA.. es~eItti'i\ll';i reduces to the well-Known min-sum alg,o rithm, which has the least complexity. Simulation results have shown that for increasing values of c , the performance of the algorithm quickly approaches the performance of the full SPA. By suitably selecting the parameter z , the SB-SPA provides the flexibility to improve performance at the cost of increased complexity. Moreover, one can simplify the EBcore operation by some reduced-complexity implementation of one's own choice.