On the Use of Graph Fourier Transform for Light-Field Compression

This work proposes the use and analyzes the viability of graph Fourier transform (GFT) for light-field compression. GFT is employed in place of discrete-cosine transform (DCT) in a simplified compression system based on high-efficiency video coding (HEVC). The effect on GFT efficiency of different implementations for prediction procedure is analyzed, as well as different methods for computing GFT given residual images. Results indicate that the prediction scheme is sensitive to the type of light field being compressed, and a preliminary method for selecting the best prediction scheme is explored. Moreover, considering multiple residual images when computing GFT, instead of only one central image, improves compression rate and makes compression more uniform across multiple views. GFT achieves reduction of up to 21.92% in number of transform coefficients when compared to DCT-based compression, while providing better or equal mean squared reconstruction error.


I. INTRODUCTION
Light field imaging is a promising technology that opens a variety of new possibilities to entertainment industries, such as photography and cinema, by capturing 4D data from a scene [1]- [7].Light field technology is based on the 5D plenoptic function L(x, y, z, θ, φ), which describes the amount of light L, denominated radiance, along every position (x, y, z) in space and in any direction (θ, φ).Theoretically, if the plenoptic function for a region of interest is known, any image associated with that region can be recreated, from every perspective.This motivates the use of light field in entertainment industries, mainly photography and cinema [1].Other application for light fields reside in medical imaging, such as microscopy [8] and brain imaging [9].In practice, determining the plenoptic function is unfeasible, so light field cameras capture a 4D parametrization of the plenoptic function that consists of multiple photographs of a scene.This can be done moving a digital camera in a grid of various positions and taking photographs at each position, by using an array with multiple cameras, or by adding a microlens array in front of the camera sensor [3].
As light field data consists of multiple photographs, data size may increase drastically depending on the configuration of the light-field recording setup, making the manipulation of the resulting data a challenging task [10]- [15].The "JPEG Pleno" initiative, conducted by the JPEG standardization committee, aims at providing solutions for framework and data manipulation considering several multiview image techniques, such as light field [6].The delivery of a complete set of tools, including framework, coding, tests, and software, is set to 2018 [6], [16].This requires in-depth research in order to develop and improve the various tools.
The use of graphs is specially relevant when dealing with an irregular domain or any domain that is not well represented by traditional time series [17].In the current stage of the information era, the necessity of dealing with data from enormous networks, such as social networks, sensor networks, transport networks, among many others, increases daily.Given the non-ordered nature of these networks, using graphs as an underlying domain for the associated data becomes an interesting alternative to standard analyses [18].Data from these networks become signals on graphs and, in order to manipulate these data, tools from classic digital signal processing (DSP) are adapted to signals on graphs, yielding the emerging field of digital signal processing on graphs (DSP G ) [17], [19]- [23].
Two important concepts that serve as basis for a signal processing framework for signals on graphs are the definitions of shift operator and frequency domain.As an emerging field, there are no consensus regarding the proper definitions of these concepts, giving rise to many researches addressing the approach that best fits each particular application [24].One approach is based on the spectral graph theory [25], which uses the graph Laplacian L as shift operator and its eigenvectors as spectrum of the graph.This approach is usually restricted to undirected graphs, for which relations between two different elements are symmetrical, i.e., an edge from element i to element j has the same value as an edge from j to i.A second approach, valid for both directed and undirected graphs, uses the adjacency matrix of the graph A as shift operator [19], [26], [27].In this case, the spectrum of the graph is defined as the eigenvalues of A. This approach is the one adopted throughout this work, as it allows the use of more general classes of graphs.
This work is an extended version of the work presented in [28], where the application of graph Fourier transform (GFT) was proposed and studied as an alternative to the discretecosine transform (DCT) in the compression of light-field data.The objective of this work is to provide an improvement for light-field compression systems based on high-efficiency video coding (HEVC) [14], [29].In HEVC, DCT and discrete-sine transform (DST) are used as block transforms, with the objective of mapping data into a frequency-related domain where quantization (and thus compression) is more efficient.This increase in efficiency is due to the energy compaction property related to these trigonometric transforms when applied to images.It has been shown in [30], [31] that GFT is able to concentrate energy in fewer coefficients when compared to DCT, decreasing compression distortion when using the same number of coefficients.GFT usually depends on the original data and, thus, is not a fixed transform.Transmitting the transform basis from encoder to decoder is required, increasing transmission rate, and the impact of this task must be dealt with in order for GFT to be more efficient than DCT in the rate-distortion sense.

A. Scope and Contributions
This work begins by providing a review on both light field and DSP G theories and an overview on how both these concepts are employed in this work.This includes: presentation of introductory concepts on both topics, motivation of the proposed approaches, notation, and database adopted throughout this work.The remaining part of this work focuses on analyzing the viability of using GFT in place of DCT under different analysis methods.We investigate forms of improving the performance of GFT by studying some of its parameters for which no consensus has been reached.The main contributions of this work are: • Proposal and investigation of real applications for the developing field of DSP G , given real and practical light field data.• Performance comparison between GFT and traditional and broadly used DCT, analyzing viability of using GFT in the proposed application.
• Study of the effects of different settings for graph representation on GFT.

B. Outline
In Section II, background review on both light field and DSP G is provided, including theory, applications, and motivation.Section III presents the proposed approach for using GFT light-field compression in an HEVC-based system.Section IV describes the entire methodology regarding database, definitions, and other concepts adopted throughout this work.Simulations and results are presented in Section V. Section VI presents a brief discussion of the results and future works.Section VII presents a conclusion for this work.

II. LIGHT FIELD AND DSP G : A REVIEW
This section reviews the main concepts related to both light fields and DSP G .It begins by presenting light-field theory, focusing on recent implementations and how light-field data is generated.Then basic graph concepts and notations adopted in this work are presented, along with recent advances in the area.

A. Light field
Early notions of interpreting light as a field and conceiving a vector function to represent the amount of light present at (and passing through) points in space date back to the beginning of the 20th century.In 1936, Andrey Gershun introduced the term light field [32] and an early version of the function that would later be called the plenoptic function.In its standard interpretation, the 5D plenoptic function L(x, y, z, θ, φ), which is a scalar field, describes light intensity that goes through a given point in space as a function of its position and the direction toward which the light ray is headed.Light intensity is denominated radiance and is given in W ⁄sr•m 2 (watts per steradian per meter squared, i.e., power per solid angle per area).The function L(•) may be extended to higher dimensionality, for instance, by also considering time or wavelength.The idea of this function is to convey the complete information about a scene 1 associated with electromagnetic radiation.If L(•) is known, then every possible view 2 associated with a scene can be reconstructed by correctly arranging evaluations of the function for different points and directions in space, having several applications in imaging, photography, rendering, and other areas.
In practice, the plenoptic function is not available or obtainable in a feasible way.If free space is assumed, that is, the space associated with the region of interest is free of obstacles, the plenoptic function may be represented in lower dimensionality, considering a light ray sustains its radiance for different points along a given direction.The assumption of free space may be generalized to keeping the region of interest limited to the convex hull of any object.A straightforward parametrization of the plenoptic function in four dimensions is composed by two planes as shown in Figure 1.This representation of plenoptic function in four dimensions leads to current implementations of light-field-capturing devices.In devices used for capturing scenes and creating a light-field composition, the uv plane is taken as the camera plane and the st plane as the focal plane.That is, multiple light rays from the scene located at plane st travel along the space and hit a sensor region in plane uv, creating a view of the scene [1].Common implementations are: cameras, but requiring a static scene.An example of light field captured by a moving camera is shown in Figure 2; • microlens array inside a conventional digital single-lens reflex (DSLR) camera, where each microlens captures light from a different direction rendering different perspectives of the scene.Light-field technology comes with several applications, most of them in the entertainment field.With light-field data captured by systems such as the aforementioned ones, features otherwise unfeasible become direct applications.For instance, synthetic aperture photography allows changing the focal point of a picture after it was taken.Light-field rendering allows the creation of novel views not previously captured.Light field displays may improve virtual-reality displays by using full light-field data rather than simple stereoscopic views.Lightfield applications, however, are very data-intensive, since a single traditional image is now represented by a set of multiple images.Recent researches are dedicated to dealing with the high amount of data from light field [10], [11], [14].

B. Digital signal processing on graphs
Graphs are commonly defined as mathematical structures composed by two different sets: set V = {v 0 , v 1 , . . ., v N −1 } composed of N vertices (also known as nodes) and set E = {e 00 , e 01 , . . ., e (N −1)(N −1) } of N 2 edges.Vertices are basic units and are interpreted as objects of a graph G = {V, E}, which can be used to model objects in diverse systems, e.g., points in R 2 , sensor locations in a network, social-network users, or chemical elements on a molecule, among many other applications.Edges e i j , whose meaning and (possibly complex) value rely on the application of the graph, represent pairwise relations between vertices v i and v j , being equal to zero if there is no relation.The neighborhood of a vertex v i is defined as the set of all vertices directly connected to v i by a non-zero edge.These assumptions consider that there are no multiple edges between two vertices, but there are no restrictions to self-loops, which means a vertex can be directly related to itself.In this context, relation between elements does not have a fixed definition and depends on the application.If the relation between vertices v i and v j is the same as the relation between vertices v j and v i for every pair of vertices, i.e., e i j = e ji , ∀i, j, the graph is denominated undirected graph.Otherwise, if the direction of the edge is relevant and e i j e ji for some pair of vertices, the graph is denominated directed graph.An example for an undirected graph is shown in Figure 3, with N = 4 vertices.This graph is not fully connected, since many edges are equal to zero.Another form of representing the relations between vertices is the adjacency matrix A ∈ C N ×N , whose element [A] i j = e i j .The graph is undirected if, and only if, A is symmetric.Throughout the rest of this paper, graphs will be represented by pairs G = {V, A}.
Graphs are traditionally used as tools for data visualization and system modeling, whereas classical digital signal processing (DSP) is traditionally constructed around well-structured domains, such as time or space.Time domain is interesting for DSP as it holds properties that are particularly useful in the analysis of discrete-time signals.Consider a discrete-time finite-duration signal s[n] as a function s : {0, 1, . . ., N − 1} → C that maps instants n ∈ {0, 1, . . ., N − 1} in time domain into the complex plane.Time domain is well-structured, as comparisons such as n 1 < n 2 and n 1 = n 2 are feasible for any two points n 1 , n 2 within {0, 1, . . ., N − 1}, and it is a totally ordered domain.For many applications that emerge with recent advances and necessities in technology, treating signals associated with unstructured and more general domains is required.These applications are usually associated with networks, such as social, transport, sensor, and biological networks, for which representing the underlying domain with time or space would waste part of the information regarding connections among elements in the network.Graphs provide the suitable discrete domain for signals extracted from these types of network.Moreover, these applications are usually data-intensive, and graphs are a natural tool for representation of Big Data [20].
The concept of signals on graphs uses the set of N vertices V of a graph G as the domain of a dataset of N elements, equivalently to the use of N time instants n ∈ {n 0 , n 1 , . . ., n N −1 }, as shown in Figure 4.The set of edges E of the graph is used to encode any relevant relationship between elements of the signal that could not be represented in the time domain.A classic example is a sensor network that measures local temperature for N sensors distributed across several points of a country.Each location is represented by a vertex of the graph and the locally measured temperature is the signal on the vertex.Edges may be used to indicate distance between sensors, rendering an undirected graph.Another example is the measurement of user activity on a social network.Vertices would indicate each user account, for which an online-time is measured, and users are connected to each other via "following" tags, rendering a directed graph.For both cases, representing signals in time domain discards pieces of information that could be of paramount importance when processing these data.
The notation for signals on graph adopted throughout this paper is as follows: a graph signal given by s : V → C is referred to as a vector s.The n-th entry of vector s is Once graph domain and the definition of a signal over this domain are formally stated, one can build tools to process signals on graphs, which lead to two major approaches developed in the last years.The first approach is based on graph spectral theory [25] and on the graph Laplacian, being restricted to undirected graphs with non-negative edge values.This approach has received great attention and much effort was put into developing tools with these concepts [17].Tools for DSP G are mostly translations from already-consolidated classical DSP tools, which was mostly exploited by the second approach proposed by Sandryhaila and Moura [19], [20], [26], whose concepts are adopted and reviewed in the following definitions.
The first and most fundamental tool translated from classical DSP is the unit-delay or unit-shift operator, denoted as T −1 , which consists of an essential block in filter design.In DSP, when a unit shift T −1 is applied to a length-N discrete-time signal s[n], the signal is shifted in time resulting in a signal The unit-shift operator T −1 is a linear transformation, implying that it can be associated with a matrix.Indeed, when One can interpret the relation in Equation ( 2) within a graph framework.Indeed, consider the directed cyclic graph in Figure 5.Given all edges equal to 1, this graph can be interpreted as a graph generalization of the discrete-time domain, where each vertex v n represents a time instant n ∈ {0, 1, . . ., N − 1}.The adjacency matrix of this graph is the cyclic-shift matrix C appearing in Equation ( 2).
One can bring these ideias to the graph domain by considering a graph G = {V, A} as the underlying structure for a signal s, and by identifying the graph-shift operator with the graph adjacency matrix A. That is, a shifted signal s on a graph is given by s = As. ( This definition for graph shift means that shifting a signal on graph domain is equivalent to replacing each signal sample s n by a linear combination, given by the n-th row of A, of its neighborhood.This approach is not restricted to undirected graphs, allowing the use of directed graphs with complexvalued edges.A straightforward property of this definition is that it generalizes the unit-shift operator from classical DSP. Given a formal definition for unit-shift in the graph domain, defining filters is the next natural step and it is performed by translating filtering concepts from classical DSP.In discretetime domain, the output from a finite-duration impulse response (FIR) filter with length P is defined by the linear combination of its P most recent inputs, i.e., where the time-invariant coefficients h 0 , h 1 , . . ., h P−1 define the impulse response of the filter and each term s[n− p] results from shifting s[n] with a shift operator T −p .For a signal with finite duration N, applying an FIR causal filter of length P ≤ N, that is, h p = 0 for p < 0 and p ≥ P, induces the following circular convolution which shows that the filter is equivalent to a length-P polynomial over the cyclic-shift matrix C. Analogously, the linear, shift-invariant graph filter is defined as a polynomial over the adjacency matrix A, i.e., Once signals, shift, and filters on graphs are defined, concepts of spectral decomposition and Fourier transform can be extended to graph domain.For a signal space S, spectral decomposition of S is the identification of W filtering-invariant subspaces S 0 , . . ., S W −1 of S. Being invariant to filtering means that, for a signal s w ∈ S w , the output of filtering this signal is sw = H(A)s w ∈ S w .The spectral decomposition is uniquely determined for every signal s ∈ S if, and only if: • Each S w is irreducible to smaller subspaces, and, in this case, Given S as defined in Equation ( 7), satisfying the above conditions, any signal s ∈ S is univocally represented as The diagonalization of the adjacency matrix A leads to a spectral decomposition of the signal space S on the graph domain.Nonetheless, given the arbitrary nature of A, as allowed in this DSP G approach, it is not always diagonalizable.It is shown in [19] that the Jordan decomposition A = VJV −1 is used to conduct spectral decomposition of S on graphs.J is the Jordan normal form and V is the matrix whose columns are the generalized eigenvectors of A, which are the bases of the subspaces of S. Hence, Equation ( 8) can be written as where ŝ is the vector of coefficients that expand s into the subspaces of S. The union of these subspaces is the graph Fourier basis.The graph Fourier transform (GFT), which provides the coefficients of the expansion of a signal over the graph Fourier basis, is defined as such that ŝ = Fs.The inverse graph Fourier transform (IGFT) is given by If the graph is undirected, A is a symmetric matrix and it is diagonalizable.The graph Fourier transform is then obtainable from the eigenvectors of A. In this case, the eigenvectors are orthogonal and V −1 = V T , which makes computation of the transform matrix F less intensive.

III. PROPOSED APPROACH TO LIGHT FIELD COMPRESSION
The application of HEVC-based methods for compression of light-field data has been intensively researched over the past years [14], [29], [33], [34].HEVC presents a complex scheme composed by intra-frame and inter-frame prediction, motion estimation and compensation, transformation, quantization, coding, and other procedures, for which several configurations are available.These procedures are applied to coding tree units, which are blocks of up to 64×64 pixels into which video frames are divided.Notable procedures considered in this work are inter-frame prediction, transformation, and quantization, whose general concepts are explained below.
• Inter-frame prediction: When encoding a block of pixels of the current frame, the algorithm searches for a similar block, denominated reference block, from the previously encoded frame.Instead of encoding the raw values of pixels of the current block, the algorithm encodes only the difference between current and reference blocks.This difference is denominated prediction residual.The prediction procedure may be a complex process, using, for example, algorithms to estimate and compensate movement of blocks between different frames.Residual blocks should have less entropy than raw blocks, which makes compression in transformation, quantization, and coding stages more efficient.It must be noted that, in order to make inter-frame prediction possible, at least one frame that was previously encoded must have been encoded without inter-frame prediction.This frame is referred to as intra frame.Quantization is a lossy process, i.e., information is permanently lost once coefficients are quantized.The loss of information is called distortion, for which several metrics are available.Compression processes must consider the trade-off between rate and distortion.This work proposes and analyzes the viability of using GFT in place of DCT in HEVC-based light-field encoders, while exploiting the similarity among light-field images.The use of GFT within data compression context, and specifically image compression, is not new.The competence of GFT for concentrating information in few transform coefficients in a competitive manner when compared to other transforms is known and has been approached in other works [30], [31], [35].Notwithstanding GFT inducing relatively high energy concentration, the transform and its inverse IGFT depend on the adjacency matrix A, which has no fixed structure and depends on the application and on the data.The impact of storing or transmitting A or the transform matrix F must be considered during compression.The method proposed in this work aims at reducing the impact of the extra data related to graph structure by exploring the redundancy that exists among images near to each other in light fields.

IV. METHODOLOGY
In order to assess the performance of using GFT for light-field compression, a simplified compression process is defined, as presented in Figure which is detailed in the next subsections.A database composed by 7 light fields is used.Three of them, namely Humvee, Knights, and Tarot, are obtained from the Stanford Light Field Archive [36] and some sample views are shown in Figure 7.These light fields are captured from real scenes using a moving camera on a rectangular grid with 16 × 16 positions, yielding 256 total images for each light field.The other four light fields are generated synthetically, obtained from the HCI 4D Light Field Dataset [37], [38]; sample views for boxes, cotton, dino, and sideboard are presented in Figure 8.For these light fields, views are captured over a grid of 9 × 9 positions, for a total of 81 images for each light field.Database information is summarized in Table I.Only the luminance component from  these light fields is used throughout this work, despite the fact that RGB versions are depicted here.

A. Prediction
The input of video codecs is a stream of frames ordered according to their time stamps.It is reasonable to assume that similarity between frames decays when two frames are selected further apart in time if compared to similarity between two consecutive frames.Thus, prediction for video streams can be implemented by selecting the frame that comes right before the current frame.It is worth noting that complex prediction schemes are not usually limited to only one frame.For light fields, a prediction order is not straightforward.It is expected that views close to each other should be more similar.However, there is no consensus on how to determine the optimal selection of views or the boundaries for spatial neighborhood used for prediction in light fields.Considering the light field humvee as example, with a grid of 16 × 16 positions, three prediction schemes are considered in this work: • Rows: Prediction is performed over each row with 1 × 16 images, independently from other rows.The first image from each line is assumed to be an intra image, i.e., no prediction is used when coding this image.For the remaining 15 images from each line, prediction residuals are calculated.A simple prediction scheme is adopted.The prediction image I • Columns: Prediction using columns is similar to prediction using rows.Columns with 16 × 1 images are treated independently, and the first image from each column is an intra image, whereas the remaining are inter images.Computation of residual images R k is analogous to the one described for rows.• Blocks: When using a block scheme to perform prediction, a 3 × 3 block of views is selected.The central view of the block is the intra image and the prediction image for every inter image is the central view.In other words, a block is composed by K = 9 views on a 3 × 3 grid.The central image I c is intra-encoded, for some c ∈ {1, 2, . . ., K }.Per group, K − 1 residual images are computed as R k = I k − I c , for k ∈ {1, 2, . . ., K } and k c.Given one of the prediction schemes described, the set of views selected for prediction procedure, i.e, views from a row, column, or block, will be referred to as prediction group.

B. Transformation
As stated, block transform is used to map data from residual image blocks into a frequency-related domain.This allows better compression of the data.HEVC uses DCT for residual blocks from size 4 × 4 up to 32 × 32, and DST for some cases of 4 × 4 blocks.In this work, GFT is used to transform blocks of size 32 × 32 and results are compared to those of DCT.If smaller blocks, such as 4 × 4 or 8 × 8, are used, it is expected that blocks at the same position for different residual images should have low correlation with each other, given the parallax between adjacent views.For large 32 × 32 blocks, the impact of parallax is reduced.High correlation among blocks in the same position from several views in a prediction group is beneficial for the proposed compression scheme, as will be further explained in this section.
Up to this point, images are treated as sets of pixels in 2D space.In order to make the use of GFT possible, the signal associated with a residual block must be represented as a signal on a graph, previously defined as a vector s, such that the n-th entry s n is a function of the vertex v n ∈ V. Let the signal associated with a pixel from an M 1 × M 2 residual block be r : I M1×M2 → R, where I M1×M2 represents the set of integer indexes for the positions of pixels on the M 1 × M 2 block.That is, for each position on the M 1 × M 2 block, a residual-related real value is assigned.The signal on graph is defined such that That is, the graph signal s is defined as a column vector formed by stacking the columns of the residual block.
Let a residual block B k,t , k ∈ {1, . . ., K }, t ∈ {1, . . ., T }, be the M 1 × M 2 block from the k-th residual image R k (in a prediction group with K − 1 residual images) that was divided into T blocks.The graph signal associated with this block is s k,t .The corresponding adjacency matrix is denoted by A k,t and the GFT matrix by F k,t .Note that the transform matrix, and consequently its inverse, depend on the signal, unlike the DCT, which is the same for every M 1 × M 2 block.The first consideration adopted in this work in order to reduce the impact of transmitting the transform matrix is to build a sparse adjacency matrix and transmit A k,t instead of F k,t .The adjacency matrix A k,t is built according to the nearestneighbor (NN) image model [39], which is shown to offer an efficient image representation whilst providing a sparse and fixed graph structure.This model defines an image as a 2D nearest-neighbor graph.An NN graph is a graph for which a vertex v i is connected to v j if, and only if, the distance d(v i , v j ) is minimum among the distance between v i and all other vertices.For a regular structure like an image, the minimum distance exists for more than one pixel, as depicted in Figure 9.Using NN image model implies that each vertex of the graph will have at most four non-zero edges, and pixels at the corner, border, or interior of the block have different number of edges.The model also assumes that an image is a 2D NN graph constructed as a Cartesian product of two 1D NN graphs.A 1D NN graph is a possibly-directed line graph similar to the one presented in Figure 5, apart from the loop edge.This generates a structure where multiple edges assume the same value, indicated by coefficients a 0 , . . ., a M 1 −2 and b 0 , . . ., b M 2 −2 in Figure 9.As a result, considering an  defined so as to minimize the 2 distortion introduced by the shift operation, i.e., A k,t s k,t − s k,t 2 .As described in [39], this minimization is solved as an overdetermined leastsquares problem.This entire reasoning eventually implies that the adjacency matrix A k,t is transmitted in place of the graph Fourier transform matrix F k,t .While this saves bandwidth, it adds complexity to the decoder, as the eigenvectors of A k,t must be computed.Note that A k,t is symmetric and, thus, diagonalizable.Finally, it is worth pointing out that other schemes rather than the NN image model could have been employed as well, which might induce different performances; however, the NN model proved viable, as corroborated by the results achieved in this work (see Section V).
The second consideration employed to reduce the impact of A k,t , besides forcing sparsity and fixed structure via NN image model, is to exploit the redundancy among the many views in the light field in order to avoid transmitting A k,t with every single block.Considering that every view is equally divided into T blocks, only one A t 0 is transmitted for a given block position t 0 across the entire prediction group.Figure 10 shows an example of block position t 0 across views from a prediction group.This consideration assumes that blocks in the same position are highly correlated among several residual images.In this work, two similar methods for computing matrix A t 0 are considered.The first is using only adjacency matrices associated with one of the K − 1 residual images R k .For rows or columns prediction schemes, using the central residual image (for example R 8 when K = 16) is an intuitive choice, since other views are symmetrically similar to it.For blocks prediction scheme, there is no defined choice.The second method is to use multiple residual images R k and compute the coefficients of A t 0 by minimizing That is, the distortion introduced by the shift operator is minimized jointly for multiple, possibly all, residual images in a prediction group.For both methods, using an adjacency matrix which is not specifically computed for a given block may degrade the efficiency of the GFT, but the impact of transmitting the matrix is slightly reduced.
Once the adjacency matrix is computed, the GFT matrix for each block position is given by F t , whose columns are the eigenvectors of A t -the reader should keep in mind that the index k can now be dropped from A k,t and F k,t since it is assumed that adjacency and transform matrices do not depend on the residual image, given that only one matrix is considered for a given block position across the entire prediction group.The transform coefficients for each block from residual images in the prediction group are computed as ŝk,t = F t s k,t , where s k,t is the graph signal corresponding to each block.

C. Coefficient selection
A heuristic technique is adopted to assess the performance of GFT against DCT for light-field compression when employed in an HEVC-based compression system.The IGFT is given by the transpose of F t , since eigenvectors from A t are orthogonal.If IGFT is applied to transform coefficients ŝk,t , the signal s k,t is perfectly recovered.In practical applications, compression occurs when transform coefficients are quantized, resulting also in loss of information.In this work, a simplified compression process is conducted by setting Q smallest transform coefficients to zero, resulting in compressed transform coefficients ŝQ k,t .When IGFT is applied to these coefficients, the signal s Q k,t , which is recovered by inverse transform, is an approximation of the original signal s k,t .A compressed version B GFT k,t of the original block B k,t can be constructed from the signal recovered.For the case of DCT, the 2D DCT is applied directly to block B k,t and by setting the smallest coefficients from the transform block to zero, a compressed block B DCT k,t is recovered via inverse discrete-cosine transform (IDCT).

V. SIMULATIONS AND RESULTS
Simulations were conducted in order to compare GFT against DCT when employed in the proposed compression system.The basic concept underlying all simulations presented in the next subsections is to set GFT coefficients to zero as much as possible while still recovering blocks with less distortion when compared to a specific DCT compression.The number of compressed DCT coefficients is fixed at Q DCT = 924, i.e., only the 100 largest out of 1024 coefficients are kept and DCT is fixed at approximately 10:1 compression ratio.Distortion D DCT is evaluated for DCT.For each residual image, the simulation searches for the largest number of compressed GFT coefficients Q GFT for which the corresponding distortion D GFT is still smaller or equal to D DCT .It is important to note that both Q DCT and Q GFT are set for an entire residual image and, thus, every block in each residual image will be represented by the same number of coefficients.The figure of merit used to characterize distortion is the mean squared error (MSE) between compressed and original residual images.For some simulations, the structural similarity (SSIM) index [40] is also considered as figure of merit for distortion.While MSE represents an indication of absolute error between images, the SSIM index provides information related to changes in structural information between images.Different simulation setups are considered given the options described in Section IV.Three prediction methods were proposed, namely: rows, columns, and blocks.Moreover, two methods for building the adjacency matrix are considered.The first uses only one reference residual image, whereas the second uses multiple residual images when computing the coefficients of A t .The effects of these different setups are analyzed in this section.The database presented in Section IV and detailed in Table I is used.

A. Transform-setup analysis
As presented in Section IV-B, the coefficients of A t may be computed either for a single reference residual image or jointly for multiple residual images.For this simulation, using the rows prediction scheme, three setups are considered for transform computation: • Using only one central residual image as reference.The 8-th residual image R 8 for real light fields, where K = 16 images per line, and the 5-th residual image R 5 for synthetic light fields, with K = 9 images per line; • Using part of the prediction group.Residual images from R 5 to R 10 for real light fields and from R 3 to R 6 for synthetic light fields; • Using all residual images from the prediction group.Table II shows the results obtained for simulations considering these three setups.Results show the reduction in number of coefficients used by GFT when compared to DCT for the entire light field, so that GFT is still able to yield better or equal distortion for every residual image.Reduction values for the total number of coefficients (#) for each light field are computed as It is worth highlighting that the number of coefficients associated with the adjacency matrices is included in # GFT coefficients and, thus, the impact of transmitting A t is considered.GFT shows slight improvement over DCT for most cases, yielding up to 21.92% of reduction in number of coefficients.
The analysis shows that using multiple residual images when building A t improved the results for all cases when compared to results obtained using only one residual image as reference.This result can be observed in Table II by considering each light field independently, which is represented by each row.
For each light field, an increasing trend in the reduction value can be noted when going from "Central residual" to "Entire group" sections, with few exceptions, indicating the overall improvement when using multiple residual images.A relevant analysis given different transform setups is to observe the standard deviation of the number of coefficients used by the GFT across the residual images.The standard deviation of Q GFT is estimated for each light field, using the number of compressed GFT coefficients Q GFT from each residual image as sample for the standard deviation estimator.From Table II, it is notable that using the entire prediction group reduces the standard deviation of Q GFT .When GFT is built using only the central residual image, its efficiency is high for the central residual image, but decays as residual images get further apart form the central reference.This is expected, since correlation is reduced and the impact of using a single  transform matrix is increased, requiring more coefficients.
Constructing the transform while considering multiple images reduces the efficiency decay across the prediction group.This effect is depicted in Figure 11, where the difference ∆Q = Q GFT − Q DCT in number of compressed coefficients for one row of the humvee light field is presented.In this case, the coefficients of A t are not considered.The three proposed transform setups are considered.The peak for ∆Q at R 8 is notable when this residual image is the only one used for transform computation.When using all residual images, this effect is no longer present, allowing for a more uniform compression across all images.This simulation was replicated using SSIM as metric when searching for Q GFT .Only the transform setup based on all residual images for the construction of A t was used, considering it achieved the best results in the previous simulation.Results are presented in Table III.Values for reduction in number of coefficients are lower than the ones obtained when using MSE, but GFT is still competitive when compared to DCT.Moreover, small values for standard deviation are achieved, as expected.

B. Prediction-setup analysis
In this simulation, the three proposd prediction methods, namely rows, columns, and blocks, are tested.The transform matrix is built using all residual images from each group when computing the matrix coefficients.Results are shown in Table IV.For real light fields, using the rows prediction scheme yields the best results, followed by blocks, which increases the standard deviation of Q GFT across residual images.For synthetic light fields, the discrepancy in results among different methods is reduced and the efficiency of columns prediction scheme slightly increases.
These results indicate that different prediction methods may be better suitable for some specific type of light field.Video encoders usually work with several possible configurations for each processing stage.This opens the possibility of searching for the best prediction method when compressing light field images in a more complex system.An analysis of how the similarity between images in a prediction group affects the compression efficiency in that group was conducted.For each light field, prediction groups based on the three proposed methods were constructed.For each group, the SSIM index is Figure 12: Analysis of the correlation between average similarity in a prediction group and the resulting efficiency of using that group for light-field compression.computed for every pairwise combination of residual images in that group and the average SSIM index value is computed.That is, for each prediction group, the corresponding average structural similarity is computed.Figure 12 shows the average SSIM results for every group for all light fields in the available database, along with the average reduction in number of coefficients used per group.This simulation is conducted for the three prediction methods.Results indicate high correlation between intra-group similarity and compression efficiency.In a more complex compression system, similarity could be used as a metric for the selection of the best prediction method.

C. Transform coding gain
The transform coding gain is a criterion commonly used in order to assess the effectiveness of a transform, by comparing the transform quantization against direct quantization in the original domain [41].For orthogonal block transforms, which is the case for both GFT and DCT, and assuming high-rate, i.e., every coefficient contributes equally to distortion after optimal bit allocation, the transform coding gain is given by where σ 2 i is the variance of the i-th transform coefficient across all blocks.The transform coding gain is the ratio between arithmetic and geometric means of coefficient variances.When estimating the transform coding gain from data from a light field, it must be considered that GFT is not a single transform, since it was defined as a data-dependent transform.In order to compute the transform coding gain G GFT associated with GFT, blocks are treated according to their position t, for which a single A t is defined.That is, given a prediction group, an independent G GFT,t is computed for each block position, since the transform F t is restricted to that block position in that prediction group.For each light field, the final gain G GFT is given by the average gain across all block positions for all prediction groups.For DCT, the transform coding gain G DCT is computed in the same way as G GFT to make comparison possible.Results for the estimation of transform coding gain, using rows prediction method and using all residual images for computing A t , are presented in Table V. Transform coding gain shows better efficiency for GFT when compared to DCT for all light fields but one (Tarot).For some block positions from Humvee light field, G GFT,t could not be computed due to zero variance encountered in some coefficients, resulting in zero geometric mean.For Humvee, the dynamic range of G GFT,t was set to 10 by limiting the maximum value.
Table VI shows results for transform coding gain if the entire light-field data is considered at once, i.e., assuming that GFT is a unique data-independent transform.As expected, these results are worse for GFT.

VI. DISCUSSION AND FUTURE WORK
The simulations show mixed results in the comparison between GFT and DCT for light-field compression in an HEVCbased system.When comparing the reduction in number of coefficients, GFT is rather promising, being capable of reducing the number of transform coefficients by up to 21.18% in some cases, while keeping equal or better distortion when compared to DCT.Transform coding gain was used in order to provide an insight of how well transform coefficients may be coded, but results may be biased due to GFT not being a data-independent transform.The compression system employed is a simplified model based on HEVC, therefore several possibly relevant optimization procedures were not considered.Employing GFT in a more complex system is required so as to allow practical operation analysis.Moreover, several possible methods for implementing GFT were proposed, but some analyses were restricted to a single setup using rows prediction scheme and computing A t from all residual images.A broader analysis could offer better understanding of GFT behavior.

VII. CONCLUSIONS
This work proposed and analyzed the use of GFT for lightfield compression in an HEVC-based compression system.The comparison of the proposed method against the traditionally used DCT shows that GFT greatly reduces the number of coefficients required to represent light-field residual images while providing smaller distortion when compared to DCT.Different methods for constructing and employing GFT were tested for real and synthetic light fields.Using multiple images when computing the coefficients of the adjacency matrix provides a more uniform compression across residual images within a prediction group and improves reduction when compared to computing coefficients from a single reference residual image.An analysis of different prediction methods was conducted, as well as an analysis of how similarity between images within a prediction group affects the performance.When estimated from coefficients from the entire light field, transform coding gain favors DCT.Given the fact that GFT is data-dependent and, thus, not a fixed transform, a transform coding gain analysis regarding blocks for which GFT is unique was conducted.This analysis yields better transform coding gain for GFT.The compression system adopted may be improved in order to allow the comparison with practical coding systems.

Figure 1 :
Figure 1: Planes st and uv, which serve as 4D parametrization for plenoptic function.

Figure 2 :
Figure 2: Example of light-field data, consisting of multiple views of a scene, captured by a moving camera.

Figure 4 :
Figure 4: Relation between a signal represented in time domain and in graph domain.

Figure 6 :
Figure 6: Block diagram describing the simplified compression process adopted throughout this work.

Figure 7 :
Figure 7: Sample views from light fields captured from real scenes.Humvee (top), Knights (bottom left), and Tarot (bottom right).

Figure 8 :
Figure 8: Sample views from light fields captured from synthetic scenes.Boxes (top left), Cotton (top right), Dino (bottom left), and Sideboard (bottom right).

pk
for the k-th image I k in a light field row, where k ∈ {2, 3, . . ., K } and K = 16 in this example, is given by I p k = I k−1 .That is, each image is assumed to be equal to the previous image in the line, given the high similarity among adjacent views in light fields.Finally, the residual image R k is computed as the difference between current image and its prediction, i.e., R k = I k − I p k = I k − I k−1 .A total of K − 1 residual images are computed for each row.

Figure 9 :
Figure 9: Relation edges according to the NN image model.Edges connect only pixels at minimum distance among all pixels.

Figure 10 :
Figure 10: Representation of a block position t 0 for residual images from a prediction group.

Figure 11 :
Figure 11: Number of compressed coefficients Q according to residual image position for the three proposed methods for computing A t .

Table I :
Database information

Table II :
Simulation results for transform-setup analysis

Table III :
Results for simulation using SSIM index and A t computed from all residual images

Table IV :
Simulation results for prediction-setup analysis

Table V :
Transform coding gain for GFT and DCT computed for each light field, considering independent transforms A t

Table VI :
Transform coding gain for GFT and DCT computed for each light field, considering all A t as a single transform