Deep Reinforcement Learning for QoS-Constrained Resource Allocation in Multiservice Networks

In this article, we study a Radio Resource Allocation (RRA) that was formulated as a non-convex optimization problem whose main aim is to maximize the spectral efficiency subject to satisfaction guarantees in multiservice wireless systems. This problem has already been previously investigated in the literature and efficient heuristics have been proposed. However, in order to assess the performance of Machine Learning (ML) algorithms when solving optimization problems in the context of RRA, we revisit that problem and propose a solution based on a Reinforcement Learning (RL) framework. Specifically, a distributed optimization method based on multi-agent deep RL is developed, where each agent makes its decisions to find a policy by interacting with the local environment, until reaching convergence. Thus, this article focuses on an application of RL and our main proposal consists in a new deep RL based approach to jointly deal with RRA, satisfaction guarantees and Quality of Service (QoS) constraints in multiservice celular networks. Lastly, through computational simulations we compare the state-of-art solutions of the literature with our proposal and we show a near optimal performance of the latter in terms of throughput and outage rate.


I. INTRODUCTION
The overall performance of wireless telecommunications systems directly depends on how efficiently the available resources are managed, e.g., subcarriers, time slots, transmit power, antennas, among others.Consequently, optimal Radio Resource Allocation (RRA) is one of the fundamental challenges and a key requirement for the design of efficient mobile networks.RRA problems in general are formulated as optimization problems and different objective functions and constraints have been considered in the literature.Two examples of classical objective functions are: maximize the system throughput, as considered in [1] and [2] and guarantee system fairness, as done in [3].However, with the advent of the Fifth Generation (5G) of mobile wireless telecommunications, which shall integrate new technology components, RRA problems can be come even harder to tackle, with larger optimization domains and a series of practical considerations.Moreover, these problems need also to deal with the advanced This work was supported by Ericsson Research, Sweden, and Ericsson Innovation Center, Brazil, under UFC.46 and UFC.47 Technical Cooperation Contracts Ericsson/UFC.Iran M. Braga Jr. would like to acknowledge CAPES for its financial support under the grant 88887.474363/2020-00.
The authors are with the Wireless Telecommunications Research Group (GTEL), Federal University of Ceará (UFC), Fortaleza, Ceará, Brazil.
RRA functionalities, the growing variety of scenarios and sophisticated types of services and/or Quality of Service (QoS) constraints of the users [4].
Although it is possible to apply optimal methods to some of these problems, such as exhaustive search and Branch and Bound (BB) algorithm, their high computational complexities are prohibitive and, therefore, these methods are not appealing for large-scale mobile networks.Contrarily, techniques such as Lagrangian relaxations, iterative distributed optimization and heuristic algorithms normally have reduced computational costs, but they fail in achieving the maximum performance and are usually tailored to specific network configurations.Furthermore, issues related to convergence and optimality gaps of these solutions can be unknown as well [5].As a result, the solution of many problems in the context of RRA can be quite inadequate using conventional optimization methods.There already exists a very rich literature on this topic, as it can be seen in [6], where an extensive survey on these techniques is presented.
Machine Learning (ML) techniques are leading the advent of the fourth industrial revolution due to its capabilities of solving complex problems.In fact, this has aroused interest of many researchers in the mobile communications field.More specifically, in learning-based resource allocation, a branch of ML called deep learning has gained notoriety and shown its potential in this type of context.Moreover, in order to further improve the performance of this technique with regard to learning from high-dimensional raw input data and make intelligent decisions, in [7], it is proposed a sophisticated approach where deep learning and Reinforcement Learning (RL) are combined, resulting in a promising and powerful technique known as deep RL.In summary, in deep RL, a Deep Neural Network (DNN) along with other techniques, e.g., replay memory, are used in order to carry out a stable and efficient training.
Applying deep RL to cellular mobile networks can lead to the following main advantages: (1) a DNN with a moderate size can quickly perform predictions as only a small number of simple operations are needed to obtain an output.This is interesting and also helps the deep RL agent to get to know his environment faster; (2) the fact of the deep RL agent learn directly from the raw collected network data with high dimension in large environments is not a problem due to the powerful representation capabilities of DNNs; (3) by exploiting distributed and/or parallel computing employing multiple machines and multiple cores, the response time of deep RL-based schemes can be greatly reduced and its performance increased; (4) deep RL-based schemes can also improve over time since the deep RL agent aims at optimizing a long-term performance, considering the impact of actions on future rewards.This makes deep RL efficient in dealing with imprecise input data such as the Channel State Information (CSI) and makes it capable of learning how to behave in an unknown environment [8]; (5) deep RL is a model-free approach, i.e., it does not rely on a specific system model and, therefore, it can be easily extended to different contexts [9].
Motivated by the benefits of deep RL, in this paper, we revisit the problem of maximizing system throughput subject to minimum satisfaction constraints per service, as in [1] and [2], and we propose a new near-optimal RRA solution based on multiple agent deep RL.
The remainder of this paper is organized as follows.In Section II previous related works are reviewed, and the main contributions of our work are highlighted.In Section III and Section IV, we present the system modeling and formulate the optimization problem to be solved, respectively.Section V presents a deep RL-based method to solve the problem formulated in the previous section.Section VI and Section VII show simulation results and the main conclusions of this study, respectively.

II. STATE-OF-THE-ART AND MAIN CONTRIBUTIONS
RL techniques have recently been used in a variety of wireless resource management problems such as, channel and power allocation, throughput maximization and spectrum sharing.In [10], for example, a self-organizing method to allocate power to millimeter Wave (mmW) Base Stations (BSs) is proposed based on Q-learning technique.Q-learning is the most popular RL algorithm, where an agent interacts over time with its environment based on trial-and-error in order to learn a policy to achieve a given goal [11].Thus, based on this technique, in [10], each BS acts as an independent agent taking actions such as choosing a transmit power.In other words, each BS sees the others as part of the environment and do not communicate with each other.In this case, the environment, for a given BS, is seen as a source of interfering signals.The problem with this solution is that, in a non stationary environment as mobile wireless networks, it takes a long time to converge, due to the lack of cooperation between the BSs.
Q-learning based resource allocation models are also proposed in [12] and [13].Specifically, in [12], Q-learning is used to select which network node a User Equipment (UE) should connect to in order to minimize the total transmit power.It is considered that a UE can either directly connect to a BS or to another UE, which relays the traffic from a BS.According to that solution, each UE autonomously selects the node to which it will be connected and keeps a record of its experience when using that node.The records, called rewards, reflect the degree of fulfillment of the optimization target, e.g., total transmit power used by BSs and also other constraints such as required bit rate.In [13], we have proposed a Q-learning based solution to the same problem that we address in the present paper, i.e., schedule frequency resources to UEs in order to maximize the system throughput subject to users' QoS requirements, in terms of UE throughput.However, in general, Q-learning based solutions, and especially the one proposed in [13], present a scalability problem, since the agent's experiences are stored in a look-up table, called Q-table.This becomes an issue when the set of possible experiences increases with the dimensions of the environment, which is the case in [13], where the size of the Q-table is proportional to the number of frequency resources and the number of UEs.
As presented in the previous section, when a DNN is used by an agent, it is referred to as deep RL and this technique has shown to be efficient for RRA in large cellular networks.In [14], for example, a model-free deep RL method is applied to perform dynamic transmit power allocation.The solution presented in that work achieves near-optimal performance and is suitable for practical scenarios where the system model is inaccurate, thus overcoming some issues of classical and heuristic solutions.A decentralized band and power allocation problem for a Vehicle-to-Vehicle (V2V) communication system is solved based on deep RL in [15].In details, the goal is to minimize interference under latency constraints, where each V2V link operates as an agent making its own allocation decisions.Moreover, many other papers also successfully address deep RL in several wireless telecommunications research areas, such as resource scheduling [16] and mobile edge computing/caching [17], [18].
The above mentioned works consider either a completely centralized solution [17], [18] or a decentralized solution [14]- [16].On one hand, in a centralized solution, a deep RL agent is localized in a central node and it is responsible for the entire processing.Unfortunately, this processing consumes a lot of computational resources since the employed DNN size is proportional to the wireless network dimension.On the other hand, in a decentralized solution, multiple agents are considered each one running in a different node and taking actions independently.In this last option, the agents can either work independently or in cooperation with the burden of a longer convergence time or a higher signalling overhead and data updating procedures, respectively.In the present work, we propose another option, where we assume a centralized node, but with multiple agents running in parallel, each one related to a frequency resource.In addition, this structure is executed independently in a distributed way in different cores or machines in order to improve the performance of the system.
In summary, our main contributions are the following:

III. SYSTEM MODELING
We consider a Single Input Single Output (SISO) downlink cellular system composed of a number of sectored cells so that in a given sector there are J UEs grouped in set J = {1, . . ., J} and connected to an Evolved Node B (eNB).
We assume that the intra-cell interference, i.e., interference between terminals of the same cell, is controlled by employing the combination of Orthogonal Frequency Division Multiple Access (OFDMA) and Time Division Multiple Access (TDMA) with the assignment of orthogonal resources.Thus, we define a Resource Block (RB) as the basic scheduling unit composed of a group of subcarriers in the frequency domain and a number of consecutive Orthogonal Frequency Division Multiplexing (OFDM) symbols in the time domain, whose total duration represents a Transmission Time Interval (TTI).In addition, N = {1, . . ., N } is the set of RBs available.Regarding inter-cell interference, we assume that it is added to the thermal noise in the Signal to Noise Ratio (SNR) expression, defined later.
In this work, we also assume a multiservice scenario with L service plans contained in the set L = {1, . . ., L} and supported by the system operator.In each TTI, the J UEs compete for the available RBs in order to meet their throughput requirements, which are defined by their service plans.Each service plan l ∈ L requires a minimum number of UEs that should be satisfied.The set of all UEs from service l ∈ L is J l with |J l | = J l , where | • | denotes the cardinality of a set and J l is the set of UEs from service l ∈ L. Besides, each UE subscribes to only a single service plan, i.e., J l1 ∩ J l2 = ∅, ∀l 1 , l 2 ∈ L and l 1 = l 2 .
The SNR γ j,n of UE j ∈ J in RB n ∈ N is given by where p n is the transmit power allocated to the UE j on RB n; α j models the joint effect of the path loss and shadowing of the link between the eNB and UE j; |h j,n | represents the magnitude of the complex channel frequency response of RB n when assigned to UE j; and, finally, σ 2 is the noise power at the receiver in the bandwidth of a given RB.
Similar to [1], [2] and [13], power allocation is not optimized herein and we employ Equal Power Allocation (EPA) among RBs, which is the most basic and common power allocation scheme.Hence, the power p n allocated to each RB n is fixed and equal to P/N , where P is the available power at the eNB.
We assume f (•) as the link adaptation function responsible for mapping the achieved SNR to the transmit rate.It is a discrete and monotonic increasing function that models the Modulation and Coding Scheme (MCS) levels so that the transmission parameters at the physical layer are adapted according to the current channel state.Thus, we consider that the transmit rate when the RB n is assigned to UE j is r j,n such that r j,n = f (γ j,n ).

IV. PROBLEM FORMULATION AND OPTIMAL SOLUTION
As presented in Section II, the problem investigated herein aims to maximize the system throughput constrained by a per-service minimum number of satisfied UEs in a given TTI.For that problem, we define x j,n as the binary decision variable that assumes the value 1 when RB n is assigned to UE j and 0, otherwise.Furthermore, let R j be the total throughput allocated to a UE j, i.e., R j = n∈N r j,n x j,n .Therefore, the resource assignment problem can be formulated as: x j,n ∈ {0, 1}, ∀j ∈ J and ∀n ∈ N , where X is the matrix of optimization variables composed of x j,n and u(a, b) in (2c) denotes the Heaviside step function, which assumes the value 1 if a ≥ b and 0, otherwise.With this, η l is the minimum number of UEs from service l that should be satisfied and ξ j represents the required throughput for a UE to be considered satisfied, i.e., ξ j consists of a QoS requirement for each UE j in terms of throughput.Regarding constraints (2b) and (2d), they guarantee that each RB is assigned to a single UE.Notice that ( 2) is a combinatorial optimization problem with a non-convex constraint (2c).In order to simplify the optimal solution analyses, we linearize (2c) by introducing some new variables.Let ρ j be a binary selection variable that assumes the value 1 if UE j is selected to be satisfied and 0, otherwise.Thus, (2c) can be replaced by (3c) and (3d), where ρ j = 1 in (3c) implies that UE j is satisfied and (3d) means that for all service l there are at least η l satisfied UEs.Hence, problem (2) can be equivalently reformulated as x j,n , ρ j ∈ {0, 1}, ∀j ∈ J and ∀n ∈ N .(3e) Thus, we have transformed (2) into an Integer Linear Problem (ILP), which can be solved by standard methods such as the BB algorithm [1], [2].
V. PROPOSED SOLUTION In this section, in order to better understand our proposed solution, a brief review of RL, including the techniques Q-learning and deep RL, is first described.
A. An Overview of Reinforcement Learning 1) Q-Learning technique: One of the most common RL techniques is the Q-learning model which consists of an agent, a set of states S and a set of actions per state A(s), s ∈ S. By performing actions and, consequently, transitions from state to state, the agent aims to learn an optimal policy or an optimal path to a given goal.Each of these states can be defined as a tuple of values that characterizes the environment for the agent, while each action represents the change that the agent applies to this environment.Thereby, the idea is that the agent perceives the environment state and selects an action according to a particular strategy or decision policy [11].This strategy or policy can be implemented using a variety of techniques such as the -greedy decision policy.Particularly, the -greedy policy is simple, but very efficient: the agent explores or exploits its environment taking random (non-greedy) or greedy actions according to a given probability distribution, respectively.Normally, for this decision policy, a random action can be chosen with probability ∈ (0, 1), while a greedy action is taken with a probability 1 − .
On one hand, when greedy actions are taken, the objective is to exploit the acquired knowledge to improve performance.Generally, for a (partially-)unknown environment, these actions lead to locally optimal solutions.On the other hand, when the agent decides to take random actions, it explores its environment for the sake of acquiring experience and knowledge about the environment and, therefore, there is no concern with the immediate effects of these actions.However, random actions allow the agent to neglect the locally optimal policies, and to achieve the globally optimal one, instead.Consequently, one of the challenges that arise in RL techniques is the trade-off between exploration and exploitation and it should be carefully balanced so that the benefits of both can be properly harvested [11].
Once taken an action a ∈ A(s), the system state changes from s to s and this change generates a signal or indicator that evaluates the effect of the taken action.This feedback or message from the environment is called reward, φ, which is a numerical score and it is used to estimate the expected value of taking an action a in a particular state s, also known as Q-value of a state/action (s, a).In detail, the Q-value is calculated by a Q-function such that Q : S × A → R and, for a given state/action pair (s, a), it can be estimated according to Bellman's equation [11]: where 0 < α ≤ 1, 0 ≤ γ < 1 are constants called learning rate and discount factor, respectively, and max a Q(s , a ) is the best estimated Q-value given the next state s and all possible actions at s .Basically, α determines how quickly the learning process occurs, while γ controls the value placed on future Q-values.Then, over several iterations the state/action pairs are defined and their respective Q-values are estimated and updated by Bellman's equation.A set of these iterations from an initial state s o to a final state s f is called an episode.
Thereby, each state/action pair and its respective performed action allow the agent to interact with the environment and this interaction after several episodes produces precious information about the consequences of actions, mainly about what to do or not in order to achieve goals.This information are precisely represented in the Q-values and the set of all of them is stored in the Q-table, which is where all the experience or knowledge acquired by the agent is stored.Therefore, the basic idea in Q-learning is that the agent finds and learns an optimal policy for the desired problem by benefiting from the experience gathered in the Q-table.
2) Deep Q-Learning technique: Although the Q-learning technique is very simple, it is a quite powerful algorithm to create an interesting set of experience or a kind of cheat-sheet for the agent.Indeed, this is fundamental and helps the agent to figure out exactly which action to perform until it converges to an optimal policy.Nevertheless, as highlighted in [13], Q-learning has two serious problems: (1) the amount of memory required to save and update the Q-table can increase exponentially as the number of states and actions increases, (2) many states are rarely visited and, consequently, the amount of time required to explore all these possibilities (state/action pairs) in order to create a good estimate for Q-table would be unrealistic or impractical in a real setting [14].
As a result, producing and updating a Q-table can become ineffective in large-sized environments, i.e., with a large number of states and actions.Notwithstanding, these limitations can be solved with the emerging deep RL, e.g., deep Q-learning, which is considered as a promising technique to solve the complex control issues, especially for the high-dimension solutions [16].Basically, this technique can be cast as an extension of classical Q-learning algorithm that uses DNN to approximate the Q-function in lieu of a lookup table.
In specific, in the deep Q-learning algorithm, a DNN called Deep Q-Network (DQN) is defined as a parameterized value function Q θ : S × A → R that is used to estimate the Q-function, where the state is given as its input, the Q-value of all possible actions is generated as its output and, finally, θ represents its parameters that define the Q-values.Therefore, the key idea of deep Q-learning technique is that the function Q θ is completely determined by θ.Consequently, the task of finding the best Q-function is essentially limited to the search for these best parameters of finite dimensions [14].
Although the algorithmic and statistical properties as well as the performance of the classical Q-learning algorithm are well known and studied, the same is not true for deep Q-learning, which still remains less well-understood in theory.Thereby, the idea of simply approximating the Q-function by a DNN can often lead to learning instability.Fortunately, these instabilities can be greatly reduced by the following two aspects [19], [20].Firstly, similar to classical Q-learning, the agent interacts with its environment following an -greedy policy over S × A. However, the experiences, composed each one of them of the current state (s), action (a), reward (φ), and the next state (s ), obtained by agent are gathered in a memory D with limited capacity D. In addition, the DQN training is performed on a mini-batch B of B tuples or experiences selected randomly from D. Such method is known as memory replay.This strategy can reduce the correlation among the training examples, which ensures that the optimal policy cannot be driven to a local minimal [20].
Moreover, when only a single DQN is employed, the same values are used to select and evaluate an action and, thereby, the Q-function may be over-optimistically estimated.Therefore, the second important aspect consists of using two DQNs with the same architecture: the target DQN, Q θtarget , with parameters θ target and the training DQN, Q θtrain , with parameters θ train .The training DQN is responsible for learning the values of θ train , while the target DQN is used to take actions.The value of θ target are updated at every τ iterations and set to be equal to θ train .Put another way, the weights θ target are fixed for a number of iterations while the weights θ train are constantly updated.This strategy is commonly called Double DQN (DDQN) and it has shown improvements in learning process compared to single DQN strategy [21].Therefore, for each episode, the least squares loss of the training DQN for a random mini-batch B ⊂ D with B samples is where the target is Finally, the loss function ( 5) is minimized by a stochastic gradient algorithm in order to train the mini-batch B. Then, the train DQN updates its parameters with the new parameters provided by training.According to [14], the convergence to a set of good parameters occurs quickly.

B. Proposed Multi-Agent Deep Q-learning Solution
In this section, we present a multi-agent DQN-based dynamic resource allocation framework to solve problem (2).Firstly, we define the concept of agent, state, action, and reward for this approach.
• Agents: we propose a multi-agent deep reinforcement learning scheme with each RB as an agent.As a result, there are N agents in this approach.• Action of agent n: consist of choosing a UE j, i.e., an action a (n) = j means that RB n is assigned to UE j.
A tuple or vector a, composed of elements a (n) , therefore, means a given assignment pattern or association among UEs and RBs.• State of agent n: we describe the state of agent n, s (n) , as a composition of two important aspects for an agent.In the first part, we consider a piece of information common to all agents.This information consists in an N -tuple a which represents a possible assignment for the system.On the other hand, in the second part, we have specific information related to agent n.Thus, the second part of the state of agent n is composed by a J-tuple, u, where u (n) j = γ j,n , ∀j, i.e., each agent or RB n knows the SNR value for all users of the system.
• Reward: obviously, the reward function should be designed to maximize the objective (2a) of problem (2).Thus, to do that we use Alg. 1.The main idea of this algorithm

Algorithm 1 Set reward value
Require: R j and ξ j , ∀j ∈ J ; 1: φ ← j∈J R j and ϑ ← 0; 2: for l ∈ L do 3: if j∈J l u(R j , ξ j ) < η l then 4: for j ∈ J l do 5: if R j < ξ j then 6: Observe current state of all agents s (n) ; 6: Each agent n chooses an action a (n) using -greedy policy from Q θtarget ;

7:
Execute the action of each agent, i.e., a ← [a (1) , • • • , a (N ) ]; 8: Obtain φ using Alg.1; 9: Observe the next state of each agent (s (n) ); 10: Store experience (s (n) , a (n) , φ, s (n) ) of each agent in D; 11: Sample a set of random experiences, i.e., a mini-batch B from D; 12: Perform the gradient descent step on ( 5) with respect to the weights θ train ; 13: At every τ episodes replaces target parameters, i.e., θtarget ← θ train ; 14: Update the -greedy decision policy; 15: is to define a reward value, φ, capable of reporting what is possible to achieve in terms of satisfaction and system throughput for a given assignment.This value tries to measure how close one is from meeting the requirements of problem (2), without disregarding its objective function.Note that if all constraints of problem (2) are met, meaning that the chosen assignment is a feasible solution of problem (2), then φ is equal to j∈J R j , which is the objective function (2a).Otherwise, φ is equal to ϑ/ j∈J R j .The variable ϑ is responsible for quantifying how close the chosen assignment is to a feasible solution of problem (2).Notice that if any constraint is not met, then ϑ is negative, and this represents a punishment or a negative reward.Our proposed solution based on deep Q-learning to problem (2) is shown in Alg. 2. Basically, the idea of this algorithm is to use the concepts of DQN and those defined in Section V-A to approximate Q-table by a function and, therefore, avoid the main disadvantages of the proposal addressed in [13], such as high memory cost.
Regarding Alg. 2, in lines 1, 2 and 3 we define the main structures for our approach.More specifically, in line 1, we reserve a limited amount of memory, D, and define a certain number of iterations, τ , that represent a period for updating target DQN weights.In lines 2 and 3, we randomly initialize the target and training DQNs responsible for the learning process, where both DQNs are fully-connected DNNs that consists of four layers: an input layer, an output layer and two hidden layers in between.The input layer is fed by the state vector of length (N + J) and the output layer has dimensionality corresponding to the number of possible actions, in this case, J.
The loop between lines 4 and 16 represents the learning process, which is responsible for adjusting the weights of training and target DQNs, where each iteration is defined as an episode.We assume an approach in which the actions are taken in parallel by each agent while the training is performed by a central module according to Fig. 1.In this figure, on one hand, it can be seen that the decisions of the N agents can be chosen in parallel using a DQN whose input and output depend on the current states and possible actions of each agent, respectively.However, on the other hand, the training phase is centralized and, therefore, the experiences of all agents are constantly collected to adjust the weights of another DQN.This framework eases implementation and improves stability.Moreover, this strategy can also significantly reduce the amount of memory and computational resources required by training [14].Therefore, each agent has the same copy of θtarget , while Q θtrain is localized at the central module.Thus, in each episode, all agents observe their respective states and are synchronized to take their actions at the same time based on -greedy policy from Q θtarget , according to lines 5 and 6.Next, an assignment pattern, a, is defined from the agent's actions, the reward, φ, is calculated and each agent n observes its next state according to lines 7, 8 and 9, respectively.Note that the reward is common for all agents so that they can benefit from each other's experiences to try to learn the optimal policy.In other words, the agents work collaboratively to maximize the obtained reward and, consequently, the objective in (2a).
After that, in line 10, we define an experience sample as a tuple, (s (n) , a (n) , φ, s (n) ), consisting in current state s (n) , chosen action a (n) , reward φ and next state, s (n) of each agent.In addition, in order to avoid oscillations and divergence in the parameters, we use the concept of memory replay so that the tuples of experiences of all agents are stored in memory D. We consider that this memory is a First In First Out (FIFO) queue where a new experience replaces the oldest experience in the queue when the number of experiences exceeds the capacity, D. In order to train the parameters θ train , a mini-batch of experiences, B, is sampled randomly from D and the stochastic gradient descent method is performed by central module to minimize the cost function in (5)  in lines 11 and 12, respectively.Furthermore, the process of updating the parameters θ target is periodic and, therefore, in line 13, only at every τ episodes, the new parameters θ train are available for target DQN.Finally, in line 14, the -greedy policy is updated, the current state of each agent changes to the next state (line 15) and another episode starts.Mathematically, the complexity of Alg. 2 can be evaluated by quantifying the complexity to obtain the Q-function from the DQN and to train the weights of the DQN since this is the main idea of this algorithm.Obviously, it highly depends on the structure of the employed DQN and its parameters.As discussed, in our case, the DQN is composed by fully-connected layers and, thereby, the complexity of the algorithm is given by O(wm log m) where w is the number of layers and m is the number of units per layer [22].
Something interesting about Alg. 2 is that depending on the initialization of the DQNs weights or parameters, the algorithm can converge to a solution more or less accurately relative to the optimal solution of problem (2), given a fixed number of episodes.Indeed, this can be exploited by letting Alg. 2 run multiple times on different cores and, therefore, with totally independent weights initialization.As a result, since the runtime of each core is the same, the idea is to choose the best output as a solution to problem (2) as depicted in Fig. 2. Note that parallel execution of multiple cores does not necessarily need to be computed on the eNB itself.Due to possible limitations of this infrastructure such as overhead, small storage space and low computing ability, the data storage and processing can be moved to decentralized and powerful computing platforms located in a cloud.In terms of performance, the cloud utilizes distributed system architectures and can offer excellent computation speeds.Besides, cloud computing provides many other advantages as quick deployment, easy integration, resiliency, redundancy, backup, disaster recovery, among others [23].Thus, the eNB is limited to deciding the best solution after data processing.

VI. PERFORMANCE EVALUATION
In this section, we evaluate our proposed solution and compare it with the optimal solution and with the solutions of [1], [2] and [13].We firstly present the main simulations parameters and, after that, the results and their discussion.

A. Simulation Assumptions
We consider 6 RBs (N = 6), 4 UEs (J = 4), 2 service plans (L = 2) and we admit that UEs from service plan 2 demand a throughput of 150 kbps higher than the UEs from the service plan 1.In both services, we consider only two UEs, where η 1 = 2 and η 2 = 1.We assume 11 QoS levels in kbps such that ξ j∈J1 = (150, 220, . . ., 850), i.e., the required data rates for service plan 1 vary between 150 kbps and 850 kbps at the step of 70 kbps.Consequently, the requeriments for service plan 2 vary between 300 and 1000 with the same step.The DQN was implemented using Tensorflow [24], assuming two hidden layers.We use the rectifier linear unit (ReLU) as DQN's activation function and we use Adam's algorithm [25] for the optimization.Moreover, we consider that -greedy policy varies over the episodes following an exponential decay.In general, all the important simulation parameters are shown in Table I and Table II.
To perform qualitative comparisons with our proposed algorithm (deep Q-RA), we simulate the optimal solution of problem (2) (OPT) as well as the algorithms Reallocation-based Assignment for Improved Spectral Efficiency and Satisfaction (RAISES) [1], Rate Maximization under Experience Constraints (RMEC) [2] and Q-learning based Resource Assignment (Q-RA) [13].On one hand, RAISES and RMEC are traditional rule-based algorithms, which use resource reallocation strategies to define the best assignment pattern for the system.On the other hand, Q-RA, as its name suggests, is an algorithm based on Q-learning technique for resource allocation.Therefore, Q-RA algorithm is a tabular learning method, where a single agent accumulates all its experience in a Q-table over several episodes.
Regarding the performance metrics, we consider the outage rate and the system throughput.An outage event happens when an algorithm cannot manage to find a feasible solution, i.e., the algorithm does not find a solution fulfilling the constraints of problem (2).Then, outage rate is defined as the ratio between the number of instances with outage events and the total number of simulated instances.The system throughput is the sum of the data rates obtained by all the UEs in a given instance.The results were obtained by running 1, 000 feasible instances of problem (2) in order to get valid results in a statistical sense and the channel realizations were the same for all the simulated algorithms to get fair comparisons.

B. Numerical Results
Fig. 3 shows the system throughput versus the number of episodes for the algorithms OPT and deep Q-RA, condesidering the first and the last QoS levels described in Section VI-A.Looking at the performance of the deep Q-RA solution, we can observe that it converges to the OPT solution as the number of episodes increases for both investigated QoS levels.This is an expected result since the more episodes we have, the more accurate the estimation of Q-function is and, consequently, the more favorable it is for the agents to converge to the optimal solution of problem (2).Moreover, note that in Fig. 3 the convergence time to the optimal solution may vary depending on the required QoS level.This is because at low QoS levels there are several possible solutions and, as a result, it can be more difficult to converge to the optimal solution of problem (2).For scenarios with high QoS levels required, possible solutions are rarer but once found means near optimal solutions, consequently the deep Q-RA algorithm tends to focus on them.Indeed, this can lead to faster convergence.In Fig. 4 and Fig. 5, we plot the system throughput and outage rate versus the number of parallel cores in the system, respectively, in order to show the advantages of the structure illustrated in Fig. 2. Also, from here, we assume for all the following results a confidence interval with a 95% confidence level.Firsty, in Fig. 4 and Fig. 5, note that as the number of cores in the system increases, there is a considerable increase 0 0.   in the performance of the proposed solution.In addition, due to the characteristics of the deep Q-learning technique, this structure does not require a high memory consumption and, as shown in the last figures, a relatively low number of cores is enough to ensure excellent performance.Note, for example, that with less than 10 cores there is practically no outage in the system for the investigated scenario.Now we compare our approach with other proposals from the literature.In Fig. 6 and Fig. 7, we plot the system throughput and the outage rate in the considered scenario versus the QoS level for the algorithms OPT, RAISES, RMEC, Q-RA and deep Q-RA, respectively.For the Q-RA and deep Q-RA algorithms, we consider 3, 000 episodes in the plots of these figures.Besides, for deep Q-RA algorithm we use 10 cores.In this way, we firstly observe a near optimal performance of our proposed solution both in terms of outage rate and system throughput to problem (2).In fact, we highlight Fig. 7 that shows the outage curve that is considerably better for the solutions based on RL, with even better performance for deep Q-RA solution.In this figure, notice that the outage rate for these solutions are smaller than 1% and 0.1% for Q-RA and deep Q-RA, respectively.
On the other hand, RAISES and RMEC solutions have  much higher outage rates, with approximately 7% and 10% for the highest QoS level, respectively.This shows that solutions based on ML algorithms may perform better than traditional heuristics and, therefore, they can be considered as a promising tool to solve resource allocation problems in modern networks.However, as highlighted in [13], Q-RA solution may require a high memory cost to build and store Q-table because it directly depends on space S × A. Therefore, this makes its use more difficult in interesting and realistic scenarios.As discussed earlier, this is not a problem for deep Q-RA solution, which may in fact become a more attractive and less problematic solution in larger scenarios.

VII. CONCLUSIONS AND PERSPECTIVES
In this paper, we have investigated the problem of maximizing the system throughput subject to user satisfaction ratio constraints in a multiservice scenario.This problem was previously studied in [1], [2] and [13], where tradicional heuristics or machine learning based methods were proposed.However, to tackle this problem we have proposed a new decentralized radio resource allocation mechanism employing multi-agent deep reinforcement learning.From the simulation results, we have shown that each agent can learn how to jointly deal with resource allocation and QoS garantees while maximizing the system throughput.As a result, our proposed can provide better performance than the other benchmark approaches simulated in this article.
Regarding future works, we believe that the proposed framework in this paper can be improved by taking into consideration the channel correlation along the time and redefining the system state in order to considerably decrease the need for training when applied in dynamic contexts.Finally, other approaches where learning-based techniques are jointly responsible for allocating power and resource can also be analyzed in the future.

Fig. 4 .
Fig. 4. System throughput of deep Q-RA algorithm versus the number of parallel cores in the system.

Fig. 5 .
Fig. 5. Outage rate of deep Q-RA algorithm versus the number of parallel cores in the system.

Fig. 6 .
Fig. 6.System throughput versus QoS level for OPT, RAISES, RMEC, Q-RA and deep Q-RA algorithms in the considered scenario.

Fig. 7 .
Fig. 7. Outage rate versus QoS level for OPT, RAISES, RMEC, Q-RA and deep Q-RA algorithms in the considered scenario.
QoS in different types of service plans, something that few papers consider in the literature.