In each iteration, k, the agent observes the current state s, chooses and executes an action a that belongs to the available set of actions A, and then the Q-Factor is updated according to the obtained reward r(s,a) and the state transition to state s'. [...] The simplest way to extend RL to the MARL is to consider the local state and local action for each agent assuming a stationary environment and that the agent’s policy is the prime factor affecting the environment. [...] The agent is the learner and the decision-maker that interacts with the environment by first receiving the system’s state and the reward and then selecting an action accordingly. [...] The state-space and the action-space are distributed such that the agent learns the joint policy with one of the neighbours at a time following the principle of modular Q-learning. [...] Reward Definition: The Reduction in the Total Cumulative Delay The immediate reward for certain agent is defined as the reduction (saving) in the total cumulative delay associated with that agent, i.e., the difference between the total cumulative delays of two successive decision points.