Third Journal Summary
Literature Review
(txyz.ai) The paper addresses the problem of large-scale traffic signal control (TSC), which is a challenging task due to the difficulty in effectively utilizing limited road resources and the dynamic nature of traffic conditions. Reinforcement learning (RL) has been identified as a promising adaptive decision-making method for TSC, as it can make real-time decisions based on traffic flow and also predict future traffic flow . Recent advancements in deep learning have further enhanced the capabilities of RL, allowing it to handle large-scale state and action spaces. However, when dealing with multiple signalized intersections, a straightforward centralized RL approach may not be feasible due to the high latency and failure rate associated with collecting global traffic data. Therefore, researchers have explored decentralized multi-agent reinforcement learning (MARL) approaches, where each traffic signal is controlled by an independent agent. While MARL shows promise, there is still room for improvement in scaling to large-scale problems and effectively modeling the behaviors of other agents. The proposed Cooperative Double Q-Learning (Co-DQL) algorithm in this paper aims to address these challenges by using a scalable independent double Q-learning method and modeling agent interactions through mean-field approximation. Additionally, the paper introduces new reward allocation and local state sharing mechanisms to improve the stability and robustness of the learning process . The experimental results demonstrate that Co-DQL outperforms state-of-the-art decentralized MARL algorithms for TSC.
(humanized) The paper focuses on the issue of large-scale traffic signal control (TSC), which represents a challenging problem due to inadequately utilizing limited road resources and fluctuations in traffic conditions needed to control signal phases. Reinforcement learning (RL) has shown promise for adopting adaptive decision-making approaches to TSC, as RL makes traffic signal control decisions in real-time, based on traffic flow and predicts traffic flow (both short-term and long-term). Additionally, RL applied to deep learning drove further improvements in the ability to process recognition systems for controlled traffic signals while managing large state spaces and action spaces, which will be more extensively elaborated upon later in the paper. Even under RL, a simple centralized RL model could not be effective as managing TSC becomes complicated when you incorporate managing multiple signalized intersections because there is a delay in collecting global traffic data; subsequently, it would be infeasible given that where the delay rate climbs and the fail-rate in that centralized method increases. Therefore, researchers have begun applying decentralized or multi-agent reinforcement learning (MARL) approaches to leverage potentially independent traffic signal agents; nevertheless, these solutions need to be improved to scale large problems and to model their behaviors. To that end, the Cooperative Double Q-Learning (Co-DQL) solved problems in the areas of double Q-learning capability, and applied that to a scalable independent Q-learning application, and modeled other agent interactions using mean-field assimilation to be robust to outliers. The paper also presented new approaches to improve the model’s stability and robustness with how rewards were allocated, and local state sharing was achieved. From experimental work, the Co-DQL algorithm bested a number of state-of-the-art decentralized MARL algorithms for TSC.
Literature Review table
objective
address the problem of large-scale traffic signal control (TSC) using a novel multi-agent reinforcement learning (MARL) approach called Cooperative Double Q-Learning (Co-DQL).
Contribution
Proposed a multi-agent reinforcement learning techniques, Cooperative Double Q-learning (Co-DQL) for traffic signal control
Research Method
Cooeperative Double Q-learning (Co-DQL)
Research Methodology
This approach uses multi-agent reinforcement learning (MARL). Reinforcement learning is applied to traffic signal control and tested on various traffic flow scenarios in a traffic signal control simulation by improving the adaptive decisioin making of other signals.
Data collection
The agents use this data to learn an optimal joint policy that maximizes the global reward of the road network, while considering the dynamic interactions between the agents and the environment.
Journal [3] used two Traffic Control System (TSC) simulator to gain data The first is a simulation platform which is grid TSC system based on OpenAI-gym. It uses three different scenarios as experiment, which are 1) global random traffic flow; 2) double-ring traffic flow; 3) four-ring traffic flow, to gather data for the study. Simulation time of each episode is 60min and there are four groups of traffic flows. Second simulatior is more realistic. The road networks in some areas of Xi’an is taken as a prototype of the real road network to design a TSC simulator based on SUMO, which has 49 signalized intersectioins on the real road network.
Data preprocessing
In the first simulation, the normal driving time between two intersections, that is, the distance between two intersections, indicates that normal driving vehicles needs five time steps from one intersection to an adjacent intersection. The shortest length is 2 and the longest length is 20. The action time interval of the signal agent is 4, which means that a signal agent must keep at least four time steps before it can change one action.
For second simulator, the simulation time of each episode is 60 min and there are setups of four traffic flow groups. Four traffic flow groups are generated as multiples of unit flows 1100, 660, 920, and 552 veh/h. The first two traffic flows are simulated during the first 40 min, as [0.4, 0.7, 0.9, 1.0, 0.75, 0.5, 0.25] unit flows with 5-min intervals while the last two traffic flows are generated during a shifted time window from 15 to 55 min, as [0.3, 0.8, 0.9, 1.0, 0.8, 0.6, 0.2] unit flows with 5-min intervals.
Model training
In order to analyze the performance of the proposed algorithm, it is compared with several popular RL methods in the same traffic scenarios. Details of the implementation of Co-DQL. Co-DQL is a multilayer fully connected neural network is used to approximate the Q-function of each agent. We use the ReLU-activation between hidden layers, and transform the final output of Q-network with it. MA2C is a state-of-the-art MARL (decentralized) algo- rithms for large-scale TSC. QL has almost the same hyperparameters settings as Co-DQL. The network architecture is identical to Co-DQL, except a mean action and sharing joint state are not fed as an addition input to the Q-network. DQL has almost the same setting same as that of IQL. The main difference is that double estimators are used when calculating the target value. DDPG is an off-policy algorithm. It consists of two parts: 1) actor and 2) critic. Each agent is trained with the DDPG algorithm and we share the critic among all agents in each experiment and all of the actors are kept separate
Model evaluation
In Global random traffic flow, for each algorithm, all models are evaluated over 100 episodes. The average delay time is calculated from the total delay time of vehicles in the road network during an episode. The standard deviation is given in parentheses after the mean value. Co-DQL greatly reduces the average delay time compared with the other methods. (The test results are basically consistent with the trained model performance, which shows the validity of our trained model.)
For Double-Ring traffic flow, Co-DQL outperforms all the other methods. MA2C obtains better result in contrast to IQL and IDQL. However, the final training results are similar. This may be because the problme of double-ring traffic flow is relatively simple, so the three methods can achieve relatively consistently results.
For Four-Ring traffic flow, the learning process of Co-DQL is relatively stable and the standard deviation in the evaluation process is smaller than the other methods. Ultimately, Co-DQL achieves the shortest average delay time by means of mean-field approximation for opponent modeling and local information sharing.
When using the more realistic simulator, Compared with IQL, IDQL, and DDPG, Co-DQL and MA2C show more robust test performance Ultimately, Co-DQL achieves the best average performance with respect to multiple measures, which shows the advantage of the mean field approximation in agent behavior modeling
Conclusion
Based on the research from all the journals referenced, there are several reinforcement learning techniques that can be impelmented for the use for traffic control signal. To reduce traffic congestion, multi-agent reinforcemnt learning has been used to regulate the traffic flow of road network at intersections. This is far superior than traditional traffic signal control systems, where static based timing causes ineffective traffic management. Though there are few limitations and challenges faced by the traffic signal control. One of the challenges faced is how to respond to the dynamic interaction between each single agent and the environment.
To improve in creating a better TSC, there are some future enhancements that can be made. Firstly, further explore in multi-agent reinforcement learning to balance various performance metrics, such as reducing the average waiting time of traffic, lowering queue length. Secondly, enhancing the TSC system to cope well in various different environments/scenarios such as, unexpected rain disturbance, recurring and non-recurring traffic congestion, and having a more complex road network.