Please enable JavaScript.
Coggle requires JavaScript to display documents.
DRL - SPaT, *8. Multi-Agent Deep Reinforcement Learning for Large-Scale…
DRL - SPaT
Restriction
Hyperparameters
State
Object Function
Environment
Training / Results
Pros
Terms
Questions
*8. Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control / 2020
Chu, Tianshu Wang, Jie Codeca, Lara Li, Zhaojian 1
0.1109/TITS.2019.2901791
Restriction
A recent
work suggested (interact period)
t = 10s, and ty = 5s [14]
In this study
t = 5s, and ty = 2s
Hyperparameters
Episode = 1400 / Evaluate: 10 episodes = 7200s * 10 times
Horizon T = 720 steps || 10 sec/step
State
waiting time of the first vehicle
approaching vehicle
while wave [veh] measures the total number of approaching vehicles along each incoming lane, within 50m to the intersection.
lane-based
Environment
4-phase, thru and left
Training / Results
Pros
Terms
Refer to
Refer to ---->
Refer to
|
|
|
\/
Action
There are several standard action definitions, such as
phase switch
[38],
phase duration
[14], and
phase itself
[26].
next Phase
Reward
- (queue + first vehicle awaiting time)
Model
MA2C
Adaptive Traffic Signal Control with Deep Recurrent Q-learning / 2018
Zeng, Jinghong Hu, Jianming Zhang, Yi
10.1109/IVS.2018.8500414
Restriction
Max. G = 60 sec
Min.G = 6 sec
Hyperparameters
ε = decayed [0.1, 1]
1e6 steps
Adam, learning rate - α = 0.00025
Huber Loss instead of square loss function
Input: 6 x 4 x 2 concatenate 1 x 4 (phase
State
3 vectors: one-hot coded
presence of vehicle at that cell
vehicle speed of that cell
current signal
S = {
D, V, p
}
6 x 4 x 2 (channel)
Previous researchers have shown that the current phase
information (e.g. duration of phase) is useful for learning a
proper control policy
It is a more
attractive idea to learning the state information from the raw
traffic data by agent itself
Training / Results
Method
DQN
Recurrent - DQN
Veh / SPat Info are seperated
Environment
Geometry
4-phase, 4-leg, 2-way 4-lane thru/left traffic
500-m length / 256-m perception
Divide each approach with length
l
is divided into
c
segements, which c = 16.
similar to 2
Action
2 actions: extend current / switch to next
Reward
Halting vehicle differences
r = N(t) - N(t-1)
DTSE
Adam
DTSE: The authors refer to this representation as discrete
trac state encoding (DTSE).
The model is detailed
Deep Deterministic Policy Gradient for Urban Traffic Light Control / 2017
Casas, Noe
http://arxiv.org/abs/1703.09035
Restriction
Fixed Cycle Length = 120 sec
Hyperparameters
State
Detector Data
2. normalized speed
Possible with detectors
Veh count
normalized speed
Occupancy
Object Function
Training / Results
Pros
Terms
Questions
Action
Each Phase duration
Reward
α x vehs x (Speed Score - baseline) / all vehs --> [-1, 1]
α = 0.02
*7. A Deep Reinforcement Learning Network for Traffic Light Cycle Control / 2018
Liang, Xiaoyuan Du, Xunsheng Wang, Guiling Han, Zhu 10.1109/TVT.2018.2890726 (arvix: Deep Reinforcement Learning for Traffic Light Control in Vehicular Networks)
Restriction
Fixed Cycle Length = 120 sec
Hyperparameters
State
2 channel matrix
1st layer: presence
2nd layer: speed in m/s
Object Function
Environment
4-phase, thru and left
Pros
Terms
Refer to
|
|
|
\/
Action???!
One of all Phases duration can be added 5 or subtract 5 sec
Reward
accumlated delay during the latest cycle
*9. Adaptive traffic signal control with actor-critic methods in a real-world traffic network with different traffic disruption eventsl / 2017
Aslani, Mohammad
Mesgari, Mohammad Saadi
Wiering, Marco
10.1016/j.trc.2017.09.020
Restriction
In this study
t = 5s, and ty = 2s
A recent
work suggested (interact period)
t = 10s, and ty = 5s [14]
Hyperparameters
Episode = 1400 / Evaluate: 10 episodes = 7200s * 10 times
Horizon T = 720 steps || 10 sec/step
State
waiting time of the first vehicle
approaching vehicle
while wave [veh] measures the total number of approaching vehicles along each incoming lane, within 50m to the intersection.
lane-based
Environment
4-phase, thru and left
Training / Results
Pros
Terms
Action
sparse action [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
current Phase duration
Reward
- (queue + first vehicle awaiting time)
Model
MA2C
Adaptive Traffic Signal Control: Deep Reinforcement Learning Algorithm with Experience Replay and Target Network / 2017
Gao, Juntao Shen, Yulong Liu, Jia Ito, Minoru Shiratori, Norio
ArXiv: 1705.02755v1
Model
DQN /
Reward
R = TT(begin) -TT(end)
Action
2 actions: N-S G / W-E G
Fixed Green: 10 sec
Terms
Pros
Training / Results
Environment
4-phase, thru left, right
State
presence of the cell
speed of the cell
segment-based
Hyperparameters
Episode = 2000
Horizon T = 1.5 hr / episode
Restriction
A recent
work suggested (interact period)
t = 10s, and ty = 5s [14]
In this study
t = 5s, and ty = 2s
Traffic light control using deep policy-gradient and value-function-based reinforcement learning / 2017
Mousavi, Seyed Sajad Schukat, Michael Howley, Enda
10.1049/iet-its.2017.0153
Restriction
no red, immediate switch
min. G = 15 s
Hyperparameters
γ = 0.99
layer: 2 Conv, 1 fully, 2 outputs
Input: 128 x 128 x 4 image,
Active Func: tanh
α = 0.00001
batch size = 32
State
Picture from SUMO GUI,
1/0 represent the presence of veh
Using last several pics,
then detect speed, directions
Training / Results
Performance: Delay, Stops
Method
PG method
Value-based
CNN extract features
Environment
Geometry
2-phase, 4-leg, 2-way 4-lane only thru traffic
Action
E-W G / N-S G
Reward
accumulate delay difference????!
Cumulative delay differences can be positive?
Deep Deterministic Policy Gradient for Traffic Signal Control of Single Intersection / 2019
Pang, Hali Gao, Weilong
10.1109/CCDC.2019.8832406
Restriction
min. G = 5 s
Hyperparameters
γ = 0.99
layer: 2 Conv, 1 fully, 2 outputs
Input: 128 x 128 x 4 image,
Active Func: tanh
α = 0.0005
batch size = 32
State
Picture from SUMO GUI,
1/0 represent the presence of veh
Using last several pics,
then detect speed, directions
Method
DDPG
DQN
Environment
Geometry
2-phase, 4-leg, 2-way 4-lane only thru traffic
Action
Next Phase Green Time - DDPG
< Min G --> Skip
It is more flexible, and has stronger adaptive ability to control the traffic flow at the intersection with dynamic traffic flow effectively.
Directly select next phase
Reward
R = K x (D0 - Dt )
D0 is expected delay, Dt is actual delay
Cumulative delay differences can be positive?
Training / Results
Traffic signal timing via deep reinforcement learning / 2016
Li, Li Lv, Yisheng Wang, Fei Yue
10.1109/JAS.2016.7508798
Restriction
no red, immediate switch
min. G = 15 s
Hyperparameters
ε = max(0.001, 1 - t/500)
γ = 0.1 / decayed future return --> focus on current reward
layer : 32-16-4-2
Active Func: Sigmoid
State
Possible state representation
For traffic signal timing problems for intersections, traffic sensors measure the states (speed, queueing length, etc.) of traffic flows.
Queue length, collect per 5 sec
last 4 collections.
Training / Results
Performance: Delay, Stops
Method
Deep stacked autoencoders (SAE) neural network
Off-policy (greedy pred / ε-greedy select action)
Environment
Geometry
2-phase, 4-leg, 2-way 4-lane only thru traffic
Action
Remain ( always remain if min.G is not met)
Switch
Reward
Abs(Queue diff.) / Regardless of time
Using a Deep Reinforcement Learning Agent for Traffic Signal Control / 2016
Genders, Wade Razavi, Saiedeh
http://arxiv.org/abs/1611.01142
Restriction
Hyperparameters
ε = max(0.001, 1 - n/N)
γ = 0.95 / decayed future return --> focus on both
learning rate - α = 0.00025
State
3 vectors:
presence of vehicle at that cell
vehicle speed of that cell
current signal
The motivation behind the DTSE is to retain useful information. If the agent is to discover the optimal policy, it must discover the optimal actions for any given state;
having knowledge of approaching vehicle’s speed and position is conjectured to be superior to only the number of queued vehicles
Training / Results
Method
Q-learning
CNN extract features, concatenate real-value, boolean value, and current vector P, (4*1), as the input of DRL model
Environment
Geometry
4-phase, 4-leg, 2-way 4-lane thru/left traffic
750-m length
Divide each approach with length
l
is divided into
c
segements.
Action
Episod-based, which means a fixed-table is used per simulation
4 actions, E-W Thru/Left, N-S Thru-Left
Reward
cumulative vehicle delay between actions.
Episod-based, 1.25 hrs per epoch.
Feature Literatures