A Coggle Diagram about Change
(In the proposed approach, objectives are defined as specific states or sets of states (features). This is perhaps a bit limiting, since in RL, there are not necessarily goal states; just reward, although indeed usually tied to specific regions of the state-space.
, The way these secondary goal states (means of clusters) are assigned a reward signal is quite arbitrary though, and may not be useful. Say you actually discover a secondary objective that in a following task is your primary objective. The authors assigned it a reward of 100 at the goal state and learned a value function for it. But perhaps the reward function for that task is actually -1 all the time and 0 at the goal. Using the learned value function probably would not be that useful for that task if used as is. Probably more sophisticated techniques, that can reuse this value function independent of this reward signal that needs to be defined will be more useful. Probabilistic policy reuse using a policy based
on the learned value function for example.
, It looks like once a feature vector is assigned to a specific cluster, it never leaves that cluster, even though this cluster might change significantly. Say you start with one sample, and then a second sample very very far away. I assume it gets added to the first cluster, since it has no variance yet?
, Can you provide more evidence that this clustering algorithm that you propose here works well beyond the problem you tested it on here?
, The authors claim that the convergence percentage goes up marginally with the number of episodes, but that is blatantly not true for the primary objective, which is very strange. You should definitely explain or find out why the learner has a better policy after 100 episodes of learning than after 1000 episodes.
, Why are the sensors assumed to be free of noise? Is that necessary for you approach to work?
and Why do you need the x and y location variables for the function approximation setting and not the tabular case? How are the x and y locations represented btw? Are they discrete? It seems so from your slightly strange explanation (feature equal to 1 for the agent's position and 0 for all other positions). This sounds like it's just a tabular representation. Where does the function approximation come in? And what function approximator do you use? The way you describe that the first part of the features relates to the environment and the second part is for the localisation of the agent is weird too
(Specify distinctive contributions
, Cite sutton horde
, Furthermore, the way the authors now concluded that their approach is twice as fast as the baseline is completely besides the point. They cannot just divide by the number of objectives the approach learned and say that it learned faster. Perhaps what it learned is useless. The authors should actually go on and perform experiments where the learner has to learn a second and third task. And then they can say something about how much faster their approach is than the baseline. As it is, the baseline is twice as fast.
, The way the authors evaluate their technique also does not allow to draw the conclusions formulated in the paper. Rather the learning scenario should be:
- learning for a first objective for a certain number of episodes
- learning for a different second objective for a certain number of episodes
- learning for a different third objective for a certain number of episodes
With this setting, the authors would be able to show that their baseline learns the first task, and then has to learn the second task completely from scratch, while the proposed approach learns the second task much faster because it has already collected information about that task in the first task. And then in the third it would learn even faster.
and The process of proof is not complete.
) and Check
, The sentences of the paper are too long to understand. Hope the authors can use simple sentences as much as possible
- in Algorithm 1, line 11: u should be \mu
- it has to BE kept in mind that ..
, Please improve the wording.
The algorithms should be self-contained. For example, Algorithm 1 should contain Equations 2 and 3 not just references to them.
The experimental results should be sufficiently explained. For example, the content of Figure 6 is not commented.
- in Algorithm 2, line 18, explain roundoff and why it's necessary