Trajctory Sampling
The main disdvantage of dyna-q is that, in the simulated states, the staest are selected at random. This is inefficient, why we should care about the random states, we should care about the most probable states that we might encounter in the interaction with the environment. Becase of that the simulation loop is improved where the actions are taken based on the current action-value function, instead of getting radomly visited state, the trajectory starts with the current state and continues with the learned model. The action-value function is updated based on this.