1. YongDuk

    A comment or maybe a question.
    Your setting is as follows:
    At line 55, if prev_norm > 1.5, then prev_norm cannot be smaller than 0.5, so the if-statement in line 59 is always false, therefore the code always executes the line 62.
    The code always choose the inverse action of the previous action in the case that pre_norm > 1.5. Then the LSTM learns the inverse action only when norm(prev_observation) > 1.5, and otherwise it does not have to learn any.

    Maybe, my understanding is wrong and I missed something.
    Above all, I found your algorithm very interesting, and could not help but asking why you chose those two thresholds 1.5 and 0.5.

    In addition, may I ask the theoretical background of the approach? Any related reference is also very welcome!

    • First of all, thanks for your comment. Your consideration is correct and depends on the fact that I’ve made several hyperparameter adjustments. My initial idea is based on the concept that a reward (+1) is given only when the pole is almost vertical and there’s no difference until it overcomes the limits. So I thought to create a feedback mechanism (LSTM) that should preserve the target state keeping the oscillations under a very strict threshold.

      Indeed, the LSTM has to learn little. The only important corrections happen only when a state is reaching (or has overcome) a certain limit (in terms of positions and speeds). So the learning steps happen only in those cases and doesn’t reinforce an output that is already correct. Of course this is a toy-problem, with several strange restrictions (for example, about reward) and my approach cannot be the best one at all. I’m still studying other non-conventional ways to optimize those kind of problems.

      There’s no theoretical background but what about LSTM and RL. I’d like to write a short paper showing some results, but I’m still collecting data.

  2. Kurt Peek

    Hi Giuseppe,

    Impressive result! According to the OpenAI gym evaluation (https://gym.openai.com/evaluations/eval_JxPKNwd1QjaofWkaE4aLfQ), your algorithm takes 0 episodes to solve – that is, it manages to balance the cart pole at the first try, without ever falling.

    Do I interpret this result correctly and if so, can you perhaps explain how this works at a high level? How can the system ‘learn’ without ever receiving any (negative) reinforcement?

    • Hi Kurt,

      with current (and default – after some tuning) parameters, there are only corrections after a negative reinforcement. The system develops a high level of inertia keeping the norm of each observation close to a very small value (less than 1.5, as default condition). Of course, it’s ok if the cart stops in a certain position different than 0, making no movements (that means that the two speed components are almost null while there’s a constant “bias” due to the position).

      Considering that the real goal is to reduce any oscillation (so both speed components must as smaller as possible), the LSTM will train its memory so to reduce the entity of all movements. If you stop the monitor and let the system evolve for a longer time, you can see that during the first steps it can easily fail after 500-600 steps (so the episode is solved, but maybe the cart reaches one of the boundaries), but, starting from the second or third attempt, it’ll remain still to a stable position indefinitely.

      Try to change the values of both parameters: it can be interesting to compare different results!

      • An addendum: I’d like to test this solution with a different starting condition (unfortunately I can’t…). In those cases, I’m afraid that the learning speed becomes much slower and maybe it’s even impossible to reach a convergence keeping the “done” flag on.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.