Deep neural networks introduce challenges due to the non-stationary nature of RL training. One source of the challenges in RL is that output predictions can churn, leading to uncontrolled changes after each batch update for states not included in the batch. Although such a churn phenomenon exists in each step of network training, how churn occurs and impacts RL remains under-explored.
In this work, we aim to answer how churn occurs and influence the learning process of DRL agents, and how to deal with churn for better learning performance. The contributions are summarized below:
In general, churn is defined as the difference between the prior-update network prediction and the post-update network prediction for the data NOT in the batch used to make the update:
In the context of deep RL, we care about the value churn and the policy churn:
Let's explicitly model the churn in a general form with the Generalized Policy Iteration (GPI) framework. Compared to the standard GPI where churn is neglected, now we have the churn in the iterative training process of both the value function and policy function.
As illustrated, we consider both the exact updates for the data in training batches and the implicit changes caused by churn for out-of-batch data.
The chain effect forms a cycle and lasts throughout learning, where the churn and biases can accumulate and amplify each other.
The Issues Caused by Churn in Popular Deep RL AgentsFor DoubleDQN and MinAtar environments, churn induces a decrease in the value of greedy action, thus degrading to inferior actions.
While
For PPO and continuous control tasks in MuJoCo and DMC,
@inproceedings{
htang2024improving,
title={Improving Deep Reinforcement Learning by Reducing the Chain Effect of Value and Policy Churn},
author={Hongyao Tang and Glen Berseth},
booktitle={Advances in Neural Information Processing Systems},
year={2024},
url={https://openreview.net/pdf?id=cQoAgPBARc}
}