Improving Deep Reinforcement Learning by Reducing the Chain Effect of Value and Policy Churn

Abstract

Deep neural networks introduce challenges due to the non-stationary nature of RL training. One source of the challenges in RL is that output predictions can churn, leading to uncontrolled changes after each batch update for states not included in the batch. Although such a churn phenomenon exists in each step of network training, how churn occurs and impacts RL remains under-explored.

In this work, we aim to answer how churn occurs and influence the learning process of DRL agents, and how to deal with churn for better learning performance. The contributions are summarized below:

We formally define and study the policy and value churn under GPI framework and present the chain effect of churn, which induces compounding churns and biases throughout learning.
We show how churn results in three learning issues in typical DRL settings, including Greedy-action Deviation, Trust Region Violation, and Dual Bias in Policy Value.
We propose a general method to reduce the chain effect, called CHurn Approximated ReductIoN (CHAIN), which can be easily plugged into most existing DRL algorithms. In our experiments, we show that CHAIN effectively alleviates the learning issues, improves efficiency and final scores in various settings, and facilitates the scaling of PPO.

TL;DR

What is Churn?

In general, churn is defined as the difference between the prior-update network prediction and the post-update network prediction for the data NOT in the batch used to make the update:

In the context of deep RL, we care about the value churn and the policy churn:

Where is Churn in Deep RL?

Let's explicitly model the churn in a general form with the Generalized Policy Iteration (GPI) framework. Compared to the standard GPI where churn is neglected, now we have the churn in the iterative training process of both the value function and policy function.

As illustrated, we consider both the exact updates for the data in training batches and the implicit changes caused by churn for out-of-batch data.

Standard GPI [Sutton and Barto, 2018] (left) v.s. GPI under churn (right)

How the policy and value churn occur and influence the learning?

The Chain Effect of Churn

The network parameter updates causes the value and policy churn.
The policy churn and value churn further lead to the deviations in the action gradient and policy value.
Then, the churns and the deviations bias the following parameter updates.

The chain effect forms a cycle and lasts throughout learning, where the churn and biases can accumulate and amplify each other.

The Issues Caused by Churn in Popular Deep RL Agents

Greedy action deviation in value-based methods

Trust region violation in policy gradient methods

Dual bias of policy value in AC methods

Regularizing Churn for Better Deep RL

We propose a simple regularization method, called Churn Approximated Reduction (CHAIN), to reduce churn by holding a conservative attitude on the churn.

Sample a separate batch alongside the conventional training batch, called reference batch.
Compute the churn amount with the reference batch as the churn regularization term.

(Optional) Adjust the regularization coefficient according to the relative scale between the regularization term and the conventional deep RL objectives.
Minimize the churn regularization terms together with the conventional RL objectives.

Experiments

CHAIN Reduces the Deviation of Greedy Action

For DoubleDQN and MinAtar environments, churn induces a decrease in the value of greedy action, thus degrading to inferior actions. While CHAIN significantly prevents it and leads to higher efficiency and better final scores.

DoubleDQN (DDQN) v.s., CHAIN-DDQN in MinAtar tasks.

DDQN v.s., CHAIN-DDQN in terms of: (from left to right) greedy action deviation (%), value change of all actions, value change of greedy action.

CHAIN Reduces Policy Churn and Clip Range Violation

For PPO and continuous control tasks in MuJoCo and DMC, CHAIN reduces the policy churn effectively and suppresses the violation of clip range, leading to improved learning performance in terms of efficiency and higher scores.

PPO v.s., CHAIN-PPO in MuJoCo and DMC tasks.

CHAIN-PPO with different static regularization coefficients. From left to right: episode return, policy churn, policy loss, and the regularization term. (The policy churn diminishes as the decay of learning rate.)

CHAIN Facilitates the Scaling of PPO

CHAIN alleviates the deterioration of PPO when using larger networks, no matter scaling an MLP-based PPO agent by widening or deepening. We think it is appealing as it reveals some promise to address the scalability issue by controlling churn or generalization.

Scaling PPO with CHAIN across different scale and learning rate configurations.

BibTeX


@inproceedings{
  htang2024improving,
  title={Improving Deep Reinforcement Learning by Reducing the Chain Effect of Value and Policy Churn},
  author={Hongyao Tang and Glen Berseth},
  booktitle={Advances in Neural Information Processing Systems},
  year={2024},
  url={https://openreview.net/pdf?id=cQoAgPBARc}
}