[RLlib] Add support for multi-agent off-policy algorithms in the new API stack. #45182

simonsays1980 · 2024-05-07T15:55:16Z

Why are these changes needed?

Off-policy algorithms moved from old to the new stack and worked so far only in single-agent mode. We were missing a standard Learner API for the new stack which is now available: Any LearnerGroup receives now List[EpisodeType] for updates.

This PR adds the support for multi-agent setups in off-policy algorithms using the new MultiAgentEpisodeReplayBuffer. This PR includes all necessary modifications for "independent" sampling and includes an example for SAC to be added to the learning_tests.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>

…ge_episode_buffers_to_return_episode_lists_from_sample

Signed-off-by: sven1977 <[email protected]>

…hat held DQN off from learning. In addition fixed some minor bugs. Signed-off-by: Simon Zehnder <[email protected]>

Signed-off-by: Simon Zehnder <[email protected]>

…ists_from_sample Signed-off-by: Simon Zehnder <[email protected]>

Signed-off-by: Simon Zehnder <[email protected]>

…rror occurred in CI tests. Signed-off-by: Simon Zehnder <[email protected]>

Signed-off-by: Simon Zehnder <[email protected]>

…ists_from_sample Signed-off-by: Simon Zehnder <[email protected]>

…r readability of the test code for users (we describe the connector to add the 'NEXT_OBS' to the batch). Signed-off-by: Simon Zehnder <[email protected]>

…ependent'-mode sampling. Added multi-agent example for SAC and modified 'compute_gradients' in 'SACTorchLearner' to deal with MARLModules. Commented 2 assertions in connectors that avoided multi-agent setups with 'SingleAgentEpisode's. Signed-off-by: Simon Zehnder <[email protected]>

rllib/algorithms/sac/torch/sac_torch_learner.py

rllib/connectors/common/agent_to_module_mapping.py

sven1977 · 2024-05-08T11:58:42Z

rllib/connectors/common/batch_individual_items.py

@@ -33,7 +33,10 @@ def __call__(
            # to a batch structure of:
            # [module_id] -> [col0] -> [list of items]
            if is_marl_module and column in rl_module:
-                assert is_multi_agent
+                # assert is_multi_agent
+                # TODO (simon, sven): Check, if we need for other cases this check.


This is a good point. There are still some "weird" assumptions left in some connectors' logic.
We should comb these out and make the logic when to go into what loop with SA- or MAEps more clear.

Some of this stuff has to do with the fact that EnvRunners can either have a SingleAgentRLModule OR a MultiAgentRLModule, but Learners always(!) have a MultiAgentModule. Maybe we should have Learners that operate on SingleAgentRLModules for simplicity and more transparency. It shouldn't be too hard to fix that on the Learner side.

rllib/tuned_examples/sac/multi_agent_pendulum_sac_envrunner.py

rllib/utils/replay_buffers/utils.py

rllib/utils/replay_buffers/prioritized_episode_replay_buffer.py

…sode' Signed-off-by: Simon Zehnder <[email protected]>

Signed-off-by: Simon Zehnder <[email protected]>

…agent off-policy algorithms. Signed-off-by: Simon Zehnder <[email protected]>

sven1977 · 2024-05-13T10:06:09Z

rllib/env/multi_agent_env_runner.py

-        # If no episodes at all, log NaN stats.
-        if len(self._done_episodes_for_metrics) == 0:
-            self._log_episode_metrics(np.nan, np.nan, np.nan)
+        # TODO (simon): This results in hundreds of warnings in the logs


We'll have to see. This might lead to Tune errors in the sense that at the beginning, if no episode is done yet, Tune will complain that none of the stop criteria (e.g. num_env_steps_sampled_lifetime) can be found in the result dict.

sven1977

LGTM now.

I do have one concern about removing the NaN from the MultiAgentEnvRunner, but we can move it back or find a better solution (maybe initialize the most common stop keys already in algo) later.

rllib/BUILD

Signed-off-by: Sven Mika <[email protected]>

Signed-off-by: Simon Zehnder <[email protected]>

…footprint of the class. Changes to 'MultiAgentEpisodeReplayBuffer' to reduce memory usage and increase performance. Signed-off-by: Simon Zehnder <[email protected]>

Signed-off-by: Simon Zehnder <[email protected]>

…sed. Signed-off-by: Simon Zehnder <[email protected]>

Signed-off-by: Simon Zehnder <[email protected]>

…irection single-agent buffer. Memory leak should be fixed with this commit. Signed-off-by: Simon Zehnder <[email protected]>

Signed-off-by: Simon Zehnder <[email protected]>

…:simonsays1980/ray into change_ma_buffer_to_use_list_of_episodes Signed-off-by: Simon Zehnder <[email protected]>

…ge_ma_buffer_to_use_list_of_episodes

Signed-off-by: sven1977 <[email protected]>

Signed-off-by: Simon Zehnder <[email protected]>

…:simonsays1980/ray into change_ma_buffer_to_use_list_of_episodes Signed-off-by: Simon Zehnder <[email protected]>

…API stack. (ray-project#45182) Signed-off-by: Ryan O'Leary <[email protected]>

…API stack. (ray-project#45182)

sven1977 and others added 17 commits April 29, 2024 13:28

wip

baa1398

Signed-off-by: sven1977 <[email protected]>

wip

a1eb1f9

Signed-off-by: sven1977 <[email protected]>

fixes

6538b58

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into chan…

683f515

…ge_episode_buffers_to_return_episode_lists_from_sample

wip

366a4b9

Signed-off-by: sven1977 <[email protected]>

wip

a8b2d0c

Signed-off-by: sven1977 <[email protected]>

merge

f76628a

Signed-off-by: sven1977 <[email protected]>

Fixed a bug with 'TERMINATEDS/TRUNCATEDS' in replay buffer sampling t…

81421d9

…hat held DQN off from learning. In addition fixed some minor bugs. Signed-off-by: Simon Zehnder <[email protected]>

LINTER.

bd54d5a

Signed-off-by: Simon Zehnder <[email protected]>

Added docs to new 'sample' method and removed old sample methods.

6ee006f

Signed-off-by: Simon Zehnder <[email protected]>

Merge branch 'master' into change_episode_buffers_to_return_episode_l…

a345d09

…ists_from_sample Signed-off-by: Simon Zehnder <[email protected]>

Replaced 'td_error' by 'TD_ERROR_KEY'.

b77fd5a

Signed-off-by: Simon Zehnder <[email protected]>

Needed to define 'TD_ERROR_KEY' in 'replay_buffer.utils' b/c import e…

6e11ff6

…rror occurred in CI tests. Signed-off-by: Simon Zehnder <[email protected]>

Fixed a small bug in test code.

b39b9a8

Signed-off-by: Simon Zehnder <[email protected]>

Merge branch 'master' into change_episode_buffers_to_return_episode_l…

e6cf4f7

…ists_from_sample Signed-off-by: Simon Zehnder <[email protected]>

Interchanged 'new_obs' with our constant 'Columns.NEXT_OBS' for bette…

eebc04d

…r readability of the test code for users (we describe the connector to add the 'NEXT_OBS' to the batch). Signed-off-by: Simon Zehnder <[email protected]>