"""Example on how to define and run with an RLModule with a dependent action space. This examples: - Shows how to write a custom RLModule outputting autoregressive actions. The RLModule class used here implements a prior distribution for the first couple of actions and then uses the sampled actions to compute the parameters for and sample from a posterior distribution. - Shows how to configure a PPO algorithm to use the custom RLModule. - Stops the training after 100k steps or when the mean episode return exceeds -0.012 in evaluation, i.e. if the agent has learned to synchronize its actions. For details on the environment used, take a look at the `CorrelatedActionsEnv` class. To receive an episode return over 100, the agent must learn how to synchronize its actions. How to run this script ---------------------- `python [script file name].py --enable-new-api-stack --num-env-runners 2` Control the number of `EnvRunner`s with the `--num-env-runners` flag. This will increase the sampling speed. For debugging, use the following additional command line options `--no-tune --num-env-runners=0` which should allow you to set breakpoints anywhere in the RLlib code and have the execution stop there for inspection and debugging. For logging to your WandB account, use: `--wandb-key=[your WandB API key] --wandb-project=[some project name] --wandb-run-name=[optional: WandB run name (within the defined project)]` Results to expect ----------------- You should reach an episode return of better than -0.5 quickly through a simple PPO policy. The logic behind beating the env is roughly: OBS: optimal a1: r1: optimal a2: r2: -1 2 0 -1.0 0 -0.5 1/2 -0.5 -0.5/-1.5 0 0 1 0 -1.0 0 0.5 0/1 -0.5 -0.5/-1.5 0 1 0 0 -1.0 0 Meaning, most of the time, you would receive a reward better than -0.5, but worse than 0.0. +--------------------------------------+------------+--------+------------------+ | Trial name | status | iter | total time (s) | | | | | | |--------------------------------------+------------+--------+------------------+ | PPO_CorrelatedActionsEnv_6660d_00000 | TERMINATED | 76 | 132.438 | +--------------------------------------+------------+--------+------------------+ +------------------------+------------------------+------------------------+ | episode_return_mean | num_env_steps_sample | ...env_steps_sampled | | | d_lifetime | _lifetime_throughput | |------------------------+------------------------+------------------------| | -0.43 | 152000 | 1283.48 | +------------------------+------------------------+------------------------+ """ from ray.rllib.algorithms.ppo import PPOConfig from ray.rllib.core.rl_module.rl_module import RLModuleSpec from ray.rllib.examples.envs.classes.correlated_actions_env import CorrelatedActionsEnv from ray.rllib.examples.rl_modules.classes.autoregressive_actions_rlm import ( AutoregressiveActionsRLM, ) from ray.rllib.utils.test_utils import ( add_rllib_example_script_args, run_rllib_example_script_experiment, ) parser = add_rllib_example_script_args( default_iters=1000, default_timesteps=2000000, default_reward=-0.45, ) parser.set_defaults(enable_new_api_stack=True) if __name__ == "__main__": args = parser.parse_args() if args.algo != "PPO": raise ValueError( "This example script only runs with PPO! Set --algo=PPO on the command " "line." ) base_config = ( PPOConfig() .environment(CorrelatedActionsEnv) .training( train_batch_size_per_learner=2000, num_epochs=12, minibatch_size=256, entropy_coeff=0.005, lr=0.0003, ) # Specify the RLModule class to be used. .rl_module( rl_module_spec=RLModuleSpec(module_class=AutoregressiveActionsRLM), ) ) run_rllib_example_script_experiment(base_config, args)