在下面的段落中,我们将首先描述RLlib自动构建模型的默认行为(如果您没有设置自定义模型),然后深入了解如何通过更改这些设置或编写自己的模型类来自定义模型。
默认情况下,RLlib将为您的模型使用以下配置设置。其中包括FullyConnectedNetworks(fcnet_hiddens和fcnet_activation)、VisionNetworks(conv_filters和conv_activation)、自动RNN包装、自动注意力(GTrXL)包装以及Atari环境的一些特殊选项:
MODEL_DEFAULTS: ModelConfigDict = {
# 实验性标志
# 如果为True,则用户指定不要创建的预处理器
# (通过 config._disable_preprocessor_api=True)。如果为 True,则观察结果将在环境返回时直接到达模型中
"_disable_preprocessor_api": False,
# 实验性标志
# If True, RLlib will no longer flatten the policy-computed actions into
# a single tensor (for storage in SampleCollectors/output files/etc..),
# but leave (possibly nested) actions as-is. Disabling flattening affects:
# - SampleCollectors:必须存储可能嵌套的操作结构。
# - 将之前的action作为其输入的一部分的模型。
# - 从脱机文件读取的算法(包括操作信息)。
"_disable_action_flattening": False,
# === 内置选项 ===
# 全连接网络 (tf and torch): rllib.models.tf|torch.fcnet.py
# 如果未指定自定义模型并且输入空间为1D,则使用这个模型.
# Number of hidden layers to be used.
"fcnet_hiddens": [256, 256],
# 激活函数描述符
# Supported values are: "tanh", "relu", "swish" (or "silu", which is the same),
# "linear" (or None).
"fcnet_activation": "tanh",
# 视觉网络 (tf and torch): rllib.models.tf|torch.visionnet.py
# 如果未指定自定义模型并且输入空间为2D,则使用这些
# 过滤器配置:每个过滤器的[out_channels,内核,步幅]列表
# Example:
# Use None for making RLlib try to find a default filter setup given the
# observation space.
"conv_filters": None,
# Activation function descriptor.
# Supported values are: "tanh", "relu", "swish" (or "silu", which is the same),
# "linear" (or None).
"conv_activation": "relu",
# 一些默认模型支持具有给定激活的 n 个密集层的最终 FC 堆栈:
# - Complex observation spaces: Image components are fed through
# VisionNets, flat Boxes are left as-is, Discrete are one-hot'd, then
# everything is concated and pushed through this final FC stack.
# - VisionNets (CNNs), e.g. after the CNN stack, there may be
# additional Dense layers.
# - FullyConnectedNetworks will have this additional FCStack as well
# (that's why it's empty by default).
"post_fcnet_hiddens": [],
"post_fcnet_activation": "relu",
# 对于 DiagGaussian 动作分布,使模型的后半部分输出浮动偏差变量而不是状态相关变量。
# 这仅在使用默认的全连接网络时有效。
"free_log_std": False,
# Whether to skip the final linear layer used to resize the hidden layer
# outputs to size `num_outputs`. If True, then the last hidden layer
# should already match num_outputs.
"no_final_linear": False,
# Whether layers should be shared for the value function.
"vf_share_layers": True,
# == LSTM ==
# 是否用LSTM包装模型。
"use_lstm": False,
# Max seq len for training the LSTM, defaults to 20.
"max_seq_len": 20,
# Size of the LSTM cell.
"lstm_cell_size": 256,
# Whether to feed a_{t-1} to LSTM (one-hot encoded if discrete).
"lstm_use_prev_action": False,
# Whether to feed r_{t-1} to LSTM.
"lstm_use_prev_reward": False,
# Whether the LSTM is time-major (TxBx..) or batch-major (BxTx..).
"_time_major": False,
# == 注意力网络(transformer) (experimental: torch-version is untested) ==
# 是否使用 GTrXL ("Gru transformer XL"; attention net) as the
# wrapper Model around the default Model.
"use_attention": False,
# The number of transformer units within GTrXL.
# A transformer unit in GTrXL consists of a) MultiHeadAttention module and
# b) a position-wise MLP.
"attention_num_transformer_units": 1,
# The input and output size of each transformer unit.
"attention_dim": 64,
# The number of attention heads within the MultiHeadAttention units.
"attention_num_heads": 1,
# The dim of a single head (within the MultiHeadAttention units).
"attention_head_dim": 32,
# The memory sizes for inference and training.
"attention_memory_inference": 50,
"attention_memory_training": 50,
# The output dim of the position-wise MLP.
"attention_position_wise_mlp_dim": 32,
# The initial bias values for the 2 GRU gates within a transformer unit.
"attention_init_gru_gate_bias": 2.0,
# Whether to feed a_{t-n:t-1} to GTrXL (one-hot encoded if discrete).
"attention_use_n_prev_actions": 0,
# Whether to feed r_{t-n:t-1} to GTrXL.
"attention_use_n_prev_rewards": 0,
# == Atari ==
# Set to True to enable 4x stacking behavior.
"framestack": True,
# Final resized frame dimension
"dim": 84,
# (deprecated) Converts ATARI frame to 1 Channel Grayscale image
"grayscale": False,
# (deprecated) Changes frame to range from [-1, 1] if true
"zero_mean": True,
# === 自定义模型的选项 ===
# Name of a custom model to use
"custom_model": None,
# Extra options to pass to the custom classes. These will be available to
# the Model's constructor in the model_config field. Also, they will be
# attempted to be passed as **kwargs to ModelV2 models. For an example,
# see rllib/models/[tf|torch]/attention_net.py.
"custom_model_config": {},
# Name of a custom action distribution to use.
"custom_action_dist": None,
# Custom preprocessors are deprecated. Please use a wrapper class around
# your environment instead to preprocess observations.
"custom_preprocessor": None,
# === RLModules中ModelConfigs的选项 ===
# 要编码的潜在维度。
# Since most RLModules have an encoder and heads, this establishes an agreement
# on the dimensionality of the latent space they share.
# This has no effect for models outside RLModule.
# If None, model_config['fcnet_hiddens'][-1] value will be used to guarantee
# backward compatibility to old configs. This yields different models than past
# versions of RLlib.
"encoder_latent_dim": None,
# Whether to always check the inputs and outputs of RLlib's default models for
# their specifications. Input specifications are checked on failed forward passes
# of the models regardless of this flag. If this flag is set to `True`, inputs and
# outputs are checked on every call. This leads to a slow-down and should only be
# used for debugging. Note that this flag is only relevant for instances of
# RLlib's Model class. These are commonly generated from ModelConfigs in RLModules.
"always_check_shapes": False,
# Deprecated keys:
# Use `lstm_use_prev_action` or `lstm_use_prev_reward` instead.
"lstm_use_prev_action_reward": DEPRECATED_VALUE,
# Deprecated in anticipation of RLModules API
"_use_default_native_models": DEPRECATED_VALUE,
}
在对原始环境输出进行预处理(如果适用)后,处理后的观察结果将通过策略的模型提供。如果没有指定自定义模型(请参阅下面关于如何自定义模型的进一步信息),RLlib将根据简单的启发式方法选择一个默认模型:
这些默认模型类型可以通过算法配置中的模型配置键进一步配置(如上所述)。上面列出了可用的设置,模型目录文件中也记录了这些设置。
请注意,对于视觉网络情况,如果您的环境观察具有自定义大小,则可能必须配置conv_Filters。例如,对于42x42观测值,\“MODEL\”:{\“DIM\”:42,\“CONV_FILTIRS\”:[[16,[4,4],2],[32,[4,4],2],[512,[11,11],1]]}。因此,请始终确保最后一个Conv2D输出的输出形状为B,1,1,X,其中B=批处理,X=最后一个Conv2D层的滤镜数量,以便RLlib可以将其展平。如果不是这样,将抛出信息性错误。
此外,如果在模型配置中设置了 “use_lstm”: True 或 “use_attention”: True ,则模型的输出将分别由LSTM单元(TF或Torch)或注意(GTrXL)网络(TF或Torch)进一步处理。更广泛地说,RLlib支持对其所有策略梯度算法(A3C、PPO、PG、Impala)使用重复/注意模型,并且在其策略评估实用程序中内置了必要的序列处理支持。
关于使用哪些附加配置键来更详细地配置这两个自动包装器的详细信息,请参见上面的内容(例如,您可以通过lstm_cell_size指定LSTM层的大小,或者通过attence_dim指定注意暗度)。
对于完全定制的 RNN / LSTM / Attention-Net 设置,请参阅下面的循环模型和注意网络/Transformers 部分。