Skip to content

Latest commit

 

History

History
539 lines (450 loc) · 19 KB

env_info.md

File metadata and controls

539 lines (450 loc) · 19 KB

Environment documentation

Here you will find documentation for the VCMI environment.

Note

Note that this document is about the VCMI-v3 gym environment, which (as of Aug 2024) is the newest available version. Other environment versions may use different actions, observations and rewards, so please refer to the source code if you need more information about them.

vcmi-gym implements the Gym API spec, please refer to the Gymnasium documentation for reference.

Starting

VCMI is written in C++ and has an ML extension which must be compiled as a dynamic library libmlclient (compilation instructions here). This library is linked to the main python process at runtime as part of the gym environment's boot process.

Starting the RL environment automatically starts VCMI in a separate thread. Communication between the two threads is made possible via the connector component (for details on how it works, see the Connector doc).

Starting the environment is as simple as:

import gymnasium as gym
import vcmi_gym

env = gym.make("VCMI-v3", mapname="gym/A1.vmap")

There are more than 30 optional startup arguments which can be used to customize the environment. Information about them can be found in VcmiEnv's __init__ docstring here.

👁️ Observations

On each timestep, a flat (1-D) Box observation space with a total of 12685 floats is returned.

These floats represent the encoded equivalents of various aspects of the current environment state. Three main encoding types are used in the observation:

  • Normalized - input value is normalized between 0 and 1
  • categorical - input value is one-hot encoded
  • binary - input value is represented as a binary number

Depending on how NULL values are handled, there can be different variations of those encodings. The below table contains examples which should explain the differences:

Encoding NULL handling Abbreviation Example attribute
vmax=5
Input Output
Categorical Explicit CE v=5 [0, 0, 0, 0, 0, 0, 1]
v=3 [0, 0, 0, 0, 1, 0, 0]
v=0 [0, 1, 0, 0, 0, 0, 0]
null [1, 0, 0, 0, 0, 0, 0]
Categorical Strict CS v=5 [0, 0, 0, 0, 0, 1]
v=3 [0, 0, 0, 1, 0, 0]
v=0 [1, 0, 0, 0, 0, 0]
null error
Binary Explicit BE v=5 [0, 1, 0, 1]
v=3 [0, 0, 1, 1]
v=0 [0, 0, 0, 0]
null [1, 0, 0, 0]
Binary Zero BZ v=5 [1, 0, 1]
v=3 [0, 1, 1]
v=0 [0, 0, 0]
null [0, 0, 0]
Binary Strict BS v=5 [1, 0, 1]
v=3 [0, 1, 1]
v=0 [0, 0, 0]
null error
Normalized Explicit NE v=5 [0, 1]
v=3 [0, 0.6]
v=0 [0, 0]
null [1, 0]
Normalized Strict NS v=5 [1]
v=3 [0.6]
v=0 [0]
null error

All 12685 floats in the observation represent encoded information for a total of 20 stacks (10 attacker stacks, 10 defender stacks) and 165 hexes (the number of hexes on the battlefield).

observation_space

Stacks

The first 1960 floats of the observation provide information about the stacks on the battlefield and are distributed as follows:

Stack ID Index Description
1 0 ... 97 Red army slot #1
2 98 ... 195 Red army slot #2
3 196 ... 293 Red army slot #3
4 294 ... 391 Red army slot #4
5 392 ... 489 Red army slot #5
6 490 ... 587 Red army slot #6
7 588 ... 685 Red army slot #7
8 686 ... 783 Red army extra slot #1 *
9 784 ... 881 Red army extra slot #2 *
10 882 ... 979 Red army extra slot #3 *
11 980 ... 1077 Blue army slot #1
12 1078 ... 1175 Blue army slot #2
13 1176 ... 1273 Blue army slot #3
14 1274 ... 1371 Blue army slot #4
15 1372 ... 1469 Blue army slot #5
16 1470 ... 1567 Blue army slot #6
17 1568 ... 1665 Blue army slot #7
18 1666 ... 1763 Blue army extra slot #1 *
19 1764 ... 1861 Blue army extra slot #2 *
20 1862 ... 1959 Blue army extra slot #3 *

* An "extra" slot is used for summoned creatures and war machines. If there are more than 10 alive stacks in the hero's army, some of them will remain "hidden" from the agent.

Note

The terms "attacker", "left" and "red" are used interchangeably here. The same applies for "defender", "right" and "blue". This is because in VCMI, the attacking army is always on the left side and AI training maps are designed such that the red player always attacks the blue player.

Each stack is represented by a total of 98 floats which represent the creature's attirbutes, similar to what the player would see when right-clicking on the creature: owner, quantity, creature type, attack, defence, etc.:

creature_stats

Stack Attribute Index Encoding Description
ID 0 ... 20 CE Stack number
Y_COORD 21 ... 32 CE Y coordinate of stack's front hex
X_COORD 33 ... 48 CE X coordinate of stack's front hex
SIDE 49 ... 51 CE Side in battle (attacker/defender)
QUANTITY 52 ... 53 NE Stack quantity
ATTACK 54 ... 55 NE Attack
DEFENSE 56 ... 57 NE Defense
SHOTS 58 ... 59 NE Shots remaining
DMG_MIN 60 ... 61 NE Dmg (min)
DMG_MAX 62 ... 63 NE Dmg (max)
HP 64 ... 65 NE Hit points
HP_LEFT 66 ... 67 NE Hit points left
SPEED 68 ... 69 NE Speed
WAITED 70 ... 71 NE Waited this turn?
QUEUE_POS 72 ... 73 NE Turn order queue position
RETALIATIONS_LEFT 74 ... 75 NE Retaliations left
IS_WIDE 76 ... 77 NE Is it a two-hex stack?
AI_VALUE 78 ... 79 NE AI value
MORALE 80 ... 81 NE Morale
LUCK 82 ... 83 NE Luck
FLYING 84 ... 85 NE Can fly?
BLIND_LIKE_ATTACK 86 ... 87 NE Chance to blind/paralyze/petrify
ADDITIONAL_ATTACK 88 ... 89 NE Attacks twice?
NO_MELEE_PENALTY 90 ... 91 NE Has no melee penalty?
TWO_HEX_ATTACK_BREATH 92 ... 93 NE Has dragon breath?
NON_LIVING 94 ... 95 NE Is undead or otherwise non-living?
BLOCKS_RETALIATION 96 ... 97 NE Has no enemy retaliation?

Hexes

The remaining 10725 floats of the observation carry information about the 165 battlefield hexes:

hexes

Each hex is represented by a total of 65 floats with information about several key characteristics of that hex:

Hex Attribute Index Encoding Description
Y_COORD 0 ... 10 CS Y coordinate of this hex
X_COORD 11 ... 25 CS X coordinate of this hex
STATE_MASK 26 ... 29 BS Hex state flags *
ACTION_MASK 30 ... 43 BZ Hex action flags for the currently active stack **
STACK_ID 44 ... 63 CE Stack ID (see Stacks)

* The hex state flags are used to describe a (combination of) hex properties:

State flag Explanation
PASSABLE empty/mine/firewall/gate(open)/gate(closed,defender)
STOPPING moat/quicksand
DAMAGING_L moat/mine/firewall that would damage left (i.e. attacker) stacks
DAMAGING_R moat/mine/firewall that would damage right (i.e. defender) stacks

** These hex action flags tell us which actions can the currently active stack perform on that specific hex (see Action space).

The below example shows illustrates to extract information from the observation via the .decode() convenience method:

import gymnasium as gym
import vcmi_gym

env = gym.make("VCMI-v3", mapname="gym/A1.vmap")
observation, _info = env.reset()

""" Raw observation is a numpy array of 12685 floats """
observation
# => array([1., 0., 0., ..., 0., 0., 0.], dtype=float32)

"""
Using env.decode() returns a Battlefield object -- a decoded version of the raw
observation with a minimalistic API allowing to easily explore the information
carried within.
This method is specific to VCMI and is not part of the gym API spec.
"""
bf = env.decode()

""" Get hex 46 (Y=3, X=1). """
h = bf.get_hex(46)  # same as bf.get_hex(3, 1)

""" Print hex data in a human-friendly format. """
h.dump()
# Y_COORD     | 3
# X_COORD     | 1
# STATE_MASK  | PASSABLE
# ACTION_MASK | MOVE

""" Get stack 4. """
s = bf.get_stack(4)

""" Print stack data in a human-friendly format. """
s.dump()
# ID                    | 4
# Y_COORD               | 5
# X_COORD               | 0
# SIDE                  | LEFT
# QUANTITY              | 1
# ATTACK                | 2
# DEFENSE               | 2
# SHOTS                 | 0
# DMG_MIN               | 1
# DMG_MAX               | 1
# HP                    | 1
# ...

🕹️ Actions

vcmi-gym uses a Discrete action space with a total of 2312 actions which is better thought of 2 non-hex action + 2310 hex actions.

The non-hex actions are RETREAT (=0) and WAIT (=1). The remaining values are used for the 14 actions on each hex (there are 165 hexes on the battlefield => 165 * 14 = 2310 hex actions).

For a given Hex (0..164), the action value is: hex_id * 14 + (1 + action_index):

Action index Description
0..11 Move to hex and attack at direction 0..11*
12 Move to hex
13 Shoot at hex

e.g. Moving to hex #2 (X=2, Y=0) corresponds to action 41.

* The 12 attack directions are as follows: 0..5 are the hexes that surround the current unit, while 7..11 are special cases for 2-hex units (3 per side):

attacks1 attacks3 attacks2

The below

import gymnasium as gym
from vcmi_gym import HexAction

env = gym.make("VCMI-v3", mapname="gym/A2.vmap")

""" Decode the observation. """
bf = env.decode()

""" Get the integer representation of 'move to hex 46' """
action = bf.get_hex(46).action(HexAction.MOVE)
# => 658

""" Execute the action. """
env.step(action)

Action masking

The env object exposes the action_mask() method which is not part of the Gym API spec, but is useful for certain Reinforcement Learning scenarios where invalid actions are masked in order to improve learning performance.

The method returns an np.array with 2312 bool values, indicating the validity of the corresponding action (True means the action is valid).

🍩 Rewards

Rewards are returned on each step based on the calculations below.

Base reward

The base reward on each step is:

$$R_{base} = a * (b + c*D_{net} + V_{net}) + σ*d*V_{diff}$$

where:

  • Rbase is base the reward (optionally modified, see below)
  • Dnet is the net damage since the last step (Ddealt - Dreceived)
  • Vnet is the net value of units which died since the last step (Vkilled - Vlost)
  • Vdiff is the difference in the total army value of the two armies
  • σ is a term which evaluates to 1 at battle end, 0 otherwise
  • p1 is a configurable parameter step_reward_mult. It provides control over the "weight" of per-step rewards. If sparse rewards are desired, this parameter can be set to 0 while the the p5 to any non-zero value.
  • p2 is a configurable parameter step_reward_fixed. A negative value can be set for punishing agents who keep running away from the enemy troops to avoid damage.
  • p3 is a configurable parameter reward_dmg_factor. Adjusts the role of the damage in the reward calculations. If set to 0, the reward will only depend on the actual units killed, regardless of the damage dealt.
  • p4 is a configurable parameter term_reward_mult. It provides control over the "weight" of the terminal reward.

The resulting value (base reward) is further modified by the following "global" reward modifiers:

Clipped reward

Reward clipping is sometimes advised for more stable learning. It is controlled by the reward_clip_tanh_army_frac parameter:

$$C = p_5 * V_{mean}$$ $$R_{clip} = C * \tanh(R_{base} / C)$$

where:

  • C is an intermediate variable used for clarity
  • Rclip is the clipped reward
  • Rbase is the base reward (see above)
  • Vmean is the mean of the total starting values of the two armies
  • p5 is a configurable parameter reward_clip_tanh_army_frac

A value of 0 will disables clipping (i.e. Rclip = Rbase)

Scaled reward

Scaling rewards based on the initial starting army values can be achieved via the reward_army_value_ref parameter.

$$R_{scale} = R_{clip} * p_6 * V_{mean}$$

where:

  • Rsacle is the scaled reward
  • Rclip is the clipped reward (see above)
  • p6 is a configurable parameter reward_army_value_ref

The effect of scaled rewards can be explained via an example:

Consider these two VCMI battles:

  • (A) armies with total starting army value = 1K (early game army)
  • (B) armies with total starting army value = 100K (late game army)

Without scaling, the rewards in battle A would be 100 times smaller than the rewards in battle B.

Specifying an army ref of 10K, A and B's rewards will be multiplied by 10 and 0.1, effectively negating this discrepancy and ensuring the RL agent perceives early-game and late-game battles as equally significant.

🖼️ Rendering

This gym environment supports only one type of rendering: the ANSI render.

Should be printed in terminals with ANSI color code support, unicode support and monospaced font (e.g. Ubuntu Mono):

import gymnasium as gym
import vcmi_gym

env = gym.make("VCMI-v3", mapname="gym/A1.vmap")
print(env.render())

render

Tip

If your output looks unaligned, try changing the font of your terminal

Test Helper

If you want to test the env by playing manually, a convenient helper is provided:

from vcmi_gym import VcmiEnv_v3 as VcmiEnv, HexAction, TestHelper

env = VcmiEnv("gym/A1.vmap");
h = TestHelper(env)

h.move(5, 3)
h.defend()
h.wait()
h.amove(5, 4, HexAction.AMOVE_R)